Re: [EXTERNAL] Re: Famous operational issues

2021-10-01 Thread Ray Bellis
On 16/02/2021 22:51, Compton, Rich A wrote:

> There was the outage in 2014 when we got to 512K routes.  
> http://www.bgpmon.net/what-caused-todays-internet-hiccup/

There was a similar issue in 1998/9 or so when we got to 64K routes,
which broke the routing table index (which defaulted to a uint16_t) on
any FreeBSD box doing BGP.

Fortunately a quick kernel recompile with the type changed to uint32_t
fixed that.

Ray



Re: Famous operational issues

2021-03-12 Thread Mark Tinka
Hardly famous and not service-affecting in the end, but figured I'd 
share an incident from our side that occurred back in 2018.


While commissioning a new node in our Metro-E network, an IPv6 
point-to-point address was mis-typed. Instead of ending in /126, it 
ended in /12. This happened in Johannesburg.


We actually came across this by chance while examining the IGP table of 
another router located in Slough, and found an entry for 2c00::/12 
floating around. That definitely looked out of place, as we never carry 
parent blocks in our IGP.


Running the trace from Slough led us back to this one Metro-E device in 
Jo'burg.


It took everyone nearly an hour to figure out the typo, because for all 
the laser focus we had on the supposed link of the supposed box that was 
creating this problem, we all overlooked the fact that the /12 
configured on the point-to-point link was actually supposed to have been 
a /126.


The reason this never caused a service problem was because we do not 
redistribute our IGP into BGP (not that anyone should). And even if we 
did, there are a ton of filters and BGP communities on all devices to 
ensure a route such as that would have never made it out of our AS.


Also, the IGP contains the most specific paths to every node in our 
network, so the presence of the 2c00::/12 was mostly cosmetic. It would 
have never been used for routing decisions.


Mark.


Re: Famous operational issues

2021-02-24 Thread Randy Bush
anyone else have the privilege of running 2321 data cells?  had a bunch.
unreliable as hell.  there was a job running continuously recovering
transactions off of log tapes.  one night at 3am, head of apps program
(i was systems) got a call that a tran tape was unmounted with a console
message that recovery was complete.  ops did not know what it meant or
what to do.  was the first time in over five years the data were stable.

wife of same head of apps grew more and more tired of 2am calls.
finally she answered one "david?  he said he was going in to work."
ops never called in the night again.

randy

---
ra...@psg.com
`gpg --locate-external-keys --auto-key-locate wkd ra...@psg.com`
signatures are back, thanks to dmarc header mangling


Re: Famous operational issues

2021-02-24 Thread Alain Hebert
    I personally did "disable vlan Xyz" instead of "delete vlan Xyz" on 
Extreme Network... which proceeded to disable all the ports where the 
VLAN was present...


    Good thing it was a (local) remote pop and not on the core.

-
Alain Hebertaheb...@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911  http://www.pubnix.netFax: 514-990-9443

On 2/23/21 5:22 PM, Justin Streiner wrote:

An interesting sub-thread to this could be:

Have you ever unintentionally crashed a device by running a perfectly 
innocuous command?

1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
2. "clear interface XXX" on a Nexus 7K triggered a 
cascading/undocument Sev1 bug that caused two linecards to crash and 
reload, and take down about two dozen buildings on campus at the .edu 
where I used to work.
3. For those that ever had the misfortune of using early versions of 
the "bcc" command shell* on Bay Networks routers, which was intended 
to make the CLI make look and feel more like a Cisco router, you have 
my condolences.  One would reasonably expect "delete ?" to respond 
with a list of valid arguments for that command.  Instead, it deleted, 
well... everything, and prompted an on-site restore/reboot.


BCC originally stood for "Bay Command Console", but we joked that it 
really stood for "Blatant Cisco Clone".


On Tue, Feb 16, 2021 at 2:37 PM John Kristoff > wrote:


Friends,

I'd like to start a thread about the most famous and widespread
Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?

To get things started, I'd suggest the AS 7007 event is perhaps  the
most notorious and likely to top many lists including mine. So if
that is one for you I'm asking for just two more.

I'm particularly interested in this as the first step in developing a
future NANOG session.  I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective.  I already have
someone
that is willing to talk about AS 7007, which shouldn't be hard to
guess
who.

Thanks in advance for your suggestions,

John





Re: Famous operational issues

2021-02-23 Thread Valdis Klētnieks
On Tue, 23 Feb 2021 20:46:38 -0800, Randy Bush said:
> maybe late '60s or so, we had a few 2314 dasd monsters[0].  think maybe
> 4m x 2m with 9 drives with removable disk packs.
>
> a grave shift operator gets errors on a drive and wonders if maybe they
> swap it into another spindle.  no luck, so swapped those two drives with
> two others.  one more iteration, and they had wiped out the entire
> array.  at that point they called me; so i missed the really creative
> part.

I suspect every S/360 site that had 2314's had an operator who did that, as I
was witness to the same thing.  For at least a decade after that debacle, the
Manager of Operations was awarding Gold, Silver, and Bronze Danny awards for
operational screw-ups. (The 2314 event was the sole Platinum Danny :)

And yes, IBM 4341 consoles were all too easy to hit the EPO button on the
keyboard, we got guards for the consoles after one of our operators nailed the
button a second time in a month.

And to tie the S/360 and 4341 together - we were one of the last sites that was
still running an S/360 Mod 65J.  And plans came through for a new server room
on the top floor of a new building.  Architect comes through, measures the S/360
and all the peripherals for floorspace and power/cooling - and the CPU, plus
*4* meg of memory, and 3 strings of 2314 drives chewed a lot of both.

Construction starts.   Meanwhile, IBM announces the 4341, and offers us a real
sweetheart deal because even at the high maintenance charges we were paying,
IBM was losing money. Something insane like the system and peripherals and
first 3 years of maintenance, for less than the old system per-year
maintenance. Oh, and the power requirements are like 10% of the 360s.

So we take delivery of the new system and it's looking pitiful, just one box
and 2 small strings of disk in 10K square feet.  Lots of empty space. Do all
the migrations to the new system over the summer, and life is good.   Until
fall and winter arrive, and we discover there is zero heat in the room, and the
ceiling is uninsulated, and it's below zero outside because this is way upstate
NY.  And if there was a 360 in the room, it would *still* be needing cooling
rather than heating. But it's a 4341 that's shedding only 10% of the heat...

Finally, one February morning, the 4341 throws a thermal check. Air was too
cold at the intakes.  Our IBM CE did a double-take because he'd been doing IBM
mainframes for 3 decades and had never seen a thermal check for too cold
before.

Lots of legal action threatened against the architect, who simply said "If you
had *told* me that the system was being replaced, I'd have put heat in the
room". A settlement was reached, revised plans were drawn up, there was a whole
mess of construction to get ductwork and insulation and other stuff into place,
and life was good for the decade or so before I left for a better gig




Re: Famous operational issues

2021-02-23 Thread Randy Bush
maybe late '60s or so, we had a few 2314 dasd monsters[0].  think maybe
4m x 2m with 9 drives with removable disk packs.

a grave shift operator gets errors on a drive and wonders if maybe they
swap it into another spindle.  no luck, so swapped those two drives with
two others.  one more iteration, and they had wiped out the entire
array.  at that point they called me; so i missed the really creative
part.

[0] https://www.ibm.com/ibm/history/exhibits/storage/storage_2314.html

randy

---
ra...@psg.com
`gpg --locate-external-keys --auto-key-locate wkd ra...@psg.com`
signatures are back, thanks to dmarc header mangling


Re: Famous operational issues

2021-02-23 Thread Adam Kennedy via NANOG
While we're talking about raid types...

A few acquisitions ago, between 2006-2010, I worked at a Wireless ISP in
Northern Indiana. Our CEO decided to sell Internet service to school
systems because the e-rate funding was too much to resist. He had the idea
to install towers on the schools and sell service off that while paying the
school for roof rights. About two years into the endeavor, I wake up one
morning and walk to my car. Two FBI agents get out of an unmarked towncar.
About an hour later, they let me go to the office where I found an entire
barrage of FBI agents. It was a full raid and not the kind you want to see.
Hard drives were involved and being made redundant, but the redundant
copies were labeled and placed into boxes that were carried out to SUVs
that were as dark as the morning coffee these guys drank. There were a lot
of drives, all of our servers were in our server room at the office. There
were roughly five or six racks of varying amounts of equipment in each.

After some questioning and assisting them in their cataloging adventure,
the agents left us with a ton of questions and just enough equipment to
keep the customers connected. CEO became extremely paranoid at this point.
He told us to prepare to move servers to a different building. He went into
a tailspin trying to figure out where he could hide the servers to keep
things going without the bank or FBI seizing the assets. He was extremely
worried the bank would close the office down. We started moving all network
routing around to avoid using the office as our primary DIA.

One morning I get into the office and we hear the words we've been
dreading: "We're moving the servers". The plan was to move them to a tower
site that had a decent-sized shack on site. Connectivity was decent, we had
a licensed 11GHz microwave backhaul capable of about 155mbps. The site was
part of the old MCI microwave long-distance network in the 80s and 90s. It
had redundant air conditioners, a large propane tank, and a generator
capable of keeping the site alive for about three days. We were told not to
notify any customers, which became problematic because two customers had
servers colocated in our building. We consolidated the servers into three
racks and managed to get things prepared with a decent UPS in each rack.
CEO decided to move the servers at nightfall to "avoid suspicion". Our
office was in an unsavory part of town, moving anything at night was
suspicious. So, under the cover of half-ass darkness, we loaded the racks
onto a flatbed truck and drove them 20 minutes to the tower. While we
unloaded the racks, an electrician we knew was wiring up the L5-20 outlets
for the UPS in each rack. We got the racks plugged in, servers powered up,
and then the two customers came that had colocated equipment. They got
their equipment powered up and all seemed ok.

Back at the office the next day we were told to gather our workstations and
start working from home. I've been working from home ever since and quite
enjoy it, but that's beside the point.

Summer starts and I tell the CEO we need to repair the AC units because
they are failing. He ignores it, claiming he doesn't want to lose money the
bank could take at any minute. About a month later, a nice hot summer day
rolls in and the AC units both die. I stumble upon an old portable AC unit
and put that at the site. Temperatures rise to 140F ambient. Server
overheat alarms start going off, things start failing. Our colocation
customers are extremely upset. They pull their servers and drop service.
The heat subsides, CEO finally pays to repair one of the AC units.

Eventually, the company declares bankruptcy and goes into liquidation.
Luckily another WISP catches wind of it, buys the customers and assets, and
hires me. My happiest day that year was moving all the servers into a
better-suited home, a real data center. I don't know what happened to the
CEO, but I know that I'll never trust anything he has his hands in ever
again.

Adam Kennedy
Systems Engineer
adamkenn...@watchcomm.net | 800-589-3837 x120 <800-589-3837;120>
Watch Communications | www.watchcomm.net

3225 W Elm St, Suite A
Lima, OH 45805





On Tue, Feb 23, 2021 at 8:55 PM brutal8z via NANOG  wrote:

> My war story.
>
> At one of our major POPs in DC we had a row of 7513's, and one of them had
> intermittent problems. I had replaced every piece of removable card/part in
> it over time, and it kept failing. Even the vendor flew in a team to the
> site to try to figure out what was wrong. It was finally decided to replace
> the whole router (about 200lbs?). Being the local field tech, that was my
> Job. On the night of the maintenance at 3am, the work started. I switched
> off the rack power, which included a 2511 terminal 

Re: Famous operational issues

2021-02-23 Thread brutal8z via NANOG
My war story.

At one of our major POPs in DC we had a row of 7513's, and one of them
had intermittent problems. I had replaced every piece of removable
card/part in it over time, and it kept failing. Even the vendor flew in
a team to the site to try to figure out what was wrong. It was finally
decided to replace the whole router (about 200lbs?). Being the local
field tech, that was my Job. On the night of the maintenance at 3am, the
work started. I switched off the rack power, which included a 2511
terminal server that was connected to half the routers in the row and
started to remove the router. A few minutes later I got a text, "You're
taking out the wrong router!" You can imagine the "Damn it, what have I
done?" feeling that runs through your mind and the way your heart stops
for a moment.

Okay, I wasn't taking out the wrong router. But unknown at the time,
terminal servers when turned off, had a nasty habit of sending a break
to all the routers it was connected to, and all those routers
effectively stopped. The remote engineer that was in charge saw the
whole POP go red and assumed I was the cause. I was, but not because of
anything I could have known about. I had to power cycle the downed
routers to bring them back on-line, and then continue with the
maintenance. A disaster to all involved, but the router got replaced.

I gave a very detailed account of my actions in the postmortem. It was
clear they knew I had turned off the wrong rack/router, and wasn't being
honest about it. I was adamant I had done exactly what I said, and even
swore I would fess up if I had error-ed, and always would, even if it
cost me the job. I rarely made mistakes, if any, so it was an easy thing
for me to say. For the next two weeks everyone that aware of the work
gave me the side eye.

About a week after that, the same thing happened to another field tech
in another state. That helped my case. They used my account to figure
out it was the TS that caused the problem. A few of them that had
questioned me harshly admitted to me my account helped them figure out
the cause.

And the worst part of this story? That router, completely replaced,
still had the same intermittent problem as before. It was a DC powered
POP, so they were all wired with the same clean DC power. In the end
they chalked it up to cosmic rays and gave up on it. I believe this
break issue was unique to the DC powered 2511's, and that we were the
first to use them, but I might be wrong on that.


On 2/16/21 2:37 PM, John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John



Re: Famous operational issues

2021-02-23 Thread bzs


Anyone remember when DEC delivered a new VMS version (V5 I think)
whose backups didn't work, couldn't be restored?

BU did, the hard way, when the engineering dept's faculty and student
disk failed.

DEC actually paid thousands of dollars for typist services to come and
re-enter whatever was on paper and could be re-entered.

I think that was the day I won the Unix vs VMS wars at BU anyhow.

-- 
-Barry Shein

Software Tool & Die| b...@theworld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD   | 800-THE-WRLD
The World: Since 1989  | A Public Information Utility | *oo*


Re: Famous operational issues

2021-02-23 Thread scott



On 2/23/2021 12:22 PM, Justin Streiner wrote:

An interesting sub-thread to this could be:
Have you ever unintentionally crashed a device by running a perfectly 
innocuous command?

---


There was that time in the later 1990s where I took most of a global 
network down several
times by typing "show ip bgp regexp " on most all of the 
core routers.  It turned
out to be a cisco bug.  I looked for a reference, but cannot find one.  
Ahh, the earlier days of

the commercial internet...gotta love'em.

scott


Re: Famous operational issues

2021-02-23 Thread Warren Kumari
On Tue, Feb 23, 2021 at 5:14 PM Justin Streiner  wrote:

> Beyond the widespread outages, I have so many personal war stories that
> it's hard to pick a favorite.
>
> My first job out of college in the mid-late 90s was at an ISP in
> Pittsburgh that I joined pretty early in its existence, and everyone did a
> bit of everything. I was hired to do sysadmin stuff, networking, pretty
> much whatever was needed. About a year after I started, we brought up a new
> mail system with an external RAID enclosure for the mail store itself.  One
> day, we saw indications that one of the disks in the RAID enclosure was
> starting to fail, so I scheduled a maintenance window to replace the disk
> and let the controller rebuild the data and integrate it back into the RAID
> set.  No big worries, right?
>
> It's Tuesday at about 2 AM.
>
> Well, the kernel on the RAID controller itself decided that when I pulled
> the failing drive would be a fine time to panic, and more or less turn
> itself into a bit-blender, and take all the mailstore down with it.  After
> a few hours of watching fsck make no progress on anything, in terms of
> trying to un-fsck the mailstore, we made the decision in consultation with
> the CEO to pull the plug on trying to bring the old RAID enclosure back to
> life, and focus on finding suitable replacement hardware and rebuild from
> scratch.  We also discovered that the most recent backups of the mailstore
> were over a month old :(
>
> I think our CEO ended up driving several hours to procure a suitable
> enclosure.  By the time we got the enclosure installed, filesystems built,
> and got whatever tape backups we had restored, and tested the integrity of
> the system, it was now Thursday around 8 AM. Coincidentally, that was the
> same day the company hosted a big VIP gathering (the mayor was there, along
> with lots of investors and other bigwigs), so I had to come back and put on
> a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
> about the previous 3 days.  I still don't know how I got home that night
> without wrapping my vehicle around a utility pole (due to being over-tired,
> not due to alcohol).
>
> Many painful lessons learned over that stretch of days, as often the case
> as a company grows from startup mode and builds more robust technology and
> business processes as a consequence of growth.
>

Oh, dear. RAID that triggered 2 stories.
1: I worked at a small ISP in Westchester, NY. One day I'm doing stuff, and
want to kill process 1742, so I type 'kill -9 1' ... and then, before
pressing enter, I get distracted by our "Cisco AGS+ monitor" (a separate
story). After I get back to my desk I unlock my terminal, and call over a
friend to show just how close I'd gotten to making something go Boom. He
says "Nah, BSD is cleverer than that. I'm sure the kill command has some
check in to stop you killing init.". I disagree. He disagrees. I disagree
again. He calls me stupid. I bet him a soda.
He proves his point by typing 'su; kill -9 1' in the window he's logged
into -- and our primary NFS server (with all of the user sites)
obediently kills off init, and all of the child processes we run over
to the front of the box and hit the power switch, while desperately looking
for a monitor and keyboard to watch it boot.
It does the BIOS checks, and then stops on the RAID controller, complaining
about the fact that there are *2* dead drives, and that the array is now
sad.
This makes no sense. I can understand one drive not recovering from a power
outage, but 2 seems a bit unlikely, especially because the machine hadn't
been beeping or anything like that we try turning it off and on again a
few times, no change... We pull the machine out of the rack and rip the
cover off.
Sure enough, there is a RAID card - but the piezo-buzzer on it is, for some
reason, wrapped in a bunch of napkins, held in place with electrical tape.
I pull that off, and there is also some  paper towel jammed into the hole
in the buzzer, and bits of a broken pencil

After replacing the drives, starting an rsync restore from a backup server
we investigate more
...
it turns out that a few months ago(!) the machine had started beeping. The
night crew naturally found this annoying, and so they'd gone investigating
and discovered that it was this machine, and lifted the lid while still in
the rack. They traced the annoying noise to this small black thingie, and
made poked it until it stopped, thus solving the problem once and for
all yay!





2: I used to work at a company which was in one of the buildings next to
the twin-towers. For various clever reasons, they had their "datacenter" in
a corner of the office space... anyway, the planes hit, power goes out and
the building is evacuated - luckily no one is injured, but the entire
company/site is down. After a few weeks, my friend Joe is able to arrange
with a fire marshal to get access to the building so he can go and grab the
disks 

Re: Famous operational issues

2021-02-23 Thread Eric Kuhnke
I would be more interested in seeing someone who HASN'T crashed a Cisco
6500/7600, particularly one with a long uptime, by typing in a supposedly
harmless 'show' command.


On Tue, Feb 23, 2021 at 2:26 PM Justin Streiner  wrote:

> An interesting sub-thread to this could be:
>
> Have you ever unintentionally crashed a device by running a perfectly
> innocuous command?
> 1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
> 2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument
> Sev1 bug that caused two linecards to crash and reload, and take down about
> two dozen buildings on campus at the .edu where I used to work.
> 3. For those that ever had the misfortune of using early versions of the
> "bcc" command shell* on Bay Networks routers, which was intended to make
> the CLI make look and feel more like a Cisco router, you have my
> condolences.  One would reasonably expect "delete ?" to respond with a list
> of valid arguments for that command.  Instead, it deleted, well...
> everything, and prompted an on-site restore/reboot.
>
> BCC originally stood for "Bay Command Console", but we joked that it
> really stood for "Blatant Cisco Clone".
>
> On Tue, Feb 16, 2021 at 2:37 PM John Kristoff  wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps  the
>> most notorious and likely to top many lists including mine.  So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session.  I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective.  I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>


Re: Famous operational issues

2021-02-23 Thread Shawn L via NANOG

That brings back memoriesI had a similar experience.  First month on the 
job, large Sun raid array storing ~ 5k of mailboxes dies in the middle of the 
afternoon.  So, I start troubleshooting and determine it's most likely a bad 
disk.  The CEO walked into the server room right about the time I had 20 disks 
laid out on a table.  He had a fit and called the desktop support guy to come 
and 'show me how to fix a pc'.
 
Never mind the fact that we had a 90% ready to go replacement box sitting at 
another site, and just needed to either go get it, or bring the disks to 
it. So we sat there until the desktop who was 30 minutes away guy got 
there.  He took one look at it and said 'never touched that thing before, looks 
like he knows what he's doing' and pointed to me.  4 hours later we were 
driving the new server to the data center strapped down in the back of a 
pickup.  Fun times.
 
 
-Original Message-
From: "Justin Streiner" 
Sent: Tuesday, February 23, 2021 5:11pm
To: "John Kristoff" 
Cc: "NANOG" 
Subject: Re: Famous operational issues



Beyond the widespread outages, I have so many personal war stories that it's 
hard to pick a favorite.
My first job out of college in the mid-late 90s was at an ISP in Pittsburgh 
that I joined pretty early in its existence, and everyone did a bit of 
everything. I was hired to do sysadmin stuff, networking, pretty much whatever 
was needed. About a year after I started, we brought up a new mail system with 
an external RAID enclosure for the mail store itself.  One day, we saw 
indications that one of the disks in the RAID enclosure was starting to fail, 
so I scheduled a maintenance window to replace the disk and let the controller 
rebuild the data and integrate it back into the RAID set.  No big worries, 
right?
It's Tuesday at about 2 AM.
Well, the kernel on the RAID controller itself decided that when I pulled the 
failing drive would be a fine time to panic, and more or less turn itself into 
a bit-blender, and take all the mailstore down with it.  After a few hours of 
watching fsck make no progress on anything, in terms of trying to un-fsck the 
mailstore, we made the decision in consultation with the CEO to pull the plug 
on trying to bring the old RAID enclosure back to life, and focus on finding 
suitable replacement hardware and rebuild from scratch.  We also discovered 
that the most recent backups of the mailstore were over a month old :(
I think our CEO ended up driving several hours to procure a suitable enclosure. 
 By the time we got the enclosure installed, filesystems built, and got 
whatever tape backups we had restored, and tested the integrity of the system, 
it was now Thursday around 8 AM. Coincidentally, that was the same day the 
company hosted a big VIP gathering (the mayor was there, along with lots of 
investors and other bigwigs), so I had to come back and put on a suit to hobnob 
with the VIPs after getting a total of 6 hours of sleep in about the previous 3 
days.  I still don't know how I got home that night without wrapping my vehicle 
around a utility pole (due to being over-tired, not due to alcohol).
Many painful lessons learned over that stretch of days, as often the case as a 
company grows from startup mode and builds more robust technology and business 
processes as a consequence of growth.
jms


On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <[ j...@dataplane.org ]( 
mailto:j...@dataplane.org )> wrote:Friends,

 I'd like to start a thread about the most famous and widespread Internet
 operational issues, outages or implementation incompatibilities you
 have seen.

 Which examples would make up your top three?

 To get things started, I'd suggest the AS 7007 event is perhaps  the
 most notorious and likely to top many lists including mine.  So if
 that is one for you I'm asking for just two more.

 I'm particularly interested in this as the first step in developing a
 future NANOG session.  I'd be particularly interested in any issues
 that also identify key individuals that might still be around and
 interested in participating in a retrospective.  I already have someone
 that is willing to talk about AS 7007, which shouldn't be hard to guess
 who.

 Thanks in advance for your suggestions,

 John

Re: Famous operational issues

2021-02-23 Thread Justin Streiner
An interesting sub-thread to this could be:

Have you ever unintentionally crashed a device by running a perfectly
innocuous command?
1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument
Sev1 bug that caused two linecards to crash and reload, and take down about
two dozen buildings on campus at the .edu where I used to work.
3. For those that ever had the misfortune of using early versions of the
"bcc" command shell* on Bay Networks routers, which was intended to make
the CLI make look and feel more like a Cisco router, you have my
condolences.  One would reasonably expect "delete ?" to respond with a list
of valid arguments for that command.  Instead, it deleted, well...
everything, and prompted an on-site restore/reboot.

BCC originally stood for "Bay Command Console", but we joked that it really
stood for "Blatant Cisco Clone".

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff  wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>


Re: Famous operational issues

2021-02-23 Thread Justin Streiner
Beyond the widespread outages, I have so many personal war stories that
it's hard to pick a favorite.

My first job out of college in the mid-late 90s was at an ISP in Pittsburgh
that I joined pretty early in its existence, and everyone did a bit of
everything. I was hired to do sysadmin stuff, networking, pretty much
whatever was needed. About a year after I started, we brought up a new mail
system with an external RAID enclosure for the mail store itself.  One day,
we saw indications that one of the disks in the RAID enclosure was starting
to fail, so I scheduled a maintenance window to replace the disk and let
the controller rebuild the data and integrate it back into the RAID set.
No big worries, right?

It's Tuesday at about 2 AM.

Well, the kernel on the RAID controller itself decided that when I pulled
the failing drive would be a fine time to panic, and more or less turn
itself into a bit-blender, and take all the mailstore down with it.  After
a few hours of watching fsck make no progress on anything, in terms of
trying to un-fsck the mailstore, we made the decision in consultation with
the CEO to pull the plug on trying to bring the old RAID enclosure back to
life, and focus on finding suitable replacement hardware and rebuild from
scratch.  We also discovered that the most recent backups of the mailstore
were over a month old :(

I think our CEO ended up driving several hours to procure a suitable
enclosure.  By the time we got the enclosure installed, filesystems built,
and got whatever tape backups we had restored, and tested the integrity of
the system, it was now Thursday around 8 AM. Coincidentally, that was the
same day the company hosted a big VIP gathering (the mayor was there, along
with lots of investors and other bigwigs), so I had to come back and put on
a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
about the previous 3 days.  I still don't know how I got home that night
without wrapping my vehicle around a utility pole (due to being over-tired,
not due to alcohol).

Many painful lessons learned over that stretch of days, as often the case
as a company grows from startup mode and builds more robust technology and
business processes as a consequence of growth.

jms

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff  wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>


Re: Famous operational issues

2021-02-23 Thread Justin Streiner
On Thu, Feb 18, 2021 at 5:38 PM Warren Kumari  wrote:

>
> 2: A somewhat similar thing would happen with the Ascend TNT Max, which
> had side-to-side airflow. These were dial termination boxes, and so people
> would install racks and racks of them. The first one would draw in cool air
> on the left, heat it up and ship it out the right. The next one over would
> draw in warm air on the left, heat it up further, and ship it out the
> right... Somewhere there is a fairly famous photo of a rack of TNT Maxes,
> with the final one literally on fire, and still passing packets.
>

We had several racks of TNTs at the peak of our dial POP phase, and I
believe we ended up designing baffles for the sides of those racks to pull
in cool air from the front of the rack to the left side of the chassis and
exhaust it out the back from the right side.  It wasn't perfect, but it did
the job.

The TNTs with channelized T3 interfaces were a great way to terminate lots
of modems in a reasonable amount of rack space with minimal cabling.

Thank you
jms


Re: Famous operational issues

2021-02-23 Thread Warren Kumari
On Mon, Feb 22, 2021 at 7:31 PM  wrote:

>
> At Boston Univ we discovered the hard way that a security guard's
> walkie-talkie could cause a $5,000 (or $10K for the big machine room)
> Halon dump.
>

At one of the AOL datacenters there was some convoluted fire marshal reason
why a specific door could not be locked "during business hours" (?!), and
so there was a guard permanently stationed outside. The door was all the
way around the back of the building, and so basically never used - and so
the guard would fall asleep outside it with a piece of cardboard saying
"Please wake me before entering". He was a nice guy (and it was less faff
than the main entrance), and so we'd either sneak in and just not tell
anyone, or talk loudly while going round the corner so he could pretend to
have been awake the whole time...

W




>
> Took a couple of times before we figured out the connection tho once
> someone made it to the hold button before it actually dumped.
>
> Speaking of halon one very hot day I'm goofing off drinking coffee at
> a nearby sub shop when the owner tells me someone from the computing
> center was on the phone, that never happened before.
>
> Some poor operator was holding the halon shot, it's a deadman's switch
> (well, button) and the building was doing its 110db thing could I come
> help? The building is being evac'd.
>
> So my boss who wasn't the sharpest knife in the drawer follows me down
> as I enter and I'm sweating like a pig with a floor panel sucker
> trying to figure out which zone tripped.
>
> And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.
>
> I answered: well, maybe THERE'S A FIRE!!!
>
> At which point I notice the back of my shoulder is really bothering
> me, which I say to him, and he says hmmm there's a big bee on your
> back maybe he's stinging you?
>
> Fun day.
>
> --
> -Barry Shein
>
> Software Tool & Die| b...@theworld.com |
> http://www.TheWorld.com
> Purveyors to the Trade | Voice: +1 617-STD-WRLD   | 800-THE-WRLD
> The World: Since 1989  | A Public Information Utility | *oo*
>


-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Famous operational issues

2021-02-22 Thread sronan
Let me tell you about my personal favorite.

It’s 2002 and I am working as an engineer for an electronic stock trading 
platform (ECN), this platform happened to be the biggest platform for trading 
stocks electronically, on some days bigger than NASDAQ itself. This platform 
also happened to be run on DOS, FoxPro and a Novell file share, on a cluster of 
roughly 1,000 computers, two of which were the “engine” that matched all of the 
trades.

Well FoxPro has this “feature” where the ESC key halts the running program. We 
had the ability to remote control these DOS/FoxPro machines via some program we 
had written. Someone asked me to check the status of the process running on the 
primary matching engine, and when I was done, out of habit, I hit ESC. Trade 
processing grinds to a halt (phone calls have to be made to the SEC). I 
immediately called the NOC and told them it was me. Next thing I know, someone 
from the NOC is at my desk with a screwdriver putting the ESC key from my 
keyboard. I remained ESC keyless for the next several years until I left the 
company. I was hazed pretty good over it, but was essentially given a one time 
pass.



> On Feb 22, 2021, at 7:30 PM, b...@theworld.com wrote:
> 
> 
> At Boston Univ we discovered the hard way that a security guard's
> walkie-talkie could cause a $5,000 (or $10K for the big machine room)
> Halon dump.
> 
> Took a couple of times before we figured out the connection tho once
> someone made it to the hold button before it actually dumped.
> 
> Speaking of halon one very hot day I'm goofing off drinking coffee at
> a nearby sub shop when the owner tells me someone from the computing
> center was on the phone, that never happened before.
> 
> Some poor operator was holding the halon shot, it's a deadman's switch
> (well, button) and the building was doing its 110db thing could I come
> help? The building is being evac'd.
> 
> So my boss who wasn't the sharpest knife in the drawer follows me down
> as I enter and I'm sweating like a pig with a floor panel sucker
> trying to figure out which zone tripped.
> 
> And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.
> 
> I answered: well, maybe THERE'S A FIRE!!!
> 
> At which point I notice the back of my shoulder is really bothering
> me, which I say to him, and he says hmmm there's a big bee on your
> back maybe he's stinging you?
> 
> Fun day.
> 
> -- 
>-Barry Shein
> 
> Software Tool & Die| b...@theworld.com | 
> http://www.TheWorld.com
> Purveyors to the Trade | Voice: +1 617-STD-WRLD   | 800-THE-WRLD
> The World: Since 1989  | A Public Information Utility | *oo*


Re: Famous operational issues

2021-02-22 Thread bzs


At Boston Univ we discovered the hard way that a security guard's
walkie-talkie could cause a $5,000 (or $10K for the big machine room)
Halon dump.

Took a couple of times before we figured out the connection tho once
someone made it to the hold button before it actually dumped.

Speaking of halon one very hot day I'm goofing off drinking coffee at
a nearby sub shop when the owner tells me someone from the computing
center was on the phone, that never happened before.

Some poor operator was holding the halon shot, it's a deadman's switch
(well, button) and the building was doing its 110db thing could I come
help? The building is being evac'd.

So my boss who wasn't the sharpest knife in the drawer follows me down
as I enter and I'm sweating like a pig with a floor panel sucker
trying to figure out which zone tripped.

And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.

I answered: well, maybe THERE'S A FIRE!!!

At which point I notice the back of my shoulder is really bothering
me, which I say to him, and he says hmmm there's a big bee on your
back maybe he's stinging you?

Fun day.

-- 
-Barry Shein

Software Tool & Die| b...@theworld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD   | 800-THE-WRLD
The World: Since 1989  | A Public Information Utility | *oo*


Re: Famous operational issues

2021-02-22 Thread Patrick W. Gilmore
On Feb 22, 2021, at 7:02 AM, t...@pelican.org wrote:
> On Thursday, 18 February, 2021 22:37, "Warren Kumari"  
> said:
> 
>> 4: Not too long after I started doing networking (and for the same small
>> ISP in Yonkers), I'm flying off to install a new customer. I (of course)
>> think that I'm hot stuff because I'm going to do the install, configure the
>> router, whee, look at me! Anyway, I don't want to check a bag, and so I
>> stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all
>> pre-9/11!). I'm going through security and the TSA[0] person opens my bag
>> and pulls the router out. "What's this?!" he asks. I politely tell him that
>> it's a router. He says it's not. I'm still thinking that I'm the new
>> hotness, and so I tell him in a somewhat condescending way that it is, and
>> I know what I'm talking about. He tells me that it's not a router, and is
>> starting to get annoyed. I explain using my "talking to a 5 year old" voice
>> that it most certainly is a router. He tells me that lying to airport
>> security is a federal offense, and starts looming at me. I adjust my
>> attitude and start explaining that it's like a computer and makes the
>> Internet work. He gruffly hands me back the router, I put it in my bag and
>> scurry away. As I do so, I hear him telling his colleague that it wasn't a
>> router, and that he certainly knows what a router is, because he does
>> woodwork...
> 
> Here in the UK we avoid that issue by pronouncing the packet-shifter as 
> "rooter", and only the wood-working tool as "rowter" :)

So wrong.

A “root” server is part of the DNS. A “route” server is part of BGP.


> Of course, it raises a different set of problems when talking to the 
> Australians…

Everything is weird down down. But I still like them. :-)

-- 
TTFN,
patrick



Re: Famous operational issues

2021-02-22 Thread Owen DeLong


> On Feb 18, 2021, at 9:04 PM, Jen Linkova  wrote:
> 
> On Fri, Feb 19, 2021 at 9:40 AM Warren Kumari  wrote:
>> 4: Not too long after I started doing networking (and for the same small ISP 
>> in Yonkers), I'm flying off to install a new customer. I (of course) think 
>> that I'm hot stuff because I'm going to do the install, configure the 
>> router, whee, look at me! Anyway, I don't want to check a bag, and so I 
>> stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all 
>> pre-9/11!). I'm going through security and the TSA[0] person opens my bag 
>> and pulls the router out. "What's this?!" he asks. I politely tell him that 
>> it's a router. He says it's not. I'm still thinking that I'm the new 
>> hotness, and so I tell him in a somewhat condescending way that it is, and I 
>> know what I'm talking about. He tells me that it's not a router, and is 
>> starting to get annoyed. I explain using my "talking to a 5 year old" voice 
>> that it most certainly is a router. He tells me that lying to airport 
>> security is a federal offense, and starts looming at me. I adjust my 
>> attitude and start explaining that it's like a computer and makes the 
>> Internet work. He gruffly hands me back the router, I put it in my bag and 
>> scurry away. As I do so, I hear him telling his colleague that it wasn't a 
>> router, and that he certainly knows what a router is, because he does 
>> woodwork...
> 
> OK, Warren, achievement unlocked. You've just made a network engineer
> to google 'router'
> 
> P.S. I guess I'm obliged to tell a story if I respond to this thread...so...
> "Servers and the ice cream factory".
> Late spring/early summer in Moscow. The temperature above 30C (86°F).
> I worked for a local content provided.
> Aircons in our server room died, the technician ETA was 2 days ( I
> guess we were not the only ones with aircon problems).
> So we drove to the nearby ice cream factory  and got *a lot* of  dry
> ice. Then we have a roaster: every few hours one person took a deep
> breath, grabbed a box of dry ice, ran into the server room and emptied
> the box on top of the racks. The backup person was watching through
> the glass door - just in case, you know, ready to start the rescue
> operation.
> We (and the servers) survived till the technician arrived. And we had
> a lot of dry ice to cool the beer..
> 
> -- 
> SY, Jen Linkova aka Furry

During a wood-working project for the Southern California Linux Expo (the tech 
team that
(among other things) runs the network for the show was building new equipment 
carts), I
came up with the following meme:



[I don’t know if NANOG will pass the image despite its small size, so textual 
description:
A bandaged hand with the index finger amputated at the second knuckle with 
overlaid red
text stating “Carless Routing May Lead to Urgent Test of Self Healing Network”]

Fortunately, we didn’t have any such issues with the router, though we did have 
one person
suffer a crushed toe from a cabinet tip-over. Fortunately, the person made a 
full recovery.

Owen



RE: Famous operational issues

2021-02-22 Thread Tony Wicks
Many years ago I experienced a very similar thing. The DC/Integrator I worked 
for outsourced the co-location and operation of mainframe services for several 
banks and government organisations. One of these banks had a significant 
investment in AS/400's and they decided that it was so much hassle and expense 
using our datacentres that they would start putting those nice small AS/400's 
in computer rooms in their office buildings instead. One particular computer 
room contained large line printers that the developers would use to print out 
whatever it is such people print out. One Saturday morning I received a frantic 
call from the customer to say that all their primary production as/400's had 
gone offline. After a short investigation I realised that all the offline 
devices wire in this particular computer room. It turn's out that one of the 
developers had bought his six year old son to work that Saturday and upon 
retrieval of a printout said son had dutifully followed dad in to the computer 
room and was unable to resist the big red button sitting exposed on the wall by 
the door. Shortly thereafter the embarrassed customer decided that perhaps it 
was worth relocating their as/400's to our expensive datacentres.



> 
>  During my younger days, that button was used a few time by the 
> operator of a VM/370 to regain control from someone with a "curious 
> mind" *cought* *cought*...
> 
Two horror stories I remember from long ago when I was a console jockey for a 
federal space agency that will remain nameless :P

1. A coworker brought her daughter to work with her on a Saturday overtime 
shift because she couldn't get a babysitter. She parked the kid with a coloring 
book and a pile of crayons at the only table in the console room with some 
space, right next to the master console for our 3081. I asked her to make sure 
sh was well away from the console, and as she reached over to scoot the girl 
and her coloring books further away she slipped, and reached out to steady 
herself. Yep, planted her finger right down on the IML button (plexi covers? We 
don' need no STEENKIN' 
plexi covers!). MVS and VM vanished, two dozen tape drives rewound and several 
hours' worth of data merge jobs went blooey.




Re: Famous operational issues

2021-02-22 Thread Dovid Bender
On Mon, Feb 22, 2021 at 2:05 PM Warren Kumari  wrote:

>
>
> On Mon, Feb 22, 2021 at 12:50 PM Regis M. Donovan 
> wrote:
>
>> On Thu, Feb 18, 2021 at 07:34:39PM -0500, Patrick W. Gilmore wrote:
>> > And to put it on topic, cover your EPOs
>>
>> I worked somewhere with an uncovered EPO, which was okay until we had a
>> telco tech in who was used to a different data center where a similar
>> looking button controlled the door access, so he reflexively hit it
>> on his way out to unlock the door.  Oops.
>>
>> Also, consider what's on generator and what's not.  I worked in a
>> corporate
>> data center where we lost power.  The backup system kept all the machines
>> running, but the ventilation system was still down, so it was very warm
>> very
>> fast as everyone went around trying to shut servers down gracefully while
>> other folks propped the doors open to get some cooler air in.
>>
>
> That reminds me of another one...
>
> In parts of NYC, there are noise abatement requirements, and so many
> places have their generators mounted on the roof -- it's cheap real-estate,
> the exhaust is easier, the noise issues are less, etc.
>
> The generators usually have a smallish diesel tank, and then a much larger
> one in the basement (diesel is heavy)...
>
> So, one of the buildings that I was in was really good about testing thier
> gensets - they'd do weekly tests (usually at night), and the generators
> always worked perfectly -- right up until the time that it was actually
> needed.
> The generator fired up, the lights kept blinking, the disks kept spinning
> - but the transfer pump that pumped diesel from the basement to the roof
> was one of the few things that was not on the generator
>
>
When we were looking at one of the big carrier hotels in NYC  they said
that they had the same issue (could be it was the same one). The elevators
were also out as well. They resorted to having techs climb up an down 9
flights of stairs all day long with 5 gallon buckets of diesel and throwing
it into the generator.

>


Re: Famous operational issues

2021-02-22 Thread Jethro R Binks
On Fri, 19 Feb 2021, Andy Ringsmuth wrote:

> > I explain using my "talking to a 5 year old" voice that it 
> > most certainly is a router. He tells me that lying to airport security 
> > is a federal offense, and starts looming at me. I adjust my attitude 
> > and start explaining that it's like a computer and makes the Internet 
> > work. He gruffly hands me back the router, I put it in my bag and 
> > scurry away. As I do so, I hear him telling his colleague that it 
> > wasn't a router, and that he certainly knows what a router is, because 
> > he does woodwork…
> 
> Well, in his defense, he wasn’t wrong…   :-)

This is wjy, in the UK, we tend to pronounce "router" as "router", and 
"router" as "router", so there's no confusion.

You're welcome.

Jethro.

.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.


Re: Famous operational issues

2021-02-22 Thread Warren Kumari
On Mon, Feb 22, 2021 at 12:50 PM Regis M. Donovan 
wrote:

> On Thu, Feb 18, 2021 at 07:34:39PM -0500, Patrick W. Gilmore wrote:
> > And to put it on topic, cover your EPOs
>
> I worked somewhere with an uncovered EPO, which was okay until we had a
> telco tech in who was used to a different data center where a similar
> looking button controlled the door access, so he reflexively hit it
> on his way out to unlock the door.  Oops.
>
> Also, consider what's on generator and what's not.  I worked in a corporate
> data center where we lost power.  The backup system kept all the machines
> running, but the ventilation system was still down, so it was very warm
> very
> fast as everyone went around trying to shut servers down gracefully while
> other folks propped the doors open to get some cooler air in.
>

That reminds me of another one...

In parts of NYC, there are noise abatement requirements, and so many places
have their generators mounted on the roof -- it's cheap real-estate, the
exhaust is easier, the noise issues are less, etc.

The generators usually have a smallish diesel tank, and then a much larger
one in the basement (diesel is heavy)...

So, one of the buildings that I was in was really good about testing thier
gensets - they'd do weekly tests (usually at night), and the generators
always worked perfectly -- right up until the time that it was actually
needed.
The generator fired up, the lights kept blinking, the disks kept spinning -
but the transfer pump that pumped diesel from the basement to the roof was
one of the few things that was not on the generator

W



>
> --r
>
>

-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Famous operational issues

2021-02-22 Thread Regis M. Donovan
On Thu, Feb 18, 2021 at 07:34:39PM -0500, Patrick W. Gilmore wrote:
> And to put it on topic, cover your EPOs

I worked somewhere with an uncovered EPO, which was okay until we had a
telco tech in who was used to a different data center where a similar
looking button controlled the door access, so he reflexively hit it
on his way out to unlock the door.  Oops.

Also, consider what's on generator and what's not.  I worked in a corporate
data center where we lost power.  The backup system kept all the machines
running, but the ventilation system was still down, so it was very warm very
fast as everyone went around trying to shut servers down gracefully while
other folks propped the doors open to get some cooler air in.

--r



Re: Famous operational issues

2021-02-22 Thread Christopher Morrow
Long ago, in a galaxy far away I worked for a gov't contractor on site
at a gov't site...

We had our own cute little datacenter, and our 4 building complex had
a central power distribution setup from utility -> buildings.
It was really quite nice :) (the job, the buildings, the power and
cute little datacenter)

One fine Tues afternoon ~2pm local time, the building engineers
decided they would make a copy of the key used to turn the main /
utility power off...
Of course they also needed to make sure their copy worked, so... they
put the key in and turned it.

Shockingly, the key worked! and no power was provided to the buildings :(
It was very suddenly very dark and very quiet... (then the yelling started)

Ok, fast forward 7 days... rerun the movie... Yes, the same building
engineers made a new copy, and .. tested that new copy in the same
manner.

For neither of these events did someone tell the rest of us (and our
customers): "Hey, we MAY interrupt power to the buildings... FYI, BTW,
make sure your backups are current..." I recall we got the name of the
engineer the 1st time around, but not the second.

On Mon, Feb 22, 2021 at 12:26 PM Tony Finch  wrote:
>
> Patrick W. Gilmore  wrote:
> >
> >   Me: Did you order that EPO cover?
> >   Her: Nope.
>
> There are apparently two kinds of EPO cover:
>
> - the kind that stops you from pressing the button by mistake;
>
> - and the kind that doesn't, and instead locks the button down to make
> sure it isn't un-pressed until everything is safe.
>
> We had a series of incidents similar to yours, so an EPO cover was
> belatedly installed. We learned about the second kind of EPO cover when a
> colleague proudly demonstrated that the EPO button should no longer be
> pressed by accident, or so he thought.
>
> Tony.
> --
> f.anthony.n.finchhttp://dotat.at/
> the quest for freedom and justice can never end


Re: Famous operational issues

2021-02-22 Thread Warren Kumari
On Mon, Feb 22, 2021 at 7:09 AM t...@pelican.org  wrote:

> On Thursday, 18 February, 2021 22:37, "Warren Kumari" 
> said:
>
> > 4: Not too long after I started doing networking (and for the same small
> > ISP in Yonkers), I'm flying off to install a new customer. I (of course)
> > think that I'm hot stuff because I'm going to do the install, configure
> the
> > router, whee, look at me! Anyway, I don't want to check a bag, and so I
> > stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was
> all
> > pre-9/11!). I'm going through security and the TSA[0] person opens my bag
> > and pulls the router out. "What's this?!" he asks. I politely tell him
> that
> > it's a router. He says it's not. I'm still thinking that I'm the new
> > hotness, and so I tell him in a somewhat condescending way that it is,
> and
> > I know what I'm talking about. He tells me that it's not a router, and is
> > starting to get annoyed. I explain using my "talking to a 5 year old"
> voice
> > that it most certainly is a router. He tells me that lying to airport
> > security is a federal offense, and starts looming at me. I adjust my
> > attitude and start explaining that it's like a computer and makes the
> > Internet work. He gruffly hands me back the router, I put it in my bag
> and
> > scurry away. As I do so, I hear him telling his colleague that it wasn't
> a
> > router, and that he certainly knows what a router is, because he does
> > woodwork...
>
> Here in the UK we avoid that issue by pronouncing the packet-shifter as
> "rooter", and only the wood-working tool as "rowter" :)
>
> Of course, it raises a different set of problems when talking to the
> Australians...
>

Yes. I discovered this while walking around Sydney wearing my "I have root
@ Google" t-shirt got some odd looks/snickers...

W




>
> Cheers,
> Tim.
>
>
>

-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Famous operational issues

2021-02-22 Thread Tony Finch
Patrick W. Gilmore  wrote:
>
>   Me: Did you order that EPO cover?
>   Her: Nope.

There are apparently two kinds of EPO cover:

- the kind that stops you from pressing the button by mistake;

- and the kind that doesn't, and instead locks the button down to make
sure it isn't un-pressed until everything is safe.

We had a series of incidents similar to yours, so an EPO cover was
belatedly installed. We learned about the second kind of EPO cover when a
colleague proudly demonstrated that the EPO button should no longer be
pressed by accident, or so he thought.

Tony.
-- 
f.anthony.n.finchhttp://dotat.at/
the quest for freedom and justice can never end


Re: Famous operational issues

2021-02-22 Thread Bruce H McIntosh




On 2/22/21 9:14 AM, Alain Hebert wrote:

*[External Email]*

     Well...

     During my younger days, that button was used a few time by the 
operator of a VM/370 to regain control from someone with a "curious 
mind" *cought* *cought*...


Two horror stories I remember from long ago when I was a console jockey 
for a federal space agency that will remain nameless :P


1. A coworker brought her daughter to work with her on a Saturday 
overtime shift because she couldn't get a babysitter. She parked the kid 
with a coloring book and a pile of crayons at the only table in the 
console room with some space, right next to the master console for our 
3081. I asked her to make sure sh was well away from the console, and as 
she reached over to scoot the girl and her coloring books further away 
she slipped, and reached out to steady herself. Yep, planted her finger 
right down on the IML button (plexi covers? We don' need no STEENKIN' 
plexi covers!). MVS and VM vanished, two dozen tape drives rewound and 
several hours' worth of data merge jobs went blooey.


2. The 3081 was water cooled via a heat exchanger. The building chilled 
water feed had a very old, very clogged filter that was bypassed until 
it could be replaced. One day a new maintenance foreman came through the 
building doing his "clipboard and harried expression" thing, and spotted 
the filter in bypass (NO, I don't know WHY it hadn't been red-tagged. 
Someone clearly dropped that ball.) He thought, "Well that's not right" 
and reset all the valves to put it back inline, which of course, pretty 
much killed the chilled water flow through the heat exchanger. First 
thing we knew about it in Operations was when the 3081 started throwing 
thermal alarms and MVS crashed hard. IBM had to replace several modules 
in the CPUs.


--

Bruce H. McIntosh
Network Engineer II
University of Florida Information Technology
b...@ufl.edu
352-273-1066


Re: Famous operational issues

2021-02-22 Thread Alain Hebert

    Well...

    During my younger days, that button was used a few time by the 
operator of a VM/370 to regain control from someone with a "curious 
mind" *cought* *cought*...


-
Alain Hebertaheb...@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911  http://www.pubnix.netFax: 514-990-9443

On 2/20/21 4:07 AM, Henry Yen wrote:

On Thu, Feb 18, 2021 at 07:34:39AM -0500, Patrick W. Gilmore wrote:

In 1994, there was a major earthquake near the city of Los Angeles. City hall 
had to be evacuated and it would take over a year to reinforce the building to 
make it habitable again. My company moved all the systems in the basement of 
city hall to a new datacenter a mile or so away. After the install, we spent 
more than a week coaxing their ancient (even for 1994) machines back online, 
such as a Prime Computer and an AS400 with tons of DASD. Well, tons of 
cabinets, certainly less storage than my watch has now.

I was in the DC going over something with the lady in charge when someone 
walked in to ask her something. She said “just a second”. That person took 
one step to the side of the door and leaned against the wall - right on an EPO 
which had no cover.

Have you ever heard an entire row of DASD spin down instantly? Or taken 40 
minutes to IPL an AS400? In the middle of the business day? For the second most 
populous city in the country?

Me: Maybe you should get a cover for that?
Her: Good idea.

Couple weeks later, in the same DC, going over final checklist. A fedex guy 
walks in. (To this day, no idea how he got in a supposedly locked DC.) She says 
“just a second”, and I get a very strong deja vu feeling. He takes one step 
to the side and leans against the wall.

Me: Did you order that EPO cover?
Her: Nope.

some of the ibm 4300 series mini-mainframes came with a console terminal
that had a very large, raised (completely not flush), alternate power
button on the upper panel of the keyboard, facing the operator. in later
versions, the button was inset in a little open box with high sides. in
earlier versions, there was just a pair of raised ribs on either side of the
button. in the earliest version, if that panel needed to be replaced, the
replacement part didn't even have those protective ribs, this huge button
was just sitting there. on our 4341, someone had dropped the keyboard during
installation and the damaged panel was replaced with the
no-protection-whatsoever part.

i had an operator who, working a double shift into the overnight run,
fell asleep and managed to bang his head square on the button.
the overnight jobs running were left in various states of ruin.

third party manufacturers had an easy sell for lucite power/EPO button covers.

--
Henry Yen   Aegis Information Systems, Inc.
Senior Systems Programmer   Hicksville, New York




Re: Famous operational issues

2021-02-22 Thread t...@pelican.org
On Thursday, 18 February, 2021 22:37, "Warren Kumari"  said:

> 4: Not too long after I started doing networking (and for the same small
> ISP in Yonkers), I'm flying off to install a new customer. I (of course)
> think that I'm hot stuff because I'm going to do the install, configure the
> router, whee, look at me! Anyway, I don't want to check a bag, and so I
> stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all
> pre-9/11!). I'm going through security and the TSA[0] person opens my bag
> and pulls the router out. "What's this?!" he asks. I politely tell him that
> it's a router. He says it's not. I'm still thinking that I'm the new
> hotness, and so I tell him in a somewhat condescending way that it is, and
> I know what I'm talking about. He tells me that it's not a router, and is
> starting to get annoyed. I explain using my "talking to a 5 year old" voice
> that it most certainly is a router. He tells me that lying to airport
> security is a federal offense, and starts looming at me. I adjust my
> attitude and start explaining that it's like a computer and makes the
> Internet work. He gruffly hands me back the router, I put it in my bag and
> scurry away. As I do so, I hear him telling his colleague that it wasn't a
> router, and that he certainly knows what a router is, because he does
> woodwork...

Here in the UK we avoid that issue by pronouncing the packet-shifter as 
"rooter", and only the wood-working tool as "rowter" :)

Of course, it raises a different set of problems when talking to the 
Australians...

Cheers,
Tim.




Re: Famous operational issues

2021-02-21 Thread Ben Cannon
I’m embarrassed to say, I’ve done this.

Ms. Lady Benjamin PD Cannon, ASCE
6x7 Networks & 6x7 Telecom, LLC 
CEO 
b...@6by7.net
"The only fully end-to-end encrypted global telecommunications company in the 
world.”

FCC License KJ6FJJ

Sent from my iPhone via RFC1149.

> On Feb 19, 2021, at 12:55 AM, Wolfgang Tremmel  
> wrote:
> 
> Do you remember the Cisco HDCI connectors? 
> https://en.wikipedia.org/wiki/HDCI
> 
> I once shipped a Cisco 4500 plus some cables to a remote data center and 
> asked the local guys to cable them for me.
> With Cisco you could check the cable type and if they were properly attached. 
> They were not.
> 
> I asked for a check and the local guy confirmed me three times that the 
> cables were properly plugged. 
> At the end I gave up, and took the 3 hour drive to the datacenter to check 
> myself.
> 
> Problem was that, while the casing of the connector is asymmetrical, the pins 
> inside are symmetrical.
> And the local guy was quite strong.
> 
> Yes, he managed to plug in the cables 180° flipped, bending the case, but he 
> got them in.
> He was quite embarrassed when I fixed the cabling problem in 10 seconds.
> 
> That must have been 1995 or so
> 
> Wolfgang
> 
> 
> 
>> On 16. Feb 2021, at 20:37, John Kristoff  wrote:
>> 
>> Which examples would make up your top three?
> 
> -- 
> Wolfgang Tremmel 
> 
> Phone +49 69 1730902 0  | wolfgang.trem...@de-cix.net
> Executive Directors: Harald A. Summa and Sebastian Seifert | Trade Registry: 
> AG Cologne, HRB 51135
> DE-CIX Management GmbH | Lindleystrasse 12 | 60314 Frankfurt am Main | 
> Germany | www.de-cix.net
> 


Re: Famous operational issues

2021-02-20 Thread Jörg Kost

Oh,

I actually wanted to keep this for my memoirs, but if we can name danger 
datacenter operational issues …. somehow 2000s:


Somebody ran its own datacenter,
- once had an active ant colony living under the raised floor and in the 
climate system,
- for a while had several electric grounding defects, leading to the 
work instruction of “don’t touch any metallic or conducting 
materials”,
- for a minute, had a “look what we have bought on Ebay” - UPS 
system, until started to roast after turned on,
- from time to time had climate issues, leading to temperatures around 
peaks  with 68 centigrade room temperature, and yes, some equipment 
survived and even continued to work.


Decided not to go back there, after “look what we have bought on Ebay, 
an argon fire distinguisher, we just need to mount it”.


On 20 Feb 2021, at 10:15, Eric Kuhnke wrote:

From a datacenter ROI and economics, cooling, HVAC perspective that 
might
just be the best colo customer ever. As long as they're paying full 
price
for the cabinet and nothing is *dangerous* about how they've hung the 
2U
server vertically, using up all that space for just one thing has to 
be a
lot better than a customer that makes full and efficient use of space 
and

all the amperage allotted to them.




Re: Famous operational issues

2021-02-20 Thread Clayton Zekelman



Not a famous operational issue, but in 2000, we had a major outage of 
our dialup modem pool.


The owner of the building was re-skinning the outside using Styrofoam 
and stucco.  A bunch of the Styrofoam
had blocked the roof drains on the podium section of the building, 
immediately above our equipment room.


A flash rainstorm filled the entire flat roof, and water came back in 
over the flashings, and poured directly in
to our dialup modem pool through the hole in the concrete roof deck 
where the drain pipe protruded through.


In retrospect, it was a monumentally stupid place to put our main 
modem pool, but we didn't realize what was
above the drop ceiling - and that it was roof, not the other 11 
floors of the building.


1 bay of 6 shelves of USR TC 1000 HiperDSPs were now very wet and 
blinking funny patterns on their LEDs.


Fortunately, our vendor in Toronto (4 hour drive away) had stock of 
equipment that another customer kept
delaying shipment on.  They got their staff in, started un-boxing 
and, slotting cards.  We spent a few hours

tearing out the old gear and getting ready for replacements.

We left Windsor, Ontario at around 12:00am - same time they left 
Toronto, heading towards us.  We coordinated
a meet at one of the rural exits along Highway 401 at a closed gas 
station at around 2am.


Everything was going so well until a cop pulled up, and asked us what 
we were doing, as we were slinging
modem chassis between the back of the vendor's SUV and our van... We 
calmly explained
what happened.  He looked between us a couple of times, shook his 
head and said "well, good luck with that",

got back in his car and drove away.

We had everything back online within 14 hours of the initial outage.

At 02:37 PM 16/02/2021, John Kristoff wrote:

Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?

To get things started, I'd suggest the AS 7007 event is perhaps  the
most notorious and likely to top many lists including mine.  So if
that is one for you I'm asking for just two more.

I'm particularly interested in this as the first step in developing a
future NANOG session.  I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective.  I already have someone
that is willing to talk about AS 7007, which shouldn't be hard to guess
who.

Thanks in advance for your suggestions,

John


--

Clayton Zekelman
Managed Network Systems Inc. (MNSi)
3363 Tecumseh Rd. E
Windsor, Ontario
N8W 1H4

tel. 519-985-8410
fax. 519-985-8409



Re: Famous operational issues

2021-02-20 Thread Eric Kuhnke
>From a datacenter ROI and economics, cooling, HVAC perspective that might
just be the best colo customer ever. As long as they're paying full price
for the cabinet and nothing is *dangerous* about how they've hung the 2U
server vertically, using up all that space for just one thing has to be a
lot better than a customer that makes full and efficient use of space and
all the amperage allotted to them.


On Thu, Feb 18, 2021 at 11:38 AM t...@pelican.org  wrote:

> On Thursday, 18 February, 2021 16:23, "Seth Mattinen" 
> said:
>
> > I had a customer that tried to stack their servers - no rails except the
> > bottom most one - using 2x4's between each server. Up until then I
> > hadn't imagined anyone would want to fill their cabinet with wood, so I
> > made a rule to ban wood and anything tangentially related (cardboard,
> > paper, plastic, etc.). Easier to just ban all things. Fire reasons too
> > but mainly I thought a cabinet full of wood was too stupid to allow.
>
> On the "stupid racking" front, I give you most of a rack dedicated to a
> single server.  Not all that high a server, maybe 2U or so, but *way* too
> deep for the rack, so it had been installed vertically.  By looping some
> fairly hefty chain through the handles on either side of the front of the
> chassis, and then bolting the four chain ends to the four rack posts.  I
> wish I'd kept pictures of that one.  Not flammable, but a serious WTF
> moment.
>
> Cheers,
> Tim.
>
>
>


Re: Famous operational issues

2021-02-20 Thread Henry Yen
On Thu, Feb 18, 2021 at 07:34:39AM -0500, Patrick W. Gilmore wrote:
> In 1994, there was a major earthquake near the city of Los Angeles. City hall 
> had to be evacuated and it would take over a year to reinforce the building 
> to make it habitable again. My company moved all the systems in the basement 
> of city hall to a new datacenter a mile or so away. After the install, we 
> spent more than a week coaxing their ancient (even for 1994) machines back 
> online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons 
> of cabinets, certainly less storage than my watch has now.
> 
> I was in the DC going over something with the lady in charge when someone 
> walked in to ask her something. She said “just a second”. That person 
> took one step to the side of the door and leaned against the wall - right on 
> an EPO which had no cover.
> 
> Have you ever heard an entire row of DASD spin down instantly? Or taken 40 
> minutes to IPL an AS400? In the middle of the business day? For the second 
> most populous city in the country?
> 
>   Me: Maybe you should get a cover for that?
>   Her: Good idea.
> 
> Couple weeks later, in the same DC, going over final checklist. A fedex guy 
> walks in. (To this day, no idea how he got in a supposedly locked DC.) She 
> says “just a second”, and I get a very strong deja vu feeling. He takes 
> one step to the side and leans against the wall.
> 
>   Me: Did you order that EPO cover?
>   Her: Nope.

some of the ibm 4300 series mini-mainframes came with a console terminal
that had a very large, raised (completely not flush), alternate power
button on the upper panel of the keyboard, facing the operator. in later
versions, the button was inset in a little open box with high sides. in
earlier versions, there was just a pair of raised ribs on either side of the
button. in the earliest version, if that panel needed to be replaced, the
replacement part didn't even have those protective ribs, this huge button
was just sitting there. on our 4341, someone had dropped the keyboard during
installation and the damaged panel was replaced with the
no-protection-whatsoever part.

i had an operator who, working a double shift into the overnight run,
fell asleep and managed to bang his head square on the button.
the overnight jobs running were left in various states of ruin.

third party manufacturers had an easy sell for lucite power/EPO button covers.

--
Henry Yen   Aegis Information Systems, Inc.
Senior Systems Programmer   Hicksville, New York


Re: Famous operational issues

2021-02-19 Thread Tom Hill
On 16/02/2021 22:08, Jared Mauch wrote:
> I was thinking about how we need a war stories nanog track. My favorite was 
> being on call when the router was stolen. 

Enough time has (probably) elapsed since my escapades in a small data
centre in Manchester. The RFO was ten pages long, and I don't want to
spoil the ending, but ... I later discovered that Cumulus' then VP of
Engineering had elevated me to a veritable 'Hall of Infamy' for the
support ticket attached to that particular tale.

One day I'll be able to buy the guy that handled it a *lot* of whisky.
He deserved it.

-- 
Tom


Re: Famous operational issues

2021-02-19 Thread Sabri Berisha
- On Feb 19, 2021, at 3:07 AM, Daniel Karrenberg d...@ripe.net wrote:

Hi,

> Lessons: HW/SW mono-cultures are dangerous. Input testing is good
> practice at all levels software. Operational co-ordination is key in
> times of crisis.

Well... Here is a very similar, fairly recent one. Albeit in this case, the
opposite is true: running one software train would have prevented an outage.
Some members on this list (hi, Brian!) will recognize the story.

Group XX within $company decided to deploy EVPN. All of backbone was running
single $vendor, but different software trains. Turns out that between an
early draft, implemented in version X, and the RFC, implemented in version Y,
a change was made in NLRI formats which were not backwards compatible.

Version X was in use on virtually all DC egress boxes, version Y was in use
on route reflectors. The moment the first EVPN NLRI was advertised, the 
entire backbone melted down. Dept-wide alert issued (at night), people trying
to log on to the VPN. Oh wait, the VPN requires yubikey, which requires the
corp network to access the interwebs, which is not accessible due to said
issue.

And, despite me complaining since the day of hire, no out of band network.

I didn't stay much longer after that.

Thanks,

Sabri 



Re: Famous operational issues

2021-02-19 Thread Warren Kumari
At a previous company we had a large number of Foundry Networks layer-3
switches. They participated in our OSPF network and had a *really* annoying
bug. Every now and then one of them would get somewhat confused and would
corrupt its OSPF database (there seemed to be some pointer that would end
up off by one).

It would then cleverly realize that its LSDB was different to everyone
else's and so would flood this corrupt database to all other OSPF speakers.
Some vendors would do a better job of sanity checking the LSAs and would
ignore the bad LSAs, other vendors would install them... and now you have
different link state databases on different devices and OSPF becomes
unhappy.

Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,
LSID 0.9.32.5

Mask 10.160.8.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.

 If you look at the output, you can see that there is some garbage in the
LSID field and the bit that should be there is now in the Mask section. I
also saw some more extreme version of the same bug, in my favorite example
the mask was 115.104.111.119 and further down there was 105.110.116.114 --
if you take these as decimal number and look up their ASCII values we get
"show" and "inte" -- I wrote a tool to scrape bits from these errors and
ended up with a large amount of the CLI help text.




Many years ago I worked for a small Mom-and-Pop type ISP in New York state
(I was the only network / technical person there) -- it was a very free
wheeling place and I built the network by doing whatever made sense at the
time.

One of my "favorite" customers (Joe somebody) was somehow related to the
owner of the ISP and was a gamer. This was back in the day when the gaming
magazines would give you useful tips like "Type 'tracert $gameserver' and
make sure that there are less than N hops".  Joe would call up tech
support, me, the owner, etc and complain that there was N+3 hops and most
of them were in our network. I spent much time explaining things about
packet-loss, latency, etc but couldn't shake his belief that hop count was
the only metric that mattered.

Finally, one night he called me at home well after midnight (no, I didn't
give him my home phone number, he looked me up in the phonebook!) to
complain that his gaming was suffering because it was "too many hops to get
out of your network". I finally snapped and built a static GRE tunnel from
the RAS box that he connected to all over the network -- it was a thing of
beauty, it went through almost every device that we owned and took the most
convoluted path I could come up with. "Yay!", I figured, "now I can
demonstrate that latency is more important than hop count" and I went to
bed.

The next morning I get a call from him. He is ecstatic and wildly impressed
by how well the network is working for him now and how great his gaming
performance is. "Oh well", I think, "at least he is happy and will leave me
alone now". I don't document the purpose of this GRE anywhere and after
some time forget about it.

A few months later I am doing some routine cleanup work and stumble across
a weird looking tunnel -- its bizarre, it goes all over the place and is
all kinds of crufty -- there are static routes and policy routing and
bizarre things being done on the RADIUS server to make sure some user
always gets a certain IP... I look in my pile of notes and old configs and
then decide to just yank it out.

That night I get an enraged call (at home again) from Joe *screaming* that
the network is all broken again because it is now way too many hops to get
out of the network and that people keep shooting him...

*What I learnt from this:*
1: Make sure you document everything (and no, the network isn't
documentation)
2: Gamers are weird.
3: Making changes to your network in anger provides short term pleasure but
long term pain.



On Fri, Feb 19, 2021 at 1:10 PM Andrew Gallo  wrote:

>
>
> On 2/16/2021 2:37 PM, John Kristoff wrote:
> > Friends,
> >
> > I'd like to start a thread about the most famous and widespread Internet
> > operational issues, outages or implementation incompatibilities you
> > have seen.
> >
> > Which examples would make up your top three?
>
>
> I don't believe I've seen this in any of the replies, but the AT
> cascading switch crashes of 1990 is a good one.  This link even has some
> pseudocode
> https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse
>
>

-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Famous operational issues

2021-02-19 Thread Andrew Gallo




On 2/16/2021 2:37 PM, John Kristoff wrote:

Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?



I don't believe I've seen this in any of the replies, but the AT 
cascading switch crashes of 1990 is a good one.  This link even has some 
pseudocode

https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse



Re: Famous operational issues

2021-02-19 Thread Andrey Kostin

Jen Linkova писал 2021-02-19 00:04:


OK, Warren, achievement unlocked. You've just made a network engineer
to google 'router'


He meant that we call "frezer" machine... (in our language ;)

I heard a similar story from my colleague who was working at that time 
for Huawei as DWDM engineer and had to fly frequently with testing 
devices.
One time he tried to explain at airport security control what DWDM 
spectrum analyser is for, the officer called another for help and he 
said something like this: "DWDM spectrum analyser? Pass it, usual 
thing..."


--
Kind regards,
Andrey Kostin


Re: Famous operational issues

2021-02-19 Thread Aaron C. de Bruyn via NANOG
All these stories remind me of two of my own from back in the late 90s.
I worked for a regional ISP doing some network stuff (under the real
engineer), and some software development.

Like a lot of ISPs in the 90s, this one started out in a rental house.
Over the months and years rooms were slowly converted to host more and more
equipment as we expanded our customer base and presence in the region.
If we needed a "rack", someone would go to the store and buy a 4-post metal
shelf [1] or...in some cases the dump to see what they had.

We had one that looked like an oversized filing cabinet with some sort of
rails on the sides.  I don't recall how the equipment was mounted, but I
think it was by drilling holes into the front lip and tapping the screws
in.  This was the big super-important rack.  It had the main router that
connected lines between 5 POPs around the region, and also several
connections to Portland Oregon about 60 miles away.  Since we were
making tons of money, we decided we should update our image and install
real racks in the "bedroom server room".  It was decided we were going to
do it with no downtime.

I was on the 2-man team that stood behind and in front of the rack with
2x4s dead-lifting them as equipment was unscrewed and lowered onto the
boards.  I was on the back side of the rack.  After all the equipment was
unscrewed, someone came in with a sawzall and cut the filing cabinet thing
apart.  The top half was removed and taken away, then we lifted up on the
boards and the bottom half was slid out of the way.  The new rack was
brought in, bolted to the floor, and then one by one equipment was taken
off the pile we were holding up with 2x4s, brought through the back of the
new rack, and then mounted.

I was pleasantly surprised and very relieved when we finished moving the
big router, several switches, a few servers, and a UPS unit over to the new
rack with zero downtime.  The entire team cheered and cracked beers.  I
stepped out from behind the rack...
...and snagged the power cable to the main router with my foot.  I don't
recall the Cisco model number after all this time...but I do remember the
excruciating 6-8 minutes it took for the damn thing to reboot, and the
sight of the 7 PRI cards in our phone system almost immediately jumping
from 5 channels in-use to being 100% full.

It's been 20 years, but I swear my arms are still sore from holding all
that equipment up for ~20 minutes, and I always pick my feet up very slowly
when I'm near a rack. ;)

The second story is a short one from the same time period.  Our POPs
consisted of the afore-mentioned 4-post metal shelves stacked with piles of
US Robotics 56k modems [2] stacked on top of each other.  They were wired
back to some sort of serial box that was in-turn connected to an ISA card
stuck in a Windows NT 4 server that used RADIUS to authenticate sessions
with an NT4 server back at the main office that had user accounts for all
our customers.  Every single modem had a wall-wart power brick for power,
an RJ11 phone line, and a big old serial cable.  It was an absolute rats
nest of cables.  The small POP (which I think was a TuffShed in someone's
yard about 50 feet from the telco building) was always 100 degrees--even in
the dead of winter.

One year we made the decision to switch to 3Com Total Control Chassis with
PRI cards.  The cut-over was pretty seamless and immediately made shelves
stacked full of hundreds of modems completely useless.  As we started
disconnecting modems with the intent of selling them for a few bucks to
existing customers who wanted to upgrade or giving them to new customers to
get them signed up, we found a bunch of the stacks of modems had actually
melted together due to the temps.  That explained the handful of numbers in
the hunt group that would just ring and ring with no answer.  In the end we
went from a completely packed 10x20 shed to two small 3Com TCH boxes packed
with PRI cards and a handful of PRI cables with much more normal
temperatures.

I thoroughly enjoyed the "wild west" days of the internet.

If Eric and Dan are reading this, thanks for everything you taught me about
networking, business, hard work, and generally being a good person.

-A

[1] -
https://www.amazon.com/dp/B01D54TICS/ref=redir_mobile_desktop?_encoding=UTF8=Pe4xuew1D1PkrRA9cq8Cdg_cr_id=5048111780901_rd_plhdr=t_rd_r=4d9e3b6b-3360-41e8-9901-d079ac063f03_rd_w=uRxXq_rd_wg=CDibq_=sbx_be_s_sparkle_td_asin_0_img

[2] - https://www.usr.com/products/56k-dialup-modem/usr5686g/



On Tue, Feb 16, 2021 at 11:39 AM John Kristoff  wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two 

Re: Famous operational issues

2021-02-19 Thread Daniel Karrenberg




On 16 Feb 2021, at 20:37, John Kristoff wrote:

I'd like to start a thread about the most famous and widespread 
Internet

operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?



My absolute top one happened 1995. Traffic engineering was not a widely 
used term then. A bright colleague who will remain un-named decided that 
he could make AS paths longer by repeating the same AS number more than 
once. Unfortunately the prevalent software on CISCO routers was not 
resilient to such trickery and reacted with a reboot. This caused an 
avalanche of jo-jo-ing routers. Think it through!


It took some time before that offending path could be purged from the 
whole Internet; yes we all roughly knew the topology and the players of 
the  BGP speaking parts of it at that time.  Luckily this happened 
during the set-up for the Danvers IETF and co-ordination between major 
operators was quick because most of their routing geeks happened to be 
in the same room, the ‘terminal room’; remember those?


Since at the time I personally had no responsibility for operations any 
more I went back to pulling cables and crimping RJ45s.


Lessons: HW/SW mono-cultures are dangerous. Input testing is good 
practice at all levels software. Operational co-ordination is key in 
times of crisis.


Daniel



Re: Famous operational issues

2021-02-19 Thread Jen Linkova
On Fri, Feb 19, 2021 at 9:40 AM Warren Kumari  wrote:
> 4: Not too long after I started doing networking (and for the same small ISP 
> in Yonkers), I'm flying off to install a new customer. I (of course) think 
> that I'm hot stuff because I'm going to do the install, configure the router, 
> whee, look at me! Anyway, I don't want to check a bag, and so I stuff the 
> Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). 
> I'm going through security and the TSA[0] person opens my bag and pulls the 
> router out. "What's this?!" he asks. I politely tell him that it's a router. 
> He says it's not. I'm still thinking that I'm the new hotness, and so I tell 
> him in a somewhat condescending way that it is, and I know what I'm talking 
> about. He tells me that it's not a router, and is starting to get annoyed. I 
> explain using my "talking to a 5 year old" voice that it most certainly is a 
> router. He tells me that lying to airport security is a federal offense, and 
> starts looming at me. I adjust my attitude and start explaining that it's 
> like a computer and makes the Internet work. He gruffly hands me back the 
> router, I put it in my bag and scurry away. As I do so, I hear him telling 
> his colleague that it wasn't a router, and that he certainly knows what a 
> router is, because he does woodwork...

OK, Warren, achievement unlocked. You've just made a network engineer
to google 'router'

P.S. I guess I'm obliged to tell a story if I respond to this thread...so...
"Servers and the ice cream factory".
Late spring/early summer in Moscow. The temperature above 30C (86°F).
I worked for a local content provided.
Aircons in our server room died, the technician ETA was 2 days ( I
guess we were not the only ones with aircon problems).
So we drove to the nearby ice cream factory  and got *a lot* of  dry
ice. Then we have a roaster: every few hours one person took a deep
breath, grabbed a box of dry ice, ran into the server room and emptied
the box on top of the racks. The backup person was watching through
the glass door - just in case, you know, ready to start the rescue
operation.
We (and the servers) survived till the technician arrived. And we had
a lot of dry ice to cool the beer..

-- 
SY, Jen Linkova aka Furry


Re: Famous operational issues

2021-02-19 Thread Owen DeLong
In the case of Exodus when I was working there, it was literally dictated to us 
by
the fire marshal of the city of Santa Clara (and enough other cities where we 
had
datacenters to make a universal policy the only sensible choice).

Owen
 
> On Feb 18, 2021, at 1:07 AM, Eric Kuhnke  wrote:
> 
> On that note, I'd be very interested in hearing stories of actual incidents 
> that are the cause of why cardboard boxes are banned in many facilities, due 
> to loose particulate matter getting into the air and setting off very 
> sensitive fire detection systems.
> 
> Or maybe it's more mundane and 99% of the reason is people unpack stuff and 
> don't always clean up properly after themselves.
> 
> On Wed, Feb 17, 2021, 6:21 PM Owen DeLong  > wrote:
> Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives 
> and
> gets installed and operational before anyone realizes that the conductive 
> packing
> peanuts that it was packed in have managed to work their way into various 
> midplane
> connectors. Several hours later someone notices that the box is quite 
> literally
> smoldering in the colo and the resulting combination of panic, fire drill, and
> management antics that ensue.
> 
> Owen
> 
> 
> > On Feb 16, 2021, at 2:08 PM, Jared Mauch  > > wrote:
> > 
> > I was thinking about how we need a war stories nanog track. My favorite was 
> > being on call when the router was stolen. 
> > 
> > Sent from my TI-99/4a
> > 
> >> On Feb 16, 2021, at 2:40 PM, John Kristoff  >> > wrote:
> >> 
> >> Friends,
> >> 
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >> 
> >> Which examples would make up your top three?
> >> 
> >> To get things started, I'd suggest the AS 7007 event is perhaps  the
> >> most notorious and likely to top many lists including mine.  So if
> >> that is one for you I'm asking for just two more.
> >> 
> >> I'm particularly interested in this as the first step in developing a
> >> future NANOG session.  I'd be particularly interested in any issues
> >> that also identify key individuals that might still be around and
> >> interested in participating in a retrospective.  I already have someone
> >> that is willing to talk about AS 7007, which shouldn't be hard to guess
> >> who.
> >> 
> >> Thanks in advance for your suggestions,
> >> 
> >> John
> 



Re: Famous operational issues

2021-02-19 Thread Wolfgang Tremmel
Do you remember the Cisco HDCI connectors? 
https://en.wikipedia.org/wiki/HDCI

I once shipped a Cisco 4500 plus some cables to a remote data center and asked 
the local guys to cable them for me.
With Cisco you could check the cable type and if they were properly attached. 
They were not.

I asked for a check and the local guy confirmed me three times that the cables 
were properly plugged. 
At the end I gave up, and took the 3 hour drive to the datacenter to check 
myself.

Problem was that, while the casing of the connector is asymmetrical, the pins 
inside are symmetrical.
And the local guy was quite strong.

Yes, he managed to plug in the cables 180° flipped, bending the case, but he 
got them in.
He was quite embarrassed when I fixed the cabling problem in 10 seconds.

That must have been 1995 or so

Wolfgang



> On 16. Feb 2021, at 20:37, John Kristoff  wrote:
> 
> Which examples would make up your top three?

-- 
Wolfgang Tremmel 

Phone +49 69 1730902 0  | wolfgang.trem...@de-cix.net
Executive Directors: Harald A. Summa and Sebastian Seifert | Trade Registry: AG 
Cologne, HRB 51135
DE-CIX Management GmbH | Lindleystrasse 12 | 60314 Frankfurt am Main | Germany 
| www.de-cix.net



Re: Famous operational issues

2021-02-19 Thread Mark Tinka



On 2/19/21 10:40, Suresh Ramasubramanian wrote:

He is. He asked a perfectly relevant question based on what he saw of 
the physical setup in front of him.


And he kept his cool when being talked down to.

I’d hire him the next minute, personally speaking.



In the early 2000's, with that level of deduction, I'd have been 
surprised if he wasn't snatched up quickly. Unless, of course, it 
ultimately wasn't his passion.


Mark.


Re: Famous operational issues

2021-02-19 Thread Suresh Ramasubramanian
He is. He asked a perfectly relevant question based on what he saw of the 
physical setup in front of him.

And he kept his cool when being talked down to.

I’d hire him the next minute, personally speaking.

From: Sabri Berisha 
Date: Friday, 19 February 2021 at 2:02 PM
To: Suresh Ramasubramanian 
Cc: nanog 
Subject: Re: Famous operational issues
On Feb 18, 2021, at 11:51 PM, Suresh Ramasubramanian  
wrote:

>> On 2/19/21 00:37, Warren Kumari wrote:

>> and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't
>> intended to be a condescending ass, but I think of that every time I realize 
>> I
>> might be assuming something about someone based on thier attire/job/etc.

> Did you at least hire the janitor?

Well, it's funny that you mention that because I worked at a place where the
company ended up hiring a young lady who worked in the cafeteria. When she
graduated she was offered a job in HR, and turned out to be absolutely awesome.

At some point in my life, I was carrying 50lbs bags of potato starch. Now I have
two graduate degrees and am working on a third. That janitor may be awesome, 
too!

Thanks,

Sabri


Re: Famous operational issues

2021-02-19 Thread Sabri Berisha
On Feb 18, 2021, at 11:51 PM, Suresh Ramasubramanian  
wrote: 

>> On 2/19/21 00:37, Warren Kumari wrote:

>> and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't
>> intended to be a condescending ass, but I think of that every time I realize 
>> I
>> might be assuming something about someone based on thier attire/job/etc.

> Did you at least hire the janitor?

Well, it's funny that you mention that because I worked at a place where the
company ended up hiring a young lady who worked in the cafeteria. When she
graduated she was offered a job in HR, and turned out to be absolutely awesome.

At some point in my life, I was carrying 50lbs bags of potato starch. Now I have
two graduate degrees and am working on a third. That janitor may be awesome, 
too!

Thanks, 

Sabri



Re: Famous operational issues

2021-02-18 Thread Suresh Ramasubramanian
Did you at least hire the janitor?

From: NANOG  on behalf of Mark 
Tinka 
Date: Friday, 19 February 2021 at 10:20 AM
To: nanog@nanog.org 
Subject: Re: Famous operational issues

On 2/19/21 00:37, Warren Kumari wrote:

5: Another one. In the early 2000s I was working for a dot-com boom company. We 
are building out our first datacenter, and I'm installing a pair of Cisco 7206s 
in 811 10th Ave. These will run basically the entire company, we have some 
transit, we have some peering to configure, we have an AS, etc. I'm going to be 
configuring all of this; clearly I'm a router-god...
Anyway, while I'm getting things configured, this janitor comes past, wheeling 
a garbage bin. He stops outside the cage and says "Whatcha doin'?". I go into 
this long explanation of how these "routers"  will connect to "the 
Internet"  to allow my "servers"  to talk to other "computers"  
on "the Internet" . He pauses for a second, 
and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't 
intended to be a condescending ass, but I think of that every time I realize I 
might be assuming something about someone based on thier attire/job/etc.

:-), cute.

Mark.


Re: Famous operational issues

2021-02-18 Thread George Herbert
Northridge quake.  I was #2 and on call at CRL.  That One Guy on dialup in 
Atlanta playing MUDs 23x7 pages that things are down.  I wander out to my 
computer to dial in and see what’s up, turned on TV walking past it, sat down 
and turned computer on, as it was booting on comes a live helicopter shot over 
Northridge showing the 1.5 remaining floors of the 3-story Cable and Wireless 
building our east coast connector went through.

Took a second to listen and make sure I understood what was happening, changed 
channels to verify it wasn’t a stunt, logged  on and pinged our router there to 
confirm nothing there, call & wake up Jim: “East coast’s down because 
earthquake in Northridge and the C center fell down.”

“oh.”

And then there was the Sidekick outage...


-George 

Sent from my iPhone

> On Feb 18, 2021, at 4:37 PM, Patrick W. Gilmore  wrote:
> 
> On Feb 18, 2021, at 6:10 PM, Karl Auer  wrote:
>> 
>> I think it was Macchiavelli who said that one should not ascribe to
>> malice anything adequately explained by incompetence…
> 
> https://en.wikipedia.org/wiki/Hanlon%27s_razor
>Never attribute to malice that which is adequately explained by stupidity.
> 
> I personally prefer this version from Robert A. Heinlein:
>Never underestimate the power of human stupidity.
> 
> And to put it on topic, cover your EPOs
> 
> In 1994, there was a major earthquake near the city of Los Angeles. City hall 
> had to be evacuated and it would take over a year to reinforce the building 
> to make it habitable again. My company moved all the systems in the basement 
> of city hall to a new datacenter a mile or so away. After the install, we 
> spent more than a week coaxing their ancient (even for 1994) machines back 
> online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons 
> of cabinets, certainly less storage than my watch has now.
> 
> I was in the DC going over something with the lady in charge when someone 
> walked in to ask her something. She said “just a second”. That person took 
> one step to the side of the door and leaned against the wall - right on an 
> EPO which had no cover.
> 
> Have you ever heard an entire row of DASD spin down instantly? Or taken 40 
> minutes to IPL an AS400? In the middle of the business day? For the second 
> most populous city in the country?
> 
>Me: Maybe you should get a cover for that?
>Her: Good idea.
> 
> Couple weeks later, in the same DC, going over final checklist. A fedex guy 
> walks in. (To this day, no idea how he got in a supposedly locked DC.) She 
> says “just a second”, and I get a very strong deja vu feeling. He takes one 
> step to the side and leans against the wall.
> 
>Me: Did you order that EPO cover?
>Her: Nope.
> 
> -- 
> TTFN,
> patrick
> 


Re: Famous operational issues

2021-02-18 Thread bzs


One day I got called into the office supplies area because there was a
smell of something burning. Uh-oh.

To make a long story short there was a stainless steel bowl which was
focusing the sun from a window such that it was igniting a cardboard
box.

Talk about SMH and random bad luck which could have been a lot worse,
nothing really happened other than some smoke and char.

On February 18, 2021 at 01:07 eric.kuh...@gmail.com (Eric Kuhnke) wrote:
 > On that note, I'd be very interested in hearing stories of actual incidents
 > that are the cause of why cardboard boxes are banned in many facilities, due 
 > to
 > loose particulate matter getting into the air and setting off very sensitive
 > fire detection systems.
 > 
 > Or maybe it's more mundane and 99% of the reason is people unpack stuff and
 > don't always clean up properly after themselves.
 > 
 > On Wed, Feb 17, 2021, 6:21 PM Owen DeLong  wrote:
 > 
 > Stolen isn’t nearly as exciting as what happens when your (used) 6509
 > arrives and
 > gets installed and operational before anyone realizes that the conductive
 > packing
 > peanuts that it was packed in have managed to work their way into various
 > midplane
 > connectors. Several hours later someone notices that the box is quite
 > literally
 > smoldering in the colo and the resulting combination of panic, fire 
 > drill,
 > and
 > management antics that ensue.
 > 
 > Owen
 > 
 > 
 > > On Feb 16, 2021, at 2:08 PM, Jared Mauch  wrote:
 > >
 > > I was thinking about how we need a war stories nanog track. My favorite
 > was being on call when the router was stolen.
 > >
 > > Sent from my TI-99/4a
 > >
 > >> On Feb 16, 2021, at 2:40 PM, John Kristoff  wrote:
 > >>
 > >> Friends,
 > >>
 > >> I'd like to start a thread about the most famous and widespread 
 > Internet
 > >> operational issues, outages or implementation incompatibilities you
 > >> have seen.
 > >>
 > >> Which examples would make up your top three?
 > >>
 > >> To get things started, I'd suggest the AS 7007 event is perhaps  the
 > >> most notorious and likely to top many lists including mine.  So if
 > >> that is one for you I'm asking for just two more.
 > >>
 > >> I'm particularly interested in this as the first step in developing a
 > >> future NANOG session.  I'd be particularly interested in any issues
 > >> that also identify key individuals that might still be around and
 > >> interested in participating in a retrospective.  I already have 
 > someone
 > >> that is willing to talk about AS 7007, which shouldn't be hard to 
 > guess
 > >> who.
 > >>
 > >> Thanks in advance for your suggestions,
 > >>
 > >> John
 > 
 > 

-- 
-Barry Shein

Software Tool & Die| b...@theworld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD   | 800-THE-WRLD
The World: Since 1989  | A Public Information Utility | *oo*


Re: Famous operational issues

2021-02-18 Thread Mark Tinka



On 2/19/21 00:37, Warren Kumari wrote:



5: Another one. In the early 2000s I was working for a dot-com boom 
company. We are building out our first datacenter, and I'm installing 
a pair of Cisco 7206s in 811 10th Ave. These will run basically the 
entire company, we have some transit, we have some peering to 
configure, we have an AS, etc. I'm going to be configuring all of 
this; clearly I'm a router-god...
Anyway, while I'm getting things configured, this janitor comes past, 
wheeling a garbage bin. He stops outside the cage and says "Whatcha 
doin'?". I go into this long explanation of how these "routers" 
 will connect to "the Internet"  to 
allow my "servers"  
to talk to other "computers"  on "the Internet" with the waving of the hands>. He pauses for a second, and says "'K. 
So, you doing a full iBGP mesh, or confeds?". I really hadn't intended 
to be a condescending ass, but I think of that every time I realize I 
might be assuming something about someone based on thier attire/job/etc.


:-), cute.

Mark.


Re: Famous operational issues

2021-02-18 Thread Andy Ringsmuth


> On Feb 18, 2021, at 4:37 PM, Warren Kumari  wrote:
> 
> 4: Not too long after I started doing networking (and for the same small ISP 
> in Yonkers), I'm flying off to install a new customer. I (of course) think 
> that I'm hot stuff because I'm going to do the install, configure the router, 
> whee, look at me! Anyway, I don't want to check a bag, and so I stuff the 
> Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). 
> I'm going through security and the TSA[0] person opens my bag and pulls the 
> router out. "What's this?!" he asks. I politely tell him that it's a router. 
> He says it's not. I'm still thinking that I'm the new hotness, and so I tell 
> him in a somewhat condescending way that it is, and I know what I'm talking 
> about. He tells me that it's not a router, and is starting to get annoyed. I 
> explain using my "talking to a 5 year old" voice that it most certainly is a 
> router. He tells me that lying to airport security is a federal offense, and 
> starts looming at me. I adjust my attitude and start explaining that it's 
> like a computer and makes the Internet work. He gruffly hands me back the 
> router, I put it in my bag and scurry away. As I do so, I hear him telling 
> his colleague that it wasn't a router, and that he certainly knows what a 
> router is, because he does woodwork… 

Well, in his defense, he wasn’t wrong…   :-)




Andy Ringsmuth
5609 Harding Drive
Lincoln, NE 68521-5831
(402) 304-0083
a...@andyring.com

“Better even die free, than to live slaves.” - Frederick Douglas, 1863



Re: Famous operational issues

2021-02-18 Thread Randy Bush
when employer had shipped 2xJ to london, had the circuits up, ...
the local office sat on their hands.  for weeks.  i finally was
pissed enough to throw my toolbag over my shoulder, get on a
plane, and fly over.  i walked into the fancy office and said
"hi, i am randy, vp eng, here to help you turn up the routers."
they managed to turn them up pretty quickly.


Re: Famous operational issues

2021-02-18 Thread Patrick W. Gilmore
On Feb 18, 2021, at 6:10 PM, Karl Auer  wrote:
> 
> I think it was Macchiavelli who said that one should not ascribe to
> malice anything adequately explained by incompetence…

https://en.wikipedia.org/wiki/Hanlon%27s_razor
Never attribute to malice that which is adequately explained by 
stupidity.

I personally prefer this version from Robert A. Heinlein:
Never underestimate the power of human stupidity.

And to put it on topic, cover your EPOs

In 1994, there was a major earthquake near the city of Los Angeles. City hall 
had to be evacuated and it would take over a year to reinforce the building to 
make it habitable again. My company moved all the systems in the basement of 
city hall to a new datacenter a mile or so away. After the install, we spent 
more than a week coaxing their ancient (even for 1994) machines back online, 
such as a Prime Computer and an AS400 with tons of DASD. Well, tons of 
cabinets, certainly less storage than my watch has now.

I was in the DC going over something with the lady in charge when someone 
walked in to ask her something. She said “just a second”. That person took one 
step to the side of the door and leaned against the wall - right on an EPO 
which had no cover.

Have you ever heard an entire row of DASD spin down instantly? Or taken 40 
minutes to IPL an AS400? In the middle of the business day? For the second most 
populous city in the country?

Me: Maybe you should get a cover for that?
Her: Good idea.

Couple weeks later, in the same DC, going over final checklist. A fedex guy 
walks in. (To this day, no idea how he got in a supposedly locked DC.) She says 
“just a second”, and I get a very strong deja vu feeling. He takes one step to 
the side and leans against the wall.

Me: Did you order that EPO cover?
Her: Nope.

-- 
TTFN,
patrick



Re: Famous operational issues

2021-02-18 Thread Brian Knight via NANOG

On 2021-02-17 13:28, John Kristoff wrote:

On Wed, 17 Feb 2021 14:07:54 -0500
John Curran  wrote:


I have no idea what outages were most memorable for others, but the
Stanford transfer switch explosion in October 1996 resulted in a much
of the Internet in the Bay Area simply not being reachable for
several days.


Thanks John.

This reminds me of two I've not seen anyone mention yet.  Both
coincidentally in the Chicago area that I learned before my entry
into netops full time.  One was a flood:

  

The other, at the dawn of an earlier era:

  



I wouldn't necessarily put those two in the top 3, but by some standard
for many they were certainly very significant and noteworthy.

John


Thanks for sharing these links John.  I was personally affected by the 
Hinsdale CO fire when I was a kid.  At the time, my family lived on the 
southern border of Hinsdale in the adjacent town of Burr Ridge.  It was 
weird like a power outage: you're reminded of the loss of service every 
time you perform the simple act of requesting service, picking up the 
phone or toggling a light switch.  But it lasted a lot longer than any 
loss of power: It was six or seven weeks that, to this day, felt a lot 
longer.


Anytime we needed to talk to someone long-distance, we had to drive to a 
cousin's house to make the call.  To talk to anyone local, you'd have to 
physically go and show up unannounced.  At 11 years old, I was the 
bicycle messenger between our house and my great-grandmother, who lived 
about two blocks away.  My mother and father kept the cars gassed up and 
extra fuel on hand in case there was an emergency.


Dad ran a home improvement business out of the house, so new business 
ground to a halt.  Mom worked for a publishing company, so their release 
dates were impacted.  The local grocery store's scanners wouldn't work, 
so they had to punch the orders into the register by hand, using the 
paper sticker prices on the items.


I clearly remember from the local papers that they had to special-order 
the replacement 5ESS at enormous cost.  I saw the big brick building 
after the fire with the burn marks around the front door.  In late May 
and early June, the Greyhound buses with the workers were parked around 
the block, power plants outside with huge cables snaking in right 
through the wide open front door.


When we heard that dial tone at last, everyone was happier than an 
iPhone with full bars. Lol


We're spoiled for choice in telecom networks these days.  Also, 
facilities management have learned plenty of lessons since then.  Like, 
install and maintain an FM-200 fire suppression system.  But 
nevertheless, sometimes when I step into a colo, I think of that outage 
and the impact it had.


-Brian


Re: Famous operational issues

2021-02-18 Thread Paul Ebersman
warren> 2: A somewhat similar thing would happen with the Ascend TNT
warren> Max, which had side-to-side airflow. These were dial termination
warren> boxes, and so people would install racks and racks of them. The
warren> first one would draw in cool air on the left, heat it up and
warren> ship it out the right. The next one over would draw in warm air
warren> on the left, heat it up further, and ship it out the
warren> right... Somewhere there is a fairly famous photo of a rack of
warren> TNT Maxes, with the final one literally on fire, and still
warren> passing packets.

The Ascend MAX (TNT was the T3 version, max took 2 T1s) was originally
an ISDN device. We got the first v.34 rockwell modem version for
testing. An individual card had 4 daughter boards. They were burned in
for 24 hours at Ascend, then shipped to us. We were doing stress testing
in Fairfax VA. Turns out that the boards started to overheat at about 30
hours and caught fire a few hours after that... Completely melted the
daughterboards. They did fix that issue and upped the burnin test period
to 48 hours.

And yeah, they vented side to side. They were designed for enclosed
racks where are flow was forced up. We were colocating at telco POPs so
we had to use center mount open relay racks. The air flow was as you
describe. Good time. Had by all...

Both we (UUNET, for MSN and Earthlink) and AOL were using these for
dialup access. 80k ports before we switched to the TNTs, 3+ million
ports on TNTs by the time I stopped paying attention.


Re: Famous operational issues

2021-02-18 Thread Karl Auer
On Thu, 2021-02-18 at 17:37 -0500, Warren Kumari wrote:
> Anyway, the subcontractor who made the power supplies for the vendor
> realized that they could save a few cents by not installing the
> little metal clip that held the heatsink to the MOSFET

I think it was Macchiavelli who said that one should not ascribe to
malice anything adequately explained by incompetence...

> 3: I used to work for a small ISP in Yonkers, NY.

There is actually a place called "Yonkers"?!? I always thought it was a
joke placename. We don't really need joke placenames in Oz, since we
have real ones like Woolloomooloo, Burpengary and Humpty Doo. My
favourite is Numbugga (closely followed by Wonglepong).

> I cannot remember what we used to call airport security pre-TSA...

"Useful"?

Regards, K.

-- 
~~~
Karl Auer (ka...@biplane.com.au)
http://www.biplane.com.au/kauer

GPG fingerprint: 2561 E9EC D868 E73C 8AF1 49CF EE50 4B1D CCA1 5170
Old fingerprint: 8D08 9CAA 649A AFEF E862 062A 2E97 42D4 A2A0 616D





Re: Famous operational issues

2021-02-18 Thread Warren Kumari
On Thu, Feb 18, 2021 at 8:31 AM Jared Mauch  wrote:

> On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
> > On that note, I'd be very interested in hearing stories of actual
> incidents
> > that are the cause of why cardboard boxes are banned in many facilities,
> > due to loose particulate matter getting into the air and setting off very
> > sensitive fire detection systems.
> >
> > Or maybe it's more mundane and 99% of the reason is people unpack stuff
> and
> > don't always clean up properly after themselves.
>
> We had a plastic bag sucked into the intake of a router in a
> datacenter once that caused it to overheat and take the site down.  We
> had cameras in our cage and I remember seeing the photo from the site of
> the colo (I'll protect their name just because) taken as the tech was on
> the phone and pulled the bag out of the router.
>
> The time from the thermal warning syslog that it's getting warm
> to overheat and shutdown is short enough you can't really get a tech to
> the cage in time to prevent it.
>


1: A previous employer was a large customer of a (now defunct) L3 switch
vendor. The AC power inputs were along the bottom of the power supply, and
the big aluminium heatsinks in the power supplies were just above the AC
socket.
Anyway, the subcontractor who made the power supplies for the vendor
realized that they could save a few cents by not installing the little
metal clip that held the heatsink to the MOSFET, and instead relying on the
thermal adhesive to hold it...
This worked fine, until a certain number of hours had passed, at which
point the goop would dry out and the heatsink would fall down, directly
across the AC socket This would A: trip the circuit that this was on,
but, more excitingly, set the aluminum on fire, which would then ignite the
other heatsinks in the PSU, leading to much fire...

2: A somewhat similar thing would happen with the Ascend TNT Max, which had
side-to-side airflow. These were dial termination boxes, and so people
would install racks and racks of them. The first one would draw in cool air
on the left, heat it up and ship it out the right. The next one over would
draw in warm air on the left, heat it up further, and ship it out the
right... Somewhere there is a fairly famous photo of a rack of TNT Maxes,
with the final one literally on fire, and still passing packets.
There is a related (and probably apocryphal) regarding the launch of the
TNT. It was being shipped for a major trade-show, but got stuck in customs.
After many bizarre calls with the customs folk, someone goes to the customs
office to try and sort it out, and get greeted by custom agents with guns.
They all walk into the warehouse, and discover that there is a large empty
area around the crate, which is a wooden cube, with "TNT" stencilled in big
red letters...

3: I used to work for a small ISP in Yonkers, NY. We had a customer in
Florida, and on a Friday morning their site goes down. We (of course) have
not paid for Cisco 4 hour support (or, honestly, any support) and they have
a strict SLA, so we are a little stuck.
We end up driving to JFK, and lugging a fully loaded Cisco 7507 to the
check in counter. It was just before the last flight of the day, so we
shrugged and said it was my checked bag. The excess baggage charges were
eye-watering,  but it rode the conveyor belt with the rest of the luggage
onto the plane. It arrived with just a bent  ejector handle, and the rest
was fine.

4: Not too long after I started doing networking (and for the same small
ISP in Yonkers), I'm flying off to install a new customer. I (of course)
think that I'm hot stuff because I'm going to do the install, configure the
router, whee, look at me! Anyway, I don't want to check a bag, and so I
stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all
pre-9/11!). I'm going through security and the TSA[0] person opens my bag
and pulls the router out. "What's this?!" he asks. I politely tell him that
it's a router. He says it's not. I'm still thinking that I'm the new
hotness, and so I tell him in a somewhat condescending way that it is, and
I know what I'm talking about. He tells me that it's not a router, and is
starting to get annoyed. I explain using my "talking to a 5 year old" voice
that it most certainly is a router. He tells me that lying to airport
security is a federal offense, and starts looming at me. I adjust my
attitude and start explaining that it's like a computer and makes the
Internet work. He gruffly hands me back the router, I put it in my bag and
scurry away. As I do so, I hear him telling his colleague that it wasn't a
router, and that he certainly knows what a router is, because he does
woodwork...

5: Another one. In the early 2000s I was working for a dot-com boom
company. We are building out our first datacenter, and I'm installing a
pair of Cisco 7206s in 811 10th Ave. These will run basically the entire
company, we have some transit, we 

Re: Famous operational issues

2021-02-18 Thread Henry Yen
On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual incidents
> that are the cause of why cardboard boxes are banned in many facilities,

the datacenter manager's daughter's cat.

-- 
Henry Yen   Aegis Information Systems, Inc.
Senior Systems Programmer   Hicksville, New York


Re: Famous operational issues

2021-02-18 Thread Alain Hebert

A few I remember:

    . Some monitoring server SCSI drive failed (we're talking 
State/Province level govt)...  Got a return back stating it will take 6 
month delay to get a replacement...


        Ended up choosing to use my own drive instead of leaving 
something that could be have been deadly, unmonitored.


    . Metro interruption during rush hour (for a pop of 4M) due to 
overload power bar in a MMR (Meet Me Room) during a unplanned deployment;


    . Cherry red and very angry looking 520-600V bus bar =D;

    . Fire fighters hitting the building generator emergency STOP 
button because some neighbor reported smoke on top of the building 
during a black out...

    ( not their fault, local gov failure as usual )

    . Some idiots poured gasoline into a large pipe under a bridge...  
ended up demonstrating the lack of diversity to the DCs on that urban 
island;


    . Underground transformer blow up downtown Mtl and took out the 
entire fiber bundle, demonstrating to those customers that their 
diversity was actually real =D.


        (took them a year to get that fixed)

and

    . Obviously: Any rack cabling I do...

-
Alain Hebertaheb...@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911  http://www.pubnix.netFax: 514-990-9443

On 2/18/21 2:37 PM, t...@pelican.org wrote:

On Thursday, 18 February, 2021 16:23, "Seth Mattinen"  said:


I had a customer that tried to stack their servers - no rails except the
bottom most one - using 2x4's between each server. Up until then I
hadn't imagined anyone would want to fill their cabinet with wood, so I
made a rule to ban wood and anything tangentially related (cardboard,
paper, plastic, etc.). Easier to just ban all things. Fire reasons too
but mainly I thought a cabinet full of wood was too stupid to allow.

On the "stupid racking" front, I give you most of a rack dedicated to a single 
server.  Not all that high a server, maybe 2U or so, but *way* too deep for the rack, so 
it had been installed vertically.  By looping some fairly hefty chain through the handles 
on either side of the front of the chassis, and then bolting the four chain ends to the 
four rack posts.  I wish I'd kept pictures of that one.  Not flammable, but a serious WTF 
moment.

Cheers,
Tim.






Re: Famous operational issues

2021-02-18 Thread George Metz
Normally I reference this as an example of terrible government
bureaucracy, but in this case it's also how said bureaucracy can delay
operational changes.

I was a contractor for one of the many branches of the DoD in charge
of the network at a moderate-sized site. I'd been there about 4
months, and it was my first job with FedGov. I was sent a pair of
Cisco 6509-E routers, with all supervisors and blades needed, along
with a small mountain of SFPs, to replace the non-E 6509s we had
installed that were still using GBICs for their downlinks. These were
the distro switches for approximately half the site.

Problem was, we needed 84 new SC-LC fiber jumpers to replace the SC-SC
we had in place for the existing switch - GBICs to SFPs remember. We
hadn't received any with the shipment. So I reached out to the project
manager to ask about getting the fiber jumpers. "Oh, that should be
coming from the server farm folks, since it's being installed in a
server farm." Okay, that seems stupid to me, but $FedGov, who knows. I
tell him we're stalled out until we get those cables - we have the
routers configured and ready to go, just need the jumpers, can he get
them from the server farm folks? He'll do that.

It took FIFTEEN MONTHS to hash out who was going to pay for and order
the fiber jumpers. Any number of times as the months dragged on, I
seriously considered ordering them on Amazon Prime using my corporate
card. We had them installed a week and a half after we got them. Why
that long? Because we had to completely reconfigure them, and after 15
months, the urgency just wasn't there.

By the way, the project ended up buying them, not the server farm team.

On Tue, Feb 16, 2021 at 2:38 PM John Kristoff  wrote:
>
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John


Re: Famous operational issues

2021-02-18 Thread t...@pelican.org
On Thursday, 18 February, 2021 16:23, "Seth Mattinen"  said:

> I had a customer that tried to stack their servers - no rails except the
> bottom most one - using 2x4's between each server. Up until then I
> hadn't imagined anyone would want to fill their cabinet with wood, so I
> made a rule to ban wood and anything tangentially related (cardboard,
> paper, plastic, etc.). Easier to just ban all things. Fire reasons too
> but mainly I thought a cabinet full of wood was too stupid to allow.

On the "stupid racking" front, I give you most of a rack dedicated to a single 
server.  Not all that high a server, maybe 2U or so, but *way* too deep for the 
rack, so it had been installed vertically.  By looping some fairly hefty chain 
through the handles on either side of the front of the chassis, and then 
bolting the four chain ends to the four rack posts.  I wish I'd kept pictures 
of that one.  Not flammable, but a serious WTF moment.

Cheers,
Tim.




Re: Famous operational issues

2021-02-18 Thread Erik Sundberg
Worked a cronic support call where their internet would bounce at noon every 
workday. The Cisco 1601 or 1700 Router that had there T1 in, ended up being on 
top a microwave. Weeks of troubleshooting and shipping new routers on this one.

Also had another one where the router was plugged in to an outlet that was 
controlled by a light switch, discovered this after shipping them two new 
routers.

Customer had there building remodeled and the techs counldn't find the T1 
Smartjack for the building. The contract who did the remodel job, decided it 
would be a good idea to cut out the section of wall where the telco equipment 
was and mounted it to the ceiling. It's new location was in the ladys bathroom, 
above the drop ceiling mounted to the building's rafters 10' in the air.

Customer needed a new router, because the first one died. It was a machine shop 
and they mounted the router to the wall next to a lathe or drill press that 
used oil to cool the bit while it was cutting. It looked like some dumped the 
router in a bucket of oil when we got it back.

Arriving at another large colo for a buildout. Only to find that our ASR9K that 
arrived 2 weeks ago was stored outside on the load dock which has no roof or 
locked gate. I guess that why Cisco put the plastic bag over the chassis when 
there shipped.

Colo techs at another larger colo decided to unpack our router which was a 
fully loaded 1/2 rack chassis. Since they couldn't lift it, they tipped the 
router on the side and walked it back by shifting the weight from one corner of 
the chassis to another. Bending the chassis. I could see the scrap marks in the 
floor from it.

We had colo space in top floor of an ATT CO where we put a Cisco 7513 to 
terminate about a dozen CHDS3's. The roof was leaking and instead of fixing the 
roof. The fix was to put a sheet of plastic over our cabinet. It was more like 
a tent over the cabinet.  A pool of water formed in a diviot at the top and it 
was 120+ degrees under the plastic tarp.

Our office was in a work loft off an older building and they had the AC unit 
mounted to the ceiling with a drip pan underneath them. Well, AC on the 2nd 
floor had the pump for the drip pan died. Who every installed the drip pan 
didn't secure it or center it under the AC unit. It filled up with water and 
since it was not secured and was off centered. The drip pan came crashing down 
with a few gallons of water. The water worked it's way over to the wall and 
traveled down one story in the building. The floor below had all the telco 
equipment mounted to that same wall and the water flowed down right through a 
couple of ATT's Ciena mounted to the wall shorting them out. I was at the 
Chicago Nanog Hackathon on Sunday and was called out to work that one 

Was working in the back of a cabinet that had -48 VDC power for a Cisco Router, 
a screw fell and shorted out the power. My co worker who was standing in front 
of the rack wasn't happy because the ADC PowerWorx Fuse panel was about 6" from 
his face where he was working. It had those little black alarm fuses, that had 
the spring-loaded arm. When it tripped a nice shower of sparks had flew right 
at his face Luckly he wore glasses.

I was 18 at my first IT job and it was a brand-new building. I was plugging in 
a 208VAC 30A APC UPS in the server room the electrican had just energized and 
check the circuit. I plugged in the APC UPS and gave it a good turn for the 
twist lock plug to catch and KA BAMB!!! Sparks came shooting out of the outlet 
at me. I think I pooped myself that day. Turns out the electricians deiced that 
a single Gange electrical box was good enough for a 208 VAC 30A outlet, that 
barely fit in the box. Didn't put any tape around the wire terminals. When they 
energized the circuit there was enough of an air gap that the hot screw didn't 
ground out. When I gave it that good old twist while plugging in the APC, I 
grounded the hot screw to the side of the electrical box.







From: NANOG  on behalf of Seth 
Mattinen 
Sent: Thursday, February 18, 2021 10:23 AM
To: nanog@nanog.org 
Subject: Re: Famous operational issues

On 2/18/21 1:07 AM, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual
> incidents that are the cause of why cardboard boxes are banned in many
> facilities, due to loose particulate matter getting into the air and
> setting off very sensitive fire detection systems.
>


I had a customer that tried to stack their servers - no rails except the
bottom most one - using 2x4's between each server. Up until then I
hadn't imagined anyone would want to fill their cabinet with wood, so I
made a rule to ban wood and anything tangentially related (cardboard,
paper, plastic, etc.). Easier to just ban all things. Fire reasons too
but mainly I thought a cabinet full of wood was too stupid to allow.

The "no wood" rule has become a fun story to tell every

Re: Famous operational issues

2021-02-18 Thread Seth Mattinen

On 2/18/21 1:07 AM, Eric Kuhnke wrote:
On that note, I'd be very interested in hearing stories of actual 
incidents that are the cause of why cardboard boxes are banned in many 
facilities, due to loose particulate matter getting into the air and 
setting off very sensitive fire detection systems.





I had a customer that tried to stack their servers - no rails except the 
bottom most one - using 2x4's between each server. Up until then I 
hadn't imagined anyone would want to fill their cabinet with wood, so I 
made a rule to ban wood and anything tangentially related (cardboard, 
paper, plastic, etc.). Easier to just ban all things. Fire reasons too 
but mainly I thought a cabinet full of wood was too stupid to allow.


The "no wood" rule has become a fun story to tell everyone who asks how 
that ended up being a rule. The wood customer turned out to be a 
complete a-hole anyway, wood was just the tip of the iceberg.


Re: Famous operational issues

2021-02-18 Thread Jared Mauch
On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual incidents
> that are the cause of why cardboard boxes are banned in many facilities,
> due to loose particulate matter getting into the air and setting off very
> sensitive fire detection systems.
> 
> Or maybe it's more mundane and 99% of the reason is people unpack stuff and
> don't always clean up properly after themselves.

We had a plastic bag sucked into the intake of a router in a
datacenter once that caused it to overheat and take the site down.  We
had cameras in our cage and I remember seeing the photo from the site of
the colo (I'll protect their name just because) taken as the tech was on
the phone and pulled the bag out of the router.

The time from the thermal warning syslog that it's getting warm
to overheat and shutdown is short enough you can't really get a tech to
the cage in time to prevent it.

I assume also the latter above, which is people have varying
definitons of clean.

- Jared

-- 
Jared Mauch  | pgp key available via finger from ja...@puck.nether.net
clue++;  | http://puck.nether.net/~jared/  My statements are only mine.


Re: Famous operational issues

2021-02-18 Thread Eric Kuhnke
On that note, I'd be very interested in hearing stories of actual incidents
that are the cause of why cardboard boxes are banned in many facilities,
due to loose particulate matter getting into the air and setting off very
sensitive fire detection systems.

Or maybe it's more mundane and 99% of the reason is people unpack stuff and
don't always clean up properly after themselves.

On Wed, Feb 17, 2021, 6:21 PM Owen DeLong  wrote:

> Stolen isn’t nearly as exciting as what happens when your (used) 6509
> arrives and
> gets installed and operational before anyone realizes that the conductive
> packing
> peanuts that it was packed in have managed to work their way into various
> midplane
> connectors. Several hours later someone notices that the box is quite
> literally
> smoldering in the colo and the resulting combination of panic, fire drill,
> and
> management antics that ensue.
>
> Owen
>
>
> > On Feb 16, 2021, at 2:08 PM, Jared Mauch  wrote:
> >
> > I was thinking about how we need a war stories nanog track. My favorite
> was being on call when the router was stolen.
> >
> > Sent from my TI-99/4a
> >
> >> On Feb 16, 2021, at 2:40 PM, John Kristoff  wrote:
> >>
> >> Friends,
> >>
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >>
> >> Which examples would make up your top three?
> >>
> >> To get things started, I'd suggest the AS 7007 event is perhaps  the
> >> most notorious and likely to top many lists including mine.  So if
> >> that is one for you I'm asking for just two more.
> >>
> >> I'm particularly interested in this as the first step in developing a
> >> future NANOG session.  I'd be particularly interested in any issues
> >> that also identify key individuals that might still be around and
> >> interested in participating in a retrospective.  I already have someone
> >> that is willing to talk about AS 7007, which shouldn't be hard to guess
> >> who.
> >>
> >> Thanks in advance for your suggestions,
> >>
> >> John
>
>


Re: Famous operational issues

2021-02-17 Thread Owen DeLong
Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives 
and
gets installed and operational before anyone realizes that the conductive 
packing
peanuts that it was packed in have managed to work their way into various 
midplane
connectors. Several hours later someone notices that the box is quite literally
smoldering in the colo and the resulting combination of panic, fire drill, and
management antics that ensue.

Owen


> On Feb 16, 2021, at 2:08 PM, Jared Mauch  wrote:
> 
> I was thinking about how we need a war stories nanog track. My favorite was 
> being on call when the router was stolen. 
> 
> Sent from my TI-99/4a
> 
>> On Feb 16, 2021, at 2:40 PM, John Kristoff  wrote:
>> 
>> Friends,
>> 
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>> 
>> Which examples would make up your top three?
>> 
>> To get things started, I'd suggest the AS 7007 event is perhaps  the
>> most notorious and likely to top many lists including mine.  So if
>> that is one for you I'm asking for just two more.
>> 
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session.  I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective.  I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>> 
>> Thanks in advance for your suggestions,
>> 
>> John



Re: Famous operational issues

2021-02-17 Thread Rogier van Eeten via NANOG
Ahh, war stories. I like the one where I got a wake up call that our IRC 
server was on fire,  together with the rest of the DC.



Not that widespread, but we reached Slashdot. :)

November 2002, University of Twente, The Netherlands. Some idiot wanted 
to be a hero. He deflated peoples tires, to help inflate them. One 
morning he thought it would be a good idea to start a small fire and 
then extinguish it, so he would be the hero that stopped a fire. He 
failed and the building burned down. He got caught a few days later when 
he tried the same thing in a different building.


Almost all of the IT was in that building, including core network, 
uplinks to SURFNet (Dutch Educational Network) and to the 2000 students 
living on the campus. Ironically a new DC was already being built, so 
that was ready for use a few weeks later.


As we had quite a network for 2002 we hosted for instance 
security.debian.org. The students all had 100Mbit in their room, so some 
of them also hosted some popular websites. One I can remember was an 
image sharing site.


Some students immediately created a backup network; dhcp server, dns 
server with a catch all, website explaining what was going on, IRC 
server, etc..


A local ISP offered to sponsor 50Mbit for the residents, which was 
connected via a microwave relay and a temporary fiber was run through a 
ditch to connect two parts of the campus residencies. At the end of the 
day all 2000 students had their internet connection back, although all 
behind a single 50Mbit link.



Syslog message from the local SURFNet router:

lo0.ar5.enschede1.surf.net 3613: Nov 20 07:20:50.927 UTC: 
%ENV_MON-2-TEMP: Hotpoint temp sensor(slot 18) temperature has reached 
WARNING level at 61(C)



(Disclaimer: Where I say we, I mean we as University. I wasn't working 
for the university, but was part of the students working on the backup 
network. There are probably some other people on list with some more 
details and I've probably missed some details, but this is the summary.)



On 16-02-2021 23:08, Jared Mauch wrote:

I was thinking about how we need a war stories nanog track. My favorite was 
being on call when the router was stolen.

Sent from my TI-99/4a


On Feb 16, 2021, at 2:40 PM, John Kristoff  wrote:

Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?

To get things started, I'd suggest the AS 7007 event is perhaps  the
most notorious and likely to top many lists including mine.  So if
that is one for you I'm asking for just two more.

I'm particularly interested in this as the first step in developing a
future NANOG session.  I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective.  I already have someone
that is willing to talk about AS 7007, which shouldn't be hard to guess
who.

Thanks in advance for your suggestions,

John


Re: Famous operational issues

2021-02-17 Thread John Curran
(resent - to list this time)
On 16 Feb 2021, at 2:37 PM, John Kristoff mailto:j...@dataplane.org>> wrote:
> 
> Friends,
> 
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
> 
> Which examples would make up your top three?

John  - 

I have no idea what outages were most memorable for others, but the Stanford 
transfer switch explosion in October 1996 resulted in a much of the Internet in 
the Bay Area simply not being reachable for several days.   

At the time there were three main power grids feeding Stanford – two from PG 
and one from Stanford’s own CoGen plant – and somehow a rat crawling into one 
of the two 12KVA transfer switches resulted in an the switch disppearing in an 
epic explosion that even took out a portion of the exterior wall of the 
building. 

The ensuing restoration involved lots of industry folks, GE power-on-wheel 
generating stations, anaconda-sized power cables, and all in all was quite the 
adventure. 

FYI,
 /John





Re: Famous operational issues

2021-02-17 Thread John Kristoff
On Wed, 17 Feb 2021 14:07:54 -0500
John Curran  wrote:

> I have no idea what outages were most memorable for others, but the
> Stanford transfer switch explosion in October 1996 resulted in a much
> of the Internet in the Bay Area simply not being reachable for
> several days.   

Thanks John.

This reminds me of two I've not seen anyone mention yet.  Both
coincidentally in the Chicago area that I learned before my entry
into netops full time.  One was a flood:

  

The other, at the dawn of an earlier era:

  

I wouldn't necessarily put those two in the top 3, but by some standard
for many they were certainly very significant and noteworthy.

John


Re: Famous operational issues

2021-02-17 Thread Jared Mauch
The he.net side is interesting as you can see who their v4 transits are but 
they suppress their routes via v6, but (last I knew) lacked community support 
for their customers to do similar route suppression.

I’m not a fan of it, but it makes the commercial discussions much easier each 
time those networks come by to shop services to me in a personal or 
professional capacity.  “No, I need all the internet”.

- Jared

> On Feb 17, 2021, at 12:07 PM, David Guo via NANOG  wrote:
> 
> Cogentco still did not peer with Google and HE over IPv6 I guess.
> 
> From: NANOG  on behalf of Justin 
> Wilson (Lists) 
> Sent: Thursday, February 18, 2021 00:53
> To: Miles Fidelman
> Cc: nanog@nanog.org
> Subject: Re: Famous operational issues
>  
> I remember when the big carriers de-peered with Cogent in the early 2000s.  
> The underestimated the amount of web-sites being hosted by people using 
> cogent exclusively. 
> 
> 
> Justin Wilson
> j...@j2sw.com
> 
> —
> https://j2sw.com - All things jsw (AS209109)
> https://blog.j2sw.com - Podcast and Blog
> 
> > On Feb 17, 2021, at 10:29 AM, Miles Fidelman  
> > wrote:
> > 
> > John Kristoff wrote:
> >> Friends,
> >> 
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >> 
> > Well... pre-Internet, but the great Northeast fiber cut comes to mind 
> > (backhoe vs. fiber, backhoe won).
> > 
> > Miles Fidelman
> > 
> > -- 
> > In theory, there is no difference between theory and practice.
> > In practice, there is.   Yogi Berra
> > 
> > Theory is when you know everything but nothing works. 
> > Practice is when everything works but no one knows why. 
> > In our lab, theory and practice are combined: 
> > nothing works and no one knows why.  ... unknown



Re: Famous operational issues

2021-02-17 Thread David Guo via NANOG
Cogentco still did not peer with Google and HE over IPv6 I guess.


From: NANOG  on behalf of Justin Wilson 
(Lists) 
Sent: Thursday, February 18, 2021 00:53
To: Miles Fidelman
Cc: nanog@nanog.org
Subject: Re: Famous operational issues

I remember when the big carriers de-peered with Cogent in the early 2000s.  The 
underestimated the amount of web-sites being hosted by people using cogent 
exclusively.


Justin Wilson
j...@j2sw.com

—
https://j2sw.com - All things jsw (AS209109)
https://blog.j2sw.com - Podcast and Blog

> On Feb 17, 2021, at 10:29 AM, Miles Fidelman  
> wrote:
>
> John Kristoff wrote:
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
> Well... pre-Internet, but the great Northeast fiber cut comes to mind 
> (backhoe vs. fiber, backhoe won).
>
> Miles Fidelman
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   Yogi Berra
>
> Theory is when you know everything but nothing works.
> Practice is when everything works but no one knows why.
> In our lab, theory and practice are combined:
> nothing works and no one knows why.  ... unknown



Re: Famous operational issues

2021-02-17 Thread Justin Wilson (Lists)
I remember when the big carriers de-peered with Cogent in the early 2000s.  The 
underestimated the amount of web-sites being hosted by people using cogent 
exclusively. 


Justin Wilson
j...@j2sw.com

—
https://j2sw.com - All things jsw (AS209109)
https://blog.j2sw.com - Podcast and Blog

> On Feb 17, 2021, at 10:29 AM, Miles Fidelman  
> wrote:
> 
> John Kristoff wrote:
>> Friends,
>> 
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>> 
> Well... pre-Internet, but the great Northeast fiber cut comes to mind 
> (backhoe vs. fiber, backhoe won).
> 
> Miles Fidelman
> 
> -- 
> In theory, there is no difference between theory and practice.
> In practice, there is.   Yogi Berra
> 
> Theory is when you know everything but nothing works. 
> Practice is when everything works but no one knows why. 
> In our lab, theory and practice are combined: 
> nothing works and no one knows why.  ... unknown



Re: Famous operational issues

2021-02-17 Thread Miles Fidelman

John Kristoff wrote:

Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Well... pre-Internet, but the great Northeast fiber cut comes to mind 
(backhoe vs. fiber, backhoe won).


Miles Fidelman

--
In theory, there is no difference between theory and practice.
In practice, there is.   Yogi Berra

Theory is when you know everything but nothing works.
Practice is when everything works but no one knows why.
In our lab, theory and practice are combined:
nothing works and no one knows why.  ... unknown



Re: Famous operational issues

2021-02-16 Thread bzs


 > On Tue, 16 Feb 2021, John Kristoff wrote:
 > 
 > > Friends,
 > >
 > > I'd like to start a thread about the most famous and widespread Internet
 > > operational issues, outages or implementation incompatibilities you
 > > have seen.
 > >

When Boston University joined the internet proper ca 1984 I was in
charge of that group.

We accidentally* submitted an initial HOSTS.TXT file which included
some internally used one-character host names (A, B, C) and one which
began with a digit (3B, an AT 3B5), both illegal for HOSTS.TXT back
then.

This put the BSD Unix program which converted from HOSTS.TXT to Unix'
/etc/hosts format into an infinite loop filling /tmp which in those
days crashed Unix and it often couldn't reboot successfully without
manual intervention.

On many, many hosts across the internet.

I hesitate to guess a number since scale has changed so much but some
of the more heated email claimed it brought down at least half the
internet by some count.

It was worsened by the fact that many hosts pulled and processed a new
HOSTS.TXT file via cron (time-based job scheduler) at midnight so no
one was around to fix and reboot systems.

The thread on the TCP-IP mailing list was: BU JOINS THE INTERNET!

It was a little embarrassing.

Today it probably would have landed me in Gitmo.

* There were two versions, the one we used internally, and the one to
be submitted which removed those host names. The wrong one got
submitted.

-- 
-Barry Shein

Software Tool & Die| b...@theworld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD   | 800-THE-WRLD
The World: Since 1989  | A Public Information Utility | *oo*


Re: Famous operational issues

2021-02-16 Thread Rich Kulawiec
On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
> Which examples would make up your top three?

Morris worm, November 1988.  Much confusion and eventually the realization
the John Brunner had called it from 13 years out ("The Shockwave Rider", 1975).
But sloppy coding meant it could be defeated with one line of /bin/sh.

---rsk


Re: Famous operational issues

2021-02-16 Thread Richard Golodner
That was the one with the most severe imact for my company. Seven Frame 
Circuits (UUNET) and we all saw what an updtae can do


On 2/16/21 3:28 PM, Sean Donelan wrote:

Since you said operational issues, instead of just outage...

How about MCI Worldcom's 10-day operational disaster in 1999.


http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
How not to handle a network outage

[...]
MCI WorldCom issued an alert to its sales force, which was given the 
option to deliver a notice to customers by e-mail, hand delivery or 
telephone – or not at all. After a deafening silence from company 
executives on the 10-day network outage, MCI WorldCom CEO Bernie 
Ebbers finally took the podium to discuss the situation. How did he 
explain the failure, and reassure customers that the network would not 
suffer such a failure in the future? He didn't. Instead, he blamed 
Lucent.

[...]


Re: Famous operational issues

2021-02-16 Thread Mark Andrews



> On 17 Feb 2021, at 09:51, Sean Donelan  wrote:
> 
> 
> Biggest internet operational SUCCESS
> 
> 1. Secure Shell (SSH) replaced TELNET. Nearly eliminated an entire class of 
> security problems on the Internet.  But then HTTP took over everything, so a 
> good news/bad news.
> 
> 2. Internet worms massively reduced by changed default configurations and 
> default firewalls (Windows XP proved defaults could be changed). Still need 
> to work on DDOS amplification.
> 
> 3. Head of Line blocking in IX switches (although I miss Stephen Stuart 
> saying "I'm Sorry" at every NANOG for a decade). Was a huge problem, which is 
> a non-problem now.
> 
> 4. Classless Inter-Domain Routing and BGP4 changed how Internet routing 
> worked across the entire backbone, and it worked!  Vince Fuller et al rebuilt 
> the aircraft in flight, without crashing.
> 
> 5. Y2K was a huge suggess because a lot of people fixed things ahead time, 
> and almost nothing crashed (other than the National Security Agency's 
> internal systems :-).  I'll be retired before Y2038, so that's someone else's 
> problem.

Lets hope you aren’t depending on a piece of medical equipment with a Y2038 
issue to keep you alive.

Y2038 is everybody's problem!

Mark
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742  INTERNET: ma...@isc.org



Re: Famous operational issues

2021-02-16 Thread Joe
If were just talking about outages historically, I recall the 1996 AOL
Email debacle, not really anything to do with network mishaps but more so
DNS configuration..

As well, I believe the North East 2003 blackout was a great DR test that no
one was expecting.

Of course we also have the big non-events too such as Y2K

Regards
-Joe B.


On Tue, Feb 16, 2021 at 1:38 PM John Kristoff  wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>


Re: Famous operational issues

2021-02-16 Thread Simon Lockhart
On Tue Feb 16, 2021 at 09:33:20PM +0100, J?rg Kost wrote:
> I don't want to classify and rate it, but would name 9/11.
> 
> You can read about the impacts on the list archives and there is also a
> presentation from NANOG '23 online.

For an operational perspective, I was part of the team trying to keep the
BBC website up and running through 9/11...

http://www.slimey.org/bbc_ticket_10083.txt

Simon


Re: Famous operational issues

2021-02-16 Thread Paul Ebersman
jlewis> This reminds me of one of the Sprint CO's we were colo'd in.

Ah, Sprint. Nothing like using your railroad to run phone lines...
Our routers in San Jose colo were black from the soot of the trains.

Fondly remember a major Sprint outage in the early 90s. All our data
circuits in the southeast went down at once and there were major voice
outages in the entire southeast.

Turns out a storm caused a mudslide which in turn derailed a train
carrying toxic waste, resulting in a wave of 6-10' of toxic mud taking
out the Spring voice pop for the whole southeast, because it was
conveniently located right on said railroad tracks.

We were a big enough customer that PLSC in Atlanta gave us the real
story when we asked for an ETA on repair. They couldn't give us one
immediately until the HAZMAT crew let them in. Turned out to be a total
loss of all gear.

They yanked every tech east of the Misssissippi and a 7ESS was Fedex
overnighted (stolen from some customer in the middle east?) and they had
to rebuild everything.

Was down less than 10 days. Good times.


Re: Famous operational issues

2021-02-16 Thread scott


On 2/16/2021 9:37 AM, John Kristoff wrote:

I'd suggest the AS 7007 event is perhaps the most notorious and 
likely to top many lists including mine. 




AS7007 is how I found NANOG.  We (Digital Island; first job out
of college) were in 10-20 countries around the planet at the time.
All of them wentdown while we were in cisco training.  I kept
interrupting the class andtelling my manager "everything's down!
We need to stop the training and get on it!"  We didn't because I
was new and no onebelieved that much could go down all at once.
They assumed it was a monitoring glitch.So, the training
continued for a while until very senior engineers got involved.
One of the senior guys said something to the effect of "yeah, it's
all over NANOG."  I said what is NANOG?  I signed upfor the list
and many of you have had to listen to me ever since... ;)

scott



Re: Famous operational issues

2021-02-16 Thread Pierre Emeriaud
Le mar. 16 févr. 2021 à 21:03, Job Snijders via NANOG
 a écrit :
>
> https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment/
>
> The experiment triggered a bug in some Cisco router models: affected
> Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
> Any peers of such Ciscos receiving this BGP update, would (according to
> then current RFCs) consider the BGP UPDATE corrupted, and would
> subsequently tear down the BGP sessions with the Ciscos. Because the
> corruption was not detected by the Ciscos themselves, whenever the
> sessions would come back online again they'd reannounce the corrupted
> update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
> global scale in both IBGP and EBGP! :-)

In a similar fashion, a network I know had a massive outage when a
failing linecard corrupted is-is lsps, triggering a flood of purges
and taking down the whole backbone.

This was pre-rfc6232, so you can guess that resolving the issue was a real PITA.

This kind of outages fuels my netops nightmares.


Re: [EXTERNAL] Re: Famous operational issues

2021-02-16 Thread Compton, Rich A
There was the outage in 2014 when we got to 512K routes.  
http://www.bgpmon.net/what-caused-todays-internet-hiccup/


On 2/16/21, 1:04 PM, "NANOG on behalf of Job Snijders via NANOG" 
 
wrote:

CAUTION: The e-mail below is from an external source. Please exercise 
caution before opening attachments, clicking links, or following guidance.

On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
> 
> Which examples would make up your top three?

This was a fantastic outage, one could really feel the tremors into the
far corners of the BGP default-free zone:


https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment/

The experiment triggered a bug in some Cisco router models: affected
Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
Any peers of such Ciscos receiving this BGP update, would (according to
then current RFCs) consider the BGP UPDATE corrupted, and would
subsequently tear down the BGP sessions with the Ciscos. Because the
corruption was not detected by the Ciscos themselves, whenever the
sessions would come back online again they'd reannounce the corrupted
update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
global scale in both IBGP and EBGP! :-)

Luckily the industry took these, and many other lessons to heart: in
2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE
Messages") which specifices far more robust behaviour for BGP speakers.

Kind regards,

Job


E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.


Re: Famous operational issues

2021-02-16 Thread Sean Donelan



Biggest internet operational SUCCESS

1. Secure Shell (SSH) replaced TELNET. Nearly eliminated an entire class 
of security problems on the Internet.  But then HTTP took over everything, 
so a good news/bad news.


2. Internet worms massively reduced by changed default configurations 
and default firewalls (Windows XP proved defaults could be changed). Still 
need to work on DDOS amplification.


3. Head of Line blocking in IX switches (although I miss Stephen Stuart 
saying "I'm Sorry" at every NANOG for a decade). Was a huge problem, which 
is a non-problem now.


4. Classless Inter-Domain Routing and BGP4 changed how Internet routing 
worked across the entire backbone, and it worked!  Vince Fuller et al 
rebuilt the aircraft in flight, without crashing.


5. Y2K was a huge suggess because a lot of people fixed things ahead time, 
and almost nothing crashed (other than the National Security Agency's 
internal systems :-).  I'll be retired before Y2038, so that's someone 
else's problem.





Re: Famous operational issues

2021-02-16 Thread Jon Lewis

On Tue, 16 Feb 2021, Sabri Berisha wrote:


- On Feb 16, 2021, at 2:08 PM, Jared Mauch ja...@puck.nether.net wrote:

Hi,


I was thinking about how we need a war stories nanog track. My favorite was
being on call when the router was stolen.


Wait... what? I would love to listen to that call between you and your manager.

But, here is one for you then. I was once called to a POP where one of our main
routers was down. Due to political reasons, my access had been revoked. My
manager told me to do whatever I needed to do to fix the problem, he would cover
my behind. I did, and I "gently" removed the door. My manager held word.


This reminds me of one of the Sprint CO's we were colo'd in.  Access to 
the CLEC colo area was via a back door through the Men's room!  One 
weekend, I had to make the drive to that site to deal with an access 
server issue, and I found they'd locked the back door to the Men's room 
from the colo floor side, so no access.  Using supplies I found inside the 
CO, I managed open the locked door and get to our gear.  That route, being 
our only access route was probably some kind of violation.  Not all of our 
techs were guys.


While we never had a router stolen, we did have a flash card stolen from 
one of our routers in a WCOM colo facility (most customers in open relay 
racks).  It was right after they'd upgraded the doors to the colo area 
from simplex locks to card access.  I was pissed for quite some time that 
WCOM knew who was in there (due to the card access system), but refused to 
tell us.  I figured it was probably one of their own people.


--
 Jon Lewis, MCP :)   |  I route
 StackPath, Sr. Neteng   |  therefore you are
_ http://www.lewis.org/~jlewis/pgp for PGP public key_


Re: Famous operational issues

2021-02-16 Thread Sabri Berisha
- On Feb 16, 2021, at 2:08 PM, Jared Mauch ja...@puck.nether.net wrote:

Hi,

> I was thinking about how we need a war stories nanog track. My favorite was
> being on call when the router was stolen.

Wait... what? I would love to listen to that call between you and your manager.

But, here is one for you then. I was once called to a POP where one of our main
routers was down. Due to political reasons, my access had been revoked. My 
manager told me to do whatever I needed to do to fix the problem, he would cover
my behind. I did, and I "gently" removed the door. My manager held word.

Another interesting one: entering a pop to find it flooded. Luckily there were
raised floors with only fiber underneath the floor panels. The NOC ignored the
warnings because "it was impossible for water to enter the building as it was
not raining". Yeah, but water pipes do burst from time to time.

But my favorite was pressing an undocumented combination of keys on a fire
alarm system which set off the Inergen protection without warning, immediately.
The noise and pressure of all that air entering the datacenter space with me
still in it is something I will never forget. Similar to the response of my
manager who, instead of asking me if I was ok, decided to try and light a piece
of paper. "Oh wow, it does work, I can't set anything on fire".

All if this was, obviously, in the late 1990s and early 2000s. These days,
things are -slightly- more professional.

Thanks,

Sabri


Re: Famous operational issues

2021-02-16 Thread Jared Mauch
I was thinking about how we need a war stories nanog track. My favorite was 
being on call when the router was stolen. 

Sent from my TI-99/4a

> On Feb 16, 2021, at 2:40 PM, John Kristoff  wrote:
> 
> Friends,
> 
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
> 
> Which examples would make up your top three?
> 
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
> 
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
> 
> Thanks in advance for your suggestions,
> 
> John


Re: Famous operational issues

2021-02-16 Thread Justin Streiner
Would this also extend to intentional actions that may have had unintended
consequences, such as provider A intentionally de-peering provider B, or
the monopoly telco for $country cutting itself off from the rest of the
global Internet for various reasons (technical, political, or otherwise)?

That said, I'd still have to stick with AS7007, the Baltimore tunnel fire,
and 9/11 as the most prominent examples of widespread issues/outages and
how those issues were addressed.

Honorable mention: $vendor BGP bugs, either due to $vendor ignoring the
relevant RFCs, implementing them incorrectly, or an outage exposed a design
flaw that the RFCs didn't catch.  Too many of those to list here :)

jms

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff  wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>


Re: Famous operational issues

2021-02-16 Thread Jörg Kost

Oh well, MCI in 1999 was all about…
https://www.youtube.com/watch?v=7iM5nFNUG4U

On 16 Feb 2021, at 22:28, Sean Donelan wrote:


Since you said operational issues, instead of just outage...

How about MCI Worldcom's 10-day operational disaster in 1999.


http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
How not to handle a network outage

[...]
MCI WorldCom issued an alert to its sales force, which was given the 
option to deliver a notice to customers by e-mail, hand delivery or 
telephone – or not at all. After a deafening silence from company 
executives on the 10-day network outage, MCI WorldCom CEO Bernie 
Ebbers finally took the podium to discuss the situation. How did he 
explain the failure, and reassure customers that the network would not 
suffer such a failure in the future? He didn't. Instead, he blamed 
Lucent.

[...]


Re: Famous operational issues

2021-02-16 Thread Todd Underwood
There are all the hilarious leaks and blocks.

Pakistan blocks youtube and the announcement leaks internet-wide.
Turk telecom (AS9121 IIRC) leaks a full table out one of their providers.

So many routing level incidents they're probably not even interesting any
more,  I suppose.

The huge power outages in the US northeast in 2003 (
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.998=rep1=pdf)
were pretty decent.



On Tue, Feb 16, 2021 at 4:02 PM Damian Menscher via NANOG 
wrote:

> https://en.wikipedia.org/wiki/SQL_Slammer was interesting in that it was
> an application-layer issue that affected the network layer.
>
> Damian
>
> On Tue, Feb 16, 2021 at 11:37 AM John Kristoff  wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps  the
>> most notorious and likely to top many lists including mine.  So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session.  I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective.  I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>


Re: Famous operational issues

2021-02-16 Thread Sean Donelan

Since you said operational issues, instead of just outage...

How about MCI Worldcom's 10-day operational disaster in 1999.


http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
How not to handle a network outage

[...]
MCI WorldCom issued an alert to its sales force, which was given the 
option to deliver a notice to customers by e-mail, hand delivery or 
telephone – or not at all. After a deafening silence from company 
executives on the 10-day network outage, MCI WorldCom CEO Bernie Ebbers 
finally took the podium to discuss the situation. How did he explain the 
failure, and reassure customers that the network would not suffer such a 
failure in the future? He didn't. Instead, he blamed Lucent.

[...]


Re: Famous operational issues

2021-02-16 Thread Damian Menscher via NANOG
https://en.wikipedia.org/wiki/SQL_Slammer was interesting in that it was an
application-layer issue that affected the network layer.

Damian

On Tue, Feb 16, 2021 at 11:37 AM John Kristoff  wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>


Re: Famous operational issues

2021-02-16 Thread Randy Bush
> actually, the 129/8 incident

a friend pointed out that it was the 128/9 incident

> but folk tend not to remember it

qed, eh?  :)


Re: Famous operational issues

2021-02-16 Thread Randy Bush
actually, the 129/8 incident was as damaging as 7007, but folk tend not
to remember it; maybe because it was a bit embarrassing

and the baltimore tunnel is a gift that gave a few times

and the quake/mudslides off taiwan

the tohoku quake was also fun, in some sense of the word

but the list of really damaging wet glass cuts is long


Re: Famous operational issues

2021-02-16 Thread Jörg Kost

Hi,

I don't want to classify and rate it, but would name 9/11.

You can read about the impacts on the list archives and there is also a 
presentation from NANOG '23 online.


Regards
Jörg

On 16 Feb 2021, at 20:37, John Kristoff wrote:


Friends,

I'd like to start a thread about the most famous and widespread 
Internet

operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?



Re: Famous operational issues

2021-02-16 Thread Mikael Abrahamsson via NANOG

On Tue, 16 Feb 2021, John Kristoff wrote:


Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?


https://blogs.oracle.com/internetintelligence/longer-is-not-always-better

--
Mikael Abrahamssonemail: swm...@swm.pp.se


  1   2   >