Re: [uknof] Go daddy what happened
Hi Neil, http://www.gossamer-threads.com/lists/nsp/outages/40837 Thomas On 6 Oct 2012, at 05:29, Neil J. McRae n...@domino.org wrote: but even if they didn't have RR how do they get into a situation where a router starts switching in software. RR is a red herring in this failure scenario even with full mesh this failure would still have happened. root cause is somewhere a wad of routes turned a lot of silicon into something useless. does anyone know what kit this was? Sent from my iPad On 5 Oct 2012, at 22:40, Thomas Mangin thomas.man...@exa-networks.co.uk wrote: http://inside.godaddy.com/inside-story-happened-godaddy-com-sept-10-2012/ Their conclusion about RR make sense ... Sent from my iPad
Re: [uknof] Go daddy what happened
On 6 Oct 2012, at 05:29, Neil J. McRae n...@domino.org wrote: but even if they didn't have RR how do they get into a situation where a router starts switching in software. RR is a red herring in this failure scenario even with full mesh this failure would still have happened. root cause is somewhere a wad of routes turned a lot of silicon into something useless. does anyone know what kit this was? These sorts of designs are common in DC networks now, with increasing use of l3 to the edge. The key thing here is to keep your internet edge/core separated from your DC network. Great preso here from Microsoft: http://www.nanog.org/meetings/nanog55/abstracts.php?pt=MTk0MiZuYW5vZzU1nm=nanog55 -- Will Hargrave +44 114 303
Re: [uknof] Go daddy what happened
Ah ok now I understand Will's email, didn't spot this reply :-) and your hypothesis seems very reasonable. -- Neil J. McRae. n...@domino.org From: uknof-boun...@lists.uknof.org.uk [uknof-boun...@lists.uknof.org.uk] on behalf of Daniel Austin [dan...@kewlio.net] Sent: 06 October 2012 07:54 To: uknof@lists.uknof.org.uk Subject: Re: [uknof] Go daddy what happened Hi, On 06/10/2012 05:29, Neil J. McRae wrote: but even if they didn't have RR how do they get into a situation where a router starts switching in software. RR is a red herring in this failure scenario even with full mesh this failure would still have happened. root cause is somewhere a wad of routes turned a lot of silicon into something useless. does anyone know what kit this was? I had a theory that they were using switches to route with a limited table, and accidentally pushed a full table to them. When they say 210x normal routes... if they normally had around 2000 routes in the FIB, 210x this would be approx a full table. If they limited the route reflectors with a max-prefix setting, they could end up in a situation where their routers become islands. These are the sorts of mistakes i'd expect from a new, unexperienced ISP - not someone the size of godaddy. Thanks, Dan.