Re: Complain to your vendors (was Re: Did your BGP crash today?)

2010-09-01 Thread Neil J. McRae
Paul,

 Maybe the NANOG conference committee (or whatever its called) could get a
 couple of major router vendor gerbils to come to the next NANOG and talk
 to
 this issue?

 Maybe?

 Okay, I give up.

Recently I've been involved in some issues such as this working with
Alcatel Lucent and Cisco to jointly test how their routing protocols
interact with each other. As I think you try to point out, it was like
herding cats, pushing jelly up the wall, mowing the lawn with scissors
etc.

However one of the aspects that came out of this was that it required some
changes by service providers. Burgess from the RTG at Cisco has commited
to working with me and Alcatel to put together a presentation on this for
the NANOG community (hopefully it would be something that the PC would be
interested in). I doubt this will be ready for the next meeting but should
be for the one after.

If we allow vendors just to throw in the towel on these issues then its
the service provider community to blame. In my view we have bit by bit
step by step ended up in a very dark place. With our entire planet now
completely reliant on Internet and Data networks its time for action.

Regards,
Neil.
--
Neil J. McRae -- Alive and Kicking.
n...@domino.org





Re: Did your BGP crash today?

2010-08-30 Thread Claudio Jeker
On Sun, Aug 29, 2010 at 10:12:35PM +0200, Thomas Mangin wrote:
  It would seem to me that there should actually be a better option, e.g.
  recognizing the malformed update, and simply discarding it (and sending the
  originator an error message) instead of resetting the session.
  
  Resetting of BGP sessions should only be done in the most dire of
  circumstances, to avoid a widespread instability incident.
 
 
 I had the same thought before giving up on it. 
 
 Negotiating a new error message could be a per peer option. BGP has
 capabilities for this exact reason.
 
 However to make sense you would need to find a resynchronisation point
 to only exclude the one faulty message. Initially I thought that the
 last received KEEPALIVE (for the receiver of the error message) could do
 - but you find yourselves with races conditions - so perhaps two
 KEEPALIVE back ?

Apart from one big vendor most BGP speaker only send KEEPALIVES when they
need to. So on my full feeds I see sessions running for more then 1 month
which received less then 300 KEEPALIVE packets. 

-- 
:wq Claudio



Re: Did your BGP crash today?

2010-08-30 Thread Thomas Mangin
 Apart from one big vendor most BGP speaker only send KEEPALIVES when they
 need to. So on my full feeds I see sessions running for more then 1 month
 which received less then 300 KEEPALIVE packets. 


The negociaged holdtime is always the lower value presented between two 
routers. The default HoldTime timer for Cisco is 180 seconds and for Juniper 90.
So you should see a KEEPALIVE packet every minute from/to Cisco routers, and 
one every 30 seconds between Junipers.

Should a BGP speaker do not see any KEEPALIVE during $HOLDTIME, it will tear 
the session down.
You are telling me that your effective holdtime is 2592000 seconds when the 
HOLDTIME field is 16 bits ... hum ...
http://www.faqs.org/rfcs/rfc4271.html section 4.2

So unless you know something I don't, I believe you are totally mistaken :)

Thomas




Re: Did your BGP crash today?

2010-08-30 Thread Daniel Verlouw
On Mon, 2010-08-30 at 10:58 +0200, Thomas Mangin wrote:
 http://www.faqs.org/rfcs/rfc4271.html section 4.2
 
 So unless you know something I don't, I believe you are totally mistaken :)

updates serve as implicit keepalives.

in that same section:

Hold Time:

The calculated value indicates the maximum number of
 seconds that may elapse between the receipt of successive
 KEEPALIVE and/or UPDATE messages from the sender.

also check section 6.5:

If a system does not receive successive KEEPALIVE, UPDATE, and/or
   NOTIFICATION messages [...]

  --Daniel




Re: Did your BGP crash today?

2010-08-30 Thread Thomas Mangin
 On Mon, 2010-08-30 at 10:58 +0200, Thomas Mangin wrote:
 http://www.faqs.org/rfcs/rfc4271.html section 4.2
 
 So unless you know something I don't, I believe you are totally mistaken :)
 
 updates serve as implicit keepalives.

Rule #1 do not post when you are not awake yet and quote the text which tells 
you are wrong .. broken :p

Thank you Claudio for showing me why it would not work.

Thomas




Re: Did your BGP crash today?

2010-08-30 Thread Pierre Francois


Thomas,

Wouldn't the confusion come from the fact that updates are considered as 
keepalives, so that Claudio sees so few type 4 messages because he receives 
updates ?


Sec 4.2, Hold Time :

The calculated value indicates the maximum number of
seconds that may elapse between the receipt of successive
KEEPALIVE and/or UPDATE messages from the sender.

Regards,

Pierre.

Thomas Mangin wrote:

Apart from one big vendor most BGP speaker only send KEEPALIVES when they
need to. So on my full feeds I see sessions running for more then 1 month
which received less then 300 KEEPALIVE packets. 



The negociaged holdtime is always the lower value presented between two 
routers. The default HoldTime timer for Cisco is 180 seconds and for Juniper 90.
So you should see a KEEPALIVE packet every minute from/to Cisco routers, and 
one every 30 seconds between Junipers.

Should a BGP speaker do not see any KEEPALIVE during $HOLDTIME, it will tear 
the session down.
You are telling me that your effective holdtime is 2592000 seconds when the 
HOLDTIME field is 16 bits ... hum ...
http://www.faqs.org/rfcs/rfc4271.html section 4.2

So unless you know something I don't, I believe you are totally mistaken :)

Thomas








Re: Did your BGP crash today?

2010-08-30 Thread Jack Bates


Florian Weimer wrote:

This whole thread is quite schizophrenic because the consensus appears
to be that (a) a *researcher is not to blame* for sending out a BGP
message which eventually leads to session resets, and (b) an
*implementor is to blame* for sending out a BGP messages which
eventually leads to session resets.  You really can't have it both
ways.



As good a place to break in on the thread as any, I guess. Randy and 
others believe more testing should have been done. I'm not completely 
sure they didn't test against XR. They very likely could have tested in 
a 1 on 1 connection and everything looked fine.


I don't know the full details, but at what point did the corruption 
appear, and was it visible? We know that it was corrupt on the output 
which caused peer resets, but was it necessarily visible in the router 
itself?


Do we require a researcher to setup a chain of every vender BGP speaker 
in every possible configuration and order to verify a bug doesn't cause 
things to break? In this case, one very likely would need an XR 
receiving and transmitting updates to detect the failure, so no less 
than 3 routers with the XR in the middle.


What about individual configurations? Perhaps the update is received and 
altered by one vendor due to specific configurations, sent to the next 
vendor, accepted and altered (due to the first alteration, where as it 
wouldn't be altered if the original update had been received) which 
causes the next vendor to reset. Then we add to this that it may pass 
silently through several middle vendor routers without problems and we 
realize the scope of such problems and why connecting to the Internet is 
so unpredictable.



Jack



Re: Did your BGP crash today?

2010-08-30 Thread Kevin Oberman
 Date: Mon, 30 Aug 2010 10:55:03 -0500
 From: Jack Bates jba...@brightok.net
 
 
 Florian Weimer wrote:
  This whole thread is quite schizophrenic because the consensus appears
  to be that (a) a *researcher is not to blame* for sending out a BGP
  message which eventually leads to session resets, and (b) an
  *implementor is to blame* for sending out a BGP messages which
  eventually leads to session resets.  You really can't have it both
  ways.
  
 
 As good a place to break in on the thread as any, I guess. Randy and 
 others believe more testing should have been done. I'm not completely 
 sure they didn't test against XR. They very likely could have tested in 
 a 1 on 1 connection and everything looked fine.
 
 I don't know the full details, but at what point did the corruption 
 appear, and was it visible? We know that it was corrupt on the output 
 which caused peer resets, but was it necessarily visible in the router 
 itself?
 
 Do we require a researcher to setup a chain of every vender BGP speaker 
 in every possible configuration and order to verify a bug doesn't cause 
 things to break? In this case, one very likely would need an XR 
 receiving and transmitting updates to detect the failure, so no less 
 than 3 routers with the XR in the middle.
 
 What about individual configurations? Perhaps the update is received and 
 altered by one vendor due to specific configurations, sent to the next 
 vendor, accepted and altered (due to the first alteration, where as it 
 wouldn't be altered if the original update had been received) which 
 causes the next vendor to reset. Then we add to this that it may pass 
 silently through several middle vendor routers without problems and we 
 realize the scope of such problems and why connecting to the Internet is 
 so unpredictable.

This only way they could have caught this one was to have tested to a
CRS which had another router to which it was announcing the attribute in
a mal-formed packet. Worse, the resets should just keep happening as the
CRS would still have the route with the unknown attribute which would
just generate another malformed update to cause the session to reset
again. 

While it may be possible to recover from something like this, it sure
would not be easy.
-- 
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: ober...@es.net  Phone: +1 510 486-8634
Key fingerprint:059B 2DDF 031C 9BA3 14A4  EADA 927D EBB3 987B 3751



Re: Did your BGP crash today?

2010-08-30 Thread Mike Tancsa

At 12:40 PM 8/30/2010, Kevin Oberman wrote:


This only way they could have caught this one was to have tested to a
CRS which had another router to which it was announcing the attribute in
a mal-formed packet. Worse, the resets should just keep happening as the
CRS would still have the route with the unknown attribute which would
just generate another malformed update to cause the session to reset
again.

While it may be possible to recover from something like this, it sure
would not be easy.



We experienced something like this a year ago on a couple of quagga 
boxes. At least we had source code to go through and resources to 
make use of that source code to find the problem and implement a 
quick work around.  Its for situations like this, debugging logging 
is ohhh so important.


What did people do in this case to identify the issue ?  Did you just 
pass it off to your vendor ? or did anyone try to diagnose it locally 
?  If so, what did you do ?



---Mike



--
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: ober...@es.net  Phone: +1 510 486-8634
Key fingerprint:059B 2DDF 031C 9BA3 14A4  EADA 927D EBB3 987B 3751



Mike Tancsa,  tel +1 519 651 3400
Sentex Communications,m...@sentex.net
Providing Internet since 1994www.sentex.net
Cambridge, Ontario Canada www.sentex.net/mike




Re: Did your BGP crash today?

2010-08-30 Thread Gary Buhrmaster
On Mon, Aug 30, 2010 at 15:55, Jack Bates jba...@brightok.net wrote:

...
 As good a place to break in on the thread as any, I guess. Randy and others
 believe more testing should have been done. I'm not completely sure they
 didn't test against XR. They very likely could have tested in a 1 on 1
 connection and everything looked fine.

 I don't know the full details, but at what point did the corruption appear,
 and was it visible? We know that it was corrupt on the output which caused
 peer resets, but was it necessarily visible in the router itself?

 Do we require a researcher to setup a chain of every vender BGP speaker in
 every possible configuration and order to verify a bug doesn't cause things
 to break? In this case, one very likely would need an XR receiving and
 transmitting updates to detect the failure, so no less than 3 routers with
 the XR in the middle.

 What about individual configurations? Perhaps the update is received and
 altered by one vendor due to specific configurations, sent to the next
 vendor, accepted and altered (due to the first alteration, where as it
 wouldn't be altered if the original update had been received) which causes
 the next vendor to reset. Then we add to this that it may pass silently
 through several middle vendor routers without problems and we realize the
 scope of such problems and why connecting to the Internet is so
 unpredictable.

I am not aware that anyone has provided the complete details at
this point which would include any test plans that may have been
performed.  From what I have been able to discern, it does seem
likely that a test plan that would have caught this almost had to
know of the specific issue in advance.  More testing would have
been better, but there is just too much variability out there to
assure you can do a complete test.

I am also not aware that the introduction of the attribute was
announced to the usual operational lists in advance just in
case (Ok, in this case, I mean NANOG).  This, is my mind,
 is actually the bigger faux pas.  An Oh S*** moment has
happened to most of us.  It probably will happen again to
many of us.  But letting people know in advance of scheduled
changes is the important thing.

I would hope that in the future researchers will commit to
test plans to (at least) all the major vendor BGP speakers
(which, I admit, would likely not have caught this issue),
and that before introducing such new attributes into the
Internet, they would announce it to the usual operational
lists, again, just in case.  But my hopes are often dashed.

Gary



Re: Did your BGP crash today?

2010-08-29 Thread Mikael Abrahamsson

On Sat, 28 Aug 2010, Brett Frankenberger wrote:

The implementor is to blame becuase the code he wrote send out BGP 
messages which were not properly formed.


People talk about not dropping sessions but instead dropping malformed 
messages. This is not safe. We've seen ISIS (which is TLV based and *can* 
drop individual messages) been wrongly implemented and platforms drop the 
entire ISIS *packet* instead of the individual message when seeing 
something malformed (or rather in this case, ISIS multi topology which the 
implementation didn't understand), and this made the link state database 
go out of sync and miss information for things it actually should have 
understood.


This was *silent* error/corruption. I'm not sure I prefer to have silent 
problems instead of tearing down the session which is definitely 
noticable.


--
Mikael Abrahamssonemail: swm...@swm.pp.se



Re: Did your BGP crash today?

2010-08-29 Thread Randy Bush
 This was *silent* error/corruption. I'm not sure I prefer to have
 silent problems instead of tearing down the session which is
 definitely noticable.

i call the silent fix do-gooder software.  it means to do good.  when
it works, nobody notices or says thanks.  when it fails, there is hell
to pay.

randy



Re: Did your BGP crash today?

2010-08-29 Thread Paul Ferguson
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sun, Aug 29, 2010 at 12:23 AM, Mikael Abrahamsson swm...@swm.pp.se
wrote:

 On Sat, 28 Aug 2010, Brett Frankenberger wrote:

 The implementor is to blame becuase the code he wrote send out BGP
 messages which were not properly formed.

 People talk about not dropping sessions but instead dropping malformed
 messages. This is not safe. We've seen ISIS (which is TLV based and *can*
 drop individual messages) been wrongly implemented and platforms drop the
 entire ISIS *packet* instead of the individual message when seeing
 something malformed (or rather in this case, ISIS multi topology which
 the
 implementation didn't understand), and this made the link state database
 go out of sync and miss information for things it actually should have
 understood.

 This was *silent* error/corruption. I'm not sure I prefer to have silent
 problems instead of tearing down the session which is definitely
 noticable.


It would seem to me that there should actually be a better option, e.g.
recognizing the malformed update, and simply discarding it (and sending the
originator an error message) instead of resetting the session.

Resetting of BGP sessions should only be done in the most dire of
circumstances, to avoid a widespread instability incident.

- - ferg

-BEGIN PGP SIGNATURE-
Version: PGP Desktop 9.5.3 (Build 5003)

wj8DBQFMegyGq1pz9mNUZTMRAr6tAKDHDZk2/Yk3bHNKTvCJeniTCEdPvwCg0zhk
HX/E0XsFOIURWI8UlfpM2Ms=
=PSz3
-END PGP SIGNATURE-



-- 
Fergie, a.k.a. Paul Ferguson
 Engineering Architecture for the Internet
 fergdawgster(at)gmail.com
 ferg's tech blog: http://fergdawg.blogspot.com/



Complain to your vendors (was Re: Did your BGP crash today?)

2010-08-29 Thread Adrian Chadd
Guys/girls/furry-creatures-from-!Earth,

Complaining on nanog-ml is likely to only achieve personal stress relief.

This is something you should bring up with your vendor. Say that you'll
move vendors if they don't start making better BGP implementations and
adding the features you guys want. Make the list of better features
open, public, and actively solicit alternatives. Follow up on your threat.
This is your business bottom line after all.

Don't just use it as a reason to get lower prices from your current vendor
and then continue complaining when dumb crap like this occurs.

It would be great if vendor(s) participated in a public interoperability
test suite where researchers could test their stuff against it before
unleashing it on the public internet. I'd love to see something public
-and- cross institutional, -and- include access to things like CRS-level
equipment.

Go on, I dare you. :)

2c,


Adrian




Re: Complain to your vendors (was Re: Did your BGP crash today?)

2010-08-29 Thread Paul Ferguson
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sun, Aug 29, 2010 at 12:35 AM, Adrian Chadd adr...@creative.net.au
wrote:

 Guys/girls/furry-creatures-from-!Earth,

 Complaining on nanog-ml is likely to only achieve personal stress relief.

 This is something you should bring up with your vendor. Say that you'll
 move vendors if they don't start making better BGP implementations and
 adding the features you guys want. Make the list of better features
 open, public, and actively solicit alternatives. Follow up on your
 threat. This is your business bottom line after all.

 Don't just use it as a reason to get lower prices from your current
 vendor and then continue complaining when dumb crap like this occurs.

 It would be great if vendor(s) participated in a public interoperability
 test suite where researchers could test their stuff against it before
 unleashing it on the public internet. I'd love to see something public
 -and- cross institutional, -and- include access to things like CRS-level
 equipment.

 Go on, I dare you. :)


Maybe the NANOG conference committee (or whatever its called) could get a
couple of major router vendor gerbils to come to the next NANOG and talk to
this issue?

Maybe?

Okay, I give up.

- - ferg

-BEGIN PGP SIGNATURE-
Version: PGP Desktop 9.5.3 (Build 5003)

wj8DBQFMeg7Uq1pz9mNUZTMRAtLzAJwNzJMf4YwjP9C42CFANvESJCVoDQCg9trZ
lS5Wd5kpH27JBLKkDhibIOg=
=fdTs
-END PGP SIGNATURE-


-- 
Fergie, a.k.a. Paul Ferguson
 Engineering Architecture for the Internet
 fergdawgster(at)gmail.com
 ferg's tech blog: http://fergdawg.blogspot.com/



Re: Did your BGP crash today?

2010-08-29 Thread Dobbins, Roland

On Aug 29, 2010, at 2:30 PM, Paul Ferguson wrote:

 It would seem to me that there should actually be a better option, e.g. 
 recognizing the malformed update, and simply discarding it (and sending the 
 originator an error message) instead of resetting the session.

Generation of the error message should probably have a user toggle.

---
Roland Dobbins rdobb...@arbor.net // http://www.arbornetworks.com

Injustice is relatively easy to bear; what stings is justice.

-- H.L. Mencken






Re: Did your BGP crash today?

2010-08-29 Thread Brett Frankenberger
On Sun, Aug 29, 2010 at 12:30:21AM -0700, Paul Ferguson wrote:
 
 It would seem to me that there should actually be a better option, e.g.
 recognizing the malformed update, and simply discarding it (and sending the
 originator an error message) instead of resetting the session.
 
 Resetting of BGP sessions should only be done in the most dire of
 circumstances, to avoid a widespread instability incident.

The only thing you know for sure when you receive a malformed update
is that the router on the other end of the connection is broken (or
that there's something in between the other router and you that is
corrupting messages, but for the purposes of this, that's essentially
the same thing).

Accepting information received from a router known to be broken, and
then passing that on to other routers, is a bad idea and something that
could lead to a widespread instability incident.  Of course, in theory,
you discard the bad updates and only pass on the good updates, but
doing that relies on the assumption that the known-to-be-broken router
on the other end of the connection is broken in such a way that ensures
that all the corrupted messages it sends will be recognizable as
malformed and can be discarded.  There's plenty of corruption that
can't be detected on the receiving end.

On top of that, there's problems with being out of sync with the router
on the other end.  For example, suppose a router developed a condition
that caused it to malform all withdraw messages (or, more precisely,
all UPDATE messages where the withdrarn routes length field is
non-zero).  If we implement what you suggest above, then we'll accept
all the advertisements from that router, but ignore all the withdraws,
and end up sending that router a bunch of traffic that it won't
actually be able to handle.

 -- Brett



Re: Did your BGP crash today?

2010-08-29 Thread Bjørn Mork
Richard A Steenbergen r...@e-gerbil.net writes:

 Just out of curiosity, at what point will we as operators rise up 
 against the ivory tower protocol designers at the IETF and demand that 
 they add a mechanism to not bring down the entire BGP session because of 
 a single malformed attribute? Did I miss the memo about the meeting? 

I guess you did.

http://tools.ietf.org/html/draft-ietf-idr-optional-transitive-02


Bjørn



Re: Did your BGP crash today?

2010-08-29 Thread William Allen Simpson

On 8/29/10 3:23 AM, Mikael Abrahamsson wrote:

On Sat, 28 Aug 2010, Brett Frankenberger wrote:


The implementor is to blame becuase the code he wrote send out BGP messages 
which were not properly formed.


People talk about not dropping sessions but instead dropping malformed 
messages. This is not safe. We've seen ISIS (which is TLV based and *can* drop 
individual messages) been wrongly implemented and platforms drop the entire 
ISIS *packet* instead of the
individual message when seeing something malformed (or rather in this case, 
ISIS multi topology which the implementation didn't understand), and this made 
the link state database go out of sync and miss information for things it 
actually should have
understood.


Reminder: TCP itself has also been wrongly implemented with horrid 
consequences.

Unknown TCP options are supposed to be silently discarded.  Instead, some
middlebox vendors simply copy them into the return packet.

There are some circumstances where it makes sense to silently discard one TLV
option, and others where it makes sense to discard the whole packet, and still
others where it makes sense to drop the session.

A problem is that many of the early designers (BGP is a fairly early design)
used one-size-fits-all error handling.

There's not much anybody can do about bad implementation (as in this case)
that corrupts data.  But a lot more thought needs to go into error recovery!



This was *silent* error/corruption. I'm not sure I prefer to have silent 
problems instead of tearing down the session which is definitely noticable.


Personally, I've usually advocated returning an error message.  Many of the
protocols I've developed use this approach.

(Please forgive RADIUS, which for some odd reason is my most frequently cited
work according to Google.  My original draft had a Reject, subsequent WG
activity took it away.  All I could do is throw up my hands and walk away.)



Re: Did your BGP crash today?

2010-08-29 Thread Joel Jaeggli
On 8/29/10 9:31 AM, Bjørn Mork wrote:
 Richard A Steenbergen r...@e-gerbil.net writes:
 
 Just out of curiosity, at what point will we as operators rise up 
 against the ivory tower protocol designers at the IETF and demand that 
 they add a mechanism to not bring down the entire BGP session because of 
 a single malformed attribute? Did I miss the memo about the meeting? 
 
 I guess you did.
 
 http://tools.ietf.org/html/draft-ietf-idr-optional-transitive-02

rfc 4893 (4 octet as numbers) leverages the assumption that you can send
the as4_path attribute and that even router's that don't understand it
will forward it.

given that 4 byte as numbers exist in the internet and many non-4byte
aware routers exist, that seems like a reasonable assumption.

 
 Bjørn
 
 




Re: Did your BGP crash today?

2010-08-29 Thread Joel Jaeggli
On 8/27/10 1:07 PM, Mike Gatti wrote:
 where's the change management process in all of this. 
 basically now we are going to starting changing things that can 
 potentially have an adverse affect on users without letting anyone know
 before hand  Interesting concept.

BGP is transitive, change management is not. you have a change
management process, your peer might integrate into that but have their
own, your peer's peers almost certainly do not.

Every time a wet-behind-the-ears network engineer connects a bgp speaker
to the edge of the network it's an experiment in the the stability of
the Internet.

This on the fact of it seems like a quite reasonable experiment, which
should have worked, except that it happened to tickle a bug...


 On Aug 27, 2010, at 3:33 PM, Dave Israel wrote:
 

 On 8/27/2010 3:22 PM, Jared Mauch wrote:
 When you are processing something, it's sometimes hard to tell if something
 just was mis-parsed (as I think the case is here with the missing-2-bytes)
 vs just getting garbage.  Perhaps there should be some way to re-sync when
 you are having this problem, or a parallel keepalive path similar to
 MACA/MCAS/MIDCAS/TCAS between the devices to talk when something bad is
 happening.

 I know it wasn't there originally, and isn't mandatory now, but there is
 an MD5 hash that can be added to the packet.  If the TCP hash checks
 out, then you know the packet wasn't garbled, and just contained
 information you didn't grok.  That seems like enough evidence to be able
 to shrug and toss the packet without dropping the session.

 -Dave



 
 =+=+=+=+=+=+=+=+=+=+=+=+=
 Mike Gatti  
 ekim.it...@gmail.com
 =+=+=+=+=+=+=+=+=+=+=+=+=
 
 
 
 
 




Re: Did your BGP crash today?

2010-08-29 Thread Thomas Mangin
 It seems that creating a worst case BGP test suite for all kinds of nastiness 
 (in light of the recent RIPE thing) might not be a bad idea - so that we can 
 all test the implementation ourselves before we deploy new code.

Normally those things are done by vendors - that what we pay them good money 
for software update and support.
You need a fully mesh network of all the vendors as it would seems that you 
need to check the outgoing packet as well as making sure the router is as well 
not chocking on the packet.
Definitively no rocket science, still quite some work for something which 
should not be the end-user's problem.

Thomas




Re: Did your BGP crash today?

2010-08-29 Thread Thomas Mangin
 It would seem to me that there should actually be a better option, e.g.
 recognizing the malformed update, and simply discarding it (and sending the
 originator an error message) instead of resetting the session.
 
 Resetting of BGP sessions should only be done in the most dire of
 circumstances, to avoid a widespread instability incident.


I had the same thought before giving up on it. 

Negotiating a new error message could be a per peer option. BGP has 
capabilities for this exact reason.

However to make sense you would need to find a resynchronisation point to only 
exclude the one faulty message. Initially I thought that the last received 
KEEPALIVE (for the receiver of the error message) could do - but you find 
yourselves with races conditions - so perhaps two KEEPALIVE back ?
Each TCP packet can contain multiple message, so the messages would have to be 
then split and ACK individually to find the faulty one and then ACK 
individually. EOR could be used for that purpose.

Still it adds lots of complexity in the conversation - are we not going to 
introduce bug in that not much used and tested code path as well ?
Unless you have a new ACK capability for each message - another idea but  
those are clearly a discussions for outside NANOG.

Thomas






Re: Did your BGP crash today?

2010-08-29 Thread James Hess
On Sun, Aug 29, 2010 at 3:12 PM, Thomas Mangin
thomas.man...@exa-networks.co.uk wrote:
 However to make sense you would need to find a resynchronisation point to 
 only exclude the one faulty message. Initially I thought that the last 
 received KEEPALIVE (for the receiver of the error message) could do - but you 
 find yourselves with races conditions - so perhaps two KEEPALIVE back ?
 Each TCP packet can contain multiple message, so the messages would have to 
 be then split and ACK individually to find the faulty one and then ACK 
 individually. EOR could be used for that purpose.

Every BGP message header has a portion that starts with  16
all-bits-1  octets,  for compatibility.
This is distinctive enough an implementation can guess where the next
message starts.
However,  suppose you have an attacker.. if for example, a BGP speaker
passes on too short a length value for an attribute...
and  the attacker knows what length will be sent instead of the right one.

Places an entry into the Data portion,  that will  appear to the other
peer to be
 the rest  of the malformed update,  Result: the malformed  update
 is received and appears to be perfectly valid.
The next thing the attacker inserts into the data portion of the
attribute is the  16  all-bits-1 octets, BGP header, update message,
and their malicious update.

This will appear properly formed, when the buggy BGP speaker sends it.
As far as the buggy BGP speaker is concerned,  it has propagated 1 route update.

As far as  the buggy BGP speaker's other peers  are concerned,  they
have received  3 messages from the buggy speaker.
* The update  completed  in the attribute data section.   (This is
malformed,  but  intentionally not detectable as malformed)
* The maliciously injected route.(This isn't supposed to exist.
The buggy speaker is unaware of its existence,   there is a
disagreement between peers about how the message is interpreted)
* A malformed message that does not make any sense.

If the injection were perfect,  nothing would be detectable as malformed.
But alas, the attacker does not know exactly what other attributes or
prepending buggy router will add to the message before passing it on.
They could work this out through trial and error, however,  some admin
will hopefully notice all the CEASEs, before the attacker achieved
complete success.


In this case, by the time   the other speakers detect  something as
malformed,  the two preceding updates are already in the table,   and
possibly even propagated further.
A CEASE  rolls this back, by rolling back the entire session.

Peers could  (perhaps) safely re-synchronize in this case is  if there
was an extension to  partially roll back some of the updates
in a session and request a portion of the messages to be resent.

Or if an extension such as authentication is used to make it
impossible to inject BGP messages within the value of an attribute.
Through data quarantine:requiring all BGP speakers to disallow the
 all-bits-1 sequence in any attribute value.

Or  through peer-specific authentication mechanisms, or  checksums and
digital signature, in the message header portion of each BGP message.



--
-J



Re: Did your BGP crash today?

2010-08-29 Thread Randy Bush
 Every BGP message header has a portion that starts with  16
 all-bits-1  octets,  for compatibility.
 This is distinctive enough an implementation can guess where the next
 message starts.

i desperately feared reading this.  i do not want to bet the internet on
guessing where anythings starts.

randy



Re: Did your BGP crash today?

2010-08-28 Thread Thomas Mangin
 I'm assuming that they weren't really expecting this to cause issues... Where 
 does one draw the line? I'm planning on announcing x.y.z.0/20 later in the 
 week -- x, y and z are all prime and the sum of all 3 is also a prime. There 
 is a non-zero chance that something somewhere will go flooie, shall I send 
 mail now or later?

In this case the researchers sent an new packet that would never have been 
generated by any operational router ever before to their peer. They knew their 
packet would cause the router to run less/un tested and code path in BGP. To 
their defence, the risk was low. 

That said when I wrote my own BGP injector I accidentally sent badly formed 
known messages (like UPDATE,etc.) with bad attributes (like transitive when the 
RFC it MUST not be, and vice versa) to my routers.
Juniper would kill the session at the validation stage and be quite verbose in 
the log but Cisco - at least on the 7301 I tested last year with a then recent 
IOS - would accept the packet as it. Yep, IOS do accept INVALID packets.
I have no idea what happens after but if a Cisco router is passing the packet 
to a Juniper router it could have the same effect that what we saw, again, and 
tear down a session which is not the one which initiated the badly formed 
packet. That said I suspect that the message may not have been fully parsed, 
for performance reasons, with the outgoing packet partially generated following 
the RFC.
Quagga is even worse that Cisco when it comes to packet validation but it 
should not surprise anyone :p

Now, Should I research the described BGP behaviour (for a white hat conference 
for example) and send my possibly risking packets to my peer without telling 
them ?
Hell no ! I am pretty sure that if I did I would loose quite a few session 
afterwards. People trust me not to absuse my BGP connections but for sending 
safe known message about my network and not some research stuff.

That said vendor SHOULD research (and hopefully did) this kind of behaviour, 
but as yesterday shown, causing packet corruption through a router is bad for 
its stability :p

 Also, I would prefer that this gets discovered and dealt with (in this case 
 by stopping the announcement :-)) than having folk not willing to try things 
 and ending up with a weaponized version...

No argument here.

Thomas


Re: Did your BGP crash today?

2010-08-28 Thread Randy Bush
imiho, researchers injecting data into the control plane are responsible
to have tested it at least against major bgp speakers.  and, considering
its placement in the net (big core), i consider ios xr to be a major
speaker.

i suspect that these folk will test better next time.  i sure hope so.

randy



Re: Did your BGP crash today?

2010-08-28 Thread Thomas Mangin
On 28 Aug 2010, at 08:56, Randy Bush wrote:

 imiho, researchers injecting data into the control plane are responsible
 to have tested it at least against major bgp speakers.  and, considering
 its placement in the net (big core), i consider ios xr to be a major
 speaker.
 
 i suspect that these folk will test better next time.  i sure hope so.

Not sure the researcher can afford to buy a ios xr and may not have access to 
one !

Thomas 


Re: Did your BGP crash today?

2010-08-28 Thread Randy Bush
 i suspect that these folk will test better next time.  i sure hope so.
 Not sure the researcher can afford to buy a ios xr and may not have
 access to one !

then ask on *nog for someone against whom they can test.

randy



Re: Did your BGP crash today?

2010-08-28 Thread Saku Ytti
On (2010-08-28 09:22 +0100), Thomas Mangin wrote:

  i suspect that these folk will test better next time.  i sure hope so.
 
 Not sure the researcher can afford to buy a ios xr and may not have access to 
 one !

Indeed.

Also testing is hard, especially so, when you essentially need to reinvent
the wheel every time, which might not even fit your time schedule. 

Maybe we as community could build 'BGPSpec' testing suite, simply python
(or ruby yay!) script which has been thought at least to puke out UPDATEs
that have known to break implementations before. Test cases being unique
files for easy contribution.
This BGPSpec could then be ran by vendors, researchers and operators, and
we could be sure that at least same mistake is not done twice.
With this suite in place, it would be easier for researcher to write new
test case for the suite and then ask people to run it against their gear.

From global network security/reliability point-of-view BGP is pretty much
only important protocol and as such maybe should enjoy special status in
collaborative quality assurance.

Considering this issue, late junos 32b ASN, mikrotik long AS path this
http://www.cisco.com/en/US/products/products_security_advisory09186a0080094a58.shtml
and probably many others, it seems we've been exceptionally lucky, that
someone hasn't been fuzzing Internet BGP with target of breaking as much of
it as possible, as it wouldn't really been that hard.

-- 
  ++ytti



Re: Did your BGP crash today?

2010-08-28 Thread bmanning
On Sat, Aug 28, 2010 at 09:22:34AM +0100, Thomas Mangin wrote:
 On 28 Aug 2010, at 08:56, Randy Bush wrote:
 
  imiho, researchers injecting data into the control plane are responsible
  to have tested it at least against major bgp speakers.  and, considering
  its placement in the net (big core), i consider ios xr to be a major
  speaker.
  
  i suspect that these folk will test better next time.  i sure hope so.
 
 Not sure the researcher can afford to buy a ios xr and may not have access to 
 one !
 
 Thomas 

while this is undoubtedly true for hobbiest researchers, there are
pretty good relationships between vendors and some research facilities 
with a strong interst in ensuring there is external review of the
code base(es).

(I am personally aware of at least five such facilities...:)

hence I am going to have to echo Randys sentiments.  This was just
sloppy.

--bill



Re: Did your BGP crash today?

2010-08-28 Thread Randy Bush
 i suspect that these folk will test better next time.  i sure hope
 so.
 Not sure the researcher can afford to buy a ios xr and may not have
 access to one !
 Also testing is hard

so is cleaning up the mess when you screw up enough of the internet to
make the international press.

 Maybe we as community could build 'BGPSpec' testing suite, simply
 python (or ruby yay!) script which has been thought at least to puke
 out UPDATEs that have known to break implementations before. Test
 cases being unique files for easy contribution.

a bgp regression suite would not have caught this as it was not a
repeat.  but it sure would be useful to implementors.

randy



Re: Did your BGP crash today?

2010-08-28 Thread Mike


while this is undoubtedly true for hobbiest researchers, there are
	pretty good relationships between vendors and some research facilities 
	with a strong interst in ensuring there is external review of the

code base(es).

(I am personally aware of at least five such facilities...:)

hence I am going to have to echo Randys sentiments.  This was just
sloppy.


  
I am really surprised by these attitudes. Guys (and gals), these 
incidents simply go to reinforce that the software we depend on, has not 
received sufficient testing and that we all have gigantic exposures due 
to things outside of our direct control (eg: cisco, juniper or other 
router software quality control). You can't just demand that people 
don't do things that wind up being destructive to you on your production 
network, thats asking the world to be responsible for you. There are 
lots of bugs in this stuff and the sooner that we find out about them, 
the sooner we can get updates to address them and hopefully, shorten the 
window in which those issues could be painful to us and cause us grief.






Re: Did your BGP crash today?

2010-08-28 Thread Saku Ytti
On (2010-08-28 18:20 +0900), Randy Bush wrote:

 a bgp regression suite would not have caught this as it was not a
 repeat.  but it sure would be useful to implementors.

Naturally 'proving' that non-trivial software works is practically
impossible. But stating what non-existing test-suite would or would not
have covered is not a topic I'm particularly interested to engage.


-- 
  ++ytti



Re: Did your BGP crash today?

2010-08-28 Thread Randy Bush
 I am really surprised by these attitudes. Guys (and gals), these
 incidents simply go to reinforce that the software we depend on, has
 not received sufficient testing and that we all have gigantic
 exposures due to things outside of our direct control

nice anti-vendor rant.  but over the last decades we the ops have asked
for a jillion features which creates massive code, and there is no hope
of testing all the code paths rigorously.  the vendors have large test
labs, and do what they can.  sure, they could do better.  so could we
all.

but it is also coders' responsibility, whether vendors or researchers or
hackers, to also test what they send.  in this case, clearly that was
not done sufficiently.

if i am sloppy in my receiving code, the pain is mine.  if you are
sloppy in your sending code, the pain is not yours.

randy



Re: Did your BGP crash today?

2010-08-28 Thread Thomas Mangin
 Quagga is even worse that Cisco when it comes to packet validation but it 
 should not surprise anyone :p

To substantiate my claim, my mercurial log tells me that for MPRNLRI and 
MPURNLRI having the flag set as Transitive instead of Optional did not cause 
Quagga to complain. It just took the IPv4/IPv6 route .
Now it may have been fixed. I should really check and if not pass this to the 
quagga dev list. I am lazy.

Thomas


Re: Did your BGP crash today?

2010-08-28 Thread Florian Weimer
* Christopher Morrow:

 (you are asking your vendors to run full bit sweeps of each protocol
 in a regimented manner checking for all possible edge cases and
 properly handling them, right?)

The real issue is that both spec and current practice say you need to
drop the session as soon as you encounter any unexpected data.  That's
just wrong---you can't really be sure that it's a temporary glitch
caused by your peer.  If it's not, you are unnecessarily taking
yourself off the net.

Of course, there is little you can do when the outer framing at the
internal BGP layer is wrong (resyncing is way too risky).  A tear-down
might be in order if you receive an unrecognized message type, too.
But a BGP update message which is malformed internally should just be
ignored.  From a theoretical point of view, it's no worse than the
operator configuring a prefix-list that filters out all the NLRI
entries.



Re: Did your BGP crash today?

2010-08-28 Thread Leen Besselink
On 08/28/2010 11:39 AM, Saku Ytti wrote:
 On (2010-08-28 18:20 +0900), Randy Bush wrote:

   
 a bgp regression suite would not have caught this as it was not a
 repeat.  but it sure would be useful to implementors.
 
 Naturally 'proving' that non-trivial software works is practically
 impossible. But stating what non-existing test-suite would or would not
 have covered is not a topic I'm particularly interested to engage.


   
I suggest the test-tool has 2 bgp-sessions and tests if what it put in
did or did not come out on the otherside and in what shape or form.

There are already atleast 2 projects which have BGP-code which could
probably be adapted:
http://code.google.com/p/exabgp/
http://code.google.com/p/bgpsimple/

Can I suggest a fuzzer as wel ?




Re: Did your BGP crash today?

2010-08-28 Thread Thomas Mangin
Those tools are not suitable for regression testing ( I know I wrote exabgp ) 
not saying they could not be adapted though.

Fizzing may return crashes or issues with the daemon but it is unlikely. You 
need predictable input for regression testing and in our particular case how do 
you detect a corruption without knowing what the behaviour of the router should 
be on that particular input.

If it was that simple vendors would have done it
---
from my iPhone

On 28 Aug 2010, at 13:09, Leen Besselink l...@consolejunkie.net wrote:

 On 08/28/2010 11:39 AM, Saku Ytti wrote:
 On (2010-08-28 18:20 +0900), Randy Bush wrote:
 
 
 a bgp regression suite would not have caught this as it was not a
 repeat.  but it sure would be useful to implementors.
 
 Naturally 'proving' that non-trivial software works is practically
 impossible. But stating what non-existing test-suite would or would not
 have covered is not a topic I'm particularly interested to engage.
 
 
 
 I suggest the test-tool has 2 bgp-sessions and tests if what it put in
 did or did not come out on the otherside and in what shape or form.
 
 There are already atleast 2 projects which have BGP-code which could
 probably be adapted:
 http://code.google.com/p/exabgp/
 http://code.google.com/p/bgpsimple/
 
 Can I suggest a fuzzer as wel ?
 
 



Re: Did your BGP crash today?

2010-08-28 Thread Florian Weimer
* Randy Bush:

 imiho, researchers injecting data into the control plane are
 responsible to have tested it at least against major bgp speakers.

Practically, this boils down to don't do that, which is certainly
fine by me.

To carry out such experiments responsibly, you have to conduct so much
testing beforehand that the live test on the actual Internet will not
yield new insights (assuming you did your pre-experiment testing
properly).



Re: Did your BGP crash today?

2010-08-28 Thread Florian Weimer
* Randy Bush:

 a bgp regression suite would not have caught this as it was not a
 repeat.

Eh, it was just another corrupt-and-propagate issue combined with the
broken (but RFC-required) session reset policy on malformed updates.



Re: Did your BGP crash today?

2010-08-28 Thread Saku Ytti
On (2010-08-28 13:23 +0200), Thomas Mangin wrote:

 Those tools are not suitable for regression testing ( I know I wrote exabgp ) 
 not saying they could not be adapted though.
 
 Fizzing may return crashes or issues with the daemon but it is unlikely. You 
 need predictable input for regression testing and in our particular case how 
 do you detect a corruption without knowing what the behaviour of the router 
 should be on that particular input.

It doesn't actually matter how likely or unlikely one expect such tool to
be finding new issues. There is already value, that researchers like RIPE
in this case, could simply write new test case, instead of needing to build
whole infrastructure.

-- 
  ++ytti



Re: Did your BGP crash today?

2010-08-28 Thread Claudio Jeker
On Sat, Aug 28, 2010 at 04:56:05PM +0900, Randy Bush wrote:
 imiho, researchers injecting data into the control plane are responsible
 to have tested it at least against major bgp speakers.  and, considering
 its placement in the net (big core), i consider ios xr to be a major
 speaker.
 
 i suspect that these folk will test better next time.  i sure hope so.
 

I think you blame the wrong people. The vendor should make sure that their
implementation does not violate the very basics of the BGP protocol.
This bug in the IOS XR BGP implementation was a ticking time bomb and it
was just a matter of when it would blow up.

I suspect that Cisco will test better next time when they release a new
version of their software. I sure hope so.

-- 
:wq Claudio



Re: Did your BGP crash today?

2010-08-28 Thread Christian Martin
I think that focusing on researchers (who we assume are good-intentioned) 
misses the point.  Any connected BGP speaker can inject any form of ugliness.  
The routers that mishandled these updates were bounded by routers that were 
able to 'properly' handle corrupted updates. 

The question of aggressive teardown of BGP sessions after a speaker receives 
garbage has been well considered for a long time.  Stop the problem at the 
edges.  The only difference here is that the edge moved one hop closer to the 
core.

/c

Sent from my iPhone

On Aug 28, 2010, at 7:31 AM, Florian Weimer f...@deneb.enyo.de wrote:

 * Randy Bush:
 
 imiho, researchers injecting data into the control plane are
 responsible to have tested it at least against major bgp speakers.
 
 Practically, this boils down to don't do that, which is certainly
 fine by me.
 
 To carry out such experiments responsibly, you have to conduct so much
 testing beforehand that the live test on the actual Internet will not
 yield new insights (assuming you did your pre-experiment testing
 properly).
 



Re: Did your BGP crash today?

2010-08-28 Thread Thomas Mangin
My point was not about crafted bgp message to test border cases - this is what 
one would expect in a regression suite.
It is about the use of a fuzzer to corrupt packet when you then do not know if 
the router is then behaving correctly or not.

---
from my iPhone

On 28 Aug 2010, at 13:36, Saku Ytti s...@ytti.fi wrote:

 On (2010-08-28 13:23 +0200), Thomas Mangin wrote:
 
 Those tools are not suitable for regression testing ( I know I wrote exabgp 
 ) not saying they could not be adapted though.
 
 Fizzing may return crashes or issues with the daemon but it is unlikely. You 
 need predictable input for regression testing and in our particular case how 
 do you detect a corruption without knowing what the behaviour of the router 
 should be on that particular input.
 
 It doesn't actually matter how likely or unlikely one expect such tool to
 be finding new issues. There is already value, that researchers like RIPE
 in this case, could simply write new test case, instead of needing to build
 whole infrastructure.
 
 -- 
  ++ytti
 



Re: Did your BGP crash today?

2010-08-28 Thread Claudio Jeker
On Sat, Aug 28, 2010 at 01:09:47PM +0200, Leen Besselink wrote:
 On 08/28/2010 11:39 AM, Saku Ytti wrote:
  On (2010-08-28 18:20 +0900), Randy Bush wrote:
 

  a bgp regression suite would not have caught this as it was not a
  repeat.  but it sure would be useful to implementors.
  
  Naturally 'proving' that non-trivial software works is practically
  impossible. But stating what non-existing test-suite would or would not
  have covered is not a topic I'm particularly interested to engage.
 
 

 I suggest the test-tool has 2 bgp-sessions and tests if what it put in
 did or did not come out on the otherside and in what shape or form.
 
 There are already atleast 2 projects which have BGP-code which could
 probably be adapted:
 http://code.google.com/p/exabgp/
 http://code.google.com/p/bgpsimple/
 
 Can I suggest a fuzzer as wel ?
 
 

There was once cert-bgp-testcases-28may03-final.tar.gz which did some
testing (including expected responses). I use it from time to time.
From the README:
For more information see the NANOG 28 (http://www.nanog.org) presentation
...

BGP Vulnerability Testing: Separating Fact from FUD
by Sean Convery (s...@cisco.com) and Matthew Franz (mfr...@cisco.com)

But my quick googeling failed to locate a link to it.
-- 
:wq Claudio



Re: Did your BGP crash today?

2010-08-28 Thread Randy Bush
 To carry out such experiments responsibly, you have to conduct so much
 testing beforehand that the live test on the actual Internet will not
 yield new insights (assuming you did your pre-experiment testing
 properly).

you seem to assume the purpose of the test was to see if routers
crashed.  i certainly think mor ehighly of ripe lans folk than that.

randy



Re: Did your BGP crash today?

2010-08-28 Thread Leen Besselink
On 08/28/2010 01:52 PM, Thomas Mangin wrote:
 My point was not about crafted bgp message to test border cases - this is 
 what one would expect in a regression suite.
 It is about the use of a fuzzer to corrupt packet when you then do not know 
 if the router is then behaving correctly or not.

   

I wasn't saying you should use both at the same time, but I thought it
might be a good idea to add a fuzzer so
that it could be run seperately. Any bugs we can find before it is in
production causing problems is useful.

Although most code I've seen which deals with the BGP-protocol directly
seemed to be pretty simple/smart about it.

 ---
 from my iPhone

 On 28 Aug 2010, at 13:36, Saku Ytti s...@ytti.fi wrote:

   
 On (2010-08-28 13:23 +0200), Thomas Mangin wrote:

 
 Those tools are not suitable for regression testing ( I know I wrote exabgp 
 ) not saying they could not be adapted though.

 Fizzing may return crashes or issues with the daemon but it is unlikely. 
 You need predictable input for regression testing and in our particular 
 case how do you detect a corruption without knowing what the behaviour of 
 the router should be on that particular input.
   
 It doesn't actually matter how likely or unlikely one expect such tool to
 be finding new issues. There is already value, that researchers like RIPE
 in this case, could simply write new test case, instead of needing to build
 whole infrastructure.

 -- 
  ++ytti

 

   




Re: Did your BGP crash today?

2010-08-28 Thread Florian Weimer
* Claudio Jeker:

 I think you blame the wrong people. The vendor should make sure that
 their implementation does not violate the very basics of the BGP
 protocol.

The curious thing here is that the peer that resets the session, as
required by the spec, causes the actual damage (the session reset),
and not the peer producing the wrong update.

This whole thread is quite schizophrenic because the consensus appears
to be that (a) a *researcher is not to blame* for sending out a BGP
message which eventually leads to session resets, and (b) an
*implementor is to blame* for sending out a BGP messages which
eventually leads to session resets.  You really can't have it both
ways.

I'm fed up with this situation, and we will fix it this time.  My take
is that if you reset the session, you're part of the problem, and
consequently deserve part of the blame.  So if you receive a
properly-framed BGP update message you cannot parse, you should just
log it, but not take down the session.



Re: Did your BGP crash today?

2010-08-28 Thread lorddoskias
 Am I the only one on the list which saw the sentence in Cisco's 
Advisory Before sending the the unknown attribute to peers, the IOS XR 
corrupted it which clearly states this was a bug?!




Re: Did your BGP crash today?

2010-08-28 Thread Raymond Dijkxhoorn

Hi!


I think you blame the wrong people. The vendor should make sure that
their implementation does not violate the very basics of the BGP
protocol.



The curious thing here is that the peer that resets the session, as
required by the spec, causes the actual damage (the session reset),
and not the peer producing the wrong update.

This whole thread is quite schizophrenic because the consensus appears
to be that (a) a *researcher is not to blame* for sending out a BGP
message which eventually leads to session resets, and (b) an
*implementor is to blame* for sending out a BGP messages which
eventually leads to session resets.  You really can't have it both
ways.

I'm fed up with this situation, and we will fix it this time.  My take
is that if you reset the session, you're part of the problem, and
consequently deserve part of the blame.  So if you receive a
properly-framed BGP update message you cannot parse, you should just
log it, but not take down the session.


Not sure if the link was posted allready ...

http://www.cisco.com/en/US/products/products_security_advisory09186a0080b4411f.shtml

'The vulnerability manifests itself when a BGP peer announces a prefix 
with a specific, valid but unrecognized transitive attribute. On receipt 
of this prefix, the Cisco IOS XR device will corrupt the attribute before 
sending it to the neighboring devices. Neighboring devices that receive 
this corrupted update may reset the BGP peering session.'


Bye,
Raymond.



Re: Did your BGP crash today?

2010-08-28 Thread Florian Weimer
* Raymond Dijkxhoorn:

 Not sure if the link was posted allready ...

 http://www.cisco.com/en/US/products/products_security_advisory09186a0080b4411f.shtml

Cisco posts their advisories to the NANOG list.

 'The vulnerability manifests itself when a BGP peer announces a prefix
 with a specific, valid but unrecognized transitive attribute. On
 receipt of this prefix, the Cisco IOS XR device will corrupt the
 attribute before sending it to the neighboring devices. Neighboring
 devices that receive this corrupted update may reset the BGP peering
 session.'

I'm not sure what you intend to say by quoting this part of the
advisory.  If you think that it's an IOS XR bug which only needs
fixing in IOS XR, you're showing the very attitude which has stopped
us from making the network more resilient to these types of events.



Re: Did your BGP crash today?

2010-08-28 Thread Florian Weimer
* Randy Bush:

 To carry out such experiments responsibly, you have to conduct so much
 testing beforehand that the live test on the actual Internet will not
 yield new insights (assuming you did your pre-experiment testing
 properly).

 you seem to assume the purpose of the test was to see if routers
 crashed.  i certainly think mor ehighly of ripe lans folk than that.

We don't yet precisely what was the point of the experiment.

But it is very likely that it intended to study propagation of such
updates.  Not propagating them is a protocol violation, so in order to
observe anything beyond propagation times, they would have to intend
to cause protocol violations, which is, in fact, awfully close to
session resets (thanks to the BGP protocol design).



Re: Did your BGP crash today?

2010-08-28 Thread Raymond Dijkxhoorn

Hi!


Cisco posts their advisories to the NANOG list.



'The vulnerability manifests itself when a BGP peer announces a prefix
with a specific, valid but unrecognized transitive attribute. On
receipt of this prefix, the Cisco IOS XR device will corrupt the
attribute before sending it to the neighboring devices. Neighboring
devices that receive this corrupted update may reset the BGP peering
session.'



I'm not sure what you intend to say by quoting this part of the
advisory.  If you think that it's an IOS XR bug which only needs
fixing in IOS XR, you're showing the very attitude which has stopped
us from making the network more resilient to these types of events.


Its more a workaround then a bugfix ...

Dont try to write down what I might think. I am perfectly capable of 
explaining this myselve. The narrow minded response you just did tells 
more about you then about me. So far for the rant.


I think i am around long enough that you would not even consider thinking 
that i would say 'hey this is a IOS XR BUG. Its not.' I didnt say this at 
all. Did I?


If it affects a large part of traffic on the internet and it obviously 
did. It took down a couple of the larger networks.


http://www.ams-ix.net/cgi-bin/stats/16all?log=totalall;png=daily

You can clearly see the drop there also.

I think a 'fix' 'bugfix' 'workaround' whatever you want to call it, 
i still think its good they released it and fast. A more structural 
approach is nice but wont help a lot of networks right now.


I am sorry i tried to add something to the thread. Think about this 
Florian. We are not the bad guys.


Bye,
Raymond.






Re: Did your BGP crash today?

2010-08-28 Thread Thomas Mangin
We had ASN4, AS-PATH and this one. More or less we hit this session reset 
problem once a year but nothing was done yet to change the RFC.

So I am to blame as much as every network engineer to not have pushed for a 
change or at least a comprehensive explanation on the session teardown 
behaviour is like it is and should not be changed.

It is only our fault for not having dealt with the problem the first time 
correctly, and will be next time if nothing is changed once more.

I agree correctly framed invalid packet should be discarded without tearing the 
session down.
---
from my iPhone

On 28 Aug 2010, at 14:27, Florian Weimer f...@deneb.enyo.de wrote:

 * Raymond Dijkxhoorn:
 
 Not sure if the link was posted allready ...
 
 http://www.cisco.com/en/US/products/products_security_advisory09186a0080b4411f.shtml
 
 Cisco posts their advisories to the NANOG list.
 
 'The vulnerability manifests itself when a BGP peer announces a prefix
 with a specific, valid but unrecognized transitive attribute. On
 receipt of this prefix, the Cisco IOS XR device will corrupt the
 attribute before sending it to the neighboring devices. Neighboring
 devices that receive this corrupted update may reset the BGP peering
 session.'
 
 I'm not sure what you intend to say by quoting this part of the
 advisory.  If you think that it's an IOS XR bug which only needs
 fixing in IOS XR, you're showing the very attitude which has stopped
 us from making the network more resilient to these types of events.
 



Re: Did your BGP crash today?

2010-08-28 Thread Thomas Mangin
 I agree correctly framed invalid packet should be discarded without tearing 
 the session down.

This statement is way to simplistic.
I would be interested if anyone has pointers toward any work which was done to 
sort this issue.

Thanks.

Thomas




Re: Did your BGP crash today?

2010-08-28 Thread Claudio Jeker
On Sat, Aug 28, 2010 at 02:51:17PM +0200, Thomas Mangin wrote:
 We had ASN4, AS-PATH and this one. More or less we hit this session
 reset problem once a year but nothing was done yet to change the RFC.
 
You are mixing up three totaly different problems. Sure the result was the
same (session drops). This time a IOS XR device was corrupting an
attribute before sending it out. The corruption had to be in the header
section of the attribute or the other side would not have detected it
(since the neighbor did not know about this attribute either). Now if a
system sends out corrupted BGP messages there is no way out, you need to
close the session because not doing so may result in much bigger mayhem.
It was not mentioned what the corruption was actually, was the lenght
wrong or was the optional flag missing (makeing the attribute well known)? 

Unlike in the ASN4 issue this time the session to the faulty system was
dropped and by doing so stopped further issues.

 So I am to blame as much as every network engineer to not have pushed
 for a change or at least a comprehensive explanation on the session
 teardown behaviour is like it is and should not be changed.
 
 It is only our fault for not having dealt with the problem the first
 time correctly, and will be next time if nothing is changed once more.
 
 I agree correctly framed invalid packet should be discarded without
 tearing the session down.

Great, corrupting your RIB and FIB and every of your peers RIB. Thanks a
lot for routing loops and wrong announcements. The only thing you can drop
without causing troubles are (tranistive) optional attributes. This is
covered by draft-ietf-idr-optional-transitive and hopefully it will be
adopted as RFC and implemented by vendors.
If a well known attribute like AS-PATH is corrupted then there is no
choice, the session needs to be reset. Which is bad when the AS-PATH
validation code has a bug.

-- 
:wq Claudio



Re: Did your BGP crash today?

2010-08-28 Thread Brett Frankenberger
On Sat, Aug 28, 2010 at 02:19:28PM +0200, Florian Weimer wrote:
 * Claudio Jeker:
 
  I think you blame the wrong people. The vendor should make sure that
  their implementation does not violate the very basics of the BGP
  protocol.
 
 The curious thing here is that the peer that resets the session, as
 required by the spec, causes the actual damage (the session reset),
 and not the peer producing the wrong update.
 
 This whole thread is quite schizophrenic because the consensus appears
 to be that (a) a *researcher is not to blame* for sending out a BGP
 message which eventually leads to session resets, and (b) an
 *implementor is to blame* for sending out a BGP messages which
 eventually leads to session resets.  You really can't have it both
 ways.

The researcher is not to blame because all the BGP messages he sent out
were properly formed.

The implementor is to blame becuase the code he wrote send out BGP
messages which were not properly formed.

 I'm fed up with this situation, and we will fix it this time.  My take
 is that if you reset the session, you're part of the problem, and
 consequently deserve part of the blame.  So if you receive a
 properly-framed BGP update message you cannot parse, you should just
 log it, but not take down the session.

If you get your wish, and that gets implemented, in some numer of years
trree will be a NANOG posting (perhaps from you, perhaps not) arguing
that any malformed BGP message should result in the session being torn
down.  This will be after a router develops a failure that causes it to
send many incorrect messages, but only some of them malformed.  So the
malformed ones will be discarded, the remainder will be propogated
throughout the Internet.  If the ones that are incorrect but not
malformed are, say, filled with more specifics for large portions of
the Internet, someone will be asking: How could all the other routers
accept these advertisement from a router known to be broken ... it was
sending malformed advertisements, but instead of tearning down the
sessions, you decided to trust all the validly formed messages from
this known-to-be-broken router.

My point is:  we can't always look at the most recent failure to decide
what the correct policy is.  We have good data on the cases where
NOTIFY on any malformed packet has caused significantly outages in the
Internet.  We don't have nearly as good data on the cases where
NOTIFY-on-any-malformed-packet saved the Internet from a significant
outage.

I don't claim to know which is the bigger problem.  But any serious
argument to change the behavior needs to consider the risk from
propogating information received from a router known to be broken, on
the theory that the brokenness only causes malformed messages (which
can be discarded) and does not also cause incorrect but correctly
formed messages to be sent.

 -- Brett



Re: Did your BGP crash today?

2010-08-28 Thread James Hess
On Fri, Aug 27, 2010 at 2:33 PM, Dave Israel da...@otd.com wrote:
 On 8/27/2010 3:22 PM, Jared Mauch wrote:
[snip]
 an MD5 hash that can be added to the packet.  If the TCP hash checks

Hello,  layering violation.If  the  TCP MD5 option was used, the
MD5 checksum was probably correct.
Malformed BGP Protocol messages, not malformed TCP messages.

The BGP protocol that lives on top of TCP is a non-packetized stream.
Dropping the IP packets, would just mean that the IP packets
containing the malformed BGP data
need to get resent  (still containing malformed BGP application
protocol data, when resent).

 out, then you know the packet wasn't garbled, and just contained
 information you didn't grok.  That seems like enough evidence to be able
 to shrug and toss the packet without dropping the session.

If the attribute is malformed, and in particular,  if the  _length_
value is malformed,
because more bits have been transmitted as part of an update than
indicated in the length,
how do you reliably determine  exactly where the   junk   ends,  and
the next attribute
starts,   and resume the stream without loss of other critical data?

Without a valid length of the attribute,  you don't know  which  bit
the next attribute starts at,
or which bit  the next   update starts at.

If the apparently length of the update is wrong, the rest of your
session appears to be malformed.

If your guess is wrong,  you could  wind up interpreting part of the
attribute DATA portion
as another route update,   allowing an  adversary  to  exploit the
malformed bug to
possibly inject new routes.

A recovery  mechanism could lead to worse problems, or lead to
problems not being discovered.

 -Dave
-- 
-J



Re: Did your BGP crash today?

2010-08-28 Thread Christopher Morrow
On Sat, Aug 28, 2010 at 6:14 AM, Florian Weimer f...@deneb.enyo.de wrote:
 * Christopher Morrow:

 (you are asking your vendors to run full bit sweeps of each protocol
 in a regimented manner checking for all possible edge cases and
 properly handling them, right?)

 The real issue is that both spec and current practice say you need to
 drop the session as soon as you encounter any unexpected data.  That's

sorry, I conflated two things... or didn't mean to but did anyway.

1) users of gear that does BGP really need to ask loudly and longly
(and then go test for themselves) that their BGP speakers do the
'right thing' when faced with oddball scenarios. If someone sends you
a previously unknown attribute... don't corrupt it and pass it on,
pass if transitive, drop if not.

2) some thought and writing and code-changes need to go into how the
bgp-speakers of the world deal with bad-behaving bgp speakers. Is
'send notify and reset' the right answer? is there one 'right answer'
? Should some classes of fugly exchange end with a 'dropped that
update, moved along' and some end with 'pull eject handle!' ?

it's doubtful that 2 can get solved here (nanog, though certainly some
operational thought on the right thing would be great as guidance). i
would hope that 1 can get some traction here (via folks going back to
their vendors and asking: Did you run the Mu-security/Oolu-univ/etc
fuzzing test suites against this code? can I see the results? I hope
they match the results I'm going to be getting from my folks in
~2wks... or we'll be having a much more structured/loud
conversation...

another poster had a great point about 'all the world can screw with
you, you have no protections other than trust that the next guy won't
screw you over (inadvertently)'. There are no protections available to
you if someone sets (example) bit 77 in an ipv4 update message to 1
when it should by all accounts be 0. Or (apparently) if they send a
previously unknown attribute on a route :( You can put in max-prefix
limits, as-path limits (length and content), prefix-filters.. but
internal-message-content you are stuck hoping the vendors all followed
the same playbook. With everyone saying together: Please
appropriately test your implementation for all boundary cases maybe
we can get to where these happen less often (or nearly never) - every
3 months is a little tedious.

-chris



Re: Did your BGP crash today?

2010-08-28 Thread Deepak Jain
On BB, so top posting. Apologies.

It seems that creating a worst case BGP test suite for all kinds of nastiness 
(in light of the recent RIPE thing) might not be a bad idea - so that we can 
all test the implementation ourselves before we deploy new code.

Like all funky attributes, all funky AS SETs... With knobs for 1 to mem exhaust 
(for long data sets, etc). 

Unless BGP is massively more complicated than I remember, its not a very 
advanced CS grad project.
I'm thinking a quagga or perl BGP talker would be a good place to start.

Deepak

- Original Message -
From: Christopher Morrow morrowc.li...@gmail.com
To: Florian Weimer f...@deneb.enyo.de
Cc: nanog@nanog.org nanog@nanog.org
Sent: Sun Aug 29 01:12:00 2010
Subject: Re: Did your BGP crash today?

On Sat, Aug 28, 2010 at 6:14 AM, Florian Weimer f...@deneb.enyo.de wrote:
 * Christopher Morrow:

 (you are asking your vendors to run full bit sweeps of each protocol
 in a regimented manner checking for all possible edge cases and
 properly handling them, right?)

 The real issue is that both spec and current practice say you need to
 drop the session as soon as you encounter any unexpected data.  That's

sorry, I conflated two things... or didn't mean to but did anyway.

1) users of gear that does BGP really need to ask loudly and longly
(and then go test for themselves) that their BGP speakers do the
'right thing' when faced with oddball scenarios. If someone sends you
a previously unknown attribute... don't corrupt it and pass it on,
pass if transitive, drop if not.

2) some thought and writing and code-changes need to go into how the
bgp-speakers of the world deal with bad-behaving bgp speakers. Is
'send notify and reset' the right answer? is there one 'right answer'
? Should some classes of fugly exchange end with a 'dropped that
update, moved along' and some end with 'pull eject handle!' ?

it's doubtful that 2 can get solved here (nanog, though certainly some
operational thought on the right thing would be great as guidance). i
would hope that 1 can get some traction here (via folks going back to
their vendors and asking: Did you run the Mu-security/Oolu-univ/etc
fuzzing test suites against this code? can I see the results? I hope
they match the results I'm going to be getting from my folks in
~2wks... or we'll be having a much more structured/loud
conversation...

another poster had a great point about 'all the world can screw with
you, you have no protections other than trust that the next guy won't
screw you over (inadvertently)'. There are no protections available to
you if someone sets (example) bit 77 in an ipv4 update message to 1
when it should by all accounts be 0. Or (apparently) if they send a
previously unknown attribute on a route :( You can put in max-prefix
limits, as-path limits (length and content), prefix-filters.. but
internal-message-content you are stuck hoping the vendors all followed
the same playbook. With everyone saying together: Please
appropriately test your implementation for all boundary cases maybe
we can get to where these happen less often (or nearly never) - every
3 months is a little tedious.

-chris



Did your BGP crash today?

2010-08-27 Thread Kasper Adel
Havent seen a thread on this one so thought i'd start one.

Ripe tested a new attribute that crashed the internet, is that true?


Kim


Re: Did your BGP crash today?

2010-08-27 Thread Jared Mauch
I did see some attribute 99 stuff go around earlier today and have not yet 
researched it.

Unknown BGP attribute 99 (flags: 240)
Unknown BGP attribute 99 (flags: 240)
Unknown BGP attribute 99 (flags: 240)
Unknown BGP attribute 99 (flags: 240)
Unknown BGP attribute 99 (flags: 240)

- Jared

On Aug 27, 2010, at 1:27 PM, Kasper Adel wrote:

 Havent seen a thread on this one so thought i'd start one.
 
 Ripe tested a new attribute that crashed the internet, is that true?
 
 
 Kim




Re: Did your BGP crash today?

2010-08-27 Thread Valdis . Kletnieks
On Fri, 27 Aug 2010 19:27:06 +0200, Kasper Adel said:
 Havent seen a thread on this one so thought i'd start one.
 
 Ripe tested a new attribute that crashed the internet, is that true?

If it in fact crashed the internet, as opposed to gave a few buggy routers
here and there indigestion, you wouldn't be posting to NANOG looking for
confirmation. :)


pgpDM0tt5WPYV.pgp
Description: PGP signature


re: Did your BGP crash today?

2010-08-27 Thread Nick Olsen
No down time here, Would have been all over the news and everything if it 
really do crash the internet.

Nick Olsen
Network Operations
(321) 205-1100 x106



From: Kasper Adel karim.a...@gmail.com
Sent: Friday, August 27, 2010 1:27 PM
To: NANOG list nanog@nanog.org
Subject: Did your BGP crash today?

Havent seen a thread on this one so thought i'd start one.

Ripe tested a new attribute that crashed the internet, is that true?

Kim



Re: Did your BGP crash today?

2010-08-27 Thread Nick Olsen
Well played, Sir.

Nick Olsen
Network Operations
(321) 205-1100 x106



From: valdis.kletni...@vt.edu
Sent: Friday, August 27, 2010 1:32 PM
To: Kasper Adel karim.a...@gmail.com
Subject: Re: Did your BGP crash today?

On Fri, 27 Aug 2010 19:27:06 +0200, Kasper Adel said:
 Havent seen a thread on this one so thought i'd start one.
 
 Ripe tested a new attribute that crashed the internet, is that true?

If it in fact crashed the internet, as opposed to gave a few buggy 
routers
here and there indigestion, you wouldn't be posting to NANOG looking for
confirmation. :)




Re: Did your BGP crash today?

2010-08-27 Thread Thomas Mangin
Looking at the graph of at least one of the european exchange where RIS 
connect, it had an impact. Now saying it was nothing is like saying that the 
YouTube incident was nothing as you were not affected as you do not use YouTube.

Some people did feel the pain - lucky it was not you :)

Thomas
---
from my iPhone

On 27 Aug 2010, at 19:31, Nick Olsen n...@brevardwireless.com wrote:

 No down time here, Would have been all over the news and everything if it 
 really do crash the internet.
 
 Nick Olsen
 Network Operations
 (321) 205-1100 x106
 
 
 
 From: Kasper Adel karim.a...@gmail.com
 Sent: Friday, August 27, 2010 1:27 PM
 To: NANOG list nanog@nanog.org
 Subject: Did your BGP crash today?
 
 Havent seen a thread on this one so thought i'd start one.
 
 Ripe tested a new attribute that crashed the internet, is that true?
 
 Kim
 



RE: Did your BGP crash today?

2010-08-27 Thread Blake Pfankuch
Ignoring the fact that the original poster has a thing for the dramatic, of 
those who did feel minor pain from this what hardware platforms were affected 
and what software versions just for curiosity sake.   

-Original Message-
From: Thomas Mangin [mailto:thomas.man...@exa-networks.co.uk] 
Sent: Friday, August 27, 2010 11:44 AM
To: n...@brevardwireless.com
Cc: nanog@nanog.org
Subject: Re: Did your BGP crash today?

Looking at the graph of at least one of the european exchange where RIS 
connect, it had an impact. Now saying it was nothing is like saying that the 
YouTube incident was nothing as you were not affected as you do not use YouTube.

Some people did feel the pain - lucky it was not you :)

Thomas
---
from my iPhone

On 27 Aug 2010, at 19:31, Nick Olsen n...@brevardwireless.com wrote:

 No down time here, Would have been all over the news and everything if 
 it really do crash the internet.
 
 Nick Olsen
 Network Operations
 (321) 205-1100 x106
 
 
 
 From: Kasper Adel karim.a...@gmail.com
 Sent: Friday, August 27, 2010 1:27 PM
 To: NANOG list nanog@nanog.org
 Subject: Did your BGP crash today?
 
 Havent seen a thread on this one so thought i'd start one.
 
 Ripe tested a new attribute that crashed the internet, is that true?
 
 Kim
 




Re: Did your BGP crash today?

2010-08-27 Thread Grzegorz Janoszka

On 27-08-10 19:31, valdis.kletni...@vt.edu wrote:

On Fri, 27 Aug 2010 19:27:06 +0200, Kasper Adel said:

Havent seen a thread on this one so thought i'd start one.

Ripe tested a new attribute that crashed the internet, is that true?


If it in fact crashed the internet, as opposed to gave a few buggy routers
here and there indigestion, you wouldn't be posting to NANOG looking for
confirmation. :)


https://www.ams-ix.net/statistics/

Not whole internet, but a part. And the few buggy routers here and 
there were mostly Cisco CRS-1's which didn't understand the new 
attribute and sent a malformed message to all peers, causing them to 
close the BGP session.


I think most of the impact was limited to Europe, especially Amsterdam area.

--
Grzegorz Janoszka



Re: Did your BGP crash today?

2010-08-27 Thread Thomas Mangin
On 27 Aug 2010, at 19:27, Grzegorz Janoszka wrote:

 On 27-08-10 19:31, valdis.kletni...@vt.edu wrote:
 On Fri, 27 Aug 2010 19:27:06 +0200, Kasper Adel said:
 Havent seen a thread on this one so thought i'd start one.
 
 Ripe tested a new attribute that crashed the internet, is that true?
 
 If it in fact crashed the internet, as opposed to gave a few buggy routers
 here and there indigestion, you wouldn't be posting to NANOG looking for
 confirmation. :)
 
 https://www.ams-ix.net/statistics/
 
 Not whole internet, but a part. And the few buggy routers here and there 
 were mostly Cisco CRS-1's which didn't understand the new attribute and sent 
 a malformed message to all peers, causing them to close the BGP session.

In a way it remind me of the ASN4 bug .. Until a vendor fix is available I 
guess that the details are better left off public mailing lists.
http://www.uknof.org.uk/uknof12/Davidson-4_byte_asn.pdf

 I think most of the impact was limited to Europe, especially Amsterdam area.

Yes, It had an effect on ISPs which are connected to RIS. 
http://www.ripe.net/ris/
AFAIK this mean ASes at LINX and AMS-IX . The LINX graph shows a similar (but 
smaller) dip of 50-60 GB.

Thomas


Re: Did your BGP crash today?

2010-08-27 Thread Lucy Lynch

FYI:

--
Dear Colleagues,

On Friday 27 August, from 08:41 to 09:08 UTC, the RIPE NCC Routing
Information Service (RIS) announced a route with an experimental BGP
attribute. During this announcement, some Internet Service Providers
reported problems with their networking infrastructure.

Investigation
--

Immediately after discovering this, we stopped the announcement and
started investigating the problem. Our investigation has shown that the
problem was likely to have been caused by certain router types
incorrectly modifying the experimental attribute and then further
announcing the malformed route to their peers. The announcements sent
out by the RIS were correct and complied to all standards.

The experimental attribute was part of an experiment conducted in
collaboration with a group from Duke University. This involved
announcing a large (3000 bytes) optional transitive attribute, using a
modified version of Quagga. The attribute used type code 99. The data
consisted of zeros. We used the prefix 93.175.144.0/24 for this and
announced from AS 12654 on AMS-IX, NL-IX and GN-IX to all our peers.

Reports from affected ISPs showed that the length of the attribute in
the attribute header, as seen by their routers, was not correct. The
header stated 233 bytes and the actual data in their samples was 237
bytes. This caused some routers to drop the session with the peer that
announced the route.

We have built a test set-up which is running identical software and
configurations to the live set-up. From this set-up, and the BGP packet
dumps as made by the RIS, we have determined that the length of the data
in the attribute as sent out by the RIS was indeed 3000 bytes and that
all lengths recorded in the headers of the BGP updates were correct.

Beyond the RIS systems, we can only do limited diagnosis. One possible
explanation is that the affected routers did not correctly use the
extended length flag on the attribute. This flag is set when the length
of the attribute exceeds 255 bytes i.e. when two octets are needed to
store the length.

It may be that the routers may not add the higher octet of the length to
the total length, which would lead, in our test set-up, to a total
packet length of 236 bytes. If, in addition, the routers also
incorrectly trim the attribute length, the problem could occur as
observed. It is worth noting that the difference between the reported
233 and 237 bytes is the size of the flags, type code and length in the
attribute.

We will be further investigating this problem and will report any
findings. We regret any inconvenience caused.

Kind regards,

Erik Romijn

Information Services
RIPE NCC
___
tech-l mailing list
tec...@ams-ix.net
http://melix.ams-ix.net/mailman/listinfo/tech-l



- Lucy

On Fri, 27 Aug 2010, Grzegorz Janoszka wrote:


On 27-08-10 19:31, valdis.kletni...@vt.edu wrote:

On Fri, 27 Aug 2010 19:27:06 +0200, Kasper Adel said:

Havent seen a thread on this one so thought i'd start one.

Ripe tested a new attribute that crashed the internet, is that true?


If it in fact crashed the internet, as opposed to gave a few buggy 
routers

here and there indigestion, you wouldn't be posting to NANOG looking for
confirmation. :)


https://www.ams-ix.net/statistics/

Not whole internet, but a part. And the few buggy routers here and there 
were mostly Cisco CRS-1's which didn't understand the new attribute and sent 
a malformed message to all peers, causing them to close the BGP session.


I think most of the impact was limited to Europe, especially Amsterdam area.






Re: Did your BGP crash today?

2010-08-27 Thread Thomas Mangin
So much for better left off public mailing lists ! sigh !

Thomas

On 27 Aug 2010, at 19:42, Lucy Lynch wrote:

 FYI:
 
 --
 Dear Colleagues,
 
 On Friday 27 August, from 08:41 to 09:08 UTC, the RIPE NCC Routing
 Information Service (RIS) announced a route with an experimental BGP
 attribute. During this announcement, some Internet Service Providers
 reported problems with their networking infrastructure.
 
 Investigation
 --
 
 Immediately after discovering this, we stopped the announcement and
 started investigating the problem. Our investigation has shown that the
 problem was likely to have been caused by certain router types
 incorrectly modifying the experimental attribute and then further
 announcing the malformed route to their peers. The announcements sent
 out by the RIS were correct and complied to all standards.
 
 The experimental attribute was part of an experiment conducted in
 collaboration with a group from Duke University. This involved
 announcing a large (3000 bytes) optional transitive attribute, using a
 modified version of Quagga. The attribute used type code 99. The data
 consisted of zeros. We used the prefix 93.175.144.0/24 for this and
 announced from AS 12654 on AMS-IX, NL-IX and GN-IX to all our peers.
 
 Reports from affected ISPs showed that the length of the attribute in
 the attribute header, as seen by their routers, was not correct. The
 header stated 233 bytes and the actual data in their samples was 237
 bytes. This caused some routers to drop the session with the peer that
 announced the route.
 
 We have built a test set-up which is running identical software and
 configurations to the live set-up. From this set-up, and the BGP packet
 dumps as made by the RIS, we have determined that the length of the data
 in the attribute as sent out by the RIS was indeed 3000 bytes and that
 all lengths recorded in the headers of the BGP updates were correct.
 
 Beyond the RIS systems, we can only do limited diagnosis. One possible
 explanation is that the affected routers did not correctly use the
 extended length flag on the attribute. This flag is set when the length
 of the attribute exceeds 255 bytes i.e. when two octets are needed to
 store the length.
 
 It may be that the routers may not add the higher octet of the length to
 the total length, which would lead, in our test set-up, to a total
 packet length of 236 bytes. If, in addition, the routers also
 incorrectly trim the attribute length, the problem could occur as
 observed. It is worth noting that the difference between the reported
 233 and 237 bytes is the size of the flags, type code and length in the
 attribute.
 
 We will be further investigating this problem and will report any
 findings. We regret any inconvenience caused.
 
 Kind regards,
 
 Erik Romijn
 
 Information Services
 RIPE NCC
 ___
 tech-l mailing list
 tec...@ams-ix.net
 http://melix.ams-ix.net/mailman/listinfo/tech-l
 
 
 
 - Lucy
 
 On Fri, 27 Aug 2010, Grzegorz Janoszka wrote:
 
 On 27-08-10 19:31, valdis.kletni...@vt.edu wrote:
 On Fri, 27 Aug 2010 19:27:06 +0200, Kasper Adel said:
 Havent seen a thread on this one so thought i'd start one.
 Ripe tested a new attribute that crashed the internet, is that true?
 If it in fact crashed the internet, as opposed to gave a few buggy 
 routers
 here and there indigestion, you wouldn't be posting to NANOG looking for
 confirmation. :)
 
 https://www.ams-ix.net/statistics/
 
 Not whole internet, but a part. And the few buggy routers here and there 
 were mostly Cisco CRS-1's which didn't understand the new attribute and sent 
 a malformed message to all peers, causing them to close the BGP session.
 
 I think most of the impact was limited to Europe, especially Amsterdam area.
 
 
 




Re: Did your BGP crash today?

2010-08-27 Thread Lucy Lynch

sorry - found via google...

- Lucy

On Fri, 27 Aug 2010, Thomas Mangin wrote:


So much for better left off public mailing lists ! sigh !

Thomas

On 27 Aug 2010, at 19:42, Lucy Lynch wrote:


FYI:

--
Dear Colleagues,

On Friday 27 August, from 08:41 to 09:08 UTC, the RIPE NCC Routing
Information Service (RIS) announced a route with an experimental BGP
attribute. During this announcement, some Internet Service Providers
reported problems with their networking infrastructure.

Investigation
--

Immediately after discovering this, we stopped the announcement and
started investigating the problem. Our investigation has shown that the
problem was likely to have been caused by certain router types
incorrectly modifying the experimental attribute and then further
announcing the malformed route to their peers. The announcements sent
out by the RIS were correct and complied to all standards.

The experimental attribute was part of an experiment conducted in
collaboration with a group from Duke University. This involved
announcing a large (3000 bytes) optional transitive attribute, using a
modified version of Quagga. The attribute used type code 99. The data
consisted of zeros. We used the prefix 93.175.144.0/24 for this and
announced from AS 12654 on AMS-IX, NL-IX and GN-IX to all our peers.

Reports from affected ISPs showed that the length of the attribute in
the attribute header, as seen by their routers, was not correct. The
header stated 233 bytes and the actual data in their samples was 237
bytes. This caused some routers to drop the session with the peer that
announced the route.

We have built a test set-up which is running identical software and
configurations to the live set-up. From this set-up, and the BGP packet
dumps as made by the RIS, we have determined that the length of the data
in the attribute as sent out by the RIS was indeed 3000 bytes and that
all lengths recorded in the headers of the BGP updates were correct.

Beyond the RIS systems, we can only do limited diagnosis. One possible
explanation is that the affected routers did not correctly use the
extended length flag on the attribute. This flag is set when the length
of the attribute exceeds 255 bytes i.e. when two octets are needed to
store the length.

It may be that the routers may not add the higher octet of the length to
the total length, which would lead, in our test set-up, to a total
packet length of 236 bytes. If, in addition, the routers also
incorrectly trim the attribute length, the problem could occur as
observed. It is worth noting that the difference between the reported
233 and 237 bytes is the size of the flags, type code and length in the
attribute.

We will be further investigating this problem and will report any
findings. We regret any inconvenience caused.

Kind regards,

Erik Romijn

Information Services
RIPE NCC
___
tech-l mailing list
tec...@ams-ix.net
http://melix.ams-ix.net/mailman/listinfo/tech-l



- Lucy

On Fri, 27 Aug 2010, Grzegorz Janoszka wrote:


On 27-08-10 19:31, valdis.kletni...@vt.edu wrote:

On Fri, 27 Aug 2010 19:27:06 +0200, Kasper Adel said:

Havent seen a thread on this one so thought i'd start one.
Ripe tested a new attribute that crashed the internet, is that true?

If it in fact crashed the internet, as opposed to gave a few buggy routers
here and there indigestion, you wouldn't be posting to NANOG looking for
confirmation. :)


https://www.ams-ix.net/statistics/

Not whole internet, but a part. And the few buggy routers here and there were 
mostly Cisco CRS-1's which didn't understand the new attribute and sent a malformed 
message to all peers, causing them to close the BGP session.

I think most of the impact was limited to Europe, especially Amsterdam area.










Re: Did your BGP crash today?

2010-08-27 Thread Grzegorz Janoszka

On 27-08-10 20:41, Thomas Mangin wrote:

I think most of the impact was limited to Europe, especially Amsterdam area.

Yes, It had an effect on ISPs which are connected to RIS. 
http://www.ripe.net/ris/
AFAIK this mean ASes at LINX and AMS-IX . The LINX graph shows a similar (but 
smaller) dip of 50-60 GB.


Not only. We don't peer with RIS, but about 8-10 our peers announce to 
us RIS. The nasty update we got from completely different AS, not RIS.


You may just check whether you see AS12654 - it is RIS.

--
Grzegorz Janoszka



Re: Did your BGP crash today?

2010-08-27 Thread Thomas Mangin
On 27 Aug 2010, at 20:03, Grzegorz Janoszka wrote:

 On 27-08-10 20:41, Thomas Mangin wrote:
 I think most of the impact was limited to Europe, especially Amsterdam area.
 Yes, It had an effect on ISPs which are connected to RIS. 
 http://www.ripe.net/ris/
 AFAIK this mean ASes at LINX and AMS-IX . The LINX graph shows a similar 
 (but smaller) dip of 50-60 GB.
 
 Not only. We don't peer with RIS, but about 8-10 our peers announce to us 
 RIS. The nasty update we got from completely different AS, not RIS.
 You may just check whether you see AS12654 - it is RIS.

Yes, the BGP message had a transitive attribute - sorry if I was not clear.
That said, you may want to ask why you are getting RIS routes if you are not 
peering with them directly :p

RIS is peering world wide ( http://www.ripe.net/ris/docs/peering.html ) but the 
mail was only sent to linx-ops and tech-l, so the announcement may have been 
limited to europe (for all I know).

Thomas




Re: Did your BGP crash today?

2010-08-27 Thread Richard A Steenbergen
On Fri, Aug 27, 2010 at 01:29:15PM -0400, Jared Mauch wrote:
 
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)

Just out of curiosity, at what point will we as operators rise up 
against the ivory tower protocol designers at the IETF and demand that 
they add a mechanism to not bring down the entire BGP session because of 
a single malformed attribute? Did I miss the memo about the meeting? 
I'll bring the punch and pie.

-- 
Richard A Steenbergen r...@e-gerbil.net   http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)



Re: Did your BGP crash today?

2010-08-27 Thread Jeroen Massar
On 2010-08-27 21:13, Richard A Steenbergen wrote:
 On Fri, Aug 27, 2010 at 01:29:15PM -0400, Jared Mauch wrote:

 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 
 Just out of curiosity, at what point will we as operators rise up 
 against the ivory tower protocol designers at the IETF and demand that 
 they add a mechanism to not bring down the entire BGP session because of 
 a single malformed attribute? Did I miss the memo about the meeting? 
 I'll bring the punch and pie.

Complain to your vendor, especially C  J are having good enough
influence on the IETF to make such a change possible.


I can agree with tearing the session down when one encounters an
improperly formatted message, but an unknown attribute, while the rest
of the format of message is fine, is a silly thing to hang up on indeed.

Greets,
 Jeroen




Re: Did your BGP crash today?

2010-08-27 Thread Jared Mauch

On Aug 27, 2010, at 3:13 PM, Richard A Steenbergen wrote:

 On Fri, Aug 27, 2010 at 01:29:15PM -0400, Jared Mauch wrote:
 
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 
 Just out of curiosity, at what point will we as operators rise up 
 against the ivory tower protocol designers at the IETF and demand that 
 they add a mechanism to not bring down the entire BGP session because of 
 a single malformed attribute? Did I miss the memo about the meeting? 
 I'll bring the punch and pie.

I think it's actually an implementation problem where it got out-of-sync.

You can't exactly blame the IETF for a vendor having poor code quality.

(at least not in this case IMHO).

I seem to recall there was something like this in the past that caused
some significant problems with people also running XR/CRS-1.  They quickly
got a fix and cisco issued a PSIRT as a result:

http://www.cisco.com/en/US/products/products_security_advisory09186a0080af150f.shtml#summary

I would hope these people updated their software for that impact as well.

Without knowing what the defect impact was on those devices, and without 
talking to
PSIRT today, I don't know if an advisory is pending.  Perhaps it's a new defect
and the bug is going to be triggered again soon for those that don't patch
their devices.

- jared


Re: Did your BGP crash today?

2010-08-27 Thread Jared Mauch

On Aug 27, 2010, at 3:17 PM, Jeroen Massar wrote:

 On 2010-08-27 21:13, Richard A Steenbergen wrote:
 On Fri, Aug 27, 2010 at 01:29:15PM -0400, Jared Mauch wrote:
 
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 Unknown BGP attribute 99 (flags: 240)
 
 Just out of curiosity, at what point will we as operators rise up 
 against the ivory tower protocol designers at the IETF and demand that 
 they add a mechanism to not bring down the entire BGP session because of 
 a single malformed attribute? Did I miss the memo about the meeting? 
 I'll bring the punch and pie.
 
 Complain to your vendor, especially C  J are having good enough
 influence on the IETF to make such a change possible.
 
 
 I can agree with tearing the session down when one encounters an
 improperly formatted message, but an unknown attribute, while the rest
 of the format of message is fine, is a silly thing to hang up on indeed.

When you are processing something, it's sometimes hard to tell if something
just was mis-parsed (as I think the case is here with the missing-2-bytes)
vs just getting garbage.  Perhaps there should be some way to re-sync when
you are having this problem, or a parallel keepalive path similar to
MACA/MCAS/MIDCAS/TCAS between the devices to talk when something bad is
happening.

- Jared



Re: Did your BGP crash today?

2010-08-27 Thread Dave Israel

On 8/27/2010 3:22 PM, Jared Mauch wrote:
 When you are processing something, it's sometimes hard to tell if something
 just was mis-parsed (as I think the case is here with the missing-2-bytes)
 vs just getting garbage.  Perhaps there should be some way to re-sync when
 you are having this problem, or a parallel keepalive path similar to
 MACA/MCAS/MIDCAS/TCAS between the devices to talk when something bad is
 happening.

I know it wasn't there originally, and isn't mandatory now, but there is
an MD5 hash that can be added to the packet.  If the TCP hash checks
out, then you know the packet wasn't garbled, and just contained
information you didn't grok.  That seems like enough evidence to be able
to shrug and toss the packet without dropping the session.

-Dave





Re: Did your BGP crash today?

2010-08-27 Thread Mike Gatti
where's the change management process in all of this. 
basically now we are going to starting changing things that can 
potentially have an adverse affect on users without letting anyone know
before hand  Interesting concept.

On Aug 27, 2010, at 3:33 PM, Dave Israel wrote:

 
 On 8/27/2010 3:22 PM, Jared Mauch wrote:
 When you are processing something, it's sometimes hard to tell if something
 just was mis-parsed (as I think the case is here with the missing-2-bytes)
 vs just getting garbage.  Perhaps there should be some way to re-sync when
 you are having this problem, or a parallel keepalive path similar to
 MACA/MCAS/MIDCAS/TCAS between the devices to talk when something bad is
 happening.
 
 I know it wasn't there originally, and isn't mandatory now, but there is
 an MD5 hash that can be added to the packet.  If the TCP hash checks
 out, then you know the packet wasn't garbled, and just contained
 information you didn't grok.  That seems like enough evidence to be able
 to shrug and toss the packet without dropping the session.
 
 -Dave
 
 
 

=+=+=+=+=+=+=+=+=+=+=+=+=
Mike Gatti  
ekim.it...@gmail.com
=+=+=+=+=+=+=+=+=+=+=+=+=






Re: Did your BGP crash today?

2010-08-27 Thread Christopher Morrow
On Fri, Aug 27, 2010 at 4:07 PM, Mike Gatti ekim.it...@gmail.com wrote:
 where's the change management process in all of this.
 basically now we are going to starting changing things that can
 potentially have an adverse affect on users without letting anyone know
 before hand  Interesting concept.

you are running bgp, you are connected to the 'internet'... congrats
you are part of the experiment.

I suppose one view is that at least it wasn't someone with ill
intent, or a misconfigured mikrotek!

(you are asking your vendors to run full bit sweeps of each protocol
in a regimented manner checking for all possible edge cases and
properly handling them, right?)

-chris

 On Aug 27, 2010, at 3:33 PM, Dave Israel wrote:


 On 8/27/2010 3:22 PM, Jared Mauch wrote:
 When you are processing something, it's sometimes hard to tell if something
 just was mis-parsed (as I think the case is here with the missing-2-bytes)
 vs just getting garbage.  Perhaps there should be some way to re-sync when
 you are having this problem, or a parallel keepalive path similar to
 MACA/MCAS/MIDCAS/TCAS between the devices to talk when something bad is
 happening.

 I know it wasn't there originally, and isn't mandatory now, but there is
 an MD5 hash that can be added to the packet.  If the TCP hash checks
 out, then you know the packet wasn't garbled, and just contained
 information you didn't grok.  That seems like enough evidence to be able
 to shrug and toss the packet without dropping the session.

 -Dave




 =+=+=+=+=+=+=+=+=+=+=+=+=
 Mike Gatti
 ekim.it...@gmail.com
 =+=+=+=+=+=+=+=+=+=+=+=+=








Re: Did your BGP crash today?

2010-08-27 Thread Clay Fiske

On Aug 27, 2010, at 12:13 PM, Richard A Steenbergen wrote:

 
 Just out of curiosity, at what point will we as operators rise up 
 against the ivory tower protocol designers at the IETF and demand that 
 they add a mechanism to not bring down the entire BGP session because of 
 a single malformed attribute? Did I miss the memo about the meeting? 
 I'll bring the punch and pie.

About the same time vendors' BGP implementations start to work correctly?

I agree such a knob would be useful, but seems to me that actually following 
the current standard would largely curb the issue by itself.

I recall one of the previous times something like this happened (and with a 
much wider impact), I believe it was $C that was accepting a bad attribute and 
passing it along. The effect was that other vendors ($F in particular, I think) 
would drop the session (per RFC), which made it look like they were the broken 
ones. Instead of saying why was this accepted from its source? the community 
reaction seemed more to me to be hey, BGP is breaking the internet!

If -everyone- dropped the session on a bad attribute, it likely wouldn't make 
it far enough into the wild to cause these problems in the first place.

-c




Re: Did your BGP crash today?

2010-08-27 Thread Valdis . Kletnieks
On Fri, 27 Aug 2010 13:43:39 PDT, Clay Fiske said:

 If -everyone- dropped the session on a bad attribute, it likely wouldn't
 make it far enough into the wild to cause these problems in the first
 place.

That works fine for malformed attributes.  It blows chunks for legally formed
but unknown attributes - how would you ever deploy a new attribute?



pgphZF03VSl1G.pgp
Description: PGP signature


Re: Did your BGP crash today?

2010-08-27 Thread Richard A Steenbergen
On Fri, Aug 27, 2010 at 01:43:39PM -0700, Clay Fiske wrote:
 
 If -everyone- dropped the session on a bad attribute, it likely 
 wouldn't make it far enough into the wild to cause these problems in 
 the first place.

And if everyone filtered their BGP customers there would be no routing 
leaks, but we've seen how well that works. :)

The if anything bad happens, drop the session method of protection is 
only effective if EVERY BGP implementation catches EVERY malformed 
update EVERY time, which just doesn't match up with reality. Not only 
that, but a healthy number of the bgp update issues over the years have 
actually been the result of implementations detecting perfectly valid 
things as invalid, which means by definition the implementations which 
get it right and don't drop the session act as carriers and spread the 
problem route globally. How long as we going to continue to act like 
this method of protection is actually working?

Lets be reasonable, if your basic bgp message format is malformed you're 
going to need to drop the session. If the packet is corrupted or the 
size of the message doesn't match whats in the tlv, you're not going to 
be able to continue and you'll have to drop the session. But there are 
still a huge number of potential issues where it would be perfectly safe 
to drop the update you didn't like, and support for this could easily be 
negotiated and the sending side informed of the issue by a soft 
notification extension. I have yet to see a single argument against this 
which isn't political or philosophical in nature.

-- 
Richard A Steenbergen r...@e-gerbil.net   http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)



Re: Did your BGP crash today?

2010-08-27 Thread bmanning

come on Chris,  is the Internet an experiment or not? :)
one would think that a responsible party would have made
efforts to let others in the playground know they were
going to try something different that could have ramifications
on an unkown distribution of some code bases.

I'm not asking my vendor or (in the case of OSS) me to run
full bit sweeps... but a heads up to some of the known
ops lists would have been not only welcome but expected.

as usual, YMMV

--bill


On Fri, Aug 27, 2010 at 04:11:32PM -0400, Christopher Morrow wrote:
 On Fri, Aug 27, 2010 at 4:07 PM, Mike Gatti ekim.it...@gmail.com wrote:
  where's the change management process in all of this.
  basically now we are going to starting changing things that can
  potentially have an adverse affect on users without letting anyone know
  before hand  Interesting concept.
 
 you are running bgp, you are connected to the 'internet'... congrats
 you are part of the experiment.
 
 I suppose one view is that at least it wasn't someone with ill
 intent, or a misconfigured mikrotek!
 
 (you are asking your vendors to run full bit sweeps of each protocol
 in a regimented manner checking for all possible edge cases and
 properly handling them, right?)
 
 -chris
 
  On Aug 27, 2010, at 3:33 PM, Dave Israel wrote:
 
 
  On 8/27/2010 3:22 PM, Jared Mauch wrote:
  When you are processing something, it's sometimes hard to tell if 
  something
  just was mis-parsed (as I think the case is here with the 
  missing-2-bytes)
  vs just getting garbage.  Perhaps there should be some way to re-sync 
  when
  you are having this problem, or a parallel keepalive path similar to
  MACA/MCAS/MIDCAS/TCAS between the devices to talk when something bad is
  happening.
 
  I know it wasn't there originally, and isn't mandatory now, but there is
  an MD5 hash that can be added to the packet.  If the TCP hash checks
  out, then you know the packet wasn't garbled, and just contained
  information you didn't grok.  That seems like enough evidence to be able
  to shrug and toss the packet without dropping the session.
 
  -Dave
 
 
 
 
  =+=+=+=+=+=+=+=+=+=+=+=+=
  Mike Gatti
  ekim.it...@gmail.com
  =+=+=+=+=+=+=+=+=+=+=+=+=
 
 
 
 
 
 



Re: Did your BGP crash today?

2010-08-27 Thread Claudio Jeker
On Fri, Aug 27, 2010 at 04:57:17PM -0400, valdis.kletni...@vt.edu wrote:
 On Fri, 27 Aug 2010 13:43:39 PDT, Clay Fiske said:
 
  If -everyone- dropped the session on a bad attribute, it likely wouldn't
  make it far enough into the wild to cause these problems in the first
  place.
 
 That works fine for malformed attributes.  It blows chunks for legally formed
 but unknown attributes - how would you ever deploy a new attribute?
 
This is covered by the RFC. Unknown attributes are either dropped or
passed on depending on the attribute flags. The problem as in AS4 was that
there where illegally formed unknown attributes that got passed around and
made RFC compliant routers, which already handled AS4, further down the
chain fail. This problem was addressed in Error Handling for Optional
Transitive BGP Attributes but for some reasons people think it is
necessary to make something simple more and more complex so this draft is
still pending.

-- 
:wq Claudio



Re: Did your BGP crash today?

2010-08-27 Thread Warren Kumari


On Aug 27, 2010, at 5:37 PM, bmann...@vacation.karoshi.com wrote:



come on Chris,  is the Internet an experiment or not? :)
one would think that a responsible party would have made
efforts to let others in the playground know they were
going to try something different that could have ramifications
on an unkown distribution of some code bases.


I'm assuming that they weren't really expecting this to cause  
issues... Where does one draw the line? I'm planning on announcing  
x.y.z.0/20 later in the week -- x, y and z are all prime and the sum  
of all 3 is also a prime. There is a non-zero chance that something  
somewhere will go flooie, shall I send mail now or later?


Also, I would prefer that this gets discovered and dealt with (in this  
case by stopping the announcement :-)) than having folk not willing to  
try things and ending up with a weaponized version...


W




I'm not asking my vendor or (in the case of OSS) me to run
full bit sweeps... but a heads up to some of the known
ops lists would have been not only welcome but expected.

as usual, YMMV

--bill


On Fri, Aug 27, 2010 at 04:11:32PM -0400, Christopher Morrow wrote:
On Fri, Aug 27, 2010 at 4:07 PM, Mike Gatti ekim.it...@gmail.com  
wrote:

where's the change management process in all of this.
basically now we are going to starting changing things that can
potentially have an adverse affect on users without letting anyone  
know

before hand  Interesting concept.


you are running bgp, you are connected to the 'internet'... congrats
you are part of the experiment.

I suppose one view is that at least it wasn't someone with ill
intent, or a misconfigured mikrotek!

(you are asking your vendors to run full bit sweeps of each protocol
in a regimented manner checking for all possible edge cases and
properly handling them, right?)

-chris


On Aug 27, 2010, at 3:33 PM, Dave Israel wrote:



On 8/27/2010 3:22 PM, Jared Mauch wrote:
When you are processing something, it's sometimes hard to tell  
if something
just was mis-parsed (as I think the case is here with the  
missing-2-bytes)
vs just getting garbage.  Perhaps there should be some way to  
re-sync when
you are having this problem, or a parallel keepalive path  
similar to
MACA/MCAS/MIDCAS/TCAS between the devices to talk when something  
bad is

happening.


I know it wasn't there originally, and isn't mandatory now, but  
there is
an MD5 hash that can be added to the packet.  If the TCP hash  
checks

out, then you know the packet wasn't garbled, and just contained
information you didn't grok.  That seems like enough evidence to  
be able

to shrug and toss the packet without dropping the session.

-Dave





=+=+=+=+=+=+=+=+=+=+=+=+=
Mike Gatti
ekim.it...@gmail.com
=+=+=+=+=+=+=+=+=+=+=+=+=











--
What our ancestors would really be thinking, if they were alive today,  
is: Why is it so dark in here?


-- (Terry Pratchett, Pyramids)





Re: Did your BGP crash today?

2010-08-27 Thread Clay Fiske

On Aug 27, 2010, at 1:57 PM, valdis.kletni...@vt.edu wrote:

 On Fri, 27 Aug 2010 13:43:39 PDT, Clay Fiske said:
 
 If -everyone- dropped the session on a bad attribute, it likely wouldn't
 make it far enough into the wild to cause these problems in the first
 place.
 
 That works fine for malformed attributes.  It blows chunks for legally formed
 but unknown attributes - how would you ever deploy a new attribute?

By making it optional. Seems to me that's pretty well covered by the Path 
Attributes section of the RFC.

A bad attribute isn't simply unknown, it's malformed. My apologies for not 
wording that more precisely.

I do see the wisdom of fine-grained control of this behavior. I'm just saying, 
it'd be nice if we could have correct behavior on the basics in the first 
place. :) 

-c




Re: Did your BGP crash today?

2010-08-27 Thread Paul Ferguson
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Fri, Aug 27, 2010 at 5:02 PM, Clay Fiske c...@bloomcounty.org wrote:


 On Aug 27, 2010, at 1:57 PM, valdis.kletni...@vt.edu wrote:



 That works fine for malformed attributes.  It blows chunks for legally
 formed but unknown attributes - how would you ever deploy a new
 attribute?

 By making it optional. Seems to me that's pretty well covered by the Path
 Attributes section of the RFC.

 A bad attribute isn't simply unknown, it's malformed. My apologies for
 not wording that more precisely.

 I do see the wisdom of fine-grained control of this behavior. I'm just
 saying, it'd be nice if we could have correct behavior on the basics in
 the first place. :)


As an aside, I see that Cisco has released a late Friday afternoon security
advisory on this issue:

http://www.cisco.com/warp/public/707/cisco-sa-20100827-bgp.shtml

FYI,

- - ferg

-BEGIN PGP SIGNATURE-
Version: PGP Desktop 9.5.3 (Build 5003)

wj8DBQFMeFNZq1pz9mNUZTMRAkR9AJ9cTz71N5/RMaQFD6LsumKLhpfASACdHrBR
4uQ0+oes21gvTS5IVJZXMds=
=5wqD
-END PGP SIGNATURE-


-- 
Fergie, a.k.a. Paul Ferguson
 Engineering Architecture for the Internet
 fergdawgster(at)gmail.com
 ferg's tech blog: http://fergdawg.blogspot.com/



Re: Did your BGP crash today?

2010-08-27 Thread Chris Adams
Once upon a time, Paul Ferguson fergdawgs...@gmail.com said:
 As an aside, I see that Cisco has released a late Friday afternoon security
 advisory on this issue:

Huh, I had an upstream (with Cisco gear on their end) do URGENT
maintenance last night with less than 12 hours notice.  I wonder if
this is why...

-- 
Chris Adams cmad...@hiwaay.net
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.



Re: Did your BGP crash today?

2010-08-27 Thread Randy Bush
 Just out of curiosity, at what point will we as operators rise up
 against the ivory tower protocol designers at the IETF and demand that
 they add a mechanism to not bring down the entire BGP session because
 of a single malformed attribute?

there is a problem underlying this.  bgp is not tlv.  so once a parser
detects an error, it can not *rigorously* know where to take up again.

randy



Re: Did your BGP crash today?

2010-08-27 Thread Randy Bush
 So much for better left off public mailing lists ! sigh !

damn!  security through obscurity busted again.  will people never
learn?
/sarcasm?

randy