Re: Validating multi-path in production?

2021-11-26 Thread Mark Tinka




On 11/12/21 23:47, Jeff Tantsura wrote:


LAG - Micro BFD (RFC7130) provides per constituent livability.


Not sure if this has changed, but the last time I looked into it, Micro 
BFD's for LAG's was only supported and functional on point-to-point 
Ethernet links.


In cases where you are running a LAN, it did not apply.

We gave up running BFD on LAG's on LAN's, because of this issue.

Mark.


Re: Validating multi-path in production?

2021-11-15 Thread Tom Beecher
It sounds like you want something like this:

https://github.com/facebookarchive/fbtracert

We have an internal tool that works on generally similar principles, works
pretty well.

( I have no relationship with Facebook; I just always remember their presos
on UDPinger and FBTracert from my first NANOG meeting for whatever reason.
:) )

On Sun, Nov 14, 2021 at 11:21 AM Adam Thompson 
wrote:

> The problem I'm looking to solve is the logical opposite, I think: I want
> to demonstrate that no links are malfunctioning in such a way that packets
> on a certain path are getting silently dropped.  Which has some "proving a
> negative" aspects to it, unfortunately.
> I think the only way I can demonstrate it is to determine that every
> single multi-path/hashed-member link is working, which is... hard.
> Especially if I need to deal with the combinatoric explosion - I *think* I
> can skip that part.
> -Adam
>
> Get Outlook for Android <https://aka.ms/AAb9ysg>
> --
> *From:* James Bensley 
> *Sent:* Sunday, November 14, 2021 5:29:25 AM
> *To:* Adam Thompson ; nanog 
> *Subject:* Re: Validating multi-path in production?
>
> On Fri, 12 Nov 2021 at 16:54, Adam Thompson 
> wrote:
>
> The best I've come up with so far is to have two test systems (typically
> VMs) that use adjacent IP addresses and adjacent MAC addresses, and test
> both inbound and outbound to/from those, blindly trusting/hoping that
> hashing algorithms will *probably* exercise both paths.
>
>
> If the goal is to test that traffic *is* being distributed across multiple
> links based on traffic headers, then you can definable roll your own. I
> think the problem is orchestrating it (feeding your topology data into the
> tool, running the tool, getting the results out, and interpreting the
> results etc).
>
> A coupe of public examples:
> https://github.com/facebookarchive/UdpPinger
> https://www.youtube.com/watch?v=PN-4JKjCAT0
>
> If you do roll your own, you need to taylor the tests to your topology and
> your equipment. For example, you can have two VMs as you mentioned, each at
> opposite ends of the network. Then, if your network uses a 5-tuple for ECMP
> inside the core for example, you could send many flows between the two VMs,
> rotating the sauce port for example, to ensure all links in a LAG or all
> ECMP paths are used.
>
> It's tricky to know the hashing algo for every type of device you have in
> your network, and for each traffic type for each device type, if you have a
> multi vendor network. Also, if your network carries a mix of IPv4, IPv6,
> PPP, MPLS L3 VPNs, MPLS L2 VPNs, GRE, GTP, IPSEC, etc. The number of
> permutations of tests you need to run and the result sets you need to
> parse, grows very rapidly.
>
> Cheers,
> James.
>


Re: Validating multi-path in production?

2021-11-14 Thread Martijn Schmidt via NANOG
If your ECMP hashing algorithm considers L4 data I can recommend giving the TCP 
mode of the standard Linux MTR package a try. While the destination port 
remains a constant (iirc it defaults to port TCP/80) each iteration will use a 
different TCP source port, thereby introducing sufficient entropy to see if you 
get packetloss on a given amount of links in an ECMP heavy forwarding path. 
This always worked wonders for me back in the day when hunting down a broken 
port in a pair of 5x10G LACP bundles, e.g. 10 different possible paths, or when 
trying to find the rotten switching fabric in a chassis from a vendor with less 
than stellar debugging capabilities. Do keep in mind you need to keep the MTR 
running for a longer period of time to get a statistically significant amount 
of data to conclude anything from the percentages, e.g. let's say 10 minutes is 
better than 1 minute.

Best regards,
Martijn

From: NANOG  on behalf of Adam 
Thompson 
Sent: 14 November 2021 17:20
To: James Bensley ; nanog 
Subject: Re: Validating multi-path in production?

The problem I'm looking to solve is the logical opposite, I think: I want to 
demonstrate that no links are malfunctioning in such a way that packets on a 
certain path are getting silently dropped.  Which has some "proving a negative" 
aspects to it, unfortunately.
I think the only way I can demonstrate it is to determine that every single 
multi-path/hashed-member link is working, which is... hard.  Especially if I 
need to deal with the combinatoric explosion - I *think* I can skip that part.
-Adam

Get Outlook for 
Android<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Faka.ms%2FAAb9ysg=04%7C01%7Cmartijnschmidt%40i3d.net%7C8dc123f885474dfa474308d9a78aa33d%7Ce01bd386fa514210a2a429e5ab6f7ab1%7C0%7C0%7C637725036175018395%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=34%2FyM2QUlvE0fi%2F4V7tFFvJYTEyfdWfKG%2FjtbL4IIV8%3D=0>

From: James Bensley 
Sent: Sunday, November 14, 2021 5:29:25 AM
To: Adam Thompson ; nanog 
Subject: Re: Validating multi-path in production?

On Fri, 12 Nov 2021 at 16:54, Adam Thompson 
mailto:athomp...@merlin.mb.ca>> wrote:
The best I've come up with so far is to have two test systems (typically VMs) 
that use adjacent IP addresses and adjacent MAC addresses, and test both 
inbound and outbound to/from those, blindly trusting/hoping that hashing 
algorithms will probably exercise both paths.

If the goal is to test that traffic *is* being distributed across multiple 
links based on traffic headers, then you can definable roll your own. I think 
the problem is orchestrating it (feeding your topology data into the tool, 
running the tool, getting the results out, and interpreting the results etc).

A coupe of public examples:
https://github.com/facebookarchive/UdpPinger<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffacebookarchive%2FUdpPinger=04%7C01%7Cmartijnschmidt%40i3d.net%7C8dc123f885474dfa474308d9a78aa33d%7Ce01bd386fa514210a2a429e5ab6f7ab1%7C0%7C0%7C637725036175028351%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=29lU8yLVrzWEA5%2BeD6SmzmeSV%2FDc44YwvUMlXWB28sM%3D=0>
https://www.youtube.com/watch?v=PN-4JKjCAT0<https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DPN-4JKjCAT0=04%7C01%7Cmartijnschmidt%40i3d.net%7C8dc123f885474dfa474308d9a78aa33d%7Ce01bd386fa514210a2a429e5ab6f7ab1%7C0%7C0%7C637725036175028351%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=FmRle7f9%2F2QPO8b9AWyqQC9FP9CYcaGpzI%2BqKjwSs90%3D=0>

If you do roll your own, you need to taylor the tests to your topology and your 
equipment. For example, you can have two VMs as you mentioned, each at opposite 
ends of the network. Then, if your network uses a 5-tuple for ECMP inside the 
core for example, you could send many flows between the two VMs, rotating the 
sauce port for example, to ensure all links in a LAG or all ECMP paths are used.

It's tricky to know the hashing algo for every type of device you have in your 
network, and for each traffic type for each device type, if you have a multi 
vendor network. Also, if your network carries a mix of IPv4, IPv6, PPP, MPLS L3 
VPNs, MPLS L2 VPNs, GRE, GTP, IPSEC, etc. The number of permutations of tests 
you need to run and the result sets you need to parse, grows very rapidly.

Cheers,
James.


Re: Validating multi-path in production?

2021-11-14 Thread Adam Thompson
The problem I'm looking to solve is the logical opposite, I think: I want to 
demonstrate that no links are malfunctioning in such a way that packets on a 
certain path are getting silently dropped.  Which has some "proving a negative" 
aspects to it, unfortunately.
I think the only way I can demonstrate it is to determine that every single 
multi-path/hashed-member link is working, which is... hard.  Especially if I 
need to deal with the combinatoric explosion - I *think* I can skip that part.
-Adam

Get Outlook for Android<https://aka.ms/AAb9ysg>

From: James Bensley 
Sent: Sunday, November 14, 2021 5:29:25 AM
To: Adam Thompson ; nanog 
Subject: Re: Validating multi-path in production?

On Fri, 12 Nov 2021 at 16:54, Adam Thompson 
mailto:athomp...@merlin.mb.ca>> wrote:
The best I've come up with so far is to have two test systems (typically VMs) 
that use adjacent IP addresses and adjacent MAC addresses, and test both 
inbound and outbound to/from those, blindly trusting/hoping that hashing 
algorithms will probably exercise both paths.

If the goal is to test that traffic *is* being distributed across multiple 
links based on traffic headers, then you can definable roll your own. I think 
the problem is orchestrating it (feeding your topology data into the tool, 
running the tool, getting the results out, and interpreting the results etc).

A coupe of public examples:
https://github.com/facebookarchive/UdpPinger
https://www.youtube.com/watch?v=PN-4JKjCAT0

If you do roll your own, you need to taylor the tests to your topology and your 
equipment. For example, you can have two VMs as you mentioned, each at opposite 
ends of the network. Then, if your network uses a 5-tuple for ECMP inside the 
core for example, you could send many flows between the two VMs, rotating the 
sauce port for example, to ensure all links in a LAG or all ECMP paths are used.

It's tricky to know the hashing algo for every type of device you have in your 
network, and for each traffic type for each device type, if you have a multi 
vendor network. Also, if your network carries a mix of IPv4, IPv6, PPP, MPLS L3 
VPNs, MPLS L2 VPNs, GRE, GTP, IPSEC, etc. The number of permutations of tests 
you need to run and the result sets you need to parse, grows very rapidly.

Cheers,
James.


Re: Validating multi-path in production?

2021-11-14 Thread James Bensley
On Fri, 12 Nov 2021 at 16:54, Adam Thompson  wrote:

> The best I've come up with so far is to have two test systems (typically
> VMs) that use adjacent IP addresses and adjacent MAC addresses, and test
> both inbound and outbound to/from those, blindly trusting/hoping that
> hashing algorithms will *probably* exercise both paths.
>

If the goal is to test that traffic *is* being distributed across multiple
links based on traffic headers, then you can definable roll your own. I
think the problem is orchestrating it (feeding your topology data into the
tool, running the tool, getting the results out, and interpreting the
results etc).

A coupe of public examples:
https://github.com/facebookarchive/UdpPinger
https://www.youtube.com/watch?v=PN-4JKjCAT0

If you do roll your own, you need to taylor the tests to your topology and
your equipment. For example, you can have two VMs as you mentioned, each at
opposite ends of the network. Then, if your network uses a 5-tuple for ECMP
inside the core for example, you could send many flows between the two VMs,
rotating the sauce port for example, to ensure all links in a LAG or all
ECMP paths are used.

It's tricky to know the hashing algo for every type of device you have in
your network, and for each traffic type for each device type, if you have a
multi vendor network. Also, if your network carries a mix of IPv4, IPv6,
PPP, MPLS L3 VPNs, MPLS L2 VPNs, GRE, GTP, IPSEC, etc. The number of
permutations of tests you need to run and the result sets you need to
parse, grows very rapidly.

Cheers,
James.


Re: Validating multi-path in production?

2021-11-13 Thread Saku Ytti
On Fri, 12 Nov 2021 at 17:55, Adam Thompson  wrote:

> The best I've come up with so far is to have two test systems (typically VMs) 
> that use adjacent IP addresses and adjacent MAC addresses, and test both 
> inbound and outbound to/from those, blindly trusting/hoping that hashing 
> algorithms will probably exercise both paths.

Add RFC5837 to your RFPs.

-- 
  ++ytti


Re: Validating multi-path in production?

2021-11-12 Thread Jeff Tantsura
LAG - Micro BFD (RFC7130) provides per constituent livability. MLAG is much 
more complicated (there’s a proposal in IETF but not progressing), so LACP is 
pretty much the only option.
ECMP could use old/good single hop BFD per pair.
Practically - if you introduce enough flows with one of the hash keys 
monotonically changing, eventually you’d exercise every path available;
on itself would not help for end2end testing, usually integrated with a form of 
s/net flow to provide “proof of transit.
Inband telemetry (chose your poison) does provide basic device ID it has 
traversed as well as in some cases POT. 
Finally - there are public Microsoft presentations how we use IPinIP encap to 
traverse a particular path on wide radix ECMP fabrics.

Cheers,
Jeff

> On Nov 12, 2021, at 07:55, Adam Thompson  wrote:
> 
> 
> Hello all.
> Over time, we've run into occurrences of both bugs and human error, both in 
> our own gear and in our partner networks' gear, specifically affecting 
> multi-path forwarding, at pretty much all layers: Multi-chassis LAG, ECMP, 
> and BGP MP.  (Yes, I am a corner-case magnet.  Lucky me.)
> 
> Some of these issues were fairly obvious when they happened, but some were 
> really hard to pin down.
> 
> We've found that typical network monitoring tools (Observium & Smokeping, not 
> to mention plain old ping and traceroute) can't really detect a 
> hashing-related or multi-path-related problem: either the packets get through 
> or they don't.
> 
> Can anyone recommend either tools or techniques to validate that multi-path 
> forwarding either is, or isn't, working correctly in a production network?  
> I'm looking to add something to our test suite for when we make changes to 
> critical network gear.  Almost all the scenarios I want to test only involve 
> two paths, if that helps.
> 
> The best I've come up with so far is to have two test systems (typically VMs) 
> that use adjacent IP addresses and adjacent MAC addresses, and test both 
> inbound and outbound to/from those, blindly trusting/hoping that hashing 
> algorithms will probably exercise both paths.
> 
> Some of the problems we've seen show that merely looking at interface 
> counters is insufficient, so I'm trying to find an explicit proof, not 
> implicit.
> 
> Any suggestions?  Surely other vendors and/or admins have screwed this up in 
> subtle ways enough times that this knowledge exists?  (My Google-fu is 
> usually pretty good, but I'm striking out - maybe I'm using the wrong terms.)
> 
> -Adam
> 
> Adam Thompson
> Consultant, Infrastructure Services
> 
> 100 - 135 Innovation Drive
> Winnipeg, MB, R3T 6A8
> (204) 977-6824 or 1-800-430-6404 (MB only)
> athomp...@merlin.mb.ca
> www.merlin.mb.ca


Validating multi-path in production?

2021-11-12 Thread Adam Thompson
Hello all.
Over time, we've run into occurrences of both bugs and human error, both in our 
own gear and in our partner networks' gear, specifically affecting multi-path 
forwarding, at pretty much all layers: Multi-chassis LAG, ECMP, and BGP MP.  
(Yes, I am a corner-case magnet.  Lucky me.)

Some of these issues were fairly obvious when they happened, but some were 
really hard to pin down.

We've found that typical network monitoring tools (Observium & Smokeping, not 
to mention plain old ping and traceroute) can't really detect a hashing-related 
or multi-path-related problem: either the packets get through or they don't.

Can anyone recommend either tools or techniques to validate that multi-path 
forwarding either is, or isn't, working correctly in a production network?  I'm 
looking to add something to our test suite for when we make changes to critical 
network gear.  Almost all the scenarios I want to test only involve two paths, 
if that helps.

The best I've come up with so far is to have two test systems (typically VMs) 
that use adjacent IP addresses and adjacent MAC addresses, and test both 
inbound and outbound to/from those, blindly trusting/hoping that hashing 
algorithms will probably exercise both paths.

Some of the problems we've seen show that merely looking at interface counters 
is insufficient, so I'm trying to find an explicit proof, not implicit.

Any suggestions?  Surely other vendors and/or admins have screwed this up in 
subtle ways enough times that this knowledge exists?  (My Google-fu is usually 
pretty good, but I'm striking out - maybe I'm using the wrong terms.)

-Adam

Adam Thompson
Consultant, Infrastructure Services
[1593169877849]
100 - 135 Innovation Drive
Winnipeg, MB, R3T 6A8
(204) 977-6824 or 1-800-430-6404 (MB only)
athomp...@merlin.mb.ca
www.merlin.mb.ca