Re: [spring] Comments on draft-filsfils-spring-segment-routing-msdc-00

Jakob Heitz (jheitz) Sun, 23 Nov 2014 23:58:49 -0800

Regarding the confusion about the AS numbers:
Each node is indeed its own AS. However, to conserve AS numbers, some nodes 
share AS numbers. It seems like a contradiction, but it works anyway. The 
routers that share AS numbers are not in the same AS, because they do not have 
iBGP sessions connecting them and they do not have congruent routes as routers 
in the same AS would have.

--
Jakob Heitz.

On Nov 16, 2014, at 10:13 AM, Bruno Rijsman 
<[email protected]<mailto:[email protected]>> wrote:

See >>> below for some comments on draft-filsfils-spring-segment-routing-msdc-00

-- Bruno

                                   Tier-3

                                  +-----+

                                  |NODE |

                               +->|  5  |--+

                               |  +-----+  |

                       Tier-2  |           |   Tier-2

                      +-----+  |  +-----+  |  +-----+

        +------------>|NODE |--+->|NODE |--+--|NODE |-------------+

        |       +-----|  3  |--+  |  6  |  +--|  9  |-----+       |

        |       |     +-----+     +-----+     +-----+     |       |

        |       |                                         |       |

        |       |     +-----+     +-----+     +-----+     |       |

        | +-----+---->|NODE |--+  |NODE |  +--|NODE |-----+-----+ |

        | |     | +---|  4  |--+->|  7  |--+--|  10  |---+ |     | |

        | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |

        | |     | |            |           |            | |     | |

      +-----+ +-----+          |  +-----+  |          +-----+ +-----+

      |NODE | |NODE | Tier-1   +->|NODE |--+   Tier-1 |NODE | |NODE |

      |  1  | |  2  |             |  8  |             | 11  | |  12 |

      +-----+ +-----+             +-----+             +-----+ +-----+

        | |     | |                                     | |     | |

        A O     B O            <- Servers ->            Z O     O O

                      Figure 1: 5-stage Clos topology

>>> Comment #1 start.

This figure appears to be a mixture between figure 1 (traditional topology) and 
figure 2 (3-stage folded Clos topology) in 
draft-ietf-rtgwg-bgp-routing-large-dc-00.  In a proper 5-stage folded Clos 
topology, there would be a full mesh from each tier N to tier N+1 (e.g. node 3 
would not only be connected to nodes 5 and 6 but also to nodes 7 and 8).  Note 
that your numbering of the tiers is the reverse of the number of the tiers in 
draft-ietf-rtgwg-bgp-routing-large-dc-00 which causes some confusion later on 
in this document. Here is a suggested for a more accurate 5-stage folded Clos 
figure:

                          +-------+  +-------+

                          | SS1   |  | SS2   |  Super spine switches

                          +-------+  +-------+

                           | | | |    | | | |

         +-----------------+ | | |    | | | |

         |          +--------+ | |    | | | +-----------------+

         |   +-----------------|-|----+ | +--------+          |

         |   |      |          | +------------------------+   |

         |   |      |          +---------------+   |      |   |

         |   |      |   +---------------+      |   |      |   |

         |   |      |   |                      |   |      |   |

       +-------+  +-------+                  +-------+  +-------+

       | S1    |  | S2    |                  | S3    |  | S4    |  Spine 
switches

       +-------+  +-------+                  +-------+  +-------+

        | | | |    | | | |                    | | | |    | | | |

 +------+ | | |    | | | +------+    +------+ | | |    | | | +------+

 |   +----|-|-|----+ | |        |    |   +----|-|-|----+ | |        |

 |   |    | | +-------------+   |    |   |    | | +-------------+   |

 |   |    | +------+ | |    |   |    |   |    | +------+ | |    |   |

 |   |    |   +----|-+ |    |   |    |   |    |   +----|-+ |    |   |

 |   |    |   |    |   |    |   |    |   |    |   |    |   |    |   |

+-----+  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+

| L1  |  | L2  |  | L3  |  | L4  |  | L5  |  | L6  |  | L7  |  | L8  |  Leaf 
switches (ToRs)

+-----+  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+

 | | |    | | |    | | |    | | |    | | |    | | |    | | |    | | |

 O O O    O O O    O O O    O O O    O O O    O O O    O O O    O O O   Servers

>>> Comment #1 end.

   o  Each node is its own AS:

         For simple and efficient route propagation filtering, Nodes 5,

         6, 7 and 8 share the same AS, Nodes 3 and 4 share the same AS,

         nodes 9 and 10 share the same AS.

         For efficient usage of the scarce 2-byte private AS pool,

         different tier-1 nodes might share the same AS.

         Without loss of generality, we will simplify these details in

         this document and assume that each node has its own AS.

>>> Comment #2 start.

The above section is somewhat confusion and appears to be self-contradictory.  
The title says that each node has its own AS.  Then the first paragraph 
contradicts that and says that certain nodes definitely do have the same AS 
number.  Then the next paragraph says that certain nodes might have the same AS 
number. Then the final paragraph says that each node has its own AS number.

Suggested text:

In real-life deployments there are various ways in which AS numbers can be 
assigned to switches.

One option is to use a single AS for all the switches in the entire data 
center.  In this case IGBP sessions are used everywhere with next-hop-self 
policies to force the traffic along the desired path.

Another option is to make each tier an AS and use EBGP sessions between the 
tiers. In this case all switches within a given tier are in the same AS and 
EBGP sessions are used between the tiers.

A third option is to assign a different AS number to each switch. In this case 
each switch is a separate AS and EBGP sessions are used everywhere.

In this document we assume, without loss of generality, the third option.

>>> Comment #2 end.

   o  The forwarding plane at Tier-2 and Tier-1 is MPLS.

   o  The forwarding plane at Tier-3 is either IP2MPLS (if the host

      sends IP traffic) or MPLS2MPLS (if the host sends MPLS-

      encapsulated traffic).

>>> Comment #3 start.

The way you numbered the tiers in your diagram, MPLS is in tier-2 and tier-3, 
and tier-1 is either IP2MPLS or MPLS.  (Note that the numbering of the tiers in 
your diagram is the reverse from the numbering in 
draft-ietf-rtgwg-bgp-routing-large-dc-00 which is likely the source of the 
confusion.)

Suggested text:

o The forwarding plane between the spine switches and the super spine switches 
is MPLS

o The forwarding plane between the leaf switches (= ToR switches) and the spine 
switches is MPLS

o The forwarding plane between the servers and the leaf switches (= ToR 
switches) may be MPLS or IP.

>>> Comment #3 end.

   In this document, we also refer to the Tier-3, Tier-2 and Tier-1

   switches respectively as Spine, Leaf and ToR (top of rack) switches.

   When a ToR switch acts as a gateway to the "outside world", we call

   it a border switch.

>>> Comment #4 start.

I believe it is common to consider the ToR switches (i.e. the switches that are 
connected to the servers) to be the leaf switches.

I would suggest the following terminology:

o ToR switch = leaf switch

o Spine switch

o Super spine switch

I also believe that it is more common to number the tiers starting from the 
bottom (leaf) with index 0:

o ToR switch = leaf switch = tier 0

o Spine switch = tier 1

o Super spine switch = tier 2

>>> Comment #4 end.

   Node 11 sends the following eBGP3107 update to Node 10:

   . NLRI:  1.1.1.11/32<http://1.1.1.11/32>

   . Label: Implicit-Null

   . Next-hop: Node11's interface address on the link to Node10

   . AS Path: {11}

   . BGP-Prefix Attribute: Index 11

   Node 10 receives the above update.  As it is SR capable, Node10 is

   able to interpret the BGP-Prefix Attribute and hence allocates the

   label 16011 to the NLRI (instead of asking a "random/local" label

   from its label manager).  The implicit-null label in the update

   signals to Node 10 that it is the penultimate hop and MUST pop the

   top label on the stack before forwarding traffic for this prefix to

   Node 11.

   Then, Node 10 sends the following eBGP3107 update to Node 7:

   . NLRI:  1.1.1.11/32<http://1.1.1.11/32>

   . Label: 16011

   . Next-hop: Node10's interface address on the link to Node7

   . AS Path: {10, 11}

   . BGP-Prefix Attribute: Index 11

>>> Comment #5 start.

As described here, this proposal requires that node 10 has a-priori knowledge 
of the globally significant label which must be assigned to prefix 
1.1.1.11/32<http://1.1.1.11/32>.

This is not only true for node 10, but it is also true for every other node 
connected to node 11 which receives prefix 1.1.1.11/32<http://1.1.1.11/32> with 
an implicit null label.

>From a practical operational view, it might require that every node has 
>a-priori knowledge of the binding of MPLS labels to nodes.  What this means, 
>in effect, is that the BGP signaling only used for reachability detection, and 
>not for distribution of label to node bindings.

Providing each node with a-priori knowledge of label to node binding for every 
node in the POD may be too much of a configuration burden for most operators.

This problem is solved later on in this draft using the 
I-D.keyupate-idr-bgp-prefix-sid mechanism (although the problem is still there 
for transition scenarios).

Alternatively, avoiding the need for pre-configured label-to-prefix binding can 
also be achieved by making the following BGP implementation changes without any 
on-the-wire protocol change.

1. Each ultimate hop node is configured to advertise its global label instead 
of implicit null.

2. Each node which receives a BGP-LU advertisement is configured to:

2a) Not allocate a new locally significant label when it does a next-hop self.  
Instead, it keeps the received label.  This behavior could be restricted to a 
particular label block.

2b) Pop the label (i.e. do a PHP) when it is forwarding the packet to the 
ultimate hop (which can be detected using the AS-path) despite the fact that 
the ultimate hop advertised a real label instead of an implicit null label.

>>> Comment #5 end.

              -----------------------------------------------

              Incoming label    | outgoing label | Outgoing

              or IP destination |                | Interface

              ------------------+----------------+-----------

                   16011        |      16011     | ECMP{7, 8}

                1.1.1.11/32<http://1.1.1.11/32>     |      16011     | ECMP{7, 
8}

              ------------------+----------------+-----------

                     Figure 4: Node-4 Forwarding Table

>>> Comment #6 start.

In the example topology, the spine switches were not fully meshed to the super 
spine switches.  In a real life topology they would be fully meshed, and the 
ECMP set out be a 4-way ECMP set ECMP{5,6,7,8}

>>> Comment #6 end.

              -----------------------------------------------

              Incoming label    | outgoing label | Outgoing

              or IP destination |                | Interface

              ------------------+----------------+-----------

                   16011        |      16011     |    10

                1.1.1.11/32<http://1.1.1.11/32>     |      16011     |    10

              ------------------+----------------+-----------

                     Figure 5: Node-7 Forwarding Table

>>> Comment #6 start.

In the example topology, the spine switches were not fully meshed to the super 
spine switches.  In a real life topology they would be fully meshed, and the 
ECMP set out be a 2-way ECMP set ECMP{9,10}

>>> Comment #6 end.

3.3.  Network Design Variation

   A network design choice could consist of switching all the traffic

   through tier-2 and tier-3 as MPLS traffic.  In this case, one could

   filter away the IP entries at nodes 4, 7 and 10.  This might be

   beneficial in order to optimize the forwarding table size.

   A network design choice could consist in allowing the hosts to send

   MPLS-encapsulated traffic (based on EPE use-case,

   [I-D.filsfils-spring-segment-routing-central-epe]).  For example,

   Node 1 would receive Node11-destined MPLS-encapsulated traffic from

   its attached host A and would switch this traffic on the basis of the

   MPLS entry for 16011 (instead of classically receiving IP traffic

   from A and performing an IPtoMPLS switching operation).

>>> Comment #7 start.

It would be good to point out explicitly that the second approach allows the 
hosts to send a multi-label stack (for example to implement egress peer 
engineering as described later in this I-D).  If IP forwarding is used between 
leaf switches, then this would require some MPLS over IP-tunnel approach (e.g. 
MPLS-over-GRE).  Using MPLS as the base tunnel mechanism is more consistent and 
allows other features to be supported that cannot be implemented with 
MPLS-over-IP-tunnel based approaches (e.g. the capacity optimization use case 
described in section 4.4 of the I-D).

>>> Comment #7 end.

   From a signaling viewpoint, nothing would change as even if Node6

   does not understand the BGP-Prefix Segment attribute, it does

   propagate it unmodified to its neighbors.

   From a label allocation viewpoint, the only difference is that Node7

   would allocate a dynamic label to the prefix 1.1.1.11/32<http://1.1.1.11/32> 
(e.g.

   12345) and would advertise that label to its neighbor Node4.

>>> Comment #8 start.

There may be one change which is required on on the legacy Node7.

We have to make sure that the MPLS label which is allocated by Node7 does not 
"collide" with the globally significant labels.  For example, things would 
break if legacy Node7 happen happens to dynamically allocate a 1600x label.  
This can be avoided by introducing a configuration knob in the label allocation 
subsystem of the switch for "off-limits label blocks".

>>> Comment #8 end.

_______________________________________________
spring mailing list
[email protected]<mailto:[email protected]>
https://www.ietf.org/mailman/listinfo/spring

_______________________________________________
spring mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/spring

Re: [spring] Comments on draft-filsfils-spring-segment-routing-msdc-00

Reply via email to