[spring] Comments on draft-filsfils-spring-segment-routing-msdc-00

Bruno Rijsman Sun, 16 Nov 2014 10:13:33 -0800

*See >>> below for some comments on *
*draft-filsfils-spring-segment-routing-msdc-00*


*-- Bruno*





                                   Tier-3

                                  +-----+

                                  |NODE |

                               +->|  5  |--+

                               |  +-----+  |

                       Tier-2  |           |   Tier-2

                      +-----+  |  +-----+  |  +-----+

        +------------>|NODE |--+->|NODE |--+--|NODE |-------------+

        |       +-----|  3  |--+  |  6  |  +--|  9  |-----+       |

        |       |     +-----+     +-----+     +-----+     |       |

        |       |                                         |       |

        |       |     +-----+     +-----+     +-----+     |       |

        | +-----+---->|NODE |--+  |NODE |  +--|NODE |-----+-----+ |

        | |     | +---|  4  |--+->|  7  |--+--|  10  |---+ |     | |

        | |     | |   +-----+  |  +-----+  |  +-----+   | |     | |

        | |     | |            |           |            | |     | |

      +-----+ +-----+          |  +-----+  |          +-----+ +-----+

      |NODE | |NODE | Tier-1   +->|NODE |--+   Tier-1 |NODE | |NODE |

      |  1  | |  2  |             |  8  |             | 11  | |  12 |

      +-----+ +-----+             +-----+             +-----+ +-----+

        | |     | |                                     | |     | |

        A O     B O            <- Servers ->            Z O     O O



                      Figure 1: 5-stage Clos topology





*>>> Comment #1 start.  *



*This figure appears to be a mixture between figure 1 (traditional
topology) and figure 2 (3-stage folded Clos topology) in
draft-ietf-rtgwg-bgp-routing-large-dc-00.  In a proper 5-stage folded Clos
topology, there would be a full mesh from each tier N to tier N+1 (e.g.
node 3 would not only be connected to nodes 5 and 6 but also to nodes 7 and
8).  Note that your numbering of the tiers is the reverse of the number of
the tiers in draft-ietf-rtgwg-bgp-routing-large-dc-00 which causes some
confusion later on in this document. Here is a suggested for a more
accurate 5-stage folded Clos figure:*



                          +-------+  +-------+

                          | SS1   |  | SS2   |  Super spine switches

                          +-------+  +-------+

                           | | | |    | | | |

         +-----------------+ | | |    | | | |

         |          +--------+ | |    | | | +-----------------+

         |   +-----------------|-|----+ | +--------+          |

         |   |      |          | +------------------------+   |

         |   |      |          +---------------+   |      |   |

         |   |      |   +---------------+      |   |      |   |

         |   |      |   |                      |   |      |   |

       +-------+  +-------+                  +-------+  +-------+

       | S1    |  | S2    |                  | S3    |  | S4    |  Spine
switches

       +-------+  +-------+                  +-------+  +-------+

        | | | |    | | | |                    | | | |    | | | |

 +------+ | | |    | | | +------+    +------+ | | |    | | | +------+

 |   +----|-|-|----+ | |        |    |   +----|-|-|----+ | |        |

 |   |    | | +-------------+   |    |   |    | | +-------------+   |

 |   |    | +------+ | |    |   |    |   |    | +------+ | |    |   |

 |   |    |   +----|-+ |    |   |    |   |    |   +----|-+ |    |   |

 |   |    |   |    |   |    |   |    |   |    |   |    |   |    |   |

+-----+  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+

| L1  |  | L2  |  | L3  |  | L4  |  | L5  |  | L6  |  | L7  |  | L8  |
Leaf switches (ToRs)

+-----+  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+  +-----+

 | | |    | | |    | | |    | | |    | | |    | | |    | | |    | | |

 O O O    O O O    O O O    O O O    O O O    O O O    O O O    O O O
  Servers



*>>> Comment #1 end.  *





   o  Each node is its own AS:



         For simple and efficient route propagation filtering, Nodes 5,

         6, 7 and 8 share the same AS, Nodes 3 and 4 share the same AS,

         nodes 9 and 10 share the same AS.



         For efficient usage of the scarce 2-byte private AS pool,

         different tier-1 nodes might share the same AS.



         Without loss of generality, we will simplify these details in

         this document and assume that each node has its own AS.



*>>> Comment #2 start.*



*The above section is somewhat confusion and appears to be
self-contradictory.  The title says that each node has its own AS.  Then
the first paragraph contradicts that and says that certain nodes definitely
do have the same AS number.  Then the next paragraph says that certain
nodes might have the same AS number. Then the final paragraph says that
each node has its own AS number.*



*Suggested text:*



*In real-life deployments there are various ways in which AS numbers can be
assigned to switches.*



*One option is to use a single AS for all the switches in the entire data
center.  In this case IGBP sessions are used everywhere with next-hop-self
policies to force the traffic along the desired path.*



*Another option is to make each tier an AS and use EBGP sessions between
the tiers. In this case all switches within a given tier are in the same AS
and EBGP sessions are used between the tiers.*



*A third option is to assign a different AS number to each switch. In this
case each switch is a separate AS and EBGP sessions are used everywhere.*



*In this document we assume, without loss of generality, the third option.*



*>>> Comment #2 end.*





   o  The forwarding plane at Tier-2 and Tier-1 is MPLS.



   o  The forwarding plane at Tier-3 is either IP2MPLS (if the host

      sends IP traffic) or MPLS2MPLS (if the host sends MPLS-

      encapsulated traffic).



*>>> Comment #3 start.*



*The way you numbered the tiers in your diagram, MPLS is in tier-2 and
tier-3, and tier-1 is either IP2MPLS or MPLS.  (Note that the numbering of
the tiers in your diagram is the reverse from the numbering in
draft-ietf-rtgwg-bgp-routing-large-dc-00 which is likely the source of the
confusion.)*



*Suggested text:*



*o The forwarding plane between the spine switches and the super spine
switches is MPLS*



*o The forwarding plane between the leaf switches (= ToR switches) and the
spine switches is MPLS*



*o The forwarding plane between the servers and the leaf switches (= ToR
switches) may be MPLS or IP.*



*>>> Comment #3 end.*





   In this document, we also refer to the Tier-3, Tier-2 and Tier-1

   switches respectively as Spine, Leaf and ToR (top of rack) switches.

   When a ToR switch acts as a gateway to the "outside world", we call

   it a border switch.



*>>> Comment #4 start.*



*I believe it is common to consider the ToR switches (i.e. the switches
that are connected to the servers) to be the leaf switches.*



*I would suggest the following terminology:*

*o ToR switch = leaf switch*

*o Spine switch*

*o Super spine switch*



*I also believe that it is more common to number the tiers starting from
the bottom (leaf) with index 0:*

*o ToR switch = leaf switch = tier 0*

*o Spine switch = tier 1*

*o Super spine switch = tier 2*



*>>> Comment #4 end.*



   Node 11 sends the following eBGP3107 update to Node 10:



   . NLRI:  1.1.1.11/32

   . Label: Implicit-Null

   . Next-hop: Node11's interface address on the link to Node10

   . AS Path: {11}

   . BGP-Prefix Attribute: Index 11



   Node 10 receives the above update.  As it is SR capable, Node10 is

   able to interpret the BGP-Prefix Attribute and hence allocates the

   label 16011 to the NLRI (instead of asking a "random/local" label

   from its label manager).  The implicit-null label in the update

   signals to Node 10 that it is the penultimate hop and MUST pop the

   top label on the stack before forwarding traffic for this prefix to

   Node 11.



   Then, Node 10 sends the following eBGP3107 update to Node 7:



   . NLRI:  1.1.1.11/32

   . Label: 16011

   . Next-hop: Node10's interface address on the link to Node7

   . AS Path: {10, 11}

   . BGP-Prefix Attribute: Index 11



*>>> Comment #5 start.*



*As described here, this proposal requires that node 10 has a-priori
knowledge of the globally significant label which must be assigned to
prefix 1.1.1.11/32 <http://1.1.1.11/32>.*



*This is not only true for node 10, but it is also true for every other
node connected to node 11 which receives prefix 1.1.1.11/32
<http://1.1.1.11/32> with an implicit null label.  *



*From a practical operational view, it might require that every node has
a-priori knowledge of the binding of MPLS labels to nodes.  What this
means, in effect, is that the BGP signaling only used for reachability
detection, and not for distribution of label to node bindings.*



*Providing each node with a-priori knowledge of label to node binding for
every node in the POD may be too much of a configuration burden for most
operators.*



*This problem is solved later on in this draft using the
I-D.keyupate-idr-bgp-prefix-sid mechanism (although the problem is still
there for transition scenarios).*



*Alternatively, avoiding the need for pre-configured label-to-prefix
binding can also be achieved by making the following BGP implementation
changes without any on-the-wire protocol change.*



*1. Each ultimate hop node is configured to advertise its global label
instead of implicit null.*



*2. Each node which receives a BGP-LU advertisement is configured to: *



*2a) Not allocate a new locally significant label when it does a next-hop
self.  Instead, it keeps the received label.  This behavior could be
restricted to a particular label block.*



*2b) Pop the label (i.e. do a PHP) when it is forwarding the packet to the
ultimate hop (which can be detected using the AS-path) despite the fact
that the ultimate hop advertised a real label instead of an implicit null
label.*



*>>> Comment #5 end.*



              -----------------------------------------------

              Incoming label    | outgoing label | Outgoing

              or IP destination |                | Interface

              ------------------+----------------+-----------

                   16011        |      16011     | ECMP{7, 8}

                1.1.1.11/32     |      16011     | ECMP{7, 8}

              ------------------+----------------+-----------



                     Figure 4: Node-4 Forwarding Table



*>>> Comment #6 start.*



*In the example topology, the spine switches were not fully meshed to the
super spine switches.  In a real life topology they would be fully meshed,
and the ECMP set out be a 4-way ECMP set ECMP{5,6,7,8}*



*>>> Comment #6 end.*





              -----------------------------------------------

              Incoming label    | outgoing label | Outgoing

              or IP destination |                | Interface

              ------------------+----------------+-----------

                   16011        |      16011     |    10

                1.1.1.11/32     |      16011     |    10

              ------------------+----------------+-----------



                     Figure 5: Node-7 Forwarding Table





*>>> Comment #6 start.*



*In the example topology, the spine switches were not fully meshed to the
super spine switches.  In a real life topology they would be fully meshed,
and the ECMP set out be a 2-way ECMP set ECMP{9,10}*



*>>> Comment #6 end.*







3.3.  Network Design Variation



   A network design choice could consist of switching all the traffic

   through tier-2 and tier-3 as MPLS traffic.  In this case, one could

   filter away the IP entries at nodes 4, 7 and 10.  This might be

   beneficial in order to optimize the forwarding table size.



   A network design choice could consist in allowing the hosts to send

   MPLS-encapsulated traffic (based on EPE use-case,

   [I-D.filsfils-spring-segment-routing-central-epe]).  For example,

   Node 1 would receive Node11-destined MPLS-encapsulated traffic from

   its attached host A and would switch this traffic on the basis of the

   MPLS entry for 16011 (instead of classically receiving IP traffic

   from A and performing an IPtoMPLS switching operation).



*>>> Comment #7 start.*



*It would be good to point out explicitly that the second approach allows
the hosts to send a multi-label stack (for example to implement egress peer
engineering as described later in this I-D).  If IP forwarding is used
between leaf switches, then this would require some MPLS over IP-tunnel
approach (e.g. MPLS-over-GRE).  Using MPLS as the base tunnel mechanism is
more consistent and allows other features to be supported that cannot be
implemented with MPLS-over-IP-tunnel based approaches (e.g. the capacity
optimization use case described in section 4.4 of the I-D).*



*>>> Comment #7 end.*



   From a signaling viewpoint, nothing would change as even if Node6

   does not understand the BGP-Prefix Segment attribute, it does

   propagate it unmodified to its neighbors.



   From a label allocation viewpoint, the only difference is that Node7

   would allocate a dynamic label to the prefix 1.1.1.11/32 (e.g.

   12345) and would advertise that label to its neighbor Node4.



*>>> Comment #8 start.*



*There may be one change which is required on on the legacy Node7.*



*We have to make sure that the MPLS label which is allocated by Node7 does
not "collide" with the globally significant labels.  For example, things
would break if legacy Node7 happen happens to dynamically allocate a 1600x
label.  This can be avoided by introducing a configuration knob in the
label allocation subsystem of the switch for "off-limits label blocks".*



*>>> Comment #8 end.*

_______________________________________________
spring mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/spring

[spring] Comments on draft-filsfils-spring-segment-routing-msdc-00

Reply via email to