Regarding the confusion about the AS numbers: Each node is indeed its own AS. However, to conserve AS numbers, some nodes share AS numbers. It seems like a contradiction, but it works anyway. The routers that share AS numbers are not in the same AS, because they do not have iBGP sessions connecting them and they do not have congruent routes as routers in the same AS would have.
-- Jakob Heitz. On Nov 16, 2014, at 10:13 AM, Bruno Rijsman <[email protected]<mailto:[email protected]>> wrote: See >>> below for some comments on draft-filsfils-spring-segment-routing-msdc-00 -- Bruno Tier-3 +-----+ |NODE | +->| 5 |--+ | +-----+ | Tier-2 | | Tier-2 +-----+ | +-----+ | +-----+ +------------>|NODE |--+->|NODE |--+--|NODE |-------------+ | +-----| 3 |--+ | 6 | +--| 9 |-----+ | | | +-----+ +-----+ +-----+ | | | | | | | | +-----+ +-----+ +-----+ | | | +-----+---->|NODE |--+ |NODE | +--|NODE |-----+-----+ | | | | +---| 4 |--+->| 7 |--+--| 10 |---+ | | | | | | | +-----+ | +-----+ | +-----+ | | | | | | | | | | | | | | +-----+ +-----+ | +-----+ | +-----+ +-----+ |NODE | |NODE | Tier-1 +->|NODE |--+ Tier-1 |NODE | |NODE | | 1 | | 2 | | 8 | | 11 | | 12 | +-----+ +-----+ +-----+ +-----+ +-----+ | | | | | | | | A O B O <- Servers -> Z O O O Figure 1: 5-stage Clos topology >>> Comment #1 start. This figure appears to be a mixture between figure 1 (traditional topology) and figure 2 (3-stage folded Clos topology) in draft-ietf-rtgwg-bgp-routing-large-dc-00. In a proper 5-stage folded Clos topology, there would be a full mesh from each tier N to tier N+1 (e.g. node 3 would not only be connected to nodes 5 and 6 but also to nodes 7 and 8). Note that your numbering of the tiers is the reverse of the number of the tiers in draft-ietf-rtgwg-bgp-routing-large-dc-00 which causes some confusion later on in this document. Here is a suggested for a more accurate 5-stage folded Clos figure: +-------+ +-------+ | SS1 | | SS2 | Super spine switches +-------+ +-------+ | | | | | | | | +-----------------+ | | | | | | | | +--------+ | | | | | +-----------------+ | +-----------------|-|----+ | +--------+ | | | | | +------------------------+ | | | | +---------------+ | | | | | | +---------------+ | | | | | | | | | | | | +-------+ +-------+ +-------+ +-------+ | S1 | | S2 | | S3 | | S4 | Spine switches +-------+ +-------+ +-------+ +-------+ | | | | | | | | | | | | | | | | +------+ | | | | | | +------+ +------+ | | | | | | +------+ | +----|-|-|----+ | | | | +----|-|-|----+ | | | | | | | +-------------+ | | | | | +-------------+ | | | | +------+ | | | | | | | +------+ | | | | | | | +----|-+ | | | | | | +----|-+ | | | | | | | | | | | | | | | | | | | +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ | L1 | | L2 | | L3 | | L4 | | L5 | | L6 | | L7 | | L8 | Leaf switches (ToRs) +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ | | | | | | | | | | | | | | | | | | | | | | | | O O O O O O O O O O O O O O O O O O O O O O O O Servers >>> Comment #1 end. o Each node is its own AS: For simple and efficient route propagation filtering, Nodes 5, 6, 7 and 8 share the same AS, Nodes 3 and 4 share the same AS, nodes 9 and 10 share the same AS. For efficient usage of the scarce 2-byte private AS pool, different tier-1 nodes might share the same AS. Without loss of generality, we will simplify these details in this document and assume that each node has its own AS. >>> Comment #2 start. The above section is somewhat confusion and appears to be self-contradictory. The title says that each node has its own AS. Then the first paragraph contradicts that and says that certain nodes definitely do have the same AS number. Then the next paragraph says that certain nodes might have the same AS number. Then the final paragraph says that each node has its own AS number. Suggested text: In real-life deployments there are various ways in which AS numbers can be assigned to switches. One option is to use a single AS for all the switches in the entire data center. In this case IGBP sessions are used everywhere with next-hop-self policies to force the traffic along the desired path. Another option is to make each tier an AS and use EBGP sessions between the tiers. In this case all switches within a given tier are in the same AS and EBGP sessions are used between the tiers. A third option is to assign a different AS number to each switch. In this case each switch is a separate AS and EBGP sessions are used everywhere. In this document we assume, without loss of generality, the third option. >>> Comment #2 end. o The forwarding plane at Tier-2 and Tier-1 is MPLS. o The forwarding plane at Tier-3 is either IP2MPLS (if the host sends IP traffic) or MPLS2MPLS (if the host sends MPLS- encapsulated traffic). >>> Comment #3 start. The way you numbered the tiers in your diagram, MPLS is in tier-2 and tier-3, and tier-1 is either IP2MPLS or MPLS. (Note that the numbering of the tiers in your diagram is the reverse from the numbering in draft-ietf-rtgwg-bgp-routing-large-dc-00 which is likely the source of the confusion.) Suggested text: o The forwarding plane between the spine switches and the super spine switches is MPLS o The forwarding plane between the leaf switches (= ToR switches) and the spine switches is MPLS o The forwarding plane between the servers and the leaf switches (= ToR switches) may be MPLS or IP. >>> Comment #3 end. In this document, we also refer to the Tier-3, Tier-2 and Tier-1 switches respectively as Spine, Leaf and ToR (top of rack) switches. When a ToR switch acts as a gateway to the "outside world", we call it a border switch. >>> Comment #4 start. I believe it is common to consider the ToR switches (i.e. the switches that are connected to the servers) to be the leaf switches. I would suggest the following terminology: o ToR switch = leaf switch o Spine switch o Super spine switch I also believe that it is more common to number the tiers starting from the bottom (leaf) with index 0: o ToR switch = leaf switch = tier 0 o Spine switch = tier 1 o Super spine switch = tier 2 >>> Comment #4 end. Node 11 sends the following eBGP3107 update to Node 10: . NLRI: 1.1.1.11/32<http://1.1.1.11/32> . Label: Implicit-Null . Next-hop: Node11's interface address on the link to Node10 . AS Path: {11} . BGP-Prefix Attribute: Index 11 Node 10 receives the above update. As it is SR capable, Node10 is able to interpret the BGP-Prefix Attribute and hence allocates the label 16011 to the NLRI (instead of asking a "random/local" label from its label manager). The implicit-null label in the update signals to Node 10 that it is the penultimate hop and MUST pop the top label on the stack before forwarding traffic for this prefix to Node 11. Then, Node 10 sends the following eBGP3107 update to Node 7: . NLRI: 1.1.1.11/32<http://1.1.1.11/32> . Label: 16011 . Next-hop: Node10's interface address on the link to Node7 . AS Path: {10, 11} . BGP-Prefix Attribute: Index 11 >>> Comment #5 start. As described here, this proposal requires that node 10 has a-priori knowledge of the globally significant label which must be assigned to prefix 1.1.1.11/32<http://1.1.1.11/32>. This is not only true for node 10, but it is also true for every other node connected to node 11 which receives prefix 1.1.1.11/32<http://1.1.1.11/32> with an implicit null label. >From a practical operational view, it might require that every node has >a-priori knowledge of the binding of MPLS labels to nodes. What this means, >in effect, is that the BGP signaling only used for reachability detection, and >not for distribution of label to node bindings. Providing each node with a-priori knowledge of label to node binding for every node in the POD may be too much of a configuration burden for most operators. This problem is solved later on in this draft using the I-D.keyupate-idr-bgp-prefix-sid mechanism (although the problem is still there for transition scenarios). Alternatively, avoiding the need for pre-configured label-to-prefix binding can also be achieved by making the following BGP implementation changes without any on-the-wire protocol change. 1. Each ultimate hop node is configured to advertise its global label instead of implicit null. 2. Each node which receives a BGP-LU advertisement is configured to: 2a) Not allocate a new locally significant label when it does a next-hop self. Instead, it keeps the received label. This behavior could be restricted to a particular label block. 2b) Pop the label (i.e. do a PHP) when it is forwarding the packet to the ultimate hop (which can be detected using the AS-path) despite the fact that the ultimate hop advertised a real label instead of an implicit null label. >>> Comment #5 end. ----------------------------------------------- Incoming label | outgoing label | Outgoing or IP destination | | Interface ------------------+----------------+----------- 16011 | 16011 | ECMP{7, 8} 1.1.1.11/32<http://1.1.1.11/32> | 16011 | ECMP{7, 8} ------------------+----------------+----------- Figure 4: Node-4 Forwarding Table >>> Comment #6 start. In the example topology, the spine switches were not fully meshed to the super spine switches. In a real life topology they would be fully meshed, and the ECMP set out be a 4-way ECMP set ECMP{5,6,7,8} >>> Comment #6 end. ----------------------------------------------- Incoming label | outgoing label | Outgoing or IP destination | | Interface ------------------+----------------+----------- 16011 | 16011 | 10 1.1.1.11/32<http://1.1.1.11/32> | 16011 | 10 ------------------+----------------+----------- Figure 5: Node-7 Forwarding Table >>> Comment #6 start. In the example topology, the spine switches were not fully meshed to the super spine switches. In a real life topology they would be fully meshed, and the ECMP set out be a 2-way ECMP set ECMP{9,10} >>> Comment #6 end. 3.3. Network Design Variation A network design choice could consist of switching all the traffic through tier-2 and tier-3 as MPLS traffic. In this case, one could filter away the IP entries at nodes 4, 7 and 10. This might be beneficial in order to optimize the forwarding table size. A network design choice could consist in allowing the hosts to send MPLS-encapsulated traffic (based on EPE use-case, [I-D.filsfils-spring-segment-routing-central-epe]). For example, Node 1 would receive Node11-destined MPLS-encapsulated traffic from its attached host A and would switch this traffic on the basis of the MPLS entry for 16011 (instead of classically receiving IP traffic from A and performing an IPtoMPLS switching operation). >>> Comment #7 start. It would be good to point out explicitly that the second approach allows the hosts to send a multi-label stack (for example to implement egress peer engineering as described later in this I-D). If IP forwarding is used between leaf switches, then this would require some MPLS over IP-tunnel approach (e.g. MPLS-over-GRE). Using MPLS as the base tunnel mechanism is more consistent and allows other features to be supported that cannot be implemented with MPLS-over-IP-tunnel based approaches (e.g. the capacity optimization use case described in section 4.4 of the I-D). >>> Comment #7 end. From a signaling viewpoint, nothing would change as even if Node6 does not understand the BGP-Prefix Segment attribute, it does propagate it unmodified to its neighbors. From a label allocation viewpoint, the only difference is that Node7 would allocate a dynamic label to the prefix 1.1.1.11/32<http://1.1.1.11/32> (e.g. 12345) and would advertise that label to its neighbor Node4. >>> Comment #8 start. There may be one change which is required on on the legacy Node7. We have to make sure that the MPLS label which is allocated by Node7 does not "collide" with the globally significant labels. For example, things would break if legacy Node7 happen happens to dynamically allocate a 1600x label. This can be avoided by introducing a configuration knob in the label allocation subsystem of the switch for "off-limits label blocks". >>> Comment #8 end. _______________________________________________ spring mailing list [email protected]<mailto:[email protected]> https://www.ietf.org/mailman/listinfo/spring
_______________________________________________ spring mailing list [email protected] https://www.ietf.org/mailman/listinfo/spring
