Hi Tony, > if you're willing to provision things correctly yourself via S-PGP.
I am not willing to do that. I would like routing to do it for me auto-magically. But yes you got the question right. I was asking for shortcuts between last levels of fabric. Not so much between Node 111 & Node 112 - but you said it is optional so ok. Side note: PGP abbrev. for vast majority of people means completely different thing then what you defined it to mean locally in your draft. I highly recommended you rename it in -05 version to PGD (policy guided destination(s) or PGR (policy guided reachability/routing). Now requirement to switch off miscabling detection is not acceptable. You are stating that only rift knows how should I cable my fabric ? And if I cable it some other way it will detect it as miscabling ? I think in most cases of correct cables check you actually define the intent then network detects if such intent is met. > Node112 can actually even go haywire That may be not best property of a routing protocol :) Best, R. On Sat, Jan 13, 2018 at 7:16 PM, Tony Przygienda <tonysi...@gmail.com> wrote: > So I thought over your horizontal link case to shortcut the spine levels > for some kind of traffic again and I think the current draft actually > covers that > > if you're willing to provision things correctly yourself via S-PGP. > Let me see whether we agree on this picture first: > > . +--------+ +--------+ > . | | | | ^ N > . |Spine 21| |Spine 22| | > .Level 2 ++-+--+-++ ++-+--+-++ <-*-> E/W > . | | | | | | | | | > . P111/2| |P121 | | | | S v > . ^ ^ ^ ^ | | | | > . | | | | | | | | > . +--------------+ | +-----------+ | | | +---------------+ > . | | | | | | | | > . South +-----------------------------+ | | ^ > . | | | | | | | All TIEs > . 0/0 0/0 0/0 +-----------------------------+ | > . v v v | | | | | > . | | +-+ +<-0/0----------+ | | > . | | | | | | | | > .+-+----++ optional +-+----++ ++----+-+ ++-----++ > .| | E/W link | +=====+ | | | > .|Node111+----------+Node112| |Node121| |Node122| > .+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ > . | | | South | | | | > . | +---0/0--->-----+ 0/0 | +----------------+ | > . 0/0 | | | | | | | > . | +---<-0/0-----+ | v | +--------------+ | | > . v | | | | | | | > .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ > .| | (L2L) | | | | Level 0 | | > .|Leaf111~~~~~~~~~~~~Leaf112| |Leaf121| |Leaf122| > .+-+-----+ +-+---+-+ +--+--+-+ +-+-----+ > . + + \ / + + > . Prefix111 Prefix112 \ / Prefix121 Prefix122 > . multi-homed > . Prefix > .+---------- Pod 1 ---------+ +---------- Pod 2 ---------+ > > I assume here that what you ask for is the following scenario: > a) POD1 being a compute generating very heavy load towards storage in > Prefix121. > b) traffic from POD1 NOT being balanced through the spines but taking > a horizontal link Node112 to Node121 to reach your storage in > Prefix121 > to save bandwidth? or delay? > c) The key to riches is -04 section4.2.5.1 > <https://tools.ietf.org/html/draft-przygienda-rift-04#section-4.2.5.1>. > Northbound SPF > > in the paragraph "Other south prefixes found when crossing E-W link MAY > be used IIF". Now, we could make it a MUST (but it's really an > implementation knob IMO) and what it says is that if you are willing to > inject an "S-PGP" @ Node121 for Prefix121 it will get flooded to Node112 > and Node112 will have a more specific match than the default in N-SPF. From > 121 normal RIFT takes over since the normal N-Prefix for Leaf121 kicks in > on Node121. I assume Node112 policy on the ingress is to not propagate the > S-PGP south but use it for N-SPF only. > > > Observe that > > a) you have to switch off miscabling detection for PoD# on those nodes > since you are "crossing PoDs illegally" > > b) if you want whole Pod#1 to do that you either cable Node111 to Node121 > as well (which will load balance whole Pod#1 towards storage without using > Spine) OR you propagate S-PGPs south towards leafs (which will cost you > leaf FIB of course but make sure ALL traffic to storage goes over Node112 > only). > > c) Your forwarding on Node112 can actually even go haywire & load-balance > some of the traffic to Prefix121 using this S-PGP over Node121 and some > still using the default route towards spine(which here is a blatant > violation of LPM of course) and RIFT will work just fine (unless you loop > yourself to death with PGPs you install) but that of course is a deep > rathole in itself. I just mention it to show why the "non-looping" design > is so important and makes for the shortcomings for the SPF on a fabric, > predicted by the non-directional mesh property that Dijkstra solved in his > time. > > so? > > --- tony > > > > > > > On Thu, Jan 11, 2018 at 10:54 AM, Robert Raszuk <rob...@raszuk.net> wrote: > >> Hi Tony, >> >> Thx for elaborating ... >> >> Two small comments: >> >> A) SID/SR use case in underlay could be as simple as gracefully taking a >> fabric node out of service. Not much OPEX needed if your NMS is decent. >> Otherwise in normal link state I can do overload bit, in BGP number of >> solutions from shutdown to MED to LP ... depending what is your BGP design. >> In RIFT how do you do that ? Note that overlay (if such exist) does not >> help here. >> >> B) For horizontal links imagine you have servers with 40 GB ports to TOR. >> Then you have Nx100 GB from TOR up. You are going to oversubscribe on TOR >> (servers to fabric) most likely 2:1 .. 3:1 etc. So if I want to >> interconnect TORs because I do know that servers behind those TORs need to >> talk to each other in a non blocking fashion _and_ I have spare 100 GB >> ports on TORs having routing protocol which does not allow me to do that >> seems pretty limited - wouldn't you agree ? >> >> Thx, >> R. >> >> >> >> >> >> >> >> On Thu, Jan 11, 2018 at 6:40 PM, Tony Przygienda <tonysi...@gmail.com> >> wrote: >> >>> Robert, productive points, thanks for raising them ... I go a bit in >>> depth >>> >>> 1. I saw no _real_ use-cases for SID in DC so far to be frank (once you >>> run RIFT). The only one that comes up regularly is egress engineering and >>> that IMO is equivalent to SID=leaf address (which could be a HV address of >>> course once you have RIFT all way down to server) so really, what's the >>> point to have a SID? It's probably much smarter to use IBGP & so on overlay >>> to do this kind of synchronization if needed since labels/SIDs become very >>> useful in overlay to distinguish lots stuff there like VPNs/services which >>> you'd carry e.g. in MPLSoUDP. In underlay just use the destination v4/v6 >>> address. Having said that, discussion always to be had if you pay me dinner >>> ;--) and I know _how_ we can do SIDs in RIFT since I thought it through but >>> again, no _real_ use case so far. And if your only concern is to "shape >>> towards a prefix" we have PGP in the draft which doesn't need new silicon >>> ;-P And then ultimately, yes, if you really, really want a SID per prefix >>> everywhere then you'll carry SIDs to everywhere since unicast SIDs are >>> really just a glorified way to say "I have this non-aggreagable 20 bit IP >>> host address" which architecturally is a very interesting proposition in >>> terms of scaling (but then again, no account for taste and RFC1925 clause 3 >>> applies) ... Your LSDB will be still much smaller, your SPF will be still >>> simple on leaf in RIFT but your FIB will blow up and anything changing on a >>> leaf shakes all other leafs (unless you start to run pollicies to control >>> distribution @ which point in time you start to baby-sit your fabric @ high >>> OPEX). One of the reasons to do per-prefix SID would be non-ECMP anycast >>> (where SIDs _are_ in fact usefull) but if you read RIFT draft carefully you >>> will observe that RIFT can do anycast without need for ECMP, i.e. true >>> anycast in a sense and with that having anycast SID serves no real purpose >>> in RIFT and is actually generally much harder to do since you need globally >>> unique label blocks and so on ... >>> >>> 2. Horizontal links on CLOSes are not used that way normally all I saw >>> since your blocking goes to hell unless you provision some kind of really >>> massive parallel links between ToRs _and_ understand your load. We _could_ >>> build RIFT that way but you give up balancing through the fabric and >>> loop-free property in a sense (that's a longish discussion and scaling >>> since now you have prefixes showing up all kind of crazy places instead of >>> default). I see enough demand, we get there ... Otherwise RFC1925 clause >>> 10 and 5. >>> >>> 3. PS1: Yes, lots of things "could" be done and then we "could" build a >>> protocol to do that and RFC1925 clause 7 and 8 applies. Such horizontal >>> links, unless provisioned correctly will pretty much just ruin your >>> blocking/loss on the fabric is the experience (which the math supports). In >>> a sense if you know your big flows you can build a specialized topology to >>> do the optimal distribution (MPLS tunnels anyone ;-) but the point of >>> fabric is that it's a fabric (i.e. load agnostic, cheap, no OPEX and easily >>> scalable). Otherwise a good analogy would be that you like to build special >>> RAM chips for the type of data structures you are storing and we know how >>> well that scales over time. We know now that within 3-4 years >>> characteristics of DC flows flip upside down without a sweat when people go >>> from server/client to microservices, from servers to containers and so on >>> and so on. So if you can't predict your load all the time you need a >>> _regular_ topology where _regular_ is more of a mathematical than a >>> protocol discussion. Fabric analogy of "buy more RAM chips in Fry's and >>> just stick them in" applies here. So RIFT is done largely to serve a >>> well-known structure called a "lattice" (with some restrictions) since we >>> need an "up" and "down". Things like hypercubes, thoroidal meshes and so on >>> and so on exist but CLOS won for a very good reason in history for that >>> kind of problems (once you move to NUMA other things win ;-) And if you >>> know your loads and your can heft the OPEX and you like to play with >>> protocols generally and if you can support the scale in terms of leaf FIB >>> sizes, flooding, slower convergence & so on & so on and you run flat IGP on >>> some kind of stuff that you build that doesn't even have to be regular in >>> any sense. We spent many years solving THAT problem obviously and doing >>> something like RIFT to replace normal IGP is of limited interest IMO >>> (albeit certain aspects having to do with modern implemenation techniques >>> may get us there one day but it's much less of pressing problem than >>> solving specialized DC routing well IMO again). >>> >>> 3. PS2: RIFT cannot build an "unsupported topology" no matter how you >>> cable (that's the point of it) or rather we have miscabling detection and >>> do not form adjacencies when you read the draft carefully. That's your >>> "flash red light" and it comes included for free with my compliments ;-) >>> ... Otherwise RFC1925 clause 10. >>> >>> Otherwise, if you have concrete charter points you'd like to add, be >>> more specific in your asks and we see what the list thinks after ... >>> >>> thanks >>> >>> --- tony >>> >>> >>> On Thu, Jan 11, 2018 at 1:30 AM, Robert Raszuk <rob...@raszuk.net> >>> wrote: >>> >>>> Hi, >>>> >>>> I have one little question/doubt on scalability point of RIFT ... >>>> >>>> Assume that someone would like to signal IPv6 prefix SID for Segment >>>> Routing in the underlay within RIFT. >>>> >>>> Wouldn't it result in amount of protocol state in full analogy to >>>> massive deaggregation - which as of today is designed to be very careful >>>> and limited operation only at moments of failure(s) ? >>>> >>>> I sort of find it a bit surprising that RIFT draft does not provide >>>> encoding for SID distribution when it is positioned as an alternative to >>>> other protocols (IGPs or BGP) which already provide ability to carry all >>>> types of SIDs. >>>> >>>> Cheers, >>>> Robert. >>>> >>>> PS1: Horizontal links which were discussed could be installed to >>>> offload from fabric transit massive amount of data (ex: storage mirroring) >>>> directly between leafs or L3 TORs and not to be treated as "backup". >>>> >>>> PS2: Restricting any protocol to specific topologies seems like pretty >>>> slippery slope to me. In any case if protocol does that it should also >>>> contain self detection mechanism of "unsupported topology" and flash red >>>> light in any NOC. >>>> >>>> >>>> >>>> _______________________________________________ >>>> Dcrouting mailing list >>>> dcrout...@ietf.org >>>> https://www.ietf.org/mailman/listinfo/dcrouting >>>> >>>> >>> >> >
_______________________________________________ spring mailing list spring@ietf.org https://www.ietf.org/mailman/listinfo/spring