Re: TACACS+ server recommendations?
Christopher Morrow writes: > On Wed, Sep 20, 2023 at 1:22 PM Jim wrote: >> >> Router operating systems still typically use only passwords with >> SSH, then those devices send the passwords over that insecure channel. I >> have yet to >> see much in terms of routers capable to Tacacs+ Authorize users based on >> users' >> openSSH certificate, Public key id, or ed2559-sk security key id, etc. > There is active work with vendors (3 or 4 of the folk you may even > use?) to support > ssh with ssh-certificates, I believe this mostly works today, though > configuring it and > distributing your ssh-ca-cert may be fun... Ahem... Cisco supports SSH authentication using *X.509* certificates. Unfortunately this is not compatible with OpenSSH (the dominant SSH client implementation we use), which only supports *OpenSSH* certificates. Not sure about other vendors, but when we found this out we decided that this wasn't a workable solution for us. -- Simon.
Re: BGP and The zero window edge
Job Snijders via NANOG writes: > *RIGHT NOW* (at the moment of writing), there are a number of zombie > route visible in the IPv6 Default-Free Zone: [Reversing the order of your two examples] > Another one is > http://lg.ring.nlnog.net/prefix_detail/lg01/ipv6?q=2a0b:6b86:d24::/48 > 2a0b:6b86:d24::/48 via: > BGP.as_path: 201701 9002 6939 42615 212232 > BGP.as_path: 34927 9002 6939 42615 212232 > BGP.as_path: 207960 34927 9002 6939 42615 212232 > BGP.as_path: 44103 50673 9002 6939 42615 212232 > BGP.as_path: 208627 207910 34927 9002 6939 42615 212232 > BGP.as_path: 3280 34927 9002 6939 42615 212232 > BGP.as_path: 206628 34927 9002 6939 42615 212232 > BGP.as_path: 208627 207910 34927 9002 6939 42615 212232 > (first announced March 24th, last withdrawn March 24th, 2021) So that one was resolved at AS9002, see Alexandre's followup (thanks!) AS9002 had also been my guess when I read this, because it's the leftmost common AS in the paths observed. > One example is > http://lg.ring.nlnog.net/prefix_detail/lg01/ipv6?q=2a0b:6b86:d15::/48 > 2a0b:6b86:d15::/48 via: > BGP.as_path: 204092 57199 35280 6939 42615 42615 212232 > BGP.as_path: 208627 207910 57199 35280 6939 42615 42615 212232 > BGP.as_path: 208627 207910 57199 35280 6939 42615 42615 212232 > (first announced April 15th, last withdrawn April 15th, 2021) Applying the same logic, I'd suspect that the withdrawal is stuck in AS57199 in this case. I'll try to contact them. Here's a (partial) RIPE RIS BGPlay view of the last lifecycle of the 2a0b:6b86:d15::/48 beacon: https://stat.ripe.net/widget/bgplay#w.resource=2a0b:6b86:d15::/48=true=1618444740=1618542000=0,1,2,4,10,12,20,21=null=bgp Cheers, -- Simon.
Re: Netflow collector that can forward flows to another collector based on various metrics.
Speaking as the maintainer of samplicator, I'm not sure it's what Drew is looking for. Samplicator just sends copies of entire UDP packets. It doesn't understand NetFlow/IPFIX or whatever else those packets might contain. If I understand correctly, drew wants to forward some of the NetFlow/IPFIX flows, based on source/destination addresses *within those flows*. Samplicator cannot do that (by a long shot). pmacct sounds like a good suggestion. (I used to have a Lisp program that could also do this, and adding an API would have been trivial... but the program has been decommissioned recently after >20 years of service. Also I never got around to cleaning that up so that I could distribute the source. :-) -- Simon.
Re: cloud automation BGP
Randy Bush writes: > have folk looked at https://github.com/nttgin/BGPalerter We use it, and have it configured to send alerts to the NOC team's chat tool (Mattermost). Seems pretty nice and stable. Kudos to Massimo and NTT for making it available and for maintaining it! The one issue we see is that the server often logs disconnections from the RIS service (to its logfile, fortunately not generating alerts). -- Simon.
Re: Bottlenecks and link upgrades
m Taichi writes: > Just my curiosity. May I ask how we can measure the link capacity > loading? What does it mean by a 50%, 70%, or 90% capacity loading? > Load sampled and measured instantaneously, or averaging over a certain > period of time (granularity)? Very good question! With tongue in cheek, one could say that measured instantaneously, the load on a link is always either zero or 100% link rate... ISPs typically sample link load in 5-minute intervals and look at graphs that show load (at this 5-minute sampling resolution) over ~24 hours, or longer-term graphs where the resolution has been "downsampled", where downsampling usually smoothes out short-term peaks. >From my own experience, upgrade decisions are made by looking at those graphs and checking whether peak traffic (possibly ignoring "spikes" :-) crosses the threshold repeatedly. At some places this might be codified in terms of percentiles, e.g. "the Nth percentile of the M-minute utilization samples exceeds X% of link capacity over a Y-day period". I doubt that anyone uses such rules to automatically issue upgrade orders, but maybe to generate alerts like "please check this link, we might want to upgrade it". I'd be curious whether other operators have such alert rules, and what N/M/X/Y they use - might well be different values for different kinds of links. -- Simon. PS. We use the "stare at graphs" method, but if we had automatic alerts, I guess it would be something like "the 95th percentile of 5-minute samples exceeds 50% over 30 days". PPS. My colleagues remind me that we do alert on output queue drops. > These are questions have bothered me for long. Don't know if I can ask > about these by the way. I take care of the radio access network > performance at work. Found many things unknown in transport network. > Thanks and best regards, > Taichi > On Wed, Aug 12, 2020 at 3:54 PM Mark Tinka wrote: > On 12/Aug/20 09:31, Hank Nussbacher wrote: > At what point do commercial ISPs upgrade links in their backbone as well as > peering and transit links that are congested? At 80% > capacity? 90%? 95%? > We start the process at 50% utilization, and work toward completing the > upgrade by 70% utilization. > The period between 50% - 70% is just internal paperwork. > Mark.
BGP unnumbered examples from data center network using RFC 5549 et al. [was: Re: RFC 5549 - IPv4 Routes with IPv6 next-hop - Does it really exists?]
Mark Tinka writes: > On 29/Jul/20 15:51, Simon Leinen wrote: >> >> Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down >> State/PfxRcd >> sw-o(swp16)465108 953559 938348000 03w5d00h >> 688 >> sw-m(swp18)465108 885442 938348000 03w5d00h >> 688 >> s0001(swp1s0.3)465300 748971 748977000 03w5d00h >> 1 >> [...] >> >> Note the host names/interface names - this is how you generally refer to >> neighbors, rather than using literal (IPv6) addresses. > Are the names based on DNS look-ups, or is there some kind of protocol > association between the device underlay and its hostname, as it pertains > to neighbors? As Nick mentions, the hostnames are from the BGP hostname extension. I should have noticed that, but we use "BGP unnumbered"[1][2], which uses RAs to discover the peer's IPv6 link-local address, and then builds an IPv6 BGP session (that uses RFC 5549 to transfer IPv4 NLRIs as well). Here are some excerpts of the configuration on such a leaf router. General BGP boilerplate: -- router bgp 65111 bgp router-id 10.1.1.46 bgp bestpath as-path multipath-relax bgp bestpath compare-routerid ! address-family ipv4 unicast network 10.1.1.46/32 redistribute connected redistribute static exit-address-family ! address-family ipv6 unicast network 2001:db8:1234:101::46/128 redistribute connected redistribute static exit-address-family -- Leaf switch <-> server connection: (we use a 802.1q tagged subinterface for the BGP peering and L3 server traffic; the untagged interface is used only for netbooting the servers when (re)installing the OS. Here, servers just get IPv4+IPv6 default routes, and each server will only announce a single IPv4+IPv6 (loopback) address, i.e. the leaf/server links are also "unnumbered". Very simple redundant setup without any LACP/MLAG protocols... it's all just BGP+IPv6 ND. You can basically connect any server to any switch port and things will "just work" without special inter-switch links etc.) -- interface swp1s0 description s0001.s1.scloud.switch.ch p8p1 ! interface swp1s0.3 description s0001.s1.scloud.switch.ch p8p1 ipv6 nd ra-interval 3 no ipv6 nd suppress-ra ! [...] router bgp 65111 neighbor servers peer-group neighbor servers remote-as external neighbor servers capability extended-nexthop neighbor swp1s0.3 interface peer-group servers ! address-family ipv4 unicast neighbor servers default-originate neighbor servers soft-reconfiguration inbound neighbor servers prefix-list DEFAULTV4-PERMIT out exit-address-family ! address-family ipv6 unicast neighbor servers activate neighbor servers default-originate neighbor servers soft-reconfiguration inbound neighbor servers prefix-list DEFAULTV6-PERMIT out exit-address-family ! ip prefix-list DEFAULT-PERMIT permit 0.0.0.0/0 ! ipv6 prefix-list DEFAULTV6-PERMIT permit ::/0 -- Leaf <-> spine: -- interface swp16 description sw-o port 22 ipv6 nd ra-interval 3 no ipv6 nd suppress-ra ! [...] router bgp 65111 neighbor fabric peer-group neighbor fabric remote-as external neighbor fabric capability extended-nexthop neighbor swp16 interface peer-group fabric ! address-family ipv4 unicast neighbor fabric soft-reconfiguration inbound ! address-family ipv6 unicast neighbor fabric activate neighbor fabric soft-reconfiguration inbound -- Note the "remote-as external" - this will accept any AS other than the router's own AS. AS numbering in this DC setup is a bit weird if you're used to BGP... each leaf switch has its own AS, all spine switches should have the same AS number (for reasons...), and all servers have the same AS because who cares. (We are talking about three disjoint sets of AS numbers for leaves/spines/servers though.) -- Simon. [1] https://cumulusnetworks.com/blog/bgp-unnumbered-overview/ [2] https://support.cumulusnetworks.com/hc/en-us/articles/212561648-Configuring-BGP-Unnumbered-with-Cisco-IOS
Re: RFC 5549 - IPv4 Routes with IPv6 next-hop - Does it really exists?
Douglas Fischer writes: > And today, I reached on https://tools.ietf.org/html/rfc5549 [...] > But the questions are: > There is any network that really implements RFC5549? We've been using it for more than two years in our data center networks. We use the Cumulus/FRR implementation on switches and FRR on Ubuntu on servers. > Can anyone share some information about it? Sure. We found the FRR/Cumulus implementation very easy to set up. We have leaf/spine networks interconnecting hundreds of servers (IPv4+IPv6) with very minimalistic configuration. In particular, you generally don't have to configure neighbor addresses or AS numbers, because those are autodiscovered. I think we're basically following the recommendations in the "BGP in the Data Center" book including the "BGP on the Host" part (though our installation predates the book, so there might be some differences). The network has been working very reliably for us, so we never really had anything to debug. If you're coming from a world where you used separate BGP sessions to exchange IPv4 and IPv6 reachability information, then the operational commands take a little getting used to, but in the end I find it very intuitive. For example, here's one of the "show bgp ... summary" commands on a leaf switch: leinen@sw-f:mgmt-vrf:~$ net show bgp ipv6 uni sum BGP router identifier 10.1.1.46, local AS number 65111 vrf-id 0 BGP table version 96883 RIB entries 1528, using 227 KiB of memory Peers 54, using 1041 KiB of memory Peer groups 2, using 128 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd sw-o(swp16)465108 953559 938348000 03w5d00h 688 sw-m(swp18)465108 885442 938348000 03w5d00h 688 s0001(swp1s0.3)465300 748971 748977000 03w5d00h 1 s0002(swp1s1.3)465300 661787 661794000 03w1d23h 1 s0003(swp1s2.3)465300 748970 748977000 03w5d00h 1 s0004(swp1s3.3)465300 661868 661875000 03w1d23h 1 s0005(swp2s0.3)465300 748970 748976000 03w5d00h 1 [...] Note the host names/interface names - this is how you generally refer to neighbors, rather than using literal (IPv6) addresses. Otherwise it should look very familiar if you have used vendor C's "industry-standard CLI" before. (In case you're wondering, the first two neighbors in the output are spine switches, the others are servers.) Cheers, -- Simon.
Re: Hi-Rise Building Fiber Suggestions
Randy Bush writes: > since we're at this layer, should i worry about going 3m with dacs at > low speed, i.e. 10g? may need to do runs to neighbor rack. No, 3m is totally fine for passive DAC, never had any issues with those. (5m should also be fine, we just have less experience with that because most we use DAC mostly for server/ToR cabling, usually using QSFP(28) to SFP+/SFP28 break-out cables.) -- Simon.
Re: akamai yesterday - what in the world was that
Paul Nash writes: > A bit of perspective on bandwidth and feeling old. The first > non-academic connection from Africa (Usenet and Email, pre-Internet) > ran at about 9600 bps over a Telebit Trailblazer in my living room. For your amusement, this latest e-bloodbath, erm -sports update, at 48GB ("PC" version), would take about 463 days (~15 months) to complete at 9600 bps (not counting overhead like packet headers etc.) At 64kbps (ISDN/Antarctica) you could do it in 69 days, maybe even finishing before the next - undoubtedly bigger - release comes out. -- Simon. [I conservatively used decimal Gigabytes, not "Gibibytes" - at 48GiB the numbers would be 497 or 74.5 days respectively.]
Re: RIPE our of IPv4
Matthew Kaufman writes: > This is a great example (but just one of many) of how server software > development works: Small addition/correction to this example (which I find interesting and also sad): > Kubernetes initial release June 2014. Developed by Google engineers. [...] > Full support including CoreDNS support in 1.13, December 2018. Support for dual-stack pods[1]: alpha in 1.16, October 2019. -- Simon. [1] https://kubernetes.io/docs/concepts/services-networking/dual-stack/
Re: Fwd: wither cyclops?
> Did this tool die on the vine? > https://cyclops.cs.ucla.edu/ Not sure I would express it that way https://www.cs.ucla.edu/thousandeyes-a-look-inside-two-ucla-alumnis-273-million-startup/ -- Simon.
Re: CVV
Todd Underwood writes: > [interesting and plausible reasoning about why no chip in US] > anyway, let's talk about networks, no? This topic is obviously "a little" off-topic, but I find some contributions (like yours) relevant for understanding adoption dynamics (or not) of proposed security mechanisms on the Internet (RPKI, route filtering in general, DNSSEC etc.). In general the regulatory environment in the Internet is quite different from that of the financial sector. But I guess credit-card security trade-offs are still made mostly by private actors. (Maybe they sometimes discuss BGP security on their mailing lists :-) -- Simon.
Re: Proving Gig Speed
> For a horrifying moment, I misread this as Google surfacing > performance stats via a BGP stream by encoding stat_name:value as > community:value > /me goes searching for mass quantities of caffeine Because you'll be spending the night writing up that Internet-Draft? :-) -- Simon.
Talk extract: Submarine cable systems 101 for AWS partners
Amazon held their "re:Invent" event two weeks ago. Wasn't there, but I'm a James Hamilton fan so I started watching the recordings of his talks. In one, he talks about fiber optic cables under the oceans. Here's the start of that section: https://youtu.be/AyOAjFNPAbA?t=672 Even though this is presented at a suitable level for a large event (32'000 attendees total, holy cow) of mostly non-network specialists, I learned a few interesting things, e.g. about dealing with shunt faults. If you rewind to a few minutes before that section, he also talks about Amazon's private inter-DC network and how it is all (N*) 100G now. -- Simon.
Re: [TECH] Pica8 & Cumulus Networks
Yoann THOMAS writes: > Under a Cloud project I ask myself to use equipment based on the Pica8 > or Cumulus Networks. Ah, quite different beasts. Cumulus Networks tries to really make the switch look like a Linux system with hardware-accelerated forwarding, so you can use stock programs that manipulate routing, e.g. Quagga, and all forwarding between the high-speed ports is done "in hardware". Most other systems including Pica8 treat the high-speed interfaces as different; you need special software to manipulate the configuration of the forwarding ASIC. I think in the case of Pica8 it's OpenFlow/Open vSwitch, for other systems it will be some sort of a ASIC-specific SDK. A colleague has built a proof-of-concept L3 leaf/spine network (using OSPFv2/OSPFv3 according to local tradition) with six 32x40GE Quanta switches running Cumulus Linux. So far it has been quite pleasant. There have been a few glitches, but those usually get fixed pretty quickly. We configure the switches very much like GNU/Linux servers, in our case using Puppet (Ansible or Chef would work just as well). > All in order to mount a Spine & Leaf architecture > - Spine 40Gbps > - Leaf in 10Gbps One interesting option is to get (e.g. 1RU 32x) 40G switches for both spine and leaf, and connect the servers using 1:4 break-out cables. Fewer SKUs, better port density at the cost of funny cabling. Also gives you a bit more flexibility with respect to uplinks (can have more than 6*40GE per leaf if needed) and downlinks (easy to connect some servers at 40GE). The new 32*100GE switches also look interesting, but they might still be prohibitively expensive (although you can save on spine count and cabling) unless you NEED the bandwidth or want to build something future-proof. They are even more flexible in that you can drive the ports as 4*10GE, 4*25GE (could be an attractive high-speed option once 25GE server adapters become common), 40GE, 2*50GE, 100GE. We have looked at Edge-Core and Quanta and they both look pretty solid. I think they are also both used by some of the Web "hypergiants". Others may be just as good - basically it's always the same Broadcom switching silicon (Trident II/II+ in the 40GE, Tomahawk in the 100GE switches) with a bit of glue; there may be subtle differences between vendors in quality, box design, airflow etc. It's a bit unhealthy that Broadcom is so dominant in this market - but probably not undeserved. There are a few alternative switching chipsets, e.g. Mellanox, Cavium XPliant that look competitive (at least on paper) and that may be more "open" than Broadcom's. I think both the software vendors (e.g. Cumulus Networks) and the ODMs (Edge-Core, Quanta etc.) are interested in these. -- Simon.
Re: Recommended L2 switches for a new IXP
Manuel Marín writes: Dear Nanog community [...] There are so many options that I don't know if it makes sense to start with a modular switch (usually expensive because the backplane, dual dc, dual CPU, etc) or start with a 1RU high density switch that support new protocols like Trill and that supposedly allow you to create Ethernet Fabric/Clusters. The requirements are simple, 1G/10G ports for exchange participants, 40G/100G for uplinks between switches and flow support for statistics and traffic analysis. Stupid thought from someone who has never built an IXP, but has been looking at recent trends in data center networks: There are these white-box switches mostly designed for top-of-rack or spine (as in leaf-spine/fat-tree datacenter networks) applications. They have all the necessary port speeds - well 100G seems to be a few months off. I'm thinking of brands such as Edge-Core, Quanta etc. You can get them as bare-metal versions with no switch OS on them, just a bootloader according to the ONIE standard. Equipment cost seems to be on the order of $100 per SFP+ port w/o optics for a second-to-last generation (Trident-based) 48*10GE+4*40GE ToR switch. Now, for the limited and somewhat special L2 needs of an IXP, couldn't someone hack together a suitable switch OS based on Open Network Linux (ONL) or something like that? You wouldn't even need MAC address learning or most types of flooding, because at an IXP this often hurts rather than helps. For building larger fabrics you might be using something other (waves hands) than TRILL; maybe you could get away without slightly complex multi-chassis multi-channel mechanisms, and so on. Flow support sounds somewhat tough, but full netflow support that would get Roland Dobbins' usable telemetry seal of approval is probably out of reach anyway - it's a high-end feature with classical gear. With white-box switches, you could try to use the given 5-tuple flow hardware capabilities - which might not scale that well -, or use packet sampling, or try to use the built-in flow and counter mechanisms in an application-specific way. (Except *that's* a lot of work on the software side, and a usably efficient implementation requires slightly sophisticated hardware/software interfaces.) Instead of a Linux-based switch OS, one could also build an IXP application using OpenFlow and some kind of central controller. (Not to be confused with SDX: Software Defined Internet Exchange.) Has anybody looked into the feasibility of this? The software could be done as an open-source community project to make setting up regional IXPs easier/cheaper. Large IXPs could sponsor this so they get better scalability - although I'm not sure how well something like the leaf-spine/fat-tree design maps to these IXPs, which are typically distributed over several locations. Maybe they could use something like Facebook's new design, treating each IXP location as a pod. -- Simon. [1] https://code.facebook.com/posts/360346274145943
Low-numbered ASes being hijacked? [Re: BGP Update Report]
cidr-report writes: BGP Update Report Interval: 20-Nov-14 -to- 27-Nov-14 (7 days) Observation Point: BGP Peering with AS131072 TOP 20 Unstable Origin AS Rank ASNUpds % Upds/PfxAS-Name [...] 11 - AS5 38861 0.6% 7.0 -- SYMBOLICS - Symbolics, Inc.,US Disappointing to see Symbolics (AS5) on this list. I would expect these Lisp Machines to have very stable BGP implementations, especially given the leisurely release rhythm for Genera for the past few decades. Has the size of the IPv4 unicast table started triggering global GCs? Seriously, all these low-numbered ASes in the report look fishy. I would have liked this to be an artifact of the reporting software (maybe an issue with 4-byte ASes?), but I do see some strange paths in the BGP table that make it look like (accidental or malicious) hi-hacking of these low-numbered ASes. Now the fact that these AS numbers are low makes me curious. If I wanted to hijack other folks' ASes deliberately, I would probably avoid such numbers because they stand out. Maybe these are just non-standard private-use ASes that are leaked? Some suspicious paths I'm seeing right now: 133439 5 197945 4 Hm, maybe 32-bit ASes do have something to do with this... Any ideas? -- Simon. (Just curious) [...] 17 - AS3 30043 0.4%3185.0 -- MIT-GATEWAYS - Massachusetts Institute of Technology,US [...] TOP 20 Unstable Origin AS (Updates per announced prefix) Rank ASNUpds % Upds/PfxAS-Name [...] 13 - AS5 38861 0.6% 7.0 -- SYMBOLICS - Symbolics, Inc.,US [...] 15 - AS4 21237 0.3% 871.0 -- ISI-AS - University of Southern California,US [...] 19 - AS45345 0.1%1437.0 -- ISI-AS - University of Southern California,US 20 - AS48784 0.1%2303.0 -- ISI-AS - University of Southern California,US
Re: iOS 7 update traffic
Glen Kent writes: One of the earlier posts seems to suggest that if iOS updates were cached on the ISPs CDN server then the traffic would have been manageable since everybody would only contact the local sever to get the image. Is this assumption correct? Not necessarily. I think most of the iOS 7 update traffic WAS in fact delivered from CDN servers (in particular Akamai). And many/most large service providers already have Akamai servers in their networks. But they may not have enough spare capacity for such a sudden demand - either in terms of CDN (Akamai) servers or in terms of capacity between their CDN servers and their customers. Do most big service providers maintain their own content servers? Is this what we're heading to these days? Depends on what you mean by their own. As I said, these days Akamai has servers in many of the big networks. Google and possibly others (Limelight, ...?) might have that as well. But I wouldn't call them their [the SPs'] own. Some SPs are also built their own CDNs (Level 3) or are talking about it. But that model seems to be less popular with the content owners and the other SPs. -- Simon.
Re: Real world sflow vs netflow?
James Braunegg writes: In the end I did real life testing comparing each platform Great, thanks for sharing your results! (It would be nice if you could tell us a little bit about the configuration, i.e. what kind of sampling you used.) [...] That being said both netflow and sflow both under read by about 3% when compared to snmp port counters, which we put to the conclusion was broadcast traffic etc which the routers didn't see / flow. That's one reason, but another reason would be that at least in Netflow (but sFlow may be similar depending on how you use it), the reported byte counts only include the sizes of the L3 packets, i.e. starting at the IP header, while the SNMP interface counters (ifInOctets etc.) include L2 overhead such as Ethernet frame headers and such. -- Simon.
Re: Network Storage
Andrew Thrift writes: If you want something from a Tier1 the new Dell R720XD's will take 24x 900GB SAS disks or 12x 2TB 3.5 cheap slow SATA disks or 12x 3TB 3.5 more expensive slightly faster SAS disks - if you take the (cheaper) 3.5-disk variant of the R720xd chassis. or 12x 3TB 3.5 cheapslow SATA disks if you buy them directly rather than from Dell. (Presumably you'd have to buy Dell hot-swap trays) -- Simon. and have 16 cores. If you order it with a SAS6-HBA you can add up to 8 trays of 24 x 900GB SAS disks to provide 194TB of raw space at quite a reasonable cost.
Re: Apple updates - Effect on network
Matt Taylor writes: Would love to see some bandwidth graphs. :) Here's one from another network. attachment: akamai-week.pngGuess it was a good idea to upgrade that Akamai cluster's uplink to 10GE, even though 2*GE (or was it 4*GE) looked sufficient at the time. Remember folks, overprovisioning is a misnomer, it should be called provisioning for robustness and growth. -- Simon.
Re: [routing-wg] The Cidr Report
Geoff Huston writes: Does anyone give a s**t about this any more? I do; I check the weekly increase every week, and check who the top offenders are. If someone from my vicinity/circles is on the list (doesn't happen frequently; more often for the BGP updates report than for CIDR), I may send them a note and ask what happened. From what I learned at the latest NANOG it's very clear that nobody reads this any more. Reads may be an exaggeration, but I'm sure some look at it. Is there any good reason to persist in spamming the nanog list with this report? I think it still provides an incentive for people not to mess things up too badly; and a chance of some mishaps to be noticed quicker, with a little help from your friends. -- Simon.
Re: facebook spying on us?
Data Center Knowledge posted about 20 minutes of very poorly shot video of Prineville. They're Open Compute servers in 'triplet' racks. [...] Their power supply (also open) runs across 2 legs of a 277/480 3-phase feed, which is usually what the substation supplies to your PDUs, which step it down further to 120/208. It also takes -48, and each pair of triplets has a 48V float string that will run the 180 servers for about 45 seconds. It's a nice setup. I plan to steal it. :-) That's what they want you to do - check out the specs on http://opencompute.org/ -- Simon.
Re: Cisco 7600 PFC3B(XL) and IPv6 packets with fragmentation header
which traceroute? icmp? udp? tcp? Traceroute is not a single protocol. Router processing is only dependent on noticing that TTL is expiring, and being able to return an ICMP message (including a quote of part of the original packet) to the sender. what is that limit? from a single port? from a single linecard? from a chassis? how about we remove complexity here and just deal with this in the fastpath? on a pfc3, the mls rate limiters deal with handling all punts from the chassis to the RP. It's difficult to handle this in any other way. If the rate limit is done in hardware (which one should hope), then it would be more natural to do it on a per-PFC/DFC basis. So on a box with DFCs on all linecards, it would be per linecard, not per chassis. Maybe someone who knows for sure can decide. My point in calling this all 'stupid' is that by now we all have been burned by this sort of behavior, vendors have heard from all of us that 'this is really not a good answer', enough is enough please stop doing this. This is a Hard Problem. There is a balance to be drawn between hardware complexity, cost and lifecycle. In the case of the PFC3, we're talking about hardware which was released in 2000 - 11 years ago. Um, no, in 2000 there was no PFC3. That came out (on the Supervisor 720) in March 2003. The ipv6 fragment punting problem was fixed in the pfc3c, which was released in 2003. The PFC 3C was announced (with the RSP720) in December 2006. I'm aware that cisco is still selling the pfc3b, but they really only push the rsp720 for internet stuff (if they're pushing the 6500/7600 line at all). See Janos' reply, the Catalyst 6500 seems alive and kicking with the Supervisor 2T. The 7600 is a somewhat different story. As far as I see, all development is going into feature-rich ES+ cards and a few relatively narrow applications such as mobile backhaul and FTTH aggregation(?). We have been using the 7600 as a cheap fast IPv4/IPv6 (and later also MPLS) backbone router. According to Cisco we should probably move up to the ASR9000 or CRS-3, but I'm tempted to downgrade to Catalyst 6500 with Sup-2T (until we need 100G :-). -- Simon.
Re: Network Equipment Discussion (HP and L2/10G)
Deepak Jain writes: The wrinkle here is that I can't use a normal enterprise 10G switch because of the need for DWDM optics (ideally 80km style). 80km DWDM optics in SFP+ format should be available now or RSN. Search engines turn up a few purported vendors. The ones I found conform to the 100GHz grid, but 50GHz ones should be coming too. Haven't tried any of those myself though. -- Simon.
Re: Top webhosters offering v6 too?
Tim Chown writes: Which of the big boys are doing it? Google - although there don't call themselves a web hoster, they can be used for hosting web sites using services such as Sites or App Engine. Both support IPv6, either using the opt-in mechanism or by using an alternate CNAME (ghs46 instead of ghs.google.com). That's what I use. None of the other large cloud providers seems to support IPv6 for their users yet. In particular, neither Amazon's AWS not Microsoft Azure have much visible activity in this direction. Rackspace have announced IPv6 support for the first half of 2011. Concerning the more traditional webhosting offerings, I have no idea about the big boys. Here in Switzerland, a few smaller hosters support IPv6. And I saw IPv6 mentioned in ads for some German server hosting offering. Germany is interesting because it has a well-developed hosting ecosystem with some really big players. -- Simon.
Re: arin and ops fora
Randy Bush writes: one difference in north america from the other 'regions' is that there is a strong and very separate operator community and forum. this does not really exist in the other regions. ripe ate the eof years ago. apops is dormant aside from [...] Right. observe that the main north american irr, radb, is not run by the rir, unlike in other regions. and i like that there are a number of diverse rir services in the region. it's healthy. ^^^ you mean rr I think. so i would be perfectly happy if arin discussed operational matters here on nanog with the rest of us ops. i would not be pleased to see ops start to be subsumed by the rir here. I'm sympathetic with that, but, like David said, the separation (NANOG/ARIN) you have in North America does lead to issues such as not being able to trust what's in the RR(s). So I'm quite happy with the situation here in Europe, where RIPE (deliberately ignoring the difference between RIPE NCC and the RIPE community for a second) takes care of both running the address registry, and running a routing registry that can leverage the same authentication/authorization substrate. This makes the RR much more trustworthy, and should really make the introduction of something like RPKI much easier (albeit with the temptation to set it up in a more centralized way than we might like). Randy, what is the model you have in mind for running a routing registry infrastructure that is sustainable and trustworthy enough for uses such as RPKI, i.e. who could/should be running it? I guess I'm arguing that from my non-North-American perspective, an ARIN with a carefully extended mandate could be of much help here. So even if you're unhappy with the current ARIN governance, maybe it would still be worthwhile for the community to fix that issue - unless there are credible alternatives. -- Simon.
Re: Over a decade of DDOS--any progress yet?
Greg Whynott writes: i found it funny how M$ started giving away virus/security software for its OS. it can't fix the leaky roof, so it includes a roof patch kit. (and puts about 10 companies out of business at the same time) I actually like the new arrangement better, where Microsoft provides the security software to its OS customers for free. The previous setup had third parties (anti-virus vendors) profiting from the weaknesses in Microsoft's software. The new arrangement provides better incentives for fixing the security weaknesses at the source, at least as far as Microsoft is concerned. Even for third-party providers of buggy software, Microsoft probably better leverage towards them than the numerous anti-virus vendors. But then maybe my armchair economics are totally wrong. -- Simon.
ICMPv6 rate limits breaking PMTUD (and traceroute) [Re: Comcast enables 6to4 relays]
Jack Bates writes: 1) Your originating host may be breaking PMTU (so the packet you send is too large and doesn't make it, you never resend a smaller packet, but it works when tracerouting from the other side due to PMTU working in that direction and you are responding with the same size packet). Your mentioning PMTU discovery issues in connection with 6to4 prompts me to confess how our open 6to4 relay has probably contributed to the perception of brokenness of 6to4 for quite a while *blush*. The relay runs on a Cisco 7600 with PFC3 - btw. this is an excellent platform to run an 6to4 relay on, because it can do the encap/decap in hardware if configured correctly. At some point of the relay becoming popular (load currently fluctuates between 80 Mb/s and 200 Mb/s), I noticed that our router very often failed to send ICMPv6 messages such as packet too big. First I suspected our control-plane rate-limit (CoPP) configuration, but couldn't find anything there. Finally I found that I had to configure a generous ipv6 icmp error-interval[1], because the (invisible) default configuration will only permit one such ICMPv6 message to be generated every 100 milliseconds, and that's WAY insufficient for a popular router. We currently use ipv6 icmp error-interval 2 100 (max. steady state rate 500 ICMPv6s/second - one every 2 milliseonds - with bursts up to 100) with no ill effects. Note that the same rate-limit will also cause stars in IPv6 traceroutes through popular routers if the default setting is used. The issue is probably not restricted to Cisco, as the ICMPv6 standard (RFC 4443) mandates that ICMPv6 error messages be rate limited. It even has good (if hand-wavy) guidance on how to arrive at defaults - the values used on our Cisco 7600 (and possibly all other IOS devices?) correspond to the RFC's suggestion for a small/mid-size device *hrmpf* (yes Randy, I know I should get real routers :-). Anybody knows which defaults are used by other devices/vendors? In general, rate limits are very useful for protecting routers' notoriously underpowered control planes, but (1) it's hard to come up with reasonable defaults, and (2) I suspect that not most people don't monitor them (because that's often hard), and thus won't notice when normal traffic levels trip these limits. -- Simon. [1] See http://www.cisco.com/en/US/docs/ios/ipv6/command/reference/ipv6_06.html#wp2135326
Re: Restrictions on Ethernet L2 circuits?
Interesting questions. Here are a few thoughts from the perspective of an education/research backbone operator that used to be IP only but has also been offering L2 point-to-point circuits for a few years. Should business customers expect to be able to connect several LANs through an Ethernet L2 ciruit and build a layer 2 network spanning several locations? At least for our customers, that is indeed important. The most popular application here is for a customer to connect a remote location to their campus network, and that want to (at least be able to) use any of their existing VLANs at the remote site. Or should the service provider implement port security and limit the number of MAC addresses on the access ports, forcing the customer to connect a router in both ends and segment their network? That would make the service less attractive, and also more complex to set up and maintain. For point-to-point service, there is really no reason for the network to care about customers' MAC addresses, VLAN tags and such. As you said, EoMPLS doesn't care. (Ethernet over L2TPv3 shouldn't care either. If I had cost-effective edge routers that did L2TPv3 encapsulation/decapsulation at line rate, I'd switch off MPLS in our core tomorrow.) Couldn't PBB or even Q-in-Q provide that isolation as well, at least for point-to-point services? I must say that I don't personally have much experience with those, because we tend to connect our customers to EoMPLS-capable routers directly. Also, do you see a demand for multi-point layer 2 networks (requiring VPLS), or are point-to-point layer 2 circuits sufficient to meet market demand? That's a big question for us right now... we're not sure yet. I'd like to hear others' opinions on this. The most important argument for customers that choose Ethernet L2 over MPLS IP-VPN is that they want full control over their routing, they don't want the involvement from the service provider. Some customers also argue that a flat layer 2 network spanning several locations is a simpler and better design for them, and they don't want the hassle with routers and network segmentation. I have a good deal of sympathy for customers who think this way. Also from the service provider point of view, I like the simplicity of the offering - basically we're providing an emulation of a very long piece of Ethernet cable. (My worry with multipoint L2 VPNs is that they can't have such a simple service model.) But IMO the customer (and the service provider) is far better off by segmenting their network in the vast majority of cases. What do you think? Maybe they already have a segmented network, but don't want to segment it based on geography/topology. As far as I'm concerned, enterprises should just connect their various sites to the Internet independently, and use VPN techniques if and where necessary to provide the illusion of a unified network. In practice, this illusion of a single large LAN (or rather, multiple organization-wide LANs) is very important to the typical enterprise, because so much security policy is enforced based on IP addresses. And the typical enterprise wants a central chokepoint that all traffic must go through, for reasons that might have to do with security, or support costs, or with (illusions of) control. This bridging function required to maintain the illusion of a unified network is something that most enterprises prefer to outsource. I'd hope that at some point, better security mechanisms and/or better VPN technologies make these kinds of VPN services less relevant. Until that happens, there's going to be demand for them. Of course the telcos have known that for eons and provided many generations of expensive and hard-to-use services to address this. Point-to-point Ethernet services are interesting because they are relatively easy to provide for folks like us who only really know IP (and maybe some MPLS). And the more transparent they are, the easier it is for customers to use them. -- Simon.
Re: Layer 2 vs. Layer 3 to TOR
Tore Anderson writes: * Jonathan Lassoff Are there any applications that absolutely *have* to sit on the same LAN/broadcast domain and can't be configured to use unicast or multicast IP? FCoE comes to mind. Doesn't FCoE need even more than that, i.e. lossless Ethernet with end-to-end flow control, such as IEEE DCB? As far as I understand, traditional switched Ethernets don't fit the bill anyway. On the other hand iSCSI should be fine with routed IP paths; though Malte's mail suggests that there are (broken?) implementations that aren't. -- Simon.
Re: MRLG
Thanks guys I got it... Congratulations. But how/where? -- Simon.
Re: SNMP and syslog forwarders
Sam Stickland writes: It's looking like running all of our traps and syslog through a couple of relay devices (and then onwards to the various NMS's) would be quite a win for us. You can try the UDP samplicator: http://www.switch.ch/network/downloads/tf-tant/samplicator/ (The name indicates that it can also sample packets, but that is just an option that can be ignored for your application.) These relay devices just need to be dumb forwarders (we don't require any filtering or storing, just reflection), but we need an HA pair (across two sites) without creating duplicates. There is one complication with SNMP traps and also with typical Syslog packets: The IP source address carries important information that is not carried in the payload. So it's not sufficient for the relay to simply re-send the UDP datagrams without loss of information. Samplicator handles this with an option to spoof the IP source address when it resends the packets. (With this option, it must run as root, and you will have to drill holes in the ingress filters that you hopefully have even for your own servers. :-) I have the coding skills to make this myself, but as coding skills come and go in our network team, we are looking for a commerical product so it will continnue to work after I get: hit by a bus / amnesia / visions of grandeur. Not commercial, sorry. Maybe someone can sell you support for it (or life insurance). I should probably put it up on a code hosting service so that the community can maintain it. Any recommendations / experience? This needs to scale to ~1,500 devices. Shouldn't be a problem. The main trick is to ensure that the forwarder's UDP receive buffers are large enough to handle bursts that might arrive while the forwarder/server is catching its breath. Samplicator lets you tune this socket buffer size. -- Simon.
Re: DNS problems to RoadRunner - tcp vs udp
Jon Kibler writes: Also, other than That's what the RFCs call for, why use TCP for data exchange instead of larger UDP packets? TCP is more robust for large (Path MTU) data transfers, and less prone to spoofing. A few months ago I sent a message to SwiNOG (like NANOG only less North American and more Swiss) about this topic, trying to explain some of the tradeoffs: http://www.mail-archive.com/[EMAIL PROTECTED]/msg02612.html Mostly I think that people approaching this from a security perspective only often forget that by fencing in the(ir idea of the) current status quo, they often prevent beneficial evolution of protocols as well, contributing to the Internet's ossification. -- Simon.
Re: [NANOG] Questions about NETCONF
Randy Bush writes: [in response to John Payne [EMAIL PROTECTED]:] I've personally been waiting for the data modeling to be standardized. Yes, it's great and wonderful to have a consistent method of talking to network devices, but I also want a standard data model along with it. does this not imply that all devices would need to be semantically congruent? if so, is this realistic? Personally I don't think it is. The way that configuration is structured is something that at least some vendors use to differentiate themselves from each other. (Though other vendors make a point of being compatible with some industry standard CLI.) So if you think that configurations in NETCONF should be similar to the native configuration language, that doesn't bode well for industry-wide standardization of a NETCONF configuration data model. It might still be possible to have a common NETCONF data model, but then that would probably be quite different from the (all) native configuration languages; much in the same way as SNMP MIBs are (structurally) different from how information is presented at the CLI. Personally I'm not sure that this would be a very useful outcome, because there would necessarily be a large lag between when features are implemented (with a native CLI to configure them of course) and when they can be configured through NETCONF. Maybe the best we can shoot for is this: * A common language to describe parts of NETCONF configuration. The newly chartered IETF NETMOD working group[1] is working on this. Vendors can then describe their specific NETCONF data models using this language, and tool writers can use these descriptions to generate code for applications that want to manipulate device configurations. * Common data models for certain well-understood parts of NETCONF configuration. This could include simple atomic things such as how to write an IP address or a prefix in (NETCONF) XML, or configuration of standardized protocols such as OSPF, IPFIX etc. The problem is how well will this support migration from vendor-specific configuration to standardized configuration - which, as I said, is always bound to lag far behind. And even if/when an aspect of a configuration model (let's say for OSPF) is standardized, vendors are bound to extend that model to support not-yet-standardized extensions (e.g. sub-second timers, BFD). This will be another challenge to support. (But there are smart people working on this :-) -- Simon. [1] http://www.ietf.org/html.charters/netmod-charter.html ___ NANOG mailing list NANOG@nanog.org http://mailman.nanog.org/mailman/listinfo/nanog