Re: TACACS+ server recommendations?

2023-09-21 Thread Simon Leinen
Christopher Morrow writes:
> On Wed, Sep 20, 2023 at 1:22 PM Jim  wrote:
>> 
>> Router operating systems still typically use only passwords with
>> SSH, then those devices send the passwords over that insecure channel.  I 
>> have yet to
>> see much in terms of routers capable to Tacacs+ Authorize  users based on  
>> users'
>> openSSH certificate, Public key id,  or  ed2559-sk security key id, etc.

> There is active work with vendors (3 or 4 of the folk you may even
> use?) to support
> ssh with ssh-certificates, I believe this mostly works today, though
> configuring it and
> distributing your ssh-ca-cert may be fun...

Ahem... Cisco supports SSH authentication using *X.509* certificates.
Unfortunately this is not compatible with OpenSSH (the dominant SSH
client implementation we use), which only supports *OpenSSH*
certificates.

Not sure about other vendors, but when we found this out we decided that
this wasn't a workable solution for us.
-- 
Simon.


Re: BGP and The zero window edge

2021-04-24 Thread Simon Leinen
Job Snijders via NANOG writes:
> *RIGHT NOW* (at the moment of writing), there are a number of zombie
> route visible in the IPv6 Default-Free Zone:

[Reversing the order of your two examples]

> Another one is 
> http://lg.ring.nlnog.net/prefix_detail/lg01/ipv6?q=2a0b:6b86:d24::/48

> 2a0b:6b86:d24::/48 via:
> BGP.as_path: 201701 9002 6939 42615 212232
> BGP.as_path: 34927 9002 6939 42615 212232
> BGP.as_path: 207960 34927 9002 6939 42615 212232
> BGP.as_path: 44103 50673 9002 6939 42615 212232
> BGP.as_path: 208627 207910 34927 9002 6939 42615 212232
> BGP.as_path: 3280 34927 9002 6939 42615 212232
> BGP.as_path: 206628 34927 9002 6939 42615 212232
> BGP.as_path: 208627 207910 34927 9002 6939 42615 212232
> (first announced March 24th, last withdrawn March 24th, 2021)

So that one was resolved at AS9002, see Alexandre's followup (thanks!)

AS9002 had also been my guess when I read this, because it's the
leftmost common AS in the paths observed.

> One example is 
> http://lg.ring.nlnog.net/prefix_detail/lg01/ipv6?q=2a0b:6b86:d15::/48

> 2a0b:6b86:d15::/48 via:
> BGP.as_path: 204092 57199 35280 6939 42615 42615 212232
> BGP.as_path: 208627 207910 57199 35280 6939 42615 42615 212232
> BGP.as_path: 208627 207910 57199 35280 6939 42615 42615 212232
> (first announced April 15th, last withdrawn April 15th, 2021)

Applying the same logic, I'd suspect that the withdrawal is stuck in
AS57199 in this case.  I'll try to contact them.

Here's a (partial) RIPE RIS BGPlay view of the last lifecycle of the
2a0b:6b86:d15::/48 beacon:

https://stat.ripe.net/widget/bgplay#w.resource=2a0b:6b86:d15::/48=true=1618444740=1618542000=0,1,2,4,10,12,20,21=null=bgp

Cheers,
-- 
Simon.


Re: Netflow collector that can forward flows to another collector based on various metrics.

2021-01-21 Thread Simon Leinen
Speaking as the maintainer of samplicator, I'm not sure it's what Drew
is looking for.

Samplicator just sends copies of entire UDP packets.  It doesn't
understand NetFlow/IPFIX or whatever else those packets might contain.

If I understand correctly, drew wants to forward some of the
NetFlow/IPFIX flows, based on source/destination addresses *within those
flows*.  Samplicator cannot do that (by a long shot).

pmacct sounds like a good suggestion.

(I used to have a Lisp program that could also do this, and adding an
API would have been trivial... but the program has been decommissioned
recently after >20 years of service.  Also I never got around to
cleaning that up so that I could distribute the source. :-)
-- 
Simon.


Re: cloud automation BGP

2020-09-29 Thread Simon Leinen
Randy Bush writes:
> have folk looked at https://github.com/nttgin/BGPalerter

We use it, and have it configured to send alerts to the NOC team's chat
tool (Mattermost).  Seems pretty nice and stable.  Kudos to Massimo and
NTT for making it available and for maintaining it!

The one issue we see is that the server often logs disconnections from
the RIS service (to its logfile, fortunately not generating alerts).
-- 
Simon.


Re: Bottlenecks and link upgrades

2020-08-13 Thread Simon Leinen
m Taichi writes:
> Just my curiosity. May I ask how we can measure the link capacity
> loading? What does it mean by a 50%, 70%, or 90% capacity loading?
> Load sampled and measured instantaneously, or averaging over a certain
> period of time (granularity)?

Very good question!

With tongue in cheek, one could say that measured instantaneously, the
load on a link is always either zero or 100% link rate...

ISPs typically sample link load in 5-minute intervals and look at graphs
that show load (at this 5-minute sampling resolution) over ~24 hours, or
longer-term graphs where the resolution has been "downsampled", where
downsampling usually smoothes out short-term peaks.

>From my own experience, upgrade decisions are made by looking at those
graphs and checking whether peak traffic (possibly ignoring "spikes" :-)
crosses the threshold repeatedly.

At some places this might be codified in terms of percentiles, e.g. "the
Nth percentile of the M-minute utilization samples exceeds X% of link
capacity over a Y-day period".  I doubt that anyone uses such rules to
automatically issue upgrade orders, but maybe to generate alerts like
"please check this link, we might want to upgrade it".

I'd be curious whether other operators have such alert rules, and what
N/M/X/Y they use - might well be different values for different kinds of
links.
-- 
Simon.
PS. We use the "stare at graphs" method, but if we had automatic alerts,
I guess it would be something like "the 95th percentile of 5-minute
samples exceeds 50% over 30 days".
PPS. My colleagues remind me that we do alert on output queue drops.

> These are questions have bothered me for long. Don't know if I can ask
> about these by the way. I take care of the radio access network
> performance at work. Found many things unknown in transport network.

> Thanks and best regards,
> Taichi

> On Wed, Aug 12, 2020 at 3:54 PM Mark Tinka  wrote:

>  On 12/Aug/20 09:31, Hank Nussbacher wrote:

>  At what point do commercial ISPs upgrade links in their backbone as well as 
> peering and transit links that are congested?  At 80%
>  capacity?  90%?  95%?  

>  We start the process at 50% utilization, and work toward completing the 
> upgrade by 70% utilization.

>  The period between 50% - 70% is just internal paperwork.

>  Mark.



BGP unnumbered examples from data center network using RFC 5549 et al. [was: Re: RFC 5549 - IPv4 Routes with IPv6 next-hop - Does it really exists?]

2020-07-30 Thread Simon Leinen
Mark Tinka writes:
> On 29/Jul/20 15:51, Simon Leinen wrote:

>> 
>> Neighbor   V   AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down 
>> State/PfxRcd
>> sw-o(swp16)465108  953559  938348000 03w5d00h
>>   688
>> sw-m(swp18)465108  885442  938348000 03w5d00h
>>   688
>> s0001(swp1s0.3)465300  748971  748977000 03w5d00h
>> 1
>> [...]
>> 
>> Note the host names/interface names - this is how you generally refer to
>> neighbors, rather than using literal (IPv6) addresses.

> Are the names based on DNS look-ups, or is there some kind of protocol
> association between the device underlay and its hostname, as it pertains
> to neighbors?

As Nick mentions, the hostnames are from the BGP hostname extension.

I should have noticed that, but we use "BGP unnumbered"[1][2], which
uses RAs to discover the peer's IPv6 link-local address, and then builds
an IPv6 BGP session (that uses RFC 5549 to transfer IPv4 NLRIs as well).

Here are some excerpts of the configuration on such a leaf router.

General BGP boilerplate:

--
router bgp 65111
 bgp router-id 10.1.1.46
 bgp bestpath as-path multipath-relax
 bgp bestpath compare-routerid
!
 address-family ipv4 unicast
  network 10.1.1.46/32
  redistribute connected
  redistribute static
 exit-address-family
 !
 address-family ipv6 unicast
  network 2001:db8:1234:101::46/128
  redistribute connected
  redistribute static
 exit-address-family
--

Leaf switch <-> server connection: (we use a 802.1q tagged subinterface
for the BGP peering and L3 server traffic; the untagged interface is
used only for netbooting the servers when (re)installing the OS.  Here,
servers just get IPv4+IPv6 default routes, and each server will only
announce a single IPv4+IPv6 (loopback) address, i.e. the leaf/server
links are also "unnumbered".  Very simple redundant setup without any
LACP/MLAG protocols... it's all just BGP+IPv6 ND.  You can basically
connect any server to any switch port and things will "just work"
without special inter-switch links etc.)

--
interface swp1s0
 description s0001.s1.scloud.switch.ch p8p1
!
interface swp1s0.3
 description s0001.s1.scloud.switch.ch p8p1
 ipv6 nd ra-interval 3
 no ipv6 nd suppress-ra
!
[...]
router bgp 65111
 neighbor servers peer-group
 neighbor servers remote-as external
 neighbor servers capability extended-nexthop
 neighbor swp1s0.3 interface peer-group servers
 !
 address-family ipv4 unicast
  neighbor servers default-originate
  neighbor servers soft-reconfiguration inbound
  neighbor servers prefix-list DEFAULTV4-PERMIT out
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor servers activate
  neighbor servers default-originate
  neighbor servers soft-reconfiguration inbound
  neighbor servers prefix-list DEFAULTV6-PERMIT out
 exit-address-family
!
ip prefix-list DEFAULT-PERMIT permit 0.0.0.0/0
!
ipv6 prefix-list DEFAULTV6-PERMIT permit ::/0
--

Leaf <-> spine:

--
interface swp16
 description sw-o port 22
 ipv6 nd ra-interval 3
 no ipv6 nd suppress-ra
!
[...]
router bgp 65111
 neighbor fabric peer-group
 neighbor fabric remote-as external
 neighbor fabric capability extended-nexthop
 neighbor swp16 interface peer-group fabric
 !
 address-family ipv4 unicast
  neighbor fabric soft-reconfiguration inbound
 !
 address-family ipv6 unicast
  neighbor fabric activate
  neighbor fabric soft-reconfiguration inbound
--

Note the "remote-as external" - this will accept any AS other than the
router's own AS.  AS numbering in this DC setup is a bit weird if you're
used to BGP... each leaf switch has its own AS, all spine switches
should have the same AS number (for reasons...), and all servers have
the same AS because who cares.  (We are talking about three disjoint
sets of AS numbers for leaves/spines/servers though.)
-- 
Simon.

[1] https://cumulusnetworks.com/blog/bgp-unnumbered-overview/
[2] 
https://support.cumulusnetworks.com/hc/en-us/articles/212561648-Configuring-BGP-Unnumbered-with-Cisco-IOS


Re: RFC 5549 - IPv4 Routes with IPv6 next-hop - Does it really exists?

2020-07-29 Thread Simon Leinen
Douglas Fischer writes:
> And today, I reached on https://tools.ietf.org/html/rfc5549
[...]
> But the questions are:
> There is any network that really implements RFC5549?

We've been using it for more than two years in our data center networks.
We use the Cumulus/FRR implementation on switches and FRR on Ubuntu on
servers.

> Can anyone share some information about it?

Sure.  We found the FRR/Cumulus implementation very easy to set up.  We
have leaf/spine networks interconnecting hundreds of servers (IPv4+IPv6)
with very minimalistic configuration.  In particular, you generally
don't have to configure neighbor addresses or AS numbers, because those
are autodiscovered.  I think we're basically following the
recommendations in the "BGP in the Data Center" book including the "BGP
on the Host" part (though our installation predates the book, so there
might be some differences).

The network has been working very reliably for us, so we never really
had anything to debug.  If you're coming from a world where you used
separate BGP sessions to exchange IPv4 and IPv6 reachability
information, then the operational commands take a little getting used
to, but in the end I find it very intuitive.

For example, here's one of the "show bgp ... summary" commands on a leaf
switch:

leinen@sw-f:mgmt-vrf:~$ net show bgp ipv6 uni sum
BGP router identifier 10.1.1.46, local AS number 65111 vrf-id 0
BGP table version 96883
RIB entries 1528, using 227 KiB of memory
Peers 54, using 1041 KiB of memory
Peer groups 2, using 128 bytes of memory

Neighbor   V   AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down 
State/PfxRcd
sw-o(swp16)465108  953559  938348000 03w5d00h   
   688
sw-m(swp18)465108  885442  938348000 03w5d00h   
   688
s0001(swp1s0.3)465300  748971  748977000 03w5d00h   
 1
s0002(swp1s1.3)465300  661787  661794000 03w1d23h   
 1
s0003(swp1s2.3)465300  748970  748977000 03w5d00h   
 1
s0004(swp1s3.3)465300  661868  661875000 03w1d23h   
 1
s0005(swp2s0.3)465300  748970  748976000 03w5d00h   
 1
[...]

Note the host names/interface names - this is how you generally refer to
neighbors, rather than using literal (IPv6) addresses.

Otherwise it should look very familiar if you have used vendor C's
"industry-standard CLI" before.

(In case you're wondering, the first two neighbors in the output are
spine switches, the others are servers.)

Cheers,
-- 
Simon.


Re: Hi-Rise Building Fiber Suggestions

2020-02-26 Thread Simon Leinen
Randy Bush writes:
> since we're at this layer, should i worry about going 3m with dacs at
> low speed, i.e. 10g?  may need to do runs to neighbor rack.

No, 3m is totally fine for passive DAC, never had any issues with those.
(5m should also be fine, we just have less experience with that because
most we use DAC mostly for server/ToR cabling, usually using QSFP(28) to
SFP+/SFP28 break-out cables.)
-- 
Simon.


Re: akamai yesterday - what in the world was that

2020-01-24 Thread Simon Leinen
Paul Nash writes:
> A bit of perspective on bandwidth and feeling old.  The first
> non-academic connection from Africa (Usenet and Email, pre-Internet)
> ran at about 9600 bps over a Telebit Trailblazer in my living room.

For your amusement, this latest e-bloodbath, erm -sports update, at 48GB
("PC" version), would take about 463 days (~15 months) to complete at
9600 bps (not counting overhead like packet headers etc.)

At 64kbps (ISDN/Antarctica) you could do it in 69 days, maybe even
finishing before the next - undoubtedly bigger - release comes out.
-- 
Simon.
[I conservatively used decimal Gigabytes, not "Gibibytes" - at 48GiB the
 numbers would be 497 or 74.5 days respectively.]


Re: RIPE our of IPv4

2019-12-01 Thread Simon Leinen
Matthew Kaufman writes:
> This is a great example (but just one of many) of how server software
> development works:

Small addition/correction to this example
(which I find interesting and also sad):

> Kubernetes initial release June 2014. Developed by Google engineers.
[...]
> Full support including CoreDNS support in 1.13, December 2018.

Support for dual-stack pods[1]: alpha in 1.16, October 2019.
-- 
Simon.
[1] https://kubernetes.io/docs/concepts/services-networking/dual-stack/


Re: Fwd: wither cyclops?

2019-02-14 Thread Simon Leinen
> Did this tool die on the vine?
> https://cyclops.cs.ucla.edu/

Not sure I would express it that way

https://www.cs.ucla.edu/thousandeyes-a-look-inside-two-ucla-alumnis-273-million-startup/
-- 
Simon.


Re: CVV

2018-11-08 Thread Simon Leinen
Todd Underwood writes:
> [interesting and plausible reasoning about why no chip in US]
> anyway, let's talk about networks, no?

This topic is obviously "a little" off-topic, but I find some
contributions (like yours) relevant for understanding adoption dynamics
(or not) of proposed security mechanisms on the Internet (RPKI, route
filtering in general, DNSSEC etc.).

In general the regulatory environment in the Internet is quite different
from that of the financial sector.  But I guess credit-card security
trade-offs are still made mostly by private actors.
(Maybe they sometimes discuss BGP security on their mailing lists :-)
-- 
Simon.


Re: Proving Gig Speed

2018-07-18 Thread Simon Leinen
> For a horrifying moment, I misread this as Google surfacing
> performance stats via a BGP stream by encoding stat_name:value as
> community:value

> /me goes searching for mass quantities of caffeine

Because you'll be spending the night writing up that Internet-Draft? :-)
-- 
Simon.


Talk extract: Submarine cable systems 101 for AWS partners

2016-12-10 Thread Simon Leinen
Amazon held their "re:Invent" event two weeks ago.  Wasn't there, but
I'm a James Hamilton fan so I started watching the recordings of his
talks.  In one, he talks about fiber optic cables under the oceans.
Here's the start of that section:

https://youtu.be/AyOAjFNPAbA?t=672

Even though this is presented at a suitable level for a large event
(32'000 attendees total, holy cow) of mostly non-network specialists, I
learned a few interesting things, e.g. about dealing with shunt faults.

If you rewind to a few minutes before that section, he also talks about
Amazon's private inter-DC network and how it is all (N*) 100G now.
-- 
Simon.


Re: [TECH] Pica8 & Cumulus Networks

2015-11-02 Thread Simon Leinen
Yoann THOMAS writes:
> Under a Cloud project I ask myself to use equipment based on the Pica8
> or Cumulus Networks.

Ah, quite different beasts.

Cumulus Networks tries to really make the switch look like a Linux
system with hardware-accelerated forwarding, so you can use stock
programs that manipulate routing, e.g. Quagga, and all forwarding
between the high-speed ports is done "in hardware".

Most other systems including Pica8 treat the high-speed interfaces as
different; you need special software to manipulate the configuration of
the forwarding ASIC.  I think in the case of Pica8 it's OpenFlow/Open
vSwitch, for other systems it will be some sort of a ASIC-specific SDK.

A colleague has built a proof-of-concept L3 leaf/spine network (using
OSPFv2/OSPFv3 according to local tradition) with six 32x40GE Quanta
switches running Cumulus Linux.  So far it has been quite pleasant.
There have been a few glitches, but those usually get fixed pretty
quickly.  We configure the switches very much like GNU/Linux servers, in
our case using Puppet (Ansible or Chef would work just as well).

> All in order to mount a Spine & Leaf architecture

> - Spine 40Gbps
> - Leaf in 10Gbps

One interesting option is to get (e.g. 1RU 32x) 40G switches for both
spine and leaf, and connect the servers using 1:4 break-out cables.
Fewer SKUs, better port density at the cost of funny cabling.  Also
gives you a bit more flexibility with respect to uplinks (can have more
than 6*40GE per leaf if needed) and downlinks (easy to connect some
servers at 40GE).

The new 32*100GE switches also look interesting, but they might still be
prohibitively expensive (although you can save on spine count and
cabling) unless you NEED the bandwidth or want to build something
future-proof.  They are even more flexible in that you can drive the
ports as 4*10GE, 4*25GE (could be an attractive high-speed option once
25GE server adapters become common), 40GE, 2*50GE, 100GE.

We have looked at Edge-Core and Quanta and they both look pretty solid.
I think they are also both used by some of the Web "hypergiants".
Others may be just as good - basically it's always the same Broadcom
switching silicon (Trident II/II+ in the 40GE, Tomahawk in the 100GE
switches) with a bit of glue; there may be subtle differences between
vendors in quality, box design, airflow etc.

It's a bit unhealthy that Broadcom is so dominant in this market - but
probably not undeserved.  There are a few alternative switching
chipsets, e.g. Mellanox, Cavium XPliant that look competitive (at least
on paper) and that may be more "open" than Broadcom's.  I think both the
software vendors (e.g. Cumulus Networks) and the ODMs (Edge-Core, Quanta
etc.) are interested in these.
-- 
Simon.


Re: Recommended L2 switches for a new IXP

2015-01-13 Thread Simon Leinen
Manuel Marín writes:
 Dear Nanog community
 [...] There are so many options that I don't know if it makes sense to
 start with a modular switch (usually expensive because the backplane,
 dual dc, dual CPU, etc) or start with a 1RU high density switch that
 support new protocols like Trill and that supposedly allow you to
 create Ethernet Fabric/Clusters. The requirements are simple, 1G/10G
 ports for exchange participants, 40G/100G for uplinks between switches
 and flow support for statistics and traffic analysis.

Stupid thought from someone who has never built an IXP,
but has been looking at recent trends in data center networks:

There are these white-box switches mostly designed for top-of-rack or
spine (as in leaf-spine/fat-tree datacenter networks) applications.
They have all the necessary port speeds - well 100G seems to be a few
months off.  I'm thinking of brands such as Edge-Core, Quanta etc.

You can get them as bare-metal versions with no switch OS on them,
just a bootloader according to the ONIE standard.  Equipment cost
seems to be on the order of $100 per SFP+ port w/o optics for a
second-to-last generation (Trident-based) 48*10GE+4*40GE ToR switch.

Now, for the limited and somewhat special L2 needs of an IXP, couldn't
someone hack together a suitable switch OS based on Open Network Linux
(ONL) or something like that?

You wouldn't even need MAC address learning or most types of flooding,
because at an IXP this often hurts rather than helps.  For building
larger fabrics you might be using something other (waves hands) than
TRILL; maybe you could get away without slightly complex multi-chassis
multi-channel mechanisms, and so on.

Flow support sounds somewhat tough, but full netflow support that
would get Roland Dobbins' usable telemetry seal of approval is
probably out of reach anyway - it's a high-end feature with classical
gear.  With white-box switches, you could try to use the given 5-tuple
flow hardware capabilities - which might not scale that well -, or use
packet sampling, or try to use the built-in flow and counter mechanisms
in an application-specific way.  (Except *that's* a lot of work on the
software side, and a usably efficient implementation requires slightly
sophisticated hardware/software interfaces.)

Instead of a Linux-based switch OS, one could also build an IXP
application using OpenFlow and some kind of central controller.
(Not to be confused with SDX: Software Defined Internet Exchange.)

Has anybody looked into the feasibility of this?

The software could be done as an open-source community project to make
setting up regional IXPs easier/cheaper.

Large IXPs could sponsor this so they get better scalability - although
I'm not sure how well something like the leaf-spine/fat-tree design maps
to these IXPs, which are typically distributed over several locations.
Maybe they could use something like Facebook's new design, treating each
IXP location as a pod.
-- 
Simon.
[1] https://code.facebook.com/posts/360346274145943


Low-numbered ASes being hijacked? [Re: BGP Update Report]

2014-11-30 Thread Simon Leinen
cidr-report  writes:
 BGP Update Report
 Interval: 20-Nov-14 -to- 27-Nov-14 (7 days)
 Observation Point: BGP Peering with AS131072

 TOP 20 Unstable Origin AS
 Rank ASNUpds %  Upds/PfxAS-Name
[...]
 11 - AS5   38861  0.6%   7.0 -- SYMBOLICS - Symbolics, Inc.,US

Disappointing to see Symbolics (AS5) on this list.  I would expect these
Lisp Machines to have very stable BGP implementations, especially given
the leisurely release rhythm for Genera for the past few decades.  Has
the size of the IPv4 unicast table started triggering global GCs?

Seriously, all these low-numbered ASes in the report look fishy.  I
would have liked this to be an artifact of the reporting software (maybe
an issue with 4-byte ASes?), but I do see some strange paths in the BGP
table that make it look like (accidental or malicious) hi-hacking of
these low-numbered ASes.

Now the fact that these AS numbers are low makes me curious.  If I
wanted to hijack other folks' ASes deliberately, I would probably avoid
such numbers because they stand out.  Maybe these are just non-standard
private-use ASes that are leaked?

Some suspicious paths I'm seeing right now:

  133439 5
  197945 4

Hm, maybe 32-bit ASes do have something to do with this...

Any ideas?
-- 
Simon. (Just curious)

[...]
 17 - AS3   30043  0.4%3185.0 -- MIT-GATEWAYS - Massachusetts 
 Institute of Technology,US
[...]

 TOP 20 Unstable Origin AS (Updates per announced prefix)
 Rank ASNUpds %  Upds/PfxAS-Name
[...]
 13 - AS5   38861  0.6%   7.0 -- SYMBOLICS - Symbolics, Inc.,US
[...]
 15 - AS4   21237  0.3% 871.0 -- ISI-AS - University of 
 Southern California,US
[...]
 19 - AS45345  0.1%1437.0 -- ISI-AS - University of 
 Southern California,US
 20 - AS48784  0.1%2303.0 -- ISI-AS - University of 
 Southern California,US


Re: iOS 7 update traffic

2013-09-23 Thread Simon Leinen
Glen Kent writes:
 One of the earlier posts seems to suggest that if iOS updates were
 cached on the ISPs CDN server then the traffic would have been
 manageable since everybody would only contact the local sever to get
 the image. Is this assumption correct?

Not necessarily.  I think most of the iOS 7 update traffic WAS in fact
delivered from CDN servers (in particular Akamai).  And many/most large
service providers already have Akamai servers in their networks.  But
they may not have enough spare capacity for such a sudden demand -
either in terms of CDN (Akamai) servers or in terms of capacity between
their CDN servers and their customers.

 Do most big service providers maintain their own content servers? Is
 this what we're heading to these days?

Depends on what you mean by their own.  As I said, these days Akamai
has servers in many of the big networks.  Google and possibly others
(Limelight, ...?) might have that as well.  But I wouldn't call them
their [the SPs'] own.

Some SPs are also built their own CDNs (Level 3) or are talking about
it.  But that model seems to be less popular with the content owners and
the other SPs.
-- 
Simon.



Re: Real world sflow vs netflow?

2012-07-17 Thread Simon Leinen
James Braunegg writes:
 In the end I did real life testing comparing each platform

Great, thanks for sharing your results!

(It would be nice if you could tell us a little bit about the
configuration, i.e. what kind of sampling you used.)

[...]
 That being said both netflow and sflow both under read by about 3%
 when compared to snmp port counters, which we put to the conclusion
 was broadcast traffic etc which the routers didn't see / flow.

That's one reason, but another reason would be that at least in Netflow
(but sFlow may be similar depending on how you use it), the reported
byte counts only include the sizes of the L3 packets, i.e. starting at
the IP header, while the SNMP interface counters (ifInOctets etc.)
include L2 overhead such as Ethernet frame headers and such.
-- 
Simon.



Re: Network Storage

2012-04-16 Thread Simon Leinen
Andrew Thrift writes:
 If you want something from a Tier1 the new Dell R720XD's will take 24x
 900GB SAS disks

or 12x 2TB 3.5 cheap  slow SATA disks
or 12x 3TB 3.5 more expensive  slightly faster SAS disks

- if you take the (cheaper) 3.5-disk variant of the R720xd chassis.

or 12x 3TB 3.5 cheapslow SATA disks if you buy them directly rather
than from Dell.  (Presumably you'd have to buy Dell hot-swap trays)
-- 
Simon.

 and have 16 cores.  If you order it with a SAS6-HBA you can add up to
 8 trays of 24 x 900GB SAS disks to provide 194TB of raw space at quite
 a reasonable cost.



Re: Apple updates - Effect on network

2011-10-15 Thread Simon Leinen
Matt Taylor writes:
 Would love to see some bandwidth graphs. :)

Here's one from another network.
attachment: akamai-week.pngGuess it was a good idea to upgrade that Akamai cluster's uplink to
10GE, even though 2*GE (or was it 4*GE) looked sufficient at the time.
Remember folks, overprovisioning is a misnomer, it should be called
provisioning for robustness and growth.
-- 
Simon.


Re: [routing-wg] The Cidr Report

2011-10-15 Thread Simon Leinen
Geoff Huston writes:
 Does anyone give a s**t about this any more?

I do; I check the weekly increase every week, and check who the top
offenders are.  If someone from my vicinity/circles is on the list
(doesn't happen frequently; more often for the BGP updates report than
for CIDR), I may send them a note and ask what happened.

 From what I learned at the latest NANOG it's very clear that nobody
 reads this any more.

Reads may be an exaggeration, but I'm sure some look at it.

 Is there any good reason to persist in spamming the nanog list with
 this report?

I think it still provides an incentive for people not to mess things up
too badly; and a chance of some mishaps to be noticed quicker, with a
little help from your friends.
-- 
Simon.



Re: facebook spying on us?

2011-10-02 Thread Simon Leinen
 Data Center Knowledge posted about 20 minutes of very poorly shot
 video of Prineville.  They're Open Compute servers in 'triplet' racks.
[...]
 Their power supply (also open) runs across 2 legs of a 277/480 3-phase
 feed, which is usually what the substation supplies to your PDUs,
 which step it down further to 120/208.  It also takes -48, and each
 pair of triplets has a 48V float string that will run the 180 servers
 for about 45 seconds.

 It's a nice setup.  I plan to steal it.  :-)

That's what they want you to do - check out the specs on

http://opencompute.org/
-- 
Simon.



Re: Cisco 7600 PFC3B(XL) and IPv6 packets with fragmentation header

2011-10-01 Thread Simon Leinen
 which traceroute?  icmp?  udp?  tcp?  Traceroute is not a single protocol.

Router processing is only dependent on noticing that TTL is expiring,
and being able to return an ICMP message (including a quote of part of
the original packet) to the sender.

 what is that limit? from a single port? from a single linecard? from a
 chassis? how about we remove complexity here and just deal with this
 in the fastpath?

 on a pfc3, the mls rate limiters deal with handling all punts from the
 chassis to the RP.  It's difficult to handle this in any other way.

If the rate limit is done in hardware (which one should hope), then it
would be more natural to do it on a per-PFC/DFC basis.  So on a box with
DFCs on all linecards, it would be per linecard, not per chassis.

Maybe someone who knows for sure can decide.

 My point in calling this all 'stupid' is that by now we all have been
 burned by this sort of behavior, vendors have heard from all of us
 that 'this is really not a good answer', enough is enough please stop
 doing this.

 This is a Hard Problem.  There is a balance to be drawn between
 hardware complexity, cost and lifecycle.  In the case of the PFC3,
 we're talking about hardware which was released in 2000 - 11 years
 ago.

Um, no, in 2000 there was no PFC3.  That came out (on the Supervisor
720) in March 2003.

 The ipv6 fragment punting problem was fixed in the pfc3c, which was
 released in 2003.

The PFC 3C was announced (with the RSP720) in December 2006.

 I'm aware that cisco is still selling the pfc3b, but they really only
 push the rsp720 for internet stuff (if they're pushing the 6500/7600
 line at all).

See Janos' reply, the Catalyst 6500 seems alive and kicking with the
Supervisor 2T.

The 7600 is a somewhat different story.  As far as I see, all
development is going into feature-rich ES+ cards and a few relatively
narrow applications such as mobile backhaul and FTTH aggregation(?).

We have been using the 7600 as a cheap fast IPv4/IPv6 (and later also
MPLS) backbone router.  According to Cisco we should probably move up
to the ASR9000 or CRS-3, but I'm tempted to downgrade to Catalyst 6500
with Sup-2T (until we need 100G :-).
-- 
Simon.



Re: Network Equipment Discussion (HP and L2/10G)

2011-05-14 Thread Simon Leinen
Deepak Jain writes:
 The wrinkle here is that I can't use a normal enterprise 10G switch
 because of the need for DWDM optics (ideally 80km style).

80km DWDM optics in SFP+ format should be available now or RSN.  Search
engines turn up a few purported vendors.  The ones I found conform to
the 100GHz grid, but 50GHz ones should be coming too.

Haven't tried any of those myself though.
-- 
Simon.



Re: Top webhosters offering v6 too?

2011-02-06 Thread Simon Leinen
Tim Chown writes:
 Which of the big boys are doing it?

Google - although there don't call themselves a web hoster, they can be
used for hosting web sites using services such as Sites or App Engine.
Both support IPv6, either using the opt-in mechanism or by using an
alternate CNAME (ghs46 instead of ghs.google.com).  That's what I use.

None of the other large cloud providers seems to support IPv6 for
their users yet.  In particular, neither Amazon's AWS not Microsoft
Azure have much visible activity in this direction.  Rackspace have
announced IPv6 support for the first half of 2011.

Concerning the more traditional webhosting offerings, I have no idea
about the big boys.  Here in Switzerland, a few smaller hosters
support IPv6.  And I saw IPv6 mentioned in ads for some German server
hosting offering.  Germany is interesting because it has a
well-developed hosting ecosystem with some really big players.
-- 
Simon.



Re: arin and ops fora

2011-01-08 Thread Simon Leinen
Randy Bush writes:
 one difference in north america from the other 'regions' is that there
 is a strong and very separate operator community and forum.  this does
 not really exist in the other regions.  ripe ate the eof years ago.
 apops is dormant aside from [...]

Right.

 observe that the main north american irr, radb, is not run by the rir,
 unlike in other regions.  and i like that there are a number of
 diverse rir services in the region.  it's healthy.
  ^^^ you mean rr I think.

 so i would be perfectly happy if arin discussed operational matters
 here on nanog with the rest of us ops.  i would not be pleased to see
 ops start to be subsumed by the rir here.

I'm sympathetic with that, but, like David said, the separation
(NANOG/ARIN) you have in North America does lead to issues such as not
being able to trust what's in the RR(s).

So I'm quite happy with the situation here in Europe, where RIPE
(deliberately ignoring the difference between RIPE NCC and the RIPE
community for a second) takes care of both running the address registry,
and running a routing registry that can leverage the same
authentication/authorization substrate.  This makes the RR much more
trustworthy, and should really make the introduction of something like
RPKI much easier (albeit with the temptation to set it up in a more
centralized way than we might like).

Randy, what is the model you have in mind for running a routing registry
infrastructure that is sustainable and trustworthy enough for uses such
as RPKI, i.e. who could/should be running it? I guess I'm arguing that
from my non-North-American perspective, an ARIN with a carefully
extended mandate could be of much help here.  So even if you're unhappy
with the current ARIN governance, maybe it would still be worthwhile for
the community to fix that issue - unless there are credible alternatives.
-- 
Simon.



Re: Over a decade of DDOS--any progress yet?

2010-12-11 Thread Simon Leinen
Greg Whynott writes:
 i found it funny how M$ started giving away virus/security software
 for its OS.  it can't fix the leaky roof,  so it includes a roof patch
 kit. (and puts about 10 companies out of business at the same time)

I actually like the new arrangement better, where Microsoft provides the
security software to its OS customers for free.

The previous setup had third parties (anti-virus vendors) profiting from
the weaknesses in Microsoft's software.

The new arrangement provides better incentives for fixing the security
weaknesses at the source, at least as far as Microsoft is concerned.
Even for third-party providers of buggy software, Microsoft probably
better leverage towards them than the numerous anti-virus vendors.

But then maybe my armchair economics are totally wrong.
-- 
Simon.



ICMPv6 rate limits breaking PMTUD (and traceroute) [Re: Comcast enables 6to4 relays]

2010-09-01 Thread Simon Leinen
Jack Bates writes:
 1) Your originating host may be breaking PMTU (so the packet you send
 is too large and doesn't make it, you never resend a smaller packet,
 but it works when tracerouting from the other side due to PMTU working
 in that direction and you are responding with the same size packet).

Your mentioning PMTU discovery issues in connection with 6to4 prompts me
to confess how our open 6to4 relay has probably contributed to the
perception of brokenness of 6to4 for quite a while *blush*.

The relay runs on a Cisco 7600 with PFC3 - btw. this is an excellent
platform to run an 6to4 relay on, because it can do the encap/decap in
hardware if configured correctly.

At some point of the relay becoming popular (load currently fluctuates
between 80 Mb/s and 200 Mb/s), I noticed that our router very often
failed to send ICMPv6 messages such as packet too big.

First I suspected our control-plane rate-limit (CoPP) configuration, but
couldn't find anything there.

Finally I found that I had to configure a generous ipv6 icmp
error-interval[1], because the (invisible) default configuration will
only permit one such ICMPv6 message to be generated every 100
milliseconds, and that's WAY insufficient for a popular router.
We currently use

 ipv6 icmp error-interval 2 100

(max. steady state rate 500 ICMPv6s/second - one every 2 milliseonds -
with bursts up to 100) with no ill effects.

Note that the same rate-limit will also cause stars in IPv6 traceroutes
through popular routers if the default setting is used.

The issue is probably not restricted to Cisco, as the ICMPv6 standard
(RFC 4443) mandates that ICMPv6 error messages be rate limited.  It even
has good (if hand-wavy) guidance on how to arrive at defaults - the
values used on our Cisco 7600 (and possibly all other IOS devices?)
correspond to the RFC's suggestion for a small/mid-size device *hrmpf*
(yes Randy, I know I should get real routers :-).

Anybody knows which defaults are used by other devices/vendors?

In general, rate limits are very useful for protecting routers'
notoriously underpowered control planes, but (1) it's hard to come up
with reasonable defaults, and (2) I suspect that not most people don't
monitor them (because that's often hard), and thus won't notice when
normal traffic levels trip these limits.
-- 
Simon.
[1] See 
http://www.cisco.com/en/US/docs/ios/ipv6/command/reference/ipv6_06.html#wp2135326



Re: Restrictions on Ethernet L2 circuits?

2009-12-31 Thread Simon Leinen
Interesting questions.  Here are a few thoughts from the perspective of
an education/research backbone operator that used to be IP only but has
also been offering L2 point-to-point circuits for a few years.

 Should business customers expect to be able to connect several LANs
 through an Ethernet L2 ciruit and build a layer 2 network spanning
 several locations?

At least for our customers, that is indeed important.  The most popular
application here is for a customer to connect a remote location to their
campus network, and that want to (at least be able to) use any of their
existing VLANs at the remote site.

 Or should the service provider implement port security and limit the
 number of MAC addresses on the access ports, forcing the customer to
 connect a router in both ends and segment their network?

That would make the service less attractive, and also more complex to
set up and maintain.  For point-to-point service, there is really no
reason for the network to care about customers' MAC addresses, VLAN tags
and such.  As you said, EoMPLS doesn't care.  (Ethernet over L2TPv3
shouldn't care either.  If I had cost-effective edge routers that did
L2TPv3 encapsulation/decapsulation at line rate, I'd switch off MPLS in
our core tomorrow.)

Couldn't PBB or even Q-in-Q provide that isolation as well, at least for
point-to-point services? I must say that I don't personally have much
experience with those, because we tend to connect our customers to
EoMPLS-capable routers directly.

 Also, do you see a demand for multi-point layer 2 networks (requiring
 VPLS), or are point-to-point layer 2 circuits sufficient to meet
 market demand?

That's a big question for us right now... we're not sure yet.  I'd like to
hear others' opinions on this.

 The most important argument for customers that choose Ethernet L2 over
 MPLS IP-VPN is that they want full control over their routing, they
 don't want the involvement from the service provider. Some customers
 also argue that a flat layer 2 network spanning several locations is a
 simpler and better design for them, and they don't want the hassle
 with routers and network segmentation.

I have a good deal of sympathy for customers who think this way.  Also
from the service provider point of view, I like the simplicity of the
offering - basically we're providing an emulation of a very long piece
of Ethernet cable.  (My worry with multipoint L2 VPNs is that they can't
have such a simple service model.)

 But IMO the customer (and the service provider) is far better off by
 segmenting their network in the vast majority of cases. What do you
 think?

Maybe they already have a segmented network, but don't want to segment
it based on geography/topology.

As far as I'm concerned, enterprises should just connect their various
sites to the Internet independently, and use VPN techniques if and where
necessary to provide the illusion of a unified network.  In practice,
this illusion of a single large LAN (or rather, multiple
organization-wide LANs) is very important to the typical enterprise,
because so much security policy is enforced based on IP addresses.  And
the typical enterprise wants a central chokepoint that all traffic must
go through, for reasons that might have to do with security, or support
costs, or with (illusions of) control.

This bridging function required to maintain the illusion of a unified
network is something that most enterprises prefer to outsource.  I'd
hope that at some point, better security mechanisms and/or better VPN
technologies make these kinds of VPN services less relevant.  Until that
happens, there's going to be demand for them.  Of course the telcos have
known that for eons and provided many generations of expensive and
hard-to-use services to address this.  Point-to-point Ethernet services
are interesting because they are relatively easy to provide for folks
like us who only really know IP (and maybe some MPLS).  And the more
transparent they are, the easier it is for customers to use them.
-- 
Simon.



Re: Layer 2 vs. Layer 3 to TOR

2009-11-15 Thread Simon Leinen
Tore Anderson writes:
 * Jonathan Lassoff
 Are there any applications that absolutely *have* to sit on the same
 LAN/broadcast domain and can't be configured to use unicast or multicast
 IP?

 FCoE comes to mind.

Doesn't FCoE need even more than that, i.e. lossless Ethernet with
end-to-end flow control, such as IEEE DCB? As far as I understand,
traditional switched Ethernets don't fit the bill anyway.

On the other hand iSCSI should be fine with routed IP paths; though
Malte's mail suggests that there are (broken?) implementations that aren't.
-- 
Simon.



Re: MRLG

2009-08-29 Thread Simon Leinen
 Thanks guys I got it...

Congratulations.  But how/where?
-- 
Simon.



Re: SNMP and syslog forwarders

2009-03-04 Thread Simon Leinen
Sam Stickland writes:
 It's looking like running all of our traps and syslog through a couple
 of relay devices (and then onwards to the various NMS's) would be
 quite a win for us.

You can try the UDP samplicator:

http://www.switch.ch/network/downloads/tf-tant/samplicator/

(The name indicates that it can also sample packets, but that is just an
option that can be ignored for your application.)

 These relay devices just need to be dumb forwarders (we don't
 require any filtering or storing, just reflection), but we need an HA
 pair (across two sites) without creating duplicates.

There is one complication with SNMP traps and also with typical Syslog
packets: The IP source address carries important information that is not
carried in the payload.  So it's not sufficient for the relay to simply
re-send the UDP datagrams without loss of information.

Samplicator handles this with an option to spoof the IP source address
when it resends the packets.  (With this option, it must run as root,
and you will have to drill holes in the ingress filters that you
hopefully have even for your own servers. :-)

 I have the coding skills to make this myself, but as coding skills
 come and go in our network team, we are looking for a commerical
 product so it will continnue to work after I get:  hit by a bus /
 amnesia / visions of grandeur.

Not commercial, sorry.  Maybe someone can sell you support for it (or
life insurance).  I should probably put it up on a code hosting service
so that the community can maintain it.

 Any recommendations / experience? This needs to scale to ~1,500 devices.

Shouldn't be a problem.  The main trick is to ensure that the
forwarder's UDP receive buffers are large enough to handle bursts that
might arrive while the forwarder/server is catching its breath.
Samplicator lets you tune this socket buffer size.
-- 
Simon.



Re: DNS problems to RoadRunner - tcp vs udp

2008-06-14 Thread Simon Leinen
Jon Kibler writes:
 Also, other than That's what the RFCs call for, why use TCP for
 data exchange instead of larger UDP packets?

TCP is more robust for large (Path MTU) data transfers, and less
prone to spoofing.

A few months ago I sent a message to SwiNOG (like NANOG only less
North American and more Swiss) about this topic, trying to explain
some of the tradeoffs:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02612.html

Mostly I think that people approaching this from a security
perspective only often forget that by fencing in the(ir idea of the)
current status quo, they often prevent beneficial evolution of
protocols as well, contributing to the Internet's ossification.
-- 
Simon.



Re: [NANOG] Questions about NETCONF

2008-05-16 Thread Simon Leinen
Randy Bush writes:
 [in response to John Payne [EMAIL PROTECTED]:]
 I've personally been waiting for the data modeling to be
 standardized.  Yes, it's great and wonderful to have a consistent
 method of talking to network devices, but I also want a standard
 data model along with it.

 does this not imply that all devices would need to be semantically
 congruent?  if so, is this realistic?

Personally I don't think it is.

The way that configuration is structured is something that at least
some vendors use to differentiate themselves from each other.  (Though
other vendors make a point of being compatible with some industry
standard CLI.)

So if you think that configurations in NETCONF should be similar to
the native configuration language, that doesn't bode well for
industry-wide standardization of a NETCONF configuration data model.

It might still be possible to have a common NETCONF data model, but
then that would probably be quite different from the (all) native
configuration languages; much in the same way as SNMP MIBs are
(structurally) different from how information is presented at the
CLI.  Personally I'm not sure that this would be a very useful
outcome, because there would necessarily be a large lag between when
features are implemented (with a native CLI to configure them of
course) and when they can be configured through NETCONF.

Maybe the best we can shoot for is this:

* A common language to describe parts of NETCONF configuration.  The
  newly chartered IETF NETMOD working group[1] is working on this.
  Vendors can then describe their specific NETCONF data models using
  this language, and tool writers can use these descriptions to
  generate code for applications that want to manipulate device
  configurations.

* Common data models for certain well-understood parts of NETCONF
  configuration.  This could include simple atomic things such as
  how to write an IP address or a prefix in (NETCONF) XML, or
  configuration of standardized protocols such as OSPF, IPFIX etc.

  The problem is how well will this support migration from
  vendor-specific configuration to standardized configuration - which,
  as I said, is always bound to lag far behind.  And even if/when an
  aspect of a configuration model (let's say for OSPF) is
  standardized, vendors are bound to extend that model to support
  not-yet-standardized extensions (e.g. sub-second timers, BFD).  This
  will be another challenge to support.  (But there are smart people
  working on this :-)
-- 
Simon.
[1] http://www.ietf.org/html.charters/netmod-charter.html

___
NANOG mailing list
NANOG@nanog.org
http://mailman.nanog.org/mailman/listinfo/nanog