from:"Saku Ytti"

Re: Opengear alternatives that support 5g?

2024-04-26 Thread Saku Ytti

On Fri, 26 Apr 2024 at 19:43, Warren Kumari  wrote:

> I've been on the same quest, and I have some additional requests / features. 
> Ideally it:
>
> 1: would be small - my particular use-case is for a "traveling rack", and so 
> 0U is preferred.
> 2: would be fairly cheap.
> 3: would not be a Raspberry-Pi, a USB hub and USB-to-serial cables. We tried 
> that for a while, and it was clunky — the SD card died a few times (and 
> jumped out entirely once!), people kept futzing with the OS and fighting over 
> which console software to use, installing other packages, etc.
> 4: support modern SSH clients (it seems like you shouldn't have to say this, 
> but… )
> 5: actually be designed as a termserver - the current thing we are using 
> doesn't really understand terminals, and so we need to use 'socat 
> -,raw,echo=0,escape=0x1d TCP::' to get things like 
> tab-completion and "up-arrow for last command" to work.
> 6: support logging of serial (e.g crash-messages) to some sort of log / 
> buffer / similar (it's useful to be able to see what a device barfed all over 
> the console when it crashes.

Decouple your needs, use whatever hardware to translate RS232 into
SSH, and then use 'conserver' to maintain 24/7 logging and
multiplexing SSH sessions to each console port. Then you have your
logs in your existing NMS box filesystem, and consistent UX
independent of hardware to reach, monitor and multiplex consoles.
For me Cisco is great here, because it's something an organisation
already knows how to source, turn-up, upgrade, troubleshoot, maintain.
And you get a broad set of features you might want, IPSEC, DMVPN, BGP,
ISIS, and so forth.

I keep wondering why everyone is so focused on OOB hardware cost, when
in my experience the ethernet connection is ~200-300USD (150USD can be
just xconn) MRC. So in 10 years, you'll pay 24k to 36k just for the
OOB WAN, masking the hardware price. And 10years, to me, doesn't sound
even particularly long a time for a console setup.

>
>
> The Get Console Airconsole TS series meets many of these requirements, but it 
> doesn't do #6. It also doesn't really feel like they have been updating / 
> maintaining these.
>
> Yes, I fully acknowledge that #3 falls into the "Doctor, Doctor, it hurts 
> when I do this" camp, but, well…
>
> W
>
>
>>
>> --
>> ++ytti
>
>

--
  ++ytti

Re: Opengear alternatives that support 5g?

2024-04-26 Thread Saku Ytti

On Fri, 26 Apr 2024 at 19:43, Warren Kumari  wrote:

>> Curious if anyone has particular hardware they like for OOB / serial 
>> management, similar to OpenGear, but preferably with 5G support, maybe even 
>> T-Mobile support? It’s becoming increasingly difficult to get static IP 4g 
>> machine accounts out of Verizon, and the added speed would be nice too. Or 
>> do you separate the serial from the access device (cell+firewall, etc.)?

Does it? To me OP implied they need 5G, because they can get static in
5G product, but not on 4G. So if need for static is solved, they can
keep existing investments.

-- 
  ++ytti

Re: Opengear alternatives that support 5g?

2024-04-25 Thread Saku Ytti

On Fri, 26 Apr 2024 at 03:11, David H  wrote:

> Curious if anyone has particular hardware they like for OOB / serial 
> management, similar to OpenGear, but preferably with 5G support, maybe even 
> T-Mobile support?  It’s becoming increasingly difficult to get static IP 4g 
> machine accounts out of Verizon, and the added speed would be nice too.  Or 
> do you separate the serial from the access device (cell+firewall, etc.)?

You could get a 5G Catalyst with an async NIM or SM.

But I think you're setting up yourself for unnecessary costs and
failures by designing your OOB to require static IP. You could design
it so that the OOB spokes dial-in to the central OOB hub, and the OOB
hub doesn't care what IP they come from, using certificates or PSK for
identity, instead of IP.

-- 
  ++ytti

Re: constant FEC errors juniper mpc10e 400g

2024-04-21 Thread Saku Ytti

On Sun, 21 Apr 2024 at 09:05, Mark Tinka  wrote:

> Technically, what you are describing is EoS (Ethernet over SONET, Ethernet 
> over SDH), which is not the same as WAN-PHY (although the working groups that 
> developed these nearly confused each other in the process, ANSI/ITU for the 
> former vs. IEEE for the latter).
>
> WAN-PHY was developed to be operated across multiple vendors over different 
> media... SONET/SDH, DWDM, IP/MPLS/Ethernet devices and even dark fibre. The 
> goal of WAN-PHY was to deliver a low-cost Ethernet interface that was 
> SONET/SDH-compatible, as EoS interfaces were too costly for operators and 
> their customers.
>
> As we saw in real life, 10GE ports out-sold STM-64/OC-192 ports, as networks 
> replaced SONET/SDH backbones with DWDM and OTN.

Key difference being, WAN-PHY does not provide synchronous timing, so
it's not SDH/SONET compatible for strict definition for it, but it
does have the frame format. And the optical systems which could
regenerate SONET/SDH framing, didn't care about timing, they just
wanted to be able to parse and generate those frames, which they
could, but they could not do it for ethernet frames.

I think it is pretty clear, the driver was to support long haul
regeneration, so it was always going to be a stop-gap solution. Even
though I know some networks, who specifically wanted WAN-PHY for its
error reporting capabilities, I don't think this was majority driver,
majority driver almost certainly was 'thats only thing we can put on
this circuit'.

-- 
  ++ytti

Re: constant FEC errors juniper mpc10e 400g

2024-04-20 Thread Saku Ytti

On Sat, 20 Apr 2024 at 14:35, Mark Tinka  wrote:

> Even when our market seeks OTN from European backhaul providers to extend 
> submarine access into Europe and Asia-Pac, it is often for structured 
> capacity grooming, and not for OAM benefit.
>
> It would be interesting to learn whether other markets in the world still 
> make a preference for OTN in lieu of Ethernet, for the OAM benefit, en masse. 
> When I worked in Malaysia back in the day (2007 - 2012), WAN-PHY was 
> generally asked for for 10G services, until about 2010; when folk started to 
> choose LAN-PHY. The reason, back then, was to get that extra 1% of pipe 
> bandwidth :-).

Oh I don't think OTN or WAN-PHY have any large deployment future, the
cheapest option is 'good enough' and whatever value you could extract
from OTN or WAN-PHY, will be difficult to capitalise, people usually
don't even capitalise the capabilities they already pay for in the
cheaper technologies.
Of course WAN-PHY is dead post 10GE, a big reason for it to exist was
very old optical systems which simply could not regenerate ethernet
framing, not any features or functional benefits.

-- 
  ++ytti

Re: constant FEC errors juniper mpc10e 400g

2024-04-20 Thread Saku Ytti

On Sat, 20 Apr 2024 at 10:00, Mark Tinka  wrote:

> This would only matter on ultra long haul optical spans where the signal 
> would need to be regenerated, where - among many other values - FEC would 
> need to be decoded, corrected and re-applied.

In most cases, modern optical long haul has a transponder, which
terminates your FEC, because clients offer gray, and you like
something a bit less depressing, like 1570.42nm.

This is not just FEC terminating, but also to a degree autonego
terminating, like RFI signal would be between you and transponder, so
these connections can be, and regularly are, provided without proper
end-to-end hardware liveliness, and even if they were delivered and
tested to have proper end-to-end HW liveliness, that may change during
operation, so line faults may or may not be propagated to both ends as
RFI assertion, and even if they are, how delayed they are, they may
suffer delay to allow for optical protection to engage, which may be
undesirable, as it eats into your convergence budget.

Of course the higher we go in the abstraction, the less likely you are
to get things like HW livelines detection, like I don't really see
anyone asking for this in their pseudowire services, even though it's
something that actually can be delivered. In Junos it's a single
config stanza in interface, to assert RFI to client port, if
pseudowire goes down in the operator network.

-- 
  ++ytti

Re: constant FEC errors juniper mpc10e 400g

2024-04-19 Thread Saku Ytti

On Fri, 19 Apr 2024 at 10:55, Mark Tinka  wrote:>
FEC is amazing.

> At higher data rates (100G and 400G) for long and ultra long haul optical 
> networks, SD-FEC (Soft Decision FEC) carries a higher overhead penalty 
> compared to HD-FEC (Hard Decision FEC), but the net OSNR gain more than 
> compensates for that, and makes it worth it to increase transmission distance 
> without compromising throughput.

Of course there are limits to this, as FEC is hop-by-hop, so in
long-haul you'll know about circuit quality to the transponder, not
end-to-end. Unlike in wan-phy, OTN where you know both.

Technically optical transport could induce FEC errors, if there are
FEC errors on any hop, so consumers of optical networks need not have
access to optical networks to know if it's end-to-end clean. Much like
cut-through switching can induce errors via some symbols to
communicate the CRC errors happened earlier, so the receiver doesn't
have to worry about problems on their end.

-- 
  ++ytti

Re: constant FEC errors juniper mpc10e 400g

2024-04-19 Thread Saku Ytti

On Thu, 18 Apr 2024 at 21:49, Aaron Gould  wrote:

> Thanks.  What "all the ethernet control frame juju" might you be referring 
> to?  I don't recall Ethernet, in and of itself, just sending stuff back and 
> forth.  Does anyone know if this FEC stuff I see concurring is actually 
> contained in Ethernet Frames?  If so, please send a link to show the ethernet 
> frame structure as it pertains to this 400g fec stuff.  If so, I'd really 
> like to know the header format, etc.

The frames in FEC are idle frames between actual ethernet frames. So
you recall right, without FEC, you won't see this idle traffic.

It's very very good, because now you actually know before putting the
circuit in production, if the circuit works or not.

Lot of people have processes to ping from router-to-router for N time,
trying to determine circuit correctness before putting traffic on it,
which looks absolutely childish compared to FEC, both in terms of how
reliable the presumed outcome is and how long it takes to get to that
presumed outcome.

-- 
  ++ytti

Re: TFTP over anycast

2024-04-06 Thread Saku Ytti

On Sat, 6 Apr 2024 at 12:00, Bill Woodcock  wrote:

> That’s been the normal way of doing it for some 35 years now.  iBGP 
> advertise, or don’t advertise, the service address, which is attached to the 
> loopback, depending whether you’re ready to service traffic.

If we are talking about eBGP, then pulling routes makes sense. If we
are talking about iBGP and controlled environment, you should never
pull anycast routes, because eventually you will have failure mode,
where the check mechanism itself is broken, and you'll pull all
routes.
If instead of pulling the routes, you make them inferior, you are
covered for the failure mode of check itself being broken.

-- 
  ++ytti

Re: Open source Netflow analysis for monitoring AS-to-AS traffic

2024-03-29 Thread Saku Ytti

On Fri, 29 Mar 2024 at 20:10, Steven Bakker  wrote:

> To top it off, both the sFlow and IPFIX specs are sufficiently vague about 
> the meaning of the "frame size", so vendors can implement whatever they want 
> (include/exclude padding, include/exclude FCS). This implies that you 
> shouldn't trust these fields.

I share this concern, but in my experience the market simply does not
care at all what the data means. People happily graph L3 rate from
Junos, and L2 rate from other boxes, using them interchangeably as
well as using them to determine if or not there is congestion.
While in reality, what you really want is L1 speed, so you can
actually see if the interface is full or not. Luckily we are starting
to see more and more devices also support peak-buiffer-util in
previous N seconds, which is far more useful for congestion
monitoring, unfortunately it is not IF-MIB so most will never ever
collect it.

Note, it is possible to get most Juniper gear to report L2 rate like
IF-MIB specifies, but it's a non-standard configuration option,
therefore very rarely used.

I also wholeheartedly agree on inline templates being near peak
insanity. Huge complexity for upside that is completely beyond my
understanding. If I decide to collect a new metric, then punching in
the metric number+name somewhere is the least of my worries. Idea that
the costs are lowered by having machines dynamically determine what is
being collected and monitored is just bizarre. Most of the cost of
starting to collect a new metric is figuring out how it is actionable,
what needs to happen to the metric to trigger a given action, and how
exactly we are extracting value from this action.
Definitely Netflow v9/v10 should have done out-of-band templates, and
left it to operator concern to communicate to the collector what it is
seeing.

Even exceedingly trivial things in v9/v10 entities can be broken for
years and years before anyone notices, like for example the original
sampling entities are deprecated, they are replaced with new entities,
which communicate 'every N packets, sample C packets', this is very
very good, because it allows you to do stateless sampling, while still
filling out export packet with MTU or larger size to keep export PPS
rate same before/after axing cache. However, by the time I was looking
into this, only pmacct correctly understood how to use these entities,
nfcapd and arbor either didn't understand them, or understood them
incorrectly (both were fixed in a timely manner by responsible
maintainers, thank you).

-- 
  ++ytti

Re: Open source Netflow analysis for monitoring AS-to-AS traffic

2024-03-29 Thread Saku Ytti

On Fri, 29 Mar 2024 at 02:15, Nick Hilliard  wrote:

> Overall, sflow has one major advantage over netflow/ipfix, namely that
> it's a stateless sampling mechanism.  Once you have hardware that can

> Obviously, not all netflow/ipfix implementations implement flow state,
> but most do; some implement stateless sampling ala sflow. Also many

> Tools should be chosen to fit the job. There are plenty of situations
> where sflow is ideal. There are others where netflow is preferable.

This seems like a long-winded way of saying, sFlow is a perfect subset of IPFIX.

We will increasingly see IPFIX implementations omit state, because
states don't do anything anymore in high-volume networks, you will
only ever create flow in cache, then delay exporting the information
for some seconds, but the flow is never hit twice, therefore paying
massive cost for caching, without getting anything out of it. Anyone
who actually needs caching, will have to buy specialised devices, as
it will no longer be economical for peering-routers to offer such
memory bandwidth and cache sizes that caches will actually do
something.
In a particular network we tried 1:5000 and 1:500 and in both cases
flow records were 1 packet long, at which point we hit record export
policer limit, and couldn't determine at which sampling rate we will
start to see cache being useful.

I've wondered for a long time, what would a graph look like, where you
graph sampling ratio and percentage of flows observed, it will be
linear to very high sampling ratios, but eventually it will start to
taper off, I just don't have any intuitive idea when. And I don't
think anyone really knows what ratio of flows they are observing in
the sFlow/IPFIX, if you keep sampling ratio static over a period of
time, say decade, you will continuously reduce your resolution, seeing
a smaller percentage of flows. This worries me a lot, because
statistician would say that you need this share of volume or this
share of flows if you want to use the data like this with this
confidence, therefore if we formally think the problem, we should
constantly adjust our sampling ratios to fit our statistical model to
keep same promises about data quality.

-- 
  ++ytti

Re: Open source Netflow analysis for monitoring AS-to-AS traffic

2024-03-29 Thread Saku Ytti

On Thu, 28 Mar 2024 at 20:36, Peter Phaal  wrote:

> The documentation for IOS-XR suggests that enabling extended-router in the 
> sFlow configuration should export "Autonomous system path to the 
> destination", at least on the 8000 series routers:
> https://www.cisco.com/c/en/us/td/docs/iosxr/cisco8000/netflow/command/reference/b-netflow-cr-cisco8k/m-sflow-commands.html
> I couldn't find a similar option in the NetFlow/IPFIX configuration guide, 
> but I might have missed it.

Hope this clarifies.

--- 
https://www.cisco.com/c/en/us/td/docs/routers/asr9000/software/asr9k-r7-9/configuration/guide/b-netflow-cg-asr9k-79x/configuring-netflow.html
Use the record ipv4 [peer-as] command to record peer AS. Here, you
collect and export the peer AS numbers.
Note
Ensure that the bgp attribute-download command is configured. Else, no
AS is collected when the record ipv4 or record ipv4 peer-as command is
configured.


-- 
  ++ytti

Re: Open source Netflow analysis for monitoring AS-to-AS traffic

2024-03-28 Thread Saku Ytti

Hey,

On Thu, 28 Mar 2024 at 17:49, Peter Phaal  wrote:

> sFlow was mentioned because I believe Brian's routers support the feature and 
> may well export the as-path data directly via sFlow (I am not aware that it 
> is a feature widely supported in vendor NetFlow/IPFIX implementations?).

Exporting AS information is wire-format agnostic feature, if it's
supported or not, it can equally be injected into sFlow, NetflowV5
(src and dst only), NetflowV9 and IPFIX. The cost is that you need to
program in FIB entries the information, so that the information
becomes available at look-up time for record creation.

In OP's case (IOS-XR) this means enabling 'attribute-download' for
BGP, and I believe IOS-XR will never download any other asn but src
and dst, therefore full information cannot be injected into any
emitted wire-format.
-- 
  ++ytti

Re: Open source Netflow analysis for monitoring AS-to-AS traffic

2024-03-28 Thread Saku Ytti

On Wed, 27 Mar 2024 at 21:02, Peter Phaal  wrote:

> Brian, you may want to see if your routers support sFlow (vendors have added 
> the feature over the last few years).

Why is this a solution, what does it solve for OP? Why is it
meaningful what the wire-format of the records are? I read OP's
question at a much higher level, about how to interact and reason
about data, rather than how to emit it.

Ultimately sFlow is a perfect subset of IPFIX, when you run IPFIX
without caching you get the functional equivalent of sFlow (there is
an IPFIX entity for emitting n bytes from frame as well as data).

-- 
  ++ytti

Re: Best TAC Services from Equipment Vendors

2024-03-06 Thread Saku Ytti

On Wed, 6 Mar 2024 at 22:57, michael brooks - ESC
 wrote:

> Funny you should mention this now, we were just discussing (more like 
> lamenting...) if support is a dying industry. It seems as though vendor 
> budgets are shrinking to the point they only have a Sales/Pre-Sales 
> department, and from Day Two on you are on your own. Dramatic take of course, 
> but if we are speaking in trajectories

My personal experience extending in three different decades is that
there is no meaningful change in support quality or amount of issues
encountered.

Support quality has always been very modest, unless you specifically
pay to have access to named engineers. And this is not because quality
of the engineers changes, this is because vast majority of support
cases are useless cases, and to handle this massive volume support
tries to assume which support cases are legitimate problems, which are
PEBKAC and in which cases the user already solved their problem by the
time you read their ticket and will never respond back. The last case
is so common that every first-line adopts the strategy of 'pinging'
you, regardless how good and clear information you provide, they ask
some soft-ball question, to see if you're still engaged.
Having a named engineer changes this process, because the engineer
will quickly learn that you don't open useless cases, that the issue
you're having is legitimate, and will actually read the ticket and
think about the problem.

To me this seems an inevitable outcome, if your product is popular,
most of its users are users who don't do their homework and do not
respect the support line's time, which ends up being a disservice to
the whole ecosystem, because legitimate problems will take longer to
fix, or in case of open source software, authors just burn out and
kill the project.

What shocks me more than the low quality support is the low quality
software, decades pass along, and everyone still is having
show-stopper level of issues in basic functions on a regular basis,
the software quality is absolutely abysmal. I fear low software
quality is organically market-driven, no one is trying to make poor
NOS, it's just market incentives drive poor quality NOS. When no one
has high quality NOS, there is no reason to develop one, because most
of your revenue is support contracts, not hardware sales, and if the
NOS wouldn't be out-right broken needing to be recompiled regularly to
get basic things working, lot of users might stop buying support,
because they don't need the hand-holding part of it, they just need
working software.
This is not something that vendors actively drive, I'm sure most
companies believe they are making an honest attempt to improve
quality, but it is visible in where investments are put. One vendor
had a very promising project to take a holistic look into their NOS
quality issue, by senior subject matter experts, this project was
killed (I'm sure funding was needed somewhere with better returns),
and the responsible senior person went to Amazon instead.

>
>
>
>
> michael brooks
> Sr. Network Engineer
> Adams 12 Five Star Schools
> michael.bro...@adams12.org
> 
> "flying is learning how to throw yourself at the ground and miss"
>
>
>
> On Wed, Mar 6, 2024 at 11:25 AM Pascal Masha  wrote:
>>
>> Thought about it but so far I believe companies from China provide better 
>> and fast TAC responses to their customers than the likes of Cisco and 
>> perhaps that’s why some companies(where there are no restrictions)prefer 
>> them for critical services.
>>
>> For a short period in TAC call you can have over 10 R engineers and 
>> solutions provided in a matter of hours even if it involves software 
>> changes.. while these other companies even before you get in a call with a 
>> TAC engineer it’s hours and when they join you hear something like “my shift 
>> ended 15 minutes ago, hold let me look for another engineer”. WHY? Thoughts
>
>
> This is a staff email account managed by Adams 12 Five Star Schools.  This 
> email and any files transmitted with it are confidential and intended solely 
> for the use of the individual or entity to whom they are addressed. If you 
> have received this email in error please notify the sender.

-- 
  ++ytti

Re: Network chatter generator

2024-02-23 Thread Saku Ytti

On Fri, 23 Feb 2024 at 19:42, Brandon Martin  wrote:

> Before I go to the trouble of making one myself, does anybody happen to
> know of a pre-canned program to generate realistic and scalable amounts
> of broadcast/broad-multicast network background "chatter" seen on
> typical consumer and business networks?  This would be things like lots
> of ARP traffic to/from various sources/destinations within a subnet,
> SSDP, MDNS-SD, SMB browser traffic, DHCP requests, etc.?

For protocol fuzzing I've used 'Codenomicon', which since has been
acquired by synopsys: (this is about trying to offer various type of
bad PDUs to protocol)
https://www.synopsys.com/software-integrity/security-testing/fuzz-testing.html

For volumetric protocol testing I've used 'Spirent Avalanche':  (this
is more like https or imaps users etc)
https://www.spirent.com/products/avalanche-security-testing

There are other commercial options in this space and I'm not familiar
with recent developments.

Not sure if either really fit your bill. I guess you could ask someone
with a chatty LAN to record it, and play the pcap back.

-- 
  ++ytti

Re: Twelve99 / AWS usw2 significant loss

2024-01-26 Thread Saku Ytti

On Fri, 26 Jan 2024 at 10:23, Phil Lavin via NANOG  wrote:

> 88.99.88.67 to 216.147.3.209:
>  Host   Loss%   Snt   Last   Avg  
> Best  Wrst StDev
>  1. 10.88.10.254 0.0%   1760.2   0.1  
>  0.1   0.3   0.1
>  7. nug-b1-link.ip.twelve99.net  0.0%   1763.3   3.5  
>  3.1  24.1   1.6
>  8. hbg-bb2-link.ip.twelve99.net86.9%   175   18.9  18.9  
> 18.7  19.2   0.1
>  9. ldn-bb2-link.ip.twelve99.net92.0%   175   30.5  30.6  
> 30.4  30.8   0.1
> 10. nyk-bb1-link.ip.twelve99.net 4.6%   175   99.5  99.5  
> 99.3 100.1   0.2
> 11. sjo-b23-link.ip.twelve99.net56.3%   175  296.8 306.0 
> 289.7 315.0   5.5
> 12. amazon-ic-366608.ip.twelve99-cust.net   80.5%   175  510.0 513.5 
> 500.7 539.7   8.4

This implies the problem is not on this path, because #10 is not
experiencing it, possibly because it happens to return a packet via
another option, but certainly shows the problem didn't happen in this
direction yet at #10, but because #8 and #9 saw it, they already saw
it on the other direction.

> 44.236.47.236 to 178.63.26.145:
>  Host Loss%   Snt   Last   Avg  
> Best  Wrst StDev
>  1. ip-10-96-50-153.us-west-2.compute.internal 0.0%   2670.2   0.2   
> 0.2   0.4   0.0
> 11. port-b3-link.ip.twelve99.net   0.0%   2675.8   5.9   
> 5.6  11.8   0.5
> 12. palo-b24-link.ip.twelve99.net  4.9%   267   21.1  21.5  
> 21.0  58.4   3.1
> 13. sjo-b23-link.ip.twelve99.net   0.0%   266   21.4  22.7  
> 21.3  86.2   6.5
> 14. nyk-bb1-link.ip.twelve99.net  58.1%   266  432.7 422.7 
> 407.2 438.5   6.5
> 15. ldn-bb2-link.ip.twelve99.net  98.1%   266  485.6 485.4 
> 481.6 491.1   3.9
> 16. hbg-bb2-link.ip.twelve99.net  92.5%   266  504.1 499.8 
> 489.8 510.1   5.9
> 17. nug-b1-link.ip.twelve99.net   55.5%   266  523.5 519.6 
> 504.4 561.7   7.6
> 18. hetzner-ic-340780.ip.twelve99-cust.net53.6%   266  524.4 519.2 
> 506.0 545.5   6.9
> 19. core22.fsn1.hetzner.com   70.2%   266  521.7 519.2 
> 498.5 531.7   6.6
> 20. static.213-239-254-150.clients.your-server.de 33.2%   266  382.4 375.4 
> 364.9 396.5   4.1
> 21. static.145.26.63.178.clients.your-server.de   62.0%   266  529.9 518.4 
> 506.9 531.3   6.1

This suggests the congestion point is from sjo to nyk, in 1299, not AWS at all.

You could try to fix SPORT/DPORT, and do several SPORT options, to see
if loss goes away with some, to determine if all LAG members are full
or just one.

At any rate, this seems business as usual, sometimes internet is very
lossy, you should contact your service provider, which I guess is AWS
here, so they can contact their service provider or 1299.

-- 
  ++ytti

Re: "Hypothetical" Datacenter Overheating

2024-01-17 Thread Saku Ytti

On Wed, 17 Jan 2024 at 03:18,  wrote:

> Others have pointed to references, I found some others, it's all
> pretty boring but perhaps one should embrace the general point that
> some equipment may not like abrupt temperature changes.

Can you share them? Only one I've found is:
https://www.ashrae.org/file%20library/technical%20resources/bookstore/supplemental%20files/referencecard_2021thermalguidelines.pdf

Which quotes 20c/h, which is a much higher rate than almost anyone has
ability to perform in their DC ambient. But it makes no explanation
where this comes from.

I believe in reality there is immense complexity here
 - Gradient depends on processes and materials used in
manufacturing (like pre/post ROHS will certainly have different
gradient)
 - Gradient has directionality, unlike ASHRAE quotes, because
devices are engineered to go from 20C to 90C in very short moment,
when turned on, but there was less engineering pressure for similar
cooling rates
 - Gradient has positionality going 20C between any two pairs does
not mean equal risk

And likely no one knows well, because no one has had to know well,
because it's not expensive enough to derisk.

But what we do know well
- ASHRAE quotes rate which you are unlikely to be able to hit
- Devices that travel with you, regularly see 50c instant ambient
gradients, both directions, multiple times a day
- Devices see large fast gradients when turned on, but slower when
turned off
- Compute people quote ASHRAE, Networking people appear not to,
perhaps like you say spindles is the ultimately reason for the limits
to exist

I think generally we have bias in that we like to identify risks and
then add them as organisational knowledge, but ultimately all these
new rules and exceptions you introduce, increase cost, complexity,
reduce efficiency and productivity. So we should be very critical
about them. It is fine to realise risks, and use realised risks as
data to analyse if avoiding those risks makes sense. It's very easy to
build poorly defined rules over poorly defined rules and arrive in
high cost, low efficiency operations.
Like this 'few centigrades per hour' is an exceedingly palatable
rule-of-thumb, it sounds good, unless you stop to think about it.

I would not recommend spending any time or money derisking gradients,
I would hope that rules that redisk condensation are enough to cover
derisking gradients and I would re-evaluate after sufficient realised
risks.
-- 
  ++ytti

Re: "Hypothetical" Datacenter Overheating

2024-01-16 Thread Saku Ytti

On Tue, 16 Jan 2024 at 12:22, Nathan Ward  wrote:

> Here’s some manufacturer specs:
> https://www.dell.com/support/manuals/en-nz/poweredge-r6515/per6515_ts_pub/environmental-specifications?guid=guid-debd273c-0dc8-40d8-abbc-be059a0ce59c=en-us
>
> 3rd section, “Maximum temperature gradient”.

Thanks. It seems quite many compute context quote ASHRAE gradients,
but in networking kit context it seems very rarely quoted (unless
indirectly via NEBS), while I wouldn't expect intuitively their
tolerances to be significantly different.

-- 
  ++ytti

Re: "Hypothetical" Datacenter Overheating

2024-01-16 Thread Saku Ytti

On Tue, 16 Jan 2024 at 11:00, William Herrin  wrote:

> You have a computer room humidified to 40% and you inject cold air
> below the dew point. The surfaces in the room will get wet.

I think humidity and condensation is well understood and indeed
documented but by NEBS and vendors as verboten.

I am more interested in temperature changes when not condensating and
causing water damage. Like we could theorise, some soldering will
expand/contract too fast, breaking or various other types of scenarios
one might guess without context, and indeed electronics often have to
experience large temperature gradients and appear to survive.
When you turn these things on, various parts rapidly heat from ambient
to 80-90c. So I have some doubts if this is actually a problem you
need to consider, in absence of condensation.

-- 
  ++ytti

Re: "Hypothetical" Datacenter Overheating

2024-01-15 Thread Saku Ytti

On Tue, 16 Jan 2024 at 08:51,  wrote:

> A rule of thumb is a few degrees per hour change but YMMV, depends on
> the equipment. Sometimes manufacturer's specs include this.

Is this common sense, or do you have reference to this, like paper
showing at what temperature change at what rate occurs what damage?

I regularly bring fine electronics, say iPhone, through significant
temperature gradients, as do most people who have to live in places
where inside and outside can be wildly different temperatures, with no
particular observable effect. iPhone does go into 'thermometer' mode,
when it overheats though.

Manufacturers, say Juniper and Cisco describe humidity, storage and
operating temperatures, but do not define temperature change rate.
Does NEBS have an opinion on this, or is this just a common case of
yours?

-- 
  ++ytti

Re: IPv6 Traffic Re: IPv6? Re: Where to Use 240/4 Re: 202401100645.AYC Re: IPv4 address block

2024-01-15 Thread Saku Ytti

On Mon, 15 Jan 2024 at 21:08, Michael Thomas  wrote:

> An ipv4 free network would be nice, but is hardly needed. There will
> always be a long tail of ipv4 and so what? You deal with it at your

I mean Internet free DFZ, so that everyone is not forced to maintain
two stacks at extra cost, fragility and time. Any protocols at the
inside networks are fine, as long as you're meeting the Internet with
IPv6-only stack. I'm sure there are CLNS, IPX, AppleTalk etc networks
there, but that doesn't impose a cost to everyone wanting to play.

-- 
  ++ytti

Re: IPv6 Traffic Re: IPv6? Re: Where to Use 240/4 Re: 202401100645.AYC Re: IPv4 address block

2024-01-15 Thread Saku Ytti

On Mon, 15 Jan 2024 at 10:59, jordi.palet--- via NANOG  wrote:

> No, I’m not saying that. I’m saying "in actual deployments", which doesn’t 
> mean that everyone is deploying, we are missing many ISPs, we are missing 
> many enterprises.

Because of low entropy of A-B pairs in bps volume, seeing massive
amounts of IPv6 in IPv6 enabled networks is not indicative of IPv6
success. I don't disagree with your assertion, I just think it's
damaging, because readers without context will form an idea that
things are going smoothly.  We should rightly be in panic mode and
forget all the IPv4 extension crap and start thinking how do we ensure
IPv6 happens and how do we ensure we get back to single stack
Internet.

IPv6 is very much an afterthought, a 2nd class citizen today. You can
deploy new features and software without IPv6, and it's fine. IPv6 can
be broken, and it's not an all-hands-on-deck problem, no one is
calling.

-- 
  ++ytti

Re: IPv6 Traffic Re: IPv6? Re: Where to Use 240/4 Re: 202401100645.AYC Re: IPv4 address block

2024-01-15 Thread Saku Ytti

On Mon, 15 Jan 2024 at 10:05, jordi.palet--- via NANOG  wrote:

> In actual customer deployments I see the same levels, even up to 85% of IPv6 
> traffic. It basically depends on the usage of the caches and the % of 
> residential vs corporate customers.

You think you are contributing to the IPv6 cause, by explaining how
positive the situation is. But in reality you are damaging it greatly,
because you're not communicating that we are not on a path to IPv4
free Internet. If we had been on such a path, we would have been IPv4
free for more than a decade. And unless we admit we are not on that
path, we will not work to get on that path.

-- 
  ++ytti

Re: IPv6 Traffic Re: IPv6? Re: Where to Use 240/4 Re: 202401100645.AYC Re: IPv4 address block

2024-01-14 Thread Saku Ytti

On Mon, 15 Jan 2024 at 06:18, Forrest Christian (List Account) <
li...@packetflux.com> wrote:

If 50٪ of the servers and 50% of the clients can do IPv6, the amount of
> IPv6 traffic will be around 25% since both ends have to do IPv6.
>

This assumes cosmological principle applies to the Internet, but Internet
traffic is not uniformly distributed.

It is entirely possible, and even reasonable, that AMSIX ~5% and GOOG 40%
are bps shares, and both are correct. Because AMSIX sees large entropy
between A-B end-points, GOOG sees very low entropy, it being always the B.

Certain tier1 transit network could see traffic being >50% IPv6 between two
specific pops, so great IPv6 adoption? Except it was a single CDN sending
traffic from them to them, if you'd exclude that CDN flows between the pop,
the IPv6 traffic share was low single digit percentage.

I am not saying IPv6 traffic is not increasing, I am saying that we are not
doing any favours to anyone, pretending we are on-track and that this will
happen, and that there are organic drivers which will ensure we are going
to end up with IPV6-only Internet.

-- 
  ++ytti

Re: 202401100645.AYC Re: IPv4 address block

2024-01-11 Thread Saku Ytti

On Thu, 11 Jan 2024 at 12:57, Christopher Hawker  wrote:

> Reclassifying this space, would add 10+ years onto the free pool for each 
> RIR. Looking at the APNIC free pool, I would estimate there is about 1/6th of 
> a /8 pool available for delegation, another 1/6th reserved. Reclassification 
> would see available pool volumes return to pre-2010 levels.

Just enough time for us to retire comfortably and let some other fool
fix the mess we built?

We don't need to extend IPv4, we need to figure out why we are in this
dual-stack mess, which was never intended, and how to get out of it.

We've created this stupid anti-competitive IPv4 market and as far as I
can foresee, we will never organically stop using IPv4. We've added
CAPEX and OPEX costs and a lot of useless work, for no other reason,
but our failure to provide a reasonable solution going from IPv4 to
IPv6.

I can't come up with a less stupid way to fix this, than major players
commonly signing a pledge to drop IPv4 in their edge at 2040-01-01, or
some such. To finally create an incentive and date when you need to
get your IPv6 affairs in order, and to fix the IPv4 antitrust issue.
Only reason people need IPv4 to offer service is because people
offering connectivity have no incentive to offer IPv6. In fact if
you've done any IPv6 at all, you're wasting money and acting against
the best interest of your shareholders, because there is no good
reason to spend time and money on IPv6, but there should be.

-- 
  ++ytti

Re: Sufficient Buffer Sizes

2024-01-03 Thread Saku Ytti

On Wed, 3 Jan 2024 at 01:05, Mike Hammett  wrote:

> It suggests that 60 meg is what you need at 10G. Is that per interface? Would 
> it be linear in that I would need 600 meg at 100G?

Not at all.

You need to understand WHY buffering is needed, to determine how much
you want to offer buffering.

Big buffering is needed, when:
   - Sender is faster than Receiver
   - Receiver wants to receive single flow at maximum rate
   - Sender is sending window growth at sender-rate, instead of
estimated receiver-rate (Common case, but easy to change, as Linux
already estimates receiver-rate, and 'tc' command can change this
behaviour)

Amount of big buffering depends on:
- How much can the window grow, when it grows. Windows grow
exponentially, so you need (RTT*receiver-rate)/2, /2 because if the
window grows the first half is already done and is dropping in at
receiver-rate, as ACKs come by.

Let's imagine your sender is 100GE connected, and your receiver is
10GE connected. And you want to achieve a 10Gbps single flow rate.

10ms RTT - 12.5MB window size, worst case you need to grow 6.25MB and
-10% off, because some of the growth you can send to the receiver,
instead of buffering all of the growth, so you'd need 5.5-6MB.
100ms RTT would be ~60MB
200ms RTT would be ~600MB

Now decide the answer you want to give in your products for these. At
what RTT you want to guarantee what single-flow maximum rate?

I do believe many of the CDNs are already using estimated
receiver-rate to grow windows, which basically removes the need for
buffering. But any standard cubic without tuning (i.e. all OS) will
burst at line-rate window growth, causing the need for buffering.

-- 
  ++ytti

Re: CPE/NID options

2023-11-27 Thread Saku Ytti

On Mon, 27 Nov 2023 at 21:45, Josh Luthman 
wrote:

Can you have an ethernet switch with dying gasp?
> Our ONTs (Calix, PON) have it but I don't see how you'd do it with
> ethernet.
>

At least via efm-oam you can have a dying gasp.

You could probably add it to autonegotiation, by sending some symbol. There
is already something similar in autonegotiation, like autonegotiation can
inform the far end, when it is locally shutdown. That is, if I have A-B
link, and B does 'shutdown' on the interface, A could emit syslog 'far-end
administratively down'. This is supported by many common PHYs, but for some
reason I've never seen software implementation.
Of course this same thing 'admin down', could be abused by sending it
always when you know you are going down. So an adventurous operator who
controls their environment could add this today with just code.

-- 
  ++ytti

Re: swedish dns zone enumerator

2023-11-02 Thread Saku Ytti

On Thu, 2 Nov 2023 at 10:32, Mark Andrews  wrote:

> You missed the point I was trying to make.  While I think that that source is 
> trying to enumerate some part of the namespace.  NS queries by themselves 
> don’t indicate an attack. Others would probably see the series of NS queries 
> as a signature of an attack when they are NOT.  There needs to be much more 
> than that to make that conclusion.

I might be reading this wrong, but I don't think the point Randy was
trying to make was 'NS queries are an attack', 'UDP packets are an
attack' or 'IP packets are an attack' . I base this on the list of
queries Randy decided to include as relevant to the thesis Randy was
trying to make, instead of wholesale warning of IP, UDP or NS queries.

-- 
  ++ytti

Re: Congestion/latency-aware routing for MPLS?

2023-10-18 Thread Saku Ytti

On Wed, 18 Oct 2023 at 17:39, Tom Beecher  wrote:

> Auto-bandwidth won't help here if the bandwidth reduction is 'silent' as 
> stated in the first message. A 1G interface , as far as RSVP is concerned, is 
> a 1G interface, even if radio interference across it means it's effectively a 
> 500M link.

Jason also explained the TWAMP + latency solution, which is an active
solution and doesn't rely on operator or automatic bandwidth providing
information, but network automatically measures latency and encodes
this information in ISIS, allowing automatic traffic engineering for
LSP to choose the lowest latency path.
I believe Jason's proposal is exactly what OP is looking for.

-- 
  ++ytti

Re: MX204 tunnel services BW

2023-10-16 Thread Saku Ytti

On Mon, 16 Oct 2023 at 22:49,  wrote:

> JTAC says we must disable a physical port to allocate BW for tunnel-services. 
>  Also leaving tunnel-services bandwidth unspecified is not possible on the 
> 204.  I haven't independently tested / validated in lab yet, but this is what 
> they have told me.  I advised JTAC to update the MX204 "port-checker" tool 
> with a tunnel-services knob to make this caveat more apparent.

Did they explain why you need to disable the physical port? I'd love
to hear that explanation.

The MX204 is single Trio EA, so you can't even waste serdes sending
the packet to remote PFE after first lookup, it would only bounce
between local XM/MQ and LU/XL, wasting that serdes.

-- 
  ++ytti

Re: MX204 tunnel services BW

2023-10-16 Thread Saku Ytti

On Tue, 17 Oct 2023 at 00:28, Delong.com  wrote:


> The MX-204 appears to be an entirely fixed configuration chassis and looks 
> from the literature like it is based on pre-trio chipset technology. 
> Interesting that there are 100Gbe interfaces implemented with this seemingly 
> older technology, but yes, looks like the PFE on the MX-204 has all the same 
> restrictions as a DPC-based line card in other MX-series routers.

It is 100% normal Trio EA.

-- 
  ++ytti

Re: Add communities on direct routes in Juniper

2023-10-15 Thread Saku Ytti

Unfortunately not yet, as far as I know. Long time ago I gave this to
my account team

Title: Direct routes must support tag and or community
Platform:  Trio, priority MX80, MPC2
JunOS: 12.4Rx
Command:   'set interfaxe ge-4/2.0 family inet address 10.42.42.1/24
tag|community X'
JTAC:  n/a
ER:
  - Router must be able to add tags communities to direct routes directly, like
it does for static routes

Usage Case:
  Trivial way to signal route information to BGP. Often tag/community is used
  by service providers to singal 'this is PI/PA prefix, leak it to internet' or
  'this is backup route, reduce its MED'. However for some reason it is only
  supported for static routes, while usage scenario and benefits are exactly the
  same for direct routes.

On Sun, 15 Oct 2023 at 15:27, Stanislav Datskevych via NANOG
 wrote:
>
> Dear all,
>
> Is there a way to add BGP communities on direct (interface) routes in 
> Junipers? The task looks to be simple but the solution eludes me.
> In Cisco/Arista, for example, I could use "network 192.0.2.0/24 route-map 
> ".
>
> In Juniper it seems to be impossible. I even tried putting interface-routes 
> into rib-group with an import policy.
> But it seems the import policy only works on importing routes into Secondary 
> routing tables (e.g. inet.50), and not into the Primary one (inet.0).
>
> I know it's possible to add communities on later stage while announcing 
> networks to peers, in [protocols bgp group  export]. But I'd better 
> slap the community on the routes right when they're imported into RIB, not 
> when they announced to peers.
>
> Thanks in advance.
>


-- 
  ++ytti

Re: Using RFC1918 on Global table as Loopbacks

2023-10-05 Thread Saku Ytti

On Thu, 5 Oct 2023 at 20:45, Niels Bakker  wrote:

> The recommendation is to make Router-IDs globally unique. They're used
> in collision detection. What if you and a peer pick the same non
> globally unique address? Any session will never come up.

https://datatracker.ietf.org/doc/html/rfc6286

  If the BGP Identifiers of the peers involved in the connection
  collision are identical, then the connection initiated by the BGP
  speaker with the larger AS number is preserved

-- 
  ++ytti

Re: MX204 tunnel services BW

2023-10-03 Thread Saku Ytti

On Mon, 2 Oct 2023 at 20:21, Jeff Behrns via NANOG  wrote:

> Encountered an issue with an MX204 using all 4x100G ports and a logical
> tunnel to hairpin a VRF.  The tunnel started dropping packets around 8Gbps.
> I bumped up tunnel-services BW from 10G to 100G which made the problem
> worse; the tunnel was now limited to around 1.3Gbps.  To my knowledge with
> Trio PFE you shouldn't have to disable a physical port to allocate bandwidth
> for tunnel-services.  Any helpful info is appreciated.

You might have more luck in j-nsp.

But yes you don't need any physical interface in trio to do tunneling.
I can't explain your problem, and you probably need JTAC help. I would
appreciate it if you'd circle back and tell what the problem was.

How it works is that when PPE decides it needs to tunnel the packet,
you're going to send the packet back to MQ via SERDES (which will then
send it again to some PPE, not the same). I think what that bandwidth
command does is change the stream allocation, you should see it in
'show  <#> stream'.

In theory, because PPE can process packet forever (well, until
watchdog kills the PPE for thinking it is stuck) you could very
cheaply do outer+inner at the local PPE, but I think that would mean
that certain features like QoS would not work on the inner interface,
so I think all this expensive recirculation and SERDES consumption is
to satisfy quite limited need, and it should be possible to implement
some 'performance mode' for tunneling, where these MQ/XM provided
features are not available, but performance cost in most cases is
negligible.

In parallel to opening the JTAC case, you might want to try to
experiment in which FPC/PIC you set the tunneling bandwidth to. I
don't understand how the tunneling would work if the MQ/XM is remote,
like would you then also steal fabric capacity every time you tunnel,
not just  MQ>LU>MQ>LU SERDES, but MQ>LU>MQ>FAB>MQ>LU. So intuitively I
would recommend ensuring you have the bandwidth configured at the
local PFE, if you don't know what the local PFE is, just configure it
everywhere?
Also you could consult several counters to see if some stream or
fabric is congested, and these tunneled packets are being sent over
congested fabric every time with lower fabric qos.

I don't understand why the bandwidth command is a thing, and why you
can configure where it is. To me it seems obvious they should always
handle tunneling strictly locally, never over fabric, because you
always end up stealing more capacity if you send it to remote MQ. That
is, implicitly it should be on for every MQ, and every PPE tunnel via
local MQ.

-- 
  ++ytti

Re: maximum ipv4 bgp prefix length of /24 ?

2023-10-02 Thread Saku Ytti

On Sun, 1 Oct 2023 at 21:19, Matthew Petach  wrote:

> Unfortunately, many coders today have not read Godel, Escher, Bach: An 
> Eternal Golden Braid,
> and like the unfortunate Crab, consider their FIB compression algorithms to 
> be unbreakable[0].
>
> In short: if you count on FIB compression working at a compression ratio 
> greater than 1 in order for your network to function, you had better have a 
> good plan for what to do when your phone rings at 3am because your FIB has 
> just become incompressible.   ^_^;

I think if we make the argument 'devices must always work' no device
satisfies it today. There are already a lot of assumptions and
compromises which cause them to work 'highly likely in most practical
scenarios'. Certainly if we were to try to formally prove, we could
prove that everything is terrible, PPS under the worst-case situation
is beyond useless on devices people intuitively consider 'wire speed'.

I fully agree fundamentally FIB compression is not safe, but also that
ship has sailed, nothing we do is safe. But is it marketable? Likely
answer is resoundingly yes.

I do feel that often people underestimate the amount of risk they
carry, and overestimate the importance of the risks they understand.
Since the vast majority of risks are carried without understanding
them. But intuitively we like to think we have good visibility into
our risks and any recognised risk therefore automatically is an
important risk.

-- 
  ++ytti

Re: maximum ipv4 bgp prefix length of /24 ?

2023-10-01 Thread Saku Ytti

On Sun, 1 Oct 2023 at 06:07, Owen DeLong via NANOG  wrote:

> Not sure why you think FIB compression is a risk or will be a mess. It’s a 
> pretty straightforward task.

Also people falsely assume that the parts they don't know about, are
risk free and simple.

While in reality there are tons of proprietary engineering choices to
make devices perform in expected environments, not arbitrary
environments. So already today you could in many cases construct
specific FIB, which exposes these compromises and makes devices not
perform. There are dragons everywhere, but we can remain largely
ignorant of them, as these engineering choices tend to be reasonable.
Sometimes they are abused by shops like EANTC and Miercom for
marketing reasons for ostensibly 'independent' tests.

I think this compression is part of this continuum, magic inside the
box I hope works because I can't begin to have a comprehensive
understanding exactly how much risk I am carrying.

Pretty much all performant boxes no longer have bandwidth to store all
packets in memory (partial buffering), many of them have 'hot' and
'cold' prefixes. You just gotta hope, you're not gonna be able to
prove anything, and by trying to do so, you're more likely to increase
your costs due to false positives than you are to find an actionable
problem. Most problems don't matter, figuring out which problem needs
to be fixed is hard.

-- 
  ++ytti

Re: maximum ipv4 bgp prefix length of /24 ?

2023-09-30 Thread Saku Ytti

On Sat, 30 Sept 2023 at 09:42, Mark Tinka  wrote:

> > But when everybody upgrades, memory and processor unit prices
> > decrease.. Vendors gain from demand.
> >
> I am yet to see that trend...

Indeed. If you look like 10k/10q for Juniper their business is fairly
stable in revenue and ports sold. so 1GE port costs the ~same as 1TE
port, not more, not less. If there was reduction in port prices over
time, then revenue would have to go down or ports sold up.
Of course all this makes perfect sense, the sand order doesn't affect
the sand price, all the cost is in people thinking how sand should be
ordered and then designing machines which put the sand together.

-- 
  ++ytti

Re: maximum ipv4 bgp prefix length of /24 ?

2023-09-30 Thread Saku Ytti

On Fri, 29 Sept 2023 at 23:43, William Herrin  wrote:

> My understanding of Juniper's approach to the problem is that instead
> of employing TCAMs for next-hop lookup, they use general purpose CPUs
> operating on a radix tree, exactly as you would for an all-software

They use proprietary NPUs, with proprietary IA. Which is called 'Trio'.

Single Trio can have hundreds of PPEs, packet processing engines,
these are all identical.

Packets are sprayed to PPEs, PPEs do not run constant time, so
reordering occurs always.

Juniper is a pioneer in FIB in DRAM, and has patente gated it to a
degree. So it takes a very very long time to get an answer from
memory.

To amortise this, PPEs have a lot of threads, and while waiting for
memory, another packet is worked on. But there is no pre-emption,
there is no kind of moving register/memory around or cache-misses here
as a function of FIB size. PPE does all the work it has, then it
requests an answer from memory, then goes to sleep, then comes back
when the answer arrives and does all the work it has, never
pre-empted.

But there is a lot more complexity here, memory used to be in the
original Trio RLDRAM which was a fairly simple setup. Once they
changed to HMC, they added a cache in front of memory, a proprietary
chip called CAE. IFLs were dynamically allocated one of multiple CAEs
they'd use to access memory. Single CAE wouldn't have 'wire rate'
performance. So if you had pathological setup, like 2 IFL, and you'd
get unlucky, you'd get both IFLs in some boots assigned to same CAE,
instead of spread to two CAEs, you would on some boots see lower PPS
performance than other boots, because you were hot-banking the CAE.
This is only type of cache problem I can recall related to Juniper.

But these devices are entirely proprietary and things move relatively
fast and complexity increases all the time.

> router. This makes each lookup much slower than a TCAM can achieve.
> However, that doesn't matter much: the lookup delays are much shorter
> than the transmission delays so it's not noticeable to the user. To

In DRAM lookups, like what Juniper does, most of the time you're
waiting for the memory. With DRAM, FIB size is trivial engineering
problem, memory bandwidth and latency is the hard problem. Juniper
does not do TC AMs on it's service provider class devices.

-- 
  ++ytti

Re: maximum ipv4 bgp prefix length of /24 ?

2023-09-28 Thread Saku Ytti

On Fri, 29 Sept 2023 at 08:24, William Herrin  wrote:

> Maybe. That's where my comment about CPU cache starvation comes into
> play. I haven't delved into the Juniper line cards recently so I could
> easily be wrong, but if the number of routes being actively used
> pushes past the CPU data cache, the cache miss rate will go way up and
> it'll start thrashing main memory. The net result is that the
> achievable PPS drops by at least an order of magnitude.

When you say, you've not delved into the Juniper line cards recently,
to which specific Juniper linecard your comment applies to?

-- 
  ++ytti

Re: what is acceptible jitter for voip and videoconferencing?

2023-09-20 Thread Saku Ytti

On Wed, 20 Sept 2023 at 19:06, Chris Boyd  wrote:

> We run Teams Telephony in $DAYJOB, and it does use SILK.
>
> https://learn.microsoft.com/en-us/microsoftteams/platform/bots/calls-and-meetings/real-time-media-concepts

Looks like codecs still are rapidly evolving in walled gardens. I just
learned about 'Satin'.

https://en.wikipedia.org/wiki/Satin_(codec)

https://ibb.co/jfrD6yk - notice 'payload description' from Teams admin
portal. So at least in some cases Teams switches from Silk to Satin,
wiki suggests 1on1 only, but I can't confirm or deny this.

--
  ++ytti

Re: what is acceptible jitter for voip and videoconferencing?

2023-09-20 Thread Saku Ytti

On Wed, 20 Sept 2023 at 03:15, Dave Taht  wrote:

> I go back many, many years as to baseline numbers for managing voip networks, 
> including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running
> voip networks entirely separately... I worked on codecs, such as oslec, and 
> early sip stacks, but that was over 20 years ago.

I don't believe LLQ has utility in hardware based routers, packets
stay inside hardware based routers single digit microseconds with
nanoseconds of jitter. For software based devices, I'm sure the
situation is different.
Practical example, tier1 network running 3 vendors, with no LLQ can go
across the globe with lower jitter (microseconds) than I can ping my
M1 laptop 127.0.0.1, because I have to do context switches, the
network does not. This is in the BE queue measured in real operation
under long periods, without any engineering effort to try to achieve
low jitter.

> The thing is, I have been unable to find much research (as yet) as to why my 
> number exists. Over here I am taking a poll as to what number is most correct 
> (10ms, 30ms, 100ms, 200ms),

I know there are academic papers as well as vendor graphs showing the
impact of jitter on quality.  Here is one:
https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1043=cs_theses
- this appears to roughly say '20ms' G711 is fine. But I'm sure this
is actually very complex to answer well, and I'm sure choice of codec
greatly impacts the answer, like whatsapp uses Opus, skype uses Silk
(maybe teams too?).  And there are many more rare/exotic codecs
optimised for very specific scenarios, like massive packet loss.

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-10 Thread Saku Ytti

On Sat, 9 Sept 2023 at 21:36, Benny Lyne Amorsen
 wrote:

> The Linux TCP stack does not immediately start backing off when it
> encounters packet reordering. In the server world, packet-based
> round-robin is a fairly common interface bonding strategy, with the
> accompanying reordering, and generally it performs great.

If you have
Linux - 1RU cat-or-such - Router - Internet

Mostly round-robin between Linux-1RU is gonna work, because it
satisfies the a) non congested b) equal rtt c) non-distributed (single
pipeline ASIC switch, honoring ingress order on egress),
requirements. But it is quite a special case, and of course there is
only a round-robin on one link in one direction.

Between 3.6-4.4 all multipath in Linux was broken, and I still to this
day help people with problems on multipath complaining it doesn't
perform (in LAN!).

3.6 introduced FIB to replace flow-cache, and made multipath essentially random
4.4 replaced random with hash

When I ask them 'do you see reordering', people mostly reply 'no',
because they look at PCAP and it doesn't look important to the human
observer, it is such an insignificant amount.. Invariable problem goes
away with hashing. (netstat -s is better than intuition on PCAP).

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-08 Thread Saku Ytti

On Fri, 8 Sept 2023 at 09:17, Mark Tinka  wrote:

> > Unfortunately that is not strict round-robin load balancing.
>
> Oh? What is it then, if it's not spraying successive packets across
> member links?

I believe the suggestion is that round-robin out-performs random
spray. Random spray is what the HPC world is asking, not round-robin.
Now I've not operated such network where per-packet is useful, so I'm
not sure why you'd want round-robin over random spray, but I can see
easily why you'd want either a) random traffic or b) random spray, if
neither are true, if you have strict round-robin and you have
non-random traffic, say every other packet is big data delivery, every
other packet is small ACK, you can easily synchronise one link to 100%
util, and and another near 0%, if you do true round-robin, but not of
you do random spray.
I don't see downside random spray would have over round-robin, but I
wouldn't be shocked if there is one.

I see this thread is mostly starting to loop around two debates

1) Reordering is not a problem
   - if you control the application, you can make it 0 problem
   - if you use TCP shipping in Androids, iOS, macOS, Windows, Linux,
BSD reordering is in practice as bad as packet loss.
   - people who know this in the list, don't know it because they read
it, they know it, because they got caught pants down and learned it,
because they had reordering and tcp performance was destroyed, even at
very low reorder rates
   - we could design TCP congestion control that is very tolerant to
reordering, but I cannot say if it would be overall win or loss

2) Reordering won't happen in per-packet, if there is no congestion
and latencies are equal
   - the receiving distributed router (~all of them) do not have
global synchronisation, they do not make any guarantees that ingress
order is honored for egress, when ingress is >1 interface, the amount
of reordering this alone causes will destroy customer expectation of
TCP performance
   - we could quite easily guarantee order as long as interfaces are
in same hardware complex, but it would be very difficult to guarantee
between hardware complexes

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-07 Thread Saku Ytti

On Thu, 7 Sept 2023 at 15:45, Benny Lyne Amorsen
 wrote:

> Juniper's solution will cause way too much packet reordering for TCP to
> handle. I am arguing that strict round-robin load balancing will
> function better than hash-based in a lot of real-world
> scenarios.

And you will be wrong. Packet arriving out of order, will be
considered previous packet lost by host, and host will signal need for
resend.

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-07 Thread Saku Ytti

On Thu, 7 Sept 2023 at 00:00, David Bass  wrote:

> Per packet LB is one of those ideas that at a conceptual level are great, but 
> in practice are obvious that they’re out of touch with reality.  Kind of like 
> the EIGRP protocol from Cisco and using the load, reliability, and MTU 
> metrics.

Those multi metrics are in ISIS as well (if you don't use wide). And I
agree those are not for common cases, but I wouldn't be shocked if
someone has legitimate MTR use-case where different metric-type
topologies are very useful. But as long as we keep context as the
Internet, true.

100% reordering does not work for the Internet, not without changing
all end hosts. And by changing those, it's not immediately obvious how
we end-up in better place, like if we wait bit longer to signal
packet-loss, likely we end up in worse place, as reordering just is so
dang rare today, because congestion control choices have made sure no
one reorders, or customers will yell at you, yet packet-loss remains
common.
Perhaps if congestion control used latency or FEC instead of loss, we
could tolerate reordering while not underperforming under loss, but
I'm sure in decades following that decision we'd learn new ways how we
don't understand any of this.

But for non-internet applications, where you control hosts, per-packet
is used and needed, I think HPC applications, and GPU farms etc. are
the users who asked JNPR to implement this.

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-06 Thread Saku Ytti

On Wed, 6 Sept 2023 at 19:28, Mark Tinka  wrote:

> Yes, this has been my understanding of, specifically, Juniper's
> forwarding complex.

Correct, packet is sprayed to some PPE, and PPEs do not run in
deterministic time, after PPEs there is reorder block that restores
flow, if it has to.
EZchip is same with its TOPs.

> Packets are chopped into near-same-size cells, sprayed across all
> available fabric links by the PFE logic, given a sequence number, and
> protocol engines ensure oversubscription is managed by a request-grant
> mechanism between PFE's.

This isn't the mechanism that causes reordering, it's the ingress and
egress lookup where Packet or PacketHead is sprayed to some PPE where
it can occur.

Can find some patents on it:
https://www.freepatentsonline.com/8799909.html
When a PPE 315 has finished processing a header, it notifies a Reorder
Block 321. The Reorder Block 321 is responsible for maintaining order
for headers belonging to the same flow, and pulls a header from a PPE
315 when that header is at the front of the queue for its reorder
flow.

Note this reorder happens even when you have exactly 1 ingress
interface and exactly 1 egress interface, as long as you have enough
PPS, you will reorder outside flows, even without fabric being
involved.

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-06 Thread Saku Ytti

On Wed, 6 Sept 2023 at 17:10, Benny Lyne Amorsen
 wrote:

> TCP looks quite different in 2023 than it did in 1998. It should handle
> packet reordering quite gracefully; in the best case the NIC will

I think the opposite is true, TCP was designed to be order agnostic.
But everyone uses cubic, and for cubic reorder is the same as packet
loss. This is a good trade-off. You need to decide if you want to
recover fast from occasional packet loss, or if you want to be
tolerant of reordering.
The moment cubic receives frame+1 it expects, it acks frame-1 again,
signalling loss of packet, causing unnecessary resend and window size
reduction.

> will never even know they were reordered. Unfortunately current
> equipment does not seem to offer per-packet load balancing, so we cannot
> test how well it works.

For example Juniper offers true per-packet, I think mostly used in
high performance computing.

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-06 Thread Saku Ytti

On Wed, 6 Sept 2023 at 10:27, Mark Tinka  wrote:

> I recognize what happens in the real world, not in the lab or text books.

Fun fact about the real world, devices do not internally guarantee
order. That is, even if you have identical latency links, 0
congestion, order is not guaranteed between packet1 coming from
interfaceI1 and packet2 coming from interfaceI2, which packet first
goes to interfaceE1 is unspecified.
This is because packets inside lookup engine can be sprayed to
multiple lookup engines, and order is lost even for packets coming
from interface1 exclusively, however after the lookup the order is
restored for _flow_, it is not restored between flows, so packets
coming from interface1 with random ports won't be same order going out
from interface2.

So order is only restored inside a single lookup complex (interfaces
are not guaranteed to be in the same complex) and only for actual
flows.

It is designed this way, because no one runs networks which rely on
order outside these parameters, and no one even knows their kit works
like this, because they don't have to.



-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-02 Thread Saku Ytti

On Fri, 1 Sept 2023 at 22:56, Mark Tinka  wrote:

> PTX1000/10001 (Express) offers no real configurable options for load
> balancing the same way MX (Trio) does. This is what took us by surprise.

What in particular are you missing?

As I explained, PTX/MX both allow for example speculating on transit
pseudowires having CW on them. Which is non-default and requires
'zero-control-word'. You should be looking at 'hash-key' on PTX and
'enhanced-hash-key' on MX.  You don't appear to have a single stanza
configured, but I do wonder what you wanted to configure when you
noticed the missing ability to do so.

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-01 Thread Saku Ytti

On Fri, 1 Sept 2023 at 18:37, Lukas Tribus  wrote:

> On the hand a workaround at the edge at least for EoMPLS would be to
> enable control-word.

Juniper LSR can actually do heuristics on pseudowires with CW.

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-01 Thread Saku Ytti

On Fri, 1 Sept 2023 at 16:46, Mark Tinka  wrote:

> Yes, this was our conclusion as well after moving our core to PTX1000/10001.

Personally I would recommend turning off LSR payload heuristics,
because there is no accurate way for LSR to tell what the label is
carrying, and wrong guess while rare will be extremely hard to root
cause, because you will never hear it, because the person suffering
from it is too many hops away from problem being in your horizon.
I strongly believe edge imposing entropy or fat is the right way to
give LSR hashing hints.


-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-01 Thread Saku Ytti

On Fri, 1 Sept 2023 at 14:54, Mark Tinka  wrote:

> When we switched our P devices to PTX1000 and PTX10001, we've had
> surprisingly good performance of all manner of traffic across native
> IP/MPLS and 802.1AX links, even without explicitly configuring FAT for
> EoMPLS traffic.

PTX and MX as LSR look inside pseudowire to see if it's IP (dangerous
guess to make for LSR), CSR/ASR9k does not. So PTX and MX LSR will
balance your pseudowire even without FAT. I've had no problem having
ASR9k LSR balancing FAT PWs.

However this is a bit of a sidebar, because the original problem is
about elephant flows, which FAT does not help with. But adaptive
balancing does.

-- 
  ++ytti

Re: Lossy cogent p2p experiences?

2023-09-01 Thread Saku Ytti

On Thu, 31 Aug 2023 at 23:56, Eric Kuhnke  wrote:

> The best working theory that several people I know in the neteng community 
> have come up with is because Cogent does not want to adversely impact all 
> other customers on their router in some sites, where the site's upstreams and 
> links to neighboring POPs are implemented as something like 4 x 10 Gbps. In 
> places where they have not upgraded that specific router to a full 100 Gbps 
> upstream. Moving large flows >2Gbps could result in flat topping a traffic 
> chart on just 1 of those 10Gbps circuits.

It is a very plausible theory, and everyone has this problem to a
lesser or greater degree. There was a time when edge interfaces were
much lower capacity than backbone interfaces, but I don't think that
time will ever come back. So this problem is systemic.
Luckily there is quite a reasonable solution to the problem, called
'adaptive load balancing', where software monitors balancing, and
biases the hash_result => egress_interface tables to improve balancing
when dealing with elephant flows.

-- 
  ++ytti

Re: JunOS config yacc grammar?

2023-08-22 Thread Saku Ytti

On Tue, 22 Aug 2023 at 03:30, Lyndon Nerenberg (VE7TFX/VE6BBM)
 wrote:

> Because I've been writing yacc grammars for decades.  I just wanted to
> see if someone had already done it, as that would save me some time.
> But if there's nothing out there I'll just roll one myself.

I sympathise with your problem and I've always wanted vendors to
publish their parsers, there are many use cases.

But as such does not exist, this avenue of attack seems very
problematic, unless this whole network lives and dies with you. If
not, then your feature velocity now depends on someone adding support
for new keywords to the parser generator, no one who comes after you
will thank you for adding this dependency to the process. But they
might call you and pay stupid money for a 5 min job, so maybe it is a
great idea.

-- 
  ++ytti

rfc5837 in the wild?

2023-08-04 Thread Saku Ytti

Does anyone have a traceroute path example where transit responds with
RFC5837 EO?

https://github.com/8enet/traceroute/blob/master/traceroute/extension.c#L101

Output should be '2/x: '

At least JNPR seems to support this:
https://www.juniper.net/documentation/us/en/software/junos/transport-ip/topics/topic-map/icmp.html
- although support may be just QFX5100, documentation is ambiguous.
There is also a patch (
https://lore.kernel.org/all/6a7f33a5-13ca-e009-24ac-fde59fb1c...@gmail.com/T/
) for linux, but it's not included in the kernel.

--
  ++ytti

Re: Test Dual Queue L4S (if you are on Comcast)

2023-06-17 Thread Saku Ytti

This seems worse :)

'we are collecting data about you, but didn't bother thinking if it is needed'

On Fri, 16 Jun 2023 at 22:55, Livingood, Jason via NANOG
 wrote:
>
> In the meantime please just select some unrelated industry on the form. We 
> don’t care – it seems to be boilerplate.
>
>
>
> From: "Livingood, Jason" 
> Date: Friday, June 16, 2023 at 15:46
> To: "Eric C. Miller" , nanog 
> Subject: Re: [EXTERNAL] RE: Test Dual Queue L4S (if you are on Comcast)
>
>
>
> We’re working to fix that. Sorry!
>
>
>
> From: "Eric C. Miller" 
> Date: Friday, June 16, 2023 at 15:18
> To: Jason Livingood , nanog 
> Subject: [EXTERNAL] RE: Test Dual Queue L4S (if you are on Comcast)
>
>
>
> FYI, when trying to sign up, it tells me that my input isn’t required because 
> I work in the telco industry.
>
>
>
> Eric
>
>
>
> From: NANOG  On Behalf Of 
> Livingood, Jason via NANOG
> Sent: Friday, June 16, 2023 2:30 PM
> To: nanog 
> Subject: Test Dual Queue L4S (if you are on Comcast)
>
>
>
> FYI that today we (Comcast) have announced the start of low latency 
> networking (L4S) field trials. If you are a customer and would like to 
> volunteer, please visit this page.
>
>
>
> For more info, there is a blog post that just went up at 
> https://corporate.comcast.com/stories/comcast-kicks-off-industrys-first-low-latency-docsis-field-trials
>
>
>
> We anticipate testing with several different cable modems and a range of 
> applications that are marking. We plan to share detailed results of the trial 
> at IETF-118 in November.
>
>
>
> Any app developers interested in working with us can either email me 
> direction or low-latency-partner-inter...@comcast.com.
>
>
>
> Thanks!
> Jason
>
>
>
>
>
>
>
>
>
>



-- 
  ++ytti

Re: Do ISP's collect and analyze traffic of users?

2023-05-16 Thread Saku Ytti

I can't tell what large is. But I've worked for enterprise ISP and
consumer ISPs, and none of the shops I worked for had capability to
monetise information they had. And the information they had was
increasingly low resolution. Infraprovider are notoriously bad even
monetising their infra.

I'm sure do monetise. But generally service providers are not
interesting or have active shareholders, so very little pressure to
make more money, hence firesales happen all the time due
infrastructure increasingly seen as a liability, not an asset. They
are generally boring companies and internally no one has incentive to
monetise data, as it wouldn't improve their personal compensation. And
regulations like GDPR create problems people rather not solve, unless
pressured.

Technically most people started 20 years ago with some netflow
sampling ratio, and they still use the same sampling ratio, despite
many orders of magnitude more packets. Meaning previously the share of
flows captured was magnitude higher than today, and today only very
few flows are seen in very typical applications, and netflow is
largely for volumetric ddos and high level ingressAS=>egressAS
metrics.

Hardware offered increasingly does IPFIX as if it was sflow, that is,
0 cache, immediately exported after sampled, because you'd need like
1:100 or higher resolution, to have any significant luck in hitting
the same flow twice. PTX has stopped supporting flow-cache entirely
because of this, at the sampling rate where cache would do something,
the cache would overflow.

Of course there are other monetisation opportunities via other
mechanism than data-in-the-wire, like DNS

On Tue, 16 May 2023 at 15:57, Tom Beecher  wrote:
>
> Two simple rules for most large ISPs.
>
> 1. If they can see it, as long as they are not legally prohibited, they'll 
> collect it.
> 2. If they can legally profit from that information, in any way, they will.
>
> Now, ther privacy policies will always include lots of nice sounding clauses, 
> such as 'We don't see your personally identifiable information'. This of 
> course allows them to sell 'anonymized' sets of that data, which sounds great 
> , except as researchers have proven, it's pretty trivial to scoop up 
> multiple, discrete anonymized data sets, and cross reference to identify 
> individuals. Netflow data may not be as directly 'valuable' as other types of 
> data, but it can be used in the blender too.
>
> Information is the currency of the realm.
>
>
>
> On Mon, May 15, 2023 at 7:00 PM Michael Thomas  wrote:
>>
>>
>> And maybe try to monetize it? I'm pretty sure that they can be compelled
>> to do that, but do they do it for their own reasons too? Or is this way
>> too much overhead to be doing en mass? (I vaguely recall that netflow,
>> for example, can make routers unhappy if there is too much "flow").
>>
>> Obviously this is likely to depend on local laws but since this is NANOG
>> we can limit it to here.
>>
>> Mike
>>

-- 
  ++ytti

Re: Reverse DNS for eyeballs?

2023-04-21 Thread Saku Ytti

On Fri, 21 Apr 2023 at 20:44, Jason Healy via NANOG  wrote:


> This is not intended as snark: what do people recommend for IPv6?  I try to 
> maintain forward/reverse for all my server/infrastructure equipment.  But 
> clients?  They're making up temporary addresses all day long.  So far, I've 
> given up on trying to keep track of those addresses, even though it's a 
> network under my direct control.

Stateless generation at query time -
https://github.com/cmouse/pdns-v6-autorev/blob/master/rev.pl

I wrote some POCs quite bit long ago

http://p.ip.fi/L5PK - base36
http://p.ip.fi/CAtB - rfc2289

-- 
  ++ytti

Re: 1.1.1.1 support?

2023-03-22 Thread Saku Ytti

On Wed, 22 Mar 2023 at 16:04, Alexander Huynh via NANOG  wrote:

> I'll take this feedback to our developers.

Many thanks.

> I took a look at the above tickets, and it seems that one of the egress
> ranges from that datacenter cannot connect to the authoritative
> nameservers of `www.moi.gov.cy`: `ns01.gov.cy` and `ns02.gov.cy`.
>
> Here's a redacted pcap for those who like details, showing no response:
>
>  IP a.b.c.d.56552 > 212.31.118.19.53: 51873+ [1au] A? www.moi.gov.cy. (55)
>  IP a.b.c.d.51718 > 212.31.118.20.53: 31021+ [1au] A? www.moi.gov.cy. (55)
>
> TCP behaves similarly.

The recursor response suggests a loop, so network problem is highly likely.

> I'm filing an internal ticket right now to investigate, but I'd
> appreciate if you could also help us on your end for any possible
> solutions regarding this connectivity failure.

Sure, you might also want to look into nlnog ring, which allows a
broad perspective to issues.

> As a general note regarding the two community posts: the straight deep
> dive into technical information makes it more difficult for others to
> interpret the request. As you said in a later post here:

This is a very difficult subject. How to get help. If I had made it
more genetic, we could refute it as it doesn't contain needed
information. If I made it longer we could refute that it's not terse
enough. However we submit it, we can argue it wasn't the right way.
As seen in the original post, I fully appreciate almost every single
case about 1.1.1.1 is incorrect and user error. But I proposed a
mechanism to by-pass community forums and reach people who are able to
help and understand. If there is disagreement in 1.1.1.1, 8.8.8.8 and
9.9.9.9 then let humans analyse it. The ticket volume would be
trivial, if we look at community forums and see how many 1.1.1.1
complaints would bypass this filter.

> Not everyone in the Community Forum (nor our company) can pull out the
> specific datacenter used, the specific machine(s) used, and the source
> ASN from the `my.ip.fi` curl.

I gave the specific unicast ID for the DNS server in addition to my
IP. I cannot glean any other information.

I don't think we can fairly fault either of the cases in the community
forum. We must fault the process itself and look for ways to improve.
-- 
  ++ytti

Re: 1.1.1.1 support?

2023-03-22 Thread Saku Ytti

Yes, it works in every other CF except LCA-CF. Thank you for the
additional data point.

You can use `dig CHAOS TXT id.server @1.1.1.1 +nsid` to get two
unicast identifiers for the server you got the response from.

On Wed, 22 Mar 2023 at 15:49, Josh Luthman  wrote:
>
> Try asking dns-operati...@lists.dns-oarc.net for someone at CloudFlare.
>
> For what it's worth, it works for me.  I'm in Troy, OH.
>
> C:\Users\jluthman>dig www.moi.gov.cy @1.1.1.1 +short
> 212.31.118.26
>
>
> On Wed, Mar 22, 2023 at 9:43 AM Saku Ytti  wrote:
>>
>>
>>
>> On Wed, 22 Mar 2023 at 15:26, Matt Harris  wrote:
>>
>>>
>>> When something is provided at no cost, I don't see how it can be unethical 
>>> unless they are explicitly lying about the ways in which they use the data 
>>> they gather.
>>> Ultimately, you're asking them to provide a costly service (support for 
>>> end-users, the vast majority of whom will not ask informed, intelligent 
>>> questions like the members of this list would be able to, but would still 
>>> demand the same level of support) on top of a service they are already 
>>> providing at no cost. That's both unrealistic and unnecessary. There's an 
>>> exceedingly simple solution, here, after all: if you don't like their 
>>> service or it isn't working for you as an end-user, don't use it.
>>
>>
>> Thank you for the philosophical perspective, but currently my interest is 
>> not to debate merits or lack thereof in laissez-faire economics.
>>
>> The problem is, a large number of people will use 1.1.1.1, 8.8.8.8 or 
>> 9.9.9.9 despite my or your position about it. There is incentive for 
>> providers to provide it 'for free', as it adds value to their products as 
>> users are compensating providers with the data.
>>
>> Occasionally things don't work and when they do not, we need a way to inform 
>> the provider 'hey you have a problem'. You could be anywhere in this chain, 
>> with no ability to impact any of the decisions.
>>
>> I know there is a real problem, I know real users are impacted, I know 
>> almost none of them will have the ability to understand why there is a 
>> problem or remediate it.
>>
>> --
>>   ++ytti



--
  ++ytti

Re: 1.1.1.1 support?

2023-03-22 Thread Saku Ytti

On Wed, 22 Mar 2023 at 15:26, Matt Harris  wrote:

> When something is provided at no cost, I don't see how it can be unethical
> unless they are explicitly lying about the ways in which they use the data
> they gather.
> Ultimately, you're asking them to provide a costly service (support for
> end-users, the vast majority of whom will not ask informed, intelligent
> questions like the members of this list would be able to, but would still
> demand the same level of support) on top of a service they are already
> providing at no cost. That's both unrealistic and unnecessary. There's an
> exceedingly simple solution, here, after all: if you don't like their
> service or it isn't working for you as an end-user, don't use it.
>

Thank you for the philosophical perspective, but currently my interest is
not to debate merits or lack thereof in laissez-faire economics.

The problem is, a large number of people will use 1.1.1.1, 8.8.8.8 or
9.9.9.9 despite my or your position about it. There is incentive for
providers to provide it 'for free', as it adds value to their products as
users are compensating providers with the data.

Occasionally things don't work and when they do not, we need a way to
inform the provider 'hey you have a problem'. You could be anywhere in this
chain, with no ability to impact any of the decisions.

I know there is a real problem, I know real users are impacted, I know
almost none of them will have the ability to understand why there is a
problem or remediate it.

-- 
  ++ytti

Re: 1.1.1.1 support?

2023-03-22 Thread Saku Ytti

If you wish to consult people on how to configure DNS, please reach
out to the responsible folk.

I am discussing a specific recursor in anycasted setup not resolving
domain and provider offering no remediation channel.

These are two entirely different classes of problem and collapsing
them into a single problem is not going to help in either case.

On Wed, 22 Mar 2023 at 12:25, Mark Andrews  wrote:
>
> What about the zone not having a single point of failure?  Both servers
> are covered by the same /24.
>
> % dig www.moi.gov.cy @212.31.118.19 +norec +dnssec
>
> ; <<>> DiG 9.19.11-dev <<>> www.moi.gov.cy @212.31.118.19 +norec +dnssec
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17380
> ;; flags: qr aa; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 3
>
> ;; OPT PSEUDOSECTION:
> ; EDNS: version: 0, flags:; udp: 4096
> ; COOKIE: 6387183a6031ef182fa6ade7641ad4ff2a078213f4e24fc9 (good)
> ;; QUESTION SECTION:
> ;www.moi.gov.cy. IN A
>
> ;; ANSWER SECTION:
> www.moi.gov.cy. 3600 IN A 212.31.118.26
>
> ;; AUTHORITY SECTION:
> moi.gov.cy. 3600 IN NS ns01.gov.cy.
> moi.gov.cy. 3600 IN NS ns02.gov.cy.
>
> ;; ADDITIONAL SECTION:
> ns02.gov.cy. 86400 IN A 212.31.118.20
> ns01.gov.cy. 86400 IN A 212.31.118.19
>
> ;; Query time: 374 msec
> ;; SERVER: 212.31.118.19#53(212.31.118.19) (UDP)
> ;; WHEN: Wed Mar 22 21:14:23 AEDT 2023
> ;; MSG SIZE  rcvd: 157
>
> %
>
> > On 22 Mar 2023, at 19:36, Saku Ytti  wrote:
> >
> > Am I correct to understand that 1.1.1.1 only does support via community 
> > forum?
> >
> > They had just enough interest in the service to collect user data to
> > monetise, but 0 interest in trying to figure out how to detect and
> > solve problems?
> >
> > Why not build a web form where they ask you to explain what is not
> > working, in terms of automatically testable. Like no A record for X.
> > Then after you submit this form, they test against all 1.1.1.1 and
> > some 9.9.9.9 and 8.8.8.8 and if they find a difference in behaviour,
> > the ticket is accepted and sent to someone who understands DNS? If
> > there is no difference in behaviour, direct people to community
> > forums.
> > This trivial, cheap and fast to produce support channel would ensure
> > virtually 0 trash support cases, so you wouldn't even have to hire
> > people to support your data collection enterprise.
>
> The number of times that 8.8.8.8 “works” but there is an actual error
> is enormous.  8.8.8.8 tolerates lots of protocol errors which ends up
> causing support cases for others where the result is “the servers are
> broken in this way”.  You then try to report the issue but the report
> is ignored because “It works with 8.8.8.8”.
>
> > Very obviously they selfishly had no interest in ensuring 1.1.1.1
> > actually works, as long as they are getting the data. I do not know
> > how to characterise this as anything but unethical.
> >
> > https://community.cloudflare.com/t/1-1-1-1-wont-resolve-www-moi-gov-cy-in-lca-235m3/487469
> > https://community.cloudflare.com/t/1-1-1-1-failing-to-resolve/474228
> >
> > If you can't due to resources or competence support DNS, do not offer one.
> >
> > --
> >  ++ytti, cake having and cake eating user
>
> --
> Mark Andrews, ISC
> 1 Seymour St., Dundas Valley, NSW 2117, Australia
> PHONE: +61 2 9871 4742  INTERNET: ma...@isc.org
>


-- 
  ++ytti

1.1.1.1 support?

2023-03-22 Thread Saku Ytti

Am I correct to understand that 1.1.1.1 only does support via community forum?

They had just enough interest in the service to collect user data to
monetise, but 0 interest in trying to figure out how to detect and
solve problems?

Why not build a web form where they ask you to explain what is not
working, in terms of automatically testable. Like no A record for X.
Then after you submit this form, they test against all 1.1.1.1 and
some 9.9.9.9 and 8.8.8.8 and if they find a difference in behaviour,
the ticket is accepted and sent to someone who understands DNS? If
there is no difference in behaviour, direct people to community
forums.
This trivial, cheap and fast to produce support channel would ensure
virtually 0 trash support cases, so you wouldn't even have to hire
people to support your data collection enterprise.

Very obviously they selfishly had no interest in ensuring 1.1.1.1
actually works, as long as they are getting the data. I do not know
how to characterise this as anything but unethical.

https://community.cloudflare.com/t/1-1-1-1-wont-resolve-www-moi-gov-cy-in-lca-235m3/487469
https://community.cloudflare.com/t/1-1-1-1-failing-to-resolve/474228

If you can't due to resources or competence support DNS, do not offer one.

-- 
  ++ytti, cake having and cake eating user

Re: Reverse Traceroute

2023-02-27 Thread Saku Ytti

On Mon, 27 Feb 2023 at 10:16, Rolf Winter  wrote:

> "https://downforeveryoneorjustme.com/;. But, somebody might use your
> server for this. How do people feel about this? Restrict the reverse
> traceroute operation to be done back to the source or allow it more
> freely to go anywhere?

What are the pros and cons of this? Let's call it destination TLV.

If I am someone who wants to do volumetric attack, I won't set any
destination TLV, because without destination TLV and by spoofing my
source, I get more leverage. If my source and destination TLV differ,
then I have less leverage. So in this sense, it adds no security
implications, but adds a massive amount of diagnostic power, as one
very common request is to ask traceroute between nodes you have no
access to.

What it would allow is port knocking the ports used through proxy, if
this matters or not might be debatable.

Perhaps the standard should consider some abilities to be default on,
and others default off, and let the operator decide if they want to
turn some default off abilities on, such as honoring destination TLV.

-- 
  ++ytti

Re: intuit DNS

2023-02-11 Thread Saku Ytti

╰─ dig NS intuit.com|grep ^intuit|ruby -nae 'puts $F[-1]'|while read dns;do
echo $dns:;dig smartlinks.intuit.com @$dns|grep CNAME
done
a7-66.akam.net.:
smartlinks.intuit.com. 30 IN CNAME cegnotificationsvc.intuit.com.
a11-64.akam.net.:
smartlinks.intuit.com. 30 IN CNAME cegnotificationsvc.intuit.com.
a24-67.akam.net.:
smartlinks.intuit.com. 30 IN CNAME cegnotificationsvc.intuit.com.
a1-182.akam.net.:
smartlinks.intuit.com. 30 IN CNAME cegnotificationsvc.intuit.com.
a6-66.akam.net.:
smartlinks.intuit.com. 30 IN CNAME cegnotificationsvc.intuit.com.
a18-64.akam.net.:
smartlinks.intuit.com. 30 IN CNAME cegnotificationsvc.intuit.com.
dns1.p01.nsone.net.:
dns2.p01.nsone.net.:
dns3.p01.nsone.net.:
dns4.p01.nsone.net.:
╭─ ytti@ytti   ~ 
    0|0|0|1 ↵  09:58:40 


On Sat, 11 Feb 2023 at 23:01, Daniel Sterling  wrote:
>
> Someone at Intuit please look into why your DNS for this A record
> hasn't been consistently resolving, this has been going on for several
> days if not weeks
>
> https://dnschecker.org/#A/smartlinks.intuit.com
>
> -- Dan



-- 
  ++ytti

Re: Typical last mile battery runtime (protecting against power cuts)

2023-02-04 Thread Saku Ytti

On Sun, 5 Feb 2023 at 07:50, Chris Adams  wrote:

> Electric heat pumps are great for power efficiency until the temperature
> drops and they switch over to pure electric heat.

Here is graph from popular air heat pump Mitsubishi MSZ/MUZ 25
https://scanoffice.fi/wp-content/uploads/2022/09/rw-vtt-tuntikeskiarvo.jpg
https://scanoffice.fi/wp-content/uploads/2022/09/rw-vttn-testitulos.png

At -30c external, with +20c internal the units produce heat at
approximately 2x of the electric input.

But many other units do not perform that well even at -20c external.
And these units are premium priced. Modern R32 units consistently
outperform old R410A units.

--
  ++ytti

Re: Typical last mile battery runtime (protecting against power cuts)

2023-02-03 Thread Saku Ytti

On Fri, 3 Feb 2023 at 16:15, Israel G. Lugo  wrote:

> Could anyone with last mile experience help with some ballpark figures?
> I.e. 15 min vs 8h or 8 days.

This would be highly market specific. In many cases, probably most
cases, there is no regulatory requirement for availability for
internet service whatsoever.

One specific case where it is regulated, Finland, the regulation is
available in Finnish, Swedish and English, the English document is
available at:
https://www.finlex.fi/data/normit/47143/05_Regulation_on_resilience_of_communications_networks_and_services_and_of_synchronisation_of_communications_networks.pdf

It classifies service to five priorities with different availability
requirements. From your ballpark, 8h would be the closest fit, but in
theory the higher priorities have indefinite availability before the
system is exhausted by means of generation.

In practice I would default to expecting 0 min availability during
power outage, regardless of how resilient my CPE is. We can scarcely
make the Internet work at the best of times.

-- 
  ++ytti

Re: MX204 and MPC7E-MRATE EoL - REVOKED

2023-01-27 Thread Saku Ytti

On Sat, 28 Jan 2023 at 08:48, Mark Tinka wrote:

> Apparently, the shortage of chips for the MX204 and MPC7E is now resolved,
> and there is no longer any need to force customers to move to the MX304.

There is still just Micron for HMC, and as far as I can find, they've
not revoked their EOL. You can't find the HMC product page from Micron
'products' anymore, and hardly any mentions anywhere. Everyone is now
focusing on HBM3.

https://www.micron.com/about/blog/2018/august/micron-announces-shift-in-high-performance-memory-roadmap-strategy

Whatever led to this problem, and what led to this EOL revocation is
not something Juniper has communicated.

If I'd have to stab in the dark based on nothing, I'd imagine they
forgot HMC is no longer shipping, and then panicked and EOLd all HMC
boxes, until someone did more work, and gathered they probably can
support a few HMC platforms with existing HMC parts they have.
I would be very uneasy committing to HMC gear, unless I'd have a
better understanding of what the problem was, and why it is no longer
a problem. My concern would be, if they were wrong once to EOL all,
then wrong again to revoke some EOL, can I trust them now to have HMC
parts for any RMAs I have down the life expectancy. Not at all
uncommon to run a box for a decade in SP network, and Juniper released
all-new HMC gear, after Micron announced HMC EOL.

For HBM there is Samsung, Hynix and Micron coming up, so HBM seems
safe. Unclear how safe HBM2 is now, as HBM3 is shipping, for the life
expectancy SP gears have. Obviously most of the market moves faster,
no one is going to run HBM2 GPUs decade from now. We are a kinda
shitty market, few units, long sales times, long cycles.

--
++ytti

Re: Large RTT or Why doesn't my ping traffic get discarded?

2022-12-21 Thread Saku Ytti

On Thu, 22 Dec 2022 at 08:41, William Herrin  wrote:

> Suppose you have a loose network cable between your Linux server and a
> switch. Layer 1. That RJ45 just isn't quite solid. It's mostly working
> but not quite right. What does it look like at layer 2? One thing it
> can look like is a periodic carrier flash where the NIC thinks it has
> no carrier, then immediately thinks it has enough of a carrier to
> negotiate speed and duplex. How does layer 3 respond to that?

Agreed. But then once the resolve happens, and linux floods the queued
pings out, the responses would come ~immediately. So the delta between
the RTT would remain at the send interval, in this case 1s. In this
case, we see the RTT decreasing as if the buffer is being purged,
until it seems to be filled again, up-until 5s or so.

I don't exclude the rationale, I just think it's not likely based on
the latencies observed. But at any rate with so little data, my
confidence to include or exclude any specific explanation is low.

>
> 1s: send ping toward default router
> 1.1s: ping response from remote server
> 2s: send ping toward default router
> 2.1s: ping response from remote server
> 2.5s: carrier down
> 2.501s: carrier up
> 3s: queue ping, arp for default router, no response
> 4s: queue ping, arp for default router, no response
> 5s: queue ping, arp for default router, no response
> 6s: queue ping, arp for default router, no response
> 7s: queue ping, arp for default router
> 7.01s: arp response, send all 5 queued pings but note that the
> earliest is more than 4 seconds old.
> 7.1s: response from all 5 queued pings.
>
> Cable still isn't right though, so in a few seconds or a few minutes
> you're going to get another carrier flash and the pattern will repeat.
>
> I've also seen some cheap switches get stuck doing this even after the
> faulty cable connection is repaired, not clearing until a reboot.
>
> Regards,
> Bill Herrin
>
>
> --
> For hire. https://bill.herrin.us/resume/



-- 
  ++ytti

Re: Large RTT or Why doesn't my ping traffic get discarded?

2022-12-21 Thread Saku Ytti

There certainly aren't any temporal buffers in SP gear limiting the
buffer to 100ms, nor are there any mechanisms to temporally decrease
TTL or hop-limit. Some devices may expose temporal configuration to
UX, but that is just a multiplier for max_buffer_bytes, and what is
programmed is a fixed amount of bytes instead of temporal limit as
function of observed traffic rate.
This is important, because HW may support tens or even hundreds of
thousands of queues, because HW may support large amount of logical
interfaces with HQoS and multiple queues each, then if such device is
ran with single logical interface, which is low speed either
physically or shaped, you may end up having very very long temporal
queues, not because people intend to queue long, but because
understanding all of this requires lot of context and information
about platform which isn't readily available nor is solved by 'just
remove those buffers from devices physically, it's bufferbloat'.

Like others have pointed out, there is not much information to go with
and this could be many things, one of those could be 'buffer bloat'
like Taht pointed out, this might be true because cyclical nature of
the ping, buffer getting filled and drained. I don't really think
ARP/ND is good candidate like Herring suggested, because it's
cyclical, instead of exactly single event, but not impossible.

We'd really need to see full mtr output, and if or not this affects
other destinations, if it just affects icmp or also dns, ideally
reverse traceroute as well. I can tell that I'm not observing the
issue, nor did I expect to observe it, as I expect problem to close to
your network, and therefore affecting a lot of destinations.

On Thu, 22 Dec 2022 at 07:35, Jerry Cloe  wrote:
>
>
> Because there is no standard for discarding "old" traffic, only discard is 
> for packets that hop too many times. There is, however, a standard for 
> decrementing TTL by 1 if a packet sits on a device for more than 1000ms, and 
> of course we all know what happens when TTL hits zero. Based on that, your 
> packet could have floated around for another 53 seconds. Having said that, 
> I'm not sure many devices actually do this (but its not likely it would have 
> had a significant impact on this traffic anyway).
>
>
>
> -Original message-
> From: Jason Iannone 
> Sent: Wed 12-21-2022 11:11 am
> Subject: Large RTT or Why doesn‘t my ping traffic get discarded?
> To: North American Network Operators‘ Group ;
> Here's a question I haven't bothered to ask until now. Can someone please 
> help me understand why I receive a ping reply after almost 5 seconds? As I 
> understand it, buffers in SP gear are generally 100ms. According to my math 
> this round trip should have been discarded around the 1 second mark, even in 
> a long path. Maybe I should buy a lottery ticket. I don't get it. What is 
> happening here?
>
> Jason
>
> 64 bytes from 4.2.2.2: icmp_seq=392 ttl=54 time=4834.737 ms
> 64 bytes from 4.2.2.2: icmp_seq=393 ttl=54 time=4301.243 ms
> 64 bytes from 4.2.2.2: icmp_seq=394 ttl=54 time=3300.328 ms
> 64 bytes from 4.2.2.2: icmp_seq=396 ttl=54 time=1289.723 ms
> Request timeout for icmp_seq 400
> Request timeout for icmp_seq 401
> 64 bytes from 4.2.2.2: icmp_seq=398 ttl=54 time=4915.096 ms
> 64 bytes from 4.2.2.2: icmp_seq=399 ttl=54 time=4310.575 ms
> 64 bytes from 4.2.2.2: icmp_seq=400 ttl=54 time=4196.075 ms
> 64 bytes from 4.2.2.2: icmp_seq=401 ttl=54 time=4287.048 ms
> 64 bytes from 4.2.2.2: icmp_seq=403 ttl=54 time=2280.466 ms
> 64 bytes from 4.2.2.2: icmp_seq=404 ttl=54 time=1279.348 ms
> 64 bytes from 4.2.2.2: icmp_seq=405 ttl=54 time=276.669 ms

-- 
  ++ytti

Re: Large prefix lists/sets on IOS-XR

2022-12-09 Thread Saku Ytti

On Fri, 9 Dec 2022 at 20:19, t...@pelican.org  wrote:

Hey Tim,

> Or at least, you've moved the problem from "generate config" to "have 
> complete and correct data".  Which statement should probably come with some 
> kind of trigger-warning...

I think it's a lot easier than you think. I understand that all older
networks and practical access networks have this problem, the data is
in the network, it's of course not the right way to do it, but it's
the way they are. But there is no reason to get discouraged.
First you gotta ignore waterfall model, you can never order something
ready and have utility out of it, because no data.

What you can do, day1

a) copy configs as-is, as templates
b) only edit the template
c) push templates to network

boom, now you are FAR, and that took an hour or day depending on the person.

Maybe you feel like you've not accomplished much, but you have. Now
you can start modelling data out of the template into the database,
and keep shrinking the 'blobs'.  You can do this at whatever pace is
convenient, and you can trivially measure which one to do next, which
one will reduce total blob bytes most.  You will see constant,
measurable progress. And you always know the network state is always
what is in your files, as you are now always replacing the entire
config with the generated config.

-- 
  ++ytti

Re: Large prefix lists/sets on IOS-XR

2022-12-09 Thread Saku Ytti

On Fri, 9 Dec 2022 at 17:58, Joshua Miller  wrote:


> In terms of structured vs unstructured data, sure, assembling text is not a 
> huge lift. Though, when you're talking about layering on complex use cases, 
> then it gets more complicated. Especially if you want to compute the inverse 
> configuration to remove service instances that are no longer needed. In terms 
> of vendor support, I'd hope that if you're paying that kind of money, you're 
> getting a product that meets your requirements. Something that should be 
> assessed during vendor selection and procurement. That's just my preference; 
> do whatever works best for your use cases.

Deltas are _super_ hard. But you never need to do them. Always produce
a complete config, and let the vendor deal with the problem.

We've done this with Junos, IOSXR, EOS (compass, not arista, RIP),
SROS (MDCLI) for years

If you remove the need for deltas the whole problem becomes extremely
trivial. Fill in all the templates with data, push it.
--
  ++ytti

Re: Large prefix lists/sets on IOS-XR

2022-12-09 Thread Saku Ytti

On Fri, 9 Dec 2022 at 17:30, Tom Beecher  wrote:

> Pushing thousands of lines via CLI/expect automation is def not a great idea, 
> no. Putting everything into a file, copying that to the device, and loading 
> from there is generally best regardless. The slowness you refer to is almost 
> certainly just because of how XR handles config application. If I'm following 
> correctly, that seems to be the crux of your question.

If you read carefully, that is what Steffann is doing. He is doing
'load location:file' + 'commit'. He is not punching anything by hand.

So the answer we are looking for is how to make that go faster.

In Junos answer would be 'ephemeral config', but in IOS-XR as far as I
know, the only thing you can do is improve the 'load' part by moving
the server closer, other than that, you get what you get.
-- 
  ++ytti

Re: Large prefix lists/sets on IOS-XR

2022-12-09 Thread Saku Ytti

On Fri, 9 Dec 2022 at 17:07, Joshua Miller  wrote:

> I don't know that Netconf or gRPC are any faster than loading cli. Those 
> protocols facilitate automation so that the time it takes to load any one 
> device is not a significant factor, especially when you can roll out changes 
> to devices in parallel. Also, it's easier to build the changes into a 
> structured format than assemble the right syntax to interact with the CLI.

As a programmer I don't really find output format to be significant
cost. If I have source of data how I emit them out doesn't matter
much. I accept preferences that people have, but don't think it to be
important part of solution.

Adrian mentioned paramiko, and if we imagine paramiko logging into
IOS-XR, and doing 'load http://...' + 'commit'. We've automated the
task.

Depending on your platform netconf/yang/grpc can be asset or
liability, I put IOS-XR strongly in the liability part, because they
don't have proper infrastructure that is datafirst, they don't have
proper module for even handling configurations, but configurations are
owned by individual component teams (like tunnel teams owns GRE config
and so forth). Contrasting with Juniper, which is datafirst, and even
CLI is 2nd class citizen taking formal data from XML RPC.
In IOS-XR you will find all kind of gaps, where you can't rely on
netconf/yang, which you will then spend cycles to deal with vendor.
Compared to people who use the first class citizen approach, CLI
format, who are already done.

I did not read Steffan as though he'd be punching in anything
manually, he wants to make the process itself faster, without any
delays introduced by humans. And I have personally nothing to offer
him, except put your server closer to the router, so you can deal with
the limited TCP window sizes that hurt transfer speed.

-- 
  ++ytti

Re: Large prefix lists/sets on IOS-XR

2022-12-09 Thread Saku Ytti

Can Andrian and Joshua explain what they specifically mean, and how
they expect it to perform over what Steffann is already doing (e.g.
load https://nms/cfg/router.txt)? How much faster will it be, and why?

Can Steffan explain how large a file they are copying, over what
protocol, how long does it take, and how long does the commit take.

We used to have configurations in excess of a million lines before
'or-longer' halved them, and we've seen much longer times than 30min
to get a new config pushed+commtited. We use FTP and while the FTP
does take its sweet time, the commit itself is very long as well.

I refrain from expressing my disillusionment with the utility of doing
IRR based filtering.

On Fri, 9 Dec 2022 at 15:38, Andrian Visnevschi via NANOG
 wrote:
>
> Two options:
> - gRPC
> - Netconf
>
> You can use tools like paramiko,netmiko or napalm that are widely used to 
> programmatically configure and manage your XR router.
>
>
> On Fri, Dec 9, 2022 at 2:24 AM Joshua Miller  wrote:
>>
>> Netconf is really nice for atomic changes to network devices, though it 
>> would still take some time for the device to process such a large change.
>>
>> On Thu, Dec 8, 2022 at 6:05 PM Sander Steffann  wrote:
>>>
>>> Hi,
>>>
>>> What is the best/most efficient/most convenient way to push large prefix 
>>> lists or sets to an XR router for BGP prefix filtering? Pushing thousands 
>>> of lines through the CLI seems foolish, I tried using the load command but 
>>> it seems horribly slow. What am I missing? :)
>>>
>>> Cheers!
>>> Sander
>>>
>>> ---
>>> for every complex problem, there’s a solution that is simple, neat, and 
>>> wrong
>
>
>
> --
>
> Cheers,
>
> Andrian Visnevschi
>
>

-- 
  ++ytti

Re: Newbie Concern: (BGP) AS-Path Oscillation

2022-11-27 Thread Saku Ytti

I don't think this is normal, I think this is a fault and needs to be
addressed. There should be significant reachability problems, because
rerouting isn't neither immediate, nor lock-step with SW+HW nor
synchronous between nodes.

What exactly needs to be done, I can't tell without looking at the
specific case.

I'm not sure I understand 'tail-end ' and 'origin announcer' as
synonyms, tail to me means receiver, head advertiser. But origin
announcer to me means advertiser. So I'm not sure in which position
you are. But if you are the source of this prefix, then you can
probably fix the situation, if you are not, then you probably cannot
fix the situation.

On Mon, 28 Nov 2022 at 07:56, Pirawat WATANAPONGSE via NANOG
 wrote:
>
> Dear Guru(s),
>
>
> My apologies upfront if this question has already been asked.
> If that’s the case, please kindly point me to the solution|thread so that the 
> mailing list bandwidth is not wasted.
>
> Situation:
> On one of our prefixes, we are detecting continuous “BGP AS-Path Changes” in 
> the order of 1,000 announcements per hour---practically one every 3-4 seconds.
> Those paths oscillate between two of our immediate upstreams.
>
> Questions:
> 1. Is this number of events “normal” for a prefix?
> 2. Is there any way we, as the tail-end (Origin Announcer), can do to reduce 
> it? Or should I just “let it be”?
> 3. [Extra] Is this kind of oscillation affecting user experience, say, 
> throughput and/or latency?
>
> Thank you in advance for all the pointers and help.
>
>
> Best Regards,
>
> Pirawat.
>

-- 
  ++ytti

Re: Random Early Detect and streaming video

2022-11-08 Thread Saku Ytti

Hey,

On Mon, 7 Nov 2022 at 21:58, Graham Johnston  wrote:

> I've been involved in service provider networks, small retail ISPs, for 20+ 
> years now. Largely though, we've never needed complex QoS, as at 
> $OLD_DAY_JOB, we had been consistently positioned to avoid regular link 
> congestion by having  sufficient capacity. In the few instances when we've 
> had link congestion, egress priority queuing met our needs.

What does 'egress priority queueing' mean? Do you mean 'send all X,
before any Y, send all Y before any Z'? If this, then this must have
been quite some time now, as since traffic managers were implemented
in hardware ages ago, this hasn't been available. And the only thing
that has been available has been 'X has guaranteed rate X1, Y has Y1
and Z has Z1' and love it or hate it, that's the QoS tool industry has
decided you need.

> combine that with the buffering and we should adjust the drop profile to kick 
> in at a higher percentage. Today we use 70% to start triggering the drop 
> behavior, but my head tells me it should be higher. The reason I am saying 
> this is that we are dropping packets ahead of full link congestion, yes that 
> is what RED was designed to do, but I surmise that we are making this 
> application worse than is actually intended.

I wager almost no one knows what their RED curve is, and different
vendors have different default curves which is then the curve almost
everyone uses. Some use a RED curve such that everything is basically
tail drop (Juniper, 0% drop at 96% fill and 100% drop at 98% fill).
Some are linear. Some allow defining just two points, some allow
defining 64 points. And almost no one has any idea what their curve
is, i.e. mostly it doesn't matter. If it usually mattered, we'd all
know what the curve is and why. As practical example Juniper has
basically

In your case, I assume you have at least two points with 0% drop at
69% fill, then a linear curve from 70% to 100% fill with 1% to 100%
drop. It doesn't seem outright wrong to me. You have 2-3 goals here,
to avoid synchronising TCP flows so that you have steady fill, instead
of wave-like behaviour and to reduce queueing delay for packets not
dropped, which would experience as long a delay as there is queue if
tail dropped. You could have a 3rd possible goal, if you map more than
1 class of packets into the same queue you can still give them
different curves, so during congestion in a single queue can show two
different behaviours depending on packet.
So what is the problem you're trying to fix? Can you measure it?

I suspect in a modern high speed network with massive amounts of flows
the wave-like synchronisation is not a problem. If you can't measure
it or If your only goal is to reduce queueing delay because you have
'strategic' congestion, perhaps instead of worrying about RED, use
tail only and reduce queue size to something that is tolerable 1ms-5ms
max?

-- 
  ++ytti

Re: Router ID on IPv6-Only

2022-09-09 Thread Saku Ytti

On Fri, 9 Sept 2022 at 09:31, Crist Clark  wrote:

> As I said in the original email, I realize router IDs just need to be
> unique in
> an AS. We could have done random ones with IPv4, but using a well chosen

In some far future this will be true. We meet eBGP speakers across the
world, and not everyone supports route refresh, _TODAY_, I suspect
mostly because internally developed eBGP implementations and
developers were not very familiar with how real life BGP works.
RFC6286 is not supported by all common implementations, much less
uncommon. And even for common implementations it requires a very new
image (20.4 for Junos, many are even in 17.4 still).

So while we can consider BGP router-id to be only locally significant
when RFC6286 is implemented, in practice you want to be defensive in
your router-id strategy, i.e. avoid at least scheme of 1,2,3,4,5,6...
on thesis that will be common scheme and liable to increase support
costs down the line due to collision probability being higher. While
it might also add commercial advantage for transit providers, to have
low router-id to win billable traffic.

> And to get even a little more specific about our particular use case and
> the
> suggestion here to build the device location into the ID, we're
> generally not

I would strongly advise against any information-to-ID mapping schemes.
This adds complexity and reduces flexibility and requires you to know
the complete problem ahead of time, which is difficult, only have
rules you absolutely must have. I am sure most people here have
experience having too cutesy addressing schemes some time in their
past, where forming an IP address had unnecessary rules in them, which
just created complexity and cost in future.
If you can add an arbitrary 32b ID to your database, this problem
becomes very easy. If not, it's tricky.

-- 
  ++ytti

Re: Router ID on IPv6-Only

2022-09-08 Thread Saku Ytti

On Thu, 8 Sept 2022 at 10:22, Bjørn Mork  wrote:

> I'm not used to punching anything, so I probably have too simple a view
> of the world.
>
> But I still don't understand how this changes the ID allocation scheme,
> which is how I understood the question.  I assume the punched value was
> based on input from somewhere?

Today

1. Don't punch - wont work, you have to (junos)
2. Punch IPv4 - won't work

So what to do tomorrow?

-- 
  ++ytti

Re: Router ID on IPv6-Only

2022-09-08 Thread Saku Ytti

On Thu, 8 Sept 2022 at 10:01, Bjørn Mork  wrote:

> Why would you do it differently than for dual-stack routers, except that
> you skip the step where you configure the ID as a loopback address?

Because you may not have an option, if you're IPv6 only, vendors (e.g.
junos) may expect you to punch it manually. Of course most of us punch
it manually as loopback0 IPv4 to have more control over outcome.
Question is legitimate and represents change where previously used
mechanisms do not apply, therefore OP is right to ask 'well what
should I do now'.

-- 
  ++ytti

Re: Router ID on IPv6-Only

2022-09-07 Thread Saku Ytti

Hey,

> Well, now there is no IPv4. But BGP, OSPFv3, and other routing protocols 
> still use 32-bit router IDs for IPv6. On the one hand, there are plenty of 
> 32-bit numbers to use. Generally speaking, router IDs just need to be unique 
> inside of an AS to do their job, but (a) for humans or automation to generate 
> them and (b) to easily recognize them, it's convenient to have some algorithm 
> or methodology for assigning them.

2nd hand knowledge, but when this was discussed early on in
standardization, someone argued against 128b ID, because it would
require too much bandwidth in their OSPF network. Joys of everyone
plays standardisation.

> Has anyone thought about this or have a good way to do it? We had ideas like 
> use bits 32-63 from an interface. Seems like it could work, but also could 
> totally break down if we're using >64-bit prefixes for things like 
> router-to-router links or pulling router loopbacks out of a common /64.

If your data is in a database I think the best bet is to
algorithmically generate multiple forms of IDs in your device and
interface rows, to satisfy various restrictions on what forms  of IDs
are accepted. And then use these IDs. If your data is in configs, you
don't have really good solutions, but you could choose 32b from your
IPv6 loopback right side :/.

-- 
  ++ytti

Re: End of Cogent-Sprint peering wars?

2022-09-07 Thread Saku Ytti

On Thu, 8 Sept 2022 at 01:06, Jawaid Bazyar  wrote:

> $1 deals usually come with an operation in the red, or assumption of 
> significant debts.

To me this looks like a continuation of the game of attrition for
infrastructure players. No one seems to know how to capitalise
infrastructure, and ostensibly cheap deals have brought shops down
before due to naive buyers (GTT).
But I do think this makes sense for both tmus and ccoi, for tmus
infrastructure is a bad risk and they can always afford to procure the
service at market price, by moving costs to customers.  For CCOI they
don't have much choice but to figure out how to turn infrastructure
into money, if they can't they're dead anyhow, now they're dead just
little bit sooner, so it seems like a good risk for CCOI.
I am a little bit more optimistic in CCOI leadership's ability to
capitalise this than the ability GTT had, and wish them good luck.

-- 
  ++ytti

Re: IoT - The end of the internet

2022-08-10 Thread Saku Ytti

On Wed, 10 Aug 2022 at 12:48, Pascal Thubert (pthubert)
 wrote:

Hey,

> I do not share that view:

I'm not sure how you read my view. I was not attempting to communicate
anything negative of IPv6. What I attempted to communicate

- near future looks to improve IOT security posture significantly, as
the IOT LAN won't share network with your user LAN, you'll go via GW
- thread+matter gives me optimism that IOT is being taken seriously
and good progress is being made, and the standards look largely well
thought out

> 1) Thread uses 6LoWPAN so nodes are effectively IPv6 even though it doesn’t 
> show in the air.

I believe I implied that strongly. Considering the 'forced marketing
of IPv6' on the thread addressing scheme. Mind you, I don't think it
is big deal, might even be positive, but I would have probably used
inline PDU to decide roles.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-10 Thread Saku Ytti

On Wed, 10 Aug 2022 at 06:48,  wrote:

> How do you propose to fairly distribute market data feeds to the market if 
> not multicast?

I expected your aggressive support for small packets was for fintech.
An anecdote:

one of the largest exchanges in the world used MX for multicast
replication, which is btree or today utree replication, that is, each
NPU gets replicated packet wildy different time, therefore receivers
do. Which wasn't a problem for them, because they didn't know that's
how it works and suffered no negative consequence of this, which
arguably should have been a show stopper if we need receivers to
receive it at a remotely similar time.

Also, it is not in disagreement with my statement that it is not
addressable market, because this marker can use products which do not
do 64B wire-rate, for two separate reason either/and a) port is no
where near congested b) the market is not cost sensitive, they buy the
device with many WAN ports, and don't provision it so that they can't
get 64B on each actually used ports.

-- 
  ++ytti

Re: IoT - The end of the internet

2022-08-09 Thread Saku Ytti

On Wed, 10 Aug 2022 at 07:54, Pascal Thubert (pthubert) via NANOG
 wrote:

> On a more positive note, the IPv6 IoT can be seen as an experiment on how we 
> can scale the internet another order of magnitude or 2 without taking the 
> power or the spectrum consumption to the parallel levels.

I think at least the next 20 years of IoT is thread (and wifi for high
BW)+matter, and IoT devices won't have IP that is addressable even
from the user LAN, you go via GW, none of which you configure.

Some bits of if look unnecessarily forced perspective, like the
addressing scheme, instead of inlining your role in PDU we use this
cutesy addressing scheme looks like bit forced marketing of IPv6,
doesn't seem necessary but also not really an important decision
either way. Overall I think thread+matter are well designed and they
make me quite optimistic of reasonable IoT outcomes.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-08 Thread Saku Ytti

On Mon, 8 Aug 2022 at 14:37, Masataka Ohta
 wrote:

> With such an imaginary assumption, according to the end to end
> principle, the customers (the ends) should use paced TCP instead

I fully agree, unfortunately I do not control the whole problem
domain, and the solutions available with partial control over the
domain are less than elegant.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-08 Thread Saku Ytti

On Mon, 8 Aug 2022 at 14:02, Masataka Ohta
 wrote:

> which is, unlike Yttinet, the reality.

Yttinet has pesky customers who care about single TCP performance over
long fat links, and observe poor performance with shallow buffers at
the provider end. Yttinet is cost sensitive and does not want to do
work, unless sufficiently motivated by paying customers.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-08 Thread Saku Ytti

On Mon, 8 Aug 2022 at 13:03, Masataka Ohta
 wrote:

> If RTT is large, your 100G runs over several 100/400G
> backbone links with many other traffic, which makes the
> burst much slower than 10G.

In Ohtanet, I presume.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-07 Thread Saku Ytti

On Sun, 7 Aug 2022 at 14:16, Masataka Ohta
 wrote:

> When many TCPs are running, burst is averaged and traffic
> is poisson.

If you grow a window, and the sender sends the delta at 100G, and
receiver is 10G, eventually you'll hit that 10G port at 100G rate.
It's largely an edge problem, not a core problem.

> People who use irrationally small packets will suffer, which is
> not a problem for the rest of us.

Quite, unfortunately, the problem I have exists in the Internet, the
problem you're solving exists in Ohtanet, Ohtanet is much more
civilized and allows for elegant solutions. The Internet just has a
different shade of bad solution to pick from.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-07 Thread Saku Ytti

On Sun, 7 Aug 2022 at 17:58,  wrote:

> There are MANY real world use cases which require high throughput at 64 byte 
> packet size. Denying those use cases because they don’t fit your world view 
> is short sighted. The word of networking is not all I-Mix.

Yes but it's not an addressable market. Such a market will just buy
silly putty for 2bucks and modify the existing face-plate to do 64B.

No one will ship that box for you, because the addressable market
gladly will take more WAN ports as trade-off for large minimum mean
packet size.
-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-07 Thread Saku Ytti

On Sun, 7 Aug 2022 at 12:16, Masataka Ohta
 wrote:

> I'm afraid you imply too much buffer bloat only to cause
> unnecessary and unpleasant delay.
>
> With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of
> buffer is enough to make packet drop probability less than
> 1%. With 98% load, the probability is 0.0041%.

I feel like I'll live to regret asking. Which congestion control
algorithm are you thinking of? If we estimate BW and pace TCP window
growth at estimated BW, we don't need much buffering at all.
But Cubic and Reno will burst tcp window growth at sender rate, which
may be much more than receiver rate, someone has to store that growth
and pace it out at receiver rate, otherwise window won't grow, and
receiver rate won't be achieved.
So in an ideal scenario, no we don't need a lot of buffer, in
practical situations today, yes we need quite a bit of buffer.

Now add to this multiple logical interfaces, each having 4-8 queues,
it adds up. Big buffers are bad 'kay is frankly simplistic and
inaccurate.

Also the shallow ingress buffers discussed in the thread are not delay
buffers and the problem is complex because no device is marketable
that can accept wire rate of minimum packet size, so what trade-offs
do we carry, when we get bad traffic at wire rate at small packet
size? We can't empty the ingress buffers fast enough, do we have
physical memory for each port, do we share, how do we share?

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-07 Thread Saku Ytti

On Sat, 6 Aug 2022 at 17:08,  wrote:


> For a while, GSR and CRS type systems had linecards where each card had a 
> bunch of chips that together built the forwarding pipeline.  You had chips 
> for the L1/L2 interfaces, chips for the packet lookups, chips for the 
> QoS/queueing math, and chips for the fabric interfaces.  Over time, we 
> integrated more and more of these things together until you (more or less) 
> had a linecard where everything was done on one or two chips, instead of a 
> half dozen or more.  Once we got here, the next step was to build linecards 
> where you actually had multiple independent things doing forwarding -- on the 
> ASR9k we called these "slices".  This again multiplies the performance you 
> can get, but now both the software and the operators have to deal with the 
> complexity of having multiple things running code where you used to only have 
> one.  Now let's jump into the 2010's where the silicon integration allows you 
> to put down multiple cores or pipelines on a single chip, each of these is 
> now (more or less) it's own forwarding entity.  So now you've got yet ANOTHER 
> layer of abstraction.  If I can attempt to draw out the tree, it looks like 
> this now:

> 1) you have a chassis or a system, which has a bunch of linecards.
> 2) each of those linecards has a bunch of NPUs/ASICs
> 3) each of those NPUs has a bunch of cores/pipelines

Thank you for this. I think we may have some ambiguity here. I'll
ignore multichassis designs, as those went out of fashion, for now.
And describe only 'NPU' not express/brcm style pipeline.

1) you have a chassis with multiple linecards
2) each linecard has 1 or more forwarding packages
3) each package has 1 or more NPUs (Juniper calls these slices, unsure
if EZchip vocabulary is same here)
4) each NPU has 1 or more identical cores (well, I can't really name
any with 1 core, I reckon, NPU like GPU pretty inherently has many
many cores, and unlike some in this thread, I don't think they ever
are ARM instruction set, that makes no sense, you create instruction
set targeting the application at hand which ARM instruction set is
not, but maybe some day we have some forwarding-IA, allowing customers
to provide ucode that runs on multiple targets, but this would reduce
pace of innovation)

Some of those NPU core architectures are flat, like Trio, where a
single core handles the entire packet. Where other core architectures,
like FP are matrices, where you have multiple lines and packet picks 1
of the lines and traverses each core in line. (FP has much more cores
in line, compared to leaba/pacific stuff)

-- 
  ++ytti

Re: cogent - Sales practices

2022-08-07 Thread Saku Ytti

On Sat, 6 Aug 2022 at 23:08, Eric Kuhnke  wrote:

> I have a morbid curiosity about what the CRM database looks like inside 
> Cogent, for the stale/cold leads that get passed on to a new junior sales rep 
> every six months.
>
> The amount of peoples' names/email addresses/phone numbers in there must be 
> stupendous.
>
> All of this can probably serve as a useful training tool for smaller/mid 
> sized ISPs with their own outbound sales staff on how not to treat the 
> potential customers.

What is the benchmark? What are we trying to do? Where does CCOI fail
and some competitor succeeds? Purely from tenQ POV, CCOI seems to be
doing fine and improving. Many shops that we could consider
competitors have been run to the ground or been sold off to avoid
losing more.

I fear that what they are doing objectively works from a business POV,
and nanog is not representative of the addressable market, if
anything, it may be a net-win to get rid of us, as we may be more
expensive to support than mean.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-05 Thread Saku Ytti

On Fri, 5 Aug 2022 at 20:31,  wrote:

Hey LJ,

> Disclaimer:  I work for Cisco on a bunch of silicon.  I'm not intimately 
> familiar with any of these devices, but I'm familiar with the high level 
> tradeoffs.  There are also exceptions to almost EVERYTHING I'm about to say, 
> especially once you get into the second- and third-order implementation 
> details.  Your mileage will vary...   ;-)

I expect it may come to this, my question may be too specific to be
answered without violating some NDA.

> If you have a model where one core/block does ALL of the processing, you 
> generally benefit from lower latency, simpler programming, etc.  A major 
> downside is that to do this, all of these cores have to have access to all of 
> the different memories used to forward said packet.  Conversely, if you break 
> up the processing into stages, you can only connect the FIB lookup memory to 
> the cores that are going to be doing the FIB lookup, and only connect the 
> encap memories to the cores/blocks that are doing the encapsulation work.  
> Those interconnects take up silicon space, which equates to higher cost and 
> power.

While an interesting answer, that is, the statement is, cost of giving
access to memory for cores versus having a more complex to program
pipeline of cores is a balanced tradeoff, I don't think it applies to
my specific question, while may apply to generic questions. We can
roughly think of FP having a similar amount of lines as Trio has PPEs,
therefore, a similar number of cores need access to memory, and
possibly higher number, as more than 1 core in line will need memory
access.
So the question is more, why a lot of less performant cores, where
performance is achieved by making pipeline, compared to fewer
performant cores, where individual  cores will work on packet to
completion, when the former has a similar number of core lines as
latter has cores.

> Packaging two cores on a single device is beneficial in that you only have 
> one physical chip to work with instead of two.  This often simplifies the 
> board designers' job, and is often lower power than two separate chips.  This 
> starts to break down as you get to exceptionally large chips as you bump into 
> the various physical/reticle limitations of how large a chip you can actually 
> build.  With newer packaging technology (2.5D chips, HBM and similar 
> memories, chiplets down the road, etc) this becomes even more complicated, 
> but the answer to "why would you put two XYZs on a package?" is that it's 
> just cheaper and lower power from a system standpoint (and often also from a 
> pure silicon standpoint...)

Thank you for this, this does confirm that benefits aren't perhaps as
revolutionary as the presentation of thread proposed, presentation
divided Trio evolution to 3 phases, and this multiple trios on package
was presented as one of those big evolutions, and perhaps some other
division of generations could have been more communicative.

> Lots and lots of Smart People Time has gone into different memory designs 
> that attempt to optimize this problem, and it's a major part of the 
> intellectual property of various chip designs.

I choose to read this as 'where a lot of innovation happens, a lot of
mistakes happen'. Hopefully we'll figure out a good answer here soon,
as the answers vendors are ending up with are becoming increasingly
visible compromises in the field. I suspect a large part of this is
that cloudy shops represent, if not disproportionate revenue,
disproportionate focus and their networks tend to be a lot more static
in config and traffic than access/SP networks. And when you have that
quality, you can make increasingly broad assumptions, assumptions
which don't play as well in SP networks.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-08-05 Thread Saku Ytti

Thank you for this.

I wish there would have been a deeper dive to the lookup side. My open questions

a) Trio model of packet stays in single PPE until done vs. FP model of
line-of-PPE (identical cores). I don't understand the advantages of
the FP model, the Trio model advantages are clear to me. Obviously the
FP model has to have some advantages, what are they?

b) What exactly are the gains of putting two trios on-package in
Trio6, there is no local-switching between WANs of trios in-package,
they are, as far as I can tell, ships in the night, packets between
trios go via fabric, as they would with separate Trios. I can
understand the benefit of putting trio and HBM2 on the same package,
to reduce distance so wattage goes down or frequency goes up.

c) What evolution they are thinking for the shallow ingress buffers
for Trio6. The collateral damage potential is significant, because WAN
which asks most, gets most, instead each having their fair share, thus
potentially arbitrarily low rate WAN ingress might not get access to
ingress buffer causing drop. Would it be practical in terms of
wattage/area to add some sort of preQoS towards the shallow ingress
buffer, so each WAN ingress has a fair guaranteed-rate to shallow
buffers?

On Fri, 5 Aug 2022 at 02:18, Jeff Tantsura  wrote:
>
> Apologies for garbage/HTMLed email, not sure what happened (thanks
> Brian F for letting me know).
> Anyway, the podcast with Juniper (mostly around Trio/Express) has been
> broadcasted today and is available at
> https://www.youtube.com/watch?v=1he8GjDBq9g
> Next in the pipeline are:
> Cisco SiliconOne
> Broadcom DNX (Jericho/Qumran/Ramon)
> For both - the guests are main architects of the silicon
>
> Enjoy
>
>
> On Wed, Aug 3, 2022 at 5:06 PM Jeff Tantsura  wrote:
> >
> > Hey,
> >
> >
> >
> > This is not an advertisement but an attempt to help folks to better 
> > understand networking HW.
> >
> >
> >
> > Some of you might know (and love ) “between 0x2 nerds” podcast Jeff Doyle 
> > and I have been hosting for a couple of years.
> >
> >
> >
> > Following up the discussion we have decided to dedicate a number of 
> > upcoming podcasts to networking HW, the topic where more information and 
> > better education is very much needed (no, you won’t have to sign NDA before 
> > joining ), we have lined up a number of great guests, people who design 
> > and build ASICs and can talk firsthand about evolution of networking HW, 
> > complexity of the process, differences between fixed and programmable 
> > pipelines, memories and databases. This Thursday (08/04) at 11:00PST we are 
> > joined by one and only Sharada Yeluri - Sr. Director ASIC at Juniper. Other 
> > vendors will be joining in the later episodes, usual rules apply – no 
> > marketing, no BS.
> >
> > More to come, stay tuned.
> >
> > Live feed: https://lnkd.in/gk2x2ezZ
> >
> > Between 0x2 nerds playlist, videos will be published to: 
> > https://www.youtube.com/playlist?list=PLMYH1xDLIabuZCr1Yeoo39enogPA2yJB7
> >
> >
> >
> > Cheers,
> >
> > Jeff
> >
> >
> >
> > From: James Bensley
> > Sent: Wednesday, July 27, 2022 12:53 PM
> > To: Lawrence Wobker; NANOG
> > Subject: Re: 400G forwarding - how does it work?
> >
> >
> >
> > On Tue, 26 Jul 2022 at 21:39, Lawrence Wobker  wrote:
> >
> > > So if this pipeline can do 1.25 billion PPS and I want to be able to 
> > > forward 10BPPS, I can build a chip that has 8 of these pipelines and get 
> > > my performance target that way.  I could also build a "pipeline" that 
> > > processes multiple packets per clock, if I have one that does 2 
> > > packets/clock then I only need 4 of said pipelines... and so on and so 
> > > forth.
> >
> >
> >
> > Thanks for the response Lawrence.
> >
> >
> >
> > The Broadcom BCM16K KBP has a clock speed of 1.2Ghz, so I expect the
> >
> > J2 to have something similar (as someone already mentioned, most chips
> >
> > I've seen are in the 1-1.5Ghz range), so in this case "only" 2
> >
> > pipelines would be needed to maintain the headline 2Bpps rate of the
> >
> > J2, or even just 1 if they have managed to squeeze out two packets per
> >
> > cycle through parallelisation within the pipeline.
> >
> >
> >
> > Cheers,
> >
> > James.
> >
> >



-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-07-27 Thread Saku Ytti

On Tue, 26 Jul 2022 at 23:15, Jeff Tantsura  wrote:

> In general, if we look at the whole spectrum, on one side there’re massively 
> parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as 
> the last gasp of Redback/Ericsson venture - we have built 1400 HW threads 
> ASIC (Spider).
> On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at 
> its extreme (max speed/radix - min features) moving with BCM Trident, 
> Innovium, Barefoot(quite different animal wrt programmability), etc - usually 
> shallow on chip buffer only (100-200M).
>
> In between we have got so called programmable pipeline silicon, BCM DNX and 
> Juniper Express are in this category, usually a combo of OCB + off chip 
> memory (most often HBM), (2-6G), usually have line-rate/high scale 
> security/overlay encap/decap capabilities. Usually have highly optimized RTC 
> blocks within a pipeline (RTC within macro). The way and speed to access DBs, 
> memories is evolving with each generation, number/speed of non networking 
> cores(usually ARM)  keeps growing - OAM, INT, local optimizations are primary 
> users of it.

What do we call Nokia FP? Where you have a pipeline of identical cores
doing different things, and the packet has to hit each core in line in
order? How do we contrast this to NPU where a given packet hits
exactly one core?

I think ASIC, NPU, pipeline, RTC are all quite ambiguous. When we say
pipeline, usually people assume a purpose build unique HW blocks
packet travels through (like DNX, Express) and not fully flexible
identical cores pipeline like FP.

So I guess I would consider 'true pipeline', pipeline of unique HW
blocks and 'true NPU' where a given packet hits exactly 1 core. And
anything else as more or less hybrid.

I expect once you get to the details of implementation all of these
generalisations use communicative power.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-07-27 Thread Saku Ytti

On Tue, 26 Jul 2022 at 21:28,  wrote:

> >No you are right, FP has much much more PPEs than Trio.
>
> Can you give any examples?

Nokia FP is like >1k, Juniper Trio is closer to 100 (earlier Trio LUs
had much less). I could give exact numbers for EA and YT if needed,
they are visible in the CLI and the end user can even profile them, to
see what ucode function they are spending their time on.

-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-07-26 Thread Saku Ytti

On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard 
wrote:

> Juniper is pipeline-based too (like any ASIC). They just invented one
> special stage in 1996 for lookup (sequence search by nibble in the big
> external memory tree) – it was public information up to 2000year. It is a
> different principle from TCAM search – performance is traded for
> flexibility/simplicity/cost.
>

How do you define a pipeline? My understanding is that fabric and wan
connections are in chip called MQ, 'head' of packet being some 320B or so
(bit less on more modern Trio, didn't measure specifically) is then sent to
LU complex for lookup.
LU then sprays packets to one of many PPE, but once packet hits PPE, it is
processed until done, it doesn't jump to another PPE.
Reordering will occur, which is later restored for flows, but outside flows
reorder may remain.

I don't know what the cores are, but I'm comfortable to bet money they are
not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their
own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use
tensilica cores, which are decidedly not ARM.

Nokia, as mentioned, kind of has a pipeline, because a single packet hits
every core in line, and each core does separate thing.

>
>
> Network Processors emulate stages on general-purpose ARM cores. It is a
> pipeline too (different cores for different functions, many cores for every
> function), just it is a virtual pipeline.
>
>
>
> Ed/
>
> -Original Message-
> From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei@nanog.org]
> On Behalf Of Saku Ytti
> Sent: Monday, July 25, 2022 10:03 PM
> To: James Bensley 
> Cc: NANOG 
> Subject: Re: 400G forwarding - how does it work?
>
>
>
> On Mon, 25 Jul 2022 at 21:51, James Bensley 
> wrote:
>
>
>
> > I have no frame of reference here, but in comparison to Gen 6 Trio of
>
> > NP5, that seems very high to me (to the point where I assume I am
>
> > wrong).
>
>
>
> No you are right, FP has much much more PPEs than Trio.
>
>
>
> For fair calculation, you compare how many lines FP has to PPEs in Trio.
> Because in Trio single PPE handles entire packet, and all PPEs run
> identical ucode, performing same work.
>
>
>
> In FP each PPE in line has its own function, like first PPE in line could
> be parsing the packet and extracting keys from it, second could be doing
> ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.
>
>
>
> Why choose this NP design instead of Trio design, I don't know. I don't
> understand the upsides.
>
>
>
> Downside is easy to understand, picture yourself as ucode developer, and
> you get task to 'add this magic feature in the ucode'.
>
> Implementing it in Trio seems trivial, add the code in ucode, rock on.
>
> On FP, you might have to go 'aww shit, I need to do this before PPE5 but
> after PPE3 in the pipeline, but the instruction cost it adds isn't in the
> budget that I have in the PPE4, crap, now I need to shuffle around and
> figure out which PPE in line runs what function to keep the PPS we promise
> to customer.
>
>
>
> Let's look it from another vantage point, let's cook-up IPv6 header with
> crapton of EH, in Trio, PPE keeps churning it out, taking long time, but
> eventually it gets there or raises exception and gives up.
>
> Every other PPE in the box is fully available to perform work.
>
> Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are
> not doing anything and can't do anything, until the PPE before in line is
> done.
>
>
>
> Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
> before and after lookup, before is normally needed for ingressACL but after
> lookup ingressACL is needed for CoPP (we only know after lookup if it is
> control-plane packet). Nokia doesn't do this at all, and I bet they can't
> do it, because if they'd add it in the core where it needs to be in line,
> total PPS would go down. as there is no budget for additional ACL. Instead
> all control-plane packets from ingressFP are sent to control plane FP, and
> inshallah we don't congest the connection there or it.
>
>
>
>
>
> >
>
> > Cheers,
>
> > James.
>
>
>
>
>
>
>
> --
>
>   ++ytti
>


-- 
  ++ytti

Re: 400G forwarding - how does it work?

2022-07-25 Thread Saku Ytti

On Mon, 25 Jul 2022 at 21:51, James Bensley  wrote:

> I have no frame of reference here, but in comparison to Gen 6 Trio of
> NP5, that seems very high to me (to the point where I assume I am
> wrong).

No you are right, FP has much much more PPEs than Trio.

For fair calculation, you compare how many lines FP has to PPEs in
Trio. Because in Trio single PPE handles entire packet, and all PPEs
run identical ucode, performing same work.

In FP each PPE in line has its own function, like first PPE in line
could be parsing the packet and extracting keys from it, second could
be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.

Why choose this NP design instead of Trio design, I don't know. I
don't understand the upsides.

Downside is easy to understand, picture yourself as ucode developer,
and you get task to 'add this magic feature in the ucode'.
Implementing it in Trio seems trivial, add the code in ucode, rock on.
On FP, you might have to go 'aww shit, I need to do this before PPE5
but after PPE3 in the pipeline, but the instruction cost it adds isn't
in the budget that I have in the PPE4, crap, now I need to shuffle
around and figure out which PPE in line runs what function to keep the
PPS we promise to customer.

Let's look it from another vantage point, let's cook-up IPv6 header
with crapton of EH, in Trio, PPE keeps churning it out, taking long
time, but eventually it gets there or raises exception and gives up.
Every other PPE in the box is fully available to perform work.
Same thing in FP? You have HOLB, the PPEs in the line after thisPPE
are not doing anything and can't do anything, until the PPE before in
line is done.

Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
before and after lookup, before is normally needed for ingressACL but
after lookup ingressACL is needed for CoPP (we only know after lookup
if it is control-plane packet). Nokia doesn't do this at all, and I
bet they can't do it, because if they'd add it in the core where it
needs to be in line, total PPS would go down. as there is no budget
for additional ACL. Instead all control-plane packets from ingressFP
are sent to control plane FP, and inshallah we don't congest the
connection there or it.

>
> Cheers,
> James.

-- 
  ++ytti

1 2 3 4 5 6 7 8 9 >

1 - 100 of 855 matches

Mail list logo