Re: 400G forwarding - how does it work?

2022-07-26 Thread Jeff Tantsura
As Lincoln said - all of us directly working with BCM/other silicon vendors 
have signed numerous NDAs.
However if you ask a well crafted question - there’s always a way to talk about 
it ;-)

In general, if we look at the whole spectrum, on one side there’re massively 
parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as 
the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC 
(Spider).
On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at 
its extreme (max speed/radix - min features) moving with BCM Trident, Innovium, 
Barefoot(quite different animal wrt programmability), etc - usually shallow on 
chip buffer only (100-200M).

In between we have got so called programmable pipeline silicon, BCM DNX and 
Juniper Express are in this category, usually a combo of OCB + off chip memory 
(most often HBM), (2-6G), usually have line-rate/high scale security/overlay 
encap/decap capabilities. Usually have highly optimized RTC blocks within a 
pipeline (RTC within macro). The way and speed to access DBs, memories is 
evolving with each generation, number/speed of non networking cores(usually 
ARM)  keeps growing - OAM, INT, local optimizations are primary users of it.

Cheers,
Jeff

> On Jul 25, 2022, at 15:59, Lincoln Dale  wrote:
> 
> 
>> On Mon, Jul 25, 2022 at 11:58 AM James Bensley  
>> wrote:
> 
>> On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker  wrote:
>> > This is the parallelism part.  I can take multiple instances of these 
>> > memory/logic pipelines, and run them in parallel to increase the 
>> > throughput.
>> ...
>> > I work on/with a chip that can forwarding about 10B packets per second… so 
>> > if we go back to the order-of-magnitude number that I’m doing about “tens” 
>> > of memory lookups for every one of those packets, we’re talking about 
>> > something like a hundred BILLION total memory lookups… and since memory 
>> > does NOT give me answers in 1 picoseconds… we get back to pipelining and 
>> > parallelism.
>> 
>> What level of parallelism is required to forward 10Bpps? Or 2Bpps like
>> my J2 example :)
> 
> I suspect many folks know the exact answer for J2, but it's likely under NDA 
> to talk about said specific answer for a given thing.
> 
> Without being platform or device-specific, the core clock rate of many 
> network devices is often in a "goldilocks" zone of (today) 1 to 1.5GHz with a 
> goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that 
> doesn't mean a latency of 1 clock ingress-to-egress but rather that every 
> clock there is a forwarding decision from one 'pipeline', and the MPPS/BPPS 
> packet rate is achieved by having enough pipelines in parallel to achieve 
> that.
> The number here is often "1" or "0.5" so you can work the number backwards. 
> (e.g. it emits a packet every clock, or every 2nd clock).
> 
> It's possible to build an ASIC/NPU to run a faster clock rate, but gets back 
> to what I'm hand-waving describing as "goldilocks". Look up power vs 
> frequency and you'll see its non-linear.
> Just as CPUs can scale by adding more cores (vs increasing frequency), ~same 
> holds true on network silicon, and you can go wider, multiple pipelines. But 
> its not 10K parallel slices, there's some parallel parts, but there are 
> multiple 'stages' on each doing different things.
> 
> Using your CPU comparison, there are some analogies here that do work:
>  - you have multiple cpu cores that can do things in parallel -- analogous to 
> pipelines
>  - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some 
> DRAM or LLC)  -- maybe some lookup engines, or centralized buffer/memory
>  - most modern CPUs are out-of-order execution, where under-the-covers, a 
> cache-miss or DRAM fetch has a disproportionate hit on performance, so its 
> hidden away from you as much as possible by speculative execution out-of-order
> -- no direct analogy to this one - it's unlikely most forwarding 
> pipelines do speculative execution like a general purpose CPU does - but they 
> definitely do 'other work' while waiting for a lookup to happen
> 
> A common-garden x86 is unlikely to achieve such a rate for a few different 
> reasons:
>  - packets-in or packets-out go via DRAM then you need sufficient DRAM (page 
> opens/sec, DRAM bandwidth) to sustain at least one write and one read per 
> packet. Look closer at DRAM and see its speed, Pay attention to page 
> opens/sec, and what that consumes.
>  - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM 
> of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least 
> potentially saves you that DRAM write+read per packet
>   - ... but then do e.g. a LPM lookup, and best case that is back to a memory 
> access/packet. Maybe it's in L1/L2/L3 cache, but likely at large table sizes 
> it isn't.
>  - ... do more things to the packet (urpf lookups, counters) and it's yet 
> more lookups.
> 
> Software can 

Re: 400G forwarding - how does it work?

2022-07-26 Thread dip
mandatory slide of laundry analogy for pipelining
https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html



On Tue, 26 Jul 2022 at 12:41, Lawrence Wobker  wrote:

>
>> "Pipeline" in the context of networking chips is not a terribly
>> well-defined term.  In some chips, you'll have a pipeline that is built
>> from very rigid hardware logic blocks -- the first block does exactly one
>> part of the packet forwarding, then hands the packet (or just the header
>> and metadata) to the second block, which does another portion of the
>> forwarding.  You build the pipeline out of as many blocks as you need to
>> solve your particular networking problem, and voila!
>
>
>
> "Pipeline", in the context of networking chips, is not a terribly
> well-defined term!  In some chips, you'll have an almost-literal pipeline
> that is built from very rigid hardware logic blocks.  The first block does
> exactly one part of the packet forwarding, then hands the packet (or just
> the header and metadata) to the second block, which does another portion of
> the forwarding.  You build the pipeline out of as many blocks as you need
> to solve your particular networking problem, and voila!
> The advantages here is that you can make things very fast and power
> efficient, but they aren't all that flexible, and deity help you if you
> ever need to do something in a different order than your pipeline!
>
> You can also build a "pipeline" out of software functions - write up some
> Python code (because everyone loves Python, right?) where function A calls
> function B and so on.  At some level, you've just build a pipeline out of
> different software functions.  This is going to be a lot slower (C code
> will be faster but nowhere near as fast as dedicated hardware) but it's WAY
> more flexible.  You can more or less dynamically build your "pipeline" on a
> packet-by-packet basis, depending on what features and packet data you're
> dealing with.
>
> "Microcode" is really just a term we use for something like "really
> optimized and limited instruction sets for packet forwarding".  Just like
> an x86 or an ARM has some finite set of instructions that it can execute,
> so do current networking chips.  The larger that instruction space is and
> the more combinations of those instructions you can store, the more
> flexible your code is.  Of course, you can't make that part of the chip
> bigger without making something else smaller, so there's another tradeoff.
>
> MOST current chips are really a hybrid/combination of these two extremes.
> You have some set of fixed logic blocks that do exactly One Set Of Things,
> and you have some other logic blocks that can be reconfigured to do A Few
> Different Things.  The degree to which the programmable stuff is
> programmable is a major input to how many different features you can do on
> the chip, and at what speeds.  Sometimes you can use the same hardware
> block to do multiple things on a packet if you're willing to sacrifice some
> packet rate and/or bandwidth.  The constant "law of physics" is that you
> can always do a given function in less power/space/cost if you're willing
> to optimize for that specific thing -- but you're sacrificing flexibility
> to do it.  The more flexibility ("programmability") you want to add to a
> chip, the more logic and memory you need to add.
>
> From a performance standpoint, on current "fast" chips, many (but
> certainly not all) of the "pipelines" are designed to forward one packet
> per clock cycle for "normal" use cases.  (Of course we sneaky vendors get
> to decide what is normal and what's not, but that's a separate issue...)
> So if I have a chip that has one pipeline and it's clocked at 1.25Ghz, that
> means that it can forward 1.25 billion packets per second.  Note that this
> does NOT mean that I can forward a packet in "a
> one-point-two-five-billionth of a second" -- but it does mean that every
> clock cycle I can start on a new packet and finish another one.  The length
> of the pipeline impacts the latency of the chip, although this part of the
> latency is often a rounding error compared to the number of times I have to
> read and write the packet into different memories as it goes through the
> system.
>
> So if this pipeline can do 1.25 billion PPS and I want to be able to
> forward 10BPPS, I can build a chip that has 8 of these pipelines and get my
> performance target that way.  I could also build a "pipeline" that
> processes multiple packets per clock, if I have one that does 2
> packets/clock then I only need 4 of said pipelines... and so on and so
> forth.  The exact details of how the pipelines are constructed and how much
> parallelism I built INSIDE a pipeline as opposed to replicating pipelines
> is sort of Gooky Implementation Details, but it's a very very important
> part of doing the chip level architecture as those sorts of decisions drive
> lots of Other Important Decisions in the silicon design...

Re: 400G forwarding - how does it work?

2022-07-26 Thread Lawrence Wobker
>
>
> "Pipeline" in the context of networking chips is not a terribly
> well-defined term.  In some chips, you'll have a pipeline that is built
> from very rigid hardware logic blocks -- the first block does exactly one
> part of the packet forwarding, then hands the packet (or just the header
> and metadata) to the second block, which does another portion of the
> forwarding.  You build the pipeline out of as many blocks as you need to
> solve your particular networking problem, and voila!



"Pipeline", in the context of networking chips, is not a terribly
well-defined term!  In some chips, you'll have an almost-literal pipeline
that is built from very rigid hardware logic blocks.  The first block does
exactly one part of the packet forwarding, then hands the packet (or just
the header and metadata) to the second block, which does another portion of
the forwarding.  You build the pipeline out of as many blocks as you need
to solve your particular networking problem, and voila!
The advantages here is that you can make things very fast and power
efficient, but they aren't all that flexible, and deity help you if you
ever need to do something in a different order than your pipeline!

You can also build a "pipeline" out of software functions - write up some
Python code (because everyone loves Python, right?) where function A calls
function B and so on.  At some level, you've just build a pipeline out of
different software functions.  This is going to be a lot slower (C code
will be faster but nowhere near as fast as dedicated hardware) but it's WAY
more flexible.  You can more or less dynamically build your "pipeline" on a
packet-by-packet basis, depending on what features and packet data you're
dealing with.

"Microcode" is really just a term we use for something like "really
optimized and limited instruction sets for packet forwarding".  Just like
an x86 or an ARM has some finite set of instructions that it can execute,
so do current networking chips.  The larger that instruction space is and
the more combinations of those instructions you can store, the more
flexible your code is.  Of course, you can't make that part of the chip
bigger without making something else smaller, so there's another tradeoff.

MOST current chips are really a hybrid/combination of these two extremes.
You have some set of fixed logic blocks that do exactly One Set Of Things,
and you have some other logic blocks that can be reconfigured to do A Few
Different Things.  The degree to which the programmable stuff is
programmable is a major input to how many different features you can do on
the chip, and at what speeds.  Sometimes you can use the same hardware
block to do multiple things on a packet if you're willing to sacrifice some
packet rate and/or bandwidth.  The constant "law of physics" is that you
can always do a given function in less power/space/cost if you're willing
to optimize for that specific thing -- but you're sacrificing flexibility
to do it.  The more flexibility ("programmability") you want to add to a
chip, the more logic and memory you need to add.

>From a performance standpoint, on current "fast" chips, many (but certainly
not all) of the "pipelines" are designed to forward one packet per clock
cycle for "normal" use cases.  (Of course we sneaky vendors get to decide
what is normal and what's not, but that's a separate issue...)  So if I
have a chip that has one pipeline and it's clocked at 1.25Ghz, that means
that it can forward 1.25 billion packets per second.  Note that this does
NOT mean that I can forward a packet in "a one-point-two-five-billionth of
a second" -- but it does mean that every clock cycle I can start on a new
packet and finish another one.  The length of the pipeline impacts the
latency of the chip, although this part of the latency is often a rounding
error compared to the number of times I have to read and write the packet
into different memories as it goes through the system.

So if this pipeline can do 1.25 billion PPS and I want to be able to
forward 10BPPS, I can build a chip that has 8 of these pipelines and get my
performance target that way.  I could also build a "pipeline" that
processes multiple packets per clock, if I have one that does 2
packets/clock then I only need 4 of said pipelines... and so on and so
forth.  The exact details of how the pipelines are constructed and how much
parallelism I built INSIDE a pipeline as opposed to replicating pipelines
is sort of Gooky Implementation Details, but it's a very very important
part of doing the chip level architecture as those sorts of decisions drive
lots of Other Important Decisions in the silicon design...

--lj


Re: 400G forwarding - how does it work?

2022-07-26 Thread jwbensley+nanog
On 25 July 2022 19:02:50 UTC, Saku Ytti  wrote:
>On Mon, 25 Jul 2022 at 21:51, James Bensley  wrote:
>
>> I have no frame of reference here, but in comparison to Gen 6 Trio of
>> NP5, that seems very high to me (to the point where I assume I am
>> wrong).
>
>No you are right, FP has much much more PPEs than Trio.

Can you give any examples?


>Why choose this NP design instead of Trio design, I don't know. I
>don't understand the upsides.

I think one use case is fixed latency. If you have minimal variation in your 
traffic you can provide a guaranteed upper bound on latency. This should be 
possible with the RTC model too of course, just harder because any variation in 
traffic at all, will result in a different run time duration, and I imagine it 
is easier to measure, find, and fix/tune chunks of code (running on separate 
cores, like in a pipeline) than in more code all running one core (like in 
RTC). So that's possibly a second benefit, maybe FP is easier to debug and 
measure changes?

>Downside is easy to understand, picture yourself as ucode developer,
>and you get task to 'add this magic feature in the ucode'.
>Implementing it in Trio seems trivial, add the code in ucode, rock on.
>On FP, you might have to go 'aww shit, I need to do this before PPE5
>but after PPE3 in the pipeline, but the instruction cost it adds isn't
>in the budget that I have in the PPE4, crap, now I need to shuffle
>around and figure out which PPE in line runs what function to keep the
>PPS we promise to customer.

That's why we have packet recirc 

>Let's look it from another vantage point, let's cook-up IPv6 header
>with crapton of EH, in Trio, PPE keeps churning it out, taking long
>time, but eventually it gets there or raises exception and gives up.
>Every other PPE in the box is fully available to perform work.
>Same thing in FP? You have HOLB, the PPEs in the line after thisPPE
>are not doing anything and can't do anything, until the PPE before in
>line is done.

This is exactly the benefit of FP vs NPI, less flexible, more throughput. NPU 
has served us (the industry) well at the edge, and FP is serving us well in the 
core.

>Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
>before and after lookup, before is normally needed for ingressACL but
>after lookup ingressACL is needed for CoPP (we only know after lookup
>if it is control-plane packet). Nokia doesn't do this at all, and I
>bet they can't do it, because if they'd add it in the core where it
>needs to be in line, total PPS would go down. as there is no budget
>for additional ACL. Instead all control-plane packets from ingressFP
>are sent to control plane FP, and inshallah we don't congest the
>connection there or it.

Interesting.

Cheers,
James.


Re: Akamai Peering

2022-07-26 Thread Jared Mauch
I'll provide a bit more detail - We have certainly been
standardizing on 100G for a number of years now and have a decreasing
number of devices where 10G is appropriate.

for public peering we certainly do have an open peering policy,
if you are encountering an issue please reach out and I can identify
what the root cause is.  If you have a ticket number, etc.. that can
help as well.  I don't personally monitor the ticket queue.

For private interconnect, 100G is the port speed for most of the
americas, some markets may vary.

For public peering, so much depends on the IX/IXP.  EdgeconneX
in Denver does not have access to the Denver IX and we are working to
extend there.  There's at least 4 different sites in Denver for
interconnection, and it's impractical to be in them all.

Some more details would be helpful (in private) so we can move
the traffic to a direct path.

If you have a 10G port and want to swap it to a 100G port, we
should have that conversation.

- Jared

On Tue, Jul 26, 2022 at 08:27:09AM -0500, Paul Emmons wrote:
> Akamai isn't supporting 10g ports on IXPs.  I'd be surprised if the allowed
> it on PNIs.  As for not being on the IXPs, that's odd.
> 
> On Tue, Jul 26, 2022 at 8:23 AM Jawaid Bazyar 
> wrote:
> 
> > Hi,
> >
> >
> >
> > We had Akamai servers in our data center for many years until a couple
> > years ago, when they said they’d changed their policies and decommissioned
> > the servers.
> >
> >
> >
> > I understand that, maintaining many server sites and being responsible for
> > that hardware, even if you pay nothing for power or collocation, must be
> > costly. And at the time, we didn’t have much traffic to them.
> >
> >
> >
> > Today, however, we’re hitting 6 Gbps with them nightly. Not sure what
> > traffic it is they’re hosting but it’s surely video of some sort.
> >
> >
> >
> > We are in the same data center with them, Edgeconnex Denver, and they
> > refuse to peer because they say their minimum traffic level for peering is
> > 30 Gbps.
> >
> >
> >
> > Their peeringdb entry says “open peering”, and in my book that’s not open
> > peering.
> >
> >
> >
> > So this seems to be exactly backward from where every other major content
> > provider is going – free peering with as many eyeball networks as possible.
> >
> >
> >
> > Google – no bandwidth minimum, and, they cover costs on 1st and every
> > other cross connect
> >
> > Amazon – peers are two Denver IX
> >
> > Apple – peers at two Denver IX
> >
> > Netflix – free peering everywhere
> >
> >
> >
> > And, on top of that, Akamai is not at either of the two Denver exchange
> > points, which push together probably half a terabit of traffic.
> >
> >
> >
> > What is the financial model for Akamai to restrict peering this way?
> > Surely it’s not the 10G ports and optics, which are cheap as dirt these
> > days.
> >
> >
> >
> > Doesn’t this policy encourage eyeballs to move this traffic to their
> > cheapest possible transit links, with a potential degradation of service
> > for Akamai’s content customers?
> >
> >
> >
> > Thanks for the insight,
> >
> >
> >
> > Jawaid
> >
> >
> >
> >
> >
> > *[image: uc%3fid=1CZG_hGEeUP_KD95fSHu2oBRA_6dkOo6n]*
> >
> > *Jawaid Bazyar*
> >
> > Chief Technical Officer
> >
> > VERO Broadband
> >
> > [image: signature_3735065359]
> >
> > 303-815-1814
> >
> > [image: signature_3363732610]
> >
> > jbaz...@verobroadband.com
> >
> > [image: signature_60923]
> >
> > https://verobroadband.com
> >
> > [image: signature_4057438942]
> >
> > 2347 Curtis St, Denver, CO 80205
> >
> >
> >

-- 
Jared Mauch  | pgp key available via finger from ja...@puck.nether.net
clue++;  | http://puck.nether.net/~jared/  My statements are only mine.


Re: Akamai Peering

2022-07-26 Thread Paul Emmons
Akamai isn't supporting 10g ports on IXPs.  I'd be surprised if the allowed
it on PNIs.  As for not being on the IXPs, that's odd.

On Tue, Jul 26, 2022 at 8:23 AM Jawaid Bazyar 
wrote:

> Hi,
>
>
>
> We had Akamai servers in our data center for many years until a couple
> years ago, when they said they’d changed their policies and decommissioned
> the servers.
>
>
>
> I understand that, maintaining many server sites and being responsible for
> that hardware, even if you pay nothing for power or collocation, must be
> costly. And at the time, we didn’t have much traffic to them.
>
>
>
> Today, however, we’re hitting 6 Gbps with them nightly. Not sure what
> traffic it is they’re hosting but it’s surely video of some sort.
>
>
>
> We are in the same data center with them, Edgeconnex Denver, and they
> refuse to peer because they say their minimum traffic level for peering is
> 30 Gbps.
>
>
>
> Their peeringdb entry says “open peering”, and in my book that’s not open
> peering.
>
>
>
> So this seems to be exactly backward from where every other major content
> provider is going – free peering with as many eyeball networks as possible.
>
>
>
> Google – no bandwidth minimum, and, they cover costs on 1st and every
> other cross connect
>
> Amazon – peers are two Denver IX
>
> Apple – peers at two Denver IX
>
> Netflix – free peering everywhere
>
>
>
> And, on top of that, Akamai is not at either of the two Denver exchange
> points, which push together probably half a terabit of traffic.
>
>
>
> What is the financial model for Akamai to restrict peering this way?
> Surely it’s not the 10G ports and optics, which are cheap as dirt these
> days.
>
>
>
> Doesn’t this policy encourage eyeballs to move this traffic to their
> cheapest possible transit links, with a potential degradation of service
> for Akamai’s content customers?
>
>
>
> Thanks for the insight,
>
>
>
> Jawaid
>
>
>
>
>
> *[image: uc%3fid=1CZG_hGEeUP_KD95fSHu2oBRA_6dkOo6n]*
>
> *Jawaid Bazyar*
>
> Chief Technical Officer
>
> VERO Broadband
>
> [image: signature_3735065359]
>
> 303-815-1814
>
> [image: signature_3363732610]
>
> jbaz...@verobroadband.com
>
> [image: signature_60923]
>
> https://verobroadband.com
>
> [image: signature_4057438942]
>
> 2347 Curtis St, Denver, CO 80205
>
>
>


Akamai Peering

2022-07-26 Thread Jawaid Bazyar
Hi,

We had Akamai servers in our data center for many years until a couple years 
ago, when they said they’d changed their policies and decommissioned the 
servers.

I understand that, maintaining many server sites and being responsible for that 
hardware, even if you pay nothing for power or collocation, must be costly. And 
at the time, we didn’t have much traffic to them.

Today, however, we’re hitting 6 Gbps with them nightly. Not sure what traffic 
it is they’re hosting but it’s surely video of some sort.

We are in the same data center with them, Edgeconnex Denver, and they refuse to 
peer because they say their minimum traffic level for peering is 30 Gbps.

Their peeringdb entry says “open peering”, and in my book that’s not open 
peering.

So this seems to be exactly backward from where every other major content 
provider is going – free peering with as many eyeball networks as possible.

Google – no bandwidth minimum, and, they cover costs on 1st and every other 
cross connect
Amazon – peers are two Denver IX
Apple – peers at two Denver IX
Netflix – free peering everywhere

And, on top of that, Akamai is not at either of the two Denver exchange points, 
which push together probably half a terabit of traffic.

What is the financial model for Akamai to restrict peering this way? Surely 
it’s not the 10G ports and optics, which are cheap as dirt these days.

Doesn’t this policy encourage eyeballs to move this traffic to their cheapest 
possible transit links, with a potential degradation of service for Akamai’s 
content customers?

Thanks for the insight,

Jawaid


[uc%3fid=1CZG_hGEeUP_KD95fSHu2oBRA_6dkOo6n]
Jawaid Bazyar
Chief Technical Officer
VERO Broadband
[signature_3735065359]
303-815-1814
[signature_3363732610]
jbaz...@verobroadband.com
[signature_60923]
https://verobroadband.com
[signature_4057438942]
2347 Curtis St, Denver, CO 80205



RE: 400G forwarding - how does it work?

2022-07-26 Thread Vasilenko Eduard via NANOG
Pipeline Stages are like separate computers (with their own ALU) sharing the 
same memory.
In the ASIC case, the computers have different types (different capabilities).

From: Etienne-Victor Depasquale [mailto:ed...@ieee.org]
Sent: Tuesday, July 26, 2022 2:05 PM
To: Saku Ytti 
Cc: Vasilenko Eduard ; NANOG 
Subject: Re: 400G forwarding - how does it work?

How do you define a pipeline?

For what it's worth, and
with just a cursory look through this email, and
without wishing to offend anyone's knowledge:

a pipeline in processing is the division of the instruction cycle into a number 
of stages.
General purpose RISC processors are often organized into five such stages.
Under optimal conditions,
which can be fairly, albeit loosely,
interpreted as "one instruction does not affect its peers which are already in 
one of the stages",
then a pipeline can increase the number of instructions retired per second,
often quoted as MIPS (millions of instructions per second)
by a factor equal to the number of stages in the pipeline.


Cheers,

Etienne


On Tue, Jul 26, 2022 at 10:56 AM Saku Ytti mailto:s...@ytti.fi>> 
wrote:

On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard 
mailto:vasilenko.edu...@huawei.com>> wrote:

Juniper is pipeline-based too (like any ASIC). They just invented one special 
stage in 1996 for lookup (sequence search by nibble in the big external memory 
tree) – it was public information up to 2000year. It is a different principle 
from TCAM search – performance is traded for flexibility/simplicity/cost.

How do you define a pipeline? My understanding is that fabric and wan 
connections are in chip called MQ, 'head' of packet being some 320B or so (bit 
less on more modern Trio, didn't measure specifically) is then sent to LU 
complex for lookup.
LU then sprays packets to one of many PPE, but once packet hits PPE, it is 
processed until done, it doesn't jump to another PPE.
Reordering will occur, which is later restored for flows, but outside flows 
reorder may remain.

I don't know what the cores are, but I'm comfortable to bet money they are not 
ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their own NPU 
called lightspeed, and lightspeed like CRS-1 and ASR1k use tensilica cores, 
which are decidedly not ARM.

Nokia, as mentioned, kind of has a pipeline, because a single packet hits every 
core in line, and each core does separate thing.



Network Processors emulate stages on general-purpose ARM cores. It is a 
pipeline too (different cores for different functions, many cores for every 
function), just it is a virtual pipeline.



Ed/

-Original Message-
From: NANOG 
[mailto:nanog-bounces+vasilenko.eduard=huawei@nanog.org]
 On Behalf Of Saku Ytti
Sent: Monday, July 25, 2022 10:03 PM
To: James Bensley 
mailto:jwbensley%2bna...@gmail.com>>
Cc: NANOG mailto:nanog@nanog.org>>
Subject: Re: 400G forwarding - how does it work?



On Mon, 25 Jul 2022 at 21:51, James Bensley 
mailto:jwbensley+na...@gmail.com>> wrote:



> I have no frame of reference here, but in comparison to Gen 6 Trio of

> NP5, that seems very high to me (to the point where I assume I am

> wrong).



No you are right, FP has much much more PPEs than Trio.



For fair calculation, you compare how many lines FP has to PPEs in Trio. 
Because in Trio single PPE handles entire packet, and all PPEs run identical 
ucode, performing same work.



In FP each PPE in line has its own function, like first PPE in line could be 
parsing the packet and extracting keys from it, second could be doing 
ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.



Why choose this NP design instead of Trio design, I don't know. I don't 
understand the upsides.



Downside is easy to understand, picture yourself as ucode developer, and you 
get task to 'add this magic feature in the ucode'.

Implementing it in Trio seems trivial, add the code in ucode, rock on.

On FP, you might have to go 'aww shit, I need to do this before PPE5 but after 
PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that 
I have in the PPE4, crap, now I need to shuffle around and figure out which PPE 
in line runs what function to keep the PPS we promise to customer.



Let's look it from another vantage point, let's cook-up IPv6 header with 
crapton of EH, in Trio, PPE keeps churning it out, taking long time, but 
eventually it gets there or raises exception and gives up.

Every other PPE in the box is fully available to perform work.

Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not 
doing anything and can't do anything, until the PPE before in line is done.



Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before 
and after lookup, before is normally needed for ingressACL but after lookup 
ingressACL is needed for CoPP (we only know after lookup if it is control-plane 
packet). Nokia doesn't do this at 

Re: 400G forwarding - how does it work?

2022-07-26 Thread Etienne-Victor Depasquale via NANOG
>
> How do you define a pipeline?


For what it's worth, and
with just a cursory look through this email, and
without wishing to offend anyone's knowledge:

a pipeline in processing is the division of the instruction cycle into a
number of stages.
General purpose RISC processors are often organized into five such stages.
Under optimal conditions,
which can be fairly, albeit loosely,
interpreted as "one instruction does not affect its peers which are already
in one of the stages",
then a pipeline can increase the number of instructions retired per second,
often quoted as MIPS (millions of instructions per second)
by a factor equal to the number of stages in the pipeline.


Cheers,

Etienne


On Tue, Jul 26, 2022 at 10:56 AM Saku Ytti  wrote:

>
> On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard <
> vasilenko.edu...@huawei.com> wrote:
>
>> Juniper is pipeline-based too (like any ASIC). They just invented one
>> special stage in 1996 for lookup (sequence search by nibble in the big
>> external memory tree) – it was public information up to 2000year. It is a
>> different principle from TCAM search – performance is traded for
>> flexibility/simplicity/cost.
>>
>
> How do you define a pipeline? My understanding is that fabric and wan
> connections are in chip called MQ, 'head' of packet being some 320B or so
> (bit less on more modern Trio, didn't measure specifically) is then sent to
> LU complex for lookup.
> LU then sprays packets to one of many PPE, but once packet hits PPE, it is
> processed until done, it doesn't jump to another PPE.
> Reordering will occur, which is later restored for flows, but outside
> flows reorder may remain.
>
> I don't know what the cores are, but I'm comfortable to bet money they are
> not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their
> own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use
> tensilica cores, which are decidedly not ARM.
>
> Nokia, as mentioned, kind of has a pipeline, because a single packet hits
> every core in line, and each core does separate thing.
>
>>
>>
>> Network Processors emulate stages on general-purpose ARM cores. It is a
>> pipeline too (different cores for different functions, many cores for every
>> function), just it is a virtual pipeline.
>>
>>
>>
>> Ed/
>>
>> -Original Message-
>> From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei@nanog.org]
>> On Behalf Of Saku Ytti
>> Sent: Monday, July 25, 2022 10:03 PM
>> To: James Bensley 
>> Cc: NANOG 
>> Subject: Re: 400G forwarding - how does it work?
>>
>>
>>
>> On Mon, 25 Jul 2022 at 21:51, James Bensley 
>> wrote:
>>
>>
>>
>> > I have no frame of reference here, but in comparison to Gen 6 Trio of
>>
>> > NP5, that seems very high to me (to the point where I assume I am
>>
>> > wrong).
>>
>>
>>
>> No you are right, FP has much much more PPEs than Trio.
>>
>>
>>
>> For fair calculation, you compare how many lines FP has to PPEs in Trio.
>> Because in Trio single PPE handles entire packet, and all PPEs run
>> identical ucode, performing same work.
>>
>>
>>
>> In FP each PPE in line has its own function, like first PPE in line could
>> be parsing the packet and extracting keys from it, second could be doing
>> ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.
>>
>>
>>
>> Why choose this NP design instead of Trio design, I don't know. I don't
>> understand the upsides.
>>
>>
>>
>> Downside is easy to understand, picture yourself as ucode developer, and
>> you get task to 'add this magic feature in the ucode'.
>>
>> Implementing it in Trio seems trivial, add the code in ucode, rock on.
>>
>> On FP, you might have to go 'aww shit, I need to do this before PPE5 but
>> after PPE3 in the pipeline, but the instruction cost it adds isn't in the
>> budget that I have in the PPE4, crap, now I need to shuffle around and
>> figure out which PPE in line runs what function to keep the PPS we promise
>> to customer.
>>
>>
>>
>> Let's look it from another vantage point, let's cook-up IPv6 header with
>> crapton of EH, in Trio, PPE keeps churning it out, taking long time, but
>> eventually it gets there or raises exception and gives up.
>>
>> Every other PPE in the box is fully available to perform work.
>>
>> Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are
>> not doing anything and can't do anything, until the PPE before in line is
>> done.
>>
>>
>>
>> Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
>> before and after lookup, before is normally needed for ingressACL but after
>> lookup ingressACL is needed for CoPP (we only know after lookup if it is
>> control-plane packet). Nokia doesn't do this at all, and I bet they can't
>> do it, because if they'd add it in the core where it needs to be in line,
>> total PPS would go down. as there is no budget for additional ACL. Instead
>> all control-plane packets from ingressFP are sent to control plane FP, and
>> inshallah we don't congest the connection there or it.

RE: 400G forwarding - how does it work?

2022-07-26 Thread Vasilenko Eduard via NANOG
Nope, ASIC vendors are not ARM-based for PFE. Every “stage” is a very 
specialized ASIC with small programmability (not so small for P4 and some 
latest generation ASICs).
ARM cores are for Network Processors (NP). ARM cores (with proper microcode) 
could emulate any “stage” of ASIC. It is the typical explanation for why NPs 
are more flexible than ASIC.

Stages are connected to the common internal memory where enriched packet 
headers are stored. The pipeline is just the order of stages to process these 
internal enriched headers.
The size of this internal header is the critical restriction of the ASIC, never 
disclosed or discussed (but people know it anyway for the most popular ASICs – 
it is possible to google “key buffer”).
Hint: the smallest one in the industry is 128bytes, the biggest 384bytes. It is 
not possible to process longer headers for one PFE pass.
Non-compressed SRv6 header could be 208bytes (+TCP/UDP +VLAN +L2 
+ASIC_internal_staff). Hence, the need for compressed.

It was a big marketing announcement from one famous ASIC vendor just a few 
years ago that some ASIC stages are capable of dynamically sharing common big 
external memory (used for MAC/IP/Filters).
It may be internal memory too for small scalability, but typically it is 
external memory. This memory is always discussed in detail – it is needed for 
the operation team.

It is only about headers. The packet itself (payload) is stored in the separate 
memory (buffer) that is not visible for pipeline stages.

There were times when it was difficult to squeeze everything into one ASIC. 
Then one chip prepares an internal (enriched) header and may do some processing 
(some simple stages), then send this header to the next chip for other “stages” 
(especially the complicated lookup with external memory connected). It is the 
artifact now.

Ed/
From: Saku Ytti [mailto:s...@ytti.fi]
Sent: Tuesday, July 26, 2022 11:53 AM
To: Vasilenko Eduard 
Cc: James Bensley ; NANOG 
Subject: Re: 400G forwarding - how does it work?


On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard 
mailto:vasilenko.edu...@huawei.com>> wrote:

Juniper is pipeline-based too (like any ASIC). They just invented one special 
stage in 1996 for lookup (sequence search by nibble in the big external memory 
tree) – it was public information up to 2000year. It is a different principle 
from TCAM search – performance is traded for flexibility/simplicity/cost.

How do you define a pipeline? My understanding is that fabric and wan 
connections are in chip called MQ, 'head' of packet being some 320B or so (bit 
less on more modern Trio, didn't measure specifically) is then sent to LU 
complex for lookup.
LU then sprays packets to one of many PPE, but once packet hits PPE, it is 
processed until done, it doesn't jump to another PPE.
Reordering will occur, which is later restored for flows, but outside flows 
reorder may remain.

I don't know what the cores are, but I'm comfortable to bet money they are not 
ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their own NPU 
called lightspeed, and lightspeed like CRS-1 and ASR1k use tensilica cores, 
which are decidedly not ARM.

Nokia, as mentioned, kind of has a pipeline, because a single packet hits every 
core in line, and each core does separate thing.



Network Processors emulate stages on general-purpose ARM cores. It is a 
pipeline too (different cores for different functions, many cores for every 
function), just it is a virtual pipeline.



Ed/

-Original Message-
From: NANOG 
[mailto:nanog-bounces+vasilenko.eduard=huawei@nanog.org]
 On Behalf Of Saku Ytti
Sent: Monday, July 25, 2022 10:03 PM
To: James Bensley 
mailto:jwbensley%2bna...@gmail.com>>
Cc: NANOG mailto:nanog@nanog.org>>
Subject: Re: 400G forwarding - how does it work?



On Mon, 25 Jul 2022 at 21:51, James Bensley 
mailto:jwbensley+na...@gmail.com>> wrote:



> I have no frame of reference here, but in comparison to Gen 6 Trio of

> NP5, that seems very high to me (to the point where I assume I am

> wrong).



No you are right, FP has much much more PPEs than Trio.



For fair calculation, you compare how many lines FP has to PPEs in Trio. 
Because in Trio single PPE handles entire packet, and all PPEs run identical 
ucode, performing same work.



In FP each PPE in line has its own function, like first PPE in line could be 
parsing the packet and extracting keys from it, second could be doing 
ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.



Why choose this NP design instead of Trio design, I don't know. I don't 
understand the upsides.



Downside is easy to understand, picture yourself as ucode developer, and you 
get task to 'add this magic feature in the ucode'.

Implementing it in Trio seems trivial, add the code in ucode, rock on.

On FP, you might have to go 'aww shit, I need to do this before PPE5 but after 
PPE3 in the pipeline, but the 

Re: 400G forwarding - how does it work?

2022-07-26 Thread Saku Ytti
On Tue, 26 Jul 2022 at 10:52, Vasilenko Eduard 
wrote:

> Juniper is pipeline-based too (like any ASIC). They just invented one
> special stage in 1996 for lookup (sequence search by nibble in the big
> external memory tree) – it was public information up to 2000year. It is a
> different principle from TCAM search – performance is traded for
> flexibility/simplicity/cost.
>

How do you define a pipeline? My understanding is that fabric and wan
connections are in chip called MQ, 'head' of packet being some 320B or so
(bit less on more modern Trio, didn't measure specifically) is then sent to
LU complex for lookup.
LU then sprays packets to one of many PPE, but once packet hits PPE, it is
processed until done, it doesn't jump to another PPE.
Reordering will occur, which is later restored for flows, but outside flows
reorder may remain.

I don't know what the cores are, but I'm comfortable to bet money they are
not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their
own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use
tensilica cores, which are decidedly not ARM.

Nokia, as mentioned, kind of has a pipeline, because a single packet hits
every core in line, and each core does separate thing.

>
>
> Network Processors emulate stages on general-purpose ARM cores. It is a
> pipeline too (different cores for different functions, many cores for every
> function), just it is a virtual pipeline.
>
>
>
> Ed/
>
> -Original Message-
> From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei@nanog.org]
> On Behalf Of Saku Ytti
> Sent: Monday, July 25, 2022 10:03 PM
> To: James Bensley 
> Cc: NANOG 
> Subject: Re: 400G forwarding - how does it work?
>
>
>
> On Mon, 25 Jul 2022 at 21:51, James Bensley 
> wrote:
>
>
>
> > I have no frame of reference here, but in comparison to Gen 6 Trio of
>
> > NP5, that seems very high to me (to the point where I assume I am
>
> > wrong).
>
>
>
> No you are right, FP has much much more PPEs than Trio.
>
>
>
> For fair calculation, you compare how many lines FP has to PPEs in Trio.
> Because in Trio single PPE handles entire packet, and all PPEs run
> identical ucode, performing same work.
>
>
>
> In FP each PPE in line has its own function, like first PPE in line could
> be parsing the packet and extracting keys from it, second could be doing
> ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.
>
>
>
> Why choose this NP design instead of Trio design, I don't know. I don't
> understand the upsides.
>
>
>
> Downside is easy to understand, picture yourself as ucode developer, and
> you get task to 'add this magic feature in the ucode'.
>
> Implementing it in Trio seems trivial, add the code in ucode, rock on.
>
> On FP, you might have to go 'aww shit, I need to do this before PPE5 but
> after PPE3 in the pipeline, but the instruction cost it adds isn't in the
> budget that I have in the PPE4, crap, now I need to shuffle around and
> figure out which PPE in line runs what function to keep the PPS we promise
> to customer.
>
>
>
> Let's look it from another vantage point, let's cook-up IPv6 header with
> crapton of EH, in Trio, PPE keeps churning it out, taking long time, but
> eventually it gets there or raises exception and gives up.
>
> Every other PPE in the box is fully available to perform work.
>
> Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are
> not doing anything and can't do anything, until the PPE before in line is
> done.
>
>
>
> Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
> before and after lookup, before is normally needed for ingressACL but after
> lookup ingressACL is needed for CoPP (we only know after lookup if it is
> control-plane packet). Nokia doesn't do this at all, and I bet they can't
> do it, because if they'd add it in the core where it needs to be in line,
> total PPS would go down. as there is no budget for additional ACL. Instead
> all control-plane packets from ingressFP are sent to control plane FP, and
> inshallah we don't congest the connection there or it.
>
>
>
>
>
> >
>
> > Cheers,
>
> > James.
>
>
>
>
>
>
>
> --
>
>   ++ytti
>


-- 
  ++ytti


RE: 400G forwarding - how does it work?

2022-07-26 Thread Vasilenko Eduard via NANOG
All high-performance networking devices on the market have pipeline 
architecture.

The pipeline consists of "stages".



ASICs have stages fixed to particular functions:

[cid:image002.png@01D8A0DD.988EC6A0]

Well, some stages are driven by code our days (a little flexibility).



Juniper is pipeline-based too (like any ASIC). They just invented one special 
stage in 1996 for lookup (sequence search by nibble in the big external memory 
tree) – it was public information up to 2000year. It is a different principle 
from TCAM search – performance is traded for flexibility/simplicity/cost.



Network Processors emulate stages on general-purpose ARM cores. It is a 
pipeline too (different cores for different functions, many cores for every 
function), just it is a virtual pipeline.



Ed/

-Original Message-
From: NANOG [mailto:nanog-bounces+vasilenko.eduard=huawei@nanog.org] On 
Behalf Of Saku Ytti
Sent: Monday, July 25, 2022 10:03 PM
To: James Bensley 
Cc: NANOG 
Subject: Re: 400G forwarding - how does it work?



On Mon, 25 Jul 2022 at 21:51, James Bensley 
mailto:jwbensley+na...@gmail.com>> wrote:



> I have no frame of reference here, but in comparison to Gen 6 Trio of

> NP5, that seems very high to me (to the point where I assume I am

> wrong).



No you are right, FP has much much more PPEs than Trio.



For fair calculation, you compare how many lines FP has to PPEs in Trio. 
Because in Trio single PPE handles entire packet, and all PPEs run identical 
ucode, performing same work.



In FP each PPE in line has its own function, like first PPE in line could be 
parsing the packet and extracting keys from it, second could be doing 
ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.



Why choose this NP design instead of Trio design, I don't know. I don't 
understand the upsides.



Downside is easy to understand, picture yourself as ucode developer, and you 
get task to 'add this magic feature in the ucode'.

Implementing it in Trio seems trivial, add the code in ucode, rock on.

On FP, you might have to go 'aww shit, I need to do this before PPE5 but after 
PPE3 in the pipeline, but the instruction cost it adds isn't in the budget that 
I have in the PPE4, crap, now I need to shuffle around and figure out which PPE 
in line runs what function to keep the PPS we promise to customer.



Let's look it from another vantage point, let's cook-up IPv6 header with 
crapton of EH, in Trio, PPE keeps churning it out, taking long time, but 
eventually it gets there or raises exception and gives up.

Every other PPE in the box is fully available to perform work.

Same thing in FP? You have HOLB, the PPEs in the line after thisPPE are not 
doing anything and can't do anything, until the PPE before in line is done.



Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL before 
and after lookup, before is normally needed for ingressACL but after lookup 
ingressACL is needed for CoPP (we only know after lookup if it is control-plane 
packet). Nokia doesn't do this at all, and I bet they can't do it, because if 
they'd add it in the core where it needs to be in line, total PPS would go 
down. as there is no budget for additional ACL. Instead all control-plane 
packets from ingressFP are sent to control plane FP, and inshallah we don't 
congest the connection there or it.





>

> Cheers,

> James.







--

  ++ytti