Re: Why do soft interrupt coelescing?

2001-10-17 Thread Kenneth D. Merry

On Mon, Oct 15, 2001 at 11:35:51 -0700, Terry Lambert wrote:
> "Kenneth D. Merry" wrote:
[ ... ]
> > > This is actually very bad: you want to drop packets before you
> > > insert them into the queue, rather than after they are in the
> > > queue.  This is because you want the probability of the drop
> > > (assuming the queue is not maxed out: otherwise, the probabilty
> > > should be 100%) to be proportional to the exponential moving
> > > average of the queue depth, after that depth exceeds a drop
> > > threshold.  In other words, you want to use RED.
> > 
> > Which queue?  The packets are dropped before they get to ether_input().
> 
> The easy answer is "any queue", since what you are becoming
> concerned with is pool retention time: you want to throw
> away packets before a queue overflow condition makes it so
> you are getting more than you can actually process.

Ahh.

[ ...RED... ]

> > > Maybe I'm being harsh in calling it "spam'ming".  It does the
> > > wrong thing, by dropping the oldest unprocessed packets first.
> > > A FIFO drop is absolutely the wrong thing to do in an attack
> > > or overload case, when you want to shed load.  I consider the
> > > packet that is being dropped to have been "spam'med" by the
> > > card replacing it with another packet, rather than dropping
> > > the replacement packet instead.
> > >
> > > The real place for this drop is "before it gets to card memory",
> > > not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc.,
> > > all agree on that.
> > 
> > As I mentioned above, how would you do that without some sort of traffic
> > shaper on the wire?
> 
> The easiest answer is to RED queue in the card firmware.

Ahh, but then the packets are likely going to be in card memory already.
A card with a reasonable amount of cache (e.g. the Tigon II) and onboard
firmware will probably dump the packet into an already-set-up spot in
memory.

You'd probably need a special mode of interaction between the firmware and
the packet receiving hardware to tell it when to drop packets.

> > My focus with gigabit ethernet was to get maximal throughput out of a small
> > number of connections.  Dealing with a large number of connections is a
> > different problem, and I'm sure it uncovers lots of nifty bugs.
> 
> 8-).  I guess that you are more interested in intermediate hops
> and site to site VPN, while I'm more interested in connection
> termination (big servers, SSL termination, and single client VPN).

Actually, the specific application I worked on for my former employer was
moving large amounts (many gigabytes) of video at high speed via FTP
between FreeBSD-based video servers.  (And between the video servers and
video editor PCs, and data backup servers.)

It actually worked fairly well, and is in production at a number of TV
stations now. :)

> > > I'd actually prefer to avoid the other DMA; I'd also like
> > > to avoid the packet receipt order change that results from
> > > DMA'ing over the previous contents, in the case that an mbuf
> > > can't be allocated.  I'd rather just let good packets in with
> > > a low (but non-zero) statistical probability, relative to a
> > > slew of bad packets, rather than letting a lot of bad packets
> > > from a persistant attacker push my good data out with the bad.
> > 
> > Easier said than done -- dumping random packets would be difficult with a
> > ring-based structure.  Probably what you'd have to do is have an extra pool
> > of mbufs lying around that would get thrown in at random times when mbufs
> > run out to allow some packets to get through.
> > 
> > The problem is, once you exhaust that pool, you're back to the same old
> > problem if you're completely out of mbufs.
> > 
> > You could probably also start shrinking the number of buffers in the ring,
> > but as I said before, you'd need a method for the kernel to notify the
> > driver that more mbufs are available.
> 
> You'd be better off shrinking the window size across all
> the connections, I think.
> 
> As to difficult to do, I actually have RED queue code, which
> I adapted from the formula in a paper.  I have no problem
> giving that code out.
> 
> The real issue is that the BSD queue macros involved in the
> queues really need to be modified to include an "elements on
> queue" count for the calculation of the moving average.

[   ]

> > > OK, I will rediff and generate context diffs; expect them to
> > > be sent in 24 hours or so from now.
> > 
> > It's been longer than that...
> 
> Sorry; I've been doing a lot this weekend.  I will redo them
> at work today, and resend them tonight... definitely.
> 
> 
> > > > Generally the ICMP response tells you how big the maximum MTU is, so you
> > > > don't have to guess.
> > >
> > > Maybe it's the ICMP response; I still haven't had a chance to
> > > hold Michael down and drag the information out of him.  8-).
> > 
> > Maybe what's the ICMP response?
> 
> The difference between working and not working.

Yes, with TCP, the ICMP re

Re: Why do soft interrupt coelescing?

2001-10-15 Thread Terry Lambert

"Kenneth D. Merry" wrote:
> Dropping packets before they get into card memory would only be possible
> with some sort of traffic shaper/dropping mechanism on the wire to drop
> things before they get to the card at all.

Actually, DEC had a congestion control mechanism that worked by
marking all packets, over a certainl level of congestion (this
was sometimes called the "DECbits" approach).  You can do the
same thing with any intermediate hop router, so long as it is
better at moving packets than your destination host.

It turns out that even if the intermediate hop and the host at
the destination are running the same hardware and OS, the cost
is going to be higher to do the terminal processing than it is
to do the network processing, so you are able to use the tagging
to indicate to the terminal hop which flows to drop packets out
of before processing.

Cisco routers can do this (using the CMU firmware) on a per
flow basis, leaving policy up to the end node.  Very neat.


[ ... per connection overhead, and overcommit ... ]

> You could always just put 4G of RAM in the machine, since memory is so
> cheap now. :)
> 
> At some point you'll hit a limit in the number of connections the processor
> can actually handle.

In practice, particularly for HTTP or FTP flows, you can
halve the amount of memory expected to be used.  This is
because the vast majority of the data is generally pushed
in only one direction.

For HTTP 1.1 persistant connections, you can, for the most
part, also assume that the connections are bursty -- that
is, that there is a human attached to the other end, who
will spend some time examining the content before making
another request (you can assume the same thing for 1.0, but
that doesn't count against persistant connection count,
unless you also include time spent in TIME_WAIT).

So overcommit turns out to be O.K. -- which is what I was
trying to say in a back-handed way, in the last post.

If you include window control (i.e. you care about overall
throughput, and not about individual connections), then
you can safely service 1,000,000 connections with 4G on a
FreeBSD box.


> > This is actually very bad: you want to drop packets before you
> > insert them into the queue, rather than after they are in the
> > queue.  This is because you want the probability of the drop
> > (assuming the queue is not maxed out: otherwise, the probabilty
> > should be 100%) to be proportional to the exponential moving
> > average of the queue depth, after that depth exceeds a drop
> > threshold.  In other words, you want to use RED.
> 
> Which queue?  The packets are dropped before they get to ether_input().

The easy answer is "any queue", since what you are becoming
concerned with is pool retention time: you want to throw
away packets before a queue overflow condition makes it so
you are getting more than you can actually process.


> Dropping random packets would be difficult.

The "R" in "RED" is "Random" for "Random Early Detection" (or
"Random Early Drop", for a minority of the literature), true.

But the randomness involved is whether you drop vs. insert a
given packet, not whether you drop a random packet from the
queue contents.  Really dropping random queue elements would
be incredibly bad.

The problem is that, during an attack, the number of packets
you get is proportionally huge, compared to the non-attack
packets (the ones you want to let through).  A RED approach
will prevent new packets being enqueued: it protects the host
system's ability to continue degraded processing, by making
the link appear "lossy" -- the closer to queue full, the
more lossy the link.

If you were to drop random packets already in the queue,
then the proportional probability of dumping good packets is
equal to the queue depth times the number of bad packets
divided by the number of total packets.  In other words, a
continuous attack will almost certainly push all good packets
out of the queue before they reach the head.

Dropping packets prior to insertion maintains the ratio of
bad and good packets, so it doesn't inflate the danger to
the good packets by the relative queue depth: thus dropping
before insertion is a significantly better strategy than
dropping after insertion, for any queue depth over 1.

> > Maybe I'm being harsh in calling it "spam'ming".  It does the
> > wrong thing, by dropping the oldest unprocessed packets first.
> > A FIFO drop is absolutely the wrong thing to do in an attack
> > or overload case, when you want to shed load.  I consider the
> > packet that is being dropped to have been "spam'med" by the
> > card replacing it with another packet, rather than dropping
> > the replacement packet instead.
> >
> > The real place for this drop is "before it gets to card memory",
> > not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc.,
> > all agree on that.
> 
> As I mentioned above, how would you do that without some sort of traffic
> shaper on the wire?

The easiest answer is to RED queu

Re: Why do soft interrupt coelescing?

2001-10-14 Thread Kenneth D. Merry

On Thu, Oct 11, 2001 at 01:02:09 -0700, Terry Lambert wrote:
> "Kenneth D. Merry" wrote:
> > If the receive ring for that packet size is full, it will hold off on
> > DMAs.  If all receive rings are full, there's no reason to send more
> > interrupts.
> 
> I think that this does nothing, in the FreeBSD case, since the
> data from the card will generally be drained much faster than
> it accrues, into the input queue.  Whether it gets processed
> out of there before you run out of mbufs is another matter.
> 
> [ ... ]
> 
> > Anyway, if all three rings fill up, then yes, there won't be a reason to
> > send receive interrupts.
> 
> I think this can't really happen, since interrupt processing
> has the highest priority, compared to stack processing or
> application level processing.  8-(.

Yep, it doesn't happen very often in the default case.

> > > OK, assuming you meant that the copies would stall, and the
> > > data not be copied (which is technically the right thing to
> > > do, assuming a source quench style livelock avoidance, which
> > > doesn't currently exist)...
> > 
> > The data isn't copied, it's DMAed from the card to host memory.  The card
> > will save incoming packets to a point, but once it runs out of memory to
> > store them it starts dropping packets altogether.
> 
> I think that the DMA will not be stalled, at least as the driver
> currently exists; you and I agreed on that already (see below).
> My concern in this case is that, if the card is using the bus to
> copy packets from card memory to the receive ring, then the bus
> isn't available for other work, which is bad.  It's better to
> drop the packets before putting them in card memory (FIFO drop
> fails to avoid the case where a continuous attack pushes all
> good packets out).

Dropping packets before they get into card memory would only be possible
with some sort of traffic shaper/dropping mechanism on the wire to drop
things before they get to the card at all.

> > > The problem is still that you end up doing interrupt processing
> > > until you run out of mbufs, and then you have the problem of
> > > not being able to transmit responses, for lack of mbufs.
> > 
> > In theory you would have configured your system with enough mbufs
> > to handle the situation, and the slowness of the system would cause
> > the windows on the sender to fill up, so they'll stop sending data
> > until the receiver starts responding again.  That's the whole purpose
> > of backoff and slow start -- to find a happy medium for the
> > transmitter and receiver so that data flows at a constant rate.
> 
> In practice, mbuf memory is just as overcommitted as all other
> memory, and given a connection count target, you are talking a
> full transmit and full receive window worth of data at 16k a
> pop -- 32k per connection.
> 
> Even a modest maximum connection count of ~30,000 connections --
> something even an unpatches 4.3 FreeBSD could handle -- means
> that you need 1G of RAM for the connections alone, if you disallow
> overcommit.  In practice, that would mean ~20,000 connections,
> when you count page table entries, open file table entries, vnodes,
> inpcb's, tcpcb's, etc..  And that's a generaous estimate, which
> assumes that you tweak your kernel properly.

You could always just put 4G of RAM in the machine, since memory is so
cheap now. :)

At some point you'll hit a limit in the number of connections the processor
can actually handle.

> One approach to this is to control the window sizes based on
> th amount of free reserve you have available, but this will
> actually damage overall throughput, particularly on links
> with a higher latency.

Yep.

> > > In the ti driver case, the inability to get another mbuf to
> > > replace the one that will be taken out of the ring means that
> > > the mbuf gets reused for more data -- NOT that the data flow
> > > in the form of DMA from the card ends up being halted until
> > > mbufs become available.
> > 
> > True.
> 
> This is actually very bad: you want to drop packets before you
> insert them into the queue, rather than after they are in the
> queue.  This is because you want the probability of the drop
> (assuming the queue is not maxed out: otherwise, the probabilty
> should be 100%) to be proportional to the exponential moving
> average of the queue depth, after that depth exceeds a drop
> threshold.  In other words, you want to use RED.

Which queue?  The packets are dropped before they get to ether_input().

Dropping random packets would be difficult.

> > > Please look at what happens in the case of an allocation
> > > failure, for any driver that does not allow shrinking the
> > > ring of receive mbufs (the ti is one example).
> > 
> > It doesn't spam things, which is what you were suggesting before, but
> > as you pointed out, it will effectively drop packets if it can't get new
> > mbufs.
> 
> Maybe I'm being harsh in calling it "spam'ming".  It does the
> wrong thing, by dropping the oldest unpro

Re: Why do soft interrupt coelescing?

2001-10-11 Thread Terry Lambert

"Kenneth D. Merry" wrote:
> If the receive ring for that packet size is full, it will hold off on
> DMAs.  If all receive rings are full, there's no reason to send more
> interrupts.

I think that this does nothing, in the FreeBSD case, since the
data from the card will generally be drained much faster than
it accrues, into the input queue.  Whether it gets processed
out of there before you run out of mbufs is another matter.

[ ... ]

> Anyway, if all three rings fill up, then yes, there won't be a reason to
> send receive interrupts.

I think this can't really happen, since interrupt processing
has the highest priority, compared to stack processing or
application level processing.  8-(.


> > OK, assuming you meant that the copies would stall, and the
> > data not be copied (which is technically the right thing to
> > do, assuming a source quench style livelock avoidance, which
> > doesn't currently exist)...
> 
> The data isn't copied, it's DMAed from the card to host memory.  The card
> will save incoming packets to a point, but once it runs out of memory to
> store them it starts dropping packets altogether.

I think that the DMA will not be stalled, at least as the driver
currently exists; you and I agreed on that already (see below).
My concern in this case is that, if the card is using the bus to
copy packets from card memory to the receive ring, then the bus
isn't available for other work, which is bad.  It's better to
drop the packets before putting them in card memory (FIFO drop
fails to avoid the case where a continuous attack pushes all
good packets out).


> > The problem is still that you end up doing interrupt processing
> > until you run out of mbufs, and then you have the problem of
> > not being able to transmit responses, for lack of mbufs.
> 
> In theory you would have configured your system with enough mbufs
> to handle the situation, and the slowness of the system would cause
> the windows on the sender to fill up, so they'll stop sending data
> until the receiver starts responding again.  That's the whole purpose
> of backoff and slow start -- to find a happy medium for the
> transmitter and receiver so that data flows at a constant rate.

In practice, mbuf memory is just as overcommitted as all other
memory, and given a connection count target, you are talking a
full transmit and full receive window worth of data at 16k a
pop -- 32k per connection.

Even a modest maximum connection count of ~30,000 connections --
something even an unpatches 4.3 FreeBSD could handle -- means
that you need 1G of RAM for the connections alone, if you disallow
overcommit.  In practice, that would mean ~20,000 connections,
when you count page table entries, open file table entries, vnodes,
inpcb's, tcpcb's, etc..  And that's a generaous estimate, which
assumes that you tweak your kernel properly.

One approach to this is to control the window sizes based on
th amount of free reserve you have available, but this will
actually damage overall throughput, particularly on links
with a higher latency.


> > In the ti driver case, the inability to get another mbuf to
> > replace the one that will be taken out of the ring means that
> > the mbuf gets reused for more data -- NOT that the data flow
> > in the form of DMA from the card ends up being halted until
> > mbufs become available.
> 
> True.

This is actually very bad: you want to drop packets before you
insert them into the queue, rather than after they are in the
queue.  This is because you want the probability of the drop
(assuming the queue is not maxed out: otherwise, the probabilty
should be 100%) to be proportional to the exponential moving
average of the queue depth, after that depth exceeds a drop
threshold.  In other words, you want to use RED.


> > Please look at what happens in the case of an allocation
> > failure, for any driver that does not allow shrinking the
> > ring of receive mbufs (the ti is one example).
> 
> It doesn't spam things, which is what you were suggesting before, but
> as you pointed out, it will effectively drop packets if it can't get new
> mbufs.

Maybe I'm being harsh in calling it "spam'ming".  It does the
wrong thing, by dropping the oldest unprocessed packets first.
A FIFO drop is absolutely the wrong thing to do in an attack
or overload case, when you want to shed load.  I consider the
packet that is being dropped to have been "spam'med" by the
card replacing it with another packet, rather than dropping
the replacement packet instead.

The real place for this drop is "before it gets to card memory",
not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc.,
all agree on that.


> Yes, it could shrink the pool, by just not replacing those mbufs in the
> ring (and therefore not notifying the card that that slot is available
> again), but then it would likely need some mechanism to allow it to be
> notified that another buffer is available for it, so it can then allocate
> receive buffers again.
> 
> In practic

Re: Why do soft interrupt coelescing?

2001-10-10 Thread Kenneth D. Merry

On Wed, Oct 10, 2001 at 01:59:48 -0700, Terry Lambert wrote:
> "Kenneth D. Merry" wrote:
> > eh?  The card won't write past the point that has been acked by the kernel.
> > If the kernel hasn't acked the packets and one of the receive rings fills
> > up, the card will hold off on sending packets up to the kernel.
> 
> Uh, eh?
> 
> You mean the card will hold off on DMA and interrupts?  This
> has not been my experience.  Is this with firmware other than
> the default and/or the 4.3-RELEASE FreeBSD driver?

If the receive ring for that packet size is full, it will hold off on
DMAs.  If all receive rings are full, there's no reason to send more
interrupts.

Keep in mind that there are three receive rings by default on the Tigon
boards -- mini, standard and jumbo.  The size of the buffers in each ring
is configurable, but basically all packets smaller than a certain size will
get routed into the mini ring.  All packets larger than a certain size will
get routed into the jumbo ring.  All packets in between will get routed
into the standard ring.

If there isn't enough space in the mini or jumbo rings for a packet, it'll
get routed into the standard ring if there is space there.  (In the case of
a jumbo packet, it may take up multiple buffers on the ring.)

Anyway, if all three rings fill up, then yes, there won't be a reason to
send receive interrupts.

> > I agree that you can end up spending large portions of your time doing
> > interrupt processing, but I haven't seen instances of "buffer overrun", at
> > least not in the kernel.  The case where you'll see a "buffer overrun", at
> > least with the ti(4) driver, is when you have a sender that's faster than
> > the receiver.
> > 
> > So the receiver can't process the data in time and the card just drops
> > packets.
> 
> OK, assuming you meant that the copies would stall, and the
> data not be copied (which is technically the right thing to
> do, assuming a source quench style livelock avoidance, which
> doesn't currently exist)...

The data isn't copied, it's DMAed from the card to host memory.  The card
will save incoming packets to a point, but once it runs out of memory to
store them it starts dropping packets altogether.

> The problem is still that you end up doing interrupt processing
> until you run out of mbufs, and then you have the problem of
> not being able to transmit responses, for lack of mbufs.

In theory you would have configured your system with enough mbufs to handle
the situation, and the slowness of the system would cause the windows on
the sender to fill up, so they'll stop sending data until the receiver
starts responding again.  That's the whole purpose of backoff and slow
start -- to find a happy medium for the transmitter and receiver so that
data flows at a constant rate.

> In the ti driver case, the inability to get another mbuf to
> replace the one that will be taken out of the ring means that
> the mbuf gets reused for more data -- NOT that the data flow
> in the form of DMA from the card ends up being halted until
> mbufs become available.

True.

> The real problem here is that most received packets want a
> response; for things like web servers, where the average
> request is ~.5k and the average response is ~11k, this means
> that you would need to establish use-based watermarking, to
> seperate the mbuff pool into transmit and receive resources;
> in practice, this doesn't really work, if you are getting
> your content from a seperate server (e.g. an NFS server that
> provides content for a web farm, etc.).
> 
> 
> > That's a different situation from the card spamming the receive
> > ring over and over again, which is what you're describing.  I've
> > never seen that happen, and if it does actually happen, I'd be
> > interested in seeing evidence.
> 
> Please look at what happens in the case of an allocation
> failure, for any driver that does not allow shrinking the
> ring of receive mbufs (the ti is one example).

It doesn't spam things, which is what you were suggesting before, but
as you pointed out, it will effectively drop packets if it can't get new
mbufs.

Yes, it could shrink the pool, by just not replacing those mbufs in the
ring (and therefore not notifying the card that that slot is available
again), but then it would likely need some mechanism to allow it to be
notified that another buffer is available for it, so it can then allocate
receive buffers again.

In practice I haven't found the number of mbufs in the system to be a
problem that would really make this a big issue.  I generally configure the
number of mbufs to be high enough that it isn't a problem in the first
place.

> > > Without hacking firmware, the best you can do is to ensure
> > > that you process as much of all the traffic as you possibly
> > > can, and that means avoiding livelock.
> > 
> > Uhh, the Tigon firmware *does* drop packets when there is no
> > more room in the proper receive ring on the host side.  It
> > doesn't spam things.
> >

Re: Why do soft interrupt coelescing?

2001-10-10 Thread Terry Lambert

"Kenneth D. Merry" wrote:
> eh?  The card won't write past the point that has been acked by the kernel.
> If the kernel hasn't acked the packets and one of the receive rings fills
> up, the card will hold off on sending packets up to the kernel.

Uh, eh?

You mean the card will hold off on DMA and interrupts?  This
has not been my experience.  Is this with firmware other than
the default and/or the 4.3-RELEASE FreeBSD driver?


> I agree that you can end up spending large portions of your time doing
> interrupt processing, but I haven't seen instances of "buffer overrun", at
> least not in the kernel.  The case where you'll see a "buffer overrun", at
> least with the ti(4) driver, is when you have a sender that's faster than
> the receiver.
> 
> So the receiver can't process the data in time and the card just drops
> packets.

OK, assuming you meant that the copies would stall, and the
data not be copied (which is technically the right thing to
do, assuming a source quench style livelock avoidance, which
doesn't currently exist)...

The problem is still that you end up doing interrupt processing
until you run out of mbufs, and then you have the problem of
not being able to transmit responses, for lack of mbufs.

In the ti driver case, the inability to get another mbuf to
replace the one that will be taken out of the ring means that
the mbuf gets reused for more data -- NOT that the data flow
in the form of DMA from the card ends up being halted until
mbufs become available.

The real problem here is that most received packets want a
response; for things like web servers, where the average
request is ~.5k and the average response is ~11k, this means
that you would need to establish use-based watermarking, to
seperate the mbuff pool into transmit and receive resources;
in practice, this doesn't really work, if you are getting
your content from a seperate server (e.g. an NFS server that
provides content for a web farm, etc.).


> That's a different situation from the card spamming the receive
> ring over and over again, which is what you're describing.  I've
> never seen that happen, and if it does actually happen, I'd be
> interested in seeing evidence.

Please look at what happens in the case of an allocation
failure, for any driver that does not allow shrinking the
ring of receive mbufs (the ti is one example).


> > Without hacking firmware, the best you can do is to ensure
> > that you process as much of all the traffic as you possibly
> > can, and that means avoiding livelock.
> 
> Uhh, the Tigon firmware *does* drop packets when there is no
> more room in the proper receive ring on the host side.  It
> doesn't spam things.
> 
> What gives you that idea?  You've really got some strange ideas
> about what goes on with that board.  Why would someone design
> firmware so obviously broken?

The driver does it on purpose, by not giving away the mbuf
in the receive ring, until it has an mbuf to replace it.

Maybe this should be rewritten to not mark the packet as
received, and thus allow the card to overwrite it.  There
are two problems with that approach however.  The first is
what happens when you reach mbuf exhaustion, and the only
way you can clear out received mbufs is to process the data
in a user space appication which never gets to run, and when
it does get to run, can't write a response for a request and
response protocol, such that it can't free up any mbufs?  The
second is that, in the face of a denial of service attack,
the correct approach (according to Van Jacobsen) is to do a
random drop, and rely on the fact that the attack packets,
being proportionally more of the queue contents, get dropped
with a higher probability... so while you _can_ do this, it
is really a bad idea, if you are trying to make your stack
robust against attacks.

The other thing that you appear to be missing is that the
most common case is that you have plenty of mbufs, and you
keep getting interrupts, replacing the mbufs in the receive
ring, and pushing the data into the ether input by giving
away the full mbufs.

The problem occurs when you are receiving at such a high rate
that you don't have any free cycles to run NETISR, and thus
you can not empty the queue from which ipintr is called with
data.

In other words, it's not really the card's fault that the OS
didn't run the stack at hardware interrupt.

> > What this means is that you get more benefit in the soft
> > interrupt coelescing than you otherwise would get, when
> > you are doing LRP.
> >
> > But, you do get *some* benefit from doing it anyway, even
> > if your ether input processing is light: so long as it is
> > non-zero, you get benefit.
> >
> > Note that LRP itself is not a panacea for livelock, since
> > it just moves the scheduling problem from the IRQ<->NETISR
> > scheduling into the NETISR<->process scheduling.  You end
> > up needing to implement weighted fair share or other code
> > to ensure that the user space process is permitted to run,
> > so yo

Re: Why do soft interrupt coelescing?

2001-10-09 Thread Kenneth D. Merry

On Tue, Oct 09, 2001 at 12:28:02 -0700, Terry Lambert wrote:
> "Kenneth D. Merry" wrote:
> [ ... soft interrupt coelescing ... ]
> 
> > As you say above, this is actually a good thing.  I don't see how this ties
> > into the patch to introduce some sort of interrupt coalescing into the
> > ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
> > on the board to do what you want.
> 
> I have tweaked all tunables on the board, and I have not gotten
> anywhere near the increased performance.
> 
> The limit on how far you can push this is based on how much
> RAM you can have on the card, and the limits to coelescing.
> 
> Here's the reason: when you receive packets to the board, they
> get DMA'ed into the ring.  No matter how large the ring, it
> won't matter, if the ring is not being emptied asynchronously
> relative to it being filled.
> 
> In the case of a full-on receiver livelock situation, the ring
> contents will be continuously overwritten.  This is actually
> what happens when you put a ti card into a machine with a
> slower processor, and hit it hard.

eh?  The card won't write past the point that has been acked by the kernel.
If the kernel hasn't acked the packets and one of the receive rings fills
up, the card will hold off on sending packets up to the kernel.

> In the case of interrupt processing, where you jam the data up
> through ether input at interrupt time, the buffer will be able
> to potentially overrun, as well.  Admittedly, you can spend a
> huge percentage of your CPU time in interrupt processing, and
> if your CPU is fast enough, unload the queue very quickly.
> 
> But if you then look at doing this for multiple gigabit cards
> at the same time, you quickly reach the limits... and you
> spend so much of your time in interrupt processing, that you
> spend none running NETISR.

I agree that you can end up spending large portions of your time doing
interrupt processing, but I haven't seen instances of "buffer overrun", at
least not in the kernel.  The case where you'll see a "buffer overrun", at
least with the ti(4) driver, is when you have a sender that's faster than
the receiver.

So the receiver can't process the data in time and the card just drops
packets.

That's a different situation from the card spamming the receive ring over
and over again, which is what you're describing.  I've never seen that
happen, and if it does actually happen, I'd be interested in seeing
evidence.

> So you have moved your livelock up one layer.
> 
> 
> In any case, doing the coelescing on the board delays the
> packet processing until that number of packets has been
> received, or a timer expires.  The timer latency must be
> increased proportionally to the maximum number of packets
> that you coelesce into a single interrupt.
> 
> In other words, you do not interleave your I/O when you
> do this, and the bursty conditions that result in your
> coelescing window ending up full or close to full are
> the conditions under which you should be attempting the
> maximum concurrency you can possibly attain.
> 
> Basically, in any case where the load is high enough to
> trigger the hardware coelescing, the ring would need to
> be the next power of two larger to ensure that the end
> does not overwrite the beginning of the ring.
> 
> In practice, the firmware on the card does not support
> this, so what you do instead is push a couple of packets
> that may have been corrupted through DMA occurring during
> the fact -- in other words, you drop packets.
> 
> This is arguably "correct", in that it permits you to shed
> load, _but_ the DMAs still occur into your rings; it would
> be much better if the load were shed by the card firmware,
> based on some knowledge of ring depth instead (RED Queueing),
> since this would leave the bus clear for other traffinc (e.g.
> communication with main memory to provide network content for
> the cards for, e.g., and NFS server, etc.).
> 
> Without hacking firmware, the best you can do is to ensure
> that you process as much of all the traffic as you possibly
> can, and that means avoiding livelock.

Uhh, the Tigon firmware *does* drop packets when there is no more room in
the proper receive ring on the host side.  It doesn't spam things.

What gives you that idea?  You've really got some strange ideas about what
goes on with that board.  Why would someone design firmware so obviously
broken?

> [ ... LRP ... ]
> 
> > That sounds cool, but I still don't see how this ties into the patch you
> > sent out.
> 
> OK.  LRP removes NETISR entirely.
> 
> This is the approach Van Jacobson stated he used in his
> mythical TCP/IP stack, which we may never see.
> 
> What this does is push the stack processing down to the
> interrupt time for the hardware interrupt.  This is a
> good idea, in that it avoids the livelock for the NETISR
> never running because you are too busy taking hardware
> interrupts to be able to do any stack processing.
> 
> The way this ties into t

Re: Why do soft interrupt coelescing?

2001-10-09 Thread Terry Lambert

"Kenneth D. Merry" wrote:
[ ... soft interrupt coelescing ... ]

> As you say above, this is actually a good thing.  I don't see how this ties
> into the patch to introduce some sort of interrupt coalescing into the
> ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
> on the board to do what you want.

I have tweaked all tunables on the board, and I have not gotten
anywhere near the increased performance.

The limit on how far you can push this is based on how much
RAM you can have on the card, and the limits to coelescing.

Here's the reason: when you receive packets to the board, they
get DMA'ed into the ring.  No matter how large the ring, it
won't matter, if the ring is not being emptied asynchronously
relative to it being filled.

In the case of a full-on receiver livelock situation, the ring
contents will be continuously overwritten.  This is actually
what happens when you put a ti card into a machine with a
slower processor, and hit it hard.

In the case of interrupt processing, where you jam the data up
through ether input at interrupt time, the buffer will be able
to potentially overrun, as well.  Admittedly, you can spend a
huge percentage of your CPU time in interrupt processing, and
if your CPU is fast enough, unload the queue very quickly.

But if you then look at doing this for multiple gigabit cards
at the same time, you quickly reach the limits... and you
spend so much of your time in interrupt processing, that you
spend none running NETISR.

So you have moved your livelock up one layer.


In any case, doing the coelescing on the board delays the
packet processing until that number of packets has been
received, or a timer expires.  The timer latency must be
increased proportionally to the maximum number of packets
that you coelesce into a single interrupt.

In other words, you do not interleave your I/O when you
do this, and the bursty conditions that result in your
coelescing window ending up full or close to full are
the conditions under which you should be attempting the
maximum concurrency you can possibly attain.

Basically, in any case where the load is high enough to
trigger the hardware coelescing, the ring would need to
be the next power of two larger to ensure that the end
does not overwrite the beginning of the ring.

In practice, the firmware on the card does not support
this, so what you do instead is push a couple of packets
that may have been corrupted through DMA occurring during
the fact -- in other words, you drop packets.

This is arguably "correct", in that it permits you to shed
load, _but_ the DMAs still occur into your rings; it would
be much better if the load were shed by the card firmware,
based on some knowledge of ring depth instead (RED Queueing),
since this would leave the bus clear for other traffinc (e.g.
communication with main memory to provide network content for
the cards for, e.g., and NFS server, etc.).

Without hacking firmware, the best you can do is to ensure
that you process as much of all the traffic as you possibly
can, and that means avoiding livelock.


[ ... LRP ... ]

> That sounds cool, but I still don't see how this ties into the patch you
> sent out.

OK.  LRP removes NETISR entirely.

This is the approach Van Jacobson stated he used in his
mythical TCP/IP stack, which we may never see.

What this does is push the stack processing down to the
interrupt time for the hardware interrupt.  This is a
good idea, in that it avoids the livelock for the NETISR
never running because you are too busy taking hardware
interrupts to be able to do any stack processing.

The way this ties into the patch is that doing the stack
processing at interrupt time increases the per ether
input processing cycle overhead up.

What this means is that you get more benefit in the soft
interrupt coelescing than you otherwise would get, when
you are doing LRP.

But, you do get *some* benefit from doing it anyway, even
if your ether input processing is light: so long as it is
non-zero, you get benefit.

Note that LRP itself is not a panacea for livelock, since
it just moves the scheduling problem from the IRQ<->NETISR
scheduling into the NETISR<->process scheduling.  You end
up needing to implement weighted fair share or other code
to ensure that the user space process is permitted to run,
so you end up monitoring queue depth or something else,
and deciding not to reenable interrupts on the card until
you hit a low water mark, indicating processing has taken
place (see the papers by Druschel et. al. and Floyd et. al.).


> > > It isn't terribly clear what you're doing in the patch, since it isn't a
> > > context diff.
> >
> > It's a "cvs diff" output.  You could always check out a sys
> > tree, apply it, and then cvs diff -c (or -u or whatever your
> > favorite option is) to get a diff more to your tastes.
> 
> As Peter Wemm pointed out, we can't use non-context diffs safely without
> the exact time, date and branch of the source files.  This int

Re: Why do soft interrupt coelescing?

2001-10-08 Thread Alfred Perlstein

* Mike Smith <[EMAIL PROTECTED]> [011009 00:25] wrote:
> > * Kenneth D. Merry <[EMAIL PROTECTED]> [011009 00:11] wrote:
> > > 
> > > As you say above, this is actually a good thing.  I don't see how this ties
> > > into the patch to introduce some sort of interrupt coalescing into the
> > > ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
> > > on the board to do what you want.
> > 
> > No matter how hard you tweak the board, an interrupt may still
> > trigger while you process a hardware interrupt, this causes an
> > additional poll which can cause additional coalescing.
> 
> I don't think I understand what sort of crack you are smoking.
> 
> If an interrupt-worthy condition is asserted on the board, you aren't 
> going to leave your typical interrupt handler anyway; this sort of 
> coalescing already happens without any "help".

After talking to you on IRC it's become obvious that this doesn't
exactly happen without help.  It's more of a side effect of the
way _some_ of the drivers are written.

What I understand from talking to you:
  Most smarter drivers or high performance ones will check if the
  tx/rx rings have been modified by the hardware and will consume
  those packets as well.

However, most drivers have code like this:

if (ifp->if_flags & IFF_RUNNING) {
/* Check RX return ring producer/consumer */
ti_rxeof(sc);

/* Check TX ring producer/consumer */
ti_txeof(sc);
}

Now if more packets come in while in ti_txeof() it seems that
you'll need to take an additional interrupt to get at them.

So Terry's code isn't wrong, but it's not as amazing as one
would initially think, it just avoids a race that can happen
while transmitting packets packets and more arrive or while
recieving packets and the transmit queue drains.

Now, when one is be doing a lot more work in the interrupt context
(or perhaps just running on a slower host processor), Terry's
patches make a lot more sense as there's a much larger window
available for this race.

The fact that receiving is done before transmitting (at least in
the 'ti' driver) makes it an even smaller race as you're less likely
to be performing a lengthy operation inside the tx routine than if
you were doing some magic in the rx routine with incomming packets.

Or at least that's how it seems to me.

Either way, no need to get your latex in a bunch Mike. :-)

-- 
-Alfred Perlstein [[EMAIL PROTECTED]]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-08 Thread Kenneth D. Merry

On Tue, Oct 09, 2001 at 00:18:57 -0500, Alfred Perlstein wrote:
> * Kenneth D. Merry <[EMAIL PROTECTED]> [011009 00:11] wrote:
> > 
> > As you say above, this is actually a good thing.  I don't see how this ties
> > into the patch to introduce some sort of interrupt coalescing into the
> > ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
> > on the board to do what you want.
> 
> No matter how hard you tweak the board, an interrupt may still
> trigger while you process a hardware interrupt, this causes an
> additional poll which can cause additional coalescing.

At least in the case of the Tigon, it won't interrupt while there is a '1'
written in mailbox 0.  (This happens in ti_intr().)

Ken
-- 
Kenneth Merry
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-08 Thread Mike Smith

> * Kenneth D. Merry <[EMAIL PROTECTED]> [011009 00:11] wrote:
> > 
> > As you say above, this is actually a good thing.  I don't see how this ties
> > into the patch to introduce some sort of interrupt coalescing into the
> > ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
> > on the board to do what you want.
> 
> No matter how hard you tweak the board, an interrupt may still
> trigger while you process a hardware interrupt, this causes an
> additional poll which can cause additional coalescing.

I don't think I understand what sort of crack you are smoking.

If an interrupt-worthy condition is asserted on the board, you aren't 
going to leave your typical interrupt handler anyway; this sort of 
coalescing already happens without any "help".

-- 
... every activity meets with opposition, everyone who acts has his
rivals and unfortunately opponents also.  But not because people want
to be opponents, rather because the tasks and relationships force
people to take different points of view.  [Dr. Fritz Todt]
   V I C T O R Y   N O T   V E N G E A N C E



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-08 Thread Alfred Perlstein

* Kenneth D. Merry <[EMAIL PROTECTED]> [011009 00:11] wrote:
> 
> As you say above, this is actually a good thing.  I don't see how this ties
> into the patch to introduce some sort of interrupt coalescing into the
> ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
> on the board to do what you want.

No matter how hard you tweak the board, an interrupt may still
trigger while you process a hardware interrupt, this causes an
additional poll which can cause additional coalescing.

-- 
-Alfred Perlstein [[EMAIL PROTECTED]]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-08 Thread Kenneth D. Merry

On Sun, Oct 07, 2001 at 00:56:44 -0700, Terry Lambert wrote:
> "Kenneth D. Merry" wrote:
> > [ I don't particularly want to get involved in this thread...but... ]
> > 
> > Can you explain why the ti(4) driver needs a coalescing patch?  It already
> > has in-firmware coalescing paramters that are tuneable by the user.  It
> > also already processes all outstanding BDs in ti_rxeof() and ti_txeof().
> 
> 
> The answer to your question is that the card will continue to DMA
> into the ring buffer, even though you are in the middle of the
> interrupt service routine, and that the amount of time taken in
> ether input is long enough that you can have more packets come in
> while you are processing (this is actually a good thing).
> 
> This is even *more* likely with hardware interrupt coelescing,
> since the default setting is to coelesce 32 packets into a
> single interrupt, meaning that you have up to 32 iterations of
> ether input to call, and thus the amount of time spent processing
> them actually affords *more* time for additional packets to come
> in.

As you say above, this is actually a good thing.  I don't see how this ties
into the patch to introduce some sort of interrupt coalescing into the
ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
on the board to do what you want.

> In my own personal situation, I have also implemented Lazy
> Receiver Processing (per the research done by Rice University
> and in the "Click Router" project; no relation to "ClickArray"),
> which does all stack processing at the hardware interrupt, rather
> then queueing between the hardware interrupt and NETISR, so my
> processing path is actually longer; I get more benefit from the
> change than you would, but on a heavily loaded system, you would
> also get some benefit, if you were able to load the wire heavily
> enough.
> 
> The LRP implementation should be considered by FreeBSD as well,
> since it takes the connection rate from ~7,000/second up to
> ~23,000/second, by avoiding the NetISR.  Rice University did
> an implementation in 2.2.x, and then another one (using resource
> containers -- I recommend against this one, not only because of
> license issues with the second implementation) for 4.2; both
> sets of research were done in FreeBSD.  Unfortunately, neither
> implementation was production quality (among other things, they
> broke RFC 1323, and they have to run a complete duplicate stack
> as a different protocol family because some of their assumptions
> make it non-interoperable with other protocol stacks).

That sounds cool, but I still don't see how this ties into the patch you
sent out.

> > It isn't terribly clear what you're doing in the patch, since it isn't a
> > context diff.
> 
> It's a "cvs diff" output.  You could always check out a sys
> tree, apply it, and then cvs diff -c (or -u or whatever your
> favorite option is) to get a diff more to your tastes.

As Peter Wemm pointed out, we can't use non-context diffs safely without
the exact time, date and branch of the source files.  This introduces an
additional burden for no real reason other than you neglected to use -c or
-u with cvs diff.

> > You also never gave any details behind your statement last week:
> > "Because at the time the Tigon II was released, the jumbogram
> > wire format had not solidified.  Therefore cards built during
> > that time used different wire data for the jumbogram framing."
> > 
> > I asked, in response:
> > 
> > "Can you give more details?  Did someone decide on a different ethertype
> > than 0x8870 or something?
> > 
> > That's really the only thing that's different between a standard ethernet
> > frame and a jumbo frame.  (other than the size)"
> 
> I believe it was the implementation of the length field.  I
> would have to get more information from the person who did
> the interoperability testing for the autonegotiation (which
> failed between the Tigon II and the Intel Gigabit cards).  I
> can assure you anecdotally, however, that autonegotiation
> _did_ fail.

I would believe that autonegotiation (i.e. 10/100/1000) might fail,
especially if you're using 1000BaseT Tigon II boards.  However, I would
like more details on the failure.  It's entirely possible that it could be
fixed in the firmware, probably without too much trouble.

I find it somewhat hard to believe that Intel would ship a gigabit board
that didn't interoperate with the board that up until probably recently was
probably the predominant gigabit board out there.

Ken
-- 
Kenneth Merry
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-07 Thread Terry Lambert

"Kenneth D. Merry" wrote:
> [ I don't particularly want to get involved in this thread...but... ]
> 
> Can you explain why the ti(4) driver needs a coalescing patch?  It already
> has in-firmware coalescing paramters that are tuneable by the user.  It
> also already processes all outstanding BDs in ti_rxeof() and ti_txeof().


The answer to your question is that the card will continue to DMA
into the ring buffer, even though you are in the middle of the
interrupt service routine, and that the amount of time taken in
ether input is long enough that you can have more packets come in
while you are processing (this is actually a good thing).

This is even *more* likely with hardware interrupt coelescing,
since the default setting is to coelesce 32 packets into a
single interrupt, meaning that you have up to 32 iterations of
ether input to call, and thus the amount of time spent processing
them actually affords *more* time for additional packets to come
in.

In my own personal situation, I have also implemented Lazy
Receiver Processing (per the research done by Rice University
and in the "Click Router" project; no relation to "ClickArray"),
which does all stack processing at the hardware interrupt, rather
then queueing between the hardware interrupt and NETISR, so my
processing path is actually longer; I get more benefit from the
change than you would, but on a heavily loaded system, you would
also get some benefit, if you were able to load the wire heavily
enough.

The LRP implementation should be considered by FreeBSD as well,
since it takes the connection rate from ~7,000/second up to
~23,000/second, by avoiding the NetISR.  Rice University did
an implementation in 2.2.x, and then another one (using resource
containers -- I recommend against this one, not only because of
license issues with the second implementation) for 4.2; both
sets of research were done in FreeBSD.  Unfortunately, neither
implementation was production quality (among other things, they
broke RFC 1323, and they have to run a complete duplicate stack
as a different protocol family because some of their assumptions
make it non-interoperable with other protocol stacks).


> It isn't terribly clear what you're doing in the patch, since it isn't a
> context diff.

It's a "cvs diff" output.  You could always check out a sys
tree, apply it, and then cvs diff -c (or -u or whatever your
favorite option is) to get a diff more to your tastes.


> You also never gave any details behind your statement last week:
> "Because at the time the Tigon II was released, the jumbogram
> wire format had not solidified.  Therefore cards built during
> that time used different wire data for the jumbogram framing."
> 
> I asked, in response:
> 
> "Can you give more details?  Did someone decide on a different ethertype
> than 0x8870 or something?
> 
> That's really the only thing that's different between a standard ethernet
> frame and a jumbo frame.  (other than the size)"

I believe it was the implementation of the length field.  I
would have to get more information from the person who did
the interoperability testing for the autonegotiation (which
failed between the Tigon II and the Intel Gigabit cards).  I
can assure you anecdotally, however, that autonegotiation
_did_ fail.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message