Re: Why do soft interrupt coelescing?
On Mon, Oct 15, 2001 at 11:35:51 -0700, Terry Lambert wrote: > "Kenneth D. Merry" wrote: [ ... ] > > > This is actually very bad: you want to drop packets before you > > > insert them into the queue, rather than after they are in the > > > queue. This is because you want the probability of the drop > > > (assuming the queue is not maxed out: otherwise, the probabilty > > > should be 100%) to be proportional to the exponential moving > > > average of the queue depth, after that depth exceeds a drop > > > threshold. In other words, you want to use RED. > > > > Which queue? The packets are dropped before they get to ether_input(). > > The easy answer is "any queue", since what you are becoming > concerned with is pool retention time: you want to throw > away packets before a queue overflow condition makes it so > you are getting more than you can actually process. Ahh. [ ...RED... ] > > > Maybe I'm being harsh in calling it "spam'ming". It does the > > > wrong thing, by dropping the oldest unprocessed packets first. > > > A FIFO drop is absolutely the wrong thing to do in an attack > > > or overload case, when you want to shed load. I consider the > > > packet that is being dropped to have been "spam'med" by the > > > card replacing it with another packet, rather than dropping > > > the replacement packet instead. > > > > > > The real place for this drop is "before it gets to card memory", > > > not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc., > > > all agree on that. > > > > As I mentioned above, how would you do that without some sort of traffic > > shaper on the wire? > > The easiest answer is to RED queue in the card firmware. Ahh, but then the packets are likely going to be in card memory already. A card with a reasonable amount of cache (e.g. the Tigon II) and onboard firmware will probably dump the packet into an already-set-up spot in memory. You'd probably need a special mode of interaction between the firmware and the packet receiving hardware to tell it when to drop packets. > > My focus with gigabit ethernet was to get maximal throughput out of a small > > number of connections. Dealing with a large number of connections is a > > different problem, and I'm sure it uncovers lots of nifty bugs. > > 8-). I guess that you are more interested in intermediate hops > and site to site VPN, while I'm more interested in connection > termination (big servers, SSL termination, and single client VPN). Actually, the specific application I worked on for my former employer was moving large amounts (many gigabytes) of video at high speed via FTP between FreeBSD-based video servers. (And between the video servers and video editor PCs, and data backup servers.) It actually worked fairly well, and is in production at a number of TV stations now. :) > > > I'd actually prefer to avoid the other DMA; I'd also like > > > to avoid the packet receipt order change that results from > > > DMA'ing over the previous contents, in the case that an mbuf > > > can't be allocated. I'd rather just let good packets in with > > > a low (but non-zero) statistical probability, relative to a > > > slew of bad packets, rather than letting a lot of bad packets > > > from a persistant attacker push my good data out with the bad. > > > > Easier said than done -- dumping random packets would be difficult with a > > ring-based structure. Probably what you'd have to do is have an extra pool > > of mbufs lying around that would get thrown in at random times when mbufs > > run out to allow some packets to get through. > > > > The problem is, once you exhaust that pool, you're back to the same old > > problem if you're completely out of mbufs. > > > > You could probably also start shrinking the number of buffers in the ring, > > but as I said before, you'd need a method for the kernel to notify the > > driver that more mbufs are available. > > You'd be better off shrinking the window size across all > the connections, I think. > > As to difficult to do, I actually have RED queue code, which > I adapted from the formula in a paper. I have no problem > giving that code out. > > The real issue is that the BSD queue macros involved in the > queues really need to be modified to include an "elements on > queue" count for the calculation of the moving average. [ ] > > > OK, I will rediff and generate context diffs; expect them to > > > be sent in 24 hours or so from now. > > > > It's been longer than that... > > Sorry; I've been doing a lot this weekend. I will redo them > at work today, and resend them tonight... definitely. > > > > > > Generally the ICMP response tells you how big the maximum MTU is, so you > > > > don't have to guess. > > > > > > Maybe it's the ICMP response; I still haven't had a chance to > > > hold Michael down and drag the information out of him. 8-). > > > > Maybe what's the ICMP response? > > The difference between working and not working. Yes, with TCP, the ICMP re
Re: Why do soft interrupt coelescing?
"Kenneth D. Merry" wrote: > Dropping packets before they get into card memory would only be possible > with some sort of traffic shaper/dropping mechanism on the wire to drop > things before they get to the card at all. Actually, DEC had a congestion control mechanism that worked by marking all packets, over a certainl level of congestion (this was sometimes called the "DECbits" approach). You can do the same thing with any intermediate hop router, so long as it is better at moving packets than your destination host. It turns out that even if the intermediate hop and the host at the destination are running the same hardware and OS, the cost is going to be higher to do the terminal processing than it is to do the network processing, so you are able to use the tagging to indicate to the terminal hop which flows to drop packets out of before processing. Cisco routers can do this (using the CMU firmware) on a per flow basis, leaving policy up to the end node. Very neat. [ ... per connection overhead, and overcommit ... ] > You could always just put 4G of RAM in the machine, since memory is so > cheap now. :) > > At some point you'll hit a limit in the number of connections the processor > can actually handle. In practice, particularly for HTTP or FTP flows, you can halve the amount of memory expected to be used. This is because the vast majority of the data is generally pushed in only one direction. For HTTP 1.1 persistant connections, you can, for the most part, also assume that the connections are bursty -- that is, that there is a human attached to the other end, who will spend some time examining the content before making another request (you can assume the same thing for 1.0, but that doesn't count against persistant connection count, unless you also include time spent in TIME_WAIT). So overcommit turns out to be O.K. -- which is what I was trying to say in a back-handed way, in the last post. If you include window control (i.e. you care about overall throughput, and not about individual connections), then you can safely service 1,000,000 connections with 4G on a FreeBSD box. > > This is actually very bad: you want to drop packets before you > > insert them into the queue, rather than after they are in the > > queue. This is because you want the probability of the drop > > (assuming the queue is not maxed out: otherwise, the probabilty > > should be 100%) to be proportional to the exponential moving > > average of the queue depth, after that depth exceeds a drop > > threshold. In other words, you want to use RED. > > Which queue? The packets are dropped before they get to ether_input(). The easy answer is "any queue", since what you are becoming concerned with is pool retention time: you want to throw away packets before a queue overflow condition makes it so you are getting more than you can actually process. > Dropping random packets would be difficult. The "R" in "RED" is "Random" for "Random Early Detection" (or "Random Early Drop", for a minority of the literature), true. But the randomness involved is whether you drop vs. insert a given packet, not whether you drop a random packet from the queue contents. Really dropping random queue elements would be incredibly bad. The problem is that, during an attack, the number of packets you get is proportionally huge, compared to the non-attack packets (the ones you want to let through). A RED approach will prevent new packets being enqueued: it protects the host system's ability to continue degraded processing, by making the link appear "lossy" -- the closer to queue full, the more lossy the link. If you were to drop random packets already in the queue, then the proportional probability of dumping good packets is equal to the queue depth times the number of bad packets divided by the number of total packets. In other words, a continuous attack will almost certainly push all good packets out of the queue before they reach the head. Dropping packets prior to insertion maintains the ratio of bad and good packets, so it doesn't inflate the danger to the good packets by the relative queue depth: thus dropping before insertion is a significantly better strategy than dropping after insertion, for any queue depth over 1. > > Maybe I'm being harsh in calling it "spam'ming". It does the > > wrong thing, by dropping the oldest unprocessed packets first. > > A FIFO drop is absolutely the wrong thing to do in an attack > > or overload case, when you want to shed load. I consider the > > packet that is being dropped to have been "spam'med" by the > > card replacing it with another packet, rather than dropping > > the replacement packet instead. > > > > The real place for this drop is "before it gets to card memory", > > not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc., > > all agree on that. > > As I mentioned above, how would you do that without some sort of traffic > shaper on the wire? The easiest answer is to RED queu
Re: Why do soft interrupt coelescing?
On Thu, Oct 11, 2001 at 01:02:09 -0700, Terry Lambert wrote: > "Kenneth D. Merry" wrote: > > If the receive ring for that packet size is full, it will hold off on > > DMAs. If all receive rings are full, there's no reason to send more > > interrupts. > > I think that this does nothing, in the FreeBSD case, since the > data from the card will generally be drained much faster than > it accrues, into the input queue. Whether it gets processed > out of there before you run out of mbufs is another matter. > > [ ... ] > > > Anyway, if all three rings fill up, then yes, there won't be a reason to > > send receive interrupts. > > I think this can't really happen, since interrupt processing > has the highest priority, compared to stack processing or > application level processing. 8-(. Yep, it doesn't happen very often in the default case. > > > OK, assuming you meant that the copies would stall, and the > > > data not be copied (which is technically the right thing to > > > do, assuming a source quench style livelock avoidance, which > > > doesn't currently exist)... > > > > The data isn't copied, it's DMAed from the card to host memory. The card > > will save incoming packets to a point, but once it runs out of memory to > > store them it starts dropping packets altogether. > > I think that the DMA will not be stalled, at least as the driver > currently exists; you and I agreed on that already (see below). > My concern in this case is that, if the card is using the bus to > copy packets from card memory to the receive ring, then the bus > isn't available for other work, which is bad. It's better to > drop the packets before putting them in card memory (FIFO drop > fails to avoid the case where a continuous attack pushes all > good packets out). Dropping packets before they get into card memory would only be possible with some sort of traffic shaper/dropping mechanism on the wire to drop things before they get to the card at all. > > > The problem is still that you end up doing interrupt processing > > > until you run out of mbufs, and then you have the problem of > > > not being able to transmit responses, for lack of mbufs. > > > > In theory you would have configured your system with enough mbufs > > to handle the situation, and the slowness of the system would cause > > the windows on the sender to fill up, so they'll stop sending data > > until the receiver starts responding again. That's the whole purpose > > of backoff and slow start -- to find a happy medium for the > > transmitter and receiver so that data flows at a constant rate. > > In practice, mbuf memory is just as overcommitted as all other > memory, and given a connection count target, you are talking a > full transmit and full receive window worth of data at 16k a > pop -- 32k per connection. > > Even a modest maximum connection count of ~30,000 connections -- > something even an unpatches 4.3 FreeBSD could handle -- means > that you need 1G of RAM for the connections alone, if you disallow > overcommit. In practice, that would mean ~20,000 connections, > when you count page table entries, open file table entries, vnodes, > inpcb's, tcpcb's, etc.. And that's a generaous estimate, which > assumes that you tweak your kernel properly. You could always just put 4G of RAM in the machine, since memory is so cheap now. :) At some point you'll hit a limit in the number of connections the processor can actually handle. > One approach to this is to control the window sizes based on > th amount of free reserve you have available, but this will > actually damage overall throughput, particularly on links > with a higher latency. Yep. > > > In the ti driver case, the inability to get another mbuf to > > > replace the one that will be taken out of the ring means that > > > the mbuf gets reused for more data -- NOT that the data flow > > > in the form of DMA from the card ends up being halted until > > > mbufs become available. > > > > True. > > This is actually very bad: you want to drop packets before you > insert them into the queue, rather than after they are in the > queue. This is because you want the probability of the drop > (assuming the queue is not maxed out: otherwise, the probabilty > should be 100%) to be proportional to the exponential moving > average of the queue depth, after that depth exceeds a drop > threshold. In other words, you want to use RED. Which queue? The packets are dropped before they get to ether_input(). Dropping random packets would be difficult. > > > Please look at what happens in the case of an allocation > > > failure, for any driver that does not allow shrinking the > > > ring of receive mbufs (the ti is one example). > > > > It doesn't spam things, which is what you were suggesting before, but > > as you pointed out, it will effectively drop packets if it can't get new > > mbufs. > > Maybe I'm being harsh in calling it "spam'ming". It does the > wrong thing, by dropping the oldest unpro
Re: Why do soft interrupt coelescing?
"Kenneth D. Merry" wrote: > If the receive ring for that packet size is full, it will hold off on > DMAs. If all receive rings are full, there's no reason to send more > interrupts. I think that this does nothing, in the FreeBSD case, since the data from the card will generally be drained much faster than it accrues, into the input queue. Whether it gets processed out of there before you run out of mbufs is another matter. [ ... ] > Anyway, if all three rings fill up, then yes, there won't be a reason to > send receive interrupts. I think this can't really happen, since interrupt processing has the highest priority, compared to stack processing or application level processing. 8-(. > > OK, assuming you meant that the copies would stall, and the > > data not be copied (which is technically the right thing to > > do, assuming a source quench style livelock avoidance, which > > doesn't currently exist)... > > The data isn't copied, it's DMAed from the card to host memory. The card > will save incoming packets to a point, but once it runs out of memory to > store them it starts dropping packets altogether. I think that the DMA will not be stalled, at least as the driver currently exists; you and I agreed on that already (see below). My concern in this case is that, if the card is using the bus to copy packets from card memory to the receive ring, then the bus isn't available for other work, which is bad. It's better to drop the packets before putting them in card memory (FIFO drop fails to avoid the case where a continuous attack pushes all good packets out). > > The problem is still that you end up doing interrupt processing > > until you run out of mbufs, and then you have the problem of > > not being able to transmit responses, for lack of mbufs. > > In theory you would have configured your system with enough mbufs > to handle the situation, and the slowness of the system would cause > the windows on the sender to fill up, so they'll stop sending data > until the receiver starts responding again. That's the whole purpose > of backoff and slow start -- to find a happy medium for the > transmitter and receiver so that data flows at a constant rate. In practice, mbuf memory is just as overcommitted as all other memory, and given a connection count target, you are talking a full transmit and full receive window worth of data at 16k a pop -- 32k per connection. Even a modest maximum connection count of ~30,000 connections -- something even an unpatches 4.3 FreeBSD could handle -- means that you need 1G of RAM for the connections alone, if you disallow overcommit. In practice, that would mean ~20,000 connections, when you count page table entries, open file table entries, vnodes, inpcb's, tcpcb's, etc.. And that's a generaous estimate, which assumes that you tweak your kernel properly. One approach to this is to control the window sizes based on th amount of free reserve you have available, but this will actually damage overall throughput, particularly on links with a higher latency. > > In the ti driver case, the inability to get another mbuf to > > replace the one that will be taken out of the ring means that > > the mbuf gets reused for more data -- NOT that the data flow > > in the form of DMA from the card ends up being halted until > > mbufs become available. > > True. This is actually very bad: you want to drop packets before you insert them into the queue, rather than after they are in the queue. This is because you want the probability of the drop (assuming the queue is not maxed out: otherwise, the probabilty should be 100%) to be proportional to the exponential moving average of the queue depth, after that depth exceeds a drop threshold. In other words, you want to use RED. > > Please look at what happens in the case of an allocation > > failure, for any driver that does not allow shrinking the > > ring of receive mbufs (the ti is one example). > > It doesn't spam things, which is what you were suggesting before, but > as you pointed out, it will effectively drop packets if it can't get new > mbufs. Maybe I'm being harsh in calling it "spam'ming". It does the wrong thing, by dropping the oldest unprocessed packets first. A FIFO drop is absolutely the wrong thing to do in an attack or overload case, when you want to shed load. I consider the packet that is being dropped to have been "spam'med" by the card replacing it with another packet, rather than dropping the replacement packet instead. The real place for this drop is "before it gets to card memory", not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc., all agree on that. > Yes, it could shrink the pool, by just not replacing those mbufs in the > ring (and therefore not notifying the card that that slot is available > again), but then it would likely need some mechanism to allow it to be > notified that another buffer is available for it, so it can then allocate > receive buffers again. > > In practic
Re: Why do soft interrupt coelescing?
On Wed, Oct 10, 2001 at 01:59:48 -0700, Terry Lambert wrote: > "Kenneth D. Merry" wrote: > > eh? The card won't write past the point that has been acked by the kernel. > > If the kernel hasn't acked the packets and one of the receive rings fills > > up, the card will hold off on sending packets up to the kernel. > > Uh, eh? > > You mean the card will hold off on DMA and interrupts? This > has not been my experience. Is this with firmware other than > the default and/or the 4.3-RELEASE FreeBSD driver? If the receive ring for that packet size is full, it will hold off on DMAs. If all receive rings are full, there's no reason to send more interrupts. Keep in mind that there are three receive rings by default on the Tigon boards -- mini, standard and jumbo. The size of the buffers in each ring is configurable, but basically all packets smaller than a certain size will get routed into the mini ring. All packets larger than a certain size will get routed into the jumbo ring. All packets in between will get routed into the standard ring. If there isn't enough space in the mini or jumbo rings for a packet, it'll get routed into the standard ring if there is space there. (In the case of a jumbo packet, it may take up multiple buffers on the ring.) Anyway, if all three rings fill up, then yes, there won't be a reason to send receive interrupts. > > I agree that you can end up spending large portions of your time doing > > interrupt processing, but I haven't seen instances of "buffer overrun", at > > least not in the kernel. The case where you'll see a "buffer overrun", at > > least with the ti(4) driver, is when you have a sender that's faster than > > the receiver. > > > > So the receiver can't process the data in time and the card just drops > > packets. > > OK, assuming you meant that the copies would stall, and the > data not be copied (which is technically the right thing to > do, assuming a source quench style livelock avoidance, which > doesn't currently exist)... The data isn't copied, it's DMAed from the card to host memory. The card will save incoming packets to a point, but once it runs out of memory to store them it starts dropping packets altogether. > The problem is still that you end up doing interrupt processing > until you run out of mbufs, and then you have the problem of > not being able to transmit responses, for lack of mbufs. In theory you would have configured your system with enough mbufs to handle the situation, and the slowness of the system would cause the windows on the sender to fill up, so they'll stop sending data until the receiver starts responding again. That's the whole purpose of backoff and slow start -- to find a happy medium for the transmitter and receiver so that data flows at a constant rate. > In the ti driver case, the inability to get another mbuf to > replace the one that will be taken out of the ring means that > the mbuf gets reused for more data -- NOT that the data flow > in the form of DMA from the card ends up being halted until > mbufs become available. True. > The real problem here is that most received packets want a > response; for things like web servers, where the average > request is ~.5k and the average response is ~11k, this means > that you would need to establish use-based watermarking, to > seperate the mbuff pool into transmit and receive resources; > in practice, this doesn't really work, if you are getting > your content from a seperate server (e.g. an NFS server that > provides content for a web farm, etc.). > > > > That's a different situation from the card spamming the receive > > ring over and over again, which is what you're describing. I've > > never seen that happen, and if it does actually happen, I'd be > > interested in seeing evidence. > > Please look at what happens in the case of an allocation > failure, for any driver that does not allow shrinking the > ring of receive mbufs (the ti is one example). It doesn't spam things, which is what you were suggesting before, but as you pointed out, it will effectively drop packets if it can't get new mbufs. Yes, it could shrink the pool, by just not replacing those mbufs in the ring (and therefore not notifying the card that that slot is available again), but then it would likely need some mechanism to allow it to be notified that another buffer is available for it, so it can then allocate receive buffers again. In practice I haven't found the number of mbufs in the system to be a problem that would really make this a big issue. I generally configure the number of mbufs to be high enough that it isn't a problem in the first place. > > > Without hacking firmware, the best you can do is to ensure > > > that you process as much of all the traffic as you possibly > > > can, and that means avoiding livelock. > > > > Uhh, the Tigon firmware *does* drop packets when there is no > > more room in the proper receive ring on the host side. It > > doesn't spam things. > >
Re: Why do soft interrupt coelescing?
"Kenneth D. Merry" wrote: > eh? The card won't write past the point that has been acked by the kernel. > If the kernel hasn't acked the packets and one of the receive rings fills > up, the card will hold off on sending packets up to the kernel. Uh, eh? You mean the card will hold off on DMA and interrupts? This has not been my experience. Is this with firmware other than the default and/or the 4.3-RELEASE FreeBSD driver? > I agree that you can end up spending large portions of your time doing > interrupt processing, but I haven't seen instances of "buffer overrun", at > least not in the kernel. The case where you'll see a "buffer overrun", at > least with the ti(4) driver, is when you have a sender that's faster than > the receiver. > > So the receiver can't process the data in time and the card just drops > packets. OK, assuming you meant that the copies would stall, and the data not be copied (which is technically the right thing to do, assuming a source quench style livelock avoidance, which doesn't currently exist)... The problem is still that you end up doing interrupt processing until you run out of mbufs, and then you have the problem of not being able to transmit responses, for lack of mbufs. In the ti driver case, the inability to get another mbuf to replace the one that will be taken out of the ring means that the mbuf gets reused for more data -- NOT that the data flow in the form of DMA from the card ends up being halted until mbufs become available. The real problem here is that most received packets want a response; for things like web servers, where the average request is ~.5k and the average response is ~11k, this means that you would need to establish use-based watermarking, to seperate the mbuff pool into transmit and receive resources; in practice, this doesn't really work, if you are getting your content from a seperate server (e.g. an NFS server that provides content for a web farm, etc.). > That's a different situation from the card spamming the receive > ring over and over again, which is what you're describing. I've > never seen that happen, and if it does actually happen, I'd be > interested in seeing evidence. Please look at what happens in the case of an allocation failure, for any driver that does not allow shrinking the ring of receive mbufs (the ti is one example). > > Without hacking firmware, the best you can do is to ensure > > that you process as much of all the traffic as you possibly > > can, and that means avoiding livelock. > > Uhh, the Tigon firmware *does* drop packets when there is no > more room in the proper receive ring on the host side. It > doesn't spam things. > > What gives you that idea? You've really got some strange ideas > about what goes on with that board. Why would someone design > firmware so obviously broken? The driver does it on purpose, by not giving away the mbuf in the receive ring, until it has an mbuf to replace it. Maybe this should be rewritten to not mark the packet as received, and thus allow the card to overwrite it. There are two problems with that approach however. The first is what happens when you reach mbuf exhaustion, and the only way you can clear out received mbufs is to process the data in a user space appication which never gets to run, and when it does get to run, can't write a response for a request and response protocol, such that it can't free up any mbufs? The second is that, in the face of a denial of service attack, the correct approach (according to Van Jacobsen) is to do a random drop, and rely on the fact that the attack packets, being proportionally more of the queue contents, get dropped with a higher probability... so while you _can_ do this, it is really a bad idea, if you are trying to make your stack robust against attacks. The other thing that you appear to be missing is that the most common case is that you have plenty of mbufs, and you keep getting interrupts, replacing the mbufs in the receive ring, and pushing the data into the ether input by giving away the full mbufs. The problem occurs when you are receiving at such a high rate that you don't have any free cycles to run NETISR, and thus you can not empty the queue from which ipintr is called with data. In other words, it's not really the card's fault that the OS didn't run the stack at hardware interrupt. > > What this means is that you get more benefit in the soft > > interrupt coelescing than you otherwise would get, when > > you are doing LRP. > > > > But, you do get *some* benefit from doing it anyway, even > > if your ether input processing is light: so long as it is > > non-zero, you get benefit. > > > > Note that LRP itself is not a panacea for livelock, since > > it just moves the scheduling problem from the IRQ<->NETISR > > scheduling into the NETISR<->process scheduling. You end > > up needing to implement weighted fair share or other code > > to ensure that the user space process is permitted to run, > > so yo
Re: Why do soft interrupt coelescing?
On Tue, Oct 09, 2001 at 12:28:02 -0700, Terry Lambert wrote: > "Kenneth D. Merry" wrote: > [ ... soft interrupt coelescing ... ] > > > As you say above, this is actually a good thing. I don't see how this ties > > into the patch to introduce some sort of interrupt coalescing into the > > ti(4) driver. IMO, you should be able to tweak the coalescing parameters > > on the board to do what you want. > > I have tweaked all tunables on the board, and I have not gotten > anywhere near the increased performance. > > The limit on how far you can push this is based on how much > RAM you can have on the card, and the limits to coelescing. > > Here's the reason: when you receive packets to the board, they > get DMA'ed into the ring. No matter how large the ring, it > won't matter, if the ring is not being emptied asynchronously > relative to it being filled. > > In the case of a full-on receiver livelock situation, the ring > contents will be continuously overwritten. This is actually > what happens when you put a ti card into a machine with a > slower processor, and hit it hard. eh? The card won't write past the point that has been acked by the kernel. If the kernel hasn't acked the packets and one of the receive rings fills up, the card will hold off on sending packets up to the kernel. > In the case of interrupt processing, where you jam the data up > through ether input at interrupt time, the buffer will be able > to potentially overrun, as well. Admittedly, you can spend a > huge percentage of your CPU time in interrupt processing, and > if your CPU is fast enough, unload the queue very quickly. > > But if you then look at doing this for multiple gigabit cards > at the same time, you quickly reach the limits... and you > spend so much of your time in interrupt processing, that you > spend none running NETISR. I agree that you can end up spending large portions of your time doing interrupt processing, but I haven't seen instances of "buffer overrun", at least not in the kernel. The case where you'll see a "buffer overrun", at least with the ti(4) driver, is when you have a sender that's faster than the receiver. So the receiver can't process the data in time and the card just drops packets. That's a different situation from the card spamming the receive ring over and over again, which is what you're describing. I've never seen that happen, and if it does actually happen, I'd be interested in seeing evidence. > So you have moved your livelock up one layer. > > > In any case, doing the coelescing on the board delays the > packet processing until that number of packets has been > received, or a timer expires. The timer latency must be > increased proportionally to the maximum number of packets > that you coelesce into a single interrupt. > > In other words, you do not interleave your I/O when you > do this, and the bursty conditions that result in your > coelescing window ending up full or close to full are > the conditions under which you should be attempting the > maximum concurrency you can possibly attain. > > Basically, in any case where the load is high enough to > trigger the hardware coelescing, the ring would need to > be the next power of two larger to ensure that the end > does not overwrite the beginning of the ring. > > In practice, the firmware on the card does not support > this, so what you do instead is push a couple of packets > that may have been corrupted through DMA occurring during > the fact -- in other words, you drop packets. > > This is arguably "correct", in that it permits you to shed > load, _but_ the DMAs still occur into your rings; it would > be much better if the load were shed by the card firmware, > based on some knowledge of ring depth instead (RED Queueing), > since this would leave the bus clear for other traffinc (e.g. > communication with main memory to provide network content for > the cards for, e.g., and NFS server, etc.). > > Without hacking firmware, the best you can do is to ensure > that you process as much of all the traffic as you possibly > can, and that means avoiding livelock. Uhh, the Tigon firmware *does* drop packets when there is no more room in the proper receive ring on the host side. It doesn't spam things. What gives you that idea? You've really got some strange ideas about what goes on with that board. Why would someone design firmware so obviously broken? > [ ... LRP ... ] > > > That sounds cool, but I still don't see how this ties into the patch you > > sent out. > > OK. LRP removes NETISR entirely. > > This is the approach Van Jacobson stated he used in his > mythical TCP/IP stack, which we may never see. > > What this does is push the stack processing down to the > interrupt time for the hardware interrupt. This is a > good idea, in that it avoids the livelock for the NETISR > never running because you are too busy taking hardware > interrupts to be able to do any stack processing. > > The way this ties into t
Re: Why do soft interrupt coelescing?
"Kenneth D. Merry" wrote: [ ... soft interrupt coelescing ... ] > As you say above, this is actually a good thing. I don't see how this ties > into the patch to introduce some sort of interrupt coalescing into the > ti(4) driver. IMO, you should be able to tweak the coalescing parameters > on the board to do what you want. I have tweaked all tunables on the board, and I have not gotten anywhere near the increased performance. The limit on how far you can push this is based on how much RAM you can have on the card, and the limits to coelescing. Here's the reason: when you receive packets to the board, they get DMA'ed into the ring. No matter how large the ring, it won't matter, if the ring is not being emptied asynchronously relative to it being filled. In the case of a full-on receiver livelock situation, the ring contents will be continuously overwritten. This is actually what happens when you put a ti card into a machine with a slower processor, and hit it hard. In the case of interrupt processing, where you jam the data up through ether input at interrupt time, the buffer will be able to potentially overrun, as well. Admittedly, you can spend a huge percentage of your CPU time in interrupt processing, and if your CPU is fast enough, unload the queue very quickly. But if you then look at doing this for multiple gigabit cards at the same time, you quickly reach the limits... and you spend so much of your time in interrupt processing, that you spend none running NETISR. So you have moved your livelock up one layer. In any case, doing the coelescing on the board delays the packet processing until that number of packets has been received, or a timer expires. The timer latency must be increased proportionally to the maximum number of packets that you coelesce into a single interrupt. In other words, you do not interleave your I/O when you do this, and the bursty conditions that result in your coelescing window ending up full or close to full are the conditions under which you should be attempting the maximum concurrency you can possibly attain. Basically, in any case where the load is high enough to trigger the hardware coelescing, the ring would need to be the next power of two larger to ensure that the end does not overwrite the beginning of the ring. In practice, the firmware on the card does not support this, so what you do instead is push a couple of packets that may have been corrupted through DMA occurring during the fact -- in other words, you drop packets. This is arguably "correct", in that it permits you to shed load, _but_ the DMAs still occur into your rings; it would be much better if the load were shed by the card firmware, based on some knowledge of ring depth instead (RED Queueing), since this would leave the bus clear for other traffinc (e.g. communication with main memory to provide network content for the cards for, e.g., and NFS server, etc.). Without hacking firmware, the best you can do is to ensure that you process as much of all the traffic as you possibly can, and that means avoiding livelock. [ ... LRP ... ] > That sounds cool, but I still don't see how this ties into the patch you > sent out. OK. LRP removes NETISR entirely. This is the approach Van Jacobson stated he used in his mythical TCP/IP stack, which we may never see. What this does is push the stack processing down to the interrupt time for the hardware interrupt. This is a good idea, in that it avoids the livelock for the NETISR never running because you are too busy taking hardware interrupts to be able to do any stack processing. The way this ties into the patch is that doing the stack processing at interrupt time increases the per ether input processing cycle overhead up. What this means is that you get more benefit in the soft interrupt coelescing than you otherwise would get, when you are doing LRP. But, you do get *some* benefit from doing it anyway, even if your ether input processing is light: so long as it is non-zero, you get benefit. Note that LRP itself is not a panacea for livelock, since it just moves the scheduling problem from the IRQ<->NETISR scheduling into the NETISR<->process scheduling. You end up needing to implement weighted fair share or other code to ensure that the user space process is permitted to run, so you end up monitoring queue depth or something else, and deciding not to reenable interrupts on the card until you hit a low water mark, indicating processing has taken place (see the papers by Druschel et. al. and Floyd et. al.). > > > It isn't terribly clear what you're doing in the patch, since it isn't a > > > context diff. > > > > It's a "cvs diff" output. You could always check out a sys > > tree, apply it, and then cvs diff -c (or -u or whatever your > > favorite option is) to get a diff more to your tastes. > > As Peter Wemm pointed out, we can't use non-context diffs safely without > the exact time, date and branch of the source files. This int
Re: Why do soft interrupt coelescing?
* Mike Smith <[EMAIL PROTECTED]> [011009 00:25] wrote: > > * Kenneth D. Merry <[EMAIL PROTECTED]> [011009 00:11] wrote: > > > > > > As you say above, this is actually a good thing. I don't see how this ties > > > into the patch to introduce some sort of interrupt coalescing into the > > > ti(4) driver. IMO, you should be able to tweak the coalescing parameters > > > on the board to do what you want. > > > > No matter how hard you tweak the board, an interrupt may still > > trigger while you process a hardware interrupt, this causes an > > additional poll which can cause additional coalescing. > > I don't think I understand what sort of crack you are smoking. > > If an interrupt-worthy condition is asserted on the board, you aren't > going to leave your typical interrupt handler anyway; this sort of > coalescing already happens without any "help". After talking to you on IRC it's become obvious that this doesn't exactly happen without help. It's more of a side effect of the way _some_ of the drivers are written. What I understand from talking to you: Most smarter drivers or high performance ones will check if the tx/rx rings have been modified by the hardware and will consume those packets as well. However, most drivers have code like this: if (ifp->if_flags & IFF_RUNNING) { /* Check RX return ring producer/consumer */ ti_rxeof(sc); /* Check TX ring producer/consumer */ ti_txeof(sc); } Now if more packets come in while in ti_txeof() it seems that you'll need to take an additional interrupt to get at them. So Terry's code isn't wrong, but it's not as amazing as one would initially think, it just avoids a race that can happen while transmitting packets packets and more arrive or while recieving packets and the transmit queue drains. Now, when one is be doing a lot more work in the interrupt context (or perhaps just running on a slower host processor), Terry's patches make a lot more sense as there's a much larger window available for this race. The fact that receiving is done before transmitting (at least in the 'ti' driver) makes it an even smaller race as you're less likely to be performing a lengthy operation inside the tx routine than if you were doing some magic in the rx routine with incomming packets. Or at least that's how it seems to me. Either way, no need to get your latex in a bunch Mike. :-) -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Why do soft interrupt coelescing?
On Tue, Oct 09, 2001 at 00:18:57 -0500, Alfred Perlstein wrote: > * Kenneth D. Merry <[EMAIL PROTECTED]> [011009 00:11] wrote: > > > > As you say above, this is actually a good thing. I don't see how this ties > > into the patch to introduce some sort of interrupt coalescing into the > > ti(4) driver. IMO, you should be able to tweak the coalescing parameters > > on the board to do what you want. > > No matter how hard you tweak the board, an interrupt may still > trigger while you process a hardware interrupt, this causes an > additional poll which can cause additional coalescing. At least in the case of the Tigon, it won't interrupt while there is a '1' written in mailbox 0. (This happens in ti_intr().) Ken -- Kenneth Merry [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Why do soft interrupt coelescing?
> * Kenneth D. Merry <[EMAIL PROTECTED]> [011009 00:11] wrote: > > > > As you say above, this is actually a good thing. I don't see how this ties > > into the patch to introduce some sort of interrupt coalescing into the > > ti(4) driver. IMO, you should be able to tweak the coalescing parameters > > on the board to do what you want. > > No matter how hard you tweak the board, an interrupt may still > trigger while you process a hardware interrupt, this causes an > additional poll which can cause additional coalescing. I don't think I understand what sort of crack you are smoking. If an interrupt-worthy condition is asserted on the board, you aren't going to leave your typical interrupt handler anyway; this sort of coalescing already happens without any "help". -- ... every activity meets with opposition, everyone who acts has his rivals and unfortunately opponents also. But not because people want to be opponents, rather because the tasks and relationships force people to take different points of view. [Dr. Fritz Todt] V I C T O R Y N O T V E N G E A N C E To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Why do soft interrupt coelescing?
* Kenneth D. Merry <[EMAIL PROTECTED]> [011009 00:11] wrote: > > As you say above, this is actually a good thing. I don't see how this ties > into the patch to introduce some sort of interrupt coalescing into the > ti(4) driver. IMO, you should be able to tweak the coalescing parameters > on the board to do what you want. No matter how hard you tweak the board, an interrupt may still trigger while you process a hardware interrupt, this causes an additional poll which can cause additional coalescing. -- -Alfred Perlstein [[EMAIL PROTECTED]] 'Instead of asking why a piece of software is using "1970s technology," start asking why software is ignoring 30 years of accumulated wisdom.' To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Why do soft interrupt coelescing?
On Sun, Oct 07, 2001 at 00:56:44 -0700, Terry Lambert wrote: > "Kenneth D. Merry" wrote: > > [ I don't particularly want to get involved in this thread...but... ] > > > > Can you explain why the ti(4) driver needs a coalescing patch? It already > > has in-firmware coalescing paramters that are tuneable by the user. It > > also already processes all outstanding BDs in ti_rxeof() and ti_txeof(). > > > The answer to your question is that the card will continue to DMA > into the ring buffer, even though you are in the middle of the > interrupt service routine, and that the amount of time taken in > ether input is long enough that you can have more packets come in > while you are processing (this is actually a good thing). > > This is even *more* likely with hardware interrupt coelescing, > since the default setting is to coelesce 32 packets into a > single interrupt, meaning that you have up to 32 iterations of > ether input to call, and thus the amount of time spent processing > them actually affords *more* time for additional packets to come > in. As you say above, this is actually a good thing. I don't see how this ties into the patch to introduce some sort of interrupt coalescing into the ti(4) driver. IMO, you should be able to tweak the coalescing parameters on the board to do what you want. > In my own personal situation, I have also implemented Lazy > Receiver Processing (per the research done by Rice University > and in the "Click Router" project; no relation to "ClickArray"), > which does all stack processing at the hardware interrupt, rather > then queueing between the hardware interrupt and NETISR, so my > processing path is actually longer; I get more benefit from the > change than you would, but on a heavily loaded system, you would > also get some benefit, if you were able to load the wire heavily > enough. > > The LRP implementation should be considered by FreeBSD as well, > since it takes the connection rate from ~7,000/second up to > ~23,000/second, by avoiding the NetISR. Rice University did > an implementation in 2.2.x, and then another one (using resource > containers -- I recommend against this one, not only because of > license issues with the second implementation) for 4.2; both > sets of research were done in FreeBSD. Unfortunately, neither > implementation was production quality (among other things, they > broke RFC 1323, and they have to run a complete duplicate stack > as a different protocol family because some of their assumptions > make it non-interoperable with other protocol stacks). That sounds cool, but I still don't see how this ties into the patch you sent out. > > It isn't terribly clear what you're doing in the patch, since it isn't a > > context diff. > > It's a "cvs diff" output. You could always check out a sys > tree, apply it, and then cvs diff -c (or -u or whatever your > favorite option is) to get a diff more to your tastes. As Peter Wemm pointed out, we can't use non-context diffs safely without the exact time, date and branch of the source files. This introduces an additional burden for no real reason other than you neglected to use -c or -u with cvs diff. > > You also never gave any details behind your statement last week: > > "Because at the time the Tigon II was released, the jumbogram > > wire format had not solidified. Therefore cards built during > > that time used different wire data for the jumbogram framing." > > > > I asked, in response: > > > > "Can you give more details? Did someone decide on a different ethertype > > than 0x8870 or something? > > > > That's really the only thing that's different between a standard ethernet > > frame and a jumbo frame. (other than the size)" > > I believe it was the implementation of the length field. I > would have to get more information from the person who did > the interoperability testing for the autonegotiation (which > failed between the Tigon II and the Intel Gigabit cards). I > can assure you anecdotally, however, that autonegotiation > _did_ fail. I would believe that autonegotiation (i.e. 10/100/1000) might fail, especially if you're using 1000BaseT Tigon II boards. However, I would like more details on the failure. It's entirely possible that it could be fixed in the firmware, probably without too much trouble. I find it somewhat hard to believe that Intel would ship a gigabit board that didn't interoperate with the board that up until probably recently was probably the predominant gigabit board out there. Ken -- Kenneth Merry [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Why do soft interrupt coelescing?
"Kenneth D. Merry" wrote: > [ I don't particularly want to get involved in this thread...but... ] > > Can you explain why the ti(4) driver needs a coalescing patch? It already > has in-firmware coalescing paramters that are tuneable by the user. It > also already processes all outstanding BDs in ti_rxeof() and ti_txeof(). The answer to your question is that the card will continue to DMA into the ring buffer, even though you are in the middle of the interrupt service routine, and that the amount of time taken in ether input is long enough that you can have more packets come in while you are processing (this is actually a good thing). This is even *more* likely with hardware interrupt coelescing, since the default setting is to coelesce 32 packets into a single interrupt, meaning that you have up to 32 iterations of ether input to call, and thus the amount of time spent processing them actually affords *more* time for additional packets to come in. In my own personal situation, I have also implemented Lazy Receiver Processing (per the research done by Rice University and in the "Click Router" project; no relation to "ClickArray"), which does all stack processing at the hardware interrupt, rather then queueing between the hardware interrupt and NETISR, so my processing path is actually longer; I get more benefit from the change than you would, but on a heavily loaded system, you would also get some benefit, if you were able to load the wire heavily enough. The LRP implementation should be considered by FreeBSD as well, since it takes the connection rate from ~7,000/second up to ~23,000/second, by avoiding the NetISR. Rice University did an implementation in 2.2.x, and then another one (using resource containers -- I recommend against this one, not only because of license issues with the second implementation) for 4.2; both sets of research were done in FreeBSD. Unfortunately, neither implementation was production quality (among other things, they broke RFC 1323, and they have to run a complete duplicate stack as a different protocol family because some of their assumptions make it non-interoperable with other protocol stacks). > It isn't terribly clear what you're doing in the patch, since it isn't a > context diff. It's a "cvs diff" output. You could always check out a sys tree, apply it, and then cvs diff -c (or -u or whatever your favorite option is) to get a diff more to your tastes. > You also never gave any details behind your statement last week: > "Because at the time the Tigon II was released, the jumbogram > wire format had not solidified. Therefore cards built during > that time used different wire data for the jumbogram framing." > > I asked, in response: > > "Can you give more details? Did someone decide on a different ethertype > than 0x8870 or something? > > That's really the only thing that's different between a standard ethernet > frame and a jumbo frame. (other than the size)" I believe it was the implementation of the length field. I would have to get more information from the person who did the interoperability testing for the autonegotiation (which failed between the Tigon II and the Intel Gigabit cards). I can assure you anecdotally, however, that autonegotiation _did_ fail. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message