Re: [PATCH 1/2] e1000: fix netpoll with NAPI
(trimmed CC to just netdev) One of our engineers (on the I/O AT team) has been tasked with modifying the Linux kernel to properly support multiple hardware queues (both TX and RX). We'll make sure that he looks at the netpoll interface as part of that process. Might I ask who this is? I might like to ping him/her on this topic. There is potentially some overlap with wireless, at least on the transmit side. John W. Linville Hi John, so yeah we want multiple TX queues on wired ethernet for QoS, same as wireless. Also for SMP scaling (maybe not needed until 10G+). Did wireless people settle on a design since the email thread in January? Regards -- Andy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Wed, Jun 14, 2006 at 04:44:56PM -0700, Mitch Williams wrote: One of our engineers (on the I/O AT team) has been tasked with modifying the Linux kernel to properly support multiple hardware queues (both TX and RX). We'll make sure that he looks at the netpoll interface as part of that process. Might I ask who this is? I might like to ping him/her on this topic. There is potentially some overlap with wireless, at least on the transmit side. John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
Andy Grover [EMAIL PROTECTED] is the guy. I'm not sure when he'll be working on this; it's somewhere in his TODO pile. On Thu, 15 Jun 2006, John W. Linville wrote: On Wed, Jun 14, 2006 at 04:44:56PM -0700, Mitch Williams wrote: One of our engineers (on the I/O AT team) has been tasked with modifying the Linux kernel to properly support multiple hardware queues (both TX and RX). We'll make sure that he looks at the netpoll interface as part of that process. Might I ask who this is? I might like to ping him/her on this topic. There is potentially some overlap with wireless, at least on the transmit side. John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Mon, Jun 12, 2006 at 02:06:00PM -0400, Neil Horman wrote: On Mon, Jun 12, 2006 at 09:42:14AM -0700, Mitch Williams wrote: On Sun, 2006-06-11 at 17:13 -0700, Neil Horman wrote: Any further thoughts on this guys? I still think my last solution solves all of the netpoll problems, and isn't going to have any noticable impact on performance. I haven't had time to evaluate performance on your patch (sorry!), but after thinking about it, I agree that it should not have any noticeable impact. OTOH, performance tuning is a funny thing, and things you think won't cause problems often do. Thats ok, I just didn't hear out of anyone on friday, so I was curious as to where we were on this. I don't have the ability to do any real world performance testing here, but I'll try to record the run time of the interrupt routine on a limited number of frames here. Hey, as promised, I've done some rudimentary performance benchmarking on various ways that we have talked about to solve this problem. As I previously mentioned I didn't have the equipment to do any real full scale testing here, so what I did was take a read of the real time counter at the start and end of the e1000_intr routine with various patches applied, and I recorded the number of ticks elapsed on the tsc during its run. I did this on my single cpu x86_64 machine here, using the latest unpatched e1000 driver as a base, and then comparing it to the e1000 driver using my patch and separately with a patch that spinlocks the e1000_clean_rx_irq routine (so as to serialize the critical section that would otherwise be subject to data corruption. Here are my results: Base line: Avg. 8145 Ticks on the tsc. With my patch: http://marc.theaimsgroup.com/?l=linux-netdevm=114970807606096w=2 Avg. 8159 Ticks on the tsc. (+0.17% increase) With a spinlock added to e1000_clean_rx_irq: Avg. 8238 Ticks on the tsc. (+1.1% increase) If you like I can send you the time stamp counter patch that I used, as well as the patch which adds spinlocks to the clean routine. Note that the free running counter values will vary so you probably want to look at percentage increase. Either way, I think either solution provides very little impact on interrupt run time. Given that my patch (granted using my test methodology here) is the faster of the two, and arguably the more correct in terms of not using the poll controller method to recieve frames, We should go with that patch. Thoughts/opinions? Neil -- /*** *Neil Horman *Software Engineer *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu ***/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Wed, 14 Jun 2006, Neil Horman wrote: Hey, as promised, I've done some rudimentary performance benchmarking on various ways that we have talked about to solve this problem. As I previously mentioned We see the same results here, Neil. However, we've got a much less invasive patch undergoing internal review, and which we will post to netdev once everybody gets happy with it. Basically, we just do our NAPI scheduling on the real netdev structure instead of our polling netdev, in the case where we only have one RX queue. Since this is the case for all our currently-shipping parts under Linux, netpoll works again across the board. It's a short-term fix because we do want to support multiple queues going forward, but for now we need to get everybody working. One of our engineers (on the I/O AT team) has been tasked with modifying the Linux kernel to properly support multiple hardware queues (both TX and RX). We'll make sure that he looks at the netpoll interface as part of that process. Stay tuned for our impending patch. -Mitch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Mon, Jun 12, 2006 at 09:42:14AM -0700, Mitch Williams wrote: On Sun, 2006-06-11 at 17:13 -0700, Neil Horman wrote: Any further thoughts on this guys? I still think my last solution solves all of the netpoll problems, and isn't going to have any noticable impact on performance. I haven't had time to evaluate performance on your patch (sorry!), but after thinking about it, I agree that it should not have any noticeable impact. OTOH, performance tuning is a funny thing, and things you think won't cause problems often do. Thats ok, I just didn't hear out of anyone on friday, so I was curious as to where we were on this. I don't have the ability to do any real world performance testing here, but I'll try to record the run time of the interrupt routine on a limited number of frames here. Anyway, I'm still not quite ready to ACK this because it's just not future-proof. Eventually, we will need to support multiple RX queues, and this solution will not work in that situation. Not sure I understand this. My patch: http://marc.theaimsgroup.com/?l=linux-netdevm=114970506406155w=2 still allows for the use of multiple rx queues in the nominal case. Only when we have to use a relatively slow netpoll driven operation (kgdb, netdump, etc), do we need to receive on the same interface that we transmit on. A simpler short-term solution is just to schedule our NAPI polling on the real netdev instead of our polling netdev. This is a trivial change and works correctly with a single queue. But, like your patch, it isn't future-proof. Again, not sure what you mean here. My last patch proposal: http://marc.theaimsgroup.com/?l=linux-netdevm=114970807606096w=2 Does precisely what you describe, but it still allows for multiple rx queues in the nominal (non poll_controller driven) receive case. So, I'm still thinking and pondering on this one. If we get a patch in to fix the recursive loop in netpoll, my original patch will work, right? Or is there still another issue? I assume you are referring to this patch: http://marc.theaimsgroup.com/?l=linux-netdevm=114970506406155w=2 If so, then no, that patch is still broken for the reasons Jeff outlined previously, The recursion patch that I proposed earlier today is related to a different recursion problem, and while the e1000 driver might be suceptable to it, your patch is also suceptible to the data corruption that arises from when the poll_controller calls adapter-clean_rx at the same that that dev-poll is called for the same adapter on another cpu. If that happens we can have two cpu's writing to the same private net_device data without the benefit of a spinlock to protect them. And yes, you can add a spin lock to protect the case where you have a dev-poll_controller and a dev-poll operation at the same time, but that seems to me like it will also re-serialize all the parallel operations that you could otherwise do with multiple rx queues I'll post again if I get a chance to do some performance measurements Regrds Neil -Mitch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- /*** *Neil Horman *Software Engineer *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu ***/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Thu, Jun 08, 2006 at 01:29:00PM -0400, Jeff Moyer wrote: == Regarding Re: [PATCH 1/2] e1000: fix netpoll with NAPI; Mitch Williams [EMAIL PROTECTED] adds: mitch.a.williams On Wed, 2006-06-07 at 11:44 -0700, Jeff Moyer wrote: That patch locks around the tx clean routine. As such, it doesn't prevent the problem. mitch.a.williams The call to netif_rx_schedule_prep provides locking mitch.a.williams because it sets the __LINK_STATE_RX_SCHED bit atomically. mitch.a.williams The spinlock around e1000_clean_tx_irq is to protect it mitch.a.williams from other calls to the transmit routine, not NAPI. Yes, but what prevents recursion in the poll routine? Consider that the poll routine could end up triggerring a printk (think iptables, here). In that case, you end up calling into netpoll, and if the tx ring is full, we call the poll_controller routine. We've now recursed. The poll lock was originally introduced to prevent recursion, not concurrent access. Any further thoughts on this guys? I still think my last solution solves all of the netpoll problems, and isn't going to have any noticable impact on performance. Regards Neil -Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Wed, 2006-06-07 at 11:44 -0700, Jeff Moyer wrote: That patch locks around the tx clean routine. As such, it doesn't prevent the problem. The call to netif_rx_schedule_prep provides locking because it sets the __LINK_STATE_RX_SCHED bit atomically. The spinlock around e1000_clean_tx_irq is to protect it from other calls to the transmit routine, not NAPI. -Mitch + disable_irq(adapter-pdev-irq); + if (likely(netif_rx_schedule_prep(adapter-polling_netdev[0]))) { + if (spin_trylock(adapter-tx_queue_lock)) { + e1000_clean_tx_irq(adapter, adapter-tx_ring[0]); + spin_unlock(adapter-tx_queue_lock); + } + adapter-clean_rx(adapter, adapter-rx_ring, + budget, netdev-weight); + clear_bit(__LINK_STATE_RX_SCHED, + adapter-polling_netdev[0].state); -Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
== Regarding Re: [PATCH 1/2] e1000: fix netpoll with NAPI; Mitch Williams [EMAIL PROTECTED] adds: mitch.a.williams On Wed, 2006-06-07 at 11:44 -0700, Jeff Moyer wrote: That patch locks around the tx clean routine. As such, it doesn't prevent the problem. mitch.a.williams The call to netif_rx_schedule_prep provides locking mitch.a.williams because it sets the __LINK_STATE_RX_SCHED bit atomically. mitch.a.williams The spinlock around e1000_clean_tx_irq is to protect it mitch.a.williams from other calls to the transmit routine, not NAPI. Yes, but what prevents recursion in the poll routine? Consider that the poll routine could end up triggerring a printk (think iptables, here). In that case, you end up calling into netpoll, and if the tx ring is full, we call the poll_controller routine. We've now recursed. The poll lock was originally introduced to prevent recursion, not concurrent access. -Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Wed, 2006-06-07 at 11:54 -0700, John W. Linville wrote: Pedantic objection, but I think this would read easier w/o the extra newline before disable_irq. Heh. I prefer to have a newline between declarations and code. The real problem is the position of the #ifdef -- that's what makes it difficult to read. The other solution would be { struct e1000_adapter *adapter = netdev_priv(netdev); #ifdef CONFIG_E1000_NAPI int budget = 0; #endif disable_irq(adapter-pdev-irq); #ifdef CONFIG_E1000_NAPI all that stuff #else rest of the stuff #endif } Which I think is worse to read. -Mitch - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Thu, Jun 08, 2006 at 10:23:56AM -0700, Mitch Williams wrote: On Wed, 2006-06-07 at 11:54 -0700, John W. Linville wrote: Pedantic objection, but I think this would read easier w/o the extra newline before disable_irq. Heh. I prefer to have a newline between declarations and code. The Normally I would agree. But in this case, I find the distraction of the random newline after the #else to be more compelling. real problem is the position of the #ifdef -- that's what makes it difficult to read. The other solution would be { struct e1000_adapter *adapter = netdev_priv(netdev); #ifdef CONFIG_E1000_NAPI int budget = 0; #endif disable_irq(adapter-pdev-irq); #ifdef CONFIG_E1000_NAPI all that stuff #else rest of the stuff #endif } Which I think is worse to read. I presume it is the double #ifdef that you find objectionable? I don't really like it, but at least that idiom is quite common. Given that the disable_irq appears in both code paths (almost by necessity), there is a certain appeal to having it outside of the #ifdef block. That seems more maintainable. To me, the idiomatic #ifdef placement seems more readable, if for no other reason than familiarity. I suppose we can agree to disagree. John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
Matt, any ideas on this? Not at the moment. how about this for a solution? It doesn't make netpoll any more robust, but I think in the interests of efficiency it would be fair to require that, when netpolled, a driver must receive frames on the same net device for which it was polled. With this patch we detect that condition and handle it accordingly in e1000_intr. This eliminates the need for us to call the clean_rx method from the poll_controller when napi is configured, instead allowing the poll method to be called from napi_poll, as the netpoll model currently does. This fixes the netdump regression, and eliminates the layering violation and the potential race that we've been discussing. I've just tested it with netdump here and it works quite well. Thoughts appreciated. Thanks Regards Neil Signed-off-by: Neil Horman [EMAIL PROTECTED] e1000_main.c | 54 +- 1 files changed, 45 insertions(+), 9 deletions(-) --- linux-2.6.9/drivers/net/e1000/e1000_main.c.neil 2006-06-06 10:37:42.0 -0400 +++ linux-2.6.9/drivers/net/e1000/e1000_main.c 2006-06-07 10:48:22.0 -0400 @@ -3207,8 +3207,9 @@ * @pt_regs: CPU registers structure **/ + static irqreturn_t -e1000_intr(int irq, void *data, struct pt_regs *regs) +__e1000_intr(int irq, void *data, struct pt_regs *regs, int netpoll_op) { struct net_device *netdev = data; struct e1000_adapter *adapter = netdev_priv(netdev); @@ -3217,6 +3218,7 @@ #ifndef CONFIG_E1000_NAPI int i; #else + struct net_device *dev_to_sched; /* Interrupt Auto-Mask...upon reading ICR, * interrupts are masked. No need for the * IMC write, but it does mean we should @@ -3255,8 +3257,22 @@ E1000_WRITE_REG(hw, IMC, ~0); E1000_WRITE_FLUSH(hw); } - if (likely(netif_rx_schedule_prep(adapter-polling_netdev[0]))) - __netif_rx_schedule(adapter-polling_netdev[0]); + + /* +* netpoll operations, in the interests of efficiency +* only do napi polling on the device passed to the +* poll_controller. Therefore, if we are preforming +* a netpoll operation, then we can't schedule a receive +* to one of the dummy net devices that exist for sole +* purpose of spreading out rx schedules +*/ + if (netpoll_op) + dev_to_sched = netdev; + else + dev_to_sched = adapter-polling_netdev[0]; + + if (likely(netif_rx_schedule_prep(dev_to_sched))) + __netif_rx_schedule(dev_to_sched); else e1000_irq_enable(adapter); #else @@ -3288,6 +3304,13 @@ return IRQ_HANDLED; } +static irqreturn_t +e1000_intr(int irq, void *data, struct pt_regs *regs) +{ + return __e1000_intr(irq, data, regs, 0); +} + + #ifdef CONFIG_E1000_NAPI /** * e1000_clean - NAPI Rx polling callback @@ -3300,7 +3323,6 @@ struct e1000_adapter *adapter; int work_to_do = min(*budget, poll_dev-quota); int tx_cleaned = 0, i = 0, work_done = 0; - /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -3308,10 +3330,24 @@ if (!netif_carrier_ok(adapter-netdev)) goto quit_polling; - while (poll_dev != adapter-polling_netdev[i]) { - i++; - if (unlikely(i == adapter-num_rx_queues)) - BUG(); + /* +* only search for a matching polling_netdev in the event +* that this isn't a real registered net_device +* A real net device can be passed in here in the event +* that netdump has been activated (this comes through +* netpoll_poll_dev). We detect this by virtue of the +* fact that each polling_netdev-priv points to the private +* data of its parent (registered) netdev. So if: +* poll_dev-priv == netdev_priv(poll_dev), its a real device +* otherwise its a polling_netdev. +*/ + if (adapter != netdev_priv(poll_dev)) { + while (poll_dev != adapter-polling_netdev[i]) { + i++; + if (unlikely(i == adapter-num_rx_queues)) + BUG(); + } + } if (likely(adapter-num_tx_queues == 1)) { @@ -4624,7 +4660,7 @@ { struct e1000_adapter *adapter = netdev_priv(netdev); disable_irq(adapter-pdev-irq); - e1000_intr(adapter-pdev-irq, netdev, NULL); + __e1000_intr(adapter-pdev-irq, netdev, NULL, 1); e1000_clean_tx_irq(adapter, adapter-tx_ring); #ifndef CONFIG_E1000_NAPI adapter-clean_rx(adapter, adapter-rx_ring); -- /*** *Neil Horman *Software Engineer *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu ***/ - To unsubscribe
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Wed, Jun 07, 2006 at 11:05:22AM -0400, Neil Horman wrote: Matt, any ideas on this? Not at the moment. how about this for a solution? It doesn't make netpoll any more robust, but I think in the interests of efficiency it would be fair to require that, when netpolled, a driver must receive frames on the same net device for which it was polled. With this patch we detect that condition and handle it accordingly in e1000_intr. This eliminates the need for us to call the clean_rx method from the poll_controller when napi is configured, instead allowing the poll method to be called from napi_poll, as the netpoll model currently does. This fixes the netdump regression, and eliminates the layering violation and the potential race that we've been discussing. I've just tested it with netdump here and it works quite well. Thoughts appreciated. This looks pretty reasonable, mostly from the perspective that it doesn't put any further ugliness in netpoll. We might want to add a comment somewhere in netpoll of the new rule we're now observing. I'll let the e1000 guys comment on the particulars of the driver change. Signed-off-by: Neil Horman [EMAIL PROTECTED] Signed-off-by: Matt Mackall [EMAIL PROTECTED] -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
Matt Mackall wrote: On Wed, Jun 07, 2006 at 11:05:22AM -0400, Neil Horman wrote: Matt, any ideas on this? Not at the moment. how about this for a solution? It doesn't make netpoll any more robust, but I think in the interests of efficiency it would be fair to require that, when netpolled, a driver must receive frames on the same net device for which it was polled. With this patch we detect that condition and handle it accordingly in e1000_intr. This eliminates the need for us to call the clean_rx method from the poll_controller when napi is configured, instead allowing the poll method to be called from napi_poll, as the netpoll model currently does. This fixes the netdump regression, and eliminates the layering violation and the potential race that we've been discussing. I've just tested it with netdump here and it works quite well. Thoughts appreciated. This looks pretty reasonable, mostly from the perspective that it doesn't put any further ugliness in netpoll. We might want to add a comment somewhere in netpoll of the new rule we're now observing. I'll let the e1000 guys comment on the particulars of the driver change. Hi, we're not too happy with this as it puts a branch right in the regular receive path. We haven't ran the numbers on it yet but it is likely that this will lower performance significantly during normal receives for something that is not common use. Attached below a (revised) patch that adds proper locking around the rx_clean to prevent the race. Cheers, Auke --- e1000: fix netpoll with NAPI This fixes netpoll when using NAPI, and protects against a rare race condition in the netpoll routine, while staying out of our ways from the normal hotpath. Signed-off-by: Mitch Williams [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- e1000_main.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index bd709e5..876251c 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -4584,10 +4584,25 @@ static void e1000_netpoll(struct net_device *netdev) { struct e1000_adapter *adapter = netdev_priv(netdev); +#ifdef CONFIG_E1000_NAPI + int budget = 0; + + disable_irq(adapter-pdev-irq); + if (likely(netif_rx_schedule_prep(adapter-polling_netdev[0]))) { + if (spin_trylock(adapter-tx_queue_lock)) { + e1000_clean_tx_irq(adapter, adapter-tx_ring[0]); + spin_unlock(adapter-tx_queue_lock); + } + adapter-clean_rx(adapter, adapter-rx_ring, +budget, netdev-weight); + clear_bit(__LINK_STATE_RX_SCHED, +adapter-polling_netdev[0].state); + } +#else + disable_irq(adapter-pdev-irq); e1000_intr(adapter-pdev-irq, netdev, NULL); e1000_clean_tx_irq(adapter, adapter-tx_ring); -#ifndef CONFIG_E1000_NAPI adapter-clean_rx(adapter, adapter-rx_ring); #endif enable_irq(adapter-pdev-irq);
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
== Regarding Re: [PATCH 1/2] e1000: fix netpoll with NAPI; Auke Kok [EMAIL PROTECTED] adds: auke-jan.h.kok Hi, auke-jan.h.kok we're not too happy with this as it puts a branch right in auke-jan.h.kok the regular receive path. We haven't ran the numbers on it auke-jan.h.kok yet but it is likely that this will lower performance auke-jan.h.kok significantly during normal receives for something that is auke-jan.h.kok not common use. auke-jan.h.kok Attached below a (revised) patch that adds proper locking auke-jan.h.kok around the rx_clean to prevent the race. That patch locks around the tx clean routine. As such, it doesn't prevent the problem. + disable_irq(adapter-pdev-irq); + if (likely(netif_rx_schedule_prep(adapter-polling_netdev[0]))) { + if (spin_trylock(adapter-tx_queue_lock)) { + e1000_clean_tx_irq(adapter, adapter-tx_ring[0]); + spin_unlock(adapter-tx_queue_lock); + } + adapter-clean_rx(adapter, adapter-rx_ring, + budget, netdev-weight); + clear_bit(__LINK_STATE_RX_SCHED, + adapter-polling_netdev[0].state); -Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Wed, Jun 07, 2006 at 11:25:29AM -0700, Auke Kok wrote: @@ -4584,10 +4584,25 @@ static void e1000_netpoll(struct net_device *netdev) { struct e1000_adapter *adapter = netdev_priv(netdev); +#ifdef CONFIG_E1000_NAPI + int budget = 0; + + disable_irq(adapter-pdev-irq); + if (likely(netif_rx_schedule_prep(adapter-polling_netdev[0]))) { + if (spin_trylock(adapter-tx_queue_lock)) { + e1000_clean_tx_irq(adapter, adapter-tx_ring[0]); + spin_unlock(adapter-tx_queue_lock); + } + adapter-clean_rx(adapter, adapter-rx_ring, + budget, netdev-weight); + clear_bit(__LINK_STATE_RX_SCHED, + adapter-polling_netdev[0].state); + } +#else + disable_irq(adapter-pdev-irq); Pedantic objection, but I think this would read easier w/o the extra newline before disable_irq. John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Wed, Jun 07, 2006 at 02:44:54PM -0400, Jeff Moyer wrote: == Regarding Re: [PATCH 1/2] e1000: fix netpoll with NAPI; Auke Kok [EMAIL PROTECTED] adds: auke-jan.h.kok Hi, auke-jan.h.kok we're not too happy with this as it puts a branch right in auke-jan.h.kok the regular receive path. We haven't ran the numbers on it auke-jan.h.kok yet but it is likely that this will lower performance auke-jan.h.kok significantly during normal receives for something that is auke-jan.h.kok not common use. auke-jan.h.kok Attached below a (revised) patch that adds proper locking auke-jan.h.kok around the rx_clean to prevent the race. That patch locks around the tx clean routine. As such, it doesn't prevent the problem. Further to that, do tests on this if you like, but I would certainly think a properly formed conditional operation is going to provide better performance than a spin_lock operation in the receive path. Why not just turn the: if(netpoll_op) into an if(unlikely(netpoll_op)) I expect that will reduce the overhead of the conditional to effectively zero for the normal receive case. The following patch does that, and I expect you performance won't suffer at all: Signed-off-by: Neil Horman [EMAIL PROTECTED] e1000_main.c | 54 +- 1 files changed, 45 insertions(+), 9 deletions(-) --- linux-2.6.9/drivers/net/e1000/e1000_main.c.neil 2006-06-06 10:37:42.0 -0400 +++ linux-2.6.9/drivers/net/e1000/e1000_main.c 2006-06-07 10:48:22.0 -0400 @@ -3207,8 +3207,9 @@ e1000_update_stats(struct e1000_adapter * @pt_regs: CPU registers structure **/ + static irqreturn_t -e1000_intr(int irq, void *data, struct pt_regs *regs) +__e1000_intr(int irq, void *data, struct pt_regs *regs, int netpoll_op) { struct net_device *netdev = data; struct e1000_adapter *adapter = netdev_priv(netdev); @@ -3217,6 +3218,7 @@ e1000_intr(int irq, void *data, struct p #ifndef CONFIG_E1000_NAPI int i; #else + struct net_device *dev_to_sched; /* Interrupt Auto-Mask...upon reading ICR, * interrupts are masked. No need for the * IMC write, but it does mean we should @@ -3255,8 +3257,22 @@ e1000_intr(int irq, void *data, struct p E1000_WRITE_REG(hw, IMC, ~0); E1000_WRITE_FLUSH(hw); } - if (likely(netif_rx_schedule_prep(adapter-polling_netdev[0]))) - __netif_rx_schedule(adapter-polling_netdev[0]); + + /* +* netpoll operations, in the interests of efficiency +* only do napi polling on the device passed to the +* poll_controller. Therefore, if we are preforming +* a netpoll operation, then we can't schedule a receive +* to one of the dummy net devices that exist for sole +* purpose of spreading out rx schedules +*/ + if (unlikely(netpoll_op)) + dev_to_sched = netdev; + else + dev_to_sched = adapter-polling_netdev[0]; + + if (likely(netif_rx_schedule_prep(dev_to_sched))) + __netif_rx_schedule(dev_to_sched); else e1000_irq_enable(adapter); #else @@ -3288,6 +3304,13 @@ e1000_intr(int irq, void *data, struct p return IRQ_HANDLED; } +static irqreturn_t +e1000_intr(int irq, void *data, struct pt_regs *regs) +{ + return __e1000_intr(irq, data, regs, 0); +} + + #ifdef CONFIG_E1000_NAPI /** * e1000_clean - NAPI Rx polling callback @@ -3300,7 +3323,6 @@ e1000_clean(struct net_device *poll_dev, struct e1000_adapter *adapter; int work_to_do = min(*budget, poll_dev-quota); int tx_cleaned = 0, i = 0, work_done = 0; - /* Must NOT use netdev_priv macro here. */ adapter = poll_dev-priv; @@ -3308,10 +3330,24 @@ e1000_clean(struct net_device *poll_dev, if (!netif_carrier_ok(adapter-netdev)) goto quit_polling; - while (poll_dev != adapter-polling_netdev[i]) { - i++; - if (unlikely(i == adapter-num_rx_queues)) - BUG(); + /* +* only search for a matching polling_netdev in the event +* that this isn't a real registered net_device +* A real net device can be passed in here in the event +* that netdump has been activated (this comes through +* netpoll_poll_dev). We detect this by virtue of the +* fact that each polling_netdev-priv points to the private +* data of its parent (registered) netdev. So if: +* poll_dev-priv == netdev_priv(poll_dev), its a real device +* otherwise its a polling_netdev. +*/ + if (likely(adapter != netdev_priv(poll_dev))) { + while (poll_dev != adapter-polling_netdev[i]) { + i++; + if (unlikely(i == adapter-num_rx_queues)) + BUG
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Mon, Jun 05, 2006 at 04:11:25PM -0700, Kok, Auke wrote: Netpoll was broken due to the earlier addition of multiqueue. Signed-off-by: Mitch Williams [EMAIL PROTECTED] Signed-off-by: Auke Kok [EMAIL PROTECTED] --- drivers/net/e1000/e1000_main.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index ed15fca..7103a0e 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -4629,10 +4629,17 @@ static void e1000_netpoll(struct net_device *netdev) { struct e1000_adapter *adapter = netdev_priv(netdev); +#ifdef CONFIG_E1000_NAPI + int budget = 0; + + disable_irq(adapter-pdev-irq); + e1000_clean_tx_irq(adapter, adapter-tx_ring); + adapter-clean_rx(adapter, adapter-rx_ring, budget, netdev-weight); +#else + I've been speaking about this fix with a Jeff Moyer, and we've come up with some concerns regarding its implementation. Specifically the call to adapter-clean_rx in the case of the e1000 driver is rather a layering violation in the netpoll code, in the sense that the function pointed to by clean_rx is functionality that is nominally used by the dev-poll method. In fact in this case, it would appear possible since dev-poll is called under the poll_lock, but dev-poll_controller is not, that is is possible to have cpus in a system executing in e1000_clean_rx_irq_[ps] at the same time leading to data corruption: CPU0: netpoll_poll_dev dev-poll_controller (e1000_netpoll) adapter-clean_rx (e1000_clean_rx_irq) CPU1: napi_poll dev-poll (e1000_clean) e1000_clean_rx_irq I'm not sure what the right fix is here. A spinlock in e1000_clean_rx_irq[_ps] would seem to be an easy and direct fix, but I don't know that thats the best solution. Something I don't understand is why the call to adapter-clean_rx is required in the first place when the napi_poll routine calls it itself directly. All other drivers schedule a napi poll and receive frames via that path. My guess is that it has to do with the fact that we schedule polls on the device in the polling_netdev array, rather than the actual registered netdev. Looking at the driver code I note that while an entire array is allocated for polling netdevs, we only ever use entry 0. Would it make sense to just remove the the polling_netdev array and use the registered device like all the other drivers. I expect that would eliminate the need for this patch as well. Regards Neil disable_irq(adapter-pdev-irq); e1000_intr(adapter-pdev-irq, netdev, NULL); e1000_clean_tx_irq(adapter, adapter-tx_ring); -#ifndef CONFIG_E1000_NAPI adapter-clean_rx(adapter, adapter-rx_ring); #endif enable_irq(adapter-pdev-irq); -- Auke Kok [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- /*** *Neil Horman *Software Engineer *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu ***/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Tue, 2006-06-06 at 09:52 -0400, Neil Horman wrote: I've been speaking about this fix with a Jeff Moyer, and we've come up with some concerns regarding its implementation. Specifically the call to adapter-clean_rx in the case of the e1000 driver is rather a layering violation in the netpoll code, in the sense that the function pointed to by clean_rx is functionality that is nominally used by the dev-poll method. In fact in this case, it would appear possible since dev-poll is called under the poll_lock, but dev-poll_controller is not, that is is possible to have cpus in a system executing in e1000_clean_rx_irq_[ps] at the same time leading to data corruption: CPU0: netpoll_poll_dev dev-poll_controller (e1000_netpoll) adapter-clean_rx (e1000_clean_rx_irq) CPU1: napi_poll dev-poll (e1000_clean) e1000_clean_rx_irq Hmmm. You may have a point. I don't think a spinlock is required, but we do need to check if the poll is already scheduled on another CPU, like netpoll does in poll_napi(). In practice, of course, we never see this. The only netpoll client in the kernel is netconsole, which doesn't do receives. A few Major Distros use netdump, which does do receives, but only after the system has crashed. In that case, all other CPUs are stopped anyway. However, just for the sake of correctness (and paranoia), I'll whip up another patch that does this check. Jeff, please do not commit this patch. -Mitch NB: The root of this problem is that e1000 uses a dummy netdev to schedule NAPI polling. Since netpoll doesn't know about the dummy netdev, it checks the real e1000 netdev struct to see if it's scheduled for polling. Since this is never the case, netpoll fails when e1000 is configured to use NAPI. Hence, this patch. Why is the dummy netdev in place? To support multi-queue RX. Our PCIe adapters support this, but the kernel's not _quite_ ready yet. Hopefully, the VJ net channel stuff will enable this feature. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
On Tue, Jun 06, 2006 at 09:39:25AM -0700, Mitch Williams wrote: On Tue, 2006-06-06 at 09:52 -0400, Neil Horman wrote: I've been speaking about this fix with a Jeff Moyer, and we've come up with some concerns regarding its implementation. Specifically the call to adapter-clean_rx in the case of the e1000 driver is rather a layering violation in the netpoll code, in the sense that the function pointed to by clean_rx is functionality that is nominally used by the dev-poll method. In fact in this case, it would appear possible since dev-poll is called under the poll_lock, but dev-poll_controller is not, that is is possible to have cpus in a system executing in e1000_clean_rx_irq_[ps] at the same time leading to data corruption: CPU0: netpoll_poll_dev dev-poll_controller (e1000_netpoll) adapter-clean_rx (e1000_clean_rx_irq) CPU1: napi_poll dev-poll (e1000_clean) e1000_clean_rx_irq Hmmm. You may have a point. I don't think a spinlock is required, but we do need to check if the poll is already scheduled on another CPU, like netpoll does in poll_napi(). In practice, of course, we never see this. The only netpoll client in the kernel is netconsole, which doesn't do receives. A few Major Distros use netdump, which does do receives, but only after the system has crashed. In that case, all other CPUs are stopped anyway. You are probably right, this is a rare case, if it ever happens at all, but I think there is (at least as Jeff explained it to me) a corner case, where netconsole on transmit, notes an exhaustion of tx descriptors, and in response calls the poll controller method of the driver to clean up and make some of those descriptors available: printk release_console_sem call_console_drivers netconsole.c:write_msg netpoll_send_udp netpoll_send_skb if (netif_queue_stopped(np-dev)) --- out of descriptors ? netpoll_poll(np); --- trigger a poll to clean up If this is being done at the same time as a napi_poll on another cpu, we have a real set of conditions under which a corruption can occur. However, just for the sake of correctness (and paranoia), I'll whip up another patch that does this check. Thanks for the quick feedback! Regards Neil Jeff, please do not commit this patch. -Mitch NB: The root of this problem is that e1000 uses a dummy netdev to schedule NAPI polling. Since netpoll doesn't know about the dummy netdev, it checks the real e1000 netdev struct to see if it's scheduled for polling. Since this is never the case, netpoll fails when e1000 is configured to use NAPI. Hence, this patch. Why is the dummy netdev in place? To support multi-queue RX. Our PCIe adapters support this, but the kernel's not _quite_ ready yet. Hopefully, the VJ net channel stuff will enable this feature. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- /*** *Neil Horman *Software Engineer *gpg keyid: 1024D / 0x92A74FA1 - http://pgp.mit.edu ***/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
Neil Horman wrote: On Tue, Jun 06, 2006 at 09:39:25AM -0700, Mitch Williams wrote: On Tue, 2006-06-06 at 09:52 -0400, Neil Horman wrote: [snip] However, just for the sake of correctness (and paranoia), I'll whip up another patch that does this check. Thanks for the quick feedback! Regards Neil Jeff, please do not commit this patch. Jeff, I've popped the patch off from our gitserver, so you can pull the two outstanding patches while we revamp this one. Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
== Regarding Re: [PATCH 1/2] e1000: fix netpoll with NAPI; Mitch Williams [EMAIL PROTECTED] adds: mitch On Tue, 2006-06-06 at 09:52 -0400, Neil Horman wrote: I've been speaking about this fix with a Jeff Moyer, and we've come up with some concerns regarding its implementation. Specifically the call to adapter- clean_rx in the case of the e1000 driver is rather a layering violation in the netpoll code, in the sense that the function pointed to by clean_rx is functionality that is nominally used by the dev-poll method. In fact in this case, it would appear possible since dev-poll is called under the poll_lock, but dev-poll_controller is not, that is is possible to have cpus in a system executing in e1000_clean_rx_irq_[ps] at the same time leading to data corruption: CPU0: netpoll_poll_dev dev- poll_controller (e1000_netpoll) adapter- clean_rx (e1000_clean_rx_irq) CPU1: napi_poll dev- poll (e1000_clean) e1000_clean_rx_irq mitch Hmmm. You may have a point. I don't think a spinlock is required, but mitch we do need to check if the poll is already scheduled on another CPU, mitch like netpoll does in poll_napi(). mitch In practice, of course, we never see this. The only netpoll client in mitch the kernel is netconsole, which doesn't do receives. A few Major mitch Distros use netdump, which does do receives, but only after the system mitch has crashed. In that case, all other CPUs are stopped anyway. Don't forget about driver/net/kgdb_eth.c, which is in-tree, and does sends and receives. -Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
Jeff Moyer wrote: == Regarding Re: [PATCH 1/2] e1000: fix netpoll with NAPI; Auke Kok [EMAIL PROTECTED] adds: auke-jan.h.kok Neil Horman wrote: On Tue, Jun 06, 2006 at 09:39:25AM -0700, Mitch Williams wrote: On Tue, 2006-06-06 at 09:52 -0400, Neil Horman wrote: auke-jan.h.kok [snip] However, just for the sake of correctness (and paranoia), I'll whip up another patch that does this check. Thanks for the quick feedback! Regards Neil Jeff, please do not commit this patch. auke-jan.h.kok Jeff, auke-jan.h.kok I've popped the patch off from our gitserver, so you can pull the two auke-jan.h.kok outstanding patches while we revamp this one. Would you please send patches to this list? I'd certainly like to review them. I don't think the problem needs solving in the e1000 driver. I think it is an issue that should be handled properly by netpoll. ??? that message was directed to Jeff Garzik, perhaps that was too confusing. They were sent here in the first place: http://marc.theaimsgroup.com/?l=linux-netdevm=114954878711789w=2 Cheers, Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] e1000: fix netpoll with NAPI
== Regarding Re: [PATCH 1/2] e1000: fix netpoll with NAPI; Auke Kok [EMAIL PROTECTED] adds: auke-jan.h.kok Jeff Moyer wrote: == Regarding Re: [PATCH 1/2] e1000: fix netpoll with NAPI; Auke Kok [EMAIL PROTECTED] adds: auke-jan.h.kok Neil Horman wrote: On Tue, Jun 06, 2006 at 09:39:25AM -0700, Mitch Williams wrote: On Tue, 2006-06-06 at 09:52 -0400, Neil Horman wrote: auke-jan.h.kok [snip] However, just for the sake of correctness (and paranoia), I'll whip up another patch that does this check. Thanks for the quick feedback! Regards Neil Jeff, please do not commit this patch. auke-jan.h.kok Jeff, auke-jan.h.kok I've popped the patch off from our gitserver, so you can pull the two auke-jan.h.kok outstanding patches while we revamp this one. Would you please send patches to this list? I'd certainly like to review them. I don't think the problem needs solving in the e1000 driver. I think it is an issue that should be handled properly by netpoll. auke-jan.h.kok ??? auke-jan.h.kok that message was directed to Jeff Garzik, perhaps that was too confusing. I figured it was directed at Jeff G. auke-jan.h.kok They were sent here in the first place: auke-jan.h.kok http://marc.theaimsgroup.com/?l=linux-netdevm=114954878711789w=2 Thanks for the pointer. As you noted, e1000 now uses a separate device for polling. Netpoll should be able to deal with this, though. I'd like to solicit mpm's input on this, as I'm having difficulties coming up with a clean solution. For some background, the netpoll_poll_lock calls were introduced to prevent recursion in a device driver's -poll routine. By essentially calling the poll routine from the poll_controller routine, you are no longer prevented from such recursion. It would be best if the poll_controller routine was kept simple. It should do the equivalent of the interrupt processing portion of the work, and leave the delivery to the network stack for a call to the -poll routine. Solving this at the netpoll layer will be a bit difficult, since we have no insight into the driver design (as your driver illustrates). Up until now, our model worked well. We could, potentially, walk the list of devices scheduled for a poll much the same as net_rx_action does. However, we don't want to process work pending on other network adapters, only the one associated with the netpoll net device. I can think of at least one way to make this distinction, but it feels too much like a hack. Matt, any ideas on this? -Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html