Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
On Nov 23 Roland Dreir sent a patch for interrupt handling, but it doesn't apply on -current since the file rt2661.c changed slightly a few weeks earlier (1.51, date: 2009/11/01). two es in Dreier :) This patch just changes Roland's patch to update against rt2661.c r1.51 from the OpenBSD repository instead of Roland's patch which is against his private GIT repo. Sorry about that... I was testing on a 4.5 box, so even though I had the patch against -current, I sent the wrong (backported) one. I've been running with this for just over a day, including some time copying kernels and snaps both ways non-stop (after removing the ifconfig down/up from crontab). It has locked up only twice in 24 hrs, a definite improvement. Thanks for testing and keeping this patch alive. I would like to see this comitted since I have multiple reports of this improving stability for people, and also I think that it is pretty clearly correct on a theoretical level too. However I have not seen any response from damien@ unfortunately. In my setup (slow VIA mini-itx box used as an AP) I've not seen any lockups with the patch applied. Could you give a quick description of your setup? Are the lockups you see the same as before -- ie the interface stops with OACTIVE set, and recovers if you do if config up/down? (Is that the problem you were having before?) I sent another patch (http://www.mail-archive.com/t...@openbsd.org/msg01261.html) that helps my setup a little more (avoids the interface stays up but no longer sends broadcasts or multicasts problem I saw -- not sure why exactly but avoiding sending the adapter garbage descriptors seems like a good idea in any case). You could try with that too and see if it helps at all. I do still see another problem that I have not figured out yet, namely the ral interface on the AP stops sending for 20 or 30 seconds and then recovers by itself. If I run ping to the AP on a client box and leave something like tcpdump -i ral0 -n icmp running on the AP, then I see that requests continue to be received during the interruption, but no replies are sent. Also I can see that OACTIVE is not set during the interruption. But I don't know why this is happening yet. Is it possible that this is what you're hitting too? Thanks, Roland
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
Mind sharing your hostname.ral0 and the tools you use to trigger this situation? I've tried hping, tcpbench, ping -f, rsync, etc to no avail. max ~8000 intr/s with hping 2.5MB/s with scp hostname.ral0 is: inet 10.2.0.1 255.255.0.0 NONE \ mode 11g \ mediaopt hostap \ nwid \ wpa \ wpaprotos wpa2 \ wpapsk 0x \ wpaakms psk \ chan 1 inet6 alias 2001:470:8379:2::1 and this system is basically my home wireless AP -- so it's routing between wired ethernet hooked up to my cable modem and my laptops etc. I see the interface get stuck intermittently under pretty much any heavy traffic from my laptop -- rsync over ssh to a system on wired ethernet, uploading big files to the external internet, etc. I think maybe having a lot of small ack packets to send exposes the race the best, since typically I see the problem when I am sending a lot via TCP from the laptop through the slow AP. If you search the web for soekris and rt2661 then you can find several other people that seem to be hitting this bug from many months ago, which makes sense -- a geode is probably a slow enough CPU to make the races bigger. cpu0: AMD Athlon(tm) XP 2500+ (AuthenticAMD 686-class, 512KB L2 cache) 1.84 GHz Your CPU may be too fast... my system has: cpu0: VIA Samuel 2 (CentaurHauls 686-class) 602 MHz If your system can service TX interrupts fast enough that there is never more than one packet being completed, the standard driver should work fine. - R.
How can I get my driver bug fix committed?
About two weeks ago, I sent a fix for the ral(4) driver ([PATCH] Fix interrupt handling in ral(4) for RT2661 under load, http://www.mail-archive.com/t...@openbsd.org/msg01155.html). I have not gotten any response from any OpenBSD developers, and I have not seen the patch committed. I'm happy to answer any questions about the patch, revise it, or do whatever it takes to move this forward, but it is very disheartening to have my efforts to contribute a bug fix simply get ignored. What am I doing wrong? Thanks, Roland
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
Hi Damien / OpenBSD devs, Did anyone get a chance to look at this diff? These fixes are the difference for me between ral being usable as an AP and getting stuck almost immediately under heavy load. Is there anything I need to do to get this committed? Thanks, Roland
Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load
Does it do anything for 2860? I have that as an AP now and every once in a while it stops working, I need to restart the interface. No, the driver code is a completely different C file. It's possible there are analogous bugs for 2860 though, since the hardware and driver are both closely related to 2661. - R.
[PATCH] Fix interrupt handling in ral(4) for RT2661 under load
The interrupt handling in ral(4) for RT2661 has a couple of problems, which causes the interface to get stuck under heavy load with OACTIVE set (the problems are likely especially severe on slow systems such as my 600MHz VIA system); bouncing the interface down and back up fixes things. As I describe below, I think I've been able to fix it, and I'd be happy to see the patch below reviewed and applied. I've seen other reports that look similar to the problems I was having; eg bug kernel/5958 starts out talking about RT2860 (which is completely different code) but some of the me too replies are for RT2561S, which I hope this patch fixes (I've cc'ed those reporters; test reports welcome!). I've not looked at the RT2860 code due to lack of hardware, but if someone wants to send me a PCI card The first problem is that multiple TX completions may happen before the interrupt handler gets to rt2661_tx_intr(). When this happens, the TX interrupt handler only completes one entry in the TX ring, which leads to the driver getting behind the hardware. To fix this, I extended the qid field in the TX descriptor to contain the index in the TX ring as well as the queue ID, and then when an interrupt is missed, free the earlier TX entries as well as the entry that the interrupt is for. (I did see this code trigger under load) This exposes the second problem: there is a race that is inherent in separating TX completion handling between TX DMA interrupts and TX interrupts -- the driver may handle all the TX DMAs that finished when it called rt2661_tx_dma_intr(), but by the time it gets to rt2661_tx_intr(), another TX may have completed and the driver may end up processing a TX completion for which it hasn't handled the TX DMA completion. This ends up leaking mbufs if a new send is enqueued before the TX DMA interrupt has a chance to catch up. (This happens in practice on my system as well) It is probably possible to fix this and keep the split DMA/TX handling, but that seems to require unneeded complexity. Instead, we can just ignore TX DMA interrupts and handle everything when the TX actually completes. This means we don't free the mbuf quite as soon, but since we can't reuse the slot in the TX ring anyway, I don't see this as a problem in practice. With this patch applied, the ral interface on my access point is able to continue operating under load that would cause the interface to get stuck with the stock driver fairly quickly. --- rt2661.c| 118 -- rt2661reg.h |3 +- rt2661var.h |1 - 3 files changed, 51 insertions(+), 71 deletions(-) diff --git a/rt2661.c b/rt2661.c index f838969..9a9cc53 100644 --- a/rt2661.c +++ b/rt2661.c @@ -97,9 +97,8 @@ void rt2661_newassoc(struct ieee80211com *, struct ieee80211_node *, intrt2661_newstate(struct ieee80211com *, enum ieee80211_state, int); uint16_t rt2661_eeprom_read(struct rt2661_softc *, uint8_t); +void rt2661_free_tx_desc(struct rt2661_softc *, struct rt2661_tx_ring *); void rt2661_tx_intr(struct rt2661_softc *); -void rt2661_tx_dma_intr(struct rt2661_softc *, - struct rt2661_tx_ring *); void rt2661_rx_intr(struct rt2661_softc *); #ifndef IEEE80211_STA_ONLY void rt2661_mcu_beacon_expire(struct rt2661_softc *); @@ -115,7 +114,7 @@ uint16_trt2661_txtime(int, int, uint32_t); uint8_trt2661_plcp_signal(int); void rt2661_setup_tx_desc(struct rt2661_softc *, struct rt2661_tx_desc *, uint32_t, uint16_t, int, int, - const bus_dma_segment_t *, int, int); + const bus_dma_segment_t *, int, int, int); intrt2661_tx_mgt(struct rt2661_softc *, struct mbuf *, struct ieee80211_node *); intrt2661_tx_data(struct rt2661_softc *, struct mbuf *, @@ -376,7 +375,7 @@ rt2661_alloc_tx_ring(struct rt2661_softc *sc, struct rt2661_tx_ring *ring, ring-count = count; ring-queued = 0; - ring-cur = ring-next = ring-stat = 0; + ring-cur = ring-stat = 0; error = bus_dmamap_create(sc-sc_dmat, count * RT2661_TX_DESC_SIZE, 1, count * RT2661_TX_DESC_SIZE, 0, BUS_DMA_NOWAIT, ring-map); @@ -470,7 +469,7 @@ rt2661_reset_tx_ring(struct rt2661_softc *sc, struct rt2661_tx_ring *ring) BUS_DMASYNC_PREWRITE); ring-queued = 0; - ring-cur = ring-next = ring-stat = 0; + ring-cur = ring-stat = 0; } void @@ -881,6 +880,36 @@ rt2661_eeprom_read(struct rt2661_softc *sc, uint8_t addr) } void +rt2661_free_tx_desc(struct rt2661_softc *sc, struct rt2661_tx_ring *txq) +{ + struct rt2661_tx_desc *desc = txq-desc[txq-stat]; + struct rt2661_tx_data *data = txq-data[txq-stat]; + struct ieee80211com *ic = sc-sc_ic; + + bus_dmamap_sync(sc-sc_dmat, data-map, 0, +
strange multicast send bug with ral(4) (was: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load)
By the way, I forgot to mention that even with this patch applied, I do have one odd problem with ral on my system -- after some time (hours it appears), the ral interface stops being able to send multicasts/broadcasts. All other traffic works fine, including receiving multicasts, but no multicasts go out. I only noticed this because my access point is running rtadvd for IPv6, and the clients stop receiving route advertisements. Bouncing the interface with ifconfig ral0 down; ifconfig ral0 up fixes this for a few more hours. I've not gotten very far debugging this yet, so I don't even know yet if the multicasts are making it to the driver or are getting lost higher in the stack. But maybe someone has seen this and has some idea of what's going on? (FWIW, this system is still running OpenBSD 4.5 with my patch applied, so possibly this is even something that was already fixed) Thanks, Roland
RT2661/ral(4) interface hangs with heavy traffic (OpenBSD 4.5)
Hi, I have a mini-itx system being used as a wireless router with a Ralink RT2661 wifi card (driven by ral(4)) in host AP mode. I've found that if I copy a lot of data from my laptop connected via wifi to another system connected via wired (100 mbit) ethernet, say doing a big rsync, that the ral interface sometimes hangs and the wifi-connected laptop is not able to send or receive anything any more. The router system is still up, and in fact just logging in on the console and doing ifconfig ral0 down; ifconfig ral0 up is enough to start things working again. I don't see any messages in the kernel log or anywhere else when ral0 gets stuck. Thoughts on how to fix this or debug this further would be appreciated. Thanks, Roland Full pcidump -v and dmesg output is below; I see problems when routing a lot of traffic from the ral0/RT2661 to the vr1/RhineIII interface. Domain /dev/pci0: 0:0:0: VIA VT8623 PCI 0x: Vendor ID: 1106 Product ID: 3123 0x0004: Command: 0006 Status ID: 2230 0x0008: Class: 06 Subclass: 00 Interface: 00 Revision: 00 0x000c: BIST: 00 Header Type: 00 Latency Timer: 08 Cache Line Size: 00 0x0010: BAR mem prefetchable 32bit addr: 0xd000 0x0014: BAR empty () 0x0018: BAR empty () 0x001c: BAR empty () 0x0020: BAR empty () 0x0024: BAR empty () 0x0028: Cardbus CIS: 0x002c: Subsystem Vendor ID: 1106 Product ID: aa01 0x0030: Expansion ROM Base Address: 0x0038: 0x003c: Interrupt Pin: 00 Line: 00 Min Gnt: 00 Max Lat: 00 0x00a0: Capability 0x02: AGP 0x00c0: Capability 0x01: Power Management 0:1:0: VIA VT8633 AGP 0x: Vendor ID: 1106 Product ID: b091 0x0004: Command: 0107 Status ID: a230 0x0008: Class: 06 Subclass: 04 Interface: 00 Revision: 00 0x000c: BIST: 00 Header Type: 01 Latency Timer: 00 Cache Line Size: 00 0x0010: 0x0014: 0x0018: Primary Bus: 0 Secondary Bus: 1 Subordinate Bus: 1 Secondary Latency Timer: 00 0x001c: I/O Base: f0 I/O Limit: 00 Secondary Status: a220 0x0020: Memory Base: dc00 Memory Limit: ddf0 0x0024: Prefetch Memory Base: d800 Prefetch Memory Limit: dbf0 0x0028: Prefetch Memory Base Upper 32 Bits: 0x002c: Prefetch Memory Limit Upper 32 Bits: 0x0030: I/O Base Upper 16 Bits: I/O Limit Upper 16 Bits: 0x0038: Expansion ROM Base Address: 0x003c: Interrupt Pin: 00 Line: 00 Bridge Control: 000c 0x0080: Capability 0x01: Power Management 0:15:0: VIA VT6105 RhineIII 0x: Vendor ID: 1106 Product ID: 3106 0x0004: Command: 0007 Status ID: 0210 0x0008: Class: 02 Subclass: 00 Interface: 00 Revision: 8b 0x000c: BIST: 00 Header Type: 00 Latency Timer: 20 Cache Line Size: 08 0x0010: BAR io addr: 0xd000 0x0014: BAR mem 32bit addr: 0xde00a000 0x0018: BAR empty () 0x001c: BAR empty () 0x0020: BAR empty () 0x0024: BAR empty () 0x0028: Cardbus CIS: 0x002c: Subsystem Vendor ID: 1106 Product ID: 0106 0x0030: Expansion ROM Base Address: 0x0038: 0x003c: Interrupt Pin: 01 Line: 0b Min Gnt: 03 Max Lat: 08 0x0044: Capability 0x01: Power Management 0:16:0: VIA VT83C572 USB 0x: Vendor ID: 1106 Product ID: 3038 0x0004: Command: 0007 Status ID: 0210 0x0008: Class: 0c Subclass: 03 Interface: 00 Revision: 80 0x000c: BIST: 00 Header Type: 80 Latency Timer: 20 Cache Line Size: 08 0x0010: BAR empty () 0x0014: BAR empty () 0x0018: BAR empty () 0x001c: BAR empty () 0x0020: BAR io addr: 0xd400 0x0024: BAR empty () 0x0028: Cardbus CIS: 0x002c: Subsystem Vendor ID: 1106 Product ID: aa01 0x0030: Expansion ROM Base Address: 0x0038: 0x003c: Interrupt Pin: 01 Line: 0c Min Gnt: 00 Max Lat: 00 0x0080: Capability 0x01: Power Management 0:16:1: VIA VT83C572 USB 0x: Vendor ID: 1106 Product ID: 3038 0x0004: Command: 0007 Status ID: 0210 0x0008: Class: 0c Subclass: 03 Interface: 00 Revision: 80 0x000c: BIST: 00 Header Type: 80 Latency Timer: 20 Cache Line Size: 08 0x0010: BAR empty () 0x0014: BAR empty () 0x0018: BAR empty () 0x001c: BAR empty () 0x0020: BAR io addr: 0xd800 0x0024: BAR empty () 0x0028: Cardbus CIS: 0x002c: Subsystem Vendor ID: 1106 Product ID: aa01 0x0030: Expansion ROM Base Address: 0x0038: 0x003c: Interrupt Pin: 02