Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load

2009-12-31 Thread Roland Dreier
  On Nov 23 Roland Dreir sent a patch for interrupt handling, but it
  doesn't apply on -current since the file rt2661.c changed slightly
  a few weeks earlier (1.51, date: 2009/11/01).

two es in Dreier :)

  This patch just changes Roland's patch to update against rt2661.c
  r1.51 from the OpenBSD repository instead of Roland's patch which
  is against his private GIT repo.

Sorry about that... I was testing on a 4.5 box, so even though I had
the patch against -current, I sent the wrong (backported) one.

  I've been running with this for just over a day, including some
  time copying kernels and snaps both ways non-stop (after removing
  the ifconfig down/up from crontab). It has locked up only twice in
  24 hrs, a definite improvement.

Thanks for testing and keeping this patch alive.  I would like to see
this comitted since I have multiple reports of this improving
stability for people, and also I think that it is pretty clearly
correct on a theoretical level too.  However I have not seen any
response from damien@ unfortunately.

In my setup (slow VIA mini-itx box used as an AP) I've not seen any
lockups with the patch applied.  Could you give a quick description of
your setup?  Are the lockups you see the same as before -- ie the
interface stops with OACTIVE set, and recovers if you do if config
up/down?  (Is that the problem you were having before?)

I sent another patch 
(http://www.mail-archive.com/t...@openbsd.org/msg01261.html)
that helps my setup a little more (avoids the interface stays up but
no longer sends broadcasts or multicasts problem I saw -- not sure
why exactly but avoiding sending the adapter garbage descriptors seems
like a good idea in any case).  You could try with that too and see if
it helps at all.

I do still see another problem that I have not figured out yet, namely
the ral interface on the AP stops sending for 20 or 30 seconds and
then recovers by itself.  If I run ping to the AP on a client box and
leave something like tcpdump -i ral0 -n icmp running on the AP, then
I see that requests continue to be received during the interruption,
but no replies are sent.  Also I can see that OACTIVE is not set
during the interruption.  But I don't know why this is happening yet.
Is it possible that this is what you're hitting too?

Thanks,
  Roland



Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load

2009-12-05 Thread Roland Dreier
  Mind sharing your hostname.ral0 and the tools you use to trigger this
  situation? I've tried hping, tcpbench, ping -f, rsync, etc to no avail.
  
  max ~8000 intr/s with hping
  2.5MB/s with scp

hostname.ral0 is:

inet 10.2.0.1 255.255.0.0 NONE \
mode 11g \
mediaopt hostap \
nwid  \
wpa \
wpaprotos wpa2 \
wpapsk 0x \
wpaakms psk \
chan 1
inet6 alias 2001:470:8379:2::1

and this system is basically my home wireless AP -- so it's routing
between wired ethernet hooked up to my cable modem and my laptops
etc.  I see the interface get stuck intermittently under pretty much
any heavy traffic from my laptop -- rsync over ssh to a system on
wired ethernet, uploading big files to the external internet, etc.

I think maybe having a lot of small ack packets to send exposes the
race the best, since typically I see the problem when I am sending a
lot via TCP from the laptop through the slow AP.

If you search the web for soekris and rt2661 then you can find several
other people that seem to be hitting this bug from many months ago,
which makes sense -- a geode is probably a slow enough CPU to make the
races bigger.

  cpu0: AMD Athlon(tm) XP 2500+ (AuthenticAMD 686-class, 512KB L2 cache) 
  1.84 GHz

Your CPU may be too fast... my system has:

cpu0: VIA Samuel 2 (CentaurHauls 686-class) 602 MHz

If your system can service TX interrupts fast enough that there is
never more than one packet being completed, the standard driver should
work fine.

 - R.



How can I get my driver bug fix committed?

2009-12-04 Thread Roland Dreier
About two weeks ago, I sent a fix for the ral(4) driver ([PATCH] Fix
interrupt handling in ral(4) for RT2661 under load,
http://www.mail-archive.com/t...@openbsd.org/msg01155.html).  I have
not gotten any response from any OpenBSD developers, and I have not
seen the patch committed.  I'm happy to answer any questions about the
patch, revise it, or do whatever it takes to move this forward, but it
is very disheartening to have my efforts to contribute a bug fix
simply get ignored.  What am I doing wrong?

Thanks,
  Roland



Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load

2009-11-30 Thread Roland Dreier
Hi Damien / OpenBSD devs,

Did anyone get a chance to look at this diff?  These fixes are the
difference for me between ral being usable as an AP and getting stuck
almost immediately under heavy load.  Is there anything I need to do
to get this committed?

Thanks,
  Roland



Re: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load

2009-11-30 Thread Roland Dreier
  Does it do anything for 2860? I have that as an AP now and every once in
  a while it stops working, I need to restart the interface.

No, the driver code is a completely different C file.  It's possible
there are analogous bugs for 2860 though, since the hardware and
driver are both closely related to 2661.

 - R.



[PATCH] Fix interrupt handling in ral(4) for RT2661 under load

2009-11-22 Thread Roland Dreier
The interrupt handling in ral(4) for RT2661 has a couple of problems,
which causes the interface to get stuck under heavy load with OACTIVE
set (the problems are likely especially severe on slow systems such as
my 600MHz VIA system); bouncing the interface down and back up fixes
things.  As I describe below, I think I've been able to fix it, and
I'd be happy to see the patch below reviewed and applied.

I've seen other reports that look similar to the problems I was
having; eg bug kernel/5958 starts out talking about RT2860 (which is
completely different code) but some of the me too replies are for
RT2561S, which I hope this patch fixes (I've cc'ed those reporters;
test reports welcome!).  I've not looked at the RT2860 code due to
lack of hardware, but if someone wants to send me a PCI card

The first problem is that multiple TX completions may happen before
the interrupt handler gets to rt2661_tx_intr().  When this happens,
the TX interrupt handler only completes one entry in the TX ring,
which leads to the driver getting behind the hardware.  To fix this, I
extended the qid field in the TX descriptor to contain the index in
the TX ring as well as the queue ID, and then when an interrupt is
missed, free the earlier TX entries as well as the entry that the
interrupt is for.  (I did see this code trigger under load)

This exposes the second problem: there is a race that is inherent in
separating TX completion handling between TX DMA interrupts and TX
interrupts -- the driver may handle all the TX DMAs that finished when
it called rt2661_tx_dma_intr(), but by the time it gets to
rt2661_tx_intr(), another TX may have completed and the driver may end
up processing a TX completion for which it hasn't handled the TX DMA
completion.  This ends up leaking mbufs if a new send is enqueued
before the TX DMA interrupt has a chance to catch up.  (This happens
in practice on my system as well)

It is probably possible to fix this and keep the split DMA/TX
handling, but that seems to require unneeded complexity.  Instead, we
can just ignore TX DMA interrupts and handle everything when the TX
actually completes.  This means we don't free the mbuf quite as soon,
but since we can't reuse the slot in the TX ring anyway, I don't see
this as a problem in practice.

With this patch applied, the ral interface on my access point is able
to continue operating under load that would cause the interface to get
stuck with the stock driver fairly quickly.
---
 rt2661.c|  118 --
 rt2661reg.h |3 +-
 rt2661var.h |1 -
 3 files changed, 51 insertions(+), 71 deletions(-)

diff --git a/rt2661.c b/rt2661.c
index f838969..9a9cc53 100644
--- a/rt2661.c
+++ b/rt2661.c
@@ -97,9 +97,8 @@ void  rt2661_newassoc(struct ieee80211com *, struct 
ieee80211_node *,
 intrt2661_newstate(struct ieee80211com *, enum ieee80211_state,
int);
 uint16_t   rt2661_eeprom_read(struct rt2661_softc *, uint8_t);
+void   rt2661_free_tx_desc(struct rt2661_softc *, struct 
rt2661_tx_ring *);
 void   rt2661_tx_intr(struct rt2661_softc *);
-void   rt2661_tx_dma_intr(struct rt2661_softc *,
-   struct rt2661_tx_ring *);
 void   rt2661_rx_intr(struct rt2661_softc *);
 #ifndef IEEE80211_STA_ONLY
 void   rt2661_mcu_beacon_expire(struct rt2661_softc *);
@@ -115,7 +114,7 @@ uint16_trt2661_txtime(int, int, uint32_t);
 uint8_trt2661_plcp_signal(int);
 void   rt2661_setup_tx_desc(struct rt2661_softc *,
struct rt2661_tx_desc *, uint32_t, uint16_t, int, int,
-   const bus_dma_segment_t *, int, int);
+   const bus_dma_segment_t *, int, int, int);
 intrt2661_tx_mgt(struct rt2661_softc *, struct mbuf *,
struct ieee80211_node *);
 intrt2661_tx_data(struct rt2661_softc *, struct mbuf *,
@@ -376,7 +375,7 @@ rt2661_alloc_tx_ring(struct rt2661_softc *sc, struct 
rt2661_tx_ring *ring,
 
ring-count = count;
ring-queued = 0;
-   ring-cur = ring-next = ring-stat = 0;
+   ring-cur = ring-stat = 0;
 
error = bus_dmamap_create(sc-sc_dmat, count * RT2661_TX_DESC_SIZE, 1,
count * RT2661_TX_DESC_SIZE, 0, BUS_DMA_NOWAIT, ring-map);
@@ -470,7 +469,7 @@ rt2661_reset_tx_ring(struct rt2661_softc *sc, struct 
rt2661_tx_ring *ring)
BUS_DMASYNC_PREWRITE);
 
ring-queued = 0;
-   ring-cur = ring-next = ring-stat = 0;
+   ring-cur = ring-stat = 0;
 }
 
 void
@@ -881,6 +880,36 @@ rt2661_eeprom_read(struct rt2661_softc *sc, uint8_t addr)
 }
 
 void
+rt2661_free_tx_desc(struct rt2661_softc *sc, struct rt2661_tx_ring *txq)
+{
+   struct rt2661_tx_desc *desc = txq-desc[txq-stat];
+   struct rt2661_tx_data *data = txq-data[txq-stat];
+   struct ieee80211com *ic = sc-sc_ic;
+
+   bus_dmamap_sync(sc-sc_dmat, data-map, 0,
+   

strange multicast send bug with ral(4) (was: [PATCH] Fix interrupt handling in ral(4) for RT2661 under load)

2009-11-22 Thread Roland Dreier
By the way, I forgot to mention that even with this patch applied, I
do have one odd problem with ral on my system -- after some time
(hours it appears), the ral interface stops being able to send
multicasts/broadcasts.  All other traffic works fine, including
receiving multicasts, but no multicasts go out.  I only noticed this
because my access point is running rtadvd for IPv6, and the clients
stop receiving route advertisements.  Bouncing the interface with
ifconfig ral0 down; ifconfig ral0 up fixes this for a few more hours.

I've not gotten very far debugging this yet, so I don't even know yet
if the multicasts are making it to the driver or are getting lost
higher in the stack.  But maybe someone has seen this and has some
idea of what's going on?

(FWIW, this system is still running OpenBSD 4.5 with my patch applied,
so possibly this is even something that was already fixed)

Thanks,
  Roland



RT2661/ral(4) interface hangs with heavy traffic (OpenBSD 4.5)

2009-07-08 Thread Roland Dreier
Hi, I have a mini-itx system being used as a wireless router with a
Ralink RT2661 wifi card (driven by ral(4)) in host AP mode.  I've
found that if I copy a lot of data from my laptop connected via wifi
to another system connected via wired (100 mbit) ethernet, say doing a
big rsync, that the ral interface sometimes hangs and the
wifi-connected laptop is not able to send or receive anything any
more.

The router system is still up, and in fact just logging in on the
console and doing ifconfig ral0 down; ifconfig ral0 up is enough to
start things working again.  I don't see any messages in the kernel
log or anywhere else when ral0 gets stuck.

Thoughts on how to fix this or debug this further would be
appreciated.

Thanks,
  Roland

Full pcidump -v and dmesg output is below; I see problems when routing
a lot of traffic from the ral0/RT2661 to the vr1/RhineIII interface.

Domain /dev/pci0:
 0:0:0: VIA VT8623 PCI
0x: Vendor ID: 1106 Product ID: 3123
0x0004: Command: 0006 Status ID: 2230
0x0008: Class: 06 Subclass: 00 Interface: 00 Revision: 00
0x000c: BIST: 00 Header Type: 00 Latency Timer: 08 Cache Line Size: 00
0x0010: BAR mem prefetchable 32bit addr: 0xd000
0x0014: BAR empty ()
0x0018: BAR empty ()
0x001c: BAR empty ()
0x0020: BAR empty ()
0x0024: BAR empty ()
0x0028: Cardbus CIS: 
0x002c: Subsystem Vendor ID: 1106 Product ID: aa01
0x0030: Expansion ROM Base Address: 
0x0038: 
0x003c: Interrupt Pin: 00 Line: 00 Min Gnt: 00 Max Lat: 00
0x00a0: Capability 0x02: AGP
0x00c0: Capability 0x01: Power Management
 0:1:0: VIA VT8633 AGP
0x: Vendor ID: 1106 Product ID: b091
0x0004: Command: 0107 Status ID: a230
0x0008: Class: 06 Subclass: 04 Interface: 00 Revision: 00
0x000c: BIST: 00 Header Type: 01 Latency Timer: 00 Cache Line Size: 00
0x0010: 
0x0014: 
0x0018: Primary Bus: 0 Secondary Bus: 1 Subordinate Bus: 1 
Secondary Latency Timer: 00
0x001c: I/O Base: f0 I/O Limit: 00 Secondary Status: a220
0x0020: Memory Base: dc00 Memory Limit: ddf0
0x0024: Prefetch Memory Base: d800 Prefetch Memory Limit: dbf0
0x0028: Prefetch Memory Base Upper 32 Bits: 
0x002c: Prefetch Memory Limit Upper 32 Bits: 
0x0030: I/O Base Upper 16 Bits:  I/O Limit Upper 16 Bits: 
0x0038: Expansion ROM Base Address: 
0x003c: Interrupt Pin: 00 Line: 00 Bridge Control: 000c
0x0080: Capability 0x01: Power Management
 0:15:0: VIA VT6105 RhineIII
0x: Vendor ID: 1106 Product ID: 3106
0x0004: Command: 0007 Status ID: 0210
0x0008: Class: 02 Subclass: 00 Interface: 00 Revision: 8b
0x000c: BIST: 00 Header Type: 00 Latency Timer: 20 Cache Line Size: 08
0x0010: BAR io addr: 0xd000
0x0014: BAR mem 32bit addr: 0xde00a000
0x0018: BAR empty ()
0x001c: BAR empty ()
0x0020: BAR empty ()
0x0024: BAR empty ()
0x0028: Cardbus CIS: 
0x002c: Subsystem Vendor ID: 1106 Product ID: 0106
0x0030: Expansion ROM Base Address: 
0x0038: 
0x003c: Interrupt Pin: 01 Line: 0b Min Gnt: 03 Max Lat: 08
0x0044: Capability 0x01: Power Management
 0:16:0: VIA VT83C572 USB
0x: Vendor ID: 1106 Product ID: 3038
0x0004: Command: 0007 Status ID: 0210
0x0008: Class: 0c Subclass: 03 Interface: 00 Revision: 80
0x000c: BIST: 00 Header Type: 80 Latency Timer: 20 Cache Line Size: 08
0x0010: BAR empty ()
0x0014: BAR empty ()
0x0018: BAR empty ()
0x001c: BAR empty ()
0x0020: BAR io addr: 0xd400
0x0024: BAR empty ()
0x0028: Cardbus CIS: 
0x002c: Subsystem Vendor ID: 1106 Product ID: aa01
0x0030: Expansion ROM Base Address: 
0x0038: 
0x003c: Interrupt Pin: 01 Line: 0c Min Gnt: 00 Max Lat: 00
0x0080: Capability 0x01: Power Management
 0:16:1: VIA VT83C572 USB
0x: Vendor ID: 1106 Product ID: 3038
0x0004: Command: 0007 Status ID: 0210
0x0008: Class: 0c Subclass: 03 Interface: 00 Revision: 80
0x000c: BIST: 00 Header Type: 80 Latency Timer: 20 Cache Line Size: 08
0x0010: BAR empty ()
0x0014: BAR empty ()
0x0018: BAR empty ()
0x001c: BAR empty ()
0x0020: BAR io addr: 0xd800
0x0024: BAR empty ()
0x0028: Cardbus CIS: 
0x002c: Subsystem Vendor ID: 1106 Product ID: aa01
0x0030: Expansion ROM Base Address: 
0x0038: 
0x003c: Interrupt Pin: 02