Re: iwm/iwx: try to make roaming more reliable

2021-11-29 Thread Stuart Henderson
On 2021/11/27 12:44, Stefan Sperling wrote:
> The current implementation suffers from race conditions which can
> leave the interface in a state where it gets "stuck". I have seen
> this happen on iwm(4) 9560 in particular, while testing the driver
> with new firmware images recently published by Intel. This may well
> be related to other hangs people have reported in multi-AP environments
> on both iwm(4) and iwx(4).

Working here with 9560 iwm, I've forced roaming via fixing to a distant
AP and then removing the fixed chan, done it several times with no
problems, the only thing visible in ping with default timers is that the
RTTs go down; not a single drop.

iwm0: firmware has detected regulatory domain 'GB' (0x4742)
iwm0: sending deauth to 22:xx:xx:xx:xx:28 on channel 120 mode 11n
iwm0: roaming from 22:xx:xx:xx:xx:28 chan 120 to 76:xx:xx:xx:xx:e7 chan 136
iwm0: RUN -> AUTH
iwm0: sending auth to 76:xx:xx:xx:xx:e7 on channel 136 mode 11a
iwm0: AUTH -> ASSOC
iwm0: sending assoc_req to 76:xx:xx:xx:xx:e7 on channel 136 mode 11a
iwm0: ASSOC -> RUN
iwm0: associated with 76:xx:xx:xx:xx:e7 ssid "XX" channel 136 start MCS 0 
short preamble long slot time HT enabled
iwm0: missed beacon threshold set to 30 beacons, beacon interval is 100 TU
iwm0: received msg 1/4 of the 4-way handshake from 76:xx:xx:xx:xx:e7
iwm0: sending msg 2/4 of the 4-way handshake to 76:xx:xx:xx:xx:e7
iwm0: received msg 3/4 of the 4-way handshake from 76:xx:xx:xx:xx:e7
iwm0: sending msg 4/4 of the 4-way handshake to 76:xx:xx:xx:xx:e7
iwm0: sending addba_req to 76:xx:xx:xx:xx:e7 on channel 136 mode 11n
iwm0: sending addba_resp to 76:xx:xx:xx:xx:e7 on channel 136 mode 11n

OpenBSD 7.0-current (GENERIC.MP) #15: Mon Nov 29 20:22:21 GMT 2021
st...@bamboo.spacehopper.org:/sys/arch/amd64/compile/GENERIC.MP
real mem = 16926281728 (16142MB)
avail mem = 16267046912 (15513MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x77d49000 (64 entries)
bios0: vendor LENOVO version "N2HET63W (1.46 )" date 06/01/2021
bios0: LENOVO 20QF00B2UK
acpi0 at bios0: ACPI 6.1
acpi0: sleep states S0 S3 S4 S5
acpi0: tables DSDT FACP SSDT SSDT SSDT SSDT SSDT TPM2 UEFI SSDT HPET APIC MCFG 
ECDT SSDT SSDT SSDT BOOT SLIC SSDT LPIT WSMT SSDT DBGP DBG2 MSDM BATB NHLT FPDT 
UEFI
acpi0: wakeup devices GLAN(S4) XHC_(S3) XDCI(S4) HDAS(S4) RP01(S4) PXSX(S4) 
RP02(S4) PXSX(S4) PXSX(S4) RP04(S4) PXSX(S4) RP05(S4) PXSX(S4) RP06(S4) 
PXSX(S4) RP07(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 2399 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1795.82 MHz, 06-8e-0c
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 24MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4.1.1.1, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1795.82 MHz, 06-8e-0c
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 4 (application processor)
cpu2: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1795.82 MHz, 06-8e-0c
cpu2: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu2: 256KB 64b/line 8-way L2 cache
cpu2: smt 0, core 2, package 0
cpu3 at mainbus0: apid 6 (application processor)
cpu3: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1715.39 MHz, 06-8e-0c
cpu3: 

Re: iwm/iwx: try to make roaming more reliable

2021-11-27 Thread Stefan Sperling
On Sat, Nov 27, 2021 at 09:57:45AM -0600, Aaron Poffenberger wrote:
> I see two differences. Before the patch, before "deauth" I see "sending
> delba" but not after patching, and before "firmware has detected
> regulatory domain 'US'", but not after.

I decided to try not sending a DELBA because it is immediately followed
by a DEAUTH anyway. The AP will clear block-ack session state in response
to DELBA, and delete *all* state in response to DEAUTH. So omitting DELBA
should be fine, and eliminates a frame we need to manage to send out
successfully in order to roam away. There is an #if 0 in the patch
which you can switch to an #if 1 in order to send the DELBA frame if
you would like to try that.

Not getting a regulatory notification from the firmware does not matter.
We do not do anything with such information yet beyond printing it to
dmesg in debug mode.



Re: iwm/iwx: try to make roaming more reliable

2021-11-27 Thread Aaron Poffenberger
On 2021-11-27 12:44 +0100, Stefan Sperling wrote:
> This patch reworks the steps involved in roaming to a new access
> point on iwm(4) and iwx(4) interfaces.
> 
> The current implementation suffers from race conditions which can
> leave the interface in a state where it gets "stuck". I have seen
> this happen on iwm(4) 9560 in particular, while testing the driver
> with new firmware images recently published by Intel. This may well
> be related to other hangs people have reported in multi-AP environments
> on both iwm(4) and iwx(4).

During testing I didn't see the race condition without the patch, but
in the past there have been occassions where I had to run `...
netstart.sh iwm0`. So far, I see no regressions with it.

The card is an 8265 in a Lenovo T450s.

iwm0 at pci2 dev 0 function 0 "Intel Dual Band Wireless-AC 8265" rev 0x78, msi

The wireless access points are both TP-Link EAP 245s, one a hardware
V1, and the other V3.


Before Patch:
iwm0: sending delba to 50:c7:bf:94:08:44 on channel 11 mode 11n
iwm0: sending deauth to 50:c7:bf:94:08:44 on channel 11 mode 11n
iwm0: firmware has detected regulatory domain 'US' (0x5553)
iwm0: roaming from 50:c7:bf:94:08:44 chan 11 to b0:95:75:6d:f0:97 chan 44
iwm0: RUN -> AUTH
iwm0: sending auth to b0:95:75:6d:f0:97 on channel 44 mode 11a
iwm0: AUTH -> ASSOC
iwm0: sending assoc_req to b0:95:75:6d:f0:97 on channel 44 mode 11a
iwm0: ASSOC -> RUN


After Patch:
iwm0: sending deauth to 50:c7:bf:94:08:44 on channel 11 mode 11n
iwm0: roaming from 50:c7:bf:94:08:44 chan 11 to b0:95:75:6d:f0:97 chan 44
iwm0: RUN -> AUTH
iwm0: sending auth to b0:95:75:6d:f0:97 on channel 44 mode 11a
iwm0: AUTH -> ASSOC
iwm0: sending assoc_req to b0:95:75:6d:f0:97 on channel 44 mode 11a
iwm0: ASSOC -> RUN


I see two differences. Before the patch, before "deauth" I see "sending
delba" but not after patching, and before "firmware has detected
regulatory domain 'US'", but not after.

Perhaps they're unimportant, or don't show up in all roaming
conditions.

Cheers,

--Aaron



iwm/iwx: try to make roaming more reliable

2021-11-27 Thread Stefan Sperling
This patch reworks the steps involved in roaming to a new access
point on iwm(4) and iwx(4) interfaces.

The current implementation suffers from race conditions which can
leave the interface in a state where it gets "stuck". I have seen
this happen on iwm(4) 9560 in particular, while testing the driver
with new firmware images recently published by Intel. This may well
be related to other hangs people have reported in multi-AP environments
on both iwm(4) and iwx(4).

With this patch in place, net80211 can now pass control to drivers
which provide an optional bgscan_done() handler. In this handler the
driver will do everything necessary to prepare the device for switching
APs, and then call back into net80211 to trigger the AP switch.

In the previous implementation, net80211 asked the driver to prepare
the device and immediately proceeded with switching APs. The driver
may need to wait for firmware commands to complete, which is done in a
task context. So we have a situation where the driver-side task and
net80211's switching of APs (which involves sending frames) kind of
happen in parallel, and things may break.

To test roaming, you need to do the following:

All APs involved need to use the same SSID for this to work as intended.
If you do not have such a setup, you cannot effectively test this patch
(unless you just want to run the patch anyway and look for regressions).

Use wifi with 'ifconfig iwm0/iwx0 debug' enabled to make roaming attempts
appear in /var/log/messages, and watch this file with a command such as:

  tail -f /var/log/messages

Move towards another access point and trigger a background scan by
running this command as root:

  ifconfig iwm0 scan

It may take a few scan attempts until roaming triggers.
Successful roaming displays the following in /var/log/messages:

  iwm0: roaming from 00:2b:a2:95:e3:e4 chan 1 to 00:2b:a2:95:e3:f4 chan 36
  iwm0: RUN -> AUTH
  iwm0: sending auth to 00:1a:dd:da:e3:f4 on channel 36 mode 11a
  iwm0: AUTH -> ASSOC
  iwm0: sending assoc_req to 00:2b:a2:95:e3:f4 on channel 36 mode 11a
  iwm0: ASSOC -> RUN

If it doesn't do this but enters INIT or SCAN or whatever, roaming failed.
This can happen for various reasons. One of them is that our attempt to
AUTH to the new AP times out, and I don't know how this could be fixed.
In any case, the driver should recover from any roaming failure by going
though the regular INIT->SCAN->AUTH->ASSOC->RUN sequence with one of
the available APs, and interface link should come back up.

If you want to force roaming from one particular AP to another, this can
be done by forcing the channel number corresponding to the first AP,
associating to this AP, and then clearing the forced channel number:

  # ifconfig iwm0 chan 1
  wait for association to succeed, then clear the forced channel:
  # ifconfig iwm0 -chan
  now trigger a background scan as usual:
  # ifconfig iwm0 scan
 
diff e5ddbb84043d48bc602408a6bf0e30fb062e3280 
af878e690f6195efea7bb4639bf75fe439f3cddc
blob - f1908d2923d90c7be84c010fc68342afda860d0c
blob + dfccd0bd1327325b7c834d1e60f1b37d7a19dc18
--- sys/dev/pci/if_iwm.c
+++ sys/dev/pci/if_iwm.c
@@ -477,6 +477,9 @@ voidiwm_add_task(struct iwm_softc *, struct taskq 
*, 
 void   iwm_del_task(struct iwm_softc *, struct taskq *, struct task *);
 intiwm_scan(struct iwm_softc *);
 intiwm_bgscan(struct ieee80211com *);
+void   iwm_bgscan_done(struct ieee80211com *,
+   struct ieee80211_node_switch_bss_arg *, size_t);
+void   iwm_bgscan_done_task(void *);
 intiwm_umac_scan_abort(struct iwm_softc *);
 intiwm_lmac_scan_abort(struct iwm_softc *);
 intiwm_scan_abort(struct iwm_softc *);
@@ -8287,6 +8290,81 @@ iwm_bgscan(struct ieee80211com *ic) 
return 0;
 }
 
+void
+iwm_bgscan_done(struct ieee80211com *ic,
+struct ieee80211_node_switch_bss_arg *arg, size_t arg_size)
+{
+   struct iwm_softc *sc = ic->ic_softc;
+
+   free(sc->bgscan_unref_arg, M_DEVBUF, sc->bgscan_unref_arg_size);
+   sc->bgscan_unref_arg = arg;
+   sc->bgscan_unref_arg_size = arg_size;
+   iwm_add_task(sc, sc->sc_nswq, >bgscan_done_task);
+}
+
+void
+iwm_bgscan_done_task(void *arg)
+{
+   struct iwm_softc *sc = arg;
+   struct ieee80211com *ic = >sc_ic;
+   struct iwm_node *in = (void *)ic->ic_bss;
+   struct ieee80211_node *ni = >in_ni;
+   int tid, err = 0, s = splnet();
+
+   if ((sc->sc_flags & IWM_FLAG_SHUTDOWN) ||
+   (ic->ic_flags & IEEE80211_F_BGSCAN) == 0 ||
+   ic->ic_state != IEEE80211_S_RUN) {
+   err = ENXIO;
+   goto done;
+   }
+
+   for (tid = 0; tid < IWM_MAX_TID_COUNT; tid++) {
+   int qid = IWM_FIRST_AGG_TX_QUEUE + tid;
+
+   if ((sc->tx_ba_queue_mask & (1 << qid)) == 0)
+   continue;
+
+   err = iwm_sta_tx_agg(sc, ni, tid, 0, 0, 0);
+   if (err)
+   goto done;
+   err = iwm_disable_txq(sc, IWM_STATION_ID,