Re: iwm/iwx: try to make roaming more reliable
On 2021/11/27 12:44, Stefan Sperling wrote: > The current implementation suffers from race conditions which can > leave the interface in a state where it gets "stuck". I have seen > this happen on iwm(4) 9560 in particular, while testing the driver > with new firmware images recently published by Intel. This may well > be related to other hangs people have reported in multi-AP environments > on both iwm(4) and iwx(4). Working here with 9560 iwm, I've forced roaming via fixing to a distant AP and then removing the fixed chan, done it several times with no problems, the only thing visible in ping with default timers is that the RTTs go down; not a single drop. iwm0: firmware has detected regulatory domain 'GB' (0x4742) iwm0: sending deauth to 22:xx:xx:xx:xx:28 on channel 120 mode 11n iwm0: roaming from 22:xx:xx:xx:xx:28 chan 120 to 76:xx:xx:xx:xx:e7 chan 136 iwm0: RUN -> AUTH iwm0: sending auth to 76:xx:xx:xx:xx:e7 on channel 136 mode 11a iwm0: AUTH -> ASSOC iwm0: sending assoc_req to 76:xx:xx:xx:xx:e7 on channel 136 mode 11a iwm0: ASSOC -> RUN iwm0: associated with 76:xx:xx:xx:xx:e7 ssid "XX" channel 136 start MCS 0 short preamble long slot time HT enabled iwm0: missed beacon threshold set to 30 beacons, beacon interval is 100 TU iwm0: received msg 1/4 of the 4-way handshake from 76:xx:xx:xx:xx:e7 iwm0: sending msg 2/4 of the 4-way handshake to 76:xx:xx:xx:xx:e7 iwm0: received msg 3/4 of the 4-way handshake from 76:xx:xx:xx:xx:e7 iwm0: sending msg 4/4 of the 4-way handshake to 76:xx:xx:xx:xx:e7 iwm0: sending addba_req to 76:xx:xx:xx:xx:e7 on channel 136 mode 11n iwm0: sending addba_resp to 76:xx:xx:xx:xx:e7 on channel 136 mode 11n OpenBSD 7.0-current (GENERIC.MP) #15: Mon Nov 29 20:22:21 GMT 2021 st...@bamboo.spacehopper.org:/sys/arch/amd64/compile/GENERIC.MP real mem = 16926281728 (16142MB) avail mem = 16267046912 (15513MB) random: good seed from bootblocks mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 3.1 @ 0x77d49000 (64 entries) bios0: vendor LENOVO version "N2HET63W (1.46 )" date 06/01/2021 bios0: LENOVO 20QF00B2UK acpi0 at bios0: ACPI 6.1 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP SSDT SSDT SSDT SSDT SSDT TPM2 UEFI SSDT HPET APIC MCFG ECDT SSDT SSDT SSDT BOOT SLIC SSDT LPIT WSMT SSDT DBGP DBG2 MSDM BATB NHLT FPDT UEFI acpi0: wakeup devices GLAN(S4) XHC_(S3) XDCI(S4) HDAS(S4) RP01(S4) PXSX(S4) RP02(S4) PXSX(S4) PXSX(S4) RP04(S4) PXSX(S4) RP05(S4) PXSX(S4) RP06(S4) PXSX(S4) RP07(S4) [...] acpitimer0 at acpi0: 3579545 Hz, 24 bits acpihpet0 at acpi0: 2399 Hz acpimadt0 at acpi0 addr 0xfee0: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1795.82 MHz, 06-8e-0c cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES cpu0: 256KB 64b/line 8-way L2 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges cpu0: apic clock running at 24MHz cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4.1.1.1, IBE cpu1 at mainbus0: apid 2 (application processor) cpu1: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1795.82 MHz, 06-8e-0c cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES cpu1: 256KB 64b/line 8-way L2 cache cpu1: smt 0, core 1, package 0 cpu2 at mainbus0: apid 4 (application processor) cpu2: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1795.82 MHz, 06-8e-0c cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,SGX,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,MPX,RDSEED,ADX,SMAP,CLFLUSHOPT,PT,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES cpu2: 256KB 64b/line 8-way L2 cache cpu2: smt 0, core 2, package 0 cpu3 at mainbus0: apid 6 (application processor) cpu3: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 1715.39 MHz, 06-8e-0c cpu3:
Re: iwm/iwx: try to make roaming more reliable
On Sat, Nov 27, 2021 at 09:57:45AM -0600, Aaron Poffenberger wrote: > I see two differences. Before the patch, before "deauth" I see "sending > delba" but not after patching, and before "firmware has detected > regulatory domain 'US'", but not after. I decided to try not sending a DELBA because it is immediately followed by a DEAUTH anyway. The AP will clear block-ack session state in response to DELBA, and delete *all* state in response to DEAUTH. So omitting DELBA should be fine, and eliminates a frame we need to manage to send out successfully in order to roam away. There is an #if 0 in the patch which you can switch to an #if 1 in order to send the DELBA frame if you would like to try that. Not getting a regulatory notification from the firmware does not matter. We do not do anything with such information yet beyond printing it to dmesg in debug mode.
Re: iwm/iwx: try to make roaming more reliable
On 2021-11-27 12:44 +0100, Stefan Sperling wrote: > This patch reworks the steps involved in roaming to a new access > point on iwm(4) and iwx(4) interfaces. > > The current implementation suffers from race conditions which can > leave the interface in a state where it gets "stuck". I have seen > this happen on iwm(4) 9560 in particular, while testing the driver > with new firmware images recently published by Intel. This may well > be related to other hangs people have reported in multi-AP environments > on both iwm(4) and iwx(4). During testing I didn't see the race condition without the patch, but in the past there have been occassions where I had to run `... netstart.sh iwm0`. So far, I see no regressions with it. The card is an 8265 in a Lenovo T450s. iwm0 at pci2 dev 0 function 0 "Intel Dual Band Wireless-AC 8265" rev 0x78, msi The wireless access points are both TP-Link EAP 245s, one a hardware V1, and the other V3. Before Patch: iwm0: sending delba to 50:c7:bf:94:08:44 on channel 11 mode 11n iwm0: sending deauth to 50:c7:bf:94:08:44 on channel 11 mode 11n iwm0: firmware has detected regulatory domain 'US' (0x5553) iwm0: roaming from 50:c7:bf:94:08:44 chan 11 to b0:95:75:6d:f0:97 chan 44 iwm0: RUN -> AUTH iwm0: sending auth to b0:95:75:6d:f0:97 on channel 44 mode 11a iwm0: AUTH -> ASSOC iwm0: sending assoc_req to b0:95:75:6d:f0:97 on channel 44 mode 11a iwm0: ASSOC -> RUN After Patch: iwm0: sending deauth to 50:c7:bf:94:08:44 on channel 11 mode 11n iwm0: roaming from 50:c7:bf:94:08:44 chan 11 to b0:95:75:6d:f0:97 chan 44 iwm0: RUN -> AUTH iwm0: sending auth to b0:95:75:6d:f0:97 on channel 44 mode 11a iwm0: AUTH -> ASSOC iwm0: sending assoc_req to b0:95:75:6d:f0:97 on channel 44 mode 11a iwm0: ASSOC -> RUN I see two differences. Before the patch, before "deauth" I see "sending delba" but not after patching, and before "firmware has detected regulatory domain 'US'", but not after. Perhaps they're unimportant, or don't show up in all roaming conditions. Cheers, --Aaron
iwm/iwx: try to make roaming more reliable
This patch reworks the steps involved in roaming to a new access point on iwm(4) and iwx(4) interfaces. The current implementation suffers from race conditions which can leave the interface in a state where it gets "stuck". I have seen this happen on iwm(4) 9560 in particular, while testing the driver with new firmware images recently published by Intel. This may well be related to other hangs people have reported in multi-AP environments on both iwm(4) and iwx(4). With this patch in place, net80211 can now pass control to drivers which provide an optional bgscan_done() handler. In this handler the driver will do everything necessary to prepare the device for switching APs, and then call back into net80211 to trigger the AP switch. In the previous implementation, net80211 asked the driver to prepare the device and immediately proceeded with switching APs. The driver may need to wait for firmware commands to complete, which is done in a task context. So we have a situation where the driver-side task and net80211's switching of APs (which involves sending frames) kind of happen in parallel, and things may break. To test roaming, you need to do the following: All APs involved need to use the same SSID for this to work as intended. If you do not have such a setup, you cannot effectively test this patch (unless you just want to run the patch anyway and look for regressions). Use wifi with 'ifconfig iwm0/iwx0 debug' enabled to make roaming attempts appear in /var/log/messages, and watch this file with a command such as: tail -f /var/log/messages Move towards another access point and trigger a background scan by running this command as root: ifconfig iwm0 scan It may take a few scan attempts until roaming triggers. Successful roaming displays the following in /var/log/messages: iwm0: roaming from 00:2b:a2:95:e3:e4 chan 1 to 00:2b:a2:95:e3:f4 chan 36 iwm0: RUN -> AUTH iwm0: sending auth to 00:1a:dd:da:e3:f4 on channel 36 mode 11a iwm0: AUTH -> ASSOC iwm0: sending assoc_req to 00:2b:a2:95:e3:f4 on channel 36 mode 11a iwm0: ASSOC -> RUN If it doesn't do this but enters INIT or SCAN or whatever, roaming failed. This can happen for various reasons. One of them is that our attempt to AUTH to the new AP times out, and I don't know how this could be fixed. In any case, the driver should recover from any roaming failure by going though the regular INIT->SCAN->AUTH->ASSOC->RUN sequence with one of the available APs, and interface link should come back up. If you want to force roaming from one particular AP to another, this can be done by forcing the channel number corresponding to the first AP, associating to this AP, and then clearing the forced channel number: # ifconfig iwm0 chan 1 wait for association to succeed, then clear the forced channel: # ifconfig iwm0 -chan now trigger a background scan as usual: # ifconfig iwm0 scan diff e5ddbb84043d48bc602408a6bf0e30fb062e3280 af878e690f6195efea7bb4639bf75fe439f3cddc blob - f1908d2923d90c7be84c010fc68342afda860d0c blob + dfccd0bd1327325b7c834d1e60f1b37d7a19dc18 --- sys/dev/pci/if_iwm.c +++ sys/dev/pci/if_iwm.c @@ -477,6 +477,9 @@ voidiwm_add_task(struct iwm_softc *, struct taskq *, void iwm_del_task(struct iwm_softc *, struct taskq *, struct task *); intiwm_scan(struct iwm_softc *); intiwm_bgscan(struct ieee80211com *); +void iwm_bgscan_done(struct ieee80211com *, + struct ieee80211_node_switch_bss_arg *, size_t); +void iwm_bgscan_done_task(void *); intiwm_umac_scan_abort(struct iwm_softc *); intiwm_lmac_scan_abort(struct iwm_softc *); intiwm_scan_abort(struct iwm_softc *); @@ -8287,6 +8290,81 @@ iwm_bgscan(struct ieee80211com *ic) return 0; } +void +iwm_bgscan_done(struct ieee80211com *ic, +struct ieee80211_node_switch_bss_arg *arg, size_t arg_size) +{ + struct iwm_softc *sc = ic->ic_softc; + + free(sc->bgscan_unref_arg, M_DEVBUF, sc->bgscan_unref_arg_size); + sc->bgscan_unref_arg = arg; + sc->bgscan_unref_arg_size = arg_size; + iwm_add_task(sc, sc->sc_nswq, >bgscan_done_task); +} + +void +iwm_bgscan_done_task(void *arg) +{ + struct iwm_softc *sc = arg; + struct ieee80211com *ic = >sc_ic; + struct iwm_node *in = (void *)ic->ic_bss; + struct ieee80211_node *ni = >in_ni; + int tid, err = 0, s = splnet(); + + if ((sc->sc_flags & IWM_FLAG_SHUTDOWN) || + (ic->ic_flags & IEEE80211_F_BGSCAN) == 0 || + ic->ic_state != IEEE80211_S_RUN) { + err = ENXIO; + goto done; + } + + for (tid = 0; tid < IWM_MAX_TID_COUNT; tid++) { + int qid = IWM_FIRST_AGG_TX_QUEUE + tid; + + if ((sc->tx_ba_queue_mask & (1 << qid)) == 0) + continue; + + err = iwm_sta_tx_agg(sc, ni, tid, 0, 0, 0); + if (err) + goto done; + err = iwm_disable_txq(sc, IWM_STATION_ID,