While debugging my 802.11n code I got tired of iwm randomly failing to associate or even getting into a state where I had to reboot to get it to work again.
In case it got hung, the newstate task was sleeping in "iwmauth" and never woke up, which made me revisit how this driver transitions to AUTH state. I couldn't figure out exactly how it got hung, but I believe I found a good reason to remove this tsleep entirely. What's going on here is that the firmware can hop channels in the background whenever it feels like doing so. To prevent this from happening during the association procedure, the driver schedules a "time event" which makes the firmware stay on the current channel for a specified amount of time. This approach is borrowed from the Linux driver. If all goes well, the following happens: 1) The driver's newstate task asks for the time event and goes into tsleep() to wait for the firmware to signal the beginning of the time event. 2) The firmware signals beginning of the time event. 3) The interrupt handler sets a flag in the softc which indicates "time event start" and wakes up the sleeping newstate task. 4) The newstate task is scheduled, checks the flag which says that the time event has started, and finishes moving to AUTH state. 5) Association continues and completes 6) The firmware signals the end of the time event. 7) The interrupt handler sets a flag in the softc which indicates "time event ended" and wakes up any task sleeping on this condition (there is none) Note that steps 5 and 6/7 can happen "concurrently" if the "time-event ended" interrupt triggers during step 5, which is no problem. Sometimes, especially with IWM_DEBUG enabled (lots of printf), this doesn't work out as planned and the order of events goes like this: 1) The driver's newstate task asks for the time event and goes into tsleep() to wait for the firmware to signal the beginning of the time event. 2) The firmware signals beginning of the time event. 3) The interrupt handler sets a flag in the softc which indicates "time event start" and wakes up the sleeping newstate task. 4) The firmware signals the end of the time event. 5) The interrupt handler sets a flag in the softc which indicates "time event ended" and wakes up the sleeping newstate task. 6) The newstate task is scheduled, checks the flag which says that the time event has started, but it is not set so association is aborted. I have also observed other variants of these chains of events where iwm does not manage to associate. Leaving precise synchronization like this up to the scheduler in our kernel is not a good idea. There is no guarantee that things will work out as intended. Moreover, the numbers controlling timing were copied from Linux which has different scheduling latency. This diff does away with the unreliable extra checking. It schedules a time event, busy-waits a short while for good measure, and life goes on. The firmware might fail to schedule the event, and if so might wander off the channel for a moment... in which case association is still more likely to succeed than with these unreliable checks in place. But generally the time event will be scheduled and things just work out. While here, also copy some related docs in comments from the Linux driver and remove an unused argument from iwm_mvm_protect_session(). OK? Below the diff is a trace which shows how events play out now. The lines saying "iwm0: TE notif status = 0x01 action = 0x01" indicate time event start, and the same with "action = 0x02" indicates the end of the time event -- well after association is complete. Index: if_iwm.c =================================================================== RCS file: /cvs/src/sys/dev/pci/if_iwm.c,v retrieving revision 1.68 diff -u -p -r1.68 if_iwm.c --- if_iwm.c 25 Nov 2015 03:09:59 -0000 1.68 +++ if_iwm.c 7 Dec 2015 15:13:22 -0000 @@ -272,7 +272,7 @@ int iwm_mvm_send_time_event_cmd(struct i int iwm_mvm_time_event_send_add(struct iwm_softc *, struct iwm_node *, void *, struct iwm_time_event_cmd_v2 *); void iwm_mvm_protect_session(struct iwm_softc *, struct iwm_node *, - uint32_t, uint32_t, uint32_t); + uint32_t, uint32_t); int iwm_nvm_read_chunk(struct iwm_softc *, uint16_t, uint16_t, uint16_t, uint8_t *, uint16_t *); int iwm_nvm_read_section(struct iwm_softc *, uint16_t, uint8_t *, @@ -2216,7 +2216,7 @@ iwm_mvm_time_event_send_add(struct iwm_s void iwm_mvm_protect_session(struct iwm_softc *sc, struct iwm_node *in, - uint32_t duration, uint32_t min_duration, uint32_t max_delay) + uint32_t duration, uint32_t max_delay) { struct iwm_time_event_cmd_v2 time_cmd; @@ -4988,7 +4988,6 @@ iwm_auth(struct iwm_softc *sc) struct ieee80211com *ic = &sc->sc_ic; struct iwm_node *in = (void *)ic->ic_bss; uint32_t duration; - uint32_t min_duration; int error; in->in_assoc = 0; @@ -5014,32 +5013,14 @@ iwm_auth(struct iwm_softc *sc) return error; } - /* a bit superfluous? */ - while (sc->sc_auth_prot) - tsleep(&sc->sc_auth_prot, 0, "iwmauth", 0); - sc->sc_auth_prot = 1; - - duration = min(IWM_MVM_TE_SESSION_PROTECTION_MAX_TIME_MS, - 200 + in->in_ni.ni_intval); - min_duration = min(IWM_MVM_TE_SESSION_PROTECTION_MIN_TIME_MS, - 100 + in->in_ni.ni_intval); - iwm_mvm_protect_session(sc, in, duration, min_duration, 500); - - while (sc->sc_auth_prot != 2) { - /* - * well, meh, but if the kernel is sleeping for half a - * second, we have bigger problems - */ - if (sc->sc_auth_prot == 0) { - DPRINTF(("%s: missed auth window!\n", DEVNAME(sc))); - return ETIMEDOUT; - } else if (sc->sc_auth_prot == -1) { - DPRINTF(("%s: no time event, denied!\n", DEVNAME(sc))); - sc->sc_auth_prot = 0; - return EAUTH; - } - tsleep(&sc->sc_auth_prot, 0, "iwmau2", 0); - } + /* + * Prevent the FW from wandering off channel during association + * by "protecting" the session with a time event. + */ + /* XXX duration is in units of TU, not MS */ + duration = IWM_MVM_TE_SESSION_PROTECTION_MAX_TIME_MS; + iwm_mvm_protect_session(sc, in, duration, 500 /* XXX magic number */); + DELAY(100); return 0; } @@ -5642,7 +5623,6 @@ iwm_stop(struct ifnet *ifp, int disable) sc->sc_flags |= IWM_FLAG_STOPPED; sc->sc_generation++; sc->sc_scanband = 0; - sc->sc_auth_prot = 0; ic->ic_scan_lock = IEEE80211_SCAN_UNLOCKED; ifp->if_flags &= ~IFF_RUNNING; ifq_clr_oactive(&ifp->if_snd); @@ -6082,16 +6062,8 @@ iwm_notif_intr(struct iwm_softc *sc) struct iwm_time_event_notif *notif; SYNC_RESP_STRUCT(notif, pkt); - if (notif->status) { - if (le32toh(notif->action) & - IWM_TE_V2_NOTIF_HOST_EVENT_START) - sc->sc_auth_prot = 2; - else - sc->sc_auth_prot = 0; - } else { - sc->sc_auth_prot = -1; - } - wakeup(&sc->sc_auth_prot); + DPRINTF(("%s: TE notif status = 0x%x action = 0x%x\n", + DEVNAME(sc), notif->status, notif->action)); break; } case IWM_MCAST_FILTER_CMD: Index: if_iwmreg.h =================================================================== RCS file: /cvs/src/sys/dev/pci/if_iwmreg.h,v retrieving revision 1.6 diff -u -p -r1.6 if_iwmreg.h --- if_iwmreg.h 15 Nov 2015 09:39:09 -0000 1.6 +++ if_iwmreg.h 7 Dec 2015 15:04:18 -0000 @@ -2008,6 +2008,43 @@ struct iwm_time_event_cmd_v1 { /* Time event - defines for command API v2 */ +/** + * DOC: Time Events - what is it? + * + * Time Events are a fw feature that allows the driver to control the presence + * of the device on the channel. Since the fw supports multiple channels + * concurrently, the fw may choose to jump to another channel at any time. + * In order to make sure that the fw is on a specific channel at a certain time + * and for a certain duration, the driver needs to issue a time event. + * + * The simplest example is for BSS association. The driver issues a time event, + * waits for it to start, and only then tells mac80211 that we can start the + * association. This way, we make sure that the association will be done + * smoothly and won't be interrupted by channel switch decided within the fw. + */ + + /** + * DOC: The flow against the fw + * + * When the driver needs to make sure we are in a certain channel, at a certain + * time and for a certain duration, it sends a Time Event. The flow against the + * fw goes like this: + * 1) Driver sends a TIME_EVENT_CMD to the fw + * 2) Driver gets the response for that command. This response contains the + * Unique ID (UID) of the event. + * 3) The fw sends notification when the event starts. + * + * Of course the API provides various options that allow to cover parameters + * of the flow. + * What is the duration of the event? + * What is the start time of the event? + * Is there an end-time for the event? + * How much can the event be delayed? + * Can the event be split? + * If yes what is the maximal number of chunks? + * etc... + */ + /* * @IWM_TE_V2_FRAG_NONE: fragmentation of the time event is NOT allowed. * @IWM_TE_V2_FRAG_SINGLE: fragmentation of the time event is allowed, but only Index: if_iwmvar.h =================================================================== RCS file: /cvs/src/sys/dev/pci/if_iwmvar.h,v retrieving revision 1.12 diff -u -p -r1.12 if_iwmvar.h --- if_iwmvar.h 22 Oct 2015 11:51:28 -0000 1.12 +++ if_iwmvar.h 7 Dec 2015 15:04:33 -0000 @@ -305,7 +305,7 @@ struct iwm_ucode_status { #define IWM_OTP_LOW_IMAGE_SIZE 2048 -#define IWM_MVM_TE_SESSION_PROTECTION_MAX_TIME_MS 500 +#define IWM_MVM_TE_SESSION_PROTECTION_MAX_TIME_MS 1000 #define IWM_MVM_TE_SESSION_PROTECTION_MIN_TIME_MS 400 enum IWM_CMD_MODE { @@ -440,8 +440,6 @@ struct iwm_softc { size_t sc_scan_cmd_len; int sc_scan_last_antenna; int sc_scanband; - - int sc_auth_prot; int sc_fixed_ridx; Dec 7 16:11:48 jessup /bsd: switching state SCAN->AUTH Dec 7 16:11:48 jessup /bsd: sending command 0xd0 qid 9, idx 25 Dec 7 16:11:48 jessup /bsd: sending command 0x8 qid 9, idx 26 Dec 7 16:11:48 jessup /bsd: sending command 0x2b qid 9, idx 27 Dec 7 16:11:48 jessup /bsd: sending command 0x18 qid 9, idx 28 Dec 7 16:11:48 jessup /bsd: Add new TE, duration 1000 TU Dec 7 16:11:48 jessup /bsd: sending command 0x29 qid 9, idx 29 Dec 7 16:11:48 jessup /bsd: iwm0: TE notif status = 0x1 action = 0x1 Dec 7 16:11:48 jessup /bsd: switching state AUTH->ASSOC Dec 7 16:11:48 jessup /bsd: sending command 0x18 qid 9, idx 30 Dec 7 16:11:48 jessup /bsd: dtim 1 0 Dec 7 16:11:48 jessup /bsd: DTIM TBTT is 0x18279706686/0x3656b4, offset 0 Dec 7 16:11:48 jessup /bsd: sending command 0x28 qid 9, idx 31 Dec 7 16:11:48 jessup /bsd: switching state ASSOC->RUN Dec 7 16:11:48 jessup /bsd: Sending power table command on mac id 0x0 for power level 1, flags = 0x0 Dec 7 16:11:48 jessup /bsd: Keep alive = 25 sec Dec 7 16:11:48 jessup /bsd: Disable power management Dec 7 16:11:48 jessup /bsd: sending command 0xa9 qid 9, idx 32 Dec 7 16:11:48 jessup /bsd: sending command 0xd2 qid 9, idx 33 Dec 7 16:11:48 jessup /bsd: ba_enable_beacon_abort is: 0 Dec 7 16:11:48 jessup /bsd: ba_escape_timer is: 6 Dec 7 16:11:48 jessup /bsd: bf_debug_flag is: 0 Dec 7 16:11:48 jessup /bsd: bf_enable_beacon_filter is: 1 Dec 7 16:11:48 jessup /bsd: bf_energy_delta is: 5 Dec 7 16:11:48 jessup /bsd: bf_escape_timer is: 50 Dec 7 16:11:48 jessup /bsd: bf_roaming_energy_delta is: 1 Dec 7 16:11:48 jessup /bsd: bf_roaming_state is: 72 Dec 7 16:11:48 jessup /bsd: bf_temp_threshold is: 112 Dec 7 16:11:48 jessup /bsd: bf_temp_fast_filter is: 1 Dec 7 16:11:48 jessup /bsd: bf_temp_slow_filter is: 5 Dec 7 16:11:48 jessup /bsd: sending command 0xd2 qid 9, idx 34 Dec 7 16:11:48 jessup /bsd: ba_enable_beacon_abort is: 0 Dec 7 16:11:48 jessup /bsd: ba_escape_timer is: 6 Dec 7 16:11:48 jessup /bsd: bf_debug_flag is: 0 Dec 7 16:11:48 jessup /bsd: bf_enable_beacon_filter is: 1 Dec 7 16:11:48 jessup /bsd: bf_energy_delta is: 5 Dec 7 16:11:48 jessup /bsd: bf_escape_timer is: 50 Dec 7 16:11:48 jessup /bsd: bf_roaming_energy_delta is: 1 Dec 7 16:11:48 jessup /bsd: bf_roaming_state is: 72 Dec 7 16:11:48 jessup /bsd: bf_temp_threshold is: 112 Dec 7 16:11:48 jessup /bsd: bf_temp_fast_filter is: 1 Dec 7 16:11:48 jessup /bsd: bf_temp_slow_filter is: 5 Dec 7 16:11:48 jessup /bsd: sending command 0x2c qid 9, idx 35 Dec 7 16:11:48 jessup /bsd: sending command 0x4e qid 9, idx 36 Dec 7 16:11:49 jessup /bsd: iwm0: TE notif status = 0x1 action = 0x2