Re: [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.
On 2021-05-07 10:41:46 Fri, Oliver O'Halloran wrote: > On Fri, May 7, 2021 at 3:43 AM Mahesh Salgaonkar wrote: > > > > When certain PHB HW failure causes phyp to recover PHB, it marks the PE > > state as temporarily unavailable. In this case, per PAPR, rtas call > > ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5) > > and OS has to wait until that recovery is complete. During this state the > > slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR > > connector empty which leads to assumption that the device has been > > hot-removed. This results into no EEH recovery on this device and it stays > > in failed state forever. > > > > This patch fixes this issue by skipping slot presence check only if device > > PE state is temporarily unavailable(5). > > > > Signed-off-by: Mahesh Salgaonkar > > --- > > * snip* > > > > /* > > * It should be corner case that the parent PE has been > > diff --git a/arch/powerpc/kernel/eeh_driver.c > > b/arch/powerpc/kernel/eeh_driver.c > > index 3eff6a4888e79..a0913768f33de 100644 > > --- a/arch/powerpc/kernel/eeh_driver.c > > +++ b/arch/powerpc/kernel/eeh_driver.c > > @@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe) > > return; > > } > > > > + /* > > +* When PE's state is temporarily unavailable, the slot > > +* presence check returns as DR connector empty. > > That sounds like a bug in either RTAS or the hotplug slot driver (or > both). The presence check is there largely to filter out events that > we can guarantee are not recoverable (i.e. surprise hot-unplug). In > every other case (especially if we can't determine the state) we > should be going down the recovery path. If the hotplug slot driver is > incorrectly reporting the card has been removed then you should be > fixing the slot driver. Thanks Oliver for the comment. So phyp fixed the issue where it was incorrectly reporting the card has been removed. After the phyp fix, the slot presence check 'get-sensor-state(dr-entity-sense)' returns extended busy error (9902) until PHB is recovered by phyp. And once PHB is recovered, the get-sensor-state() returns success with correct presence status. But now we have different problem. The Linux rtas call interface rtas_get_sensor() loops over the rtas call on extended delay return code (9902) until the return value is either success (0) or error (-1). This causes EEH handler to get stuck at presence check 'rtas_get_sensor()' for ~6 seconds before it could indicate network driver that error has been detected and stop any active operations. With no I/O traffic this doesn't cause any issue and EEH recovery works fine. However with running I/O traffic, during this 6 seconds, network driver continues its operation and hits timeout (netdev watchdog). On timeouts, network driver go into ffdc capture mode and reset path assuming PCI device is in fatal condition. This causes EEH recovery to fail and sometimes it leads to system hang or crash. [52732.244731] DEBUG: ibm_read_slot_reset_state2() [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0 [52732.244798] DEBUG: in eeh_slot_presence_check [52732.244804] DEBUG: error state check [52732.244807] DEBUG: Is slot hotpluggable [52732.244810] DEBUG: hotpluggable ops ? [52732.244953] DEBUG: Calling ops->get_adapter_status [52732.244958] DEBUG: calling rpaphp_get_sensor_state [52736.564262] [ cut here ] [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440 [...] [52736.564505] NIP [c0c32368] dev_watchdog+0x438/0x440 [52736.564513] LR [c0c32364] dev_watchdog+0x434/0x440 I am working on ways to fix this and looking at below two options. More ideas are welcome. 1. There is an alternate call rtas_get_sensor_fast() available that does not use rtas_busy_delay() and returns immediately with error code. Using rtas_get_sensor_fast() for slot presence check fixes the above issue and EEH recovery works fine. However there is no provision in hotplug_slot_ops struct to do a quick check of adapter status that can be used to call rtas_get_sensor_fast(). 2. Another option is to move the slot presence check after reporting network driver that error has been detected. This also fixes the issue. However need to verify the hotplug case where if slot is empty, inform driver to resume while skiping the recovery. Let me know what do you think about above options and if there is any other better way to fix this. Thanks, -Mahesh.
Re: [PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.
On Fri, May 7, 2021 at 3:43 AM Mahesh Salgaonkar wrote: > > When certain PHB HW failure causes phyp to recover PHB, it marks the PE > state as temporarily unavailable. In this case, per PAPR, rtas call > ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5) > and OS has to wait until that recovery is complete. During this state the > slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR > connector empty which leads to assumption that the device has been > hot-removed. This results into no EEH recovery on this device and it stays > in failed state forever. > > This patch fixes this issue by skipping slot presence check only if device > PE state is temporarily unavailable(5). > > Signed-off-by: Mahesh Salgaonkar > --- > * snip* > > /* > * It should be corner case that the parent PE has been > diff --git a/arch/powerpc/kernel/eeh_driver.c > b/arch/powerpc/kernel/eeh_driver.c > index 3eff6a4888e79..a0913768f33de 100644 > --- a/arch/powerpc/kernel/eeh_driver.c > +++ b/arch/powerpc/kernel/eeh_driver.c > @@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe) > return; > } > > + /* > +* When PE's state is temporarily unavailable, the slot > +* presence check returns as DR connector empty. That sounds like a bug in either RTAS or the hotplug slot driver (or both). The presence check is there largely to filter out events that we can guarantee are not recoverable (i.e. surprise hot-unplug). In every other case (especially if we can't determine the state) we should be going down the recovery path. If the hotplug slot driver is incorrectly reporting the card has been removed then you should be fixing the slot driver. > +* to assumption that the device is hot-removed and causes EEH > +* recovery to stop leaving the device in failed state forever. > +* Hence skip the slot presence check if PE's state is > +* temporarily unavailable and go down EEH recovery path. > +*/ > + if (pe->state & EEH_PE_TEMP_UNAVAIL) > + goto skip_slot_presence_check; There's a time-of-check-vs-time-of-use error here. You're setting this flag at the point of detection, but there can be a significant lag time between when an EEH is initially detected and when it's handled by the recovery thread (usually due to other events being recovered). Transitioning the PE into and out of PE_TEMP_UNAVAILABLE is handled autonomously by the hypervisor so by the time we get to recovery the PE may be back into a normal state where the slot check works fine.
[PATCH] powerpc/eeh: skip slot presence check when PE is temporarily unavailable.
When certain PHB HW failure causes phyp to recover PHB, it marks the PE state as temporarily unavailable. In this case, per PAPR, rtas call ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5) and OS has to wait until that recovery is complete. During this state the slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR connector empty which leads to assumption that the device has been hot-removed. This results into no EEH recovery on this device and it stays in failed state forever. This patch fixes this issue by skipping slot presence check only if device PE state is temporarily unavailable(5). Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/include/asm/eeh.h |1 + arch/powerpc/kernel/eeh.c| 14 -- arch/powerpc/kernel/eeh_driver.c | 18 ++ 3 files changed, 31 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index b1a5bba2e0b94..5dc5538e39b62 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -64,6 +64,7 @@ struct pci_dn; #define EEH_PE_RECOVERING (1 << 1)/* Recovering PE*/ #define EEH_PE_CFG_BLOCKED (1 << 2)/* Block config access */ #define EEH_PE_RESET (1 << 3)/* PE reset in progress */ +#define EEH_PE_TEMP_UNAVAIL(1 << 4)/* PE is temporarily unavailable */ #define EEH_PE_KEEP(1 << 8)/* Keep PE on hotplug */ #define EEH_PE_CFG_RESTRICTED (1 << 9)/* Block config on error */ diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 7040e430a1249..7fcbf3df18583 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -405,7 +405,8 @@ static int eeh_phb_check_failure(struct eeh_pe *pe) (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) { ret = 0; goto out; - } + } else if (ret == EEH_STATE_UNAVAILABLE) + eeh_pe_state_mark(phb_pe, EEH_PE_TEMP_UNAVAIL); /* Isolate the PHB and send event */ eeh_pe_mark_isolated(phb_pe); @@ -519,14 +520,23 @@ int eeh_dev_check_failure(struct eeh_dev *edev) * We will punt with the following conditions: Failure to get * PE's state, EEH not support and Permanently unavailable * state, PE is in good state. +* +* Certain PHB HW failure causes phyp/hypervisor to recover PHB and +* until that recovery completes, the PE's state is temporarily +* unavailable (EEH_STATE_UNAVAILABLE). In this state the slot +* presence check must be avoided since it may not return valid +* status. Mark this PE status as temporarily unavailable so +* that we can check it later. */ + if ((ret < 0) || (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) { eeh_stats.false_positives++; pe->false_positives++; rc = 0; goto dn_unlock; - } + } else if (ret == EEH_STATE_UNAVAILABLE) + eeh_pe_state_mark(pe, EEH_PE_TEMP_UNAVAIL); /* * It should be corner case that the parent PE has been diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 3eff6a4888e79..a0913768f33de 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe) return; } + /* +* When PE's state is temporarily unavailable, the slot +* presence check returns as DR connector empty. This leads +* to assumption that the device is hot-removed and causes EEH +* recovery to stop leaving the device in failed state forever. +* Hence skip the slot presence check if PE's state is +* temporarily unavailable and go down EEH recovery path. +*/ + if (pe->state & EEH_PE_TEMP_UNAVAIL) + goto skip_slot_presence_check; + /* * When devices are hot-removed we might get an EEH due to * a driver attempting to touch the MMIO space of a removed @@ -871,6 +882,7 @@ void eeh_handle_normal_event(struct eeh_pe *pe) goto out; /* nothing to recover */ } +skip_slot_presence_check: /* Log the event */ if (pe->type & EEH_PE_PHB) { pr_err("EEH: Recovering PHB#%x, location: %s\n", @@ -953,6 +965,12 @@ void eeh_handle_normal_event(struct eeh_pe *pe) } } + /* +* Now that we finished waiting for PE state as per PAPR, +* clear the PE temporarily unavailable state. +*/ + eeh_pe_state_clear(pe, EEH_PE_TEMP_UNAVAIL, true); + /* Since rtas may enable MMIO when posting the error log, * don't post the error log until after all dev drivers * have been informed.