Re: [PATCH v7 2/2] PCI: rpaphp: Error out on busy status from get-sensor-state

2023-08-02 Thread Mahesh J Salgaonkar
On 2023-08-01 16:38:08 Tue, Bjorn Helgaas wrote:
> On Mon, Jul 24, 2023 at 02:25:19PM +0530, Mahesh Salgaonkar wrote:
> > When certain PHB HW failure causes pHyp to recover PHB, it marks the PE
> > state as temporarily unavailable until recovery is complete. This also
> > triggers an EEH handler in Linux which needs to notify drivers, and perform
> > recovery. But before notifying the driver about the PCI error it uses
> > get_adapter_state()->get-sensor-state() operation of the hotplug_slot to
> > determine if the slot contains a device or not. if the slot is empty, the
> > recovery is skipped entirely.
> 
> It's helpful to use the exact function name so it's greppable; I think
> get_adapter_status() or rpaphp_get_sensor_state()?
> 
> s/if the slot is empty,/If the slot is empty,/

Sure, will correct it in next revision.

> 
> > However on certain PHB failures, the RTAS call get-sensor-state() returns
> > extended busy error (9902) until PHB is recovered by pHyp. Once PHB is
> > recovered, the get-sensor-state() returns success with correct presence
> > status. The RTAS call interface rtas_get_sensor() loops over the RTAS call
> > on extended delay return code (9902) until the return value is either
> > success (0) or error (-1). This causes the EEH handler to get stuck for ~6
> > seconds before it could notify that the PCI error has been detected and
> > stop any active operations. Hence with running I/O traffic, during this 6
> > seconds, the network driver continues its operation and hits a timeout
> > (netdev watchdog).
> > 
> > 
> > [52732.244731] DEBUG: ibm_read_slot_reset_state2()
> > [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=>
> > [52732.244798] DEBUG: in eeh_slot_presence_check
> > [52732.244804] DEBUG: error state check
> > [52732.244807] DEBUG: Is slot hotpluggable
> > [52732.244810] DEBUG: hotpluggable ops ?
> > [52732.244953] DEBUG: Calling ops->get_adapter_status
> > [52732.244958] DEBUG: calling rpaphp_get_sensor_state
> > [52736.564262] [ cut here ]
> > [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed o>
> > [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev>
> > [...]
> > [52736.564505] NIP [c0c32368] dev_watchdog+0x438/0x440
> > [52736.564513] LR [c0c32364] dev_watchdog+0x434/0x440
> > 
> > 
> > On timeouts, network driver starts dumping debug information to console
> > (e.g bnx2 driver calls bnx2x_panic_dump()), and go into recovery path while
> > pHyp is still recovering the PHB. As part of recovery, the driver tries to
> > reset the device and it keeps failing since every PCI read/write returns
> > ff's. And when EEH recovery kicks-in, the driver is unable to recover the
> > device. This impacts the ssh connection and leads to the system being
> > inaccessible. To get the NIC working again it needs a reboot or re-assign
> > the I/O adapter from HMC.
> > 
> > [ 9531.168587] EEH: Beginning: 'slot_reset'
> > [ 9531.168601] PCI 0013:01:00.0#1: EEH: Invoking bnx2x->slot_reset()
> > [...]
> > [ 9614.110094] bnx2x: [bnx2x_func_stop:9129(enP19p1s0f0)]FUNC_STOP ramrod 
> > failed. Running a dry transaction
> > [ 9614.110300] bnx2x: [bnx2x_igu_int_disable:902(enP19p1s0f0)]BUG! Proper 
> > val not read from IGU!
> > [ 9629.178067] bnx2x: [bnx2x_fw_command:3055(enP19p1s0f0)]FW failed to 
> > respond!
> > [ 9629.178085] bnx2x 0013:01:00.0 enP19p1s0f0: bc 7.10.4
> > [ 9629.178091] bnx2x: [bnx2x_fw_dump_lvl:789(enP19p1s0f0)]Cannot dump MCP 
> > info while in PCI error
> > [ 9644.241813] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f0)]IO slot reset 
> > --> driver unload
> > [...]
> > [ 9644.241819] PCI 0013:01:00.0#1: EEH: bnx2x driver reports: 
> > 'disconnect'
> > [ 9644.241823] PCI 0013:01:00.1#1: EEH: Invoking bnx2x->slot_reset()
> > [ 9644.241827] bnx2x: [bnx2x_io_slot_reset:14229(enP19p1s0f1)]IO slot reset 
> > initializing...
> > [ 9644.241916] bnx2x 0013:01:00.1: enabling device (0140 -> 0142)
> > [ 9644.258604] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f1)]IO slot reset 
> > --> driver unload
> > [ 9644.258612] PCI 0013:01:00.1#1: EEH: bnx2x driver reports: 
> > 'disconnect'
> > [ 9644.258615] EEH: Finished:'slot_reset' with aggregate recovery 
> > state:'disconnect'
> > [ 9644.258620] EEH: Unable to recover from failure from PHB#13-PE#1.
> > [ 9644.261811] EEH: Beginning: 'error_detected(permanent failure)'
> > [...]
> > [ 9644.261823] EEH: Finished:'error_detected(permanent failure)'
> > 
> > Hence, it becomes important to inform driver about the PCI error detection
> > as early as possible, so that driver is aware of PCI error and waits for
> > EEH handler's next action for successful recovery.
> 
> I don't really understand the connection between EEH and
> get_adapter_status(), but I guess this probably refers to
> arch/powerpc/kernel/eeh_driver.c, not the PCI core aer.c and err.c?

Yup, EEH is an I/O error recovery 

Re: [PATCH v7 2/2] PCI: rpaphp: Error out on busy status from get-sensor-state

2023-08-01 Thread Bjorn Helgaas
On Mon, Jul 24, 2023 at 02:25:19PM +0530, Mahesh Salgaonkar wrote:
> When certain PHB HW failure causes pHyp to recover PHB, it marks the PE
> state as temporarily unavailable until recovery is complete. This also
> triggers an EEH handler in Linux which needs to notify drivers, and perform
> recovery. But before notifying the driver about the PCI error it uses
> get_adapter_state()->get-sensor-state() operation of the hotplug_slot to
> determine if the slot contains a device or not. if the slot is empty, the
> recovery is skipped entirely.

It's helpful to use the exact function name so it's greppable; I think
get_adapter_status() or rpaphp_get_sensor_state()?

s/if the slot is empty,/If the slot is empty,/

> However on certain PHB failures, the RTAS call get-sensor-state() returns
> extended busy error (9902) until PHB is recovered by pHyp. Once PHB is
> recovered, the get-sensor-state() returns success with correct presence
> status. The RTAS call interface rtas_get_sensor() loops over the RTAS call
> on extended delay return code (9902) until the return value is either
> success (0) or error (-1). This causes the EEH handler to get stuck for ~6
> seconds before it could notify that the PCI error has been detected and
> stop any active operations. Hence with running I/O traffic, during this 6
> seconds, the network driver continues its operation and hits a timeout
> (netdev watchdog).
> 
> 
> [52732.244731] DEBUG: ibm_read_slot_reset_state2()
> [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=>
> [52732.244798] DEBUG: in eeh_slot_presence_check
> [52732.244804] DEBUG: error state check
> [52732.244807] DEBUG: Is slot hotpluggable
> [52732.244810] DEBUG: hotpluggable ops ?
> [52732.244953] DEBUG: Calling ops->get_adapter_status
> [52732.244958] DEBUG: calling rpaphp_get_sensor_state
> [52736.564262] [ cut here ]
> [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed o>
> [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev>
> [...]
> [52736.564505] NIP [c0c32368] dev_watchdog+0x438/0x440
> [52736.564513] LR [c0c32364] dev_watchdog+0x434/0x440
> 
> 
> On timeouts, network driver starts dumping debug information to console
> (e.g bnx2 driver calls bnx2x_panic_dump()), and go into recovery path while
> pHyp is still recovering the PHB. As part of recovery, the driver tries to
> reset the device and it keeps failing since every PCI read/write returns
> ff's. And when EEH recovery kicks-in, the driver is unable to recover the
> device. This impacts the ssh connection and leads to the system being
> inaccessible. To get the NIC working again it needs a reboot or re-assign
> the I/O adapter from HMC.
> 
> [ 9531.168587] EEH: Beginning: 'slot_reset'
> [ 9531.168601] PCI 0013:01:00.0#1: EEH: Invoking bnx2x->slot_reset()
> [...]
> [ 9614.110094] bnx2x: [bnx2x_func_stop:9129(enP19p1s0f0)]FUNC_STOP ramrod 
> failed. Running a dry transaction
> [ 9614.110300] bnx2x: [bnx2x_igu_int_disable:902(enP19p1s0f0)]BUG! Proper val 
> not read from IGU!
> [ 9629.178067] bnx2x: [bnx2x_fw_command:3055(enP19p1s0f0)]FW failed to 
> respond!
> [ 9629.178085] bnx2x 0013:01:00.0 enP19p1s0f0: bc 7.10.4
> [ 9629.178091] bnx2x: [bnx2x_fw_dump_lvl:789(enP19p1s0f0)]Cannot dump MCP 
> info while in PCI error
> [ 9644.241813] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f0)]IO slot reset 
> --> driver unload
> [...]
> [ 9644.241819] PCI 0013:01:00.0#1: EEH: bnx2x driver reports: 'disconnect'
> [ 9644.241823] PCI 0013:01:00.1#1: EEH: Invoking bnx2x->slot_reset()
> [ 9644.241827] bnx2x: [bnx2x_io_slot_reset:14229(enP19p1s0f1)]IO slot reset 
> initializing...
> [ 9644.241916] bnx2x 0013:01:00.1: enabling device (0140 -> 0142)
> [ 9644.258604] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f1)]IO slot reset 
> --> driver unload
> [ 9644.258612] PCI 0013:01:00.1#1: EEH: bnx2x driver reports: 'disconnect'
> [ 9644.258615] EEH: Finished:'slot_reset' with aggregate recovery 
> state:'disconnect'
> [ 9644.258620] EEH: Unable to recover from failure from PHB#13-PE#1.
> [ 9644.261811] EEH: Beginning: 'error_detected(permanent failure)'
> [...]
> [ 9644.261823] EEH: Finished:'error_detected(permanent failure)'
> 
> Hence, it becomes important to inform driver about the PCI error detection
> as early as possible, so that driver is aware of PCI error and waits for
> EEH handler's next action for successful recovery.

I don't really understand the connection between EEH and
get_adapter_status(), but I guess this probably refers to
arch/powerpc/kernel/eeh_driver.c, not the PCI core aer.c and err.c?

> Current implementation uses rtas_get_sensor() API which blocks the slot
> check state until RTAS call returns success. To avoid this, fix the PCI
> hotplug driver (rpaphp) to return an error (-EBUSY) if the slot presence
> state can not be detected immediately while PE is in EEH recovery state.
> Change 

[PATCH v7 2/2] PCI: rpaphp: Error out on busy status from get-sensor-state

2023-07-24 Thread Mahesh Salgaonkar
When certain PHB HW failure causes pHyp to recover PHB, it marks the PE
state as temporarily unavailable until recovery is complete. This also
triggers an EEH handler in Linux which needs to notify drivers, and perform
recovery. But before notifying the driver about the PCI error it uses
get_adapter_state()->get-sensor-state() operation of the hotplug_slot to
determine if the slot contains a device or not. if the slot is empty, the
recovery is skipped entirely.

However on certain PHB failures, the RTAS call get-sensor-state() returns
extended busy error (9902) until PHB is recovered by pHyp. Once PHB is
recovered, the get-sensor-state() returns success with correct presence
status. The RTAS call interface rtas_get_sensor() loops over the RTAS call
on extended delay return code (9902) until the return value is either
success (0) or error (-1). This causes the EEH handler to get stuck for ~6
seconds before it could notify that the PCI error has been detected and
stop any active operations. Hence with running I/O traffic, during this 6
seconds, the network driver continues its operation and hits a timeout
(netdev watchdog).


[52732.244731] DEBUG: ibm_read_slot_reset_state2()
[52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=>
[52732.244798] DEBUG: in eeh_slot_presence_check
[52732.244804] DEBUG: error state check
[52732.244807] DEBUG: Is slot hotpluggable
[52732.244810] DEBUG: hotpluggable ops ?
[52732.244953] DEBUG: Calling ops->get_adapter_status
[52732.244958] DEBUG: calling rpaphp_get_sensor_state
[52736.564262] [ cut here ]
[52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed o>
[52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev>
[...]
[52736.564505] NIP [c0c32368] dev_watchdog+0x438/0x440
[52736.564513] LR [c0c32364] dev_watchdog+0x434/0x440


On timeouts, network driver starts dumping debug information to console
(e.g bnx2 driver calls bnx2x_panic_dump()), and go into recovery path while
pHyp is still recovering the PHB. As part of recovery, the driver tries to
reset the device and it keeps failing since every PCI read/write returns
ff's. And when EEH recovery kicks-in, the driver is unable to recover the
device. This impacts the ssh connection and leads to the system being
inaccessible. To get the NIC working again it needs a reboot or re-assign
the I/O adapter from HMC.

[ 9531.168587] EEH: Beginning: 'slot_reset'
[ 9531.168601] PCI 0013:01:00.0#1: EEH: Invoking bnx2x->slot_reset()
[...]
[ 9614.110094] bnx2x: [bnx2x_func_stop:9129(enP19p1s0f0)]FUNC_STOP ramrod 
failed. Running a dry transaction
[ 9614.110300] bnx2x: [bnx2x_igu_int_disable:902(enP19p1s0f0)]BUG! Proper val 
not read from IGU!
[ 9629.178067] bnx2x: [bnx2x_fw_command:3055(enP19p1s0f0)]FW failed to respond!
[ 9629.178085] bnx2x 0013:01:00.0 enP19p1s0f0: bc 7.10.4
[ 9629.178091] bnx2x: [bnx2x_fw_dump_lvl:789(enP19p1s0f0)]Cannot dump MCP info 
while in PCI error
[ 9644.241813] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f0)]IO slot reset --> 
driver unload
[...]
[ 9644.241819] PCI 0013:01:00.0#1: EEH: bnx2x driver reports: 'disconnect'
[ 9644.241823] PCI 0013:01:00.1#1: EEH: Invoking bnx2x->slot_reset()
[ 9644.241827] bnx2x: [bnx2x_io_slot_reset:14229(enP19p1s0f1)]IO slot reset 
initializing...
[ 9644.241916] bnx2x 0013:01:00.1: enabling device (0140 -> 0142)
[ 9644.258604] bnx2x: [bnx2x_io_slot_reset:14245(enP19p1s0f1)]IO slot reset --> 
driver unload
[ 9644.258612] PCI 0013:01:00.1#1: EEH: bnx2x driver reports: 'disconnect'
[ 9644.258615] EEH: Finished:'slot_reset' with aggregate recovery 
state:'disconnect'
[ 9644.258620] EEH: Unable to recover from failure from PHB#13-PE#1.
[ 9644.261811] EEH: Beginning: 'error_detected(permanent failure)'
[...]
[ 9644.261823] EEH: Finished:'error_detected(permanent failure)'

Hence, it becomes important to inform driver about the PCI error detection
as early as possible, so that driver is aware of PCI error and waits for
EEH handler's next action for successful recovery.

Current implementation uses rtas_get_sensor() API which blocks the slot
check state until RTAS call returns success. To avoid this, fix the PCI
hotplug driver (rpaphp) to return an error (-EBUSY) if the slot presence
state can not be detected immediately while PE is in EEH recovery state.
Change rpaphp_get_sensor_state() to invoke rtas_call(get-sensor-state)
directly only if the respective PE is in EEH recovery state, and take
actions based on RTAS return status. This way EEH handler will not be
blocked on rpaphp_get_sensor_state() and can immediately notify driver
about the PCI error and stop any active operations.

In normal cases (non-EEH case) rpaphp_get_sensor_state() will continue to
invoke rtas_get_sensor() as it was earlier with no change in existing
behavior.

Signed-off-by: Mahesh Salgaonkar 
Reviewed-by: Nathan Lynch 
---
Change in v7:
- Modified patch description to