Public bug reported:

Hi,

[Impact]
Currently in focal, mlx5 devices reporter recovery is enabled even if state is 
healthy.

[test case]

1)
display devlink health status
# devlink health show  pci/0000:05:00.0 reporter fw_fatal
pci/0000:05:00.0:
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 1200000 auto_recover true
2)
perform reporter recovery using devlink,
# devlink health recover pci/0000:05:00.0 reporter fw_fatal

3)see that recovery was performed.
# dmesg
[776733.438708] mlx5_core 0000:05:00.0: mlx5_health_try_recover:316:(pid 
563178): handling bad device here
[776733.438717] mlx5_core 0000:05:00.0: mlx5_handle_bad_state:278:(pid 563178): 
Expected to see disabled
 NIC but it is full driver
[776735.591522] mlx5_core 0000:05:00.0: mlx5_health_try_recover:328:(pid 
563178): starting health recovery flow
...
# devlink health show  pci/0000:05:00.0 reporter fw_fatal
pci/0000:05:00.0:
  reporter fw_fatal
    state healthy error 0 recover 1 grace_period 1200000 auto_recover true

[fix]
402818205c9e devlink: don't do reporter recovery if the state is healthy
this upstream commit from kernel v5.5-rc1 which is cleanly applied on focal 
tree.
the commit prevents reporter recovery when device in healthy state.
when applied, issuing
# devlink health recover pci/0000:05:00.0 reporter fw_fatal
on healthy state reporter return successfully, but dmesg is clean and recover 
counter do not change.

[Regression Potential]
very small as it is a very minor change.

Thanks,
Amir

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Description changed:

  Hi,
  
  [Impact]
  Currently in focal, mlx5 devices reporter recovery is enabled even if state 
is healthy.
  
  [test case]
  
  1)
  display devlink health status
  # devlink health show  pci/0000:05:00.0 reporter fw_fatal
  pci/0000:05:00.0:
-   reporter fw_fatal
-     state healthy error 0 recover 0 grace_period 1200000 auto_recover true
+   reporter fw_fatal
+     state healthy error 0 recover 0 grace_period 1200000 auto_recover true
  2)
  perform reporter recovery using devlink,
  # devlink health recover pci/0000:05:00.0 reporter fw_fatal
  
  3)see that recovery was performed.
  # dmesg
  [776733.438708] mlx5_core 0000:05:00.0: mlx5_health_try_recover:316:(pid 
563178): handling bad device here
  [776733.438717] mlx5_core 0000:05:00.0: mlx5_handle_bad_state:278:(pid 
563178): Expected to see disabled
-  NIC but it is full driver
+  NIC but it is full driver
  [776735.591522] mlx5_core 0000:05:00.0: mlx5_health_try_recover:328:(pid 
563178): starting health recovery flow
  ...
  # devlink health show  pci/0000:05:00.0 reporter fw_fatal
  pci/0000:05:00.0:
-   reporter fw_fatal
-     state healthy error 0 recover 1 grace_period 1200000 auto_recover true
+   reporter fw_fatal
+     state healthy error 0 recover 1 grace_period 1200000 auto_recover true
  
  [fix]
  402818205c9e devlink: don't do reporter recovery if the state is healthy
  this upstream commit from kernel v5.5-rc1 which is cleanly applied on focal 
tree.
  the commit prevents reporter recovery when device in healthy state.
- when applied, issuing 
- # devlink health recover pci/0000:05:00.0 reporter fw_fatal 
- on healthy state reporter return successfully, but dmesg is clean and recover 
counter do not change. 
+ when applied, issuing
+ # devlink health recover pci/0000:05:00.0 reporter fw_fatal
+ on healthy state reporter return successfully, but dmesg is clean and recover 
counter do not change.
+ 
+ [Regression Potential]
+ very small as it is a very minor change.
  
  Thanks,
  Amir

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1915403

Title:
  devlink: don't do reporter recovery if the state is healthy

Status in linux package in Ubuntu:
  New

Bug description:
  Hi,

  [Impact]
  Currently in focal, mlx5 devices reporter recovery is enabled even if state 
is healthy.

  [test case]

  1)
  display devlink health status
  # devlink health show  pci/0000:05:00.0 reporter fw_fatal
  pci/0000:05:00.0:
    reporter fw_fatal
      state healthy error 0 recover 0 grace_period 1200000 auto_recover true
  2)
  perform reporter recovery using devlink,
  # devlink health recover pci/0000:05:00.0 reporter fw_fatal

  3)see that recovery was performed.
  # dmesg
  [776733.438708] mlx5_core 0000:05:00.0: mlx5_health_try_recover:316:(pid 
563178): handling bad device here
  [776733.438717] mlx5_core 0000:05:00.0: mlx5_handle_bad_state:278:(pid 
563178): Expected to see disabled
   NIC but it is full driver
  [776735.591522] mlx5_core 0000:05:00.0: mlx5_health_try_recover:328:(pid 
563178): starting health recovery flow
  ...
  # devlink health show  pci/0000:05:00.0 reporter fw_fatal
  pci/0000:05:00.0:
    reporter fw_fatal
      state healthy error 0 recover 1 grace_period 1200000 auto_recover true

  [fix]
  402818205c9e devlink: don't do reporter recovery if the state is healthy
  this upstream commit from kernel v5.5-rc1 which is cleanly applied on focal 
tree.
  the commit prevents reporter recovery when device in healthy state.
  when applied, issuing
  # devlink health recover pci/0000:05:00.0 reporter fw_fatal
  on healthy state reporter return successfully, but dmesg is clean and recover 
counter do not change.

  [Regression Potential]
  very small as it is a very minor change.

  Thanks,
  Amir

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1915403/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to