Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-21 Thread Shuai Xue
On 2023/9/21 07:02, Bjorn Helgaas wrote: > On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote: >> Hi, all folks, >> >> Error reporting and recovery are one of the important features of PCIe, and >> the kernel has been supporting them since version 2.6, 17 year

Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-18 Thread Shuai Xue
Hi, all folks, Error reporting and recovery are one of the important features of PCIe, and the kernel has been supporting them since version 2.6, 17 years ago. I am very curious about the expected behavior of the software. I first recap the error classification and then list my questions bellow

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-21 Thread Shuai Xue
+ @Rafael for the APEI/GHES part. On 2023/9/22 05:52, Bjorn Helgaas wrote: > On Thu, Sep 21, 2023 at 08:10:19PM +0800, Shuai Xue wrote: >> On 2023/9/21 07:02, Bjorn Helgaas wrote: >>> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote: >> ... &

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-26 Thread Shuai Xue
On 2023/9/27 07:02, Bjorn Helgaas wrote: > On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote: >> ... > >> Actually, this is a question from my colleague from firmware team. >> The original question is that: >> >> "Should I set CPER_SEV_

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-24 Thread Shuai Xue
On 2023/9/21 21:20, David Laight wrote: > ... > I've got a target to generate AER errors by generating read cycles > that are inside the address range that the bridge forwards but > outside of any BAR because there are 2 different sized BARs. > (Pretty easy to setup.) > On the system I was

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/21 AM4:05, Tony Luck 写道: > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/20 AM1:08, Tony Luck 写道: >>> If the kernel is copying a page as the result of a copy-on-write >>> fault and runs into an uncorrectable erro

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-21 Thread Shuai Xue
在 2022/10/21 PM12:41, Luck, Tony 写道: >>> When we do return to user mode the task is going to be busy servicing >>> a SIGBUS ... so shouldn't try to touch the poison page before the >>> memory_failure() called by the worker thread cleans things up. >> >> What about an RT process on a busy

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-21 Thread Shuai Xue
在 2022/10/21 PM12:08, Tony Luck 写道: > On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote: >> >> >> 在 2022/10/21 AM4:05, Tony Luck 写道: >>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote: >>>> >>>> >>>> 在 20

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue
在 2022/10/20 AM1:08, Tony Luck 写道: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test

Re: [PATCH v3 0/2] Copy-on-write poison recovery

2022-10-23 Thread Shuai Xue
th the scope. > > Part 2 sets up to asynchronously take the page with the uncorrected > error offline to prevent additional machine check faults. H/t to > Miaohe Lin and Shuai Xue > for pointing me to the existing function to queue a call to > memory_failure(). > > On x

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-23 Thread Shuai Xue
在 2022/10/22 AM12:30, Luck, Tony 写道: >>> But maybe it is some RMW instruction ... then, if all the above options >>> didn't happen ... we >>> could get another machine check from the same address. But then we just >>> follow the usual >>> recovery path. > > >> Let assume the instruction

Re: [PATCH v3 0/2] Copy-on-write poison recovery

2022-10-25 Thread Shuai Xue
在 2022/10/23 PM11:52, Shuai Xue 写道: > > > 在 2022/10/22 AM4:01, Tony Luck 写道: >> Part 1 deals with the process that triggered the copy on write >> fault with a store to a shared read-only page. That process is >> send a SIGBUS with the usual machine check decoratio