Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-26 Thread Shuai Xue



On 2023/9/27 07:02, Bjorn Helgaas wrote:
> On Fri, Sep 22, 2023 at 10:46:36AM +0800, Shuai Xue wrote:
>> ...
> 
>> Actually, this is a question from my colleague from firmware team.
>> The original question is that:
>>
>> "Should I set CPER_SEV_FATAL for Generic Error Status Block when a
>> PCIe fatal error is detected? If set, kernel will always panic.
>> Otherwise, kernel will always not panic."
>>
>> So I pull a question about desired behavior of Linux kernel first :)
>> From the perspective of the kernel, CPER_SEV_FATAL for Generic Error
>> Status Block is not reasonable. The kernel will attempt to recover
>> Fatal errors, although recovery may fail.
> 
> I don't know the semantics of CPER_SEV_FATAL or why it's there.
> With CPER, we have *two* error severities: a "native" one defined by
> the PCIe spec and another defined by the platform via CPER.
> 
> I speculate that the reason for the CPER severity could be to provide
> a severity for error sources that don't have a "native" severity like
> AER does, or for the vendor to force the OS to restart (for
> CPER_SEV_FATAL, anyway) in cases where it might not otherwise.

Agreed, it is the key point.

Per ACPI 6.5 18.1 Hardware Errors and Error Sources[1]:

- An uncorrected error is a hardware error condition that cannot be
corrected by the hardware or by the firmware. Uncorrected errors
are either fatal or non-fatal.

- A fatal hardware error is an uncorrected or uncontained error
condition that is determined to be unrecoverable by the hardware.
When a fatal uncorrected error occurs, the system is restarted to
prevent propagation of the error.

A non-fatal hardware error is an uncorrected error condition from
which OSPM can attempt recovery by trying to correct the error.
These are also referred to as correctable or recoverable errors.

Based on our discussion and the PCIe and APCI Spec:

- Native AER fatal error defined in PCIe does not indate that there's
uncontained data corruption.
- The kernel is capable of handle native AER fatal and non-fatal errors.
- When a CPER_SEV_FATAL error nofitied by firmware, it indicates the
platform wants to force the OS to restart, and the APEI/GHES driver follows
the Spec now.

(Please correct me if I misunderstand any)


> 
> In the native case, we only have the PCIe severity and don't have the
> CPER severity at all, and I suspect that unless there's uncontained
> data corruption, we would rather handle even the most severe PCIe
> fatal error by disabling the specific device(s) instead of panicking
> and restarting the whole machine.
> 
> So for PCIe errors, I'm not sure setting CPER_SEV_FATAL is beneficial
> unless the platform wants to force the OS to panic, e.g., maybe the
> platform knows about data corruption and/or the vendor wants the OS to
> panic as part of a reliability story.

So back to the original question, I think your above comments are clear enough.

> 
> Presumably the platform has already logged the error, and I assume the
> platform *could* restart without even returning to the OS, but maybe
> it wants the OS to do a crashdump or shutdown in a more orderly way.
> 

If the system is reset in platform without even returning to the OS,
it is not visible to end user. IMHO, it always a bad choice.
The OS can provide enhanced debuggability, for example:

- providing details about the runtime context through crashdump
- saving error information to persistent storage

Thank you for your patience and valuable feedback. It is greatly appreciated
and truly helpful.

Best Regards and Cheers.
Shuai




[1] 
https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#hardware-errors-and-error-sources


Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-24 Thread Shuai Xue



On 2023/9/21 21:20, David Laight wrote:
> ...
> I've got a target to generate AER errors by generating read cycles
> that are inside the address range that the bridge forwards but
> outside of any BAR because there are 2 different sized BARs.
> (Pretty easy to setup.)
> On the system I was using they didn't get propagated all the way
> to the root bridge - but were visible in the lower bridge.

So how did you observe it? If the error message does not propagate
to the root bridge, I think no AER interrupt will be trigger.

> It would be nice for a driver to be able to detect/clear such
> a flag if it gets an unexpected ~0u read value.
> (I'm not sure an error callback helps.)

IMHO, a general model is that error detected at endpoint should be
routed to upstream port for example: RCiEP route error message to RCEC,
so that the AER port service could handle the error, the device driver
only have to implement error handler callback.

> 
> OTOH a 'nebs compliant' server routed any kind of PCIe link error
> through to some 'system management' logic that then raised an NMI.
> I'm not sure who thought an NMI was a good idea - they are pretty
> impossible to handle in the kernel and too late to be of use to
> the code performing the access.

I think it is the responsibility of the device to prevent the spread of
errors while reporting that errors have been detected. For example, drop
the current, (drain submit queue) and report error in completion record.
Both NMI and MSI are asynchronous interrupts.

> 
> In any case we were getting one after 'echo 1 >xxx/remove' and
> then taking the PCIe link down by reprogramming the fpga.
> So the link going down was entirely expected, but there seemed
> to be nothing we could do to stop the kernel crashing.
> 
> I'm sure 'nebs compliant' ought to contain some requirements for
> resilience to hardware failures!

How the kernel crash after a link down? Did the system detect a surprise
down error?

Best Regards,
Shuai


Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-21 Thread Shuai Xue
+ @Rafael for the APEI/GHES part.

On 2023/9/22 05:52, Bjorn Helgaas wrote:
> On Thu, Sep 21, 2023 at 08:10:19PM +0800, Shuai Xue wrote:
>> On 2023/9/21 07:02, Bjorn Helgaas wrote:
>>> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote:
>> ...
> 
>>> I guess your point is that for CPER_SEV_FATAL errors, the APEI/GHES
>>> path always panics but the native path never does, and that maybe both
>>> paths should work the same way?
>>
>> Yes, exactly. Both OS native and APEI/GHES firmware first are notifications
>> used to handles PCIe AER errors, and IMHO, they should ideally work in the
>> same way.
> 
> I agree, that would be nice, but the whole point of the APEI/GHES
> functionality is vendor value-add, so I'm not sure we can achieve that
> ideal.
> 
>> ...
>> As a result, AER driver only does recovery for non-fatal PCIe error.
> 
> This is only true for the APEI/GHES path, right?  For *native* AER
> handling, we attempt recovery for both fatal and non-fatal errors.

Yes, exactly.

> 
>>> It doesn't seem like the native path should always panic.  If we can
>>> tell that data was corrupted, we may want to panic, but otherwise I
>>> don't think we should crash the entire system even if some device is
>>> permanently broken.
>>
>> Got it. But how can we tell if the data is corrupted with OS native?
> 
> I naively expect that by PCIe protocol, corrupted DLLPs or TLPs
> detected by CRC, sequence number errors, etc, would be discarded
> before corrupting memory, so I doubt we'd get an uncorrectable error
> that means "sorry, I just corrupted your data."
> 
> But DPC is advertised as "avoiding the potential spread of any data
> corruption," so there must be some mechanisms of corruption, and since
> DPC is triggered by either ERR_FATAL or ERR_NONFATAL, I guess maybe
> the errors could tell us something.  I'm going to quit speculating
> because I obviously don't know enough about this area.
> 
>>>> However, I have changed my mind on this issue as I encounter a case where
>>>> a error propagation is detected due to fatal DLLP (Data Link Protocol
>>>> Error) error. A DLLP error occurred in the Compute node, causing the
>>>> node to panic because `struct acpi_hest_generic_status::error_severity` was
>>>> set as CPER_SEV_FATAL. However, data corruption was still detected in the
>>>> storage node by CRC.
>>>
>>> The only mention of Data Link Protocol Error that looks relevant is
>>> PCIe r6.0, sec 3.6.2.2, which basically says a DLLP with an unexpected
>>> Sequence Number should be discarded:
>>>
>>>   For Ack and Nak DLLPs, the following steps are followed (see Figure
>>>   3-21):
>>>
>>> - If the Sequence Number specified by the AckNak_Seq_Num does not
>>>   correspond to an unacknowledged TLP, or to the value in
>>>   ACKD_SEQ, the DLLP is discarded
>>>
>>>   - This is a Data Link Protocol Error, which is a reported error
>>> associated with the Port (see Section 6.2).
>>>
>>> So data from that DLLP should not have made it to memory, although of
>>> course the DMA may not have been completed.  But it sounds like you
>>> did see corrupted data written to memory?
>>
>> The storage node use RDMA to directly access remote compute node.
>> And a error detected by CRC in the storage node. So I suspect yes.
> 
> When doing the CRC, can you distinguish between corrupted data and
> data that was not written because a DMA was only partially completed?

Yes, the receiving application layer will perform length verification.
So the data length is definitely correct.

> 
>> ...
>> I tried to inject Data Link Protocol Error on some platform. The mechanism
>> behind is that rootport controls the sequence number of the specific TLPs
>> and ACK/NAK DLLPs. Data Link Protocol Error will be detected at the Rx side
>> of ACK/NAK DLLPs.
>>
>> In such case, NIC and NVMe recovered on fatal and non-fatal DLLP
>> errors.
> 
> I'm guessing this error injection directly writes the AER status bit,
> which would probably only test the reporting (sending an ERR_FATAL
> message), AER interrupt generation, firmware or OS interrupt handling,
> etc.
> 
> It probably would not actually generate a DLLP with a bad sequence
> number, so it probably does not test the hardware behavior of
> discarding the DLLP if the sequence number is bad.  Just my guess
> though.

No, we don't touch AER status bit. The Root port controller provides Error
Injection Function to trigger a real DLL

Re: Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-21 Thread Shuai Xue


On 2023/9/21 07:02, Bjorn Helgaas wrote:
> On Mon, Sep 18, 2023 at 05:39:58PM +0800, Shuai Xue wrote:
>> Hi, all folks,
>>
>> Error reporting and recovery are one of the important features of PCIe, and
>> the kernel has been supporting them since version 2.6, 17 years ago.
>> I am very curious about the expected behavior of the software.
>> I first recap the error classification and then list my questions bellow it.
>>
>> ## Recap: Error classification
>>
>> - Fatal Errors
>>
>> Fatal errors are uncorrectable error conditions which render the particular
>> Link and related hardware unreliable. For Fatal errors, a reset of the
>> components on the Link may be required to return to reliable operation.
>> Platform handling of Fatal errors, and any efforts to limit the effects of
>> these errors, is platform implementation specific. (PCIe 6.0.1, sec
>> 6.2.2.2.1 Fatal Errors).
>>
>> - Non-Fatal Errors
>>
>> Non-fatal errors are uncorrectable errors which cause a particular
>> transaction to be unreliable but the Link is otherwise fully functional.
>> Isolating Non-fatal from Fatal errors provides Requester/Receiver logic in
>> a device or system management software the opportunity to recover from the
>> error without resetting the components on the Link and disturbing other
>> transactions in progress. Devices not associated with the transaction in
>> error are not impacted by the error.  (PCIe 6.0.1, sec 6.2.2.2.1 Non-Fatal
>> Errors).
>>
>> ## What the kernel do?
>>
>> The Linux kernel supports both the OS native and firmware first modes in
>> AER and DPC drivers. The error recovery API is defined in `struct
>> pci_error_handlers`, and the recovery process is performed in several
>> stages in pcie_do_recovery(). One main difference in handling PCIe errors
>> is that the kernel only resets the link when a fatal error is detected.
>>
>> ## Questions
>>
>> 1. Should kernel panic when fatal errors occur without AER recovery?
>>
>> IMHO, the answer is NO. The AER driver handles both fatal and
>> non-fatal errors, and I have not found any panic changes in the
>> recovery path in OS native mode.
>>
>> As far as I know, on many X86 platforms, struct
>> `acpi_hest_generic_status::error_severity` is set as CPER_SEV_FATAL
>> in firmware first mode. As a result, kernel will panic immediately
>> in ghes_proc() when fatal AER errors occur, and there is no chance
>> to handle the error and perform recovery in AER driver.
> 
> UEFI r2.10, sec N.2.1,, defines CPER_SEV_FATAL, and platform firmware
> decides which Error Severity to put in the error record.  I don't see
> anything in UEFI about how the OS should handle fatal errors.
> 
> ACPI r6.5, sec 18.1, says on fatal uncorrected error, the system
> should be restarted to prevent propagation of the error.  For
> CPER_SEV_FATAL errors, it looks like ghes_proc() panics even before
> trying AER recovery.
> 
> I guess your point is that for CPER_SEV_FATAL errors, the APEI/GHES
> path always panics but the native path never does, and that maybe both
> paths should work the same way?

Yes, exactly. Both OS native and APEI/GHES firmware first are notifications
used to handles PCIe AER errors, and IMHO, they should ideally work in the
same way.

> 
> It would be nice if they worked the same, but I suspect that vendors
> may rely on the fact that CPER_SEV_FATAL forces a restart/panic as
> part of their system integrity story.


As far I I know, vendor use CPER_SEV_FATAL to report default fatal PCIe error

- Data Link Protocol Error
- Surprise Down Error
- Flow Control Protocol Error Severity
- Receiver Overflow Severity
- Malformed TLP Severity
- Uncorrectable Internal Error Severity
- IDE Check Failed Severity
(per PCIe r6.0 7.8.4.4 Uncorrectable Error Severity Register (Offset 0Ch))

which forces a panic.

As a result, AER driver only does recovery for non-fatal PCIe error.

> 
> It doesn't seem like the native path should always panic.  If we can
> tell that data was corrupted, we may want to panic, but otherwise I
> don't think we should crash the entire system even if some device is
> permanently broken.

Got it. But how can we tell if the data is corrupted with OS native?

> 
>> For fatal and non-fatal errors, struct
>> `acpi_hest_generic_status::error_severity` should as
>> CPER_SEV_RECOVERABLE, and struct
>> `acpi_hest_generic_data::error_severity` should reflect its real
>> severity. Then, the kernel is equivalent to handling PCIe errors in
>> Firmware first mode as it does in OS native mode.  Please correct me
>> if I am wrong.
> 
> I don't know enough to co

Questions: Should kernel panic when PCIe fatal error occurs?

2023-09-18 Thread Shuai Xue
Hi, all folks,

Error reporting and recovery are one of the important features of PCIe, and
the kernel has been supporting them since version 2.6, 17 years ago.
I am very curious about the expected behavior of the software.
I first recap the error classification and then list my questions bellow it.

## Recap: Error classification

- Fatal Errors

Fatal errors are uncorrectable error conditions which render the particular
Link and related hardware unreliable. For Fatal errors, a reset of the
components on the Link may be required to return to reliable operation.
Platform handling of Fatal errors, and any efforts to limit the effects of
these errors, is platform implementation specific. (PCIe 6.0.1, sec
6.2.2.2.1 Fatal Errors).

- Non-Fatal Errors

Non-fatal errors are uncorrectable errors which cause a particular
transaction to be unreliable but the Link is otherwise fully functional.
Isolating Non-fatal from Fatal errors provides Requester/Receiver logic in
a device or system management software the opportunity to recover from the
error without resetting the components on the Link and disturbing other
transactions in progress. Devices not associated with the transaction in
error are not impacted by the error.  (PCIe 6.0.1, sec 6.2.2.2.1 Non-Fatal
Errors).

## What the kernel do?

The Linux kernel supports both the OS native and firmware first modes in
AER and DPC drivers. The error recovery API is defined in `struct
pci_error_handlers`, and the recovery process is performed in several
stages in pcie_do_recovery(). One main difference in handling PCIe errors
is that the kernel only resets the link when a fatal error is detected.

## Questions

1. Should kernel panic when fatal errors occur without AER recovery?

IMHO, the answer is NO. The AER driver handles both fatal and non-fatal
errors, and I have not found any panic changes in the recovery path in OS
native mode.

As far as I know, on many X86 platforms, struct 
`acpi_hest_generic_status::error_severity`
is set as CPER_SEV_FATAL in firmware first mode. As a result, kernel will
panic immediately in ghes_proc() when fatal AER errors occur, and there
is no chance to handle the error and perform recovery in AER driver.

For fatal and non-fatal errors, struct 
`acpi_hest_generic_status::error_severity`
should as CPER_SEV_RECOVERABLE, and struct 
`acpi_hest_generic_data::error_severity`
should reflect its real severity. Then, the kernel is equivalent to handling
PCIe errors in Firmware first mode as it does in OS native mode.
Please correct me if I am wrong.

However, I have changed my mind on this issue as I encounter a case where
a error propagation is detected due to fatal DLLP (Data Link Protocol
Error) error. A DLLP error occurred in the Compute node, causing the
node to panic because `struct acpi_hest_generic_status::error_severity` was
set as CPER_SEV_FATAL. However, data corruption was still detected in the
storage node by CRC.

2. Should kernel panic when AER recovery failed?

This question is actually a TODO that was added when the AER driver was
first upstreamed 17 years ago, and it is still relevant today. The kernel
does not proactively panic regardless of the error types occurring in OS
native mode. The DLLP error propagation case indicates that the kernel
might should panic when recovery failed?

3. Should DPC be enabled by default to contain fatal and non-fatal error?

According to the PCIe specification, DPC halts PCIe traffic below a
Downstream Port after an unmasked uncorrectable error is detected at or
below the Port, avoiding the potential spread of any data corruption.

The kernel configures DPC to be triggered only on ERR_FATAL. Literally
speaking, only fatal error have the potential spread of any data
corruption? In addition, the AER Severity is programable by the
Uncorrectable Error Severity Register (Offset 0Ch in PCIe AER cap). If a
default fatal error, e.g. DLLP, set as non-fatal, DPC will not be
triggered.


Looking forward to any comments and reply :)

Thank you.

Best Regards,
Shuai


[1] 
https://github.com/torvalds/linux/commit/6c2b374d74857e892080ee726184ec1d15e7d4e4#diff-fea64904d30501b59d2e948189bbedc476fc270ed4c15e4ae29d7f0efd06771aR438











Re: [PATCH v3 0/2] Copy-on-write poison recovery

2022-10-25 Thread Shuai Xue



在 2022/10/23 PM11:52, Shuai Xue 写道:
> 
> 
> 在 2022/10/22 AM4:01, Tony Luck 写道:
>> Part 1 deals with the process that triggered the copy on write
>> fault with a store to a shared read-only page. That process is
>> send a SIGBUS with the usual machine check decoration to specify
>> the virtual address of the lost page, together with the scope.
>>
>> Part 2 sets up to asynchronously take the page with the uncorrected
>> error offline to prevent additional machine check faults. H/t to
>> Miaohe Lin  and Shuai Xue 
>> for pointing me to the existing function to queue a call to
>> memory_failure().
>>
>> On x86 there is some duplicate reporting (because the error is
>> also signalled by the memory controller as well as by the core
>> that triggered the machine check). Console logs look like this:
>>
>> [ 1647.723403] mce: [Hardware Error]: Machine check events logged
>>  Machine check from kernel copy routine
>>
>> [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory 
>> corruption fault at 7f3309503400
>>  x86 fault handler sends SIGBUS to child process
>>
>> [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU 
>> page: Recovered
>>  Async call to memory_failure() from copy on write path
> 
> The recovery action might also be handled asynchronously in CMCI 
> uc_decode_notifier
> handler signaled by memory controller, right?
> 
> I have a one more memory failure log than yours.
> 
> [ 3187.485742] MCE: Killing einj_mem_uc:31746 due to hardware memory 
> corruption fault at 7fc4bf7cf400
> [ 3187.740620] Memory failure: 0x1a3b80: recovery action for dirty LRU page: 
> Recovered
>   uc_decode_notifier() processes memory controller report
> 
> [ 3187.748272] Memory failure: 0x1a3b80: already hardware poisoned
>   Workqueue: events memory_failure_work_func // queued by 
> ghes_do_memory_failure
> 
> [ 3187.754194] Memory failure: 0x1a3b80: already hardware poisoned
>   Workqueue: events memory_failure_work_func // queued by 
> __wp_page_copy_user
> 
> [ 3188.615920] MCE: Killing einj_mem_uc:31745 due to hardware memory 
> corruption fault at 7fc4bf7cf400
> 
> Best Regards,
> Shuai

Tested-by: Shuai Xue 

Thank you.
Shuai

> 
>>
>> [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned
>>  uc_decode_notifier() processes memory controller report
>>
>> [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory 
>> corruption fault at 7f3309503400
>>  Parent process tries to read poisoned page. Page has been unmapped, so
>>  #PF handler sends SIGBUS
>>
>>
>> Tony Luck (2):
>>   mm, hwpoison: Try to recover from copy-on write faults
>>   mm, hwpoison: When copy-on-write hits poison, take page offline
>>
>>  include/linux/highmem.h | 24 
>>  include/linux/mm.h  |  5 -
>>  mm/memory.c | 32 ++--
>>  3 files changed, 50 insertions(+), 11 deletions(-)
>>


Re: [PATCH v3 0/2] Copy-on-write poison recovery

2022-10-23 Thread Shuai Xue



在 2022/10/22 AM4:01, Tony Luck 写道:
> Part 1 deals with the process that triggered the copy on write
> fault with a store to a shared read-only page. That process is
> send a SIGBUS with the usual machine check decoration to specify
> the virtual address of the lost page, together with the scope.
> 
> Part 2 sets up to asynchronously take the page with the uncorrected
> error offline to prevent additional machine check faults. H/t to
> Miaohe Lin  and Shuai Xue 
> for pointing me to the existing function to queue a call to
> memory_failure().
> 
> On x86 there is some duplicate reporting (because the error is
> also signalled by the memory controller as well as by the core
> that triggered the machine check). Console logs look like this:
> 
> [ 1647.723403] mce: [Hardware Error]: Machine check events logged
>   Machine check from kernel copy routine
> 
> [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory 
> corruption fault at 7f3309503400
>   x86 fault handler sends SIGBUS to child process
> 
> [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: 
> Recovered
>   Async call to memory_failure() from copy on write path

The recovery action might also be handled asynchronously in CMCI 
uc_decode_notifier
handler signaled by memory controller, right?

I have a one more memory failure log than yours.

[ 3187.485742] MCE: Killing einj_mem_uc:31746 due to hardware memory corruption 
fault at 7fc4bf7cf400
[ 3187.740620] Memory failure: 0x1a3b80: recovery action for dirty LRU page: 
Recovered
uc_decode_notifier() processes memory controller report

[ 3187.748272] Memory failure: 0x1a3b80: already hardware poisoned
Workqueue: events memory_failure_work_func // queued by 
ghes_do_memory_failure

[ 3187.754194] Memory failure: 0x1a3b80: already hardware poisoned
Workqueue: events memory_failure_work_func // queued by 
__wp_page_copy_user

[ 3188.615920] MCE: Killing einj_mem_uc:31745 due to hardware memory corruption 
fault at 7fc4bf7cf400

Best Regards,
Shuai

> 
> [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned
>   uc_decode_notifier() processes memory controller report
> 
> [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory 
> corruption fault at 7f3309503400
>   Parent process tries to read poisoned page. Page has been unmapped, so
>   #PF handler sends SIGBUS
> 
> 
> Tony Luck (2):
>   mm, hwpoison: Try to recover from copy-on write faults
>   mm, hwpoison: When copy-on-write hits poison, take page offline
> 
>  include/linux/highmem.h | 24 
>  include/linux/mm.h  |  5 -
>  mm/memory.c | 32 ++--
>  3 files changed, 50 insertions(+), 11 deletions(-)
> 


Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-23 Thread Shuai Xue



在 2022/10/22 AM12:30, Luck, Tony 写道:
>>> But maybe it is some RMW instruction ... then, if all the above options 
>>> didn't happen ... we
>>> could get another machine check from the same address. But then we just 
>>> follow the usual
>>> recovery path.
> 
> 
>> Let assume the instruction that cause the COW is in the 63/64 case, aka,
>> it is writing a different cache line from the poisoned one. But the new_page
>> allocated in COW is dropped right? So might page fault again?
> 
> It can, but this should be no surprise to a user that has a signal handler for
> a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the
> problem, but simply returns to re-execute the same instruction that caused
> the original trap.
> 
> There may be badly written signal handlers that do this. But they just cause
> pain for themselves. Linux can keep taking the traps and fixing things up and
> sending a new signal over and over.
> 
> In this case that loop may involve taking the machine check again, so some
> extra pain for the kernel, but recoverable machine checks on Intel/x86 
> switched
> from broadcast to delivery to just the logical CPU that tried to consume the 
> poison
> a few generations back. So only a bit more painful than a repeated page fault.
> 
> -Tony
> 
> 

I see, thanks for your patient explanation :)

Best Regards,
Shuai



Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-21 Thread Shuai Xue



在 2022/10/21 PM12:41, Luck, Tony 写道:
>>> When we do return to user mode the task is going to be busy servicing
>>> a SIGBUS ... so shouldn't try to touch the poison page before the
>>> memory_failure() called by the worker thread cleans things up.
>>
>> What about an RT process on a busy system?
>> The worker threads are pretty low priority.
> 
> Most tasks don't have a SIGBUS handler ... so they just die without 
> possibility of accessing poison
> 
> If this task DOES have a SIGBUS handler, and that for some bizarre reason 
> just does a "return"
> so the task jumps back to the instruction that cause the COW then there is a 
> 63/64
> likelihood that it is touching a different cache line from the poisoned one.
> 
> In the 1/64 case ... its probably a simple store (since there was a COW, we 
> know it was trying to
> modify the page) ... so won't generate another machine check (those only 
> happen for reads).
> 
> But maybe it is some RMW instruction ... then, if all the above options 
> didn't happen ... we
> could get another machine check from the same address. But then we just 
> follow the usual
> recovery path.
> 
> -Tony


Let assume the instruction that cause the COW is in the 63/64 case, aka,
it is writing a different cache line from the poisoned one. But the new_page
allocated in COW is dropped right? So might page fault again?

Best Regards,
Shuai


Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-21 Thread Shuai Xue



在 2022/10/21 PM12:08, Tony Luck 写道:
> On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/21 AM4:05, Tony Luck 写道:
>>> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 2022/10/20 AM1:08, Tony Luck 写道:
> 
>>> I'm experimenting with using sched_work() to handle the call to
>>> memory_failure() (echoing what the machine check handler does using
>>> task_work)_add() to avoid the same problem of not being able to directly
>>> call memory_failure()).
>>
>> Work queues permit work to be deferred outside of the interrupt context
>> into the kernel process context. If we return to user-space before the
>> queued memory_failure() work is processed, we will take the fault again,
>> as we discussed recently.
>>
>> commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for 
>> synchronous errors
>> commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread 
>> to avoid memory leak
>>
>> So, in my opinion, we should add memory failure as a task work, like
>> do_machine_check does, e.g.
>>
>> queue_task_work(, msg, kill_me_maybe);
> 
> Maybe ... but this case isn't pending back to a user instruction
> that is trying to READ the poison memory address. The task is just
> trying to WRITE to any address within the page.

Aha, I see the difference. Thank you. But I still have a question on
this. Let us discuss in your reply to David Laight.

Best Regards,
Shuai

> 
> So this is much more like a patrol scrub error found asynchronously
> by the memory controller (in this case found asynchronously by the
> Linux page copy function).  So I don't feel that it's really the
> responsibility of the current task.
> 
> When we do return to user mode the task is going to be busy servicing
> a SIGBUS ... so shouldn't try to touch the poison page before the
> memory_failure() called by the worker thread cleans things up.
> 
>>> +   INIT_WORK(>work, do_sched_memory_failure);
>>> +   p->pfn = pfn;
>>> +   schedule_work(>work);
>>> +}
>>
>> I think there is already a function to do such work in mm/memory-failure.c.
>>
>>  void memory_failure_queue(unsigned long pfn, int flags)
> 
> Also pointed out by Miaohe Lin  ... this does
> exacly what I want, and is working well in tests so far. So perhaps
> a cleaner solution than making the kill_me_maybe() function globally
> visible.
> 
> -Tony


Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue



在 2022/10/21 AM4:05, Tony Luck 写道:
> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/20 AM1:08, Tony Luck 写道:
>>> If the kernel is copying a page as the result of a copy-on-write
>>> fault and runs into an uncorrectable error, Linux will crash because
>>> it does not have recovery code for this case where poison is consumed
>>> by the kernel.
>>>
>>> It is easy to set up a test case. Just inject an error into a private
>>> page, fork(2), and have the child process write to the page.
>>>
>>> I wrapped that neatly into a test at:
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>>>
>>> just enable ACPI error injection and run:
>>>
>>>   # ./einj_mem-uc -f copy-on-write
>>>
>>> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
>>> on architectures where that is available (currently x86 and powerpc).
>>> When an error is detected during the page copy, return VM_FAULT_HWPOISON
>>> to caller of wp_page_copy(). This propagates up the call stack. Both x86
>>> and powerpc have code in their fault handler to deal with this code by
>>> sending a SIGBUS to the application.
>>
>> Does it send SIGBUS to only child process or both parent and child process?
> 
> This only sends a SIGBUS to the process that wrote the page (typically
> the child, but also possible that the parent is the one that does the
> write that causes the COW).


Thanks for your explanation.

> 
>>>
>>> Note that this patch avoids a system crash and signals the process that
>>> triggered the copy-on-write action. It does not take any action for the
>>> memory error that is still in the shared page. To handle that a call to
>>> memory_failure() is needed. 
>>
>> If the error page is not poisoned, should the return value of wp_page_copy
>> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
>> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
>> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.
> 
> The page has uncorrected data in it, but this patch doesn't mark it
> as poisoned.  Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS
> that doesn't include the BUS_MCEERR_AR and "lsb" information. It would
> also skip the:
> 
>   "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n"
> 
> console message. So might result in confusion and attepmts to debug a
> s/w problem with the application instead of blaming the death on a bad
> DIMM.

I see your point. Thank you.

> 
>>> But this cannot be done from wp_page_copy()
>>> because it holds mmap_lock(). Perhaps the architecture fault handlers
>>> can deal with this loose end in a subsequent patch?
> 
> I started looking at this for x86 ... but I have changed my mind
> about this being a good place for a fix. When control returns back
> to the architecture fault handler it no longer has easy access to
> the physical page frame number. It has the virtual address, so it
> could descend back into somee new mm/memory.c function to get the
> physical address ... but that seems silly.
> 
> I'm experimenting with using sched_work() to handle the call to
> memory_failure() (echoing what the machine check handler does using
> task_work)_add() to avoid the same problem of not being able to directly
> call memory_failure()).

Work queues permit work to be deferred outside of the interrupt context
into the kernel process context. If we return to user-space before the
queued memory_failure() work is processed, we will take the fault again,
as we discussed recently.

commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for 
synchronous errors
commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to 
avoid memory leak

So, in my opinion, we should add memory failure as a task work, like
do_machine_check does, e.g.

queue_task_work(, msg, kill_me_maybe);

> 
> So far it seems to be working. Patch below (goes on top of original
> patch ... well on top of the internal version with mods based on
> feedback from Dan Williams ... but should show the general idea)
> 
> With this patch applied the page does get unmapped from all users.
> Other tasks that shared the page will get a SIGBUS if they attempt
> to access it later (from the page fault handler because of
> is_hwpoison_entry() as you mention above.
> 
> -Tony
> 
> From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001
> From: Tony Luck 
> Date: Th

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue



在 2022/10/20 AM1:08, Tony Luck 写道:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
> 
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
> 
> I wrapped that neatly into a test at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> just enable ACPI error injection and run:
> 
>   # ./einj_mem-uc -f copy-on-write
> 
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.

Does it send SIGBUS to only child process or both parent and child process?

> 
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. 

If the error page is not poisoned, should the return value of wp_page_copy
be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.

Thanks.

Best Regards,
Shuai


> But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
> 
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
> 
> Signed-off-by: Tony Luck 
> 
> ---
> Changes in V2:
>Naoya Horiguchi:
>   1) Use -EHWPOISON error code instead of minus one.
>   2) Poison path needs also to deal with old_page
>Tony Luck:
>   Rewrote commit message
>   Added some powerpc folks to Cc: list
> ---
>  include/linux/highmem.h | 19 +++
>  mm/memory.c | 28 +++-
>  2 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..5967541fbf0e 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, 
> struct page *from,
>  
>  #endif
>  
> +static inline int copy_user_highpage_mc(struct page *to, struct page *from,
> + unsigned long vaddr, struct 
> vm_area_struct *vma)
> +{
> + unsigned long ret = 0;
> +#ifdef copy_mc_to_kernel
> + char *vfrom, *vto;
> +
> + vfrom = kmap_local_page(from);
> + vto = kmap_local_page(to);
> + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> + kunmap_local(vto);
> + kunmap_local(vfrom);
> +#else
> + copy_user_highpage(to, from, vaddr, vma);
> +#endif
> +
> + return ret;
> +}
> +
>  #ifndef __HAVE_ARCH_COPY_HIGHPAGE
>  
>  static inline void copy_highpage(struct page *to, struct page *from)
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..a32556c9b689 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
>   return same;
>  }
>  
> -static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> -struct vm_fault *vmf)
> +/*
> + * Return:
> + *   -EHWPOISON: copy failed due to hwpoison in source page
> + *   0:  copied failed (some other reason)
> + *   1:  copied succeeded
> + */
> +static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> +   struct vm_fault *vmf)
>  {
>   bool ret;
>   void *kaddr;
> @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page 
> *dst, struct page *src,
>   unsigned long addr = vmf->address;
>  
>   if (likely(src)) {
> - copy_user_highpage(dst, src, addr, vma);
> - return true;
> + if (copy_user_highpage_mc(dst, src, addr, vma))
> + return -EHWPOISON;
> + return 1;
>   }
>  
>   /*
> @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page 
> *dst, struct page *src,
>* and update local tlb only
>*/
>   update_mmu_tlb(vma, addr,