s is based on an idea first shown here:
> https://patchwork.kernel.org/project/kvm/patch/20160301192822.gd22...@pd.tnic/
>
> CC: Borislav Petkov
> Signed-off-by: Maxim Levitsky
> ---
> arch/x86/kvm/x86.c | 3 +++
> arch/x86/kvm/x86.h | 2 ++
> 2 files changed, 5 insertions(+
) message about the nature of the header format error.
>
> Suggested-by: Borislav Petkov
> Signed-off-by: James Morse
> ---
> drivers/acpi/apei/ghes.c | 46
> 1 file changed, 32 insertions(+), 14 deletions(-)
>
> diff --git a/drive
On Tue, Jan 29, 2019 at 06:48:45PM +, James Morse wrote:
> +static int ghes_in_nmi_spool_from_list(struct list_head *rcu_list)
> +{
> + int err, ret = -ENOENT;
> + struct ghes *ghes;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(ghes, rcu_list, list) {
> +
s.c | 33 ++---
> drivers/acpi/apei/hest.c | 10 +-
> include/acpi/ghes.h | 2 ++
> 3 files changed, 17 insertions(+), 28 deletions(-)
Reviewed-by: Borislav Petkov
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-postin
On Wed, Jan 23, 2019 at 06:40:08PM +, James Morse wrote:
> My SMM comment was because the CPU must jump from user-space->SMM, which
> injects
> an NMI into the kernel. The kernel's EIP must point into user-space, so
> returning from the NMI without doing the memory_failure() work puts us back
On Wed, Jan 23, 2019 at 06:33:02PM +, James Morse wrote:
> Was the best I had, but this trips the BUILD_BUG() too early.
> With it, x86 BUILD_BUG()s. With just the -1 the path gets pruned out, and
> there
> are no 'sdei' symbols in the object file.
>
> ...at this point, I stopped caring!
On Tue, Jan 29, 2019 at 06:48:33PM +, James Morse wrote:
> If firmware has never generated CPER records, so it has never written to void
> *error_status_address, yes.
I guess this is the bit of information I was missing.
> There seem to be two ways of doing this. This zero check implies an
On Wed, Jan 23, 2019 at 06:36:38PM +, James Morse wrote:
> Do you consider ENOENT an error? We don't ack in that case as the
> memory wasn't in use.
So let's see:
if (!*buf_paddr)
return -ENOENT;
can happen when apei_read() has returned 0 but it has managed to do
On Mon, Dec 10, 2018 at 07:15:13PM +, James Morse wrote:
> What happens if we miss MF_ACTION_REQUIRED?
AFAICU, the logic is to force-send a signal to the user process, i.e.,
force_sig_info() which cannot be ignored. IOW, an "enlightened" process
would know how to do recovery action from a
On Mon, Dec 03, 2018 at 06:06:10PM +, James Morse wrote:
> memory_failure() offlines or repairs pages of memory that have been
> discovered to be corrupt. These may be detected by an external
> component, (e.g. the memory controller), and notified via an IRQ.
> In this case the work is queued
On Mon, Dec 03, 2018 at 06:06:08PM +, James Morse wrote:
> Now that ghes notification helpers provide the fixmap slots and
> take the lock themselves, multiple NMI-like notifications can
> be used on arm64.
>
> These should be named after their notification method as they can't
> all be
the new
> ghes_peek_estatus() to know how much memory to allocate from
> the ghes_estatus_pool before reading the records.
>
> Reported-by: Borislav Petkov
> Signed-off-by: James Morse
>
> Change since v6:
> * Added a comment explaining the 'ack-error, then goto
On Mon, Dec 03, 2018 at 06:06:06PM +, James Morse wrote:
> ghes_read_estatus() reads the record address, then the record's
> header, then performs some sanity checks before reading the
> records into the provided estatus buffer.
>
> To provide this estatus buffer the caller must know the size
On Mon, Dec 03, 2018 at 06:06:05PM +, James Morse wrote:
> The NMI-like notifications scribble over ghes->estatus, before
> copying it somewhere else. If this interrupts the ghes_probe() code
> calling ghes_proc() on each struct ghes, the data is corrupted.
>
> All the NMI-like notifications
On Mon, Dec 03, 2018 at 06:05:59PM +, James Morse wrote:
> The estatus-queue code is currently hidden by the NOTIFY_NMI #ifdefs.
> Once NOTIFY_SEA starts using the estatus-queue we can stop hiding
> it as each architecture has a user that can't be turned off.
>
> Split the existing
On Fri, Jan 11, 2019 at 06:09:28PM +, James Morse wrote:
> We can return on ENOENT out earlier, as nothing needs doing in that case. Its
> what the GHES_TO_CLEAR spaghetti is for, we can probably move the ack thing
> into
> ghes_clear_estatus(), that way that thing means 'I'm done with this
On Fri, Jan 11, 2019 at 06:25:21PM +, James Morse wrote:
> We ack it in the corrupt-record case too, because we are done with the
> memory.
Ok, so the only thing that we need to do unconditionally is ACK in order
to free the memory. Or is there an exception to that set of steps in
error
On Fri, Jan 11, 2019 at 10:32:23AM -0500, Tyler Baicar wrote:
> The kernel would have no way of knowing what to do here.
What do you mean, there's no way of knowing what to do? It needs to
clear registers so that the next error can get reported properly.
Or of the status read failed and it
On Thu, Jan 10, 2019 at 04:01:27PM -0500, Tyler Baicar wrote:
> On Thu, Jan 10, 2019 at 1:23 PM James Morse wrote:
> > >>
> > >> +if (is_hest_type_generic_v2(ghes) &&
> > >> ghes_ack_error(ghes->generic_v2))
> > >
> > > Since ghes_ack_error() is always prepended with this check, you could
>
On Thu, Jan 10, 2019 at 06:21:21PM +, James Morse wrote:
> Something like:
> ghes_notify_nmi() -> in_nmi_spool_from_list(list) ->
> in_nmi_queue_one_entry(ghes).
Yah, but make that
ghes_notify_nmi() -> ghes_nmi_spool_from_list(list) ->
ghes_nmi_queue_one_entry(ghes).
to denote it is the
Ok,
lemme split this out into a separate thread and add Tony.
On Thu, Jan 10, 2019 at 06:20:35PM +, James Morse wrote:
> > Grrr, what an effing mess that code is! There's hest_disable *and*
> > ghes_disable. Do we really need them both?
>
> ghes_disable lets you ignore the firmware-first
On Fri, Dec 14, 2018 at 01:56:16PM +, James Morse wrote:
> /me digs a bit,
>
> ghes_estatus_pool_init() allocates memory from hest_ghes_dev_register().
> Its caller is behind a 'if (!ghes_disable)' in acpi_hest_init(), and is after
> another 2 calls to apei_hest_parse().
>
> If ghes_disable
On Mon, Dec 03, 2018 at 06:05:58PM +, James Morse wrote:
> ACPI has a GHESv2 which is used on hardware reduced platforms to
> explicitly acknowledge that the memory for CPER records has been
> consumed. This lets an external agent know it can re-use this
> memory for something else.
>
>
On Mon, Dec 03, 2018 at 06:05:57PM +, James Morse wrote:
> Refactor the estatus queue's pool notification routine from
> NOTIFY_NMI's handlers. This will allow another notification
> method to use the estatus queue without duplicating this code.
>
> This patch adds
- continue;
> -
> __process_error(ghes);
> ghes_clear_estatus(ghes, buf_paddr);
> }
> --
Reviewed-by: Borislav Petkov
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
__
s->flags &= ~GHES_TO_CLEAR;
< newline here.
> + if (buf_paddr) {
Also, you can save yourself an indendation level:
if (!buf_paddr)
return;
ghes_copy...
> + ghes_copy_tofrom_phys(ghes->estatus, buf_paddr,
> +
and shrinks the pool.
>
> Suggested-by: Borislav Petkov
> Signed-off-by: James Morse
> ---
> drivers/acpi/apei/ghes.c | 49 +---
> drivers/acpi/apei/hest.c | 2 +-
> include/acpi/ghes.h | 2 +-
> 3 files changed, 8 insertions(+), 45
On Mon, Dec 03, 2018 at 06:05:52PM +, James Morse wrote:
> ghes.c has a memory pool it uses for the estatus cache and the estatus
> queue. The cache is initialised when registering the platform driver.
> For the queue, an NMI-like notification has to grow/shrink the pool
> as it is registered
if (ret)
> - return ret;
> - }
>
> - return 0;
> + addr = (unsigned long)vmalloc(PAGE_ALIGN(len));
> + if (!addr)
> + return -ENOMEM;
... and if we return here due to the ENOMEM, that increment above
remains.
I see you'
hanges since v6:
> * Moved earlier in the series,
> * Tinkered with the commit message.
> * switched to pr_warn_ratelimited() to shut checkpatch up
>
> shutup checkpatch
> ---
> drivers/acpi/apei/ghes.c | 15 +++
> 1 file changed, 7 insertions(+), 8 deletio
+ Peter.
On Fri, Sep 21, 2018 at 11:17:04PM +0100, James Morse wrote:
> arm64 can take an NMI-like error notification when user-space steps in
> some corrupt memory. APEI's GHES code will call memory_failure_queue()
> to schedule the recovery work. We then return to user-space, possibly
> taking
On Fri, Oct 12, 2018 at 06:17:48PM +0100, James Morse wrote:
> Ripping out the existing #ifdefs and replacing them with IS_ENABLED() would
> let
> the compiler work out the estatus stuff is unused, and saves us describing the
> what-uses-it logic in Kconfig.
>
> But this does expose the x86 nmi
he
> ghes_peek_estatus() so we know how much memory to allocate from
> the ghes_estatus_pool before we read the records.
>
> Reported-by: Borislav Petkov
> Signed-off-by: James Morse
> ---
> drivers/acpi/apei/ghes.c | 45
&
On Fri, Sep 21, 2018 at 11:17:01PM +0100, James Morse wrote:
> ghes_read_estatus() reads the record address, then the record's
> header, then performs some sanity checks before reading the
> records into the provided estatus buffer.
>
> We either need to know the size of the records before we
the struct ghes.
>
> Signed-off-by: James Morse
> ---
> drivers/acpi/apei/ghes.c | 26 --
> include/acpi/ghes.h | 1 -
> 2 files changed, 12 insertions(+), 15 deletions(-)
Nice.
Reviewed-by: Borislav Petkov
--
Regards/Gruss,
Boris.
Go
On Fri, Sep 21, 2018 at 11:16:58PM +0100, James Morse wrote:
> Subsequent patches will split up ghes_read_estatus(), at which
> point passing around the 'silent' flag gets annoying. This is to
> suppress prink() messages, which prior to 42a0bb3f7138 ("printk/nmi:
> generic solution for safe printk
Nitpick:
Subject: Re: [PATCH v6 10/18] ACPI / APEI: preparatory split of ghes->estatus
Pls have an active formulation in your Subject and start it with a capital
letter, i.e., something like:
"Split ghes->estatus in preparation for... "
On Fri, Sep 21, 2018 at 11:16:57PM +0100, James
proc(), and are called in process
> or IRQ context. Move the spin_lock_irqsave() around their ghes_proc()
> calls.
Right, should ghes_proc() be renamed to ghes_proc_irq() now, to be
absolutely clear on the processing context it is operating in?
Other than that:
Reviewed-by: Borislav Petko
On Fri, Sep 21, 2018 at 11:16:54PM +0100, James Morse wrote:
> To split up APEIs in_nmi() path, we need the nmi-like callers to always
> be in_nmi(). Add a helper to do the work and claim the notification.
>
> When KVM or the arch code takes an exception that might be a RAS
> notification, it
On Fri, Sep 21, 2018 at 11:16:53PM +0100, James Morse wrote:
> To split up APEIs in_nmi() path, we need any nmi-like callers to always
> be in_nmi(). KVM shouldn't have to know about this, pull the RAS plumbing
> out into a header file.
>
> Currently guest synchronous external aborts are claimed
On Wed, Oct 03, 2018 at 06:50:36PM +0100, James Morse wrote:
> I'm all in favour of letting the compiler work it out, but the existing ghes
> code has #ifdef/#else all over the place. This is 'keeping the style'.
Yeah, but this "style" is not the optimal one and we should
simplify/clean up and
On Wed, Oct 03, 2018 at 06:50:38PM +0100, James Morse wrote:
...
> The non-ghes HEST entries have a "number of records to pre-allocate" too, we
> could make this memory pool something hest.c looks after, but I can't see if
> the
> other error sources use those values.
Thanks for the detailed
On Fri, Sep 21, 2018 at 11:16:52PM +0100, James Morse wrote:
> Now that there are two users of the estatus queue, and likely to be more,
> make it a Kconfig symbol selected by the appropriate notification. We
> can move the ARCH_HAVE_NMI_SAFE_CMPXCHG checks in here too.
Ok, question: why do we
On Fri, Sep 21, 2018 at 11:16:51PM +0100, James Morse wrote:
> Now that the estatus queue can be used by more than one notification
> method, we can move notifications that have NMI-like behaviour over to
> it, and start abstracting GHES's single in_nmi() path.
>
> Switch NOTIFY_SEA over to use
On Fri, Sep 21, 2018 at 11:16:47PM +0100, James Morse wrote:
> Hello,
>
> The GHES driver has collected quite a few bugs:
>
> ghes_proc() at ghes_probe() time can be interrupted by an NMI that
> will clobber the ghes->estatus fields, flags, and the buffer_paddr.
>
> ghes_copy_tofrom_phys() uses
On Wed, May 16, 2018 at 11:38:16AM -0400, Tyler Baicar wrote:
> I haven't seen a deadlock from that, but it looks possible. What if
> the ghes_proc() call in ghes_probe() is moved before the second switch
> statement? That way it is before the NMI/IRQ/poll is setup. At quick
> glance I think that
On Wed, May 16, 2018 at 03:51:14PM +0100, James Morse wrote:
> The first two overload existing architectural behavior, the third improves all
> this with a third standard option. Its the standard story!
:-)
> I thought this was safe because its just ghes_copy_tofrom_phys()s access to
> the
>
On Tue, May 08, 2018 at 09:45:01AM +0100, James Morse wrote:
> NOTIFY_NMI is x86's NMI, arm doesn't have anything that behaves in the same
> way,
> so doesn't use it. The equivalent notifications with NMI-like behaviour are:
> * SEA (synchronous external abort)
> * SEI (SError Interrupt)
> * SDEI
On Fri, Apr 27, 2018 at 04:35:05PM +0100, James Morse wrote:
> Arm64 has multiple NMI-like notifications, but ghes.c only has one
> in_nmi() path, risking deadlock if one NMI-like notification can
> interrupt another.
>
> To support this we need a fixmap entry and lock for each notification
>
On Fri, Apr 27, 2018 at 04:35:00PM +0100, James Morse wrote:
> To support asynchronous NMI-like notifications on arm64 we need to use
> the estatus-queue. These patches refactor it to allow multiple APEI
> notification types to use it.
>
> Refactor the estatus queue's pool grow/shrink code and
memory) that can't be handled immediatly.
> ---
> drivers/acpi/apei/ghes.c | 265
> -------
> 1 file changed, 137 insertions(+), 128 deletions(-)
Reviewed-by: Borislav Petkov <b...@suse.de>
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-pos
>= GHES_SEV_PANIC) {
> -#ifdef CONFIG_X86
> - oops_begin();
> -#endif
> ghes_print_queued_estatus();
> __ghes_panic(ghes);
> }
> -%<-
Acked-by: Borislav Petkov <b...@suse.de>
--
Regards/Gru
Hi James,
On Mon, Mar 19, 2018 at 02:29:13PM +, James Morse wrote:
> I don't think the die_lock really helps here, do we really want to wait for a
> remote CPU to finish printing an OOPs about user-space's bad memory accesses,
> before we bring the machine down due to this system-wide fatal
On Wed, Mar 07, 2018 at 06:15:02PM +, James Morse wrote:
> Today its just x86 and arm64. arm64 doesn't have a hook to do this. I'm happy
> to
> add an empty declaration or leave it under an ifdef until someone complains
> about any behaviour I missed!
So I did some more staring at the code
On Thu, Mar 01, 2018 at 06:06:59PM +, Punit Agrawal wrote:
> You're looking at support for the 32-bit ARM systems.
I know. That's why I'm asking.
> The 64-bit support lives in arch/arm64 and the die() there doesn't
> contain an oops_begin()/oops_end(). But the lack of oops_begin() on
> arm64
On Thu, Feb 15, 2018 at 06:55:57PM +, James Morse wrote:
> Keep the oops_begin() call for x86,
That oops_begin() in generic code is such a layering violation, grrr.
> arm64 doesn't have one of these,
> and APEI is the only thing outside arch code calling this..
So looking at:
On Fri, Feb 23, 2018 at 06:02:21PM +, James Morse wrote:
> Sure. I reckon your English grammar is better than mine, is this better?:
Bah, you must be joking :-)
> | In any NMI-like handler, memory from ghes_estatus_pool is used to save
> | estatus, and added to the ghes_estatus_llist.
On Thu, Feb 15, 2018 at 06:55:56PM +, James Morse wrote:
> +#ifdef CONFIG_HAVE_ACPI_APEI_NMI
> +/*
> + * While printk() now has an in_nmi() path, the handling for CPER records
> + * does not. For example, memory_failure_queue() takes spinlocks and calls
> + * schedule_work_on().
> + *
> + * So
On Thu, Feb 15, 2018 at 06:55:55PM +, James Morse wrote:
> Hello!
>
> The aim of this series is to wire arm64's SDEI into APEI.
>
> What's SDEI? Its ARM's "Software Delegated Exception Interface" [0]. It's
> used by firmware to tell the OS about firmware-first RAS events.
>
> These Software
the error.
...
> V17:Rebase on tip
> Change trace event helper function names
> Remove unneeded prefixes from commit text
The whole series:
Reviewed-by: Borislav Petkov <b...@suse.de>
Thanks!
--
Regards/Gruss,
Boris.
ECO tip #101: Trim y
On Tue, May 16, 2017 at 10:44:43AM -0600, Baicar, Tyler wrote:
> I meant to respond to this comment after I sent the v16 patch series, but
> you beat me to it :)
>
> These prefixes are common to all the GHES/CPER printing to the kernel logs.
I don't mean that - I meant to remove them from this
On Mon, May 15, 2017 at 03:27:59PM -0600, Tyler Baicar wrote:
> Currently there are trace events for the various RAS
> errors with the exception of ARM processor type errors.
> Add a new trace event for such errors so that the user
> will know when they occur. These trace events are
> consistent
On Mon, May 15, 2017 at 03:27:58PM -0600, Tyler Baicar wrote:
> The UEFI spec includes non-standard section type support in the
> Common Platform Error Record. This is defined in section N.2.3 of
> UEFI version 2.5.
>
> Currently if the CPER section's type (UUID) does not match any
> section type
On Mon, May 15, 2017 at 03:27:57PM -0600, Tyler Baicar wrote:
> UEFI spec allows for non-standard section in Common Platform Error
> Record. This is defined in section N.2.3 of UEFI version 2.5.
>
> Currently if the CPER section's type (UUID) does not match with
> one of the section types that
On Mon, May 08, 2017 at 01:54:44PM -0600, Baicar, Tyler wrote:
> This was discussed in the v12 and v13 patch series. There is existing
> code in kvm_handle_guest_abort for injecting an abort back into the
> guest. We only want to do that if it was an abort that was not handled
> by the firmware
On Tue, Apr 25, 2017 at 11:41:39AM -0600, Baicar, Tyler wrote:
> I originally had this as a notifier, but Will requested to remove the
> notifier. That conversation is here: https://lkml.org/lkml/2017/1/18/1018
Yeah, he mentioned on IRC. I just think notifiers would be the cleaner
thing but
On Tue, Apr 18, 2017 at 05:05:18PM -0600, Tyler Baicar wrote:
> ARM APEI extension proposal added SEA (Synchronous External Abort)
> notification type for ARMv8.
> Add a new GHES error source handling function for SEA. If an error
> source's notification type is SEA, then this function can be
On Tue, Apr 25, 2017 at 10:05:31AM -0600, Baicar, Tyler wrote:
> That seems like something that should be done outside of these patches (if
> added to the kernel at all). The decoding for this information would all be
> vendor specific, so I'm not sure if we want to pollute the EFI code with
>
On Fri, Apr 21, 2017 at 12:22:09PM -0600, Baicar, Tyler wrote:
> I guess it's not really needed. It just may be useful considering there can
> be numerous error info structures, numerous context info structures, and a
> variable length vendor information section. I can move this print to only in
>
On Fri, Apr 21, 2017 at 12:08:43PM -0600, Baicar, Tyler wrote:
> The timestamp may still be useful when it is imprecise. In the polling case,
> you may only poll every minute or so, so the time may be useful.
Well, what is in the timestamp when !precise? Some random time or some
timestamp from a
On Tue, Apr 18, 2017 at 05:05:16PM -0600, Tyler Baicar wrote:
> Add support for ARM Common Platform Error Record (CPER).
> UEFI 2.6 specification adds support for ARM specific
> processor error information to be reported as part of the
> CPER records. This provides more detail on for processor
On Fri, Apr 21, 2017 at 10:04:35AM -0600, Baicar, Tyler wrote:
> This is basically what I already had in v14...you asked to move it into a
> different if-statement? https://lkml.org/lkml/2017/4/12/397
Well, clearly I've been smoking some nasty potent sh*t. :-\
/me goes and looks at the spec:
On Tue, Apr 18, 2017 at 05:05:15PM -0600, Tyler Baicar wrote:
> The ACPI 6.1 spec added a timestamp to the HEST generic data
HEST?
I see the timestamp in
Table 18-343 Generic Error Data Entry
where those things are "One or more Generic Error Data Entry structures
may be recorded in the Generic
On Tue, Apr 18, 2017 at 05:05:14PM -0600, Tyler Baicar wrote:
> The ACPI 6.1 spec adds a new version of the generic data structure.
which data structure? HEST?
> Add support to handle the new structure as well as properly verify
> and iterate through the generic data entries.
>
> Signed-off-by:
On Wed, Apr 19, 2017 at 02:31:13PM -0600, Baicar, Tyler wrote:
> Will do.
You don't necessarily have to reply with "will do" if you agree with the
review.
Also, please wait until I've gone through the whole pile before sending
it again.
Thanks.
--
Regards/Gruss,
Boris.
Good mailing
On Tue, Apr 18, 2017 at 05:05:13PM -0600, Tyler Baicar wrote:
> A RAS (Reliability, Availability, Serviceability) controller
> may be a separate processor running in parallel with OS
> execution, and may generate error records for consumption by
> the OS. If the RAS controller produces multiple
On Thu, Apr 13, 2017 at 02:30:21PM -0600, Baicar, Tyler wrote:
> I do not agree with this. The struct being passed to this function is
> already named acpi_hest_generic_data in the existing code and all over this
> code is named gdata not just d.
And I'm saying they're too long - the preexisting
On Tue, Mar 28, 2017 at 01:30:31PM -0600, Tyler Baicar wrote:
> A RAS (Reliability, Availability, Serviceability) controller
> may be a separate processor running in parallel with OS
> execution, and may generate error records for consumption by
> the OS. If the RAS controller produces multiple
On Tue, Mar 28, 2017 at 01:30:33PM -0600, Tyler Baicar wrote:
> Add support for ARM Common Platform Error Record (CPER).
> UEFI 2.6 specification adds support for ARM specific
> processor error information to be reported as part of the
> CPER records. This provides more detail on for processor
79 matches
Mail list logo