Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce

2017-04-21 Thread Borislav Petkov
On Fri, Apr 21, 2017 at 01:27:41PM -0700, Luck, Tony wrote:
> Boris: you coded up a "static bool memory_error(struct mce *m)"
> function inside the patches for the corrected error thingy.
> 
> Perhaps when it goes upstream it should be available for other
> users too?

I don't see why not. struct mce.cpuvendor even has the vendor in there
so memory_error() wouldn't even have to look at boot_cpu_data when doing
per-vendor decision.

I guess we should rename it to something more global namespace-y like
"mce_is_memory_error() or so, though, before we expose it to wider
audience...

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce

2017-04-21 Thread Vishal Verma
On 04/21, Luck, Tony wrote:
> On Fri, Apr 21, 2017 at 02:35:51PM -0600, Vishal Verma wrote:
> > On 04/21, Luck, Tony wrote:
> > > Needs extra parentheses to make it right. Vishal, sorry I led you astray.
> > > 
> > >   if (!((mce->status & 0xef80) == BIT(7)))
> > 
> > Is this still right though? Anything AND'ed with 0xef80 will never equal
> > BIT(7) which is simply 0100 binary (the lowest byte of the left hand
> > side is '0')
> 
> I think so ... here it is in binary
> 
> ef80 = 1110  1000 
> BIT7 =   1000 
> 
> so the "&" will zap bits {6:0} and bit {12}  [and everything not part
> of the MCACOD field].
> 
> If mce->status had some bit above BIT(7) set, it won't be zapped, so we
> won't match the exact value BIT(7).

Ah, you're right, I was off by one, taking BIT(7) to mean 0100 

> 
> -Tony
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce

2017-04-21 Thread Luck, Tony
On Fri, Apr 21, 2017 at 02:35:51PM -0600, Vishal Verma wrote:
> On 04/21, Luck, Tony wrote:
> > Needs extra parentheses to make it right. Vishal, sorry I led you astray.
> > 
> > if (!((mce->status & 0xef80) == BIT(7)))
> 
> Is this still right though? Anything AND'ed with 0xef80 will never equal
> BIT(7) which is simply 0100 binary (the lowest byte of the left hand
> side is '0')

I think so ... here it is in binary

ef80 = 1110  1000 
BIT7 =   1000 

so the "&" will zap bits {6:0} and bit {12}  [and everything not part
of the MCACOD field].

If mce->status had some bit above BIT(7) set, it won't be zapped, so we
won't match the exact value BIT(7).

-Tony
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce

2017-04-21 Thread Vishal Verma
On 04/21, Luck, Tony wrote:
> >> > +   if (!(mce->status & 0xef80) == BIT(7))
> >> 
> >> Can we get a define for this, or a comment explaining all the magic
> >> that's happening on that one line?
> >
> > Yes - also like lkp pointed out, the check isn't correct at all. Let me
> > figure out what really needs to be done, and I will resend with a better
> > comment. 
> 
> Needs extra parentheses to make it right. Vishal, sorry I led you astray.
> 
>   if (!((mce->status & 0xef80) == BIT(7)))

Is this still right though? Anything AND'ed with 0xef80 will never equal
BIT(7) which is simply 0100 binary (the lowest byte of the left hand
side is '0')

> 
> The magic is shown in table 15-9 of the Intel Software Developers Manual
> (but perhaps not well explained there).
> 
> mce->status in the above code is a value plucked from a machine check
> bank status register. See figure 15-6 in the SDM.  The important bits for this
> are {15:0} which are the "MCA Error code".  Table 15-9 shows how these
> are grouped into types, where the type is defined by the most significant '1'
> bit in the field (excluding bit 12 which is the Correction Report Filtering 
> bit,
> see section 15.9.2.1).
> 
> So if BIT(3) is the most significant bit, the this is a "Generic Cache 
> Hierarchy"
> error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on.
> 
> Maybe we should have defines in mce.h for them?  It gets a bit more 
> complicated
> as all the above only applies to Intel branded X86 CPUs ... on AMD different
> decoding rules apply.
> 
> -Tony
> 
> 


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce

2017-04-21 Thread Luck, Tony
On Fri, Apr 21, 2017 at 01:19:16PM -0700, Dan Williams wrote:
> On Fri, Apr 21, 2017 at 1:16 PM, Luck, Tony  wrote:
> >>> > +   if (!(mce->status & 0xef80) == BIT(7))
> >>>
> >>> Can we get a define for this, or a comment explaining all the magic
> >>> that's happening on that one line?
> >>
> >> Yes - also like lkp pointed out, the check isn't correct at all. Let me
> >> figure out what really needs to be done, and I will resend with a better
> >> comment.
> >
> > Needs extra parentheses to make it right. Vishal, sorry I led you astray.
> >
> > if (!((mce->status & 0xef80) == BIT(7)))
> >
> > The magic is shown in table 15-9 of the Intel Software Developers Manual
> > (but perhaps not well explained there).
> >
> > mce->status in the above code is a value plucked from a machine check
> > bank status register. See figure 15-6 in the SDM.  The important bits for 
> > this
> > are {15:0} which are the "MCA Error code".  Table 15-9 shows how these
> > are grouped into types, where the type is defined by the most significant 
> > '1'
> > bit in the field (excluding bit 12 which is the Correction Report Filtering 
> > bit,
> > see section 15.9.2.1).
> >
> > So if BIT(3) is the most significant bit, the this is a "Generic Cache 
> > Hierarchy"
> > error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on.
> 
> Ah, ok.
> 
> > Maybe we should have defines in mce.h for them?  It gets a bit more 
> > complicated
> > as all the above only applies to Intel branded X86 CPUs ... on AMD different
> > decoding rules apply.
> 
> Yeah, this code is x86_64 generic so should call into helpers that do
> the right thing per cpu type.

Boris: you coded up a "static bool memory_error(struct mce *m)"
function inside the patches for the corrected error thingy.

Perhaps when it goes upstream it should be available for other
users too?

-Tony
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce

2017-04-20 Thread kbuild test robot
Hi Vishal,

[auto build test WARNING on pm/linux-next]
[also build test WARNING on v4.11-rc7 next-20170420]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Vishal-Verma/acpi-nfit-fix-the-memory-error-check-in-nfit_handle_mce/20170421-084359
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 
linux-next
config: x86_64-randconfig-x005-201716 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/acpi/nfit/mce.c: In function 'nfit_handle_mce':
>> drivers/acpi/nfit/mce.c:29:30: warning: comparison of constant '128ul' with 
>> boolean expression is always false [-Wbool-compare]
 if (!(mce->status & 0xef80) == BIT(7))
 ^~
>> drivers/acpi/nfit/mce.c:29:30: warning: logical not is only applied to the 
>> left hand side of comparison [-Wlogical-not-parentheses]

vim +/128ul +29 drivers/acpi/nfit/mce.c

13   * General Public License for more details.
14   */
15  #include 
16  #include 
17  #include 
18  #include 
19  #include "nfit.h"
20  
21  static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
22  void *data)
23  {
24  struct mce *mce = (struct mce *)data;
25  struct acpi_nfit_desc *acpi_desc;
26  struct nfit_spa *nfit_spa;
27  
28  /* We only care about memory errors */
  > 29  if (!(mce->status & 0xef80) == BIT(7))
30  return NOTIFY_DONE;
31  
32  /*
33   * mce->addr contains the physical addr accessed that caused the
34   * machine check. We need to walk through the list of NFITs, 
and see
35   * if any of them matches that address, and only then start a 
scrub.
36   */
37  mutex_lock(_desc_lock);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce

2017-04-20 Thread Verma, Vishal L
On Thu, 2017-04-20 at 16:18 -0600, Vishal Verma wrote:
> The check for an MCE being a memory error in the NFIT mce handler was
> bogus. Fix it to check for the correct MCA status compound error code.
> 
> Reported-by: Tony Luck 
> Cc: 

Forgot to include,
Fixes: 6839a6d96f4e nfit: do an ARS scrub on hitting a latent media error

> Signed-off-by: Vishal Verma 
> ---
>  drivers/acpi/nfit/mce.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
> index 3ba1c34..23e12a0 100644
> --- a/drivers/acpi/nfit/mce.c
> +++ b/drivers/acpi/nfit/mce.c
> @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block
> *nb, unsigned long val,
>   struct nfit_spa *nfit_spa;
>  
>   /* We only care about memory errors */
> - if (!(mce->status & MCACOD))
> + if (!(mce->status & 0xef80) == BIT(7))
>   return NOTIFY_DONE;
>  
>   /*
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce

2017-04-20 Thread Vishal Verma
The check for an MCE being a memory error in the NFIT mce handler was
bogus. Fix it to check for the correct MCA status compound error code.

Reported-by: Tony Luck 
Cc: 
Signed-off-by: Vishal Verma 
---
 drivers/acpi/nfit/mce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index 3ba1c34..23e12a0 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block *nb, 
unsigned long val,
struct nfit_spa *nfit_spa;
 
/* We only care about memory errors */
-   if (!(mce->status & MCACOD))
+   if (!(mce->status & 0xef80) == BIT(7))
return NOTIFY_DONE;
 
/*
-- 
2.9.3

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm