Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce
On Fri, Apr 21, 2017 at 01:27:41PM -0700, Luck, Tony wrote: > Boris: you coded up a "static bool memory_error(struct mce *m)" > function inside the patches for the corrected error thingy. > > Perhaps when it goes upstream it should be available for other > users too? I don't see why not. struct mce.cpuvendor even has the vendor in there so memory_error() wouldn't even have to look at boot_cpu_data when doing per-vendor decision. I guess we should rename it to something more global namespace-y like "mce_is_memory_error() or so, though, before we expose it to wider audience... -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce
On 04/21, Luck, Tony wrote: > On Fri, Apr 21, 2017 at 02:35:51PM -0600, Vishal Verma wrote: > > On 04/21, Luck, Tony wrote: > > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > > > > > if (!((mce->status & 0xef80) == BIT(7))) > > > > Is this still right though? Anything AND'ed with 0xef80 will never equal > > BIT(7) which is simply 0100 binary (the lowest byte of the left hand > > side is '0') > > I think so ... here it is in binary > > ef80 = 1110 1000 > BIT7 = 1000 > > so the "&" will zap bits {6:0} and bit {12} [and everything not part > of the MCACOD field]. > > If mce->status had some bit above BIT(7) set, it won't be zapped, so we > won't match the exact value BIT(7). Ah, you're right, I was off by one, taking BIT(7) to mean 0100 > > -Tony ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce
On Fri, Apr 21, 2017 at 02:35:51PM -0600, Vishal Verma wrote: > On 04/21, Luck, Tony wrote: > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > > > if (!((mce->status & 0xef80) == BIT(7))) > > Is this still right though? Anything AND'ed with 0xef80 will never equal > BIT(7) which is simply 0100 binary (the lowest byte of the left hand > side is '0') I think so ... here it is in binary ef80 = 1110 1000 BIT7 = 1000 so the "&" will zap bits {6:0} and bit {12} [and everything not part of the MCACOD field]. If mce->status had some bit above BIT(7) set, it won't be zapped, so we won't match the exact value BIT(7). -Tony ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce
On 04/21, Luck, Tony wrote: > >> > + if (!(mce->status & 0xef80) == BIT(7)) > >> > >> Can we get a define for this, or a comment explaining all the magic > >> that's happening on that one line? > > > > Yes - also like lkp pointed out, the check isn't correct at all. Let me > > figure out what really needs to be done, and I will resend with a better > > comment. > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > if (!((mce->status & 0xef80) == BIT(7))) Is this still right though? Anything AND'ed with 0xef80 will never equal BIT(7) which is simply 0100 binary (the lowest byte of the left hand side is '0') > > The magic is shown in table 15-9 of the Intel Software Developers Manual > (but perhaps not well explained there). > > mce->status in the above code is a value plucked from a machine check > bank status register. See figure 15-6 in the SDM. The important bits for this > are {15:0} which are the "MCA Error code". Table 15-9 shows how these > are grouped into types, where the type is defined by the most significant '1' > bit in the field (excluding bit 12 which is the Correction Report Filtering > bit, > see section 15.9.2.1). > > So if BIT(3) is the most significant bit, the this is a "Generic Cache > Hierarchy" > error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on. > > Maybe we should have defines in mce.h for them? It gets a bit more > complicated > as all the above only applies to Intel branded X86 CPUs ... on AMD different > decoding rules apply. > > -Tony > > ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce
On Fri, Apr 21, 2017 at 01:19:16PM -0700, Dan Williams wrote: > On Fri, Apr 21, 2017 at 1:16 PM, Luck, Tonywrote: > >>> > + if (!(mce->status & 0xef80) == BIT(7)) > >>> > >>> Can we get a define for this, or a comment explaining all the magic > >>> that's happening on that one line? > >> > >> Yes - also like lkp pointed out, the check isn't correct at all. Let me > >> figure out what really needs to be done, and I will resend with a better > >> comment. > > > > Needs extra parentheses to make it right. Vishal, sorry I led you astray. > > > > if (!((mce->status & 0xef80) == BIT(7))) > > > > The magic is shown in table 15-9 of the Intel Software Developers Manual > > (but perhaps not well explained there). > > > > mce->status in the above code is a value plucked from a machine check > > bank status register. See figure 15-6 in the SDM. The important bits for > > this > > are {15:0} which are the "MCA Error code". Table 15-9 shows how these > > are grouped into types, where the type is defined by the most significant > > '1' > > bit in the field (excluding bit 12 which is the Correction Report Filtering > > bit, > > see section 15.9.2.1). > > > > So if BIT(3) is the most significant bit, the this is a "Generic Cache > > Hierarchy" > > error, BIT(4) denotes a TLB error, BIT(7) a Memory error, and so on. > > Ah, ok. > > > Maybe we should have defines in mce.h for them? It gets a bit more > > complicated > > as all the above only applies to Intel branded X86 CPUs ... on AMD different > > decoding rules apply. > > Yeah, this code is x86_64 generic so should call into helpers that do > the right thing per cpu type. Boris: you coded up a "static bool memory_error(struct mce *m)" function inside the patches for the corrected error thingy. Perhaps when it goes upstream it should be available for other users too? -Tony ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce
Hi Vishal, [auto build test WARNING on pm/linux-next] [also build test WARNING on v4.11-rc7 next-20170420] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Vishal-Verma/acpi-nfit-fix-the-memory-error-check-in-nfit_handle_mce/20170421-084359 base: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next config: x86_64-randconfig-x005-201716 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): drivers/acpi/nfit/mce.c: In function 'nfit_handle_mce': >> drivers/acpi/nfit/mce.c:29:30: warning: comparison of constant '128ul' with >> boolean expression is always false [-Wbool-compare] if (!(mce->status & 0xef80) == BIT(7)) ^~ >> drivers/acpi/nfit/mce.c:29:30: warning: logical not is only applied to the >> left hand side of comparison [-Wlogical-not-parentheses] vim +/128ul +29 drivers/acpi/nfit/mce.c 13 * General Public License for more details. 14 */ 15 #include 16 #include 17 #include 18 #include 19 #include "nfit.h" 20 21 static int nfit_handle_mce(struct notifier_block *nb, unsigned long val, 22 void *data) 23 { 24 struct mce *mce = (struct mce *)data; 25 struct acpi_nfit_desc *acpi_desc; 26 struct nfit_spa *nfit_spa; 27 28 /* We only care about memory errors */ > 29 if (!(mce->status & 0xef80) == BIT(7)) 30 return NOTIFY_DONE; 31 32 /* 33 * mce->addr contains the physical addr accessed that caused the 34 * machine check. We need to walk through the list of NFITs, and see 35 * if any of them matches that address, and only then start a scrub. 36 */ 37 mutex_lock(_desc_lock); --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce
On Thu, 2017-04-20 at 16:18 -0600, Vishal Verma wrote: > The check for an MCE being a memory error in the NFIT mce handler was > bogus. Fix it to check for the correct MCA status compound error code. > > Reported-by: Tony Luck> Cc: Forgot to include, Fixes: 6839a6d96f4e nfit: do an ARS scrub on hitting a latent media error > Signed-off-by: Vishal Verma > --- > drivers/acpi/nfit/mce.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c > index 3ba1c34..23e12a0 100644 > --- a/drivers/acpi/nfit/mce.c > +++ b/drivers/acpi/nfit/mce.c > @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block > *nb, unsigned long val, > struct nfit_spa *nfit_spa; > > /* We only care about memory errors */ > - if (!(mce->status & MCACOD)) > + if (!(mce->status & 0xef80) == BIT(7)) > return NOTIFY_DONE; > > /* ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH] acpi, nfit: fix the memory error check in nfit_handle_mce
The check for an MCE being a memory error in the NFIT mce handler was bogus. Fix it to check for the correct MCA status compound error code. Reported-by: Tony LuckCc: Signed-off-by: Vishal Verma --- drivers/acpi/nfit/mce.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c index 3ba1c34..23e12a0 100644 --- a/drivers/acpi/nfit/mce.c +++ b/drivers/acpi/nfit/mce.c @@ -26,7 +26,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val, struct nfit_spa *nfit_spa; /* We only care about memory errors */ - if (!(mce->status & MCACOD)) + if (!(mce->status & 0xef80) == BIT(7)) return NOTIFY_DONE; /* -- 2.9.3 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm