Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 20:30 +0200, Borislav Petkov wrote: : > > So I don't want to break existing users and thus make only explicitly > known platforms load ghes_edac. In the current case, the HPE > machines. All the rest will simply use the platform drivers and > nothing will change for them. > > Later we'll probably need to revisit this decision but right now and > with all things considered, the whitelist seems - as ugly as it is - > the most workable solution for all the different use cases and > machines... Agreed. I will verify OEMID info of our other platforms, and add APEI OSC check before calling ghes_edac_register(). Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 20:30 +0200, Borislav Petkov wrote: : > > So I don't want to break existing users and thus make only explicitly > known platforms load ghes_edac. In the current case, the HPE > machines. All the rest will simply use the platform drivers and > nothing will change for them. > > Later we'll probably need to revisit this decision but right now and > with all things considered, the whitelist seems - as ugly as it is - > the most workable solution for all the different use cases and > machines... Agreed. I will verify OEMID info of our other platforms, and add APEI OSC check before calling ghes_edac_register(). Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
(Sending to your other mail address because there's some temporary resolution issue: msmtp: recipient address mche...@s-opensource.com not accepted by the server msmtp: server message: 451 4.3.0: Temporary lookup failure msmtp: could not send mail (account alien8.de from /home/boris/.msmtprc) Maybe the problem is on my end.) On Mon, Jul 24, 2017 at 03:10:13PM -0300, Mauro Carvalho Chehab wrote: > Yeah, having a whitelist is a maintainership's burden, but, on > the other hand, I suspect that there aren't many systems that > implement FF, have a reliable BIOS mapping of MB's silkscreen > and doesn't filters out corrected errors using some sort of > undocumented mechanism. > > So, I guess it is doable. Right, let's hope. > Another alternative, with, IMO, is better would be to add a parameter like: > > edac=FF - firmware first; > edac=hw - hardware first; > edac=auto - honors FF if set in BIOS. Otherwise, hardware first. Or maybe edac=try_FF or so. But yeah, I guess we'll need something to tell the EDAC core to try FF first. > In order to avoid regressions, and to avoid the need of a whitelist, > I would keep "edac=hw" as default. So I don't want to break existing users and thus make only explicitly known platforms load ghes_edac. In the current case, the HPE machines. All the rest will simply use the platform drivers and nothing will change for them. Later we'll probably need to revisit this decision but right now and with all things considered, the whitelist seems - as ugly as it is - the most workable solution for all the different use cases and machines... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
(Sending to your other mail address because there's some temporary resolution issue: msmtp: recipient address mche...@s-opensource.com not accepted by the server msmtp: server message: 451 4.3.0 : Temporary lookup failure msmtp: could not send mail (account alien8.de from /home/boris/.msmtprc) Maybe the problem is on my end.) On Mon, Jul 24, 2017 at 03:10:13PM -0300, Mauro Carvalho Chehab wrote: > Yeah, having a whitelist is a maintainership's burden, but, on > the other hand, I suspect that there aren't many systems that > implement FF, have a reliable BIOS mapping of MB's silkscreen > and doesn't filters out corrected errors using some sort of > undocumented mechanism. > > So, I guess it is doable. Right, let's hope. > Another alternative, with, IMO, is better would be to add a parameter like: > > edac=FF - firmware first; > edac=hw - hardware first; > edac=auto - honors FF if set in BIOS. Otherwise, hardware first. Or maybe edac=try_FF or so. But yeah, I guess we'll need something to tell the EDAC core to try FF first. > In order to avoid regressions, and to avoid the need of a whitelist, > I would keep "edac=hw" as default. So I don't want to break existing users and thus make only explicitly known platforms load ghes_edac. In the current case, the HPE machines. All the rest will simply use the platform drivers and nothing will change for them. Later we'll probably need to revisit this decision but right now and with all things considered, the whitelist seems - as ugly as it is - the most workable solution for all the different use cases and machines... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 05:54:52PM +, Kani, Toshimitsu wrote: > Umm... I was under impression that we are adding the OSC bit check in > addition to the current GHES filtering. Read the parallel subthread again. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 05:54:52PM +, Kani, Toshimitsu wrote: > Umm... I was under impression that we are adding the OSC bit check in > addition to the current GHES filtering. Read the parallel subthread again. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 14:56 -0300, Mauro Carvalho Chehab wrote: > Em Mon, 24 Jul 2017 15:56:27 + : > That's probably too late for me as I received a new HP machine > we bought just last week, but for the next time I would need to > get a new hardware, what would be the non-RAS equivalent to > a ML 350 G9 tower-mounted machine with two Xeon v4 CPUs and iLO? Such servers are called "HPE Cloudline". But I think they are all rack-mounted, not tower-mounted machines. HP Inc. (which is now a separate company for consumer-oriented products) probably has such machine. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 14:56 -0300, Mauro Carvalho Chehab wrote: > Em Mon, 24 Jul 2017 15:56:27 + : > That's probably too late for me as I received a new HP machine > we bought just last week, but for the next time I would need to > get a new hardware, what would be the non-RAS equivalent to > a ML 350 G9 tower-mounted machine with two Xeon v4 CPUs and iLO? Such servers are called "HPE Cloudline". But I think they are all rack-mounted, not tower-mounted machines. HP Inc. (which is now a separate company for consumer-oriented products) probably has such machine. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Mon, 24 Jul 2017 18:44:00 +0200 Borislav Petkovescreveu: > On Mon, Jul 24, 2017 at 01:04:13PM -0300, Mauro Carvalho Chehab wrote: > > If the Kernel force those users to use ghes_edac by default, > > they they won't see the error counts anymore, but, instead, > > hardware reports that the memories need to be replaced. > > This is exactly why I'm trying to load ghes_edac only on those platforms > which would really want it. > > > So, the right solution would be to keep hardware first, but > > providing a modprobe parameter to let them switch to software > > first. > > That's exactly the issue: if we make it spec-conform and adhere to FF > setting, then it'll be clean. BUT(!), we will force ghes_edac on those > platforms which potentially are using the platform-specific drivers > until now. Not good. > > If we do the whitelisting, then we're stuck with maintaining a yucky > whitelist and have to keep updating ghes_edac with it. Yeah, having a whitelist is a maintainership's burden, but, on the other hand, I suspect that there aren't many systems that implement FF, have a reliable BIOS mapping of MB's silkscreen and doesn't filters out corrected errors using some sort of undocumented mechanism. So, I guess it is doable. Another alternative, with, IMO, is better would be to add a parameter like: edac=FF - firmware first; edac=hw - hardware first; edac=auto - honors FF if set in BIOS. Otherwise, hardware first. In order to avoid regressions, and to avoid the need of a whitelist, I would keep "edac=hw" as default. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Mon, 24 Jul 2017 18:44:00 +0200 Borislav Petkov escreveu: > On Mon, Jul 24, 2017 at 01:04:13PM -0300, Mauro Carvalho Chehab wrote: > > If the Kernel force those users to use ghes_edac by default, > > they they won't see the error counts anymore, but, instead, > > hardware reports that the memories need to be replaced. > > This is exactly why I'm trying to load ghes_edac only on those platforms > which would really want it. > > > So, the right solution would be to keep hardware first, but > > providing a modprobe parameter to let them switch to software > > first. > > That's exactly the issue: if we make it spec-conform and adhere to FF > setting, then it'll be clean. BUT(!), we will force ghes_edac on those > platforms which potentially are using the platform-specific drivers > until now. Not good. > > If we do the whitelisting, then we're stuck with maintaining a yucky > whitelist and have to keep updating ghes_edac with it. Yeah, having a whitelist is a maintainership's burden, but, on the other hand, I suspect that there aren't many systems that implement FF, have a reliable BIOS mapping of MB's silkscreen and doesn't filters out corrected errors using some sort of undocumented mechanism. So, I guess it is doable. Another alternative, with, IMO, is better would be to add a parameter like: edac=FF - firmware first; edac=hw - hardware first; edac=auto - honors FF if set in BIOS. Otherwise, hardware first. In order to avoid regressions, and to avoid the need of a whitelist, I would keep "edac=hw" as default. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Mon, 24 Jul 2017 15:56:27 + "Kani, Toshimitsu"escreveu: > On Mon, 2017-07-24 at 17:37 +0200, Borislav Petkov wrote: > > On Mon, Jul 24, 2017 at 03:25:34PM +, Kani, Toshimitsu wrote: > : > > > > > We've been providing this model for many years now. > > > > Dude, relax, I'm only trying to point out to you that there are > > customers who want to see *every* error and thus track how their > > hardware behaves. And that for those customers it is probably worth > > considering exposing that info and providing a switch to disable that > > dumbing of the RAS functionality in the BIOS so that people can > > decide for themselves. That's all. > > Yes, Mauro has already pointed this out. As I replied to him, we do > have a separate series of platforms that do not have built-in RAS, and > report all errors. Such customers can simply choose them. They do not > need to pay for built-in RAS. That's probably too late for me as I received a new HP machine we bought just last week, but for the next time I would need to get a new hardware, what would be the non-RAS equivalent to a ML 350 G9 tower-mounted machine with two Xeon v4 CPUs and iLO? Regards, Mauro > > The model w/ built-in RAS provides warranty & full support. As I said, > it's a different model. > > Thanks, > -Toshi Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Mon, 24 Jul 2017 15:56:27 + "Kani, Toshimitsu" escreveu: > On Mon, 2017-07-24 at 17:37 +0200, Borislav Petkov wrote: > > On Mon, Jul 24, 2017 at 03:25:34PM +, Kani, Toshimitsu wrote: > : > > > > > We've been providing this model for many years now. > > > > Dude, relax, I'm only trying to point out to you that there are > > customers who want to see *every* error and thus track how their > > hardware behaves. And that for those customers it is probably worth > > considering exposing that info and providing a switch to disable that > > dumbing of the RAS functionality in the BIOS so that people can > > decide for themselves. That's all. > > Yes, Mauro has already pointed this out. As I replied to him, we do > have a separate series of platforms that do not have built-in RAS, and > report all errors. Such customers can simply choose them. They do not > need to pay for built-in RAS. That's probably too late for me as I received a new HP machine we bought just last week, but for the next time I would need to get a new hardware, what would be the non-RAS equivalent to a ML 350 G9 tower-mounted machine with two Xeon v4 CPUs and iLO? Regards, Mauro > > The model w/ built-in RAS provides warranty & full support. As I said, > it's a different model. > > Thanks, > -Toshi Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 20:50 +0300, Boris Petkov wrote: > On July 24, 2017 8:44:03 PM GMT+03:00, "Kani, Toshimitsu" @hpe.com> wrote: > > I assumed our platforms w/o build-in RAS do not implement GHES, > > If we make it a normal module, it will be decoupled from GHES and it > will rely only on the whitelist to load. Umm... I was under impression that we are adding the OSC bit check in addition to the current GHES filtering. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 20:50 +0300, Boris Petkov wrote: > On July 24, 2017 8:44:03 PM GMT+03:00, "Kani, Toshimitsu" @hpe.com> wrote: > > I assumed our platforms w/o build-in RAS do not implement GHES, > > If we make it a normal module, it will be decoupled from GHES and it > will rely only on the whitelist to load. Umm... I was under impression that we are adding the OSC bit check in addition to the current GHES filtering. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On July 24, 2017 8:44:03 PM GMT+03:00, "Kani, Toshimitsu"wrote: >I assumed our platforms w/o build-in RAS do not implement GHES, If we make it a normal module, it will be decoupled from GHES and it will rely only on the whitelist to load. -- Sent from a small device: formatting sux and brevity is inevitable.
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On July 24, 2017 8:44:03 PM GMT+03:00, "Kani, Toshimitsu" wrote: >I assumed our platforms w/o build-in RAS do not implement GHES, If we make it a normal module, it will be decoupled from GHES and it will rely only on the whitelist to load. -- Sent from a small device: formatting sux and brevity is inevitable.
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 18:37 +0200, Borislav Petkov wrote: > On Mon, Jul 24, 2017 at 03:56:27PM +, Kani, Toshimitsu wrote: > > Yes, Mauro has already pointed this out. As I replied to him, we > > do have a separate series of platforms that do not have built-in > > RAS, and > > So this whitelist entry > > +static struct acpi_oemlist oemlist[] = { > + {"HPE ", "Server ", 0, ACPI_SIG_FADT, all_versions}, > + { } /* End */ > +}; > > looks like it'll match every HP server platform not only the ones > with built-in RAS. I assumed our platforms w/o build-in RAS do not implement GHES, but I will check for sure. Also, all our previous/current platforms have "HP". Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 18:37 +0200, Borislav Petkov wrote: > On Mon, Jul 24, 2017 at 03:56:27PM +, Kani, Toshimitsu wrote: > > Yes, Mauro has already pointed this out. As I replied to him, we > > do have a separate series of platforms that do not have built-in > > RAS, and > > So this whitelist entry > > +static struct acpi_oemlist oemlist[] = { > + {"HPE ", "Server ", 0, ACPI_SIG_FADT, all_versions}, > + { } /* End */ > +}; > > looks like it'll match every HP server platform not only the ones > with built-in RAS. I assumed our platforms w/o build-in RAS do not implement GHES, but I will check for sure. Also, all our previous/current platforms have "HP". Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 01:04:13PM -0300, Mauro Carvalho Chehab wrote: > If the Kernel force those users to use ghes_edac by default, > they they won't see the error counts anymore, but, instead, > hardware reports that the memories need to be replaced. This is exactly why I'm trying to load ghes_edac only on those platforms which would really want it. > So, the right solution would be to keep hardware first, but > providing a modprobe parameter to let them switch to software > first. That's exactly the issue: if we make it spec-conform and adhere to FF setting, then it'll be clean. BUT(!), we will force ghes_edac on those platforms which potentially are using the platform-specific drivers until now. Not good. If we do the whitelisting, then we're stuck with maintaining a yucky whitelist and have to keep updating ghes_edac with it. So we're basically between a rock and a hard place. If I had to choose *right* *now*, I'd probably lean slightly towards the whitelist as it won't break existing users. A big grumpfy-grumbly hmmm. :-\ -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 01:04:13PM -0300, Mauro Carvalho Chehab wrote: > If the Kernel force those users to use ghes_edac by default, > they they won't see the error counts anymore, but, instead, > hardware reports that the memories need to be replaced. This is exactly why I'm trying to load ghes_edac only on those platforms which would really want it. > So, the right solution would be to keep hardware first, but > providing a modprobe parameter to let them switch to software > first. That's exactly the issue: if we make it spec-conform and adhere to FF setting, then it'll be clean. BUT(!), we will force ghes_edac on those platforms which potentially are using the platform-specific drivers until now. Not good. If we do the whitelisting, then we're stuck with maintaining a yucky whitelist and have to keep updating ghes_edac with it. So we're basically between a rock and a hard place. If I had to choose *right* *now*, I'd probably lean slightly towards the whitelist as it won't break existing users. A big grumpfy-grumbly hmmm. :-\ -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 03:56:27PM +, Kani, Toshimitsu wrote: > Yes, Mauro has already pointed this out. As I replied to him, we do > have a separate series of platforms that do not have built-in RAS, and So this whitelist entry +static struct acpi_oemlist oemlist[] = { + {"HPE ", "Server ", 0, ACPI_SIG_FADT, all_versions}, + { } /* End */ +}; looks like it'll match every HP server platform not only the ones with built-in RAS. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 03:56:27PM +, Kani, Toshimitsu wrote: > Yes, Mauro has already pointed this out. As I replied to him, we do > have a separate series of platforms that do not have built-in RAS, and So this whitelist entry +static struct acpi_oemlist oemlist[] = { + {"HPE ", "Server ", 0, ACPI_SIG_FADT, all_versions}, + { } /* End */ +}; looks like it'll match every HP server platform not only the ones with built-in RAS. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Mon, 24 Jul 2017 17:37:16 +0200 Borislav Petkovescreveu: > > Customers do not see error counts. I do not think it's bogus. > > I am just trying to enable OS error reporting with ghes_edac. > > I know, you don't have to state the obvious constantly. The problem I see is that, currently, on users that have EDAC already enabled, the users gets the errors directly from the hardware. If the Kernel force those users to use ghes_edac by default, they they won't see the error counts anymore, but, instead, hardware reports that the memories need to be replaced. Well, if such users are handling thresholds themselves, they won't see those errors anymore, as the errors will be masked. That's a regression. So, the right solution would be to keep hardware first, but providing a modprobe parameter to let them switch to software first. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Mon, 24 Jul 2017 17:37:16 +0200 Borislav Petkov escreveu: > > Customers do not see error counts. I do not think it's bogus. > > I am just trying to enable OS error reporting with ghes_edac. > > I know, you don't have to state the obvious constantly. The problem I see is that, currently, on users that have EDAC already enabled, the users gets the errors directly from the hardware. If the Kernel force those users to use ghes_edac by default, they they won't see the error counts anymore, but, instead, hardware reports that the memories need to be replaced. Well, if such users are handling thresholds themselves, they won't see those errors anymore, as the errors will be masked. That's a regression. So, the right solution would be to keep hardware first, but providing a modprobe parameter to let them switch to software first. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 17:37 +0200, Borislav Petkov wrote: > On Mon, Jul 24, 2017 at 03:25:34PM +, Kani, Toshimitsu wrote: : > > > We've been providing this model for many years now. > > Dude, relax, I'm only trying to point out to you that there are > customers who want to see *every* error and thus track how their > hardware behaves. And that for those customers it is probably worth > considering exposing that info and providing a switch to disable that > dumbing of the RAS functionality in the BIOS so that people can > decide for themselves. That's all. Yes, Mauro has already pointed this out. As I replied to him, we do have a separate series of platforms that do not have built-in RAS, and report all errors. Such customers can simply choose them. They do not need to pay for built-in RAS. The model w/ built-in RAS provides warranty & full support. As I said, it's a different model. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 17:37 +0200, Borislav Petkov wrote: > On Mon, Jul 24, 2017 at 03:25:34PM +, Kani, Toshimitsu wrote: : > > > We've been providing this model for many years now. > > Dude, relax, I'm only trying to point out to you that there are > customers who want to see *every* error and thus track how their > hardware behaves. And that for those customers it is probably worth > considering exposing that info and providing a switch to disable that > dumbing of the RAS functionality in the BIOS so that people can > decide for themselves. That's all. Yes, Mauro has already pointed this out. As I replied to him, we do have a separate series of platforms that do not have built-in RAS, and report all errors. Such customers can simply choose them. They do not need to pay for built-in RAS. The model w/ built-in RAS provides warranty & full support. As I said, it's a different model. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 03:25:34PM +, Kani, Toshimitsu wrote: > Customers do not see error counts. I do not think it's bogus. Not showing the real error error counts but something contrived is the definition of bogus numbers. But you're not showing anything - only when some thresholds are being hit. > This model is basically the same as your car. You do not see error Oh jeez, we're talking about cars now. > We've been providing this model for many years now. Dude, relax, I'm only trying to point out to you that there are customers who want to see *every* error and thus track how their hardware behaves. And that for those customers it is probably worth considering exposing that info and providing a switch to disable that dumbing of the RAS functionality in the BIOS so that people can decide for themselves. That's all. I'm not questioning your model - I'm just saying that it could be improved for certain customers. Do me a favor and this time *actually* *read* my reply. > I am just trying to enable OS error reporting with ghes_edac. I know, you don't have to state the obvious constantly. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 03:25:34PM +, Kani, Toshimitsu wrote: > Customers do not see error counts. I do not think it's bogus. Not showing the real error error counts but something contrived is the definition of bogus numbers. But you're not showing anything - only when some thresholds are being hit. > This model is basically the same as your car. You do not see error Oh jeez, we're talking about cars now. > We've been providing this model for many years now. Dude, relax, I'm only trying to point out to you that there are customers who want to see *every* error and thus track how their hardware behaves. And that for those customers it is probably worth considering exposing that info and providing a switch to disable that dumbing of the RAS functionality in the BIOS so that people can decide for themselves. That's all. I'm not questioning your model - I'm just saying that it could be improved for certain customers. Do me a favor and this time *actually* *read* my reply. > I am just trying to enable OS error reporting with ghes_edac. I know, you don't have to state the obvious constantly. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 17:04 +0200, Borislav Petkov wrote: > On Mon, Jul 24, 2017 at 02:49:30PM +, Kani, Toshimitsu wrote: > > We do not tell the error counts to customers. > > Please read what I said: do you tell your customers that the error > counts they're seeing (or are *not* seeing) is bogus because the BIOS > is hiding them? Not the *actual* numbers! Customers do not see error counts. I do not think it's bogus. This model is basically the same as your car. You do not see error counts or periodical normal errors from all kinds of controllers in the car while you are driving. You get an attention lamp lit when you need to bring it to a car dealer. > > We tell customers when they need attention and have actionable > > items, and we provide support for that. Support gets all info > > necessary. > > Ok, good to know. I'll make sure to bounce such issues to you guys in > the future. We've been providing this model for many years now. I am just trying to enable OS error reporting with ghes_edac. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, 2017-07-24 at 17:04 +0200, Borislav Petkov wrote: > On Mon, Jul 24, 2017 at 02:49:30PM +, Kani, Toshimitsu wrote: > > We do not tell the error counts to customers. > > Please read what I said: do you tell your customers that the error > counts they're seeing (or are *not* seeing) is bogus because the BIOS > is hiding them? Not the *actual* numbers! Customers do not see error counts. I do not think it's bogus. This model is basically the same as your car. You do not see error counts or periodical normal errors from all kinds of controllers in the car while you are driving. You get an attention lamp lit when you need to bring it to a car dealer. > > We tell customers when they need attention and have actionable > > items, and we provide support for that. Support gets all info > > necessary. > > Ok, good to know. I'll make sure to bounce such issues to you guys in > the future. We've been providing this model for many years now. I am just trying to enable OS error reporting with ghes_edac. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 02:49:30PM +, Kani, Toshimitsu wrote: > We do not tell the error counts to customers. Please read what I said: do you tell your customers that the error counts they're seeing (or are *not* seeing) is bogus because the BIOS is hiding them? Not the *actual* numbers! > We tell customers when they need attention and have actionable items, > and we provide support for that. Support gets all info necessary. Ok, good to know. I'll make sure to bounce such issues to you guys in the future. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Mon, Jul 24, 2017 at 02:49:30PM +, Kani, Toshimitsu wrote: > We do not tell the error counts to customers. Please read what I said: do you tell your customers that the error counts they're seeing (or are *not* seeing) is bogus because the BIOS is hiding them? Not the *actual* numbers! > We tell customers when they need attention and have actionable items, > and we provide support for that. Support gets all info necessary. Ok, good to know. I'll make sure to bounce such issues to you guys in the future. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Sat, 2017-07-22 at 08:28 +0200, Borislav Petkov wrote: > On Fri, Jul 21, 2017 at 06:38:52PM +, Kani, Toshimitsu wrote: > > Enterprise platforms have very different model (I do not say it's > > better for everyone from the cost perspective). Typically, such > > But you do tell your customers that the error counts they see are not > really what *actually* happens, right? We do not tell the error counts to customers. We tell customers when they need attention and have actionable items, and we provide support for that. Support gets all info necessary. There are multiple models for multiple types of customers. I am not saying one model is better than the other. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Sat, 2017-07-22 at 08:28 +0200, Borislav Petkov wrote: > On Fri, Jul 21, 2017 at 06:38:52PM +, Kani, Toshimitsu wrote: > > Enterprise platforms have very different model (I do not say it's > > better for everyone from the cost perspective). Typically, such > > But you do tell your customers that the error counts they see are not > really what *actually* happens, right? We do not tell the error counts to customers. We tell customers when they need attention and have actionable items, and we provide support for that. Support gets all info necessary. There are multiple models for multiple types of customers. I am not saying one model is better than the other. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 06:38:52PM +, Kani, Toshimitsu wrote: > Enterprise platforms have very different model (I do not say it's > better for everyone from the cost perspective). Typically, such But you do tell your customers that the error counts they see are not really what *actually* happens, right? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 06:38:52PM +, Kani, Toshimitsu wrote: > Enterprise platforms have very different model (I do not say it's > better for everyone from the cost perspective). Typically, such But you do tell your customers that the error counts they see are not really what *actually* happens, right? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 19:23 +0200, Borislav Petkov wrote: : > Not only that: thresholds depend on the DIMM types which means, BIOS > must know what DIMM types are in there which I doubt. BIOS knows DIMM model from the SPD data. > So exposing that to configuration instead of "deciding" for people > would be better. Enterprise platforms have very different model (I do not say it's better for everyone from the cost perspective). Typically, such platform vendors work with DIMM vendors directly to come with their supported DIMMs with own part numbers, which are certified for the platforms with extensive validation testings. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 19:23 +0200, Borislav Petkov wrote: : > Not only that: thresholds depend on the DIMM types which means, BIOS > must know what DIMM types are in there which I doubt. BIOS knows DIMM model from the SPD data. > So exposing that to configuration instead of "deciding" for people > would be better. Enterprise platforms have very different model (I do not say it's better for everyone from the cost perspective). Typically, such platform vendors work with DIMM vendors directly to come with their supported DIMMs with own part numbers, which are certified for the platforms with extensive validation testings. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 02:01:31PM -0300, Mauro Carvalho Chehab wrote: > I see the value of having a threshold in BIOS, provided that it is > well documented, and whose value can be adjusted, if needed. > > One of the things I wanted to implement in ras-daemon were an > algorithm that would be doing such threshold in software. We have that now in the kernel: drivers/ras/cec.c We did it exactly for that purpose - not upsetting users unnecessarily. > The thing with a BIOS threshold is that the user has no way to > audit the algorithm. So, when BIOS start reporting such errors, > it may be already too late: the systems may be in the verge of > losing data (or some data was already lost). Not only that: thresholds depend on the DIMM types which means, BIOS must know what DIMM types are in there which I doubt. So exposing that to configuration instead of "deciding" for people would be better. > That's critical on cluster systems with thousands of machines: > while the impact of disabling a cluster node to do some maintainance > is marginal, the impact of an uncorrected error on a single > machine may compromise weeks of expensive processing. > > That's why some users prefer to monitor every single corrected > error, and compare with the probability distribution they > know that the risk of uncorrected errors is acceptable. Yap, you need to have stuff like that configurable - BIOS can't predict all possible use cases. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 02:01:31PM -0300, Mauro Carvalho Chehab wrote: > I see the value of having a threshold in BIOS, provided that it is > well documented, and whose value can be adjusted, if needed. > > One of the things I wanted to implement in ras-daemon were an > algorithm that would be doing such threshold in software. We have that now in the kernel: drivers/ras/cec.c We did it exactly for that purpose - not upsetting users unnecessarily. > The thing with a BIOS threshold is that the user has no way to > audit the algorithm. So, when BIOS start reporting such errors, > it may be already too late: the systems may be in the verge of > losing data (or some data was already lost). Not only that: thresholds depend on the DIMM types which means, BIOS must know what DIMM types are in there which I doubt. So exposing that to configuration instead of "deciding" for people would be better. > That's critical on cluster systems with thousands of machines: > while the impact of disabling a cluster node to do some maintainance > is marginal, the impact of an uncorrected error on a single > machine may compromise weeks of expensive processing. > > That's why some users prefer to monitor every single corrected > error, and compare with the probability distribution they > know that the risk of uncorrected errors is acceptable. Yap, you need to have stuff like that configurable - BIOS can't predict all possible use cases. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 14:01 -0300, Mauro Carvalho Chehab wrote: > Em Fri, 21 Jul 2017 16:40:20 + > "Kani, Toshimitsu"escreveu: > > > On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote: > > > Em Fri, 21 Jul 2017 15:34:50 + > > > "Kani, Toshimitsu" escreveu: > > > > > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > > > > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu > > > > > wrote: > > > > > > Yes, that is correct. Corrected errors are reported to the > > > > > > OS when they exceeded the platform's threshold. > > > > > > > > > > Are those thresholds user-configurable? > > > > > > > > I suppose it'd depend on vendors, but I do not think users can > > > > do it properly unless they have depth knowledge about the > > > > hardware. > > > > > > > > > If not, what are you telling users who want to see *every* > > > > > corrected error for measuring DIMM wear and so on...? > > > > > > > > Corrected errors are normal and expected to occur on healthy > > > > hardware. They do not need user's attention until they > > > > repeatedly occurred at a same place. > > > > > > Yes, they're expected to happen. Still, some sys admins have > > > their own measurements about what's "normal" for their scenario, > > > and want to monitor every single corrected error, running their > > > own algorithm to warn if the number of corrected errors is above > > > their "normal" rate. > > > > I suppose these admins had to do it because their platforms > > reported all corrected errors. It addresses such administrators' > > burden. > > I see the value of having a threshold in BIOS, provided that it is > well documented, and whose value can be adjusted, if needed. > > One of the things I wanted to implement in ras-daemon were an > algorithm that would be doing such threshold in software. > The problem is that it would require field experience. So, > I talked with a few vendors, to see if they could help doing > it, but, on that time, none rised their hands :-) I think it'd be very hard to keep it up to date. > The thing with a BIOS threshold is that the user has no way to > audit the algorithm. So, when BIOS start reporting such errors, > it may be already too late: the systems may be in the verge of > losing data (or some data was already lost). > > That's critical on cluster systems with thousands of machines: > while the impact of disabling a cluster node to do some maintainance > is marginal, the impact of an uncorrected error on a single > machine may compromise weeks of expensive processing. > > That's why some users prefer to monitor every single corrected > error, and compare with the probability distribution they > know that the risk of uncorrected errors is acceptable. Right, I do not think all platforms need to be firmware-first. I do not want to talk like a sale's person, but we also offer lower-cost platforms that do not come with built-in RAS. Users can choose a right model for their needs. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 14:01 -0300, Mauro Carvalho Chehab wrote: > Em Fri, 21 Jul 2017 16:40:20 + > "Kani, Toshimitsu" escreveu: > > > On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote: > > > Em Fri, 21 Jul 2017 15:34:50 + > > > "Kani, Toshimitsu" escreveu: > > > > > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > > > > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu > > > > > wrote: > > > > > > Yes, that is correct. Corrected errors are reported to the > > > > > > OS when they exceeded the platform's threshold. > > > > > > > > > > Are those thresholds user-configurable? > > > > > > > > I suppose it'd depend on vendors, but I do not think users can > > > > do it properly unless they have depth knowledge about the > > > > hardware. > > > > > > > > > If not, what are you telling users who want to see *every* > > > > > corrected error for measuring DIMM wear and so on...? > > > > > > > > Corrected errors are normal and expected to occur on healthy > > > > hardware. They do not need user's attention until they > > > > repeatedly occurred at a same place. > > > > > > Yes, they're expected to happen. Still, some sys admins have > > > their own measurements about what's "normal" for their scenario, > > > and want to monitor every single corrected error, running their > > > own algorithm to warn if the number of corrected errors is above > > > their "normal" rate. > > > > I suppose these admins had to do it because their platforms > > reported all corrected errors. It addresses such administrators' > > burden. > > I see the value of having a threshold in BIOS, provided that it is > well documented, and whose value can be adjusted, if needed. > > One of the things I wanted to implement in ras-daemon were an > algorithm that would be doing such threshold in software. > The problem is that it would require field experience. So, > I talked with a few vendors, to see if they could help doing > it, but, on that time, none rised their hands :-) I think it'd be very hard to keep it up to date. > The thing with a BIOS threshold is that the user has no way to > audit the algorithm. So, when BIOS start reporting such errors, > it may be already too late: the systems may be in the verge of > losing data (or some data was already lost). > > That's critical on cluster systems with thousands of machines: > while the impact of disabling a cluster node to do some maintainance > is marginal, the impact of an uncorrected error on a single > machine may compromise weeks of expensive processing. > > That's why some users prefer to monitor every single corrected > error, and compare with the probability distribution they > know that the risk of uncorrected errors is acceptable. Right, I do not think all platforms need to be firmware-first. I do not want to talk like a sale's person, but we also offer lower-cost platforms that do not come with built-in RAS. Users can choose a right model for their needs. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Fri, 21 Jul 2017 16:40:20 + "Kani, Toshimitsu"escreveu: > On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote: > > Em Fri, 21 Jul 2017 15:34:50 + > > "Kani, Toshimitsu" escreveu: > > > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > > > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu > > > > wrote: > > > > > Yes, that is correct. Corrected errors are reported to the OS > > > > > when they exceeded the platform's threshold. > > > > > > > > Are those thresholds user-configurable? > > > > > > I suppose it'd depend on vendors, but I do not think users can do > > > it properly unless they have depth knowledge about the hardware. > > > > > > > If not, what are you telling users who want to see *every* > > > > corrected error for measuring DIMM wear and so on...? > > > > > > Corrected errors are normal and expected to occur on healthy > > > hardware. They do not need user's attention until they repeatedly > > > occurred at a same place. > > > > Yes, they're expected to happen. Still, some sys admins have their > > own measurements about what's "normal" for their scenario, and want > > to monitor every single corrected error, running their own > > algorithm to warn if the number of corrected errors is above their > > "normal" rate. > > I suppose these admins had to do it because their platforms reported > all corrected errors. It addresses such administrators' burden. I see the value of having a threshold in BIOS, provided that it is well documented, and whose value can be adjusted, if needed. One of the things I wanted to implement in ras-daemon were an algorithm that would be doing such threshold in software. The problem is that it would require field experience. So, I talked with a few vendors, to see if they could help doing it, but, on that time, none rised their hands :-) The thing with a BIOS threshold is that the user has no way to audit the algorithm. So, when BIOS start reporting such errors, it may be already too late: the systems may be in the verge of losing data (or some data was already lost). That's critical on cluster systems with thousands of machines: while the impact of disabling a cluster node to do some maintainance is marginal, the impact of an uncorrected error on a single machine may compromise weeks of expensive processing. That's why some users prefer to monitor every single corrected error, and compare with the probability distribution they know that the risk of uncorrected errors is acceptable. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Fri, 21 Jul 2017 16:40:20 + "Kani, Toshimitsu" escreveu: > On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote: > > Em Fri, 21 Jul 2017 15:34:50 + > > "Kani, Toshimitsu" escreveu: > > > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > > > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu > > > > wrote: > > > > > Yes, that is correct. Corrected errors are reported to the OS > > > > > when they exceeded the platform's threshold. > > > > > > > > Are those thresholds user-configurable? > > > > > > I suppose it'd depend on vendors, but I do not think users can do > > > it properly unless they have depth knowledge about the hardware. > > > > > > > If not, what are you telling users who want to see *every* > > > > corrected error for measuring DIMM wear and so on...? > > > > > > Corrected errors are normal and expected to occur on healthy > > > hardware. They do not need user's attention until they repeatedly > > > occurred at a same place. > > > > Yes, they're expected to happen. Still, some sys admins have their > > own measurements about what's "normal" for their scenario, and want > > to monitor every single corrected error, running their own > > algorithm to warn if the number of corrected errors is above their > > "normal" rate. > > I suppose these admins had to do it because their platforms reported > all corrected errors. It addresses such administrators' burden. I see the value of having a threshold in BIOS, provided that it is well documented, and whose value can be adjusted, if needed. One of the things I wanted to implement in ras-daemon were an algorithm that would be doing such threshold in software. The problem is that it would require field experience. So, I talked with a few vendors, to see if they could help doing it, but, on that time, none rised their hands :-) The thing with a BIOS threshold is that the user has no way to audit the algorithm. So, when BIOS start reporting such errors, it may be already too late: the systems may be in the verge of losing data (or some data was already lost). That's critical on cluster systems with thousands of machines: while the impact of disabling a cluster node to do some maintainance is marginal, the impact of an uncorrected error on a single machine may compromise weeks of expensive processing. That's why some users prefer to monitor every single corrected error, and compare with the probability distribution they know that the risk of uncorrected errors is acceptable. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote: > Em Fri, 21 Jul 2017 15:34:50 + > "Kani, Toshimitsu"escreveu: > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu > > > wrote: > > > > Yes, that is correct. Corrected errors are reported to the OS > > > > when they exceeded the platform's threshold. > > > > > > Are those thresholds user-configurable? > > > > I suppose it'd depend on vendors, but I do not think users can do > > it properly unless they have depth knowledge about the hardware. > > > > > If not, what are you telling users who want to see *every* > > > corrected error for measuring DIMM wear and so on...? > > > > Corrected errors are normal and expected to occur on healthy > > hardware. They do not need user's attention until they repeatedly > > occurred at a same place. > > Yes, they're expected to happen. Still, some sys admins have their > own measurements about what's "normal" for their scenario, and want > to monitor every single corrected error, running their own > algorithm to warn if the number of corrected errors is above their > "normal" rate. I suppose these admins had to do it because their platforms reported all corrected errors. It addresses such administrators' burden. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote: > Em Fri, 21 Jul 2017 15:34:50 + > "Kani, Toshimitsu" escreveu: > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu > > > wrote: > > > > Yes, that is correct. Corrected errors are reported to the OS > > > > when they exceeded the platform's threshold. > > > > > > Are those thresholds user-configurable? > > > > I suppose it'd depend on vendors, but I do not think users can do > > it properly unless they have depth knowledge about the hardware. > > > > > If not, what are you telling users who want to see *every* > > > corrected error for measuring DIMM wear and so on...? > > > > Corrected errors are normal and expected to occur on healthy > > hardware. They do not need user's attention until they repeatedly > > occurred at a same place. > > Yes, they're expected to happen. Still, some sys admins have their > own measurements about what's "normal" for their scenario, and want > to monitor every single corrected error, running their own > algorithm to warn if the number of corrected errors is above their > "normal" rate. I suppose these admins had to do it because their platforms reported all corrected errors. It addresses such administrators' burden. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 17:53 +0200, Borislav Petkov wrote: > On Fri, Jul 21, 2017 at 03:34:50PM +, Kani, Toshimitsu wrote: > > I suppose it'd depend on vendors, but I do not think users can do > > it properly unless they have depth knowledge about the hardware. > > I'm talking about a menu in the BIOS where you can set the > thresholding levels on the system. Does your BIOS have that? No, we don't offer such settings. > > Corrected errors are normal and expected to occur on healthy > > hardware. They do not need user's attention until they repeatedly > > occurred at a same place. > > Apparently, you haven't been on enough maintanance calls, trying to > calm down the customer about the hardware error he sees in his > logs... Actually, that's why. Reporting all corrected errors make users worried, call support, and asking to replace healthy hardware... Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 17:53 +0200, Borislav Petkov wrote: > On Fri, Jul 21, 2017 at 03:34:50PM +, Kani, Toshimitsu wrote: > > I suppose it'd depend on vendors, but I do not think users can do > > it properly unless they have depth knowledge about the hardware. > > I'm talking about a menu in the BIOS where you can set the > thresholding levels on the system. Does your BIOS have that? No, we don't offer such settings. > > Corrected errors are normal and expected to occur on healthy > > hardware. They do not need user's attention until they repeatedly > > occurred at a same place. > > Apparently, you haven't been on enough maintanance calls, trying to > calm down the customer about the hardware error he sees in his > logs... Actually, that's why. Reporting all corrected errors make users worried, call support, and asking to replace healthy hardware... Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 03:34:50PM +, Kani, Toshimitsu wrote: > I suppose it'd depend on vendors, but I do not think users can do it > properly unless they have depth knowledge about the hardware. I'm talking about a menu in the BIOS where you can set the thresholding levels on the system. Does your BIOS have that? > Corrected errors are normal and expected to occur on healthy hardware. > They do not need user's attention until they repeatedly occurred at a > same place. Apparently, you haven't been on enough maintanance calls, trying to calm down the customer about the hardware error he sees in his logs... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 03:34:50PM +, Kani, Toshimitsu wrote: > I suppose it'd depend on vendors, but I do not think users can do it > properly unless they have depth knowledge about the hardware. I'm talking about a menu in the BIOS where you can set the thresholding levels on the system. Does your BIOS have that? > Corrected errors are normal and expected to occur on healthy hardware. > They do not need user's attention until they repeatedly occurred at a > same place. Apparently, you haven't been on enough maintanance calls, trying to calm down the customer about the hardware error he sees in his logs... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Fri, 21 Jul 2017 15:34:50 + "Kani, Toshimitsu"escreveu: > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu wrote: > > > Yes, that is correct. Corrected errors are reported to the OS when > > > they exceeded the platform's threshold. > > > > Are those thresholds user-configurable? > > I suppose it'd depend on vendors, but I do not think users can do it > properly unless they have depth knowledge about the hardware. > > > If not, what are you telling users who want to see *every* corrected > > error for measuring DIMM wear and so on...? > > Corrected errors are normal and expected to occur on healthy hardware. > They do not need user's attention until they repeatedly occurred at a > same place. Yes, they're expected to happen. Still, some sys admins have their own measurements about what's "normal" for their scenario, and want to monitor every single corrected error, running their own algorithm to warn if the number of corrected errors is above their "normal" rate. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Fri, 21 Jul 2017 15:34:50 + "Kani, Toshimitsu" escreveu: > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu wrote: > > > Yes, that is correct. Corrected errors are reported to the OS when > > > they exceeded the platform's threshold. > > > > Are those thresholds user-configurable? > > I suppose it'd depend on vendors, but I do not think users can do it > properly unless they have depth knowledge about the hardware. > > > If not, what are you telling users who want to see *every* corrected > > error for measuring DIMM wear and so on...? > > Corrected errors are normal and expected to occur on healthy hardware. > They do not need user's attention until they repeatedly occurred at a > same place. Yes, they're expected to happen. Still, some sys admins have their own measurements about what's "normal" for their scenario, and want to monitor every single corrected error, running their own algorithm to warn if the number of corrected errors is above their "normal" rate. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu wrote: > > Yes, that is correct. Corrected errors are reported to the OS when > > they exceeded the platform's threshold. > > Are those thresholds user-configurable? I suppose it'd depend on vendors, but I do not think users can do it properly unless they have depth knowledge about the hardware. > If not, what are you telling users who want to see *every* corrected > error for measuring DIMM wear and so on...? Corrected errors are normal and expected to occur on healthy hardware. They do not need user's attention until they repeatedly occurred at a same place. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu wrote: > > Yes, that is correct. Corrected errors are reported to the OS when > > they exceeded the platform's threshold. > > Are those thresholds user-configurable? I suppose it'd depend on vendors, but I do not think users can do it properly unless they have depth knowledge about the hardware. > If not, what are you telling users who want to see *every* corrected > error for measuring DIMM wear and so on...? Corrected errors are normal and expected to occur on healthy hardware. They do not need user's attention until they repeatedly occurred at a same place. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu wrote: > Yes, that is correct. Corrected errors are reported to the OS when > they exceeded the platform's threshold. Are those thresholds user-configurable? If not, what are you telling users who want to see *every* corrected error for measuring DIMM wear and so on...? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 03:08:41PM +, Kani, Toshimitsu wrote: > Yes, that is correct. Corrected errors are reported to the OS when > they exceeded the platform's threshold. Are those thresholds user-configurable? If not, what are you telling users who want to see *every* corrected error for measuring DIMM wear and so on...? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 15:47 +0200, Borislav Petkov wrote: > On Fri, Jul 21, 2017 at 10:40:01AM -0300, Mauro Carvalho Chehab > wrote: > > What happens when the error can be corrected? Does it still report > > it to userspace, or just silently hide the error? > > > > If I remember well about a past discussion with some vendor, I was > > told that the firmware can hide some errors from being reported. Is > > it still the case? > > I've heard the same thing but I have no idea what they're actually > doing. But it would make sense because the intention is not to worry > users unnecessarily if it can hide the error and if there are no > adverse consequences from it. Yes, that is correct. Corrected errors are reported to the OS when they exceeded the platform's threshold. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, 2017-07-21 at 15:47 +0200, Borislav Petkov wrote: > On Fri, Jul 21, 2017 at 10:40:01AM -0300, Mauro Carvalho Chehab > wrote: > > What happens when the error can be corrected? Does it still report > > it to userspace, or just silently hide the error? > > > > If I remember well about a past discussion with some vendor, I was > > told that the firmware can hide some errors from being reported. Is > > it still the case? > > I've heard the same thing but I have no idea what they're actually > doing. But it would make sense because the intention is not to worry > users unnecessarily if it can hide the error and if there are no > adverse consequences from it. Yes, that is correct. Corrected errors are reported to the OS when they exceeded the platform's threshold. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 10:40:01AM -0300, Mauro Carvalho Chehab wrote: > What happens when the error can be corrected? Does it still report it to > userspace, or just silently hide the error? > > If I remember well about a past discussion with some vendor, I was told > that the firmware can hide some errors from being reported. Is it > still the case? I've heard the same thing but I have no idea what they're actually doing. But it would make sense because the intention is not to worry users unnecessarily if it can hide the error and if there are no adverse consequences from it. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Fri, Jul 21, 2017 at 10:40:01AM -0300, Mauro Carvalho Chehab wrote: > What happens when the error can be corrected? Does it still report it to > userspace, or just silently hide the error? > > If I remember well about a past discussion with some vendor, I was told > that the firmware can hide some errors from being reported. Is it > still the case? I've heard the same thing but I have no idea what they're actually doing. But it would make sense because the intention is not to worry users unnecessarily if it can hide the error and if there are no adverse consequences from it. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Fri, 21 Jul 2017 15:34:41 +0200 Borislav Petkovescreveu: > On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote: > > GHES / firmware-first still requires OS recovery actions when an error > > cannot be corrected by the platform. They are handled by ghes_proc(), > > and ghes_edac remains its error-reporting wrapper. What happens when the error can be corrected? Does it still report it to userspace, or just silently hide the error? If I remember well about a past discussion with some vendor, I was told that the firmware can hide some errors from being reported. Is it still the case? Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Fri, 21 Jul 2017 15:34:41 +0200 Borislav Petkov escreveu: > On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote: > > GHES / firmware-first still requires OS recovery actions when an error > > cannot be corrected by the platform. They are handled by ghes_proc(), > > and ghes_edac remains its error-reporting wrapper. What happens when the error can be corrected? Does it still report it to userspace, or just silently hide the error? If I remember well about a past discussion with some vendor, I was told that the firmware can hide some errors from being reported. Is it still the case? Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote: > GHES / firmware-first still requires OS recovery actions when an error > cannot be corrected by the platform. They are handled by ghes_proc(), > and ghes_edac remains its error-reporting wrapper. I mean all the recovery actions the firmware does because it gets to see the error first. Otherwise, Firmware First is the the dumbest repeater layer in the history of layers. > Firmware has better knowledge about the platform and can provide better > RAS when implemented properly. s/when/if/ -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, Jul 20, 2017 at 07:50:03PM +, Kani, Toshimitsu wrote: > GHES / firmware-first still requires OS recovery actions when an error > cannot be corrected by the platform. They are handled by ghes_proc(), > and ghes_edac remains its error-reporting wrapper. I mean all the recovery actions the firmware does because it gets to see the error first. Otherwise, Firmware First is the the dumbest repeater layer in the history of layers. > Firmware has better knowledge about the platform and can provide better > RAS when implemented properly. s/when/if/ -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, 2017-07-20 at 17:15 -0300, Mauro Carvalho Chehab wrote: > Em Thu, 20 Jul 2017 19:50:03 + > "Kani, Toshimitsu"escreveu: : > > Firmware has better knowledge about the platform and can provide > > better RAS when implemented properly. I agree that user > > experiences may vary on platforms. > > It may have a better knowledge, when the vendor ships different BIOS > for platforms with different motherboard silkscreens, but a lot of > vendors just use the same BIOS on different models, with the same > information at "Locator" and "Bank Locator" data at DMI tables, > that don't match what's printed at the board's silkscreen. > > So, GHES ends by exposing wrong data. Also, such BIOS fail > to properly expose such knowledge to drivers/userspace. I see. Yeah, I can see such problems could be overlooked since normal tests run just fine even if there is a mismatch in such info... > On the discussions I had with HP, back in 2012, the idea was to try > to have some sort of way for the GHES driver to query the BIOS > on a reliable way, in order to get its layout, in a way > that tools like ras-mc-ctl would properly report the memory > configuration (with --layout) and the motherboard silkscreen > labels (with --print-labels). Unfortunately, at least on that > time, the discussions with HP didn't proceed. Thanks for the info. I hope we can enable it this time around. -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, 2017-07-20 at 17:15 -0300, Mauro Carvalho Chehab wrote: > Em Thu, 20 Jul 2017 19:50:03 + > "Kani, Toshimitsu" escreveu: : > > Firmware has better knowledge about the platform and can provide > > better RAS when implemented properly. I agree that user > > experiences may vary on platforms. > > It may have a better knowledge, when the vendor ships different BIOS > for platforms with different motherboard silkscreens, but a lot of > vendors just use the same BIOS on different models, with the same > information at "Locator" and "Bank Locator" data at DMI tables, > that don't match what's printed at the board's silkscreen. > > So, GHES ends by exposing wrong data. Also, such BIOS fail > to properly expose such knowledge to drivers/userspace. I see. Yeah, I can see such problems could be overlooked since normal tests run just fine even if there is a mismatch in such info... > On the discussions I had with HP, back in 2012, the idea was to try > to have some sort of way for the GHES driver to query the BIOS > on a reliable way, in order to get its layout, in a way > that tools like ras-mc-ctl would properly report the memory > configuration (with --layout) and the motherboard silkscreen > labels (with --print-labels). Unfortunately, at least on that > time, the discussions with HP didn't proceed. Thanks for the info. I hope we can enable it this time around. -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Thu, 20 Jul 2017 19:50:03 + "Kani, Toshimitsu"escreveu: > On Thu, 2017-07-20 at 06:33 +0200, Borislav Petkov wrote: > > On Wed, Jul 19, 2017 at 04:40:25PM +, Kani, Toshimitsu wrote: > > > ghes_edac allows to report errors to OS management tools like > > > rasdaemon in addition to platform- specific managements. > > > > So ghes_edac *is* a poor man's driver in the sense that it doesn't do > > anything fancy but repeat like a parrot data it has gotten from the > > firmware and shoving it into the EDAC counters. At least that's the > > intention. Nothing more. > > Right for ghes_edac. > > > All the action stuff like error detection and recovery should be done > > by the firmware. > > GHES / firmware-first still requires OS recovery actions when an error > cannot be corrected by the platform. They are handled by ghes_proc(), > and ghes_edac remains its error-reporting wrapper. > > > But considering how SNAFU'd firmware is, I wouldn't expect any great > > RAS functionality there. Of course, I'd be delighted to be proven > > wrong. > > Firmware has better knowledge about the platform and can provide better > RAS when implemented properly. I agree that user experiences may vary > on platforms. It may have a better knowledge, when the vendor ships different BIOS for platforms with different motherboard silkscreens, but a lot of vendors just use the same BIOS on different models, with the same information at "Locator" and "Bank Locator" data at DMI tables, that don't match what's printed at the board's silkscreen. So, GHES ends by exposing wrong data. Also, such BIOS fail to properly expose such knowledge to drivers/userspace. On the discussions I had with HP, back in 2012, the idea was to try to have some sort of way for the GHES driver to query the BIOS on a reliable way, in order to get its layout, in a way that tools like ras-mc-ctl would properly report the memory configuration (with --layout) and the motherboard silkscreen labels (with --print-labels). Unfortunately, at least on that time, the discussions with HP didn't proceed. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Thu, 20 Jul 2017 19:50:03 + "Kani, Toshimitsu" escreveu: > On Thu, 2017-07-20 at 06:33 +0200, Borislav Petkov wrote: > > On Wed, Jul 19, 2017 at 04:40:25PM +, Kani, Toshimitsu wrote: > > > ghes_edac allows to report errors to OS management tools like > > > rasdaemon in addition to platform- specific managements. > > > > So ghes_edac *is* a poor man's driver in the sense that it doesn't do > > anything fancy but repeat like a parrot data it has gotten from the > > firmware and shoving it into the EDAC counters. At least that's the > > intention. Nothing more. > > Right for ghes_edac. > > > All the action stuff like error detection and recovery should be done > > by the firmware. > > GHES / firmware-first still requires OS recovery actions when an error > cannot be corrected by the platform. They are handled by ghes_proc(), > and ghes_edac remains its error-reporting wrapper. > > > But considering how SNAFU'd firmware is, I wouldn't expect any great > > RAS functionality there. Of course, I'd be delighted to be proven > > wrong. > > Firmware has better knowledge about the platform and can provide better > RAS when implemented properly. I agree that user experiences may vary > on platforms. It may have a better knowledge, when the vendor ships different BIOS for platforms with different motherboard silkscreens, but a lot of vendors just use the same BIOS on different models, with the same information at "Locator" and "Bank Locator" data at DMI tables, that don't match what's printed at the board's silkscreen. So, GHES ends by exposing wrong data. Also, such BIOS fail to properly expose such knowledge to drivers/userspace. On the discussions I had with HP, back in 2012, the idea was to try to have some sort of way for the GHES driver to query the BIOS on a reliable way, in order to get its layout, in a way that tools like ras-mc-ctl would properly report the memory configuration (with --layout) and the motherboard silkscreen labels (with --print-labels). Unfortunately, at least on that time, the discussions with HP didn't proceed. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, 2017-07-20 at 06:33 +0200, Borislav Petkov wrote: > On Wed, Jul 19, 2017 at 04:40:25PM +, Kani, Toshimitsu wrote: > > ghes_edac allows to report errors to OS management tools like > > rasdaemon in addition to platform- specific managements. > > So ghes_edac *is* a poor man's driver in the sense that it doesn't do > anything fancy but repeat like a parrot data it has gotten from the > firmware and shoving it into the EDAC counters. At least that's the > intention. Nothing more. Right for ghes_edac. > All the action stuff like error detection and recovery should be done > by the firmware. GHES / firmware-first still requires OS recovery actions when an error cannot be corrected by the platform. They are handled by ghes_proc(), and ghes_edac remains its error-reporting wrapper. > But considering how SNAFU'd firmware is, I wouldn't expect any great > RAS functionality there. Of course, I'd be delighted to be proven > wrong. Firmware has better knowledge about the platform and can provide better RAS when implemented properly. I agree that user experiences may vary on platforms. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, 2017-07-20 at 06:33 +0200, Borislav Petkov wrote: > On Wed, Jul 19, 2017 at 04:40:25PM +, Kani, Toshimitsu wrote: > > ghes_edac allows to report errors to OS management tools like > > rasdaemon in addition to platform- specific managements. > > So ghes_edac *is* a poor man's driver in the sense that it doesn't do > anything fancy but repeat like a parrot data it has gotten from the > firmware and shoving it into the EDAC counters. At least that's the > intention. Nothing more. Right for ghes_edac. > All the action stuff like error detection and recovery should be done > by the firmware. GHES / firmware-first still requires OS recovery actions when an error cannot be corrected by the platform. They are handled by ghes_proc(), and ghes_edac remains its error-reporting wrapper. > But considering how SNAFU'd firmware is, I wouldn't expect any great > RAS functionality there. Of course, I'd be delighted to be proven > wrong. Firmware has better knowledge about the platform and can provide better RAS when implemented properly. I agree that user experiences may vary on platforms. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Thu, 20 Jul 2017 19:05:04 +0200 Borislav Petkovescreveu: > On Thu, Jul 20, 2017 at 04:55:59PM +, Luck, Tony wrote: > > Add a module parameter to those edac drivers that can override the check > > and let them load anyway. I'm not paranoid, I just assume that there is a > > BIOS > > out there that sets the OSC/WHEA bits, but isn't generating useful GHES > > logs. > > Or add that parameter to edac_core.ko and let it control which EDAC > driver gets loaded? Something like > > edac=ignore_ghes > > or so. And then the other EDAC drivers query it. Works for me. Thanks, Mauro
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Em Thu, 20 Jul 2017 19:05:04 +0200 Borislav Petkov escreveu: > On Thu, Jul 20, 2017 at 04:55:59PM +, Luck, Tony wrote: > > Add a module parameter to those edac drivers that can override the check > > and let them load anyway. I'm not paranoid, I just assume that there is a > > BIOS > > out there that sets the OSC/WHEA bits, but isn't generating useful GHES > > logs. > > Or add that parameter to edac_core.ko and let it control which EDAC > driver gets loaded? Something like > > edac=ignore_ghes > > or so. And then the other EDAC drivers query it. Works for me. Thanks, Mauro
RE: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
> Or add that parameter to edac_core.ko and let it control which EDAC > driver gets loaded? Something like > > edac=ignore_ghes > > or so. And then the other EDAC drivers query it. Sure ... one central place is better than adding code to each driver. -Tony
RE: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
> Or add that parameter to edac_core.ko and let it control which EDAC > driver gets loaded? Something like > > edac=ignore_ghes > > or so. And then the other EDAC drivers query it. Sure ... one central place is better than adding code to each driver. -Tony
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, Jul 20, 2017 at 04:55:59PM +, Luck, Tony wrote: > Add a module parameter to those edac drivers that can override the check > and let them load anyway. I'm not paranoid, I just assume that there is a > BIOS > out there that sets the OSC/WHEA bits, but isn't generating useful GHES logs. Or add that parameter to edac_core.ko and let it control which EDAC driver gets loaded? Something like edac=ignore_ghes or so. And then the other EDAC drivers query it. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, Jul 20, 2017 at 04:55:59PM +, Luck, Tony wrote: > Add a module parameter to those edac drivers that can override the check > and let them load anyway. I'm not paranoid, I just assume that there is a > BIOS > out there that sets the OSC/WHEA bits, but isn't generating useful GHES logs. Or add that parameter to edac_core.ko and let it control which EDAC driver gets loaded? Something like edac=ignore_ghes or so. And then the other EDAC drivers query it. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
RE: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
>> Yes, the following message is shown on HP systems. Please note that >> WHEA is a Windows-defined interface. > > Ok, so let's couple ghes_edac loading to that and see how far we could > go. I guess we should add checks for that to the major x86 EDAC drivers > to not load and this way ghes_edac will be the only driver loading. > > Tony, how does that sound? Add a module parameter to those edac drivers that can override the check and let them load anyway. I'm not paranoid, I just assume that there is a BIOS out there that sets the OSC/WHEA bits, but isn't generating useful GHES logs. -Tony
RE: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
>> Yes, the following message is shown on HP systems. Please note that >> WHEA is a Windows-defined interface. > > Ok, so let's couple ghes_edac loading to that and see how far we could > go. I guess we should add checks for that to the major x86 EDAC drivers > to not load and this way ghes_edac will be the only driver loading. > > Tony, how does that sound? Add a module parameter to those edac drivers that can override the check and let them load anyway. I'm not paranoid, I just assume that there is a BIOS out there that sets the OSC/WHEA bits, but isn't generating useful GHES logs. -Tony
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, Jul 20, 2017 at 02:42:25PM +, Kani, Toshimitsu wrote: > Yes, the following message is shown on HP systems. Please note that > WHEA is a Windows-defined interface. Ok, so let's couple ghes_edac loading to that and see how far we could go. I guess we should add checks for that to the major x86 EDAC drivers to not load and this way ghes_edac will be the only driver loading. Tony, how does that sound? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, Jul 20, 2017 at 02:42:25PM +, Kani, Toshimitsu wrote: > Yes, the following message is shown on HP systems. Please note that > WHEA is a Windows-defined interface. Ok, so let's couple ghes_edac loading to that and see how far we could go. I guess we should add checks for that to the major x86 EDAC drivers to not load and this way ghes_edac will be the only driver loading. Tony, how does that sound? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, 2017-07-20 at 06:16 +0200, Borislav Petkov wrote: > On Wed, Jul 19, 2017 at 04:56:17PM +, Kani, Toshimitsu wrote: > > Since ghes_edac has not been used for a long time, I have a feeling > > that not so many vendors want to use it. In the case of HPE, we do > > not need to update with each platform since "HPE" "Server" will > > cover all platforms we need. > > Does the apei_osc_setup() detection with the uuid work on HP systems? Yes, the following message is shown on HP systems. Please note that WHEA is a Windows-defined interface. "GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC." Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Thu, 2017-07-20 at 06:16 +0200, Borislav Petkov wrote: > On Wed, Jul 19, 2017 at 04:56:17PM +, Kani, Toshimitsu wrote: > > Since ghes_edac has not been used for a long time, I have a feeling > > that not so many vendors want to use it. In the case of HPE, we do > > not need to update with each platform since "HPE" "Server" will > > cover all platforms we need. > > Does the apei_osc_setup() detection with the uuid work on HP systems? Yes, the following message is shown on HP systems. Please note that WHEA is a Windows-defined interface. "GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC." Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 04:40:25PM +, Kani, Toshimitsu wrote: > ghes_edac allows to report errors to OS management tools like > rasdaemon in addition to platform- specific managements. So ghes_edac *is* a poor man's driver in the sense that it doesn't do anything fancy but repeat like a parrot data it has gotten from the firmware and shoving it into the EDAC counters. At least that's the intention. Nothing more. All the action stuff like error detection and recovery should be done by the firmware. But considering how SNAFU'd firmware is, I wouldn't expect any great RAS functionality there. Of course, I'd be delighted to be proven wrong. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 04:40:25PM +, Kani, Toshimitsu wrote: > ghes_edac allows to report errors to OS management tools like > rasdaemon in addition to platform- specific managements. So ghes_edac *is* a poor man's driver in the sense that it doesn't do anything fancy but repeat like a parrot data it has gotten from the firmware and shoving it into the EDAC counters. At least that's the intention. Nothing more. All the action stuff like error detection and recovery should be done by the firmware. But considering how SNAFU'd firmware is, I wouldn't expect any great RAS functionality there. Of course, I'd be delighted to be proven wrong. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 02:55:08PM -0400, Aristeu Rozanski wrote: > That would also need to keep an eye on versions. A newer version of BIOS > on a whitelisted platform might be broken. Yeah, that would be a nasty, back-stabbing SNAFU. So I'm thinking of adding a bunch of FW_ERR sanity checks to that whole ghes_edac and ghes init code to hopefully catch issues during platform validation. I.e., early enough for them to get fixed. But that's the same problem as with UEFI - vendors need to try to boot Linux on their platforms early enough. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 02:55:08PM -0400, Aristeu Rozanski wrote: > That would also need to keep an eye on versions. A newer version of BIOS > on a whitelisted platform might be broken. Yeah, that would be a nasty, back-stabbing SNAFU. So I'm thinking of adding a bunch of FW_ERR sanity checks to that whole ghes_edac and ghes init code to hopefully catch issues during platform validation. I.e., early enough for them to get fixed. But that's the same problem as with UEFI - vendors need to try to boot Linux on their platforms early enough. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 04:56:17PM +, Kani, Toshimitsu wrote: > Since ghes_edac has not been used for a long time, I have a feeling > that not so many vendors want to use it. In the case of HPE, we do not > need to update with each platform since "HPE" "Server" will cover all > platforms we need. Does the apei_osc_setup() detection with the uuid work on HP systems? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 04:56:17PM +, Kani, Toshimitsu wrote: > Since ghes_edac has not been used for a long time, I have a feeling > that not so many vendors want to use it. In the case of HPE, we do not > need to update with each platform since "HPE" "Server" will cover all > platforms we need. Does the apei_osc_setup() detection with the uuid work on HP systems? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, 2017-07-19 at 14:55 -0400, Aristeu Rozanski wrote: > On Wed, Jul 19, 2017 at 06:22:04PM +0200, Borislav Petkov wrote: > > On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote: > > > I do prefer to avoid any white / black listing. But I do not see > > > how > > > it solves the buggy DMI/SMBIOS info as an example of firmware > > > bugs we > > > may have to deal with. > > > > So how do you want to deal with this? > > > > Maintain an evergrowing whitelist of platforms which are OK and > > then the moment a new platform comes along, you send a patch to add > > it to that whitelist? > > That would also need to keep an eye on versions. A newer version of > BIOS on a whitelisted platform might be broken. Right. I think a question comes to who broke a running system -- OS update or BIOS update. This whitelist attempts to protect the former case by not introducing ghes_edac on arbitrary platforms. The latter case should be vendor's responsibility. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, 2017-07-19 at 14:55 -0400, Aristeu Rozanski wrote: > On Wed, Jul 19, 2017 at 06:22:04PM +0200, Borislav Petkov wrote: > > On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote: > > > I do prefer to avoid any white / black listing. But I do not see > > > how > > > it solves the buggy DMI/SMBIOS info as an example of firmware > > > bugs we > > > may have to deal with. > > > > So how do you want to deal with this? > > > > Maintain an evergrowing whitelist of platforms which are OK and > > then the moment a new platform comes along, you send a patch to add > > it to that whitelist? > > That would also need to keep an eye on versions. A newer version of > BIOS on a whitelisted platform might be broken. Right. I think a question comes to who broke a running system -- OS update or BIOS update. This whitelist attempts to protect the former case by not introducing ghes_edac on arbitrary platforms. The latter case should be vendor's responsibility. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 06:22:04PM +0200, Borislav Petkov wrote: > On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote: > > I do prefer to avoid any white / black listing. But I do not see how > > it solves the buggy DMI/SMBIOS info as an example of firmware bugs we > > may have to deal with. > > So how do you want to deal with this? > > Maintain an evergrowing whitelist of platforms which are OK and then the > moment a new platform comes along, you send a patch to add it to that > whitelist? That would also need to keep an eye on versions. A newer version of BIOS on a whitelisted platform might be broken. -- Aristeu
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 06:22:04PM +0200, Borislav Petkov wrote: > On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote: > > I do prefer to avoid any white / black listing. But I do not see how > > it solves the buggy DMI/SMBIOS info as an example of firmware bugs we > > may have to deal with. > > So how do you want to deal with this? > > Maintain an evergrowing whitelist of platforms which are OK and then the > moment a new platform comes along, you send a patch to add it to that > whitelist? That would also need to keep an eye on versions. A newer version of BIOS on a whitelisted platform might be broken. -- Aristeu
RE: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
>> Later when GHES gives you a NODE/CARD/MODULE) in an error record. You need >> to match these up. But SMBIOS only gave you two strings "Locator" and "Bank >> Locator" which have no defined syntax. You are at the mercy of the BIOS >> writer >> to put in something parseable. > > Well, at some point it is only so much we can do, right? > > I mean, if FW says it wants to do firmware-first and we go and adhere > to that, it should be expected that said FW vendor marks the silkscreen > labels and DMI data accordingly. > > I mean, it is time for FW to put its money where its mouth is, no? > > How else would you do this? By thinking a bit more and realizing that what I wrote up above misses that at byte offset 78 in the UEFI memory error section there is "Module Handle" which tells you which SMBIOS entry to use. So this should work just fine (as long as BIOS fills out all these fields ... there's a "Validation Bits" mask at the start of the error structure that says which fields have been populated). -Tony
RE: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
>> Later when GHES gives you a NODE/CARD/MODULE) in an error record. You need >> to match these up. But SMBIOS only gave you two strings "Locator" and "Bank >> Locator" which have no defined syntax. You are at the mercy of the BIOS >> writer >> to put in something parseable. > > Well, at some point it is only so much we can do, right? > > I mean, if FW says it wants to do firmware-first and we go and adhere > to that, it should be expected that said FW vendor marks the silkscreen > labels and DMI data accordingly. > > I mean, it is time for FW to put its money where its mouth is, no? > > How else would you do this? By thinking a bit more and realizing that what I wrote up above misses that at byte offset 78 in the UEFI memory error section there is "Module Handle" which tells you which SMBIOS entry to use. So this should work just fine (as long as BIOS fills out all these fields ... there's a "Validation Bits" mask at the start of the error structure that says which fields have been populated). -Tony
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, 2017-07-19 at 18:22 +0200, Borislav Petkov wrote: > On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote: > > I do prefer to avoid any white / black listing. But I do not see > > how it solves the buggy DMI/SMBIOS info as an example of firmware > > bugs we may have to deal with. > > So how do you want to deal with this? > > Maintain an evergrowing whitelist of platforms which are OK and then > the moment a new platform comes along, you send a patch to add it to > that whitelist? > > I'm sure you can see the problems with that approach. Since ghes_edac has not been used for a long time, I have a feeling that not so many vendors want to use it. In the case of HPE, we do not need to update with each platform since "HPE" "Server" will cover all platforms we need. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, 2017-07-19 at 18:22 +0200, Borislav Petkov wrote: > On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote: > > I do prefer to avoid any white / black listing. But I do not see > > how it solves the buggy DMI/SMBIOS info as an example of firmware > > bugs we may have to deal with. > > So how do you want to deal with this? > > Maintain an evergrowing whitelist of platforms which are OK and then > the moment a new platform comes along, you send a patch to add it to > that whitelist? > > I'm sure you can see the problems with that approach. Since ghes_edac has not been used for a long time, I have a feeling that not so many vendors want to use it. In the case of HPE, we do not need to update with each platform since "HPE" "Server" will cover all platforms we need. Thanks, -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Tue, 2017-07-18 at 18:15 -0300, Mauro Carvalho Chehab wrote: > Em Tue, 18 Jul 2017 19:58:54 + : > We had a similar discussion several years ago when I wrote this > driver. On that time, I talked with Red Hat, HP, Dell, Intel people > and with some customers with large clusters. > > The way it is, ghes_edac is a poor man's driver. What it hopefully > provide is a detection that an error happened, without really telling > the user what component should be replaced. "poor man's driver" is a bit misleading, but yes, firmware-first platforms have RAS features built-into the platforms, and they do not need intelligence in EDAC drivers, which may conflict with the platform's RAS features. I cannot speak for other vendors, but HPE platforms log errors and provide FRU info. ghes_edac allows to report errors to OS management tools like rasdaemon in addition to platform- specific managements. > Ok, on machines with their own error reporting mechanism (like > HP servers), a sys admin can look on some proprietary software > (or bios), in order to identify what happened. > > Yet, BIOS doesn't provide any glue about what's the memory > architecture, as it maps memory as if it was a single DIMM memory: > > (from ghes_edac_register) > > layers[0].type = EDAC_MC_LAYER_ALL_MEM; > layers[0].size = num_dimm; > layers[0].is_virt_csrow = true; > > So, even on systems where the BIOS actually knows how the memory > cards are wired, it will mask the memory controller data. > > Now, the EDAC driver can also be used to identify what > channels are used. That helps the sys admin to know if the > memories are connected in a way that it will be using multiple > channels, or not, helping to setup the machine to obtain > the maximum possible performance. > > So, for example, on my Intel-based HP server, I can check > such info with: > > $ ras-mc-ctl --mainboard > ras-mc-ctl: mainboard: HP model ProLiant ML350 Gen9 > $ ras-mc-ctl --layout > +- > --+ > |mc0|mc1 > | > | channel0 | channel1 | channel2 | channel0 | channel1 | > channel2 | > ---+- > --+ > slot2: | 0 MB | 0 MB | 0 MB | 0 MB | 0 > MB | 0 MB | > slot1: | 0 MB | 0 MB | 0 MB | 0 MB | 0 > MB | 0 MB | > slot0: | 16384 MB | 0 MB | 16384 MB | 16384 MB | 0 > MB | 16384 MB | > ---+- > --+ > > So, I know that both CPUs will be connected to my memories, and, > on both, it is using 2 channels. > > If I was using the ghes driver, that information would be hidden. > > So, due to all problems with ghes, it is enabled only if there are no > better solution, e. g. on systems where there's no way to talk > directly to the hardware (like on E7 Xeon machines, where the memory > controller is actually on a separate chip that are controlled only by > the BIOS). Thanks for the info! That's very helpful. I will check to see if ghes_edac provides enough info that we need. -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Tue, 2017-07-18 at 18:15 -0300, Mauro Carvalho Chehab wrote: > Em Tue, 18 Jul 2017 19:58:54 + : > We had a similar discussion several years ago when I wrote this > driver. On that time, I talked with Red Hat, HP, Dell, Intel people > and with some customers with large clusters. > > The way it is, ghes_edac is a poor man's driver. What it hopefully > provide is a detection that an error happened, without really telling > the user what component should be replaced. "poor man's driver" is a bit misleading, but yes, firmware-first platforms have RAS features built-into the platforms, and they do not need intelligence in EDAC drivers, which may conflict with the platform's RAS features. I cannot speak for other vendors, but HPE platforms log errors and provide FRU info. ghes_edac allows to report errors to OS management tools like rasdaemon in addition to platform- specific managements. > Ok, on machines with their own error reporting mechanism (like > HP servers), a sys admin can look on some proprietary software > (or bios), in order to identify what happened. > > Yet, BIOS doesn't provide any glue about what's the memory > architecture, as it maps memory as if it was a single DIMM memory: > > (from ghes_edac_register) > > layers[0].type = EDAC_MC_LAYER_ALL_MEM; > layers[0].size = num_dimm; > layers[0].is_virt_csrow = true; > > So, even on systems where the BIOS actually knows how the memory > cards are wired, it will mask the memory controller data. > > Now, the EDAC driver can also be used to identify what > channels are used. That helps the sys admin to know if the > memories are connected in a way that it will be using multiple > channels, or not, helping to setup the machine to obtain > the maximum possible performance. > > So, for example, on my Intel-based HP server, I can check > such info with: > > $ ras-mc-ctl --mainboard > ras-mc-ctl: mainboard: HP model ProLiant ML350 Gen9 > $ ras-mc-ctl --layout > +- > --+ > |mc0|mc1 > | > | channel0 | channel1 | channel2 | channel0 | channel1 | > channel2 | > ---+- > --+ > slot2: | 0 MB | 0 MB | 0 MB | 0 MB | 0 > MB | 0 MB | > slot1: | 0 MB | 0 MB | 0 MB | 0 MB | 0 > MB | 0 MB | > slot0: | 16384 MB | 0 MB | 16384 MB | 16384 MB | 0 > MB | 16384 MB | > ---+- > --+ > > So, I know that both CPUs will be connected to my memories, and, > on both, it is using 2 channels. > > If I was using the ghes driver, that information would be hidden. > > So, due to all problems with ghes, it is enabled only if there are no > better solution, e. g. on systems where there's no way to talk > directly to the hardware (like on E7 Xeon machines, where the memory > controller is actually on a separate chip that are controlled only by > the BIOS). Thanks for the info! That's very helpful. I will check to see if ghes_edac provides enough info that we need. -Toshi
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote: > I do prefer to avoid any white / black listing. But I do not see how > it solves the buggy DMI/SMBIOS info as an example of firmware bugs we > may have to deal with. So how do you want to deal with this? Maintain an evergrowing whitelist of platforms which are OK and then the moment a new platform comes along, you send a patch to add it to that whitelist? I'm sure you can see the problems with that approach. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --
Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
On Wed, Jul 19, 2017 at 04:10:07PM +, Kani, Toshimitsu wrote: > I do prefer to avoid any white / black listing. But I do not see how > it solves the buggy DMI/SMBIOS info as an example of firmware bugs we > may have to deal with. So how do you want to deal with this? Maintain an evergrowing whitelist of platforms which are OK and then the moment a new platform comes along, you send a patch to add it to that whitelist? I'm sure you can see the problems with that approach. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. --