Re: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal"

2018-04-19 Thread Borislav Petkov
On Thu, Apr 19, 2018 at 10:11:03AM -0500, Alex G. wrote:
> There is value in this. From my observations, fw claims it will do
> everything through FFS, yet fails to fully handle the situation. It's
> rooted in FW's assumptions about OS behavior. Because the (old) versions
> of windows, esxi, and rhel used during development crash, fw assumes
> that _all_ OSes crash. The result in a surprising majority of cases is
> that FFS doesn't properly handle recurring errors, and fw is, in fact,
> broken.

So FW being broken is a social secret. But we don't care. We have tried,
nothing happens. No one moves. The crack monkeys which program it have
long moved to the next release and you hear crap like, "we don't support
linux" and other bullshit.

What we do now is to try to make the best of it - we either can handle
an error *without* firmware's help or we panic. If we can recover from
it, let's do that without screaming about something the user can't deal
with anyway.

All those FW_ERR printks cause nothing but expensive support calls, the
outcome of which is nothing. Just a lot of money down the drain.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal"

2018-04-19 Thread Alex G.


On 04/18/2018 12:54 PM, Borislav Petkov wrote:
> On Mon, Apr 16, 2018 at 04:59:03PM -0500, Alexandru Gagniuc wrote:

(snip)
>> +
>> +corrected_sev = max(corrected_sev, sec_sev);
>> +}
>> +
>> +if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) {
>> +pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were 
>> able to correct");
>> +pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
> 
> No, I don't want any of that crap issuing stuff in dmesg and then people
> opening bugs and running around and trying to replace hardware.
> 
> We either can handle the error and log a normal record somewhere or we
> cannot and explode.

There is value in this. From my observations, fw claims it will do
everything through FFS, yet fails to fully handle the situation. It's
rooted in FW's assumptions about OS behavior. Because the (old) versions
of windows, esxi, and rhel used during development crash, fw assumes
that _all_ OSes crash. The result in a surprising majority of cases is
that FFS doesn't properly handle recurring errors, and fw is, in fact,
broken.

> The complaining about the FW doesn't bring shit.

You are correct. It doesn't bring defecation. It brings a red flag that
helps people get closer to the root cause of problems.

That being said, I can just drop this patch.

Alex



Re: [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal"

2018-04-18 Thread Borislav Petkov
On Mon, Apr 16, 2018 at 04:59:03PM -0500, Alexandru Gagniuc wrote:
> There seems to be a culture amongst BIOS teams to want to crash the
> OS when an error can't be handled in firmware. Marking GHES errors as
> "fatal" is a very common way to do this.
> 
> However, a number of errors reported by GHES may be fatal in the sense
> a device or link is lost, but are not fatal to the system. When there
> is a disagreement with firmware about the handleability of an error,
> print a warning message.
> 
> Signed-off-by: Alexandru Gagniuc 
> ---
>  drivers/acpi/apei/ghes.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index e0528da4e8f8..6a117825611d 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -535,13 +535,14 @@ static const struct ghes_handler *get_handler(const 
> guid_t *type)
>  static void ghes_do_proc(struct ghes *ghes,
>const struct acpi_hest_generic_status *estatus)
>  {
> - int sev, sec_sev;
> + int sev, sec_sev, corrected_sev;
>   struct acpi_hest_generic_data *gdata;
>   const struct ghes_handler *handler;
>   guid_t *sec_type;
>   guid_t *fru_id = &NULL_UUID_LE;
>   char *fru_text = "";
>  
> + corrected_sev = GHES_SEV_NO;
>   sev = ghes_severity(estatus->error_severity);
>   apei_estatus_for_each_section(estatus, gdata) {
>   sec_type = (guid_t *)gdata->section_type;
> @@ -563,6 +564,13 @@ static void ghes_do_proc(struct ghes *ghes,
>  sec_sev, err,
>  gdata->error_data_length);
>   }
> +
> + corrected_sev = max(corrected_sev, sec_sev);
> + }
> +
> + if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) {
> + pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were 
> able to correct");
> + pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");

No, I don't want any of that crap issuing stuff in dmesg and then people
opening bugs and running around and trying to replace hardware.

We either can handle the error and log a normal record somewhere or we
cannot and explode. The complaining about the FW doesn't bring shit.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


[RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal"

2018-04-16 Thread Alexandru Gagniuc
There seems to be a culture amongst BIOS teams to want to crash the
OS when an error can't be handled in firmware. Marking GHES errors as
"fatal" is a very common way to do this.

However, a number of errors reported by GHES may be fatal in the sense
a device or link is lost, but are not fatal to the system. When there
is a disagreement with firmware about the handleability of an error,
print a warning message.

Signed-off-by: Alexandru Gagniuc 
---
 drivers/acpi/apei/ghes.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index e0528da4e8f8..6a117825611d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -535,13 +535,14 @@ static const struct ghes_handler *get_handler(const 
guid_t *type)
 static void ghes_do_proc(struct ghes *ghes,
 const struct acpi_hest_generic_status *estatus)
 {
-   int sev, sec_sev;
+   int sev, sec_sev, corrected_sev;
struct acpi_hest_generic_data *gdata;
const struct ghes_handler *handler;
guid_t *sec_type;
guid_t *fru_id = &NULL_UUID_LE;
char *fru_text = "";
 
+   corrected_sev = GHES_SEV_NO;
sev = ghes_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
sec_type = (guid_t *)gdata->section_type;
@@ -563,6 +564,13 @@ static void ghes_do_proc(struct ghes *ghes,
   sec_sev, err,
   gdata->error_data_length);
}
+
+   corrected_sev = max(corrected_sev, sec_sev);
+   }
+
+   if ((sev >= GHES_SEV_PANIC) && (corrected_sev < sev)) {
+   pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were 
able to correct");
+   pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
}
 }
 
-- 
2.14.3