For X11SCA owners who might find this thread in the future:

Test results:

I did some more exhaustive testing with memtest86 and individual
DIMMs installed. These DIMMs were all SK Hynix HMA82GU7CJR8N-VK,
purchased directly from Supermicro around 2019-2020 in one batch.

All four DIMMs had varying degrees of failure. Two of them had in
the neighborhood of 20 errors each in one full pass of memtest86.
A third had about two errors in a full pass of memtest86. In all
of those cases, a bunch of "smbios 0x02" events showed up in the
event log visible from the BIOS setup screens.

The fourth original DIMM had no errors in one full pass of memtest86,
but generated a few events in the smbios log.

I got two new noname ECC DIMMs and tested them. No errors in one
full pass of memtest86, and no smbios events logged.

Future monitoring:

I'm still dismayed that FreeBSD doesn't seem to notice/report these
ECC events. I have not had a chance to exhaustively search the BIOS
setup screens for a setting that might enable signaling the OS, yet.

However, I noticed that "dmidecode" reports information about the
"System Event Log". I found the published SMBIOS specification
(see, for example, https://www.dmtf.org/standards/smbios) that
describes the format of the System Event Log.

I wrote a simple perl script to open /dev/mem, seek to the start
address, and read the log area and got back what looked like a
valid event log. It should be straightforward to parse the log entries
and discover ECC events, so I can build a monitoring solution for
this motherboard.

[email protected] ("G. Paul Ziemba") writes:

>Bob,

>thanks for your suggestions.

>The motherboard is a plain X11SCA (no -F ipmi)

>I don't know of a way to read the power supply voltages in software
>while FreeBSD is running, but I did reboot into the BIOS setup and
>read voltages there, and they look normal to me:

>    VCPU:       1.136
>    VDIMM:      1.224
>    12V:       12.233
>    5VCC:       5.184
>    3.3V_DL:    3.327
>    3.3VCC:     3.424
>    VSB:        3.328
>    VBAT:       3.104
>    VCC1_8_DL_PCM:      1.816

>The BIOS versions are given as:

>    "ver 1.2 Build Date 12/5/19" near the top of the screen; and
>    "version 2.19.0045 (c) [AMI]" at the bottom of the screen

>I didn't see a setting that (apparently to me) might control how
>events might be filtered, but there WAS an event log that had
>completely filled up with messages of the form:

>    <datetime> smbios 0x02 DIMMB1

>with many for DIMMB1 and DIMMB2. I haven't found any documentation yet
>of "0x02" other than a few online posts calling it either a single-bit
>or a multi-bit ECC memory error.

>I'm still favoring a diagnosis of two bad DIMMs; I just wish there were
>a way to cause these errors to show up in FreeBSD somewhere so I could
>detect them on a running system.

-- 
G. Paul Ziemba
FreeBSD unix:
10:51AM  up 1 day, 13:34, 17 users, load averages: 0.35, 0.25, 0.26

Reply via email to