On 2/1/2026 15:30, Bob Bishop wrote:
Hi,

On 1 Feb 2026, at 16:35, G. Paul Ziemba<[email protected]> wrote:

OS: 14.2-STABLE as of 250403

I seem to have at least one bad ECC DIMM
Check the power supply voltages are within tolerance if you haven’t already.

and  was expecting to see MCA
messages in /var/log/messages or to the console (which I have recently
redirected to /var/log/console.log via syslog.conf:

    console.info /var/log/console.log

but I can't find anything in any of my logs. Why am I not seeing them?
If you have the -F variant of the board that supports IPMI, it may be that the 
BMC is capturing the errors so check the BMC event log. Possibly there is a 
setting on the BMC to control what gets passed to MCA.

Also check the BIOS event logging; I don’t see settings in the BIOS to control 
MCA events.

And check the BIOS version is up to date.

Background:

Motherboard: Supermicro X11SCA
CPU: Xeon E-2176G
Chipset: C246
Memory: 4x SK Hynix HMA82GU7CJR8N-VK (16GB ECC)

Bios reports ECC on its startup screen and dmidecode reports

    Total Width: 72 bits
    Data Width: 64 bits

for each of the dimms.

Amanda started reporting checksum errors on large backup files in its
holding disk. I discovered that a large file (200GB) on any of three
disks on this system yields different sha512sum values every time I
run it on the same file. SMART data looks OK on all disks.

memtest86+ finds three bad spots in memory, at 42G, 47G and 53G. I have
4x16GB dimms installed, so I think that corresponds to two bad dimms.

    % sysctl hw.mca
    hw.mca.cmc_throttle: 60
    hw.mca.force_scan: 0
    hw.mca.interval: 300
    hw.mca.maxcount: -1
    hw.mca.count: 0
    hw.mca.erratum383: 0
    hw.mca.intel6h_HSD131: 0
    hw.mca.amd10h_L1TP: 1
    hw.mca.log_corrected: 1
    hw.mca.enabled: 1

Thanks for any insights.
--
G. Paul Ziemba
FreeBSD unix:
8:31AM  up 2 days, 14:38, 11 users, load averages: 0.71, 0.43, 0.39

I have one of these boards in a server here:

Platform Firmware Information
        Vendor: American Megatrends Inc.
        Version: 2.5
        Release Date: 06/14/2024
        Address: 0xF0000
        Runtime Size: 64 KiB
        ROM Size: 32 MiB
....

Base Board Information
        Manufacturer: Supermicro
        Product Name: X11SCA-F
        Version: 1.01A

....

Handle 0x002F, DMI type 17, 84 bytes
Memory Device
        Array Handle: 0x001F
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16 GiB
        Form Factor: DIMM
        Set: None
        Locator: DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2667 MT/s
        Manufacturer: SK Hynix
        Serial Number: 7474963A
        Asset Tag: 9876543210
        Part Number: HMA82GU7CJR8N-VK
        Rank: 2
        Configured Memory Speed: 2667 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: Not Specified
        Module Manufacturer ID: Bank 1, Hex 0xAD
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 16 GiB
        Cache Size: None
        Logical Size: None

.... (and for the other 3, 64Gb total)

I have a Mellanox card in this one for 10g/40g networking along with a bunch of SAS/SATA expansion as well; it gets fairly heavy use as a "hot standby" Postgres database box, general file service, video server and as a build system as well for various distributions I use in other contexts.

Check the RAM part numbers against what Supermicro specifies as "approved"; these boards are extremely picky in that regard but typically if you have the wrong part numbers compared with what they want they will refuse to POST straight up rather than do hinky stuff.

This board has been in service here for quite a long time (many years); the most-recent BIOS is what I am running I believe; I've not had any trouble with ECC errors being logged nor any sort of data corruption, crashes or other misbehavior that could be attributed to that sort of issue.

I'm extremely allergic to RAM issues due to the machine supporting a quite-large amount of storage on ZFS and the need for it to be rock-solid reliable.  It is.

The CPU I have in mine is a E-2146G.

I'm on 14.3-STABLE (compiled from source fairly recently as I keep up with security and potential driver issues that might impact me.)

--
Karl Denninger
[email protected]
/The Market Ticker/
/[S/MIME encrypted email preferred]/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to