Re: NetBSD and ECC RAM?

2024-03-10 Thread Aaron J. Grier
On Thu, Feb 29, 2024 at 10:55:14AM -, Michael van Elst wrote:
> The OS could be smart, lock out bad memory regions, recover some
> errors by e.g. paging in text data again or even use mirrored RAM
> (with motherboard support).

IIRC Intel Icelake introduced mechanisms to enable kernels to recover
from poison data situations, but I don't know how far this has been
implemented.  Ideally an app could be given some sort of notification
about poisoned data instead of the kernel blindly panicing.

> >A lot of fragile chipset specific code to get that.
> 
> Indeed.

There's expectation that the platform-spceific bits would be abstracted
for now through ACPI, and eventually codified into a hardware RAS
controller with a standardized driver attached either as a PCIe function
or ACPI-discovered MMIO space.  Part of EDAC is not only getting
notifications of the errors, but being able to do mapping of physical
addresses back to physical components (DIMMs or CXL devices) so you know
what to replace or block.

-- 
  Aaron J. Grier | "Not your ordinary poofy goof." | agr...@poofygoof.com
  "The price of reliability is the pursuit of the utmost simplicity.  It
   is a price which the very rich find most hard to pay."  -- Tony Hoare


Re: NetBSD and ECC RAM?

2024-02-29 Thread Michael van Elst
kevin.bowl...@kev009.com (Kevin Bowling) writes:

>Servers tend to have BMCs, so you can execute 'ipmitool sensors' and
>'ipmi sel elist' to get the information out.

ECC information is usually not provided by sensors. ECC errors may
be listed in the SEL, but even this usually occurs only when some
undocumented limit is reached. Often the messages also do not indicate
the memory module that produced the error.


>Linux has the 'EDAC' subsystem but I don't think it gains you so much
>if you have a BMC.

It gives you the data from the ECC circuits, immediately. So data is
no longer hidden by the BMC, you get precise information and you can
apply your own policies for e.g. replacing memory modules or migrating
services to other hardware.

The OS could be smart, lock out bad memory regions, recover some
errors by e.g. paging in text data again or even use mirrored RAM
(with motherboard support).


>A lot of fragile chipset specific code to get that.

Indeed.


Greetings,



Re: NetBSD and ECC RAM?

2024-02-29 Thread Kevin Bowling
On Mon, Feb 19, 2024 at 12:19 AM Michael van Elst  wrote:
>
> michael.chepo...@gmail.com (Michael Cheponis) writes:
>
> >I've been running ECC in the Windows box for years, it seems like a 'no
> >brainer' for servers. Servers usually run for years, and Stuff Happens over
> >the years [1].
> >But I'd prefer a reliable, unhackable, trustable compute fabric.  ECC is
> >part of the 'reliable' part.
>
> I agree, but the "box" will run with ECC, even when the OS doesn't
> know about it. OS support is needed to get information about errors
> and for better fault tolerance.

Servers tend to have BMCs, so you can execute 'ipmitool sensors' and
'ipmi sel elist' to get the information out.

Linux has the 'EDAC' subsystem but I don't think it gains you so much
if you have a BMC.  Kernel printfs for some errors and character
drivers to do userspace development.  And it would support systems
without BMCs.  A lot of fragile chipset specific code to get that.

>
> >I would also like to see per /dev entry ACLs.  I would like to see better
> >security than owner-group-everbody permissions.
>
> I have rarely seen ACLs being used for "better security". Even when that
> was possible, the complexity usually outweighed any gain in control.
>
> Systems that implied access control through simple rules worked much
> better. It's still not a feature that you had to enable or a switch
> you toggled, it requires constant effort, in particular on systems
> that don't just perform a fixed set of functions.
>


Re: NetBSD and ECC RAM?

2024-02-18 Thread Michael van Elst
michael.chepo...@gmail.com (Michael Cheponis) writes:

>I've been running ECC in the Windows box for years, it seems like a 'no
>brainer' for servers. Servers usually run for years, and Stuff Happens over
>the years [1].
>But I'd prefer a reliable, unhackable, trustable compute fabric.  ECC is
>part of the 'reliable' part.

I agree, but the "box" will run with ECC, even when the OS doesn't
know about it. OS support is needed to get information about errors
and for better fault tolerance.


>I would also like to see per /dev entry ACLs.  I would like to see better
>security than owner-group-everbody permissions.

I have rarely seen ACLs being used for "better security". Even when that
was possible, the complexity usually outweighed any gain in control.

Systems that implied access control through simple rules worked much
better. It's still not a feature that you had to enable or a switch
you toggled, it requires constant effort, in particular on systems
that don't just perform a fixed set of functions.



Re: NetBSD and ECC RAM?

2024-02-18 Thread Michael Cheponis
I've been running ECC in the Windows box for years, it seems like a 'no
brainer' for servers. Servers usually run for years, and Stuff Happens over
the years [1].

Most of the computing industry has been hell-bent on performance, yielding
impressive gains (albeit with occasional setbacks:
https://cachewarpattack.com/ )

But I'd prefer a reliable, unhackable, trustable compute fabric.  ECC is
part of the 'reliable' part.

I would also like to see per /dev entry ACLs.  I would like to see better
security than owner-group-everbody permissions.  I would like to see almost
no normal system operations requiring root privs - and I would like to see
root privs made much more narrow and fine-grained in scope - only large
enough to do the specific job (e.g. change file permission, with a separate
capability to change file ownership; etc).

I'm certainly no computer security guru, or have any valid opinions except
as a luser.

Still --- I would like to see some performance gains "wasted" in order to
gain better reliable, unhackable, trustable systems.


Thanks for tolerating my mini-soapbox.
-Mike

[1] I recently had a NetBSD server's computer start to have random crashes
until I tried to boot it one more time, and it wouldn't come up at all.
 Then after cleaning everything, making sure disks were OK, and trying
again with no luck did I stare at the MB and saw  the electrolytic
caps' tops bulging out!   My rule: Never trust HW completely.  It will
fail.  Eventually.


On Fri, Feb 16, 2024 at 7:09 AM Hauke Fath (SPG) 
wrote:

> On 2024-02-16 01:14, Michael van Elst wrote:
> > We should have EDAC drivers that should at least report events,
> > but so far there is nothing...
>
> Sounds like a SoC project?
>
> Cheerio,
> Hauke
>
>
> (FreeBSD appears to be no better off:
> <
> https://forums.freebsd.org/threads/how-to-find-out-if-ecc-is-enabled.72839/
> >)
>
> --
>   The ASCII Ribbon CampaignHauke Fath
> () No HTML/RTF in email Institut für Nachrichtentechnik
> /\ No Word docs in email TU Darmstadt
>   Respect for open standards  Ruf +49-6151-16-21344
>


Re: NetBSD and ECC RAM?

2024-02-16 Thread Hauke Fath (SPG)

On 2024-02-16 01:14, Michael van Elst wrote:

We should have EDAC drivers that should at least report events,
but so far there is nothing...


Sounds like a SoC project?

Cheerio,
Hauke


(FreeBSD appears to be no better off: 
)


--
 The ASCII Ribbon CampaignHauke Fath
() No HTML/RTF in email Institut für Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
 Respect for open standards  Ruf +49-6151-16-21344


Re: NetBSD and ECC RAM?

2024-02-16 Thread Hauke Fath

On 2024-02-16 01:14, Michael van Elst wrote:

We should have EDAC drivers that should at least report events,
but so far there is nothing...


Sounds like a SoC project?

Cheerio,
Hauke

--
 The ASCII Ribbon CampaignHauke Fath
() No HTML/RTF in email Institut für Nachrichtentechnik
/\ No Word docs in email TU Darmstadt
 Respect for open standards  Ruf +49-6151-16-21344



Re: NetBSD and ECC RAM?

2024-02-15 Thread Michael van Elst
h...@spg.tu-darmstadt.de ("Hauke Fath (SPG)") writes:

>one my favourite blogs is sporting a page on AMD ECC RAM support
>,

>Is this of any relevance to NetBSD, or do we just not bother?


We should have EDAC drivers that should at least report events,
but so far there is nothing...