Re: nvme controller reset failures on recent -CURRENT

2024-02-13 Thread Patrick M. Hausen
Hi all,

> Am 13.02.2024 um 20:56 schrieb Pete Wright :
> 1. M.2 nvme really does need proper cooling, much more so than traditional 
> SATA/SAS/SCSI drives.

I recently found a tool named "Scrutiny" that presents a nice dashboard
of all your disk devices and their SMART data including crucial points
like temperature.

Pros:

Open source
Nice web UI
Uses smartmontools to gather the data, not reinventing the wheel
Agents that can be called from cron jobs for many OSes including FreeBSD
Alerting via a variety of communication channels

Cons:

Central hub best run on Linux plus docker compose
No authentication whatsoever, so strictly internal use
No grouping or any organisation of systems so does not scale beyond tens of 
servers

I found a couple of problematic HDDs and SSDs right after deploying it
which regular SMART tests overlooked.

https://github.com/AnalogJ/scrutiny

Look for the Hub/Spoke deployment if you are willing to use e.g.
a Linux VM to run the tool, then point your FreeBSD systems at that.

It probably can be deployed strictly on FreeBSD, too, using the manual
installation instructions.

HTH, kind regards,
Patrick


Re: nvme controller reset failures on recent -CURRENT

2024-02-13 Thread Craig Leres
I had issues with a nvme drive in an intel nuc. When I asked 
freebsd-hackers, overheating was the first guess:



https://lists.freebsd.org/pipermail/freebsd-hackers/2018-May/052783.html

I blew the dust out of the fan assembly and changed the bios fan 
settings to be more aggressive and the system has been rock solid since.


Craig



Re: nvme controller reset failures on recent -CURRENT

2024-02-13 Thread Pete Wright

There's a tiny chance that this could be something more exotic,
but my money is on hardware gone bad after 2 years of service. I don't think
this is 'wear out' of the NAND (it's only 15TB written, but it could be if
this
drive is really really crappy nand: first generation QLC maybe, but it seems
too new). It might also be a connector problem that's developed over time.
There might be a few other things too, but I don't think this is a U.2 drive
with funky cables.

The system was probably idle the majority of those two years of power on
time.

It's one of these:
https://www.techpowerup.com/ssd-specs/intel-660p-512-gb.d437
I've seen comments that these generally don't need cooling.

I just ordered a heatsink with some nice big fins, but it will take a
week or more to arrive.



just wanted to add another data-point to this discussion.  i had a 
crucial NVME drive on my workstation that recently was showing similar 
problems.  after much debugging i came to the same conclusion that it 
was getting too hot.  i went ahead an purchased a Sabrent NVME drive 
that came with a heat sink.  i've also starting making much more use of 
my workstation (and the disk subsystem) and have had zero issues.


so lessons learnt:

1. M.2 nvme really does need proper cooling, much more so than 
traditional SATA/SAS/SCSI drives.


2. not all vendors do a great job reporting the health of devices

-pete

--
Pete Wright
p...@nomadlogic.org




Re: segfault in ld-elf.so.1

2024-02-13 Thread Alexander Leidinger

Am 2024-02-13 01:58, schrieb Konstantin Belousov:

On Mon, Feb 12, 2024 at 11:54:02AM +0200, Konstantin Belousov wrote:

On Mon, Feb 12, 2024 at 10:35:56AM +0100, Alexander Leidinger wrote:
> Hi,
>
> dovecot (and no other program I use on this machine... at least not that I
> notice it) segfaults in ld-elf.so.1 after an update from 2024-01-18-092730
> to 2024-02-10-144617 (and now 2024-02-11-212006 in the hope the issue would
> have been fixed by changes to libc/libsys since 2024-02-10-144617). The
> issue shows up when I try to do an IMAP login. A successful authentication
> starts the imap process which immediately segfaults.
>
> I didn't recompile dovecot for the initial update, but I did now to rule
> out a regression in this area (and to get access via imap do my normal mail
> account).
>
>
> Backtrace:
The backtrace looks incomplete.  It might be the case of infinite 
recursion,

but I cannot claim it from the trace.

Does the program segfault if you run it manually?  If yes, please 
provide


No.

me with the tarball of the binary and all required shared libs, 
including

base system libraries, from your machine.


Regardless of my request, you might try the following.  Note that I did
not tested the patch, ensure that you have a way to recover ld-elf.so.1
if something goes wrong.


[inline patch]

This did the trick and I have IMAP access to my emails again. As this 
runs in a jail, it was easy to test without fear to kill something.


I will try the patch in the review next.

Bye,
Alexander.

--
http://www.Leidinger.net alexan...@leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.orgnetch...@freebsd.org  : PGP 0x8F31830F9F2772BF


signature.asc
Description: OpenPGP digital signature


Re: nvme controller reset failures on recent -CURRENT

2024-02-13 Thread Don Lewis
On 12 Feb, Warner Losh wrote:
> On Mon, Feb 12, 2024 at 9:15 PM Don Lewis  wrote:
> 
>> On 12 Feb, Maxim Sobolev wrote:
>> > Might be an overheating. Today's nvme drives are notoriously flaky if you
>> > run them without proper heat sink attached to it.
>>
>> I don't think it is a thermal problem.  According to the drive health
>> page, the device temperature has never reached Temperature 2, whatever
>> that is.  The room temperature is around 65F.  The system was stable
>> last summer when the room temperature spent a lot of time in the 80-85F
>> range.  The device temperature depends a lot on the I/O rate, and the
>> last panic happened when the I/O rate had been below 40tps for quite a
>> while.
>>
> 
> It did reach temperature 1, though. That's the 'Warning this drive is too
> hot' temperature. It has spent 41213 minutes of your 19297 hours of up
> time, or an average of 2 minutes per hour. That's too much. Temperature
> 2 is critical error: we are about to shut down completely due to it
> being too hot. It's only a couple degrees below hardware power off
> due to temperature in many drives. Some really cheap ones don't really
> implement it at all. On my card with the bad heat sink, Warning temp is
> 70C while critical is 75C while IIRC thermal shutdown is 78C or 80C.
> 
> I don't think we report these values in nvmecontrol identify. But you can
> do a raw dump with -x look at bytes 266:267 for warning and 268:269
> for critical.
> 
> In contrast, the few dozen drives that I have, all of which have been
> abused in various ways, And only one of them has any heat issues,
> and that one is an engineering special / sample with what I think is
> a damaged heat sink. If your card has no heat sink, this could well
> be what's going on.
> 
> This panic means "the nvme card lost its mind and stopped talking
> to the host". Its status registers read 0xff's, which means that the card
> isn't decoding bus signals. Usually this means that the firmware on the
> card has faulted and rebooted. If the card is overheating, then this could
> well be what's happening.
> 
> There's a tiny chance that this could be something more exotic,
> but my money is on hardware gone bad after 2 years of service. I don't think
> this is 'wear out' of the NAND (it's only 15TB written, but it could be if
> this
> drive is really really crappy nand: first generation QLC maybe, but it seems
> too new). It might also be a connector problem that's developed over time.
> There might be a few other things too, but I don't think this is a U.2 drive
> with funky cables.

The system was probably idle the majority of those two years of power on
time.

It's one of these:
https://www.techpowerup.com/ssd-specs/intel-660p-512-gb.d437
I've seen comments that these generally don't need cooling.

I just ordered a heatsink with some nice big fins, but it will take a
week or more to arrive.

> 
>> > On Mon, Feb 12, 2024, 4:28 PM Don Lewis  wrote:
>> >
>> >> I just upgraded my package build machine to:
>> >>   FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
>> >> from:
>> >>   FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
>> >> and I've had two nvme-triggered panics in the last day.
>> >>
>> >> nvme is being used for swap and L2ARC.  I'm not able to get a crash
>> >> dump, probably because the nvme device has gone away and I get an error
>> >> about not having a dump device.  It looks like a low-memory panic
>> >> because free memory is low and zfs is calling malloc().
>> >>
>> >> This shows up in the log leading up to the panic:
>> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
>> >> timeout a
>> >> nd possible hot unplug.
>> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
>> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
>> >> timeout a
>> >> nd possible hot unplug.
>> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
>> >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
>> >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
>> >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping
>> watchdog
>> >> ti
>> >> meout.
>> >>
>> >> The device looks healthy to me:
>> >> SMART/Health Information Log
>> >> 
>> >> Critical Warning State: 0x00
>> >>  Available spare:   0
>> >>  Temperature:   0
>> >>  Device reliability:0
>> >>  Read only: 0
>> >>  Volatile memory backup:0
>> >> Temperature:312 K, 38.85 C, 101.93 F
>> >> Available spare:100
>> >> Available spare threshold:  10
>> >> Percentage used:3
>> >> Data units (512,000 byte) read: 5761183
>> >> Data units written: 29911502
>> >> Host read commands: 471921188
>> >>