Re: nvme timeout issues with hardware and bhyve vm's

2023-12-08 Thread Pete Wright
On Thu, Dec 07, 2023 at 04:19:12PM -0800, Chuck Tuffli wrote:
> On Thu, Dec 7, 2023 at 2:39 PM Pete Wright  wrote:
> ...
> > Hi Warner, just resurfacing this thread because I've had a few lockups
> > on my workstation running 14.0-STABLE.  I was able to capture a photo of
> > the hang and this seems to be the most important line:
> >
> > nvme0: Resetting controller due to a timeout and possible hot unplug.
> >
> > When I scan the device after reboot I don't see any errors, but if there
> > is a particular thing I should check via nvmecontrol please let me know.
> >   Also, since it mentions possible hot unplug I wonder if this is
> > hardware/firmware related to my system?
> 
> Does the device support Persistent Log pages (LID=0x0d)? If so, it
> might be interesting to dump those.
> 
unfortunately it does not.  i will probably just script this up and dump
some data into my local prometheus server so i can see what the temp looks
over time.

-p

-- 
Pete Wright
p...@nomadlogic.org



Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Bakul Shah
Thanks.

It may be worth checking the temp periodically and warning the user in case it 
is too high (70ºC+ or something). Even for devices that allow internal 
throttling, a user might wish to know whether the device neads a (better) 
heatsink.

> On Dec 7, 2023, at 5:02 PM, Maxim Sobolev  wrote:
> 
> How quickly it heats up depends on lots of factors. Usually those devices 
> burn some 3-7 watts per stick at 100% load, so maybe this would give you some 
> idea. At least some of them support several toggleable performance modes, 
> which use throttling internally to limit power consumption to a certain level 
> (man nvmecontril). It helped me recently to make a system stable, which 
> otherwise would hang with timeout after reaching 70-75C until I got the 
> chance to take it apart and attach a heatsinks to the nvmes. Once the 
> temperature dropped to <= 50C the drives become 100% stable.
> 
> -Max
> 
> On Thu, Dec 7, 2023, 4:07 PM Bakul Shah  > wrote:
>> On Dec 7, 2023, at 3:59 PM, Warner Losh > > wrote:
>> > 
>> > 
>> >  *Overheating caused hang of NVMe controller or PCI bridge on SSD, or
>> > 
>> > Yes. Most drive's firmware when it overheats resets. There might be 
>> > something
>> > that the pci code can do when this happens to retrain the link, reprogram 
>> > the
>> > config registers, etc.
>> 
>> How quickly can the device heat up? Can it be queried frequently
>> enough act before it overheats by throttling io?
>> 
>> 
>> 
>> 



Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Maxim Sobolev
How quickly it heats up depends on lots of factors. Usually those devices
burn some 3-7 watts per stick at 100% load, so maybe this would give you
some idea. At least some of them support several toggleable performance
modes, which use throttling internally to limit power consumption to a
certain level (man nvmecontril). It helped me recently to make a system
stable, which otherwise would hang with timeout after reaching 70-75C until
I got the chance to take it apart and attach a heatsinks to the nvmes. Once
the temperature dropped to <= 50C the drives become 100% stable.

-Max

On Thu, Dec 7, 2023, 4:07 PM Bakul Shah  wrote:

> On Dec 7, 2023, at 3:59 PM, Warner Losh  wrote:
> >
> >
> >  *Overheating caused hang of NVMe controller or PCI bridge on SSD, or
> >
> > Yes. Most drive's firmware when it overheats resets. There might be
> something
> > that the pci code can do when this happens to retrain the link,
> reprogram the
> > config registers, etc.
>
> How quickly can the device heat up? Can it be queried frequently
> enough act before it overheats by throttling io?
>
>
>
>
>


Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Chuck Tuffli
On Thu, Dec 7, 2023 at 2:39 PM Pete Wright  wrote:
...
> Hi Warner, just resurfacing this thread because I've had a few lockups
> on my workstation running 14.0-STABLE.  I was able to capture a photo of
> the hang and this seems to be the most important line:
>
> nvme0: Resetting controller due to a timeout and possible hot unplug.
>
> When I scan the device after reboot I don't see any errors, but if there
> is a particular thing I should check via nvmecontrol please let me know.
>   Also, since it mentions possible hot unplug I wonder if this is
> hardware/firmware related to my system?

Does the device support Persistent Log pages (LID=0x0d)? If so, it
might be interesting to dump those.

--chuck



Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Bakul Shah
On Dec 7, 2023, at 3:59 PM, Warner Losh  wrote:
> 
> 
>  *Overheating caused hang of NVMe controller or PCI bridge on SSD, or
> 
> Yes. Most drive's firmware when it overheats resets. There might be something
> that the pci code can do when this happens to retrain the link, reprogram the
> config registers, etc.

How quickly can the device heat up? Can it be queried frequently
enough act before it overheats by throttling io?






Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Warner Losh
On Thu, Dec 7, 2023 at 4:09 PM Tomoaki AOKI 
wrote:

> On Thu, 7 Dec 2023 14:38:37 -0800
> Pete Wright  wrote:
>
> >
> >
> > On 10/13/23 7:34 PM, Warner Losh wrote:
> > >
> >
> > >
> > > the messages i posted in the start of the thread are from the VM
> itself
> > > (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed
> no
> > > such issues.
> > >
> > > Based on your comment about the improvements in 14 I'll focus my
> > > efforts
> > > on my workstation, it seemed to happen regularly so hopefully i can
> > > find
> > > a repo case.
> > >
> > >
> > > Let me now if you see similar messages in stable/14. I think I've
> fixed
> > > all the
> > > issues with timeouts, though you shouldn't ever seem them in a vm setup
> > > unless something else weird is going on.
> > >
> >
> >
> > Hi Warner, just resurfacing this thread because I've had a few lockups
> > on my workstation running 14.0-STABLE.  I was able to capture a photo of
> > the hang and this seems to be the most important line:
> >
> > nvme0: Resetting controller due to a timeout and possible hot unplug.
> >
> > When I scan the device after reboot I don't see any errors, but if there
> > is a particular thing I should check via nvmecontrol please let me know.
> >   Also, since it mentions possible hot unplug I wonder if this is
> > hardware/firmware related to my system?
> >
> > Anyway, haven't found a repro case yet but it has locked up a few times
> > the past two weeks.
> >
> > -pete
> >
> >
> > --
> > Pete Wright
> > p...@nomadlogic.org
>
> If I myself encounter this kind of problem ON BARE METAL HARDWARE,
> I would usually suspect
>
>  *Overheating caused hang of NVMe controller or PCI bridge on SSD, or
>

Yes. Most drive's firmware when it overheats resets. There might be
something
that the pci code can do when this happens to retrain the link, reprogram
the
config registers, etc.


>  *Unstable physical connection (bad contact)
>

Yea, hot plug controller is required for this, but this will be bouncing.

Warner


Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Pete Wright




On 12/7/23 3:16 PM, Craig Leres wrote:

On 12/7/23 15:09, Tomoaki AOKI wrote:

If I myself encounter this kind of problem ON BARE METAL HARDWARE,
I would usually suspect

  *Overheating caused hang of NVMe controller or PCI bridge on SSD, or


This would also be my first guess.

Five years ago I had an nmve in an intel nuc that would sometimes "go to 
sleep", here's the thread



https://lists.freebsd.org/pipermail/freebsd-hackers/2018-May/052783.html

@imp helpfully suggested running "nvmecontrol logpage -p 2 nvme0" which 
showed mine was hot (60° C/140° F)! I adjusted the fan settings in the 
bios and have never had an issue since.




oh interesting, i'll run that next time it locks up.  the box is well 
ventilated, but that's not to say its not overheating.  right now its at:

Temperature:314 K, 40.85 C, 105.53 F

nvemecontrol doesn't list any errors or warnings though:
Media errors:   0
No. error info log entries: 0
Warning Temp Composite Time:0
Error Temp Composite Time:  0

thanks for the tip!
-pete

--
Pete Wright
p...@nomadlogic.org



Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Pete Wright




On 12/7/23 2:49 PM, Warner Losh wrote:



On Thu, Dec 7, 2023 at 3:38 PM Pete Wright > wrote:




On 10/13/23 7:34 PM, Warner Losh wrote:
 >

 >
 >     the messages i posted in the start of the thread are from the
VM itself
 >     (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE)
showed no
 >     such issues.
 >
 >     Based on your comment about the improvements in 14 I'll focus my
 >     efforts
 >     on my workstation, it seemed to happen regularly so hopefully
i can
 >     find
 >     a repo case.
 >
 >
 > Let me now if you see similar messages in stable/14. I think I've
fixed
 > all the
 > issues with timeouts, though you shouldn't ever seem them in a vm
setup
 > unless something else weird is going on.
 >


Hi Warner, just resurfacing this thread because I've had a few lockups
on my workstation running 14.0-STABLE.  I was able to capture a
photo of
the hang and this seems to be the most important line:

nvme0: Resetting controller due to a timeout and possible hot unplug.

When I scan the device after reboot I don't see any errors, but if
there
is a particular thing I should check via nvmecontrol please let me
know.
   Also, since it mentions possible hot unplug I wonder if this is
hardware/firmware related to my system?

Anyway, haven't found a repro case yet but it has locked up a few times
the past two weeks.


What the message means is that (a) we stopped getting interrupts from 
the device and (b) when we went to check on the status of the device it 
read back like missing hardware.


So is this from inside the VM running under bhyve, or in the host that's 
hosting the VM? We have different next steps depending on where it is.




OK awesome thanks for that context, so this is on a bare metal workstation.

-pete


--
Pete Wright
p...@nomadlogic.org



Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Craig Leres

On 12/7/23 15:09, Tomoaki AOKI wrote:

If I myself encounter this kind of problem ON BARE METAL HARDWARE,
I would usually suspect

  *Overheating caused hang of NVMe controller or PCI bridge on SSD, or


This would also be my first guess.

Five years ago I had an nmve in an intel nuc that would sometimes "go to 
sleep", here's the thread



https://lists.freebsd.org/pipermail/freebsd-hackers/2018-May/052783.html

@imp helpfully suggested running "nvmecontrol logpage -p 2 nvme0" which 
showed mine was hot (60° C/140° F)! I adjusted the fan settings in the 
bios and have never had an issue since.


Craig



Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Tomoaki AOKI
On Thu, 7 Dec 2023 14:38:37 -0800
Pete Wright  wrote:

> 
> 
> On 10/13/23 7:34 PM, Warner Losh wrote:
> > 
> 
> > 
> > the messages i posted in the start of the thread are from the VM itself
> > (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed no
> > such issues.
> > 
> > Based on your comment about the improvements in 14 I'll focus my
> > efforts
> > on my workstation, it seemed to happen regularly so hopefully i can
> > find
> > a repo case.
> > 
> > 
> > Let me now if you see similar messages in stable/14. I think I've fixed 
> > all the
> > issues with timeouts, though you shouldn't ever seem them in a vm setup
> > unless something else weird is going on.
> > 
> 
> 
> Hi Warner, just resurfacing this thread because I've had a few lockups 
> on my workstation running 14.0-STABLE.  I was able to capture a photo of 
> the hang and this seems to be the most important line:
> 
> nvme0: Resetting controller due to a timeout and possible hot unplug.
> 
> When I scan the device after reboot I don't see any errors, but if there 
> is a particular thing I should check via nvmecontrol please let me know. 
>   Also, since it mentions possible hot unplug I wonder if this is 
> hardware/firmware related to my system?
> 
> Anyway, haven't found a repro case yet but it has locked up a few times 
> the past two weeks.
> 
> -pete
> 
> 
> -- 
> Pete Wright
> p...@nomadlogic.org

If I myself encounter this kind of problem ON BARE METAL HARDWARE,
I would usually suspect

 *Overheating caused hang of NVMe controller or PCI bridge on SSD, or

 *Unstable physical connection (bad contact)

first.


-- 
Tomoaki AOKI



Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Warner Losh
On Thu, Dec 7, 2023 at 3:38 PM Pete Wright  wrote:

>
>
> On 10/13/23 7:34 PM, Warner Losh wrote:
> >
>
> >
> > the messages i posted in the start of the thread are from the VM
> itself
> > (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed no
> > such issues.
> >
> > Based on your comment about the improvements in 14 I'll focus my
> > efforts
> > on my workstation, it seemed to happen regularly so hopefully i can
> > find
> > a repo case.
> >
> >
> > Let me now if you see similar messages in stable/14. I think I've fixed
> > all the
> > issues with timeouts, though you shouldn't ever seem them in a vm setup
> > unless something else weird is going on.
> >
>
>
> Hi Warner, just resurfacing this thread because I've had a few lockups
> on my workstation running 14.0-STABLE.  I was able to capture a photo of
> the hang and this seems to be the most important line:
>
> nvme0: Resetting controller due to a timeout and possible hot unplug.
>
> When I scan the device after reboot I don't see any errors, but if there
> is a particular thing I should check via nvmecontrol please let me know.
>   Also, since it mentions possible hot unplug I wonder if this is
> hardware/firmware related to my system?
>
> Anyway, haven't found a repro case yet but it has locked up a few times
> the past two weeks.
>

What the message means is that (a) we stopped getting interrupts from the
device and (b) when we went to check on the status of the device it read
back like missing hardware.

So is this from inside the VM running under bhyve, or in the host that's
hosting the VM? We have different next steps depending on where it is.

Warner


Re: nvme timeout issues with hardware and bhyve vm's

2023-12-07 Thread Pete Wright




On 10/13/23 7:34 PM, Warner Losh wrote:






the messages i posted in the start of the thread are from the VM itself
(13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed no
such issues.

Based on your comment about the improvements in 14 I'll focus my
efforts
on my workstation, it seemed to happen regularly so hopefully i can
find
a repo case.


Let me now if you see similar messages in stable/14. I think I've fixed 
all the

issues with timeouts, though you shouldn't ever seem them in a vm setup
unless something else weird is going on.




Hi Warner, just resurfacing this thread because I've had a few lockups 
on my workstation running 14.0-STABLE.  I was able to capture a photo of 
the hang and this seems to be the most important line:


nvme0: Resetting controller due to a timeout and possible hot unplug.

When I scan the device after reboot I don't see any errors, but if there 
is a particular thing I should check via nvmecontrol please let me know. 
 Also, since it mentions possible hot unplug I wonder if this is 
hardware/firmware related to my system?


Anyway, haven't found a repro case yet but it has locked up a few times 
the past two weeks.


-pete


--
Pete Wright
p...@nomadlogic.org



Re: nvme timeout issues with hardware and bhyve vm's

2023-10-16 Thread Chuck Tuffli
On Fri, Oct 13, 2023 at 7:34 PM Warner Losh  wrote:
...
> Let me now if you see similar messages in stable/14. I think I've fixed all 
> the
> issues with timeouts, though you shouldn't ever seem them in a vm setup
> unless something else weird is going on.

I'd be interested in a repo case too as I haven't seen the NVMe
emulation in bhyve do this before. Were there any error messages from
bhyve?

The guest log messages seem to suggest that the backing storage for
the emulated device is timing out. If your comment

> I had similar issues on my workstation as well.  Scrubbing the NVMe
> device on my real-hardware workstation hasn't turned up any issues, but
> the system has locked up a handful of times.

means the host is seeing NVMe errors on the drives backing the
zpool/zvol used by the emulated device, this might explain it.
Although, it is curious the emulated controller had trouble resetting
(i.e., the error message "controller ready did not become 1 within
30500 ms").

--chuck



Re: nvme timeout issues with hardware and bhyve vm's

2023-10-15 Thread void
On Sun, 15 Oct 2023, at 15:53, Warner Losh wrote:

> The one with the uboot traceback? I can't help you there. The report is 
> confusing. I don't know the error / problem being reported to even know 
> what to look at.  Or is it a different thing? I'm so confused at this 
> point. I also think we need to recreate it on as a!clean system as 
> possible. Weird problems in the boot chain prior to loader.efi, I have 
> little interest in and no time to look at...

The problem being reported on the 30th Sept is a panic on booting zroot
because the disk couldn't be found by the loader, presumably because 
usb3 needed time to settle, because making it wait a while resulted in
a bootable system. But all the waiting does is workaround. I don't
know how to fix it permanently. The problem gets to (i guess it's called stage1
boot) where i can select what kernel to boot. At that point, i thought u-boot
had completed.

The reason I'm mentioning this in this thread is because the OP is reporting
an issue looking on the face of it very similar to the one I saw, and i thought
mentioning this here might be helpful.



Re: nvme timeout issues with hardware and bhyve vm's

2023-10-15 Thread Warner Losh
On Sun, Oct 15, 2023, 9:47 AM void  wrote:

> On Sun, 15 Oct 2023, at 15:35, Warner Losh wrote:
>
> > I've fixed all known nvme issues in current that aren't caused by other
> > parts of the system. If it isn't a very recent 15 or 14,  then there
> > are known issues and you'll need to try those first.
>
> The problem manifested with a source upgrade on the 30th September.
>
> > On arm it could be a lot of things. I keep seeing problems in other
> > area that are hard to track down without the systems in hand and
> > time to look at complex problems it can take a while to track down.
> >
> > I'll need a simple reproducer to look at things...
>
> If you can tell me what to do, i'm happy to test.
>
> Additionally, I can provide a connected instance to you for destructive
> testing on arm64. It would take me about 24hrs to set up as I'd
> need to backup the disk. Let me know if you need it.
>

The one with the uboot traceback? I can't help you there. The report is
confusing. I don't know the error / problem being reported to even know
what to look at.  Or is it a different thing? I'm so confused at this
point. I also think we need to recreate it on as a!clean system as
possible. Weird problems in the boot chain prior to loader.efi, I have
little interest in and no time to look at...

Warner

>


Re: nvme timeout issues with hardware and bhyve vm's

2023-10-15 Thread void
On Sun, 15 Oct 2023, at 15:35, Warner Losh wrote:

> I've fixed all known nvme issues in current that aren't caused by other 
> parts of the system. If it isn't a very recent 15 or 14,  then there 
> are known issues and you'll need to try those first.

The problem manifested with a source upgrade on the 30th September. 

> On arm it could be a lot of things. I keep seeing problems in other 
> area that are hard to track down without the systems in hand and 
> time to look at complex problems it can take a while to track down.
>
> I'll need a simple reproducer to look at things... 

If you can tell me what to do, i'm happy to test.

Additionally, I can provide a connected instance to you for destructive 
testing on arm64. It would take me about 24hrs to set up as I'd
need to backup the disk. Let me know if you need it.
-- 



Re: nvme timeout issues with hardware and bhyve vm's

2023-10-15 Thread Warner Losh
On Sun, Oct 15, 2023, 9:28 AM void  wrote:

> Hi,
>
> On Fri, 13 Oct 2023, at 03:40, Pete Wright wrote:
> > I had similar issues on my workstation as well.  Scrubbing the NVMe
> > device on my real-hardware workstation hasn't turned up any issues, but
> > the system has locked up a handful of times.
> >
> > Just curious if others have seen the same, or if someone could point me
> > in the right direction...
>
> I've seen similar issues (zpool timeout) in a quite different context
> (arm64, usb3-connected disk, not-vm) which i posted about here:
>
> https://lists.freebsd.org/archives/freebsd-arm/2023-September/003122.html
>
> Unsure if these have been fixed yet?
>

I've fixed all known nvme issues in current that aren't caused by other
parts of the system. If it isn't a very recent 15 or 14,  then there are
known issues and you'll need to try those first.

On arm it could be a lot of things. I keep seeing problems in other area
that are hard to track down without the systems in hand and time to
look at complex problems it can take a while to track down.

I'll need a simple reproducer to look at things...

Warner

>


Re: nvme timeout issues with hardware and bhyve vm's

2023-10-15 Thread void
Hi,

On Fri, 13 Oct 2023, at 03:40, Pete Wright wrote:
> I had similar issues on my workstation as well.  Scrubbing the NVMe 
> device on my real-hardware workstation hasn't turned up any issues, but 
> the system has locked up a handful of times.
>
> Just curious if others have seen the same, or if someone could point me 
> in the right direction...

I've seen similar issues (zpool timeout) in a quite different context 
(arm64, usb3-connected disk, not-vm) which i posted about here:

https://lists.freebsd.org/archives/freebsd-arm/2023-September/003122.html

Unsure if these have been fixed yet?



Re: nvme timeout issues with hardware and bhyve vm's

2023-10-13 Thread Warner Losh
On Fri, Oct 13, 2023 at 11:47 AM Pete Wright  wrote:

>
>
> On 10/13/23 6:24 AM, Warner Losh wrote:
> >
> >
> > On Thu, Oct 12, 2023, 10:53 PM Pete Wright  > > wrote:
> >
> >
> >
> > On 10/12/23 8:45 PM, Warner Losh wrote:
> >  > What version is that kernel?
> >
> > oh dang i sent this to the wrong list, i'm not running current.  the
> > hypervisor and vm are both 13.2 and my workstation is a recent 14.0
> > pre-release build.  i'll do more homework tomorrow and post to
> > questions
> > or a more appropriate list.
> >
> >
> > Are the messages from the VM? Stable/14 should have the important nvme
> > changes I've made lately. The bhyve in 13.2 is lacking a number of nvme
> > fixes that have gone into current and stable/14. It's hard to say where
> > the fault is coming from.
> >
>
>
> the messages i posted in the start of the thread are from the VM itself
> (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed no
> such issues.
>
> Based on your comment about the improvements in 14 I'll focus my efforts
> on my workstation, it seemed to happen regularly so hopefully i can find
> a repo case.
>

Let me now if you see similar messages in stable/14. I think I've fixed all
the
issues with timeouts, though you shouldn't ever seem them in a vm setup
unless something else weird is going on.

Warner


Re: nvme timeout issues with hardware and bhyve vm's

2023-10-13 Thread Pete Wright




On 10/13/23 6:24 AM, Warner Losh wrote:



On Thu, Oct 12, 2023, 10:53 PM Pete Wright > wrote:




On 10/12/23 8:45 PM, Warner Losh wrote:
 > What version is that kernel?

oh dang i sent this to the wrong list, i'm not running current.  the
hypervisor and vm are both 13.2 and my workstation is a recent 14.0
pre-release build.  i'll do more homework tomorrow and post to
questions
or a more appropriate list.


Are the messages from the VM? Stable/14 should have the important nvme 
changes I've made lately. The bhyve in 13.2 is lacking a number of nvme 
fixes that have gone into current and stable/14. It's hard to say where 
the fault is coming from.





the messages i posted in the start of the thread are from the VM itself 
(13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed no 
such issues.


Based on your comment about the improvements in 14 I'll focus my efforts 
on my workstation, it seemed to happen regularly so hopefully i can find 
a repo case.


thanks warner!
-pete


--
Pete Wright
p...@nomadlogic.org



Re: nvme timeout issues with hardware and bhyve vm's

2023-10-13 Thread Warner Losh
On Thu, Oct 12, 2023, 10:53 PM Pete Wright  wrote:

>
>
> On 10/12/23 8:45 PM, Warner Losh wrote:
> > What version is that kernel?
>
> oh dang i sent this to the wrong list, i'm not running current.  the
> hypervisor and vm are both 13.2 and my workstation is a recent 14.0
> pre-release build.  i'll do more homework tomorrow and post to questions
> or a more appropriate list.
>

Are the messages from the VM? Stable/14 should have the important nvme
changes I've made lately. The bhyve in 13.2 is lacking a number of nvme
fixes that have gone into current and stable/14. It's hard to say where the
fault is coming from.

Warner

-pete
>
> --
> Pete Wright
> p...@nomadlogic.org
>


Re: nvme timeout issues with hardware and bhyve vm's

2023-10-12 Thread Pete Wright




On 10/12/23 8:45 PM, Warner Losh wrote:

What version is that kernel?


oh dang i sent this to the wrong list, i'm not running current.  the 
hypervisor and vm are both 13.2 and my workstation is a recent 14.0 
pre-release build.  i'll do more homework tomorrow and post to questions 
or a more appropriate list.


-pete

--
Pete Wright
p...@nomadlogic.org



Re: nvme timeout issues with hardware and bhyve vm's

2023-10-12 Thread Warner Losh
What version is that kernel?

Warner

On Thu, Oct 12, 2023, 9:41 PM Pete Wright  wrote:

> hey there - i was curious if anyone has had issues with nvme devices
> recently.  i'm chasing down similar issues on my workstation which has a
> physical NVMe zroot, and on a bhyve VM which has a large pool exposed as
> a NVMe device (and is backed by a zvol).
>
> on the most recent bhyve issue the VM reported this:
>
> Oct 13 02:52:52 emby kernel: nvme1: RECOVERY_START 13737432416007567 vs
> 13737432371683671
> Oct 13 02:52:52 emby kernel: nvme1: RECOVERY_START 13737432718499597 vs
> 13737432371683671
> Oct 13 02:52:52 emby kernel: nvme1: timeout with nothing complete,
> resetting
> Oct 13 02:52:52 emby kernel: nvme1: Resetting controller due to a timeout.
> Oct 13 02:52:52 emby kernel: nvme1: RECOVERY_WAITING
> Oct 13 02:52:52 emby kernel: nvme1: resetting controller
> Oct 13 02:52:53 emby kernel: nvme1: waiting
> Oct 13 02:53:23 emby syslogd: last message repeated 114 times
> Oct 13 02:53:23 emby kernel: nvme1: controller ready did not become 1
> within 30500 ms
> Oct 13 02:53:23 emby kernel: nvme1: failing outstanding i/o
> Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:1 cid:119 nsid:1
> lba:4968850592 len:256
> Oct 13 02:53:23 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0
> m:0 dnr:1 sqid:1 cid:119 cdw0:0
> Oct 13 02:53:23 emby kernel: nvme1: failing outstanding i/o
> Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:6 cid:0 nsid:1
> lba:5241952432 len:32
> Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:3 cid:123 nsid:1
> lba:4968850336 len:256
> Oct 13 02:53:23 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0
> m:0 dnr:1 sqid:3 cid:123 cdw0:0
> Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:3 cid:0 nsid:1
> lba:5242495888 len:256
> Oct 13 02:53:23 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0
> m:0 dnr:0 sqid:3 cid:0 cdw0:0
> Oct 13 02:53:23 emby kernel: nvme1: READ sqid:3 cid:0 nsid:1 lba:528 len:16
> Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:5 cid:0 nsid:1
> lba:4934226784 len:96
> Oct 13 02:53:23 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0
> m:0 dnr:0 sqid:3 cid:0 cdw0:0
> Oct 13 02:53:23 emby kernel: nvme1: READ sqid:3 cid:0 nsid:1
> lba:6442449936 len:16
> Oct 13 02:53:25 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0
> m:0 dnr:0 sqid:3 cid:0 cdw0:0
> Oct 13 02:53:25 emby kernel: nvme1: READ sqid:3 cid:0 nsid:1
> lba:6442450448 len:16
> Oct 13 02:53:25 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0
> m:0 dnr:0 sqid:3 cid:0 cdw0:0
> Oct 13 02:53:25 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0
> m:0 dnr:0 sqid:5 cid:0 cdw0:0
> Oct 13 02:53:25 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0
> m:0 dnr:0 sqid:6 cid:0 cdw0:0
> Oct 13 02:53:25 emby kernel: nvd1: detached
>
>
>
> I had similar issues on my workstation as well.  Scrubbing the NVMe
> device on my real-hardware workstation hasn't turned up any issues, but
> the system has locked up a handful of times.
>
> Just curious if others have seen the same, or if someone could point me
> in the right direction...
>
> thanks!
> -pete
>
> --
> Pete Wright
> p...@nomadlogic.org
>
>


nvme timeout issues with hardware and bhyve vm's

2023-10-12 Thread Pete Wright
hey there - i was curious if anyone has had issues with nvme devices 
recently.  i'm chasing down similar issues on my workstation which has a 
physical NVMe zroot, and on a bhyve VM which has a large pool exposed as 
a NVMe device (and is backed by a zvol).


on the most recent bhyve issue the VM reported this:

Oct 13 02:52:52 emby kernel: nvme1: RECOVERY_START 13737432416007567 vs 
13737432371683671
Oct 13 02:52:52 emby kernel: nvme1: RECOVERY_START 13737432718499597 vs 
13737432371683671

Oct 13 02:52:52 emby kernel: nvme1: timeout with nothing complete, resetting
Oct 13 02:52:52 emby kernel: nvme1: Resetting controller due to a timeout.
Oct 13 02:52:52 emby kernel: nvme1: RECOVERY_WAITING
Oct 13 02:52:52 emby kernel: nvme1: resetting controller
Oct 13 02:52:53 emby kernel: nvme1: waiting
Oct 13 02:53:23 emby syslogd: last message repeated 114 times
Oct 13 02:53:23 emby kernel: nvme1: controller ready did not become 1 
within 30500 ms

Oct 13 02:53:23 emby kernel: nvme1: failing outstanding i/o
Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:1 cid:119 nsid:1 
lba:4968850592 len:256
Oct 13 02:53:23 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0 
m:0 dnr:1 sqid:1 cid:119 cdw0:0

Oct 13 02:53:23 emby kernel: nvme1: failing outstanding i/o
Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:6 cid:0 nsid:1 
lba:5241952432 len:32
Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:3 cid:123 nsid:1 
lba:4968850336 len:256
Oct 13 02:53:23 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0 
m:0 dnr:1 sqid:3 cid:123 cdw0:0
Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:3 cid:0 nsid:1 
lba:5242495888 len:256
Oct 13 02:53:23 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0 
m:0 dnr:0 sqid:3 cid:0 cdw0:0

Oct 13 02:53:23 emby kernel: nvme1: READ sqid:3 cid:0 nsid:1 lba:528 len:16
Oct 13 02:53:23 emby kernel: nvme1: WRITE sqid:5 cid:0 nsid:1 
lba:4934226784 len:96
Oct 13 02:53:23 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0 
m:0 dnr:0 sqid:3 cid:0 cdw0:0
Oct 13 02:53:23 emby kernel: nvme1: READ sqid:3 cid:0 nsid:1 
lba:6442449936 len:16
Oct 13 02:53:25 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0 
m:0 dnr:0 sqid:3 cid:0 cdw0:0
Oct 13 02:53:25 emby kernel: nvme1: READ sqid:3 cid:0 nsid:1 
lba:6442450448 len:16
Oct 13 02:53:25 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0 
m:0 dnr:0 sqid:3 cid:0 cdw0:0
Oct 13 02:53:25 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0 
m:0 dnr:0 sqid:5 cid:0 cdw0:0
Oct 13 02:53:25 emby kernel: nvme1: ABORTED - BY REQUEST (00/07) crd:0 
m:0 dnr:0 sqid:6 cid:0 cdw0:0

Oct 13 02:53:25 emby kernel: nvd1: detached



I had similar issues on my workstation as well.  Scrubbing the NVMe 
device on my real-hardware workstation hasn't turned up any issues, but 
the system has locked up a handful of times.


Just curious if others have seen the same, or if someone could point me 
in the right direction...


thanks!
-pete

--
Pete Wright
p...@nomadlogic.org