Re: nvme controller reset failures on recent -CURRENT
On Mon, Feb 12, 2024 at 9:15 PM Don Lewis wrote: > On 12 Feb, Maxim Sobolev wrote: > > Might be an overheating. Today's nvme drives are notoriously flaky if you > > run them without proper heat sink attached to it. > > I don't think it is a thermal problem. According to the drive health > page, the device temperature has never reached Temperature 2, whatever > that is. The room temperature is around 65F. The system was stable > last summer when the room temperature spent a lot of time in the 80-85F > range. The device temperature depends a lot on the I/O rate, and the > last panic happened when the I/O rate had been below 40tps for quite a > while. > It did reach temperature 1, though. That's the 'Warning this drive is too hot' temperature. It has spent 41213 minutes of your 19297 hours of up time, or an average of 2 minutes per hour. That's too much. Temperature 2 is critical error: we are about to shut down completely due to it being too hot. It's only a couple degrees below hardware power off due to temperature in many drives. Some really cheap ones don't really implement it at all. On my card with the bad heat sink, Warning temp is 70C while critical is 75C while IIRC thermal shutdown is 78C or 80C. I don't think we report these values in nvmecontrol identify. But you can do a raw dump with -x look at bytes 266:267 for warning and 268:269 for critical. In contrast, the few dozen drives that I have, all of which have been abused in various ways, And only one of them has any heat issues, and that one is an engineering special / sample with what I think is a damaged heat sink. If your card has no heat sink, this could well be what's going on. This panic means "the nvme card lost its mind and stopped talking to the host". Its status registers read 0xff's, which means that the card isn't decoding bus signals. Usually this means that the firmware on the card has faulted and rebooted. If the card is overheating, then this could well be what's happening. There's a tiny chance that this could be something more exotic, but my money is on hardware gone bad after 2 years of service. I don't think this is 'wear out' of the NAND (it's only 15TB written, but it could be if this drive is really really crappy nand: first generation QLC maybe, but it seems too new). It might also be a connector problem that's developed over time. There might be a few other things too, but I don't think this is a U.2 drive with funky cables. Warner > > On Mon, Feb 12, 2024, 4:28 PM Don Lewis wrote: > > > >> I just upgraded my package build machine to: > >> FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e > >> from: > >> FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 > >> and I've had two nvme-triggered panics in the last day. > >> > >> nvme is being used for swap and L2ARC. I'm not able to get a crash > >> dump, probably because the nvme device has gone away and I get an error > >> about not having a dump device. It looks like a low-memory panic > >> because free memory is low and zfs is calling malloc(). > >> > >> This shows up in the log leading up to the panic: > >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > >> timeout a > >> nd possible hot unplug. > >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller > >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > >> timeout a > >> nd possible hot unplug. > >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete > >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times > >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o > >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping > watchdog > >> ti > >> meout. > >> > >> The device looks healthy to me: > >> SMART/Health Information Log > >> > >> Critical Warning State: 0x00 > >> Available spare: 0 > >> Temperature: 0 > >> Device reliability:0 > >> Read only: 0 > >> Volatile memory backup:0 > >> Temperature:312 K, 38.85 C, 101.93 F > >> Available spare:100 > >> Available spare threshold: 10 > >> Percentage used:3 > >> Data units (512,000 byte) read: 5761183 > >> Data units written: 29911502 > >> Host read commands: 471921188 > >> Host write commands:605394753 > >> Controller busy time (minutes): 32359 > >> Power cycles: 110 > >> Power on hours: 19297 > >> Unsafe shutdowns: 14 > >> Media errors: 0 > >> No. error info log entries: 0 > >> Warning Temp Composite Time:0 > >> Error Temp Composite Time: 0 > >> Temperature 1 Transition Count: 5231 > >> Temperature 2 Transition Count: 0 > >> Total Time For
Re: nvme controller reset failures on recent -CURRENT
On 12 Feb, Maxim Sobolev wrote: > Might be an overheating. Today's nvme drives are notoriously flaky if you > run them without proper heat sink attached to it. I don't think it is a thermal problem. According to the drive health page, the device temperature has never reached Temperature 2, whatever that is. The room temperature is around 65F. The system was stable last summer when the room temperature spent a lot of time in the 80-85F range. The device temperature depends a lot on the I/O rate, and the last panic happened when the I/O rate had been below 40tps for quite a while. > On Mon, Feb 12, 2024, 4:28 PM Don Lewis wrote: > >> I just upgraded my package build machine to: >> FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e >> from: >> FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 >> and I've had two nvme-triggered panics in the last day. >> >> nvme is being used for swap and L2ARC. I'm not able to get a crash >> dump, probably because the nvme device has gone away and I get an error >> about not having a dump device. It looks like a low-memory panic >> because free memory is low and zfs is calling malloc(). >> >> This shows up in the log leading up to the panic: >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a >> timeout a >> nd possible hot unplug. >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a >> timeout a >> nd possible hot unplug. >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog >> ti >> meout. >> >> The device looks healthy to me: >> SMART/Health Information Log >> >> Critical Warning State: 0x00 >> Available spare: 0 >> Temperature: 0 >> Device reliability:0 >> Read only: 0 >> Volatile memory backup:0 >> Temperature:312 K, 38.85 C, 101.93 F >> Available spare:100 >> Available spare threshold: 10 >> Percentage used:3 >> Data units (512,000 byte) read: 5761183 >> Data units written: 29911502 >> Host read commands: 471921188 >> Host write commands:605394753 >> Controller busy time (minutes): 32359 >> Power cycles: 110 >> Power on hours: 19297 >> Unsafe shutdowns: 14 >> Media errors: 0 >> No. error info log entries: 0 >> Warning Temp Composite Time:0 >> Error Temp Composite Time: 0 >> Temperature 1 Transition Count: 5231 >> Temperature 2 Transition Count: 0 >> Total Time For Temperature 1: 41213 >> Total Time For Temperature 2: 0 >> >> >>
Re: nvme controller reset failures on recent -CURRENT
On 12 Feb, Mark Johnston wrote: > On Mon, Feb 12, 2024 at 04:28:10PM -0800, Don Lewis wrote: >> I just upgraded my package build machine to: >> FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e >> from: >> FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 >> and I've had two nvme-triggered panics in the last day. >> >> nvme is being used for swap and L2ARC. I'm not able to get a crash >> dump, probably because the nvme device has gone away and I get an error >> about not having a dump device. It looks like a low-memory panic >> because free memory is low and zfs is calling malloc(). >> >> This shows up in the log leading up to the panic: >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a >> nd possible hot unplug. >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a >> nd possible hot unplug. >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog ti >> meout. > > Are you by chance using the drive mentioned here? > https://github.com/openzfs/zfs/discussions/14793 > > I was bitten by that and ended up replacing the drive with a different > model. The crash manifested exactly as you describe, though I didn't > have L2ARC or swap enabled on it. Nope: nda0 at nvme0 bus 0 scbus9 target 0 lun 1 nda0: nda0: Serial Number BTNH940617WE512A nda0: nvme version 1.3 nda0: 488386MB (1000215216 512 byte sectors) I'm not seeing super high I/O rates> I happened to have iostat running when the machine paniced: 0 584 88.431 2.68 65.8 112 7.18 68.2 107 7.13 80 0 20 0 0 0 565 99.132 3.06 27.974 2.01 30.570 2.08 80 0 20 0 0 0 612 92.831 2.77 18.9 148 2.74 18.9 148 2.73 86 0 14 0 0 0 618 88.613 1.17 25.059 1.44 24.261 1.44 89 0 11 0 0 0 586 45.4 5 0.22 31.455 1.70 30.857 1.70 84 0 16 0 0 0 598 12.7 3 0.03 38.164 2.40 37.166 2.40 84 0 16 0 0 0 675 36.1 6 0.21 23.7 156 3.62 22.7 164 3.63 88 0 12 0 0 0 641 6.9 6 0.04 25.7 243 6.10 25.3 246 6.08 71 0 29 0 0 0 737 20.1 9 0.18 36.4 148 5.24 37.2 144 5.24 78 0 22 0 0 0 578 44.723 1.03 25.1 164 4.01 25.5 161 3.99 86 0 14 0 0 0 608 70.315 1.06 51.164 3.19 51.364 3.19 89 0 11 0 0 0 624 38.6 9 0.35 32.3 121 3.80 32.2 121 3.79 90 0 10 0 0 0 577 80.616 1.28 37.866 2.44 36.569 2.46 90 0 10 0 0 tty nda0 ada0 ada1 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 566 87.716 1.39 27.260 1.60 25.366 1.62 87 0 13 0 0 0 599 77.211 0.83 17.4 391 6.66 17.3 395 6.66 74 0 26 0 0 0 660 45.0 7 0.31 18.7 575 10.51 18.6 578 10.49 76 0 24 0 0 0 615 37.7 8 0.31 24.0 303 7.11 24.0 303 7.11 58 0 42 0 0 Fssh_packet_write_wait: ... port 22: Broken pipe ada* are old and slow spinning rust. That report does mention something else that could also be a cause. I upgraded the motherboard BIOS around the same time. When I get a chance, I'll drop back to the older FreeBSD version and see if the problem goes away.
Re: nvme controller reset failures on recent -CURRENT
On Mon, Feb 12, 2024 at 04:28:10PM -0800, Don Lewis wrote: > I just upgraded my package build machine to: > FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e > from: > FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 > and I've had two nvme-triggered panics in the last day. > > nvme is being used for swap and L2ARC. I'm not able to get a crash > dump, probably because the nvme device has gone away and I get an error > about not having a dump device. It looks like a low-memory panic > because free memory is low and zfs is calling malloc(). > > This shows up in the log leading up to the panic: > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: resetting controller > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete > Feb 12 10:07:41 zipper syslogd: last message repeated 2 times > Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o > Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog ti > meout. Are you by chance using the drive mentioned here? https://github.com/openzfs/zfs/discussions/14793 I was bitten by that and ended up replacing the drive with a different model. The crash manifested exactly as you describe, though I didn't have L2ARC or swap enabled on it. > The device looks healthy to me: > SMART/Health Information Log > > Critical Warning State: 0x00 > Available spare: 0 > Temperature: 0 > Device reliability:0 > Read only: 0 > Volatile memory backup:0 > Temperature:312 K, 38.85 C, 101.93 F > Available spare:100 > Available spare threshold: 10 > Percentage used:3 > Data units (512,000 byte) read: 5761183 > Data units written: 29911502 > Host read commands: 471921188 > Host write commands:605394753 > Controller busy time (minutes): 32359 > Power cycles: 110 > Power on hours: 19297 > Unsafe shutdowns: 14 > Media errors: 0 > No. error info log entries: 0 > Warning Temp Composite Time:0 > Error Temp Composite Time: 0 > Temperature 1 Transition Count: 5231 > Temperature 2 Transition Count: 0 > Total Time For Temperature 1: 41213 > Total Time For Temperature 2: 0 > >
Re: nvme controller reset failures on recent -CURRENT
Might be an overheating. Today's nvme drives are notoriously flaky if you run them without proper heat sink attached to it. -Max On Mon, Feb 12, 2024, 4:28 PM Don Lewis wrote: > I just upgraded my package build machine to: > FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e > from: > FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 > and I've had two nvme-triggered panics in the last day. > > nvme is being used for swap and L2ARC. I'm not able to get a crash > dump, probably because the nvme device has gone away and I get an error > about not having a dump device. It looks like a low-memory panic > because free memory is low and zfs is calling malloc(). > > This shows up in the log leading up to the panic: > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: resetting controller > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete > Feb 12 10:07:41 zipper syslogd: last message repeated 2 times > Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o > Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog > ti > meout. > > The device looks healthy to me: > SMART/Health Information Log > > Critical Warning State: 0x00 > Available spare: 0 > Temperature: 0 > Device reliability:0 > Read only: 0 > Volatile memory backup:0 > Temperature:312 K, 38.85 C, 101.93 F > Available spare:100 > Available spare threshold: 10 > Percentage used:3 > Data units (512,000 byte) read: 5761183 > Data units written: 29911502 > Host read commands: 471921188 > Host write commands:605394753 > Controller busy time (minutes): 32359 > Power cycles: 110 > Power on hours: 19297 > Unsafe shutdowns: 14 > Media errors: 0 > No. error info log entries: 0 > Warning Temp Composite Time:0 > Error Temp Composite Time: 0 > Temperature 1 Transition Count: 5231 > Temperature 2 Transition Count: 0 > Total Time For Temperature 1: 41213 > Total Time For Temperature 2: 0 > > >
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On Feb 12, 2024, at 16:36, Mark Millard wrote: > On Feb 12, 2024, at 16:10, Mark Millard wrote: > >> On Feb 12, 2024, at 12:00, Mark Millard wrote: >> >>> [Gack: I was looking at the wrong vintage of source code, predating >>> your changes: wrong system used.] >>> >>> >>> On Feb 12, 2024, at 10:41, Mark Millard wrote: >>> On Feb 12, 2024, at 09:32, John Baldwin wrote: > On 2/9/24 8:13 PM, Mark Millard wrote: >> Summary: >> pcib0: mem >> 0x7d50-0x7d50930f irq 80,81 on simplebus2 >> pcib0: parsing FDT for ECAM0: >> pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 >> . . . >> rman_manage_region: request: start 0x6, >> end 0x6000f >> panic: Failed to add resource to rman > > Hmmm, I suspect this is due to the way that bus_translate_resource works > which is > fundamentally broken. It rewrites the start address of a resource > in-situ instead > of keeping downstream resources separate from the upstream resources. > For example, > I don't see how you could ever release a resource in this design without > completely > screwing up your rman. That is, I expect trying to detach a PCI device > behind a > translating bridge that uses the current approach should corrupt the > allocated > resource ranges in an rman long before my changes. > > That said, that doesn't really explain the panic. Hmm, the panic might > be because > for PCI bridge windows the driver now passes RF_ACTIVE and the > bus_translate_resource > hack only kicks in the activate_resource method of pci_host_generic.c. > >> Detail: >> . . . >> pcib0: mem >> 0x7d50-0x7d50930f irq 80,81 on simplebus2 >> pcib0: parsing FDT for ECAM0: >> pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 > > This indicates this is a translating bus. > >> pcib1: irq 91 at device 0.0 on pci0 >> rman_manage_region: request: start 0x1, end 0x1 >> pcib0: rman_reserve_resource: start=0xc000, end=0xc00f, >> count=0x10 >> rman_reserve_resource_bound: request: [0xc000, >> 0xc00f], length 0x10, flags 102, device pcib1 >> rman_reserve_resource_bound: trying 0x <0xc000,0xf> >> considering [0xc000, 0x] >> truncated region: [0xc000, 0xc00f]; size 0x10 (requested >> 0x10) >> candidate region: [0xc000, 0xc00f], size 0x10 >> allocating from the beginning >> rman_manage_region: request: start 0x6, >> end 0x6000f >> >> What you later typed does not match: >> >> 0x6 >> 0x6000f >> >> You later typed: >> >> 0x6000 >> 0x600fff >> >> This seems to have lead to some confusion from using the >> wrong figure(s). >> > The fact that we are trying to reserve the CPU addresses in the rman is > because > bus_translate_resource rewrote the start address in the resource after it > was allocated. > > That said, I can't see why rman_manage_region would actually fail. At > this point the > rman is empty (this is the first call to rman_manage_region for "pcib1 > memory window"), > so only the check that should be failing are the checks against rm_start > and > rm_end. For the memory window, rm_start is always 0, and rm_end is always > 0x, so both the old (0xc - 0xc00f) and new > (0x6000 - 0x600fff) > ranges are within those bounds. >> >> No: >> >> 0x >> >> .vs (actual): >> >> 0x6 >> 0x6000f >> >> Both the actuals are larger then the 0x figure you list above. >> >> I've no clue if the 0x might have its own typos. >> > I would instead expect to see some other issue later > on where we fail to allocate a resource for a child BAR, but I wouldn't > expect > rman_manage_region to fail. > > Logging the return value from rman_manage_region would be the first step > I think > to see which error value it is returning. Looking at the code in sys/kern/subr_rman.c for rman_manage_region I see the (mostly) return rv related logic only has ENONMEM (explicit return) and EBUSY as non-0 possibilities (no other returns). >>> >>> The modern code also has EINVAL via an explicit return. >>> All the rv references and all the returns are shown below: int rv = 0; . . . >>> >>> The modern code also has here: >>> >>> DPRINTF(("rman_manage_region: <%s> request: start %#jx, end %#jx\n", >>> rm->rm_descr, start, end)); >>> if (start < rm->rm_start || end > rm->rm_end) >>> return EINVAL; >>> >>> Adding rm->rm_start and rm->rm_end to the message might be >>> appropriate --and would also be enough additional information >>> to know if EINVAL would
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On Feb 12, 2024, at 16:10, Mark Millard wrote: > On Feb 12, 2024, at 12:00, Mark Millard wrote: > >> [Gack: I was looking at the wrong vintage of source code, predating >> your changes: wrong system used.] >> >> >> On Feb 12, 2024, at 10:41, Mark Millard wrote: >> >>> On Feb 12, 2024, at 09:32, John Baldwin wrote: >>> On 2/9/24 8:13 PM, Mark Millard wrote: > Summary: > pcib0: mem > 0x7d50-0x7d50930f irq 80,81 on simplebus2 > pcib0: parsing FDT for ECAM0: > pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 > . . . > rman_manage_region: request: start 0x6, end > 0x6000f > panic: Failed to add resource to rman Hmmm, I suspect this is due to the way that bus_translate_resource works which is fundamentally broken. It rewrites the start address of a resource in-situ instead of keeping downstream resources separate from the upstream resources. For example, I don't see how you could ever release a resource in this design without completely screwing up your rman. That is, I expect trying to detach a PCI device behind a translating bridge that uses the current approach should corrupt the allocated resource ranges in an rman long before my changes. That said, that doesn't really explain the panic. Hmm, the panic might be because for PCI bridge windows the driver now passes RF_ACTIVE and the bus_translate_resource hack only kicks in the activate_resource method of pci_host_generic.c. > Detail: > . . . > pcib0: mem > 0x7d50-0x7d50930f irq 80,81 on simplebus2 > pcib0: parsing FDT for ECAM0: > pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 This indicates this is a translating bus. > pcib1: irq 91 at device 0.0 on pci0 > rman_manage_region: request: start 0x1, end 0x1 > pcib0: rman_reserve_resource: start=0xc000, end=0xc00f, > count=0x10 > rman_reserve_resource_bound: request: [0xc000, > 0xc00f], length 0x10, flags 102, device pcib1 > rman_reserve_resource_bound: trying 0x <0xc000,0xf> > considering [0xc000, 0x] > truncated region: [0xc000, 0xc00f]; size 0x10 (requested > 0x10) > candidate region: [0xc000, 0xc00f], size 0x10 > allocating from the beginning > rman_manage_region: request: start 0x6, end > 0x6000f > > What you later typed does not match: > > 0x6 > 0x6000f > > You later typed: > > 0x6000 > 0x600fff > > This seems to have lead to some confusion from using the > wrong figure(s). > The fact that we are trying to reserve the CPU addresses in the rman is because bus_translate_resource rewrote the start address in the resource after it was allocated. That said, I can't see why rman_manage_region would actually fail. At this point the rman is empty (this is the first call to rman_manage_region for "pcib1 memory window"), so only the check that should be failing are the checks against rm_start and rm_end. For the memory window, rm_start is always 0, and rm_end is always 0x, so both the old (0xc - 0xc00f) and new (0x6000 - 0x600fff) ranges are within those bounds. > > No: > > 0x > > .vs (actual): > > 0x6 > 0x6000f > > Both the actuals are larger then the 0x figure you list above. > > I've no clue if the 0x might have its own typos. > I would instead expect to see some other issue later on where we fail to allocate a resource for a child BAR, but I wouldn't expect rman_manage_region to fail. Logging the return value from rman_manage_region would be the first step I think to see which error value it is returning. >>> >>> Looking at the code in sys/kern/subr_rman.c for rman_manage_region I see >>> the (mostly) return rv related logic only has ENONMEM (explicit return) and >>> EBUSY as non-0 possibilities (no other returns). >> >> The modern code also has EINVAL via an explicit return. >> >>> All the rv references and >>> all the returns are shown below: >>> >>> int rv = 0; >>> . . . >> >> The modern code also has here: >> >> DPRINTF(("rman_manage_region: <%s> request: start %#jx, end %#jx\n", >> rm->rm_descr, start, end)); >> if (start < rm->rm_start || end > rm->rm_end) >> return EINVAL; >> >> Adding rm->rm_start and rm->rm_end to the message might be >> appropriate --and would also be enough additional information >> to know if EINVAL would be returned. > > I added such code and built, installed, and tried booting > a separate kernel. The result was: > > rman_manage_region: range 0..0x : request: > start
nvme controller reset failures on recent -CURRENT
I just upgraded my package build machine to: FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e from: FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 and I've had two nvme-triggered panics in the last day. nvme is being used for swap and L2ARC. I'm not able to get a crash dump, probably because the nvme device has gone away and I get an error about not having a dump device. It looks like a low-memory panic because free memory is low and zfs is calling malloc(). This shows up in the log leading up to the panic: Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a nd possible hot unplug. Feb 12 10:07:41 zipper syslogd: last message repeated 1 times Feb 12 10:07:41 zipper kernel: nvme0: resetting controller Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a nd possible hot unplug. Feb 12 10:07:41 zipper syslogd: last message repeated 1 times Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete Feb 12 10:07:41 zipper syslogd: last message repeated 2 times Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog ti meout. The device looks healthy to me: SMART/Health Information Log Critical Warning State: 0x00 Available spare: 0 Temperature: 0 Device reliability:0 Read only: 0 Volatile memory backup:0 Temperature:312 K, 38.85 C, 101.93 F Available spare:100 Available spare threshold: 10 Percentage used:3 Data units (512,000 byte) read: 5761183 Data units written: 29911502 Host read commands: 471921188 Host write commands:605394753 Controller busy time (minutes): 32359 Power cycles: 110 Power on hours: 19297 Unsafe shutdowns: 14 Media errors: 0 No. error info log entries: 0 Warning Temp Composite Time:0 Error Temp Composite Time: 0 Temperature 1 Transition Count: 5231 Temperature 2 Transition Count: 0 Total Time For Temperature 1: 41213 Total Time For Temperature 2: 0
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On Feb 12, 2024, at 12:00, Mark Millard wrote: > [Gack: I was looking at the wrong vintage of source code, predating > your changes: wrong system used.] > > > On Feb 12, 2024, at 10:41, Mark Millard wrote: > >> On Feb 12, 2024, at 09:32, John Baldwin wrote: >> >>> On 2/9/24 8:13 PM, Mark Millard wrote: Summary: pcib0: mem 0x7d50-0x7d50930f irq 80,81 on simplebus2 pcib0: parsing FDT for ECAM0: pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 . . . rman_manage_region: request: start 0x6, end 0x6000f panic: Failed to add resource to rman >>> >>> Hmmm, I suspect this is due to the way that bus_translate_resource works >>> which is >>> fundamentally broken. It rewrites the start address of a resource in-situ >>> instead >>> of keeping downstream resources separate from the upstream resources. For >>> example, >>> I don't see how you could ever release a resource in this design without >>> completely >>> screwing up your rman. That is, I expect trying to detach a PCI device >>> behind a >>> translating bridge that uses the current approach should corrupt the >>> allocated >>> resource ranges in an rman long before my changes. >>> >>> That said, that doesn't really explain the panic. Hmm, the panic might be >>> because >>> for PCI bridge windows the driver now passes RF_ACTIVE and the >>> bus_translate_resource >>> hack only kicks in the activate_resource method of pci_host_generic.c. >>> Detail: . . . pcib0: mem 0x7d50-0x7d50930f irq 80,81 on simplebus2 pcib0: parsing FDT for ECAM0: pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 >>> >>> This indicates this is a translating bus. >>> pcib1: irq 91 at device 0.0 on pci0 rman_manage_region: request: start 0x1, end 0x1 pcib0: rman_reserve_resource: start=0xc000, end=0xc00f, count=0x10 rman_reserve_resource_bound: request: [0xc000, 0xc00f], length 0x10, flags 102, device pcib1 rman_reserve_resource_bound: trying 0x <0xc000,0xf> considering [0xc000, 0x] truncated region: [0xc000, 0xc00f]; size 0x10 (requested 0x10) candidate region: [0xc000, 0xc00f], size 0x10 allocating from the beginning rman_manage_region: request: start 0x6, end 0x6000f What you later typed does not match: 0x6 0x6000f You later typed: 0x6000 0x600fff This seems to have lead to some confusion from using the wrong figure(s). >>> The fact that we are trying to reserve the CPU addresses in the rman is >>> because >>> bus_translate_resource rewrote the start address in the resource after it >>> was allocated. >>> >>> That said, I can't see why rman_manage_region would actually fail. At this >>> point the >>> rman is empty (this is the first call to rman_manage_region for "pcib1 >>> memory window"), >>> so only the check that should be failing are the checks against rm_start and >>> rm_end. For the memory window, rm_start is always 0, and rm_end is always >>> 0x, so both the old (0xc - 0xc00f) and new (0x6000 >>> - 0x600fff) >>> ranges are within those bounds. No: 0x vs (actual): 0x6 0x6000f Both the actuals are larger then the 0x figure you list above. I've no clue if the 0x might have its own typos. >>> I would instead expect to see some other issue later >>> on where we fail to allocate a resource for a child BAR, but I wouldn't >>> expect >>> rman_manage_region to fail. >>> >>> Logging the return value from rman_manage_region would be the first step I >>> think >>> to see which error value it is returning. >> >> Looking at the code in sys/kern/subr_rman.c for rman_manage_region I see >> the (mostly) return rv related logic only has ENONMEM (explicit return) and >> EBUSY as non-0 possibilities (no other returns). > > The modern code also has EINVAL via an explicit return. > >> All the rv references and >> all the returns are shown below: >> >> int rv = 0; >> . . . > > The modern code also has here: > >DPRINTF(("rman_manage_region: <%s> request: start %#jx, end %#jx\n", >rm->rm_descr, start, end)); >if (start < rm->rm_start || end > rm->rm_end) >return EINVAL; > > Adding rm->rm_start and rm->rm_end to the message might be > appropriate --and would also be enough additional information > to know if EINVAL would be returned. I added such code and built, installed, and tried booting a separate kernel. The result was: rman_manage_region: range 0..0x : request: start 0x6, end 0x6000f panic: Failed to add resource to rman So: 0 vs. start: 0x6 and: 0x vs. end: 0x6000f The 0x6..0x6000f range is not a range of RAM addresses [too large for that for the 8
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On Feb 12, 2024, at 10:02, John Kennedy wrote: > On Mon, Feb 12, 2024 at 09:36:46AM -0800, John Baldwin wrote: >> Without a stack trace it is pretty much impossible to debug a panic like >> this. >> Do you have KDB_TRACE enabled in your kernel config? I'm also not sure how >> the >> PCI changes can result in a panic post-boot. If you were going to have >> problems >> they would be during device attach, not after you are booted and running X. >> >> Short of a stack trace, you can at least use lldb or gdb to lookup the source >> line associated with the faulting instruction pointer (as long as it isn't in >> a kernel module), e.g. for gdb you would use 'gdb /boot/kernel/kernel' and >> then >> 'l *', e.g. from above: 'l *0x80acb962' > > I know on my RPI4 where I saw that, my USB keyboard was dead at that point > and I couldn't get a trace or continue along to get the crashdump. My serial console context still works at the panic. But: . . KDB: enter: panic [ thread pid 0 tid 10 ] Stopped at kdb_enter+0x4c: str xzr, [x19, #1280] db> dump Cannot dump: no dump device specified. db> This is despite: # grep dumpdev /boot/loader.conf dumpdev="/dev/da0p3" and: # gpart show -p => 40 468862048da0 GPT (224G) 40 32728 - free - (16M) 32768 102400 da0p1 efi (50M) 135168 451809280 da0p2 freebsd-ufs (215G) 45198 16916480 da0p3 freebsd-swap (8.1G) 468860928 1160 - free - (580K) So, the panic may be before dumping is set up, at least for USB3 media. === Mark Millard marklmi at yahoo.com
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
[Gack: I was looking at the wrong vintage of source code, predating your changes: wrong system used.] On Feb 12, 2024, at 10:41, Mark Millard wrote: > On Feb 12, 2024, at 09:32, John Baldwin wrote: > >> On 2/9/24 8:13 PM, Mark Millard wrote: >>> Summary: >>> pcib0: mem >>> 0x7d50-0x7d50930f irq 80,81 on simplebus2 >>> pcib0: parsing FDT for ECAM0: >>> pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 >>> . . . >>> rman_manage_region: request: start 0x6, end >>> 0x6000f >>> panic: Failed to add resource to rman >> >> Hmmm, I suspect this is due to the way that bus_translate_resource works >> which is >> fundamentally broken. It rewrites the start address of a resource in-situ >> instead >> of keeping downstream resources separate from the upstream resources. For >> example, >> I don't see how you could ever release a resource in this design without >> completely >> screwing up your rman. That is, I expect trying to detach a PCI device >> behind a >> translating bridge that uses the current approach should corrupt the >> allocated >> resource ranges in an rman long before my changes. >> >> That said, that doesn't really explain the panic. Hmm, the panic might be >> because >> for PCI bridge windows the driver now passes RF_ACTIVE and the >> bus_translate_resource >> hack only kicks in the activate_resource method of pci_host_generic.c. >> >>> Detail: >>> . . . >>> pcib0: mem >>> 0x7d50-0x7d50930f irq 80,81 on simplebus2 >>> pcib0: parsing FDT for ECAM0: >>> pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 >> >> This indicates this is a translating bus. >> >>> pcib1: irq 91 at device 0.0 on pci0 >>> rman_manage_region: request: start 0x1, end 0x1 >>> pcib0: rman_reserve_resource: start=0xc000, end=0xc00f, >>> count=0x10 >>> rman_reserve_resource_bound: request: [0xc000, >>> 0xc00f], length 0x10, flags 102, device pcib1 >>> rman_reserve_resource_bound: trying 0x <0xc000,0xf> >>> considering [0xc000, 0x] >>> truncated region: [0xc000, 0xc00f]; size 0x10 (requested >>> 0x10) >>> candidate region: [0xc000, 0xc00f], size 0x10 >>> allocating from the beginning >>> rman_manage_region: request: start 0x6, end >>> 0x6000f >> >> The fact that we are trying to reserve the CPU addresses in the rman is >> because >> bus_translate_resource rewrote the start address in the resource after it >> was allocated. >> >> That said, I can't see why rman_manage_region would actually fail. At this >> point the >> rman is empty (this is the first call to rman_manage_region for "pcib1 >> memory window"), >> so only the check that should be failing are the checks against rm_start and >> rm_end. For the memory window, rm_start is always 0, and rm_end is always >> 0x, so both the old (0xc - 0xc00f) and new (0x6000 - >> 0x600fff) >> ranges are within those bounds. I would instead expect to see some other >> issue later >> on where we fail to allocate a resource for a child BAR, but I wouldn't >> expect >> rman_manage_region to fail. >> >> Logging the return value from rman_manage_region would be the first step I >> think >> to see which error value it is returning. > > Looking at the code in sys/kern/subr_rman.c for rman_manage_region I see > the (mostly) return rv related logic only has ENONMEM (explicit return) and > EBUSY as non-0 possibilities (no other returns). The modern code also has EINVAL via an explicit return. > All the rv references and > all the returns are shown below: > >int rv = 0; > . . . The modern code also has here: DPRINTF(("rman_manage_region: <%s> request: start %#jx, end %#jx\n", rm->rm_descr, start, end)); if (start < rm->rm_start || end > rm->rm_end) return EINVAL; Adding rm->rm_start and rm->rm_end to the message might be appropriate --and would also be enough additional information to know if EINVAL would be returned. >r = int_alloc_resource(M_NOWAIT); >if (r == NULL) >return ENOMEM; > . . . >/* Check for any overlap with the current region. */ >if (r->r_start <= s->r_end && r->r_end >= s->r_start) { >rv = EBUSY; >goto out; >} > >/* Check for any overlap with the next region. */ >t = TAILQ_NEXT(s, r_link); >if (t && r->r_start <= t->r_end && r->r_end >= t->r_start) { >rv = EBUSY; >goto out; >} > . . . > out: >mtx_unlock(rm->rm_mtx); >return rv; > > int_alloc_resource failure would be failure (NULL) from the: > >struct resource_i *r; > >r = malloc(sizeof *r, M_RMAN, malloc_flag | M_ZERO); > > (associated with the M_NOWAIT argument). > > The malloc
Re: kernel crash in tcp_subr.c:2386
In message <20240212193044.e089d...@slippy.cwsent.com>, Cy Schubert writes: > In message <625e0ea4-9413-45ad-b05c-500833a1d...@freebsd.org>, > tuexen@freebsd.o > rg writes: > > > On Feb 12, 2024, at 10:36, Alexander Leidinger = > > wrote: > > >=20 > > > Hi, > > >=20 > > > I got a coredump with sources from 2024-02-10-144617 (GMT+0100): > > Hi Alexander, > > > > we are aware of this problem, but haven't found a way to reproduce it. > > Do you know how to reproduce this? > > I've reproduced this by rebooting any one of my machines in my basement. > The other machines will panic as below. > > I've reverted the three tcp timer commits, expecting one of them to be the > cause. Another data point: I build on a build machine and NFS mount /usr/obj on my other machines. Another symptom of this problem is that the NFS share will appear corrupted. And df -htnfs will sometimes not display the mounted NFS share. If not a kernel page fault, random kernel memory can be overwritten resulting in bizarre behaviour prior. -- Cheers, Cy Schubert FreeBSD UNIX: Web: https://FreeBSD.org NTP: Web: https://nwtime.org e^(i*pi)+1=0
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On 2/12/24 13:44, Michael Butler wrote: On 2/12/24 12:36, John Baldwin wrote: [ .. trimmed .. ] Short of a stack trace, you can at least use lldb or gdb to lookup the source line associated with the faulting instruction pointer (as long as it isn't in a kernel module), e.g. for gdb you would use 'gdb /boot/kernel/kernel' and then 'l *', e.g. from above: 'l *0x80acb962' I still didn't manage to get a core but .. does this make any sense in htis context? I apologize .. too many crashes and I grabbed the wrong instruction pointer; this was the most recent attempt. I have no idea why this networking code and PCI configurations are seemingly related :-( (kgdb) l *0x80acbc02 0x80acbc02 is in cc_cong_signal (/usr/src/sys/netinet/tcp_input.c:465). 460 tp->t_flags &= ~TF_PREVVALID; 461 tp->t_badrxtwin = 0; 462 break; 463 } 464 465 if (CC_ALGO(tp)->cong_signal != NULL) { 466 if (th != NULL) 467 tp->t_ccv.curack = th->th_ack; 468 CC_ALGO(tp)->cong_signal(>t_ccv, type); 469 }
Re: kernel crash in tcp_subr.c:2386
In message <625e0ea4-9413-45ad-b05c-500833a1d...@freebsd.org>, tuexen@freebsd.o rg writes: > > On Feb 12, 2024, at 10:36, Alexander Leidinger = > wrote: > >=20 > > Hi, > >=20 > > I got a coredump with sources from 2024-02-10-144617 (GMT+0100): > Hi Alexander, > > we are aware of this problem, but haven't found a way to reproduce it. > Do you know how to reproduce this? I've reproduced this by rebooting any one of my machines in my basement. The other machines will panic as below. I've reverted the three tcp timer commits, expecting one of them to be the cause. > > Best regards > Michael > > ---snip--- > > __curthread () at = > /space/system/usr_src/sys/amd64/include/pcpu_aux.h:57 > > 57 __asm("movq %%gs:%P1,%0" : "=3Dr" (td) : "n" = > (offsetof(struct pcpu, > > (kgdb) #0 __curthread () at = > /space/system/usr_src/sys/amd64/include/pcpu_aux.h:57 > >td =3D > > #1 doadump (textdump=3Dtextdump@entry=3D1) > >at /space/system/usr_src/sys/kern/kern_shutdown.c:403 > >error =3D 0 > >coredump =3D > > #2 0x8052fe85 in kern_reboot (howto=3D260) > >at /space/system/usr_src/sys/kern/kern_shutdown.c:521 > >once =3D 0 > >__pc =3D > > #3 0x80530382 in vpanic ( > >fmt=3D0x808df476 "Assertion %s failed at %s:%d", > >ap=3Dap@entry=3D0xfe08a079ebf0) > >at /space/system/usr_src/sys/kern/kern_shutdownc:973 > >buf =3D "Assertion !callout_active(>t_callout) failed at = > /space/system/usr_src/sys/netinet/tcp_subr.c:2386", '\000' times> > >__pc =3D > >__pc =3D > >__pc =3D > >other_cpus =3D {__bits =3D {14680063, 0 }} > >td =3D 0xf8068ef99740 > >bootopt =3D > >newpanic =3D > > #4 0x805301d3 in panic (fmt=3D) > >at /space/system/usr_src/sys/kern/kern_shutdown.c:889 > >ap =3D {{gp_offset =3D 32, fp_offset =3D 48, > >overflow_arg_area =3D 0xfe08a079ec20, > >reg_save_area =3D 0xfe08a079ebc0}} > > #5 0x806c9d8c in tcp_discardcb = > (tp=3Dtp@entry=3D0xf80af441ba80) > >at /space/system/usr_src/sys/netinet/tcp_subr.c:2386 > >inp =3D 0xf80af441ba80 > >so =3D 0xf804d23d2780 > >m =3D > >isipv6 =3D > > #6 0x806d6291 in tcp_usr_detach (so=3D0xf804d23d2780) > >at /space/system/usr_src/sys/netinet/tcp_usrreq.c:214 > >inp =3D 0xf80af441ba80 > >tp =3D 0xf80af441ba80 > > #7 0x805dba57 in sofree (so=3D0xf804d23d2780) > >at /space/system/usr_src/sys/kern/uipc_socket.c:1205 > >pr =3D 0x80a8bd18 > > #8 sorele_locked (so=3Dso@entry=3D0xf804d23d2780) > >at /space/system/usr_src/sys/kern/uipc_socket.c:1232 > > No locals. > > #9 0x805dc8c0 in soclose (so=3D0xf804d23d2780) > >at /space/system/usr_src/sys/kern/uipc_socket.c:1302 > >lqueue =3D {tqh_first =3D 0xf8068ef99740, > > tqh_last =3D 0xfe08a079ed40} > >error =3D 0 > >saved_vnet =3D 0x0 > >last =3D > >listening =3D > > #10 0x804ccbd1 in fo_close (fp=3D0xf805f2dfc500, = > td=3D) > >at /space/system/usr_src/sys/sys/file.h:390 > > No locals. > > #11 _fdrop (fp=3Dfp@entry=3D0xf805f2dfc500, td=3D, > >td@entry=3D0xf8068ef99740) > >at /space/system/usr_src/sys/kern/kern_descrip.c:3666 > >count =3D > >error =3D > > #12 0x804d02f3 in closef (fp=3Dfp@entry=3D0xf805f2dfc500, > >td=3Dtd@entry=3D0xf8068ef99740) > >at /space/system/usr_src/sys/kern/kern_descrip.c:2839 > >_error =3D 0 > >_fp =3D 0xf805f2dfc500 > >lf =3D {l_start =3D -8791759350504, l_len =3D -8791759350528, = > l_pid =3D 0, > > l_type =3D 0, l_whence =3D 0, l_sysid =3D 0} > >vp =3D > >fdtol =3D > >fdp =3D > > #13 0x804cd50c in closefp_impl (fdp=3D0xfe07afebf860, = > fd=3D19, > >fp=3D0xf805f2dfc500, td=3D0xf8068ef99740, audit=3D out>) > >at /space/system/usr_src/sys/kern/kern_descrip.c:1315 > >error =3D > > #14 closefp (fdp=3D0xfe07afebf860, fd=3D19, fp=3D0xf805f2dfc500,= > > >td=3D0xf8068ef99740, holdleaders=3Dtrue, audit=3D out>) > >at /space/system/usr_src/sys/kern/kern_descrip.c:1372 > > No locals. > > #15 0x808597d6 in syscallenter (td=3D0xf8068ef99740) > >at = > /space/system/usr_src/sys/amd64/amd64/../../kern/subr_syscall.c:186 > >se =3D 0x80a48330 > >p =3D 0xfe07f29995c0 > >sa =3D 0xf8068ef99b30 > >error =3D > >sy_thr_static =3D > >traced =3D > > #16 amd64_syscall (td=3D0xf8068ef99740, traced=3D0) > >at /space/system/usr_src/sys/amd64/amd64/trap.c:1192 > >ksi =3D {ksi_link =3D {tqe_next =3D 0xfe08a079ef30, > >tqe_prev =3D 0x808588af }, ksi_info =3D = > { > >si_signo
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On 2/12/24 12:36, John Baldwin wrote: [ .. trimmed .. ] Short of a stack trace, you can at least use lldb or gdb to lookup the source line associated with the faulting instruction pointer (as long as it isn't in a kernel module), e.g. for gdb you would use 'gdb /boot/kernel/kernel' and then 'l *', e.g. from above: 'l *0x80acb962' I still didn't manage to get a core but .. does this make any sense in htis context? (kgdb) l *0x80acb962 0x80acb962 is in cc_conn_init (/usr/src/sys/amd64/include/counter.h:92). 87 static inline void 88 counter_u64_add(counter_u64_t c, int64_t inc) 89 { 90 91 KASSERT(IS_BSP() || c != EARLY_COUNTER, ("EARLY_COUNTER used on AP")); 92 zpcpu_add(c, inc); 93 } 94 95 #endif /* ! __MACHINE_COUNTER_H__ */ 96 Michael
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On Feb 12, 2024, at 09:32, John Baldwin wrote: > On 2/9/24 8:13 PM, Mark Millard wrote: >> Summary: >> pcib0: mem 0x7d50-0x7d50930f >> irq 80,81 on simplebus2 >> pcib0: parsing FDT for ECAM0: >> pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 >> . . . >> rman_manage_region: request: start 0x6, end >> 0x6000f >> panic: Failed to add resource to rman > > Hmmm, I suspect this is due to the way that bus_translate_resource works > which is > fundamentally broken. It rewrites the start address of a resource in-situ > instead > of keeping downstream resources separate from the upstream resources. For > example, > I don't see how you could ever release a resource in this design without > completely > screwing up your rman. That is, I expect trying to detach a PCI device > behind a > translating bridge that uses the current approach should corrupt the allocated > resource ranges in an rman long before my changes. > > That said, that doesn't really explain the panic. Hmm, the panic might be > because > for PCI bridge windows the driver now passes RF_ACTIVE and the > bus_translate_resource > hack only kicks in the activate_resource method of pci_host_generic.c. > >> Detail: >> . . . >> pcib0: mem 0x7d50-0x7d50930f >> irq 80,81 on simplebus2 >> pcib0: parsing FDT for ECAM0: >> pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 > > This indicates this is a translating bus. > >> pcib1: irq 91 at device 0.0 on pci0 >> rman_manage_region: request: start 0x1, end 0x1 >> pcib0: rman_reserve_resource: start=0xc000, end=0xc00f, >> count=0x10 >> rman_reserve_resource_bound: request: [0xc000, >> 0xc00f], length 0x10, flags 102, device pcib1 >> rman_reserve_resource_bound: trying 0x <0xc000,0xf> >> considering [0xc000, 0x] >> truncated region: [0xc000, 0xc00f]; size 0x10 (requested >> 0x10) >> candidate region: [0xc000, 0xc00f], size 0x10 >> allocating from the beginning >> rman_manage_region: request: start 0x6, end >> 0x6000f > > The fact that we are trying to reserve the CPU addresses in the rman is > because > bus_translate_resource rewrote the start address in the resource after it was > allocated. > > That said, I can't see why rman_manage_region would actually fail. At this > point the > rman is empty (this is the first call to rman_manage_region for "pcib1 memory > window"), > so only the check that should be failing are the checks against rm_start and > rm_end. For the memory window, rm_start is always 0, and rm_end is always > 0x, so both the old (0xc - 0xc00f) and new (0x6000 - > 0x600fff) > ranges are within those bounds. I would instead expect to see some other > issue later > on where we fail to allocate a resource for a child BAR, but I wouldn't expect > rman_manage_region to fail. > > Logging the return value from rman_manage_region would be the first step I > think > to see which error value it is returning. Looking at the code in sys/kern/subr_rman.c for rman_manage_region I see the (mostly) return rv related logic only has ENONMEM (explicit return) and EBUSY as non-0 possibilities (no other returns). All the rv references and all the returns are shown below: int rv = 0; . . r = int_alloc_resource(M_NOWAIT); if (r == NULL) return ENOMEM; . . /* Check for any overlap with the current region. */ if (r->r_start <= s->r_end && r->r_end >= s->r_start) { rv = EBUSY; goto out; } /* Check for any overlap with the next region. */ t = TAILQ_NEXT(s, r_link); if (t && r->r_start <= t->r_end && r->r_end >= t->r_start) { rv = EBUSY; goto out; } . . out: mtx_unlock(rm->rm_mtx); return rv; int_alloc_resource failure would be failure (NULL) from the: struct resource_i *r; r = malloc(sizeof *r, M_RMAN, malloc_flag | M_ZERO); (associated with the M_NOWAIT argument). The malloc failure would likely go in a very different direction. Side note: looks like the EBUSY cases leak what r references. > Probably I should fix pci_host_generic.c to handle translation properly > however. > I can work on a patch for that. === Mark Millard marklmi at yahoo.com
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On 2/12/24 12:36, John Baldwin wrote: On 2/10/24 2:09 PM, Michael Butler wrote: I have stability problems with anything at or after this commit (b377ff8) on an amd64 laptop. While I see the following panic logged, no crash dump is preserved :-( It happens after ~5-6 minutes running in KDE (X). Reverting to 36efc64 seems to work reliably (after ACPI changes but before the problematic PCI one) kernel: Fatal trap 12: page fault while in kernel mode kernel: cpuid = 2; apic id = 02 kernel: fault virtual address = 0x48 kernel: fault code = supervisor read data, page not present kernel: instruction pointer = 0x20:0x80acb962 kernel: stack pointer = 0x28:0xfe00c4318d80 kernel: frame pointer = 0x28:0xfe00c4318d80 kernel: code segment = base 0x0, limit 0xf, type 0x1b kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 kernel: processor eflags = interrupt enabled, resume, IOPL = 0 kernel: current process = 2 (clock (0)) kernel: rdi: f802e460c000 rsi: rdx: 0002 kernel: rcx: r8: 001e r9: fe00c4319000 kernel: rax: 0002 rbx: f802e460c000 rbp: fe00c4318d80 kernel: r10: 1388 r11: 7ffc765d r12: 000f kernel: r13: 0002 r14: f8000193e740 r15: kernel: trap number = 12 kernel: panic: page fault kernel: cpuid = 2 kernel: time = 1707573802 kernel: Uptime: 6m19s kernel: Dumping 942 out of 16242 MB:..2%..11%..21%..31%..41%..51%..62%..72%..82%..92% kernel: Dump complete kernel: Automatic reboot in 15 seconds - press a key on the console to abort Without a stack trace it is pretty much impossible to debug a panic like this. Do you have KDB_TRACE enabled in your kernel config? I'm also not sure how the PCI changes can result in a panic post-boot. If you were going to have problems they would be during device attach, not after you are booted and running X. Short of a stack trace, you can at least use lldb or gdb to lookup the source line associated with the faulting instruction pointer (as long as it isn't in a kernel module), e.g. for gdb you would use 'gdb /boot/kernel/kernel' and then 'l *', e.g. from above: 'l *0x80acb962' I suspect the absence of a core dump was due to my use of tmpfs for /tmp and /var/tmp while still having clear_tmp enabled in rc.conf (that may touch swap on restart). Since then, I've removed tmpfs, everything under /usr/obj am rebuilding from scratch. I'll update when it finally finishes (i5-3340s are quick :-() Michael
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On 2/10/24 2:09 PM, Michael Butler wrote: I have stability problems with anything at or after this commit (b377ff8) on an amd64 laptop. While I see the following panic logged, no crash dump is preserved :-( It happens after ~5-6 minutes running in KDE (X). Reverting to 36efc64 seems to work reliably (after ACPI changes but before the problematic PCI one) kernel: Fatal trap 12: page fault while in kernel mode kernel: cpuid = 2; apic id = 02 kernel: fault virtual address = 0x48 kernel: fault code= supervisor read data, page not present kernel: instruction pointer = 0x20:0x80acb962 kernel: stack pointer = 0x28:0xfe00c4318d80 kernel: frame pointer = 0x28:0xfe00c4318d80 kernel: code segment = base 0x0, limit 0xf, type 0x1b kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 kernel: processor eflags = interrupt enabled, resume, IOPL = 0 kernel: current process = 2 (clock (0)) kernel: rdi: f802e460c000 rsi: rdx: 0002 kernel: rcx: r8: 001e r9: fe00c4319000 kernel: rax: 0002 rbx: f802e460c000 rbp: fe00c4318d80 kernel: r10: 1388 r11: 7ffc765d r12: 000f kernel: r13: 0002 r14: f8000193e740 r15: kernel: trap number = 12 kernel: panic: page fault kernel: cpuid = 2 kernel: time = 1707573802 kernel: Uptime: 6m19s kernel: Dumping 942 out of 16242 MB:..2%..11%..21%..31%..41%..51%..62%..72%..82%..92% kernel: Dump complete kernel: Automatic reboot in 15 seconds - press a key on the console to abort Without a stack trace it is pretty much impossible to debug a panic like this. Do you have KDB_TRACE enabled in your kernel config? I'm also not sure how the PCI changes can result in a panic post-boot. If you were going to have problems they would be during device attach, not after you are booted and running X. Short of a stack trace, you can at least use lldb or gdb to lookup the source line associated with the faulting instruction pointer (as long as it isn't in a kernel module), e.g. for gdb you would use 'gdb /boot/kernel/kernel' and then 'l *', e.g. from above: 'l *0x80acb962' -- John Baldwin
Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic
On 2/9/24 8:13 PM, Mark Millard wrote: Summary: pcib0: mem 0x7d50-0x7d50930f irq 80,81 on simplebus2 pcib0: parsing FDT for ECAM0: pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 . . . rman_manage_region: request: start 0x6, end 0x6000f panic: Failed to add resource to rman Hmmm, I suspect this is due to the way that bus_translate_resource works which is fundamentally broken. It rewrites the start address of a resource in-situ instead of keeping downstream resources separate from the upstream resources. For example, I don't see how you could ever release a resource in this design without completely screwing up your rman. That is, I expect trying to detach a PCI device behind a translating bridge that uses the current approach should corrupt the allocated resource ranges in an rman long before my changes. That said, that doesn't really explain the panic. Hmm, the panic might be because for PCI bridge windows the driver now passes RF_ACTIVE and the bus_translate_resource hack only kicks in the activate_resource method of pci_host_generic.c. Detail: . . . pcib0: mem 0x7d50-0x7d50930f irq 80,81 on simplebus2 pcib0: parsing FDT for ECAM0: pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000 This indicates this is a translating bus. pcib1: irq 91 at device 0.0 on pci0 rman_manage_region: request: start 0x1, end 0x1 pcib0: rman_reserve_resource: start=0xc000, end=0xc00f, count=0x10 rman_reserve_resource_bound: request: [0xc000, 0xc00f], length 0x10, flags 102, device pcib1 rman_reserve_resource_bound: trying 0x <0xc000,0xf> considering [0xc000, 0x] truncated region: [0xc000, 0xc00f]; size 0x10 (requested 0x10) candidate region: [0xc000, 0xc00f], size 0x10 allocating from the beginning rman_manage_region: request: start 0x6, end 0x6000f The fact that we are trying to reserve the CPU addresses in the rman is because bus_translate_resource rewrote the start address in the resource after it was allocated. That said, I can't see why rman_manage_region would actually fail. At this point the rman is empty (this is the first call to rman_manage_region for "pcib1 memory window"), so only the check that should be failing are the checks against rm_start and rm_end. For the memory window, rm_start is always 0, and rm_end is always 0x, so both the old (0xc - 0xc00f) and new (0x6000 - 0x600fff) ranges are within those bounds. I would instead expect to see some other issue later on where we fail to allocate a resource for a child BAR, but I wouldn't expect rman_manage_region to fail. Logging the return value from rman_manage_region would be the first step I think to see which error value it is returning. Probably I should fix pci_host_generic.c to handle translation properly however. I can work on a patch for that. -- John Baldwin
Kernel Panic sys/netinet/tcp_subr.c 2386
Hi, I just got a kernel panic on my 15.0-CURRENT machine, with an Assertion in sys/netinet/tcp_subr.c 2386 full log: https://people.freebsd.org/~wosch/tmp/kernel-panic-tcp_subr-line-2386.png OS: 15.0-CURRENT main-3e9515846f (10-Feb-2024, github.com/freebsd/freebsd-src) Should I worry? -Wolfram -- Wolfram Schneider https://wolfram.schneider.org
Re: kernel crash in tcp_subr.c:2386
> On Feb 12, 2024, at 10:36, Alexander Leidinger > wrote: > > Hi, > > I got a coredump with sources from 2024-02-10-144617 (GMT+0100): Hi Alexander, we are aware of this problem, but haven't found a way to reproduce it. Do you know how to reproduce this? Best regards Michael > ---snip--- > __curthread () at /space/system/usr_src/sys/amd64/include/pcpu_aux.h:57 > 57 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct > pcpu, > (kgdb) #0 __curthread () at > /space/system/usr_src/sys/amd64/include/pcpu_aux.h:57 >td = > #1 doadump (textdump=textdump@entry=1) >at /space/system/usr_src/sys/kern/kern_shutdown.c:403 >error = 0 >coredump = > #2 0x8052fe85 in kern_reboot (howto=260) >at /space/system/usr_src/sys/kern/kern_shutdown.c:521 >once = 0 >__pc = > #3 0x80530382 in vpanic ( >fmt=0x808df476 "Assertion %s failed at %s:%d", >ap=ap@entry=0xfe08a079ebf0) >at /space/system/usr_src/sys/kern/kern_shutdownc:973 >buf = "Assertion !callout_active(>t_callout) failed at > /space/system/usr_src/sys/netinet/tcp_subr.c:2386", '\000' >__pc = >__pc = >__pc = >other_cpus = {__bits = {14680063, 0 }} >td = 0xf8068ef99740 >bootopt = >newpanic = > #4 0x805301d3 in panic (fmt=) >at /space/system/usr_src/sys/kern/kern_shutdown.c:889 >ap = {{gp_offset = 32, fp_offset = 48, >overflow_arg_area = 0xfe08a079ec20, >reg_save_area = 0xfe08a079ebc0}} > #5 0x806c9d8c in tcp_discardcb (tp=tp@entry=0xf80af441ba80) >at /space/system/usr_src/sys/netinet/tcp_subr.c:2386 >inp = 0xf80af441ba80 >so = 0xf804d23d2780 >m = >isipv6 = > #6 0x806d6291 in tcp_usr_detach (so=0xf804d23d2780) >at /space/system/usr_src/sys/netinet/tcp_usrreq.c:214 >inp = 0xf80af441ba80 >tp = 0xf80af441ba80 > #7 0x805dba57 in sofree (so=0xf804d23d2780) >at /space/system/usr_src/sys/kern/uipc_socket.c:1205 >pr = 0x80a8bd18 > #8 sorele_locked (so=so@entry=0xf804d23d2780) >at /space/system/usr_src/sys/kern/uipc_socket.c:1232 > No locals. > #9 0x805dc8c0 in soclose (so=0xf804d23d2780) >at /space/system/usr_src/sys/kern/uipc_socket.c:1302 >lqueue = {tqh_first = 0xf8068ef99740, > tqh_last = 0xfe08a079ed40} >error = 0 >saved_vnet = 0x0 >last = >listening = > #10 0x804ccbd1 in fo_close (fp=0xf805f2dfc500, td=) >at /space/system/usr_src/sys/sys/file.h:390 > No locals. > #11 _fdrop (fp=fp@entry=0xf805f2dfc500, td=, >td@entry=0xf8068ef99740) >at /space/system/usr_src/sys/kern/kern_descrip.c:3666 >count = >error = > #12 0x804d02f3 in closef (fp=fp@entry=0xf805f2dfc500, >td=td@entry=0xf8068ef99740) >at /space/system/usr_src/sys/kern/kern_descrip.c:2839 >_error = 0 >_fp = 0xf805f2dfc500 >lf = {l_start = -8791759350504, l_len = -8791759350528, l_pid = 0, > l_type = 0, l_whence = 0, l_sysid = 0} >vp = >fdtol = >fdp = > #13 0x804cd50c in closefp_impl (fdp=0xfe07afebf860, fd=19, >fp=0xf805f2dfc500, td=0xf8068ef99740, audit=) >at /space/system/usr_src/sys/kern/kern_descrip.c:1315 >error = > #14 closefp (fdp=0xfe07afebf860, fd=19, fp=0xf805f2dfc500, >td=0xf8068ef99740, holdleaders=true, audit=) >at /space/system/usr_src/sys/kern/kern_descrip.c:1372 > No locals. > #15 0x808597d6 in syscallenter (td=0xf8068ef99740) >at /space/system/usr_src/sys/amd64/amd64/../../kern/subr_syscall.c:186 >se = 0x80a48330 >p = 0xfe07f29995c0 >sa = 0xf8068ef99b30 >error = >sy_thr_static = >traced = > #16 amd64_syscall (td=0xf8068ef99740, traced=0) >at /space/system/usr_src/sys/amd64/amd64/trap.c:1192 >ksi = {ksi_link = {tqe_next = 0xfe08a079ef30, >tqe_prev = 0x808588af }, ksi_info = { >si_signo = 1, si_errno = 0, si_code = 2015268872, si_pid = -512, >si_uid = 2398721856, si_status = -2042, >si_addr = 0xfe08a079ef40, si_value = {sival_int = -1602621824, > sival_ptr = 0xfe08a079ee80, sigval_int = -1602621824, > sigval_ptr = 0xfe08a079ee80}, _reason = {_fault = { >_trapno = 1489045984}, _timer = {_timerid = 1489045984, >_overrun = 17999}, _mesgq = {_mqd = 1489045984}, _poll = { >_band = 77306605406688}, _capsicum = {_syscall = 1489045984}, > __spare__ = {__spare1__ = 77306605406688, __spare2__ = { > 1489814048, 17999, 208, 0, 0, 0, 992191072, > ksi_flags = 975329968, ksi_sigq =
kernel crash in tcp_subr.c:2386
Hi, I got a coredump with sources from 2024-02-10-144617 (GMT+0100): ---snip--- __curthread () at /space/system/usr_src/sys/amd64/include/pcpu_aux.h:57 57 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, (kgdb) #0 __curthread () at /space/system/usr_src/sys/amd64/include/pcpu_aux.h:57 td = #1 doadump (textdump=textdump@entry=1) at /space/system/usr_src/sys/kern/kern_shutdown.c:403 error = 0 coredump = #2 0x8052fe85 in kern_reboot (howto=260) at /space/system/usr_src/sys/kern/kern_shutdown.c:521 once = 0 __pc = #3 0x80530382 in vpanic ( fmt=0x808df476 "Assertion %s failed at %s:%d", ap=ap@entry=0xfe08a079ebf0) at /space/system/usr_src/sys/kern/kern_shutdownc:973 buf = "Assertion !callout_active(>t_callout) failed at /space/system/usr_src/sys/netinet/tcp_subr.c:2386", '\000' __pc = __pc = __pc = other_cpus = {__bits = {14680063, 0 }} td = 0xf8068ef99740 bootopt = newpanic = #4 0x805301d3 in panic (fmt=) at /space/system/usr_src/sys/kern/kern_shutdown.c:889 ap = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0xfe08a079ec20, reg_save_area = 0xfe08a079ebc0}} #5 0x806c9d8c in tcp_discardcb (tp=tp@entry=0xf80af441ba80) at /space/system/usr_src/sys/netinet/tcp_subr.c:2386 inp = 0xf80af441ba80 so = 0xf804d23d2780 m = isipv6 = #6 0x806d6291 in tcp_usr_detach (so=0xf804d23d2780) at /space/system/usr_src/sys/netinet/tcp_usrreq.c:214 inp = 0xf80af441ba80 tp = 0xf80af441ba80 #7 0x805dba57 in sofree (so=0xf804d23d2780) at /space/system/usr_src/sys/kern/uipc_socket.c:1205 pr = 0x80a8bd18 #8 sorele_locked (so=so@entry=0xf804d23d2780) at /space/system/usr_src/sys/kern/uipc_socket.c:1232 No locals. #9 0x805dc8c0 in soclose (so=0xf804d23d2780) at /space/system/usr_src/sys/kern/uipc_socket.c:1302 lqueue = {tqh_first = 0xf8068ef99740, tqh_last = 0xfe08a079ed40} error = 0 saved_vnet = 0x0 last = listening = #10 0x804ccbd1 in fo_close (fp=0xf805f2dfc500, td=) at /space/system/usr_src/sys/sys/file.h:390 No locals. #11 _fdrop (fp=fp@entry=0xf805f2dfc500, td=, td@entry=0xf8068ef99740) at /space/system/usr_src/sys/kern/kern_descrip.c:3666 count = error = #12 0x804d02f3 in closef (fp=fp@entry=0xf805f2dfc500, td=td@entry=0xf8068ef99740) at /space/system/usr_src/sys/kern/kern_descrip.c:2839 _error = 0 _fp = 0xf805f2dfc500 lf = {l_start = -8791759350504, l_len = -8791759350528, l_pid = 0, l_type = 0, l_whence = 0, l_sysid = 0} vp = fdtol = fdp = #13 0x804cd50c in closefp_impl (fdp=0xfe07afebf860, fd=19, fp=0xf805f2dfc500, td=0xf8068ef99740, audit=) at /space/system/usr_src/sys/kern/kern_descrip.c:1315 error = #14 closefp (fdp=0xfe07afebf860, fd=19, fp=0xf805f2dfc500, td=0xf8068ef99740, holdleaders=true, audit=) at /space/system/usr_src/sys/kern/kern_descrip.c:1372 No locals. #15 0x808597d6 in syscallenter (td=0xf8068ef99740) at /space/system/usr_src/sys/amd64/amd64/../../kern/subr_syscall.c:186 se = 0x80a48330 p = 0xfe07f29995c0 sa = 0xf8068ef99b30 error = sy_thr_static = traced = #16 amd64_syscall (td=0xf8068ef99740, traced=0) at /space/system/usr_src/sys/amd64/amd64/trap.c:1192 ksi = {ksi_link = {tqe_next = 0xfe08a079ef30, tqe_prev = 0x808588af }, ksi_info = { si_signo = 1, si_errno = 0, si_code = 2015268872, si_pid = -512, si_uid = 2398721856, si_status = -2042, si_addr = 0xfe08a079ef40, si_value = {sival_int = -1602621824, sival_ptr = 0xfe08a079ee80, sigval_int = -1602621824, sigval_ptr = 0xfe08a079ee80}, _reason = {_fault = { _trapno = 1489045984}, _timer = {_timerid = 1489045984, _overrun = 17999}, _mesgq = {_mqd = 1489045984}, _poll = { _band = 77306605406688}, _capsicum = {_syscall = 1489045984}, __spare__ = {__spare1__ = 77306605406688, __spare2__ = { 1489814048, 17999, 208, 0, 0, 0, 992191072, ksi_flags = 975329968, ksi_sigq = 0x8082f8f3 } #17 No locals. #18 0x3af13b17fc9a in ?? () No symbol table info available. Backtrace stopped: Cannot access memory at address 0x3af13a225ab8 ---snip--- Any ideas? Due to another issue in userland, I updated to 2024-02-11-212006, but I have the above mentioned version and core still in a BE if needed. Bye, Alexander.
segfault in ld-elf.so.1
Hi, dovecot (and no other program I use on this machine... at least not that I notice it) segfaults in ld-elf.so.1 after an update from 2024-01-18-092730 to 2024-02-10-144617 (and now 2024-02-11-212006 in the hope the issue would have been fixed by changes to libc/libsys since 2024-02-10-144617). The issue shows up when I try to do an IMAP login. A successful authentication starts the imap process which immediately segfaults. I didn't recompile dovecot for the initial update, but I did now to rule out a regression in this area (and to get access via imap do my normal mail account). Backtrace: ---snip--- (lldb) target create "/usr/local/libexec/dovecot/imap" --core "/var/run/dovecot/imap.core" Core file '/var/run/dovecot/imap.core' (x86_64) was loaded. * thread #1, name = 'imap', stop reason = signal SIGSEGV * frame #0: 0x4d3dfa2a4761 ld-elf.so.1`load_object [inlined] object_match_name(obj=0x49a47c203408, name="") at rtld.c:5606:6 frame #1: 0x4d3dfa2a4742 ld-elf.so.1`load_object(name="", fd_u=-1, refobj=0x49a47c228008, flags=0) at rtld.c:2704:10 frame #2: 0x4d3dfa2a3eaa ld-elf.so.1`dlopen_object(name="", fd=-1, refobj=0x49a47c228008, lo_flags=0, mode=1, lockstate=0x1ded0f98cb80) at rtld.c:3747:8 frame #3: 0x4d3dfa2a2274 ld-elf.so.1`symlook_obj [inlined] load_filtee1(obj=, needed=0x49a47c2007c8, flags=, lockstate=) at rtld.c:2576:16 frame #4: 0x4d3dfa2a2245 ld-elf.so.1`symlook_obj [inlined] load_filtees(obj=0x49a47c228008, flags=0, lockstate=0x1ded0f98cb80) at rtld.c:2589:2 frame #5: 0x4d3dfa2a223e ld-elf.so.1`symlook_obj(req=0x1ded011502e0, obj=0x49a47c228008) at rtld.c:4735:6 frame #6: 0x4d3dfa2a6992 ld-elf.so.1`symlook_list(req=0x1ded01150368, objlist=, dlp=0x1ded011504b0) at rtld.c:4637:13 frame #7: 0x4d3dfa2a680b ld-elf.so.1`symlook_global(req=0x1ded01150470, donelist=0x1ded011504b0) at rtld.c:4541:8 frame #8: 0x4d3dfa2a6673 ld-elf.so.1`get_program_var_addr(name=, lockstate=0x1ded0f98cb80) at rtld.c:4483:9 frame #9: 0x4d3dfa2a4374 ld-elf.so.1`dlopen_object [inlined] distribute_static_tls(list=0x1ded01150988, lockstate=0x1ded0f98cb80) at rtld.c:5908:6 frame #10: 0x4d3dfa2a4364 ld-elf.so.1`dlopen_object(name="", fd=-1, refobj=0x49a47c228008, lo_flags=0, mode=1, lockstate=0x1ded0f98cb80) at rtld.c:3831:6 frame #11: 0x4d3dfa2a2274 ld-elf.so.1`symlook_obj [inlined] load_filtee1(obj=, needed=0x49a47c2007c8, flags=, lockstate=) at rtld.c:2576:16 frame #12: 0x4d3dfa2a2245 ld-elf.so.1`symlook_obj [inlined] load_filtees(obj=0x49a47c228008, flags=0, lockstate=0x1ded0f98cb80) at rtld.c:2589:2 frame #13: 0x4d3dfa2a223e ld-elf.so.1`symlook_obj(req=0x1ded01150a80, obj=0x49a47c228008) at rtld.c:4735:6 frame #14: 0x4d3dfa2a6992 ld-elf.so.1`symlook_list(req=0x1ded01150b08, objlist=, dlp=0x1ded01150c50) at rtld.c:4637:13 frame #15: 0x4d3dfa2a680b ld-elf.so.1`symlook_global(req=0x1ded01150c10, donelist=0x1ded01150c50) at rtld.c:4541:8 frame #16: 0x4d3dfa2a6673 ld-elf.so.1`get_program_var_addr(name=, lockstate=0x1ded0f98cb80) at rtld.c:4483:9 frame #17: 0x4d3dfa2a4374 ld-elf.so.1`dlopen_object [inlined] distribute_static_tls(list=0x1ded01151128, lockstate=0x1ded0f98cb80) at rtld.c:5908:6 frame #18: 0x4d3dfa2a4364 ld-elf.so.1`dlopen_object(name="", fd=-1, refobj=0x49a47c228008, lo_flags=0, mode=1, lockstate=0x1ded0f98cb80) at rtld.c:3831:6 frame #19: 0x4d3dfa2a2274 ld-elf.so.1`symlook_obj [inlined] load_filtee1(obj=, needed=0x49a47c2007c8, flags=, lockstate=) at rtld.c:2576:16 frame #20: 0x4d3dfa2a2245 ld-elf.so.1`symlook_obj [inlined] load_filtees(obj=0x49a47c228008, flags=0, lockstate=0x1ded0f98cb80) at rtld.c:2589:2 frame #21: 0x4d3dfa2a223e ld-elf.so.1`symlook_obj(req=0x1ded01151220, obj=0x49a47c228008) at rtldc:4735:6 frame #22: 0x4d3dfa2a6992 ld-elf.so.1`symlook_list(req=0x1ded011512a8, objlist=, dlp=0x1ded011513f0) at rtld.c:4637:13 frame #23: 0x4d3dfa2a680b ld-elf.so.1`symlook_global(req=0x1ded011513b0, donelist=0x1ded011513f0) at rtld.c:4541:8 frame #24: 0x4d3dfa2a6673 ld-elf.so.1`get_program_var_addr(name=, lockstate=0x1ded0f98cb80) at rtld.c:4483:9 frame #25: 0x4d3dfa2a4374 ld-elf.so.1`dlopen_object [inlined] distribute_static_tls(list=0x1ded011518c8, lockstate=0x1ded0f98cb80) at rtld.c:5908:6 frame #26: 0x4d3dfa2a4364 ld-elf.so.1`dlopen_object(name="", fd=-1, refobj=0x49a47c228008, lo_flags=0, mode=1, lockstate=0x1ded0f98cb80) at rtld.c:3831:6 frame #27: 0x4d3dfa2a2274 ld-elf.so.1`symlook_obj [inlined] load_filtee1(obj=, needed=0x49a47c2007c8, flags=, lockstate=) at rtld.c:2576:16 frame #28: 0x4d3dfa2a2245 ld-elf.so.1`symlook_obj [inlined] load_filtees(obj=0x49a47c228008, flags=0,