Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Fri, Jan 29, 2021 at 03:20:32PM -0600, Bjorn Helgaas wrote: > > For comparison the intel iwlwifi driver is very clear about firmware > > it's trying to load, if it can't and what exact firmware you need to > > find on the internet (filename) > > I guess you're referring to this in iwl_request_firmware()? > > IWL_ERR(drv, "check > git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git\n"); > Yes :) > How can we fix this in nouveau so we don't have the debug this again? > I don't really know how firmware loading works, but "git grep -A5 > request_firmware drivers/gpu/drm/nouveau/" shows that we generally > print something when request_firmware() fails. Well, have a look at https://pastebin.com/dX19aCpj do you see any warning whatsoever? > But I didn't notice those messages in your logs, so I'm probably > barking up the wrong tree. you're not It seems that newer kernels are a bit better: [ 189.304662] nouveau :01:00.0: pmu: firmware unavailable [ 189.312455] nouveau :01:00.0: disp: destroy running... [ 189.316552] nouveau :01:00.0: disp: destroy completed in 1us [ 189.320326] nouveau :01:00.0: disp ctor failed, -12 [ 189.324214] nouveau: probe of :01:00.0 failed with error -12 So, it probably got better, but that message got displayed after the 2mn hang that having the firmware, stops from happening. whichever developer with the right hardware can probably easily reproduce this by removing the firmware and looking at the boot messages. At the very least, it should print something more clear "driver will not function properly", and a URL to where one can get the driver, would be awesome. > So maybe the wakeups are related to having vs not having the nouveau > firmware? I'm still curious about that, and it smells like a bug to > me, but probably something to do with nouveau where I have no hope of > debugging it. Right. Honestly, given the time I've lost with this, and now that it seems gone with the firmware, I'm happy to leave well enough alone :) I'm not sure how you are involved with the driver, but are you able to help improve the dmesg output? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Thu, Jan 28, 2021 at 04:56:26PM -0800, Marc MERLIN wrote: > On Wed, Jan 27, 2021 at 03:33:00PM -0600, Bjorn Helgaas wrote: > > Hi Marc, I appreciate your persistence on this. I am frankly > > surprised that you've put up with this so long. > > Well, been using linux for 27 years, but also it's not like I have much > of a choice outside of switching to windows, as tempting as it's getting > sometimes ;) > > > > after boot, when it gets the right trigger (not sure which ones), it > > > loops on this evern 2 seconds, mostly forever. > > > > > > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or > > > something else. > > > > IIUC there are basically two problems: > > > > 1) A 2 minute delay during boot > > Another random thought: is there any chance the boot delay could be > > related to crypto waiting for entropy? > > So, the 2mn hang went away after I added the nouveau firwmare in initrd. > The only problem is that the nouveau driver does not give a very good > clue as to what's going on and what to do. > > For comparison the intel iwlwifi driver is very clear about firmware > it's trying to load, if it can't and what exact firmware you need to > find on the internet (filename) I guess you're referring to this in iwl_request_firmware()? IWL_ERR(drv, "check git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git\n"); How can we fix this in nouveau so we don't have the debug this again? I don't really know how firmware loading works, but "git grep -A5 request_firmware drivers/gpu/drm/nouveau/" shows that we generally print something when request_firmware() fails. But I didn't notice those messages in your logs, so I'm probably barking up the wrong tree. > > 2) Some sort of event every 2 seconds that kills your battery life > > Your machine doesn't sound unusual, and I haven't seen a flood of > > similar reports, so maybe there's something unusual about your config. > > But I really don't have any guesses for either one. > > Honestly, there are not too many thinpad P73 running linux out there. I > wouldn't be surprised if it's only a handful or two. > > > It sounds like v5.5 worked fine and you first noticed the slow boot > > problem in v5.8. We *could* try to bisect it, but I know that's a lot > > of work on your part. > > I've done that in the past, to be honest now that it works after I added > the firmware that nouveau started needing, and didn't need before, the > hang at boot is gone for sure. > The PCI PM wakeup issues on batteries happen sometimes still, but they > are much more rare now. So maybe the wakeups are related to having vs not having the nouveau firmware? I'm still curious about that, and it smells like a bug to me, but probably something to do with nouveau where I have no hope of debugging it. > > Grasping for any ideas for the boot delay; could you boot with > > "initcall_debug" and collect your "lsmod" output? I notice async_tx > > in some of your logs, but I have no idea what it is. It's from > > crypto, so possibly somewhat unusual? > > Is this still neeeded? I think of nouveau does a better job of helping > the user correct the issue if firmware is missing (I think intel even > gives a URL in printk), that would probably be what's needed for the > most part. Nope, don't bother with this, thanks. Bjorn ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Wed, Jan 27, 2021 at 03:33:00PM -0600, Bjorn Helgaas wrote: > Hi Marc, I appreciate your persistence on this. I am frankly > surprised that you've put up with this so long. Well, been using linux for 27 years, but also it's not like I have much of a choice outside of switching to windows, as tempting as it's getting sometimes ;) > > after boot, when it gets the right trigger (not sure which ones), it > > loops on this evern 2 seconds, mostly forever. > > > > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or > > something else. > > IIUC there are basically two problems: > > 1) A 2 minute delay during boot > Another random thought: is there any chance the boot delay could be > related to crypto waiting for entropy? So, the 2mn hang went away after I added the nouveau firwmare in initrd. The only problem is that the nouveau driver does not give a very good clue as to what's going on and what to do. For comparison the intel iwlwifi driver is very clear about firmware it's trying to load, if it can't and what exact firmware you need to find on the internet (filename) > 2) Some sort of event every 2 seconds that kills your battery life > Your machine doesn't sound unusual, and I haven't seen a flood of > similar reports, so maybe there's something unusual about your config. > But I really don't have any guesses for either one. Honestly, there are not too many thinpad P73 running linux out there. I wouldn't be surprised if it's only a handful or two. > It sounds like v5.5 worked fine and you first noticed the slow boot > problem in v5.8. We *could* try to bisect it, but I know that's a lot > of work on your part. I've done that in the past, to be honest now that it works after I added the firmware that nouveau started needing, and didn't need before, the hang at boot is gone for sure. The PCI PM wakeup issues on batteries happen sometimes still, but they are much more rare now. > Grasping for any ideas for the boot delay; could you boot with > "initcall_debug" and collect your "lsmod" output? I notice async_tx > in some of your logs, but I have no idea what it is. It's from > crypto, so possibly somewhat unusual? Is this still neeeded? I think of nouveau does a better job of helping the user correct the issue if firmware is missing (I think intel even gives a URL in printk), that would probably be what's needed for the most part. [ 12.832547] async_tx: api initialized (async) comes from ./crypto/async_tx/async_tx.c Thanks for your answer, let me know if there is anything else useful I can give, I think I'm otherwise mostly ok now. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Wed, Jan 27, 2021 at 03:33:02PM -0600, Bjorn Helgaas wrote: > On Sat, Dec 26, 2020 at 03:12:09AM -0800, Marc MERLIN wrote: > > This started with 5.5 and hasn't gotten better since then, despite > > some reports I tried to send. > > > > As per my previous message: > > I have a Thinkpad P70 with hybrid graphics. > > 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro > > M600M] (rev a2) > > that one works fine, I can use i915 for the main screen, and nouveau to > > display on the external ports (external ports are only wired to nvidia > > chip, so it's impossible to use them without turning the nvidia chip > > on). > > > > I now got a newer P73 also with the same hybrid graphics (setup as such > > in the bios). It runs fine with i915, and I don't need to use external > > display with nouveau for now (it almost works, but I only see the mouse > > cursor on the external screen, no window or anything else can get > > displayed, very weird). > > 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX > > 4000 Mobile / Max-Q] (rev a1) > > > > > > after boot, when it gets the right trigger (not sure which ones), it > > loops on this evern 2 seconds, mostly forever. > > > > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or > > something else. > > IIUC there are basically two problems: > > 1) A 2 minute delay during boot > 2) Some sort of event every 2 seconds that kills your battery life > > Your machine doesn't sound unusual, and I haven't seen a flood of > similar reports, so maybe there's something unusual about your config. > But I really don't have any guesses for either one. > > It sounds like v5.5 worked fine and you first noticed the slow boot > problem in v5.8. We *could* try to bisect it, but I know that's a lot > of work on your part. > > Grasping for any ideas for the boot delay; could you boot with > "initcall_debug" and collect your "lsmod" output? I notice async_tx > in some of your logs, but I have no idea what it is. It's from > crypto, so possibly somewhat unusual? Another random thought: is there any chance the boot delay could be related to crypto waiting for entropy? ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
Hi Marc, I appreciate your persistence on this. I am frankly surprised that you've put up with this so long. On Sat, Dec 26, 2020 at 03:12:09AM -0800, Marc MERLIN wrote: > This started with 5.5 and hasn't gotten better since then, despite > some reports I tried to send. > > As per my previous message: > I have a Thinkpad P70 with hybrid graphics. > 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] > (rev a2) > that one works fine, I can use i915 for the main screen, and nouveau to > display on the external ports (external ports are only wired to nvidia > chip, so it's impossible to use them without turning the nvidia chip > on). > > I now got a newer P73 also with the same hybrid graphics (setup as such > in the bios). It runs fine with i915, and I don't need to use external > display with nouveau for now (it almost works, but I only see the mouse > cursor on the external screen, no window or anything else can get > displayed, very weird). > 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX > 4000 Mobile / Max-Q] (rev a1) > > > after boot, when it gets the right trigger (not sure which ones), it > loops on this evern 2 seconds, mostly forever. > > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or > something else. IIUC there are basically two problems: 1) A 2 minute delay during boot 2) Some sort of event every 2 seconds that kills your battery life Your machine doesn't sound unusual, and I haven't seen a flood of similar reports, so maybe there's something unusual about your config. But I really don't have any guesses for either one. It sounds like v5.5 worked fine and you first noticed the slow boot problem in v5.8. We *could* try to bisect it, but I know that's a lot of work on your part. Grasping for any ideas for the boot delay; could you boot with "initcall_debug" and collect your "lsmod" output? I notice async_tx in some of your logs, but I have no idea what it is. It's from crypto, so possibly somewhat unusual? Bjorn ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Mon, Jan 04, 2021 at 02:28:37PM +0100, Karol Herbst wrote: > mhh, that PCI config stuff should really not happen all the time, but > it also doesn't appear to. The other thing I really don't know is, how > well the runpm works with tools like TLP if there isn't only an audio > device, but also the USB stuff and all the subdevices have to be > turned off all the time in order for the GPU to stay powered down. > > The firmware stuff is also just a functional problem, so you won't get > display offloading, but it shouldn't drain your battery as long as > nothing is connected. I'd check with "grep . > /sys/bus/pci/devices/*/power/runtime_status" if all subdevices of the > GPU are powered down, and check which one gets enabled regularly or > something. Well, all I can say is that without the firmware, my boot hung 2mn every single time (I sent details in the logs upthread). The battery draw issue was inconsistent. I haven't quite found what triggers it yet. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
mhh, that PCI config stuff should really not happen all the time, but it also doesn't appear to. The other thing I really don't know is, how well the runpm works with tools like TLP if there isn't only an audio device, but also the USB stuff and all the subdevices have to be turned off all the time in order for the GPU to stay powered down. The firmware stuff is also just a functional problem, so you won't get display offloading, but it shouldn't drain your battery as long as nothing is connected. I'd check with "grep . /sys/bus/pci/devices/*/power/runtime_status" if all subdevices of the GPU are powered down, and check which one gets enabled regularly or something. On Mon, Jan 4, 2021 at 12:50 PM Marc MERLIN wrote: > > On Tue, Dec 29, 2020 at 09:47:50AM -0800, Marc MERLIN wrote: > > > Of course now that I read your email a bit more carefully, it seems > > > your issue is with the "saving config space" messages. I'm not sure > > > I've seen those before. Perhaps you have some sort of debug enabled. > > > I'd find where in the kernel they are being produced, and what the > > > conditions for it are. But the failure to load firmware isn't great -- > > > not 100% sure if it impacts runpm or not. > > > > Yes, I have 'nouveau.debug=disp=trace' > > Someone on this list asked me to add this a few months back. > > > > > I just double-checked, TU10x accel came in via > > > afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6. > > > Initial TU10x support came in v5.0. So that doesn't line up with your > > > timeline. > > > > You know, I said 5.5, maybe it was 5.6 now, it's been a little while > > since those issues started. > > > > Now we know I was missing the required firmware, it's a good place to > > start, so I'll start there, thank you very much for the pointers. > > Sorry for the delay. I rebooted and everything worked great. > No hang at boot. > As for the PME loop I've been seeing, it hasn't happened so far. > > I can't comment on whether firmware should be required for the kernel to > boot properly, but if it's at all possible, please try to make the > driver fall back or shut down if the firmware is absent as opposed to > hanging the boot 2mn. > > Also some drivers give a better clue that their firmware is missing > and where to get it from. Adding a printk to help users could be a good > idea. > > Below is the boot with firmware present. > > Thanks for your help > Marc > > sauron:~$ grep nouveau /var/log/dmesg > [ 11.016605] nouveau: detected PR support, will not use DSM > [ 11.025191] nouveau :01:00.0: runtime IRQ mapping not provided by arch > [ 11.071823] nouveau :01:00.0: enabling device ( -> 0003) > [ 11.111588] nouveau :01:00.0: NVIDIA TU104 (164000a1) > [ 11.203598] nouveau :01:00.0: bios: version 90.04.4d.00.2c > [ 11.203921] nouveau :01:00.0: pmu: firmware unavailable > [ 11.204229] nouveau :01:00.0: enabling bus mastering > [ 11.204543] nouveau :01:00.0: fb: 8192 MiB GDDR6 > [ 11.215524] nouveau :01:00.0: DRM: VRAM: 8192 MiB > [ 11.215525] nouveau :01:00.0: DRM: GART: 536870912 MiB > [ 11.215527] nouveau :01:00.0: DRM: BIT table 'A' not found > [ 11.215527] nouveau :01:00.0: DRM: BIT table 'L' not found > [ 11.215528] nouveau :01:00.0: DRM: TMDS table version 2.0 > [ 11.215529] nouveau :01:00.0: DRM: DCB version 4.1 > [ 11.215530] nouveau :01:00.0: DRM: DCB outp 00: 02800f66 04600020 > [ 11.215531] nouveau :01:00.0: DRM: DCB outp 01: 02011f52 00020010 > [ 11.215532] nouveau :01:00.0: DRM: DCB outp 02: 01022f36 04600010 > [ 11.215532] nouveau :01:00.0: DRM: DCB outp 03: 04033f76 04600010 > [ 11.215533] nouveau :01:00.0: DRM: DCB outp 04: 04044f86 04600020 > [ 11.215533] nouveau :01:00.0: DRM: DCB conn 00: 00020047 > [ 11.215534] nouveau :01:00.0: DRM: DCB conn 01: 00010161 > [ 11.215534] nouveau :01:00.0: DRM: DCB conn 02: 1248 > [ 11.215535] nouveau :01:00.0: DRM: DCB conn 03: 01000348 > [ 11.215535] nouveau :01:00.0: DRM: DCB conn 04: 02000471 > [ 11.216166] nouveau :01:00.0: DRM: MM: using COPY for buffer copies > [ 11.526753] nouveau :01:00.0: DRM: unknown connector type 48 > [ 11.527077] nouveau :01:00.0: DRM: unknown connector type 48 > [ 11.552051] nouveau :01:00.0: [drm] Cannot find any crtc or sizes > [ 11.554239] nouveau :01:00.0: [drm] Cannot find any crtc or sizes > [ 11.555822] nouveau :01:00.0: [drm] Cannot find any crtc or sizes > [ 11.556054] [drm] Initialized nouveau 1.3.1 20120801 for :01:00.0 on > minor 1 > [ 11.556060] nouveau :01:00.0: DRM: Disabling PCI power management to > avoid bug > [ 18.887229] nouveau :01:00.0: saving config space at offset 0x0 > (reading 0x1eb610de) > [ 18.887231] nouveau :01:00.0: saving config space at offset 0x4 > (reading 0x100407) > [ 18.887233] nouveau :01:00.0: saving config space at offset 0x8 > (reading 0x3
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Tue, Dec 29, 2020 at 09:47:50AM -0800, Marc MERLIN wrote: > > Of course now that I read your email a bit more carefully, it seems > > your issue is with the "saving config space" messages. I'm not sure > > I've seen those before. Perhaps you have some sort of debug enabled. > > I'd find where in the kernel they are being produced, and what the > > conditions for it are. But the failure to load firmware isn't great -- > > not 100% sure if it impacts runpm or not. > > Yes, I have 'nouveau.debug=disp=trace' > Someone on this list asked me to add this a few months back. > > > I just double-checked, TU10x accel came in via > > afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6. > > Initial TU10x support came in v5.0. So that doesn't line up with your > > timeline. > > You know, I said 5.5, maybe it was 5.6 now, it's been a little while > since those issues started. > > Now we know I was missing the required firmware, it's a good place to > start, so I'll start there, thank you very much for the pointers. Sorry for the delay. I rebooted and everything worked great. No hang at boot. As for the PME loop I've been seeing, it hasn't happened so far. I can't comment on whether firmware should be required for the kernel to boot properly, but if it's at all possible, please try to make the driver fall back or shut down if the firmware is absent as opposed to hanging the boot 2mn. Also some drivers give a better clue that their firmware is missing and where to get it from. Adding a printk to help users could be a good idea. Below is the boot with firmware present. Thanks for your help Marc sauron:~$ grep nouveau /var/log/dmesg [ 11.016605] nouveau: detected PR support, will not use DSM [ 11.025191] nouveau :01:00.0: runtime IRQ mapping not provided by arch [ 11.071823] nouveau :01:00.0: enabling device ( -> 0003) [ 11.111588] nouveau :01:00.0: NVIDIA TU104 (164000a1) [ 11.203598] nouveau :01:00.0: bios: version 90.04.4d.00.2c [ 11.203921] nouveau :01:00.0: pmu: firmware unavailable [ 11.204229] nouveau :01:00.0: enabling bus mastering [ 11.204543] nouveau :01:00.0: fb: 8192 MiB GDDR6 [ 11.215524] nouveau :01:00.0: DRM: VRAM: 8192 MiB [ 11.215525] nouveau :01:00.0: DRM: GART: 536870912 MiB [ 11.215527] nouveau :01:00.0: DRM: BIT table 'A' not found [ 11.215527] nouveau :01:00.0: DRM: BIT table 'L' not found [ 11.215528] nouveau :01:00.0: DRM: TMDS table version 2.0 [ 11.215529] nouveau :01:00.0: DRM: DCB version 4.1 [ 11.215530] nouveau :01:00.0: DRM: DCB outp 00: 02800f66 04600020 [ 11.215531] nouveau :01:00.0: DRM: DCB outp 01: 02011f52 00020010 [ 11.215532] nouveau :01:00.0: DRM: DCB outp 02: 01022f36 04600010 [ 11.215532] nouveau :01:00.0: DRM: DCB outp 03: 04033f76 04600010 [ 11.215533] nouveau :01:00.0: DRM: DCB outp 04: 04044f86 04600020 [ 11.215533] nouveau :01:00.0: DRM: DCB conn 00: 00020047 [ 11.215534] nouveau :01:00.0: DRM: DCB conn 01: 00010161 [ 11.215534] nouveau :01:00.0: DRM: DCB conn 02: 1248 [ 11.215535] nouveau :01:00.0: DRM: DCB conn 03: 01000348 [ 11.215535] nouveau :01:00.0: DRM: DCB conn 04: 02000471 [ 11.216166] nouveau :01:00.0: DRM: MM: using COPY for buffer copies [ 11.526753] nouveau :01:00.0: DRM: unknown connector type 48 [ 11.527077] nouveau :01:00.0: DRM: unknown connector type 48 [ 11.552051] nouveau :01:00.0: [drm] Cannot find any crtc or sizes [ 11.554239] nouveau :01:00.0: [drm] Cannot find any crtc or sizes [ 11.555822] nouveau :01:00.0: [drm] Cannot find any crtc or sizes [ 11.556054] [drm] Initialized nouveau 1.3.1 20120801 for :01:00.0 on minor 1 [ 11.556060] nouveau :01:00.0: DRM: Disabling PCI power management to avoid bug [ 18.887229] nouveau :01:00.0: saving config space at offset 0x0 (reading 0x1eb610de) [ 18.887231] nouveau :01:00.0: saving config space at offset 0x4 (reading 0x100407) [ 18.887233] nouveau :01:00.0: saving config space at offset 0x8 (reading 0x3a1) [ 18.887235] nouveau :01:00.0: saving config space at offset 0xc (reading 0x80) [ 18.887237] nouveau :01:00.0: saving config space at offset 0x10 (reading 0xcd00) [ 18.887239] nouveau :01:00.0: saving config space at offset 0x14 (reading 0xa00c) [ 18.887241] nouveau :01:00.0: saving config space at offset 0x18 (reading 0x0) [ 18.887243] nouveau :01:00.0: saving config space at offset 0x1c (reading 0xb00c) [ 18.887245] nouveau :01:00.0: saving config space at offset 0x20 (reading 0x0) [ 18.887247] nouveau :01:00.0: saving config space at offset 0x24 (reading 0x2001) [ 18.887249] nouveau :01:00.0: saving config space at offset 0x28 (reading 0x0) [ 18.887251] nouveau :01:00.0: saving config space at offset 0x2c (reading 0x229b17aa) [ 18.887253] nouveau :01:00.0: saving config space at
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Tue, Dec 29, 2020 at 11:33:16AM -0500, Ilia Mirkin wrote: > On Tue, Dec 29, 2020 at 10:52 AM Marc MERLIN wrote: > > I'm not extremely familiar with debian packaging, but the firmware is > provided by NVIDIA and shipped as part of linux-firmware: > > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/nvidia I think it may be firmware-misc-nonfree. ael ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
(removed other lists, since it's likely not a linux-PCI problem) On Tue, Dec 29, 2020 at 11:33:16AM -0500, Ilia Mirkin wrote: > > Sounds like this would be a problem with all chips if userspace is able > > to wake them up every second or two with a probe. Now I wonder what > > broken userspace I have that could be doing this. > > Well, it's a theory. Some userspace helpfully prevents the GPU from > suspending entirely, unfortunately I don't remember its name though by > messing with the attached audio device. It's very common and meant to > help... oh well. Are you thinking about tlp maybe? https://linrunner.de/tlp/ I submitted a blacklist patch so that it works ok-ish on my laptop now. (when the nvidia chip is unhappy, it happily uses 70W on batteries with 1.3h of runtime. When everything is ok, I can go down to about 12W/9H) > > Do you think that could be a reason why the boot would hang for 2 full > > minutes at every > > boot ever since I upgraded to 5.5? > > I'd have to check, but I'm guessing TU104 acceleration became a thing > in 5.5. I would also not be very surprised if the code didn't handle > failure extremely gracefully - there definitely have been problems > with that in the past. Ah, then the timing checks out. That's exciting, at least now I have a lead as to why I'm having problems. This was the same time a PCI PM change went in, and I mistakenly thought it was to blame. > > The kernel module is in my initrd: > > sauron:/usr/local/bin# dd > > if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528 skip=1 > > | gunzip | cpio -tdv | grep nouveau > > drwxr-xr-x 1 root root0 Nov 30 15:40 > > usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau > > -rw-r--r-- 1 root root 3691385 Nov 30 15:35 > > usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko > > 17+1 records in > > 17+1 records out > > 52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s > > I think that gets you out of "full newbie" land... :) (ok, I have been using linux since 1993, but stuff changes so much all the time, that sometimes I feel like a newbie all over again) In my days, we didn't complain about systemd vs sysvinit, we had rc.local and it was good enough :-D > > Note that ultimately I only need nouveau not to hang my boot 2mn and do > > PM so that the nvidia chip goes to sleep since I don't use it. > > I'm not extremely familiar with debian packaging, but the firmware is > provided by NVIDIA and shipped as part of linux-firmware: > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/nvidia Ah, it comes from outside just like intel firmware, thanks. Also, I was looking for nouveau, not nvidia: sauron:/usr/local/bin# dd if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528 skip=1 | gunzip | cpio -tdv | grep tu104 shows no match Good news is that debian did package it (they have multiple firmware packages) sauron:~# dpkggrep firmware | awk '{print $1}' | xargs apt-get install -y sauron:~# dpkg -S /lib/firmware/nvidia/tu104 firmware-misc-nonfree: /lib/firmware/nvidia/tu104 update-initramfs -v -c -k 5.9.11-amd64-preempt-sysrq-20190817 Ok, I should be in business after next reboot, thank you. > Of course now that I read your email a bit more carefully, it seems > your issue is with the "saving config space" messages. I'm not sure > I've seen those before. Perhaps you have some sort of debug enabled. > I'd find where in the kernel they are being produced, and what the > conditions for it are. But the failure to load firmware isn't great -- > not 100% sure if it impacts runpm or not. Yes, I have 'nouveau.debug=disp=trace' Someone on this list asked me to add this a few months back. > I just double-checked, TU10x accel came in via > afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6. > Initial TU10x support came in v5.0. So that doesn't line up with your > timeline. You know, I said 5.5, maybe it was 5.6 now, it's been a little while since those issues started. Now we know I was missing the required firmware, it's a good place to start, so I'll start there, thank you very much for the pointers. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Tue, Dec 29, 2020 at 10:52 AM Marc MERLIN wrote: > > On Sat, Dec 26, 2020 at 03:12:09AM -0800, Ilia Mirkin wrote: > > > after boot, when it gets the right trigger (not sure which ones), it > > > loops on this evern 2 seconds, mostly forever. > > > > The gpu suspends with runtime pm. And then gets woken up for some > > reason (could be something quite silly, like lspci, or could be > > something explicitly checking connectors, etc). Repeat. > > Ah, fair point. Could it be powertop even? > How would I go towards tracing that? > Sounds like this would be a problem with all chips if userspace is able > to wake them up every second or two with a probe. Now I wonder what > broken userspace I have that could be doing this. Well, it's a theory. Some userspace helpfully prevents the GPU from suspending entirely, unfortunately I don't remember its name though by messing with the attached audio device. It's very common and meant to help... oh well. > > > Display offload usually requires acceleration -- the copies are done > > using the DMA engine. Please make sure that you have firmware > > available (and a new enough mesa). The errors suggest that you don't > > have firmware available at the time that nouveau loads. Depending on > > your setup, that might mean the firmware has to be built into the > > kernel, or available in initramfs. (Or just regular filesystem if you > > don't use a complicated boot sequence. But many people go with distro > > defaults, which do have this complexity.) > > Hi Ilia, thanks for your answer. > > Do you think that could be a reason why the boot would hang for 2 full > minutes at every > boot ever since I upgraded to 5.5? I'd have to check, but I'm guessing TU104 acceleration became a thing in 5.5. I would also not be very surprised if the code didn't handle failure extremely gracefully - there definitely have been problems with that in the past. > > Also, without wanting to sound like a full newbie, where is that > firmware you're talking about? In my kernel source? > > Here's what I do have: > sauron:/usr/local/bin# dpkggrep nouveau > libdrm-nouveau2:amd64 install > xserver-xorg-video-nouveau install > > no nouveau-firmware package in debian: > sauron:/usr/local/bin# apt-cache search nouveau > bumblebee - NVIDIA Optimus support for Linux > libdrm-nouveau2 - Userspace interface to nouveau-specific kernel DRM services > -- runtime > xfonts-jmk - Jim Knoble's character-cell fonts for X > xserver-xorg-video-nouveau - X.Org X server -- Nouveau display driver > > No firmware file on my disk: > sauron:/usr/local/bin# find /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/ > /lib/firmware/ |grep nouveau > /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau > /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko > sauron:/usr/local/bin# > > The kernel module is in my initrd: > sauron:/usr/local/bin# dd > if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528 skip=1 | > gunzip | cpio -tdv | grep nouveau > drwxr-xr-x 1 root root0 Nov 30 15:40 > usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau > -rw-r--r-- 1 root root 3691385 Nov 30 15:35 > usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko > 17+1 records in > 17+1 records out > 52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s I think that gets you out of "full newbie" land... > > What am I supposed to do/check next? > > Note that ultimately I only need nouveau not to hang my boot 2mn and do > PM so that the nvidia chip goes to sleep since I don't use it. I'm not extremely familiar with debian packaging, but the firmware is provided by NVIDIA and shipped as part of linux-firmware: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/nvidia This needs to be available at /lib/firmware/nvidia when nouveau loads. Based on your email above, it's most likely that it would load from the initrd - so make sure it's in there. Of course now that I read your email a bit more carefully, it seems your issue is with the "saving config space" messages. I'm not sure I've seen those before. Perhaps you have some sort of debug enabled. I'd find where in the kernel they are being produced, and what the conditions for it are. But the failure to load firmware isn't great -- not 100% sure if it impacts runpm or not. I just double-checked, TU10x accel came in via afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6. Initial TU10x support came in v5.0. So that doesn't line up with your timeline. Anyways, I'd definitely sort the firmware situation out, but it may not be the cause of your problem. Cheers, -ilia ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Sat, Dec 26, 2020 at 03:12:09AM -0800, Ilia Mirkin wrote: > > after boot, when it gets the right trigger (not sure which ones), it > > loops on this evern 2 seconds, mostly forever. > > The gpu suspends with runtime pm. And then gets woken up for some > reason (could be something quite silly, like lspci, or could be > something explicitly checking connectors, etc). Repeat. Ah, fair point. Could it be powertop even? How would I go towards tracing that? Sounds like this would be a problem with all chips if userspace is able to wake them up every second or two with a probe. Now I wonder what broken userspace I have that could be doing this. > Display offload usually requires acceleration -- the copies are done > using the DMA engine. Please make sure that you have firmware > available (and a new enough mesa). The errors suggest that you don't > have firmware available at the time that nouveau loads. Depending on > your setup, that might mean the firmware has to be built into the > kernel, or available in initramfs. (Or just regular filesystem if you > don't use a complicated boot sequence. But many people go with distro > defaults, which do have this complexity.) Hi Ilia, thanks for your answer. Do you think that could be a reason why the boot would hang for 2 full minutes at every boot ever since I upgraded to 5.5? Also, without wanting to sound like a full newbie, where is that firmware you're talking about? In my kernel source? Here's what I do have: sauron:/usr/local/bin# dpkggrep nouveau libdrm-nouveau2:amd64 install xserver-xorg-video-nouveau install no nouveau-firmware package in debian: sauron:/usr/local/bin# apt-cache search nouveau bumblebee - NVIDIA Optimus support for Linux libdrm-nouveau2 - Userspace interface to nouveau-specific kernel DRM services -- runtime xfonts-jmk - Jim Knoble's character-cell fonts for X xserver-xorg-video-nouveau - X.Org X server -- Nouveau display driver No firmware file on my disk: sauron:/usr/local/bin# find /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/ /lib/firmware/ |grep nouveau /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko sauron:/usr/local/bin# The kernel module is in my initrd: sauron:/usr/local/bin# dd if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528 skip=1 | gunzip | cpio -tdv | grep nouveau drwxr-xr-x 1 root root0 Nov 30 15:40 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau -rw-r--r-- 1 root root 3691385 Nov 30 15:35 usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko 17+1 records in 17+1 records out 52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s What am I supposed to do/check next? Note that ultimately I only need nouveau not to hang my boot 2mn and do PM so that the nvidia chip goes to sleep since I don't use it. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
On Sun, Dec 27, 2020 at 12:03 PM Marc MERLIN wrote: > > This started with 5.5 and hasn't gotten better since then, despite some > reports > I tried to send. > > As per my previous message: > I have a Thinkpad P70 with hybrid graphics. > 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] > (rev a2) > that one works fine, I can use i915 for the main screen, and nouveau to > display on the external ports (external ports are only wired to nvidia > chip, so it's impossible to use them without turning the nvidia chip > on). > > I now got a newer P73 also with the same hybrid graphics (setup as such > in the bios). It runs fine with i915, and I don't need to use external > display with nouveau for now (it almost works, but I only see the mouse > cursor on the external screen, no window or anything else can get > displayed, very weird). > 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX > 4000 Mobile / Max-Q] (rev a1) Display offload usually requires acceleration -- the copies are done using the DMA engine. Please make sure that you have firmware available (and a new enough mesa). The errors suggest that you don't have firmware available at the time that nouveau loads. Depending on your setup, that might mean the firmware has to be built into the kernel, or available in initramfs. (Or just regular filesystem if you don't use a complicated boot sequence. But many people go with distro defaults, which do have this complexity.) > > > after boot, when it gets the right trigger (not sure which ones), it > loops on this evern 2 seconds, mostly forever. The gpu suspends with runtime pm. And then gets woken up for some reason (could be something quite silly, like lspci, or could be something explicitly checking connectors, etc). Repeat. Cheers, -ilia ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)
This started with 5.5 and hasn't gotten better since then, despite some reports I tried to send. As per my previous message: I have a Thinkpad P70 with hybrid graphics. 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] (rev a2) that one works fine, I can use i915 for the main screen, and nouveau to display on the external ports (external ports are only wired to nvidia chip, so it's impossible to use them without turning the nvidia chip on). I now got a newer P73 also with the same hybrid graphics (setup as such in the bios). It runs fine with i915, and I don't need to use external display with nouveau for now (it almost works, but I only see the mouse cursor on the external screen, no window or anything else can get displayed, very weird). 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1) after boot, when it gets the right trigger (not sure which ones), it loops on this evern 2 seconds, mostly forever. I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or something else. Boot hangs look like this: [ 10.659209] Console: switching to colour frame buffer device 240x67 [ 10.732353] i915 :00:02.0: [drm] fb0: i915drmfb frame buffer device [ 12.101203] nvidia-gpu :01:00.3: saving config space at offset 0x0 (reading 0x1ad910de) [ 12.101212] nvidia-gpu :01:00.3: saving config space at offset 0x4 (reading 0x100406) [ 12.101217] nvidia-gpu :01:00.3: saving config space at offset 0x8 (reading 0xc8000a1) [ 12.101223] nvidia-gpu :01:00.3: saving config space at offset 0xc (reading 0x80) [ 12.101228] nvidia-gpu :01:00.3: saving config space at offset 0x10 (reading 0xce054000) [ 12.101234] nvidia-gpu :01:00.3: saving config space at offset 0x14 (reading 0x0) [ 12.101239] nvidia-gpu :01:00.3: saving config space at offset 0x18 (reading 0x0) [ 12.101244] nvidia-gpu :01:00.3: saving config space at offset 0x1c (reading 0x0) [ 12.101249] nvidia-gpu :01:00.3: saving config space at offset 0x20 (reading 0x0) [ 12.101254] nvidia-gpu :01:00.3: saving config space at offset 0x24 (reading 0x0) [ 12.101259] nvidia-gpu :01:00.3: saving config space at offset 0x28 (reading 0x0) [ 12.101265] nvidia-gpu :01:00.3: saving config space at offset 0x2c (reading 0x229b17aa) [ 12.101270] nvidia-gpu :01:00.3: saving config space at offset 0x30 (reading 0x0) [ 12.101275] nvidia-gpu :01:00.3: saving config space at offset 0x34 (reading 0x68) [ 12.101280] nvidia-gpu :01:00.3: saving config space at offset 0x38 (reading 0x0) [ 12.101285] nvidia-gpu :01:00.3: saving config space at offset 0x3c (reading 0x4ff) [ 12.101333] nvidia-gpu :01:00.3: PME# enabled [ 25.151246] thunderbolt :06:00.0: saving config space at offset 0x0 (reading 0x15eb8086) [ 25.151260] thunderbolt :06:00.0: saving config space at offset 0x4 (reading 0x100406) [ 25.151265] thunderbolt :06:00.0: saving config space at offset 0x8 (reading 0x886) [ 25.151270] thunderbolt :06:00.0: saving config space at offset 0xc (reading 0x20) [ 25.151276] thunderbolt :06:00.0: saving config space at offset 0x10 (reading 0xcc10) [ 25.151281] thunderbolt :06:00.0: saving config space at offset 0x14 (reading 0xcc14) [ 25.151286] thunderbolt :06:00.0: saving config space at offset 0x18 (reading 0x0) [ 25.151291] thunderbolt :06:00.0: saving config space at offset 0x1c (reading 0x0) [ 25.151296] thunderbolt :06:00.0: saving config space at offset 0x20 (reading 0x0) [ 25.151301] thunderbolt :06:00.0: saving config space at offset 0x24 (reading 0x0) [ 25.151306] thunderbolt :06:00.0: saving config space at offset 0x28 (reading 0x0) [ 25.151311] thunderbolt :06:00.0: saving config space at offset 0x2c (reading 0x229b17aa) [ 25.151316] thunderbolt :06:00.0: saving config space at offset 0x30 (reading 0x0) [ 25.151322] thunderbolt :06:00.0: saving config space at offset 0x34 (reading 0x80) [ 25.151327] thunderbolt :06:00.0: saving config space at offset 0x38 (reading 0x0) [ 25.151332] thunderbolt :06:00.0: saving config space at offset 0x3c (reading 0x1ff) [ 25.151416] thunderbolt :06:00.0: PME# enabled [ 25.169204] pcieport :05:00.0: saving config space at offset 0x0 (reading 0x15ea8086) [ 25.169214] pcieport :05:00.0: saving config space at offset 0x4 (reading 0x100407) [ 25.169219] pcieport :05:00.0: saving config space at offset 0x8 (reading 0x6040006) [ 25.169224] pcieport :05:00.0: saving config space at offset 0xc (reading 0x10020) [ 25.169229] pcieport :05:00.0: saving config space at offset 0x10 (reading 0x0) [ 25.169233] pcieport :05:00.0: saving config space at offset 0x14 (reading 0x0) [ 25.169238] pcieport :05:00.0: saving config space at offset 0x18 (reading 0x60605) [ 25.1692