Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window
Hi all, we are still affected by this issue from 2019 on 5.10. For example when setting "cma=256M" on phycore imx6q with 2G ram we get: [ 12.573276] etnaviv etnaviv: command buffer outside valid memory window [ 12.616460] etnaviv etnaviv: command buffer outside valid memory window [ 12.662517] etnaviv etnaviv: command buffer outside valid memory window [ 12.714859] etnaviv etnaviv: command buffer outside valid memory window On the other hand, when we set "cma=128M" this doesn't happen. For now, we were able to get around this issue by applying Lucas' patches: "[PATCH 1/2] mm: cma: export functions to get CMA base and size" "[PATCH 2/2] drm/etnaviv: use CMA area to compute linear window offset if possible" However those didn't get accepted into mainline? Has there been any progress on this? Any tips on how to properly fix this in mainline? BR, Primoz Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux admin: >/While updating my various systems for the TCP SACK issue, I notice />/that while most platforms are happy, the Cubox-i4 is not. During />/boot, we get: />//>/[0.00] cma: Reserved 256 MiB at 0x3000 />/... />/[0.00] Kernel command line: console=ttymxc0,115200n8 console=tty1 video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash resume=/dev/sda1 />/[0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) />/[0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) />/[0.00] Memory: 1790972K/2097152K available (8471K kernel code, 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K cma-reserved, 1310720K highmem) />/... />/[ 13.101098] etnaviv-gpu 13.gpu: command buffer outside valid memory window />/[ 13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid memory window / Yes, that's a regression due to different default CMA area placement and etnaviv not being smart enough to move the linear window to the right offset. Patches to fix this (but have rightfully been shot down, due to layering violations) are "[PATCH 1/2] mm: cma: export functions to get CMA base and size" and "[PATCH 2/2] drm/etnaviv: use CMA area to compute linear window offset if possible". >/and shortly after the login prompt appears, the entire SoC appears to />/lock up - it becomes unresponsive on the network, or via serial console />/to sysrq requests. />//>/I suspect the GPU ends up scribbling over the CPU's vector page/kernel />/as a result of the above two etnaviv errors when Xorg attempts to start />/using the GPU. / This should not be possible. The driver notices that the command buffer isn't accessible to the GPU, which aborts the GPU init. While the etnaviv DRM device is still accessible, it will not expose any enumerable GPU cores to userspace. So there is no way for userspace to actually submit GPU commands. Regards, Lucas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window
On Thu, Jun 27, 2019 at 04:49:30PM +0200, Lucas Stach wrote: > Am Donnerstag, den 27.06.2019, 15:32 +0100 schrieb Russell King - ARM Linux > admin: > > On Thu, Jun 27, 2019 at 11:04:17AM +0100, Russell King - ARM Linux admin > > wrote: > > > On Thu, Jun 27, 2019 at 11:20:15AM +0200, Lucas Stach wrote: > > > > Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM > > > > Linux admin: > > > > > While updating my various systems for the TCP SACK issue, I notice > > > > > that while most platforms are happy, the Cubox-i4 is not. During > > > > > boot, we get: > > > > > > > > > > [0.00] cma: Reserved 256 MiB at 0x3000 > > > > > ... > > > > > [0.00] Kernel command line: console=ttymxc0,115200n8 > > > > > console=tty1 video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M > > > > > ahci_imx.hotplug=1 splash resume=/dev/sda1 > > > > > [0.00] Dentry cache hash table entries: 131072 (order: 7, > > > > > 524288 bytes) > > > > > [0.00] Inode-cache hash table entries: 65536 (order: 6, > > > > > 262144 bytes) > > > > > [0.00] Memory: 1790972K/2097152K available (8471K kernel > > > > > code, 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K > > > > > reserved, 262144K cma-reserved, 1310720K highmem) > > > > > ... > > > > > [ 13.101098] etnaviv-gpu 13.gpu: command buffer outside valid > > > > > memory window > > > > > [ 13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid > > > > > memory window > > > > > > > > Yes, that's a regression due to different default CMA area placement > > > > and etnaviv not being smart enough to move the linear window to the > > > > right offset. > > > > > > As it's a user visible regression, it needs fixing, either by reverting > > > the changes that caused it or by some other issue. In the kernel, the > > > policy is "if a bug fix causes a regression, the bug fix was itself > > > wrong". We don't fix one person's bug if it causes a regression for > > > someone else. > > > > > > Please resolve the acknowledged regression. > > The regression is caused due to a different CMA placement, which is > outside of the control of etnaviv. If you can point to the commit > causing this change in placement we could work with the > authors/maintainers of this code to get rid of the regression. > Currently I don't have the bandwidth to pinpoint the offending code > change. Ok, thanks for the explanation. Well, the problem has become weirder. I'm unable to reproduce the hang now - the only change has been to add your patch for the unload issue, as well as temporarily disabling lightdm's startup at boot (which is now back as it was.) Odd. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window
Am Donnerstag, den 27.06.2019, 15:32 +0100 schrieb Russell King - ARM Linux admin: > On Thu, Jun 27, 2019 at 11:04:17AM +0100, Russell King - ARM Linux admin > wrote: > > On Thu, Jun 27, 2019 at 11:20:15AM +0200, Lucas Stach wrote: > > > Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux > > > admin: > > > > While updating my various systems for the TCP SACK issue, I notice > > > > that while most platforms are happy, the Cubox-i4 is not. During > > > > boot, we get: > > > > > > > > [0.00] cma: Reserved 256 MiB at 0x3000 > > > > ... > > > > [0.00] Kernel command line: console=ttymxc0,115200n8 > > > > console=tty1 video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M > > > > ahci_imx.hotplug=1 splash resume=/dev/sda1 > > > > [0.00] Dentry cache hash table entries: 131072 (order: 7, > > > > 524288 bytes) > > > > [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 > > > > bytes) > > > > [0.00] Memory: 1790972K/2097152K available (8471K kernel code, > > > > 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, > > > > 262144K cma-reserved, 1310720K highmem) > > > > ... > > > > [ 13.101098] etnaviv-gpu 13.gpu: command buffer outside valid > > > > memory window > > > > [ 13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid > > > > memory window > > > > > > Yes, that's a regression due to different default CMA area placement > > > and etnaviv not being smart enough to move the linear window to the > > > right offset. > > > > As it's a user visible regression, it needs fixing, either by reverting > > the changes that caused it or by some other issue. In the kernel, the > > policy is "if a bug fix causes a regression, the bug fix was itself > > wrong". We don't fix one person's bug if it causes a regression for > > someone else. > > > > Please resolve the acknowledged regression. The regression is caused due to a different CMA placement, which is outside of the control of etnaviv. If you can point to the commit causing this change in placement we could work with the authors/maintainers of this code to get rid of the regression. Currently I don't have the bandwidth to pinpoint the offending code change. > > > > and shortly after the login prompt appears, the entire SoC appears to > > > > lock up - it becomes unresponsive on the network, or via serial console > > > > to sysrq requests. > > > > > > > > I suspect the GPU ends up scribbling over the CPU's vector page/kernel > > > > as a result of the above two etnaviv errors when Xorg attempts to start > > > > using the GPU. > > > > > > This should not be possible. The driver notices that the command buffer > > > isn't accessible to the GPU, which aborts the GPU init. While the > > > etnaviv DRM device is still accessible, it will not expose any > > > enumerable GPU cores to userspace. So there is no way for userspace to > > > actually submit GPU commands. > > > > Yep, I came to that conclusion. Nevertheless, if I allow Xorg to start > > with 5.1, the system totally hangs shortly thereafter. I need to try > > without etnaviv loaded at all. > > Well, it seems to get worse. I just tried to unload etnaviv, and was > greeted by this oops. It's another regression; etnaviv used to unload > perfectly fine. Please can you add module unload testing to your > workflow? As you can see from the patch I've just sent, this is a missing error cleanup. So it's really the same regression. A module unload after successful init of all GPU cores doesn't show this crash. The issue is only unmasked due to the CMA placement regression. Regards, Lucas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window
On Thu, Jun 27, 2019 at 11:04:17AM +0100, Russell King - ARM Linux admin wrote: > On Thu, Jun 27, 2019 at 11:20:15AM +0200, Lucas Stach wrote: > > Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux > > admin: > > > While updating my various systems for the TCP SACK issue, I notice > > > that while most platforms are happy, the Cubox-i4 is not. During > > > boot, we get: > > > > > > [0.00] cma: Reserved 256 MiB at 0x3000 > > > ... > > > [0.00] Kernel command line: console=ttymxc0,115200n8 console=tty1 > > > video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash > > > resume=/dev/sda1 > > > [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 > > > bytes) > > > [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 > > > bytes) > > > [0.00] Memory: 1790972K/2097152K available (8471K kernel code, > > > 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K > > > cma-reserved, 1310720K highmem) > > > ... > > > [ 13.101098] etnaviv-gpu 13.gpu: command buffer outside valid > > > memory window > > > [ 13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid > > > memory window > > > > Yes, that's a regression due to different default CMA area placement > > and etnaviv not being smart enough to move the linear window to the > > right offset. > > As it's a user visible regression, it needs fixing, either by reverting > the changes that caused it or by some other issue. In the kernel, the > policy is "if a bug fix causes a regression, the bug fix was itself > wrong". We don't fix one person's bug if it causes a regression for > someone else. > > Please resolve the acknowledged regression. > > > > and shortly after the login prompt appears, the entire SoC appears to > > > lock up - it becomes unresponsive on the network, or via serial console > > > to sysrq requests. > > > > > > I suspect the GPU ends up scribbling over the CPU's vector page/kernel > > > as a result of the above two etnaviv errors when Xorg attempts to start > > > using the GPU. > > > > This should not be possible. The driver notices that the command buffer > > isn't accessible to the GPU, which aborts the GPU init. While the > > etnaviv DRM device is still accessible, it will not expose any > > enumerable GPU cores to userspace. So there is no way for userspace to > > actually submit GPU commands. > > Yep, I came to that conclusion. Nevertheless, if I allow Xorg to start > with 5.1, the system totally hangs shortly thereafter. I need to try > without etnaviv loaded at all. Well, it seems to get worse. I just tried to unload etnaviv, and was greeted by this oops. It's another regression; etnaviv used to unload perfectly fine. Please can you add module unload testing to your workflow? Unable to handle kernel NULL pointer dereference at virtual address 0008 pgd = da59c000 [0008] *pgd=8fc0f831 Internal error: Oops: 17 [#1] SMP ARM Modules linked in: ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_owner xt_multiport iptable_filter ip_tables x_tables bnep rfcomm bluetooth ecdh_generic nfsd rc_cec snd_soc_fsl_spdif nvmem_imx_ocotp imx_pcm_dma imx_sdma virt_dma coda v4l2_mem2mem imx_vdoa dw_hdmi_ahb_audio dw_hdmi_cec videobuf2_dma_contig etnaviv(-) gpu_sched imx_thermal snd_soc_imx_spdif imx6q_cpufreq caamrng caam_jr caam error CPU: 1 PID: 2898 Comm: rmmod Not tainted 5.1.0+ #319 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) PC is at etnaviv_iommu_put_suballoc_va+0x10/0x68 [etnaviv] LR is at etnaviv_cmdbuf_suballoc_destroy+0x20/0x48 [etnaviv] pc : []lr : []psr: a00f0013 sp : d9f2be40 ip : 01b0 fp : r10: 0081 r9 : d9f2a000 r8 : c00091c4 r7 : dc993800 r6 : r5 : dd4c6810 r4 : r3 : b00c r2 : 0004 r1 : dd4c6810 r0 : dc991840 Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: 2a59c04a DAC: 0051 Process rmmod (pid: 2898, stack limit = 0xd9f2a218) Stack: (0xd9f2be40 to 0xd9f2c000) be40: dd4c6800 dd5e9b40 bf04a664 dd5e9b40 be60: dc991840 bf04e4d0 bf04e458 dd5e93c0 dd5e9b40 c04aa2e0 0018 dc993800 be80: c00091c4 dd5e9b40 0001 c04aa3b4 dc993800 dd0f9410 dd5a4000 bea0: bf04a97c dd5e9b40 dd0f9410 bf05295c c04aa9bc dd5e9b40 c04aaf6c bec0: dd0f9410 bf055260 bf04a950 bf04a93c c04b1f00 c04b1edc dd0f9410 bee0: c04b0798 c0c493a8 de8af44c dd0f9410 c0c493a8 c0c49408 c04af450 bf00: dd0f9444 dd0f9410 000120a8 c04ac02c c0bf5f44 bec80600 d9f2bf30 c142e46c bf20: dd0f9400 dd0f9400 000120a8 0081 c00091c4 c04b2718 bf058390 dd0f9400 bf40: bec80600 c04b2790 bf056140 bf0528c4 bf0528b4 c00d6710 d9f2bf80 616e7465 bf60: 00766976 ddf7b4d8 b6ef5000 0001 c0196490 0001 bf80: d9f2bf80 d9f2bf80 0095d008 005b bec805f4 0880 bfa0: bec80600 c0009000 0880
Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window
On Thu, Jun 27, 2019 at 11:20:15AM +0200, Lucas Stach wrote: > Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux > admin: > > While updating my various systems for the TCP SACK issue, I notice > > that while most platforms are happy, the Cubox-i4 is not. During > > boot, we get: > > > > [0.00] cma: Reserved 256 MiB at 0x3000 > > ... > > [0.00] Kernel command line: console=ttymxc0,115200n8 console=tty1 > > video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash > > resume=/dev/sda1 > > [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 > > bytes) > > [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 > > bytes) > > [0.00] Memory: 1790972K/2097152K available (8471K kernel code, 693K > > rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K > > cma-reserved, 1310720K highmem) > > ... > > [ 13.101098] etnaviv-gpu 13.gpu: command buffer outside valid memory > > window > > [ 13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid memory > > window > > Yes, that's a regression due to different default CMA area placement > and etnaviv not being smart enough to move the linear window to the > right offset. As it's a user visible regression, it needs fixing, either by reverting the changes that caused it or by some other issue. In the kernel, the policy is "if a bug fix causes a regression, the bug fix was itself wrong". We don't fix one person's bug if it causes a regression for someone else. Please resolve the acknowledged regression. > > and shortly after the login prompt appears, the entire SoC appears to > > lock up - it becomes unresponsive on the network, or via serial console > > to sysrq requests. > > > > I suspect the GPU ends up scribbling over the CPU's vector page/kernel > > as a result of the above two etnaviv errors when Xorg attempts to start > > using the GPU. > > This should not be possible. The driver notices that the command buffer > isn't accessible to the GPU, which aborts the GPU init. While the > etnaviv DRM device is still accessible, it will not expose any > enumerable GPU cores to userspace. So there is no way for userspace to > actually submit GPU commands. Yep, I came to that conclusion. Nevertheless, if I allow Xorg to start with 5.1, the system totally hangs shortly thereafter. I need to try without etnaviv loaded at all. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window
Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux admin: > While updating my various systems for the TCP SACK issue, I notice > that while most platforms are happy, the Cubox-i4 is not. During > boot, we get: > > [0.00] cma: Reserved 256 MiB at 0x3000 > ... > [0.00] Kernel command line: console=ttymxc0,115200n8 console=tty1 > video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash > resume=/dev/sda1 > [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 > bytes) > [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) > [0.00] Memory: 1790972K/2097152K available (8471K kernel code, 693K > rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K > cma-reserved, 1310720K highmem) > ... > [ 13.101098] etnaviv-gpu 13.gpu: command buffer outside valid memory > window > [ 13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid memory > window Yes, that's a regression due to different default CMA area placement and etnaviv not being smart enough to move the linear window to the right offset. Patches to fix this (but have rightfully been shot down, due to layering violations) are "[PATCH 1/2] mm: cma: export functions to get CMA base and size" and "[PATCH 2/2] drm/etnaviv: use CMA area to compute linear window offset if possible". > and shortly after the login prompt appears, the entire SoC appears to > lock up - it becomes unresponsive on the network, or via serial console > to sysrq requests. > > I suspect the GPU ends up scribbling over the CPU's vector page/kernel > as a result of the above two etnaviv errors when Xorg attempts to start > using the GPU. This should not be possible. The driver notices that the command buffer isn't accessible to the GPU, which aborts the GPU init. While the etnaviv DRM device is still accessible, it will not expose any enumerable GPU cores to userspace. So there is no way for userspace to actually submit GPU commands. Regards, Lucas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[REGRESSION] drm/etnaviv: command buffer outside valid memory window
While updating my various systems for the TCP SACK issue, I notice that while most platforms are happy, the Cubox-i4 is not. During boot, we get: [0.00] cma: Reserved 256 MiB at 0x3000 ... [0.00] Kernel command line: console=ttymxc0,115200n8 console=tty1 video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash resume=/dev/sda1 [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) [0.00] Memory: 1790972K/2097152K available (8471K kernel code, 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K cma-reserved, 1310720K highmem) ... [ 13.101098] etnaviv-gpu 13.gpu: command buffer outside valid memory window [ 13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid memory window and shortly after the login prompt appears, the entire SoC appears to lock up - it becomes unresponsive on the network, or via serial console to sysrq requests. I suspect the GPU ends up scribbling over the CPU's vector page/kernel as a result of the above two etnaviv errors when Xorg attempts to start using the GPU. This used to work, so its a regression. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel