Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window

2021-04-26 Thread Primoz Fiser

Hi all,

we are still affected by this issue from 2019 on 5.10.

For example when setting "cma=256M" on phycore imx6q with 2G ram we get:


[   12.573276] etnaviv etnaviv: command buffer outside valid memory window
[   12.616460] etnaviv etnaviv: command buffer outside valid memory window
[   12.662517] etnaviv etnaviv: command buffer outside valid memory window
[   12.714859] etnaviv etnaviv: command buffer outside valid memory window


On the other hand, when we set "cma=128M" this doesn't happen.

For now, we were able to get around this issue by applying Lucas' patches:


"[PATCH 1/2] mm: cma: export functions to get CMA base and size"
"[PATCH 2/2] drm/etnaviv: use CMA area to compute linear window offset 
if possible"


However those didn't get accepted into mainline?

Has there been any progress on this?

Any tips on how to properly fix this in mainline?

BR,

Primoz



Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux admin:
>/While updating my various systems for the TCP SACK issue, I notice />/that while most platforms are happy, the Cubox-i4 is not.  During />/boot, we get: />//>/[0.00] cma: Reserved 256 MiB at 0x3000 />/... />/[0.00] Kernel command line: console=ttymxc0,115200n8 
console=tty1 video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M 
ahci_imx.hotplug=1 splash resume=/dev/sda1 />/[0.00] Dentry cache hash table entries: 131072 (order: 7, 
524288 bytes) />/[0.00] Inode-cache hash table entries: 65536 (order: 6, 
262144 bytes) />/[0.00] Memory: 1790972K/2097152K available (8471K kernel 
code, 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K 
reserved, 262144K cma-reserved, 1310720K highmem) />/... />/[   13.101098] etnaviv-gpu 13.gpu: command buffer outside valid 
memory window />/[   13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid 
memory window /

Yes, that's a regression due to different default CMA area placement
and etnaviv not being smart enough to move the linear window to the
right offset.

Patches to fix this (but have rightfully been shot down, due to
layering violations) are "[PATCH 1/2] mm: cma: export functions to get
CMA base and size" and "[PATCH 2/2] drm/etnaviv: use CMA area to
compute linear window offset if possible".

>/and shortly after the login prompt appears, the entire SoC appears to />/lock up - it 
becomes unresponsive on the network, or via serial console />/to sysrq requests. />//>/I 
suspect the GPU ends up scribbling over the CPU's vector page/kernel />/as a result of the above 
two etnaviv errors when Xorg attempts to start />/using the GPU. /
This should not be possible. The driver notices that the command buffer
isn't accessible to the GPU, which aborts the GPU init. While the
etnaviv DRM device is still accessible, it will not expose any
enumerable GPU cores to userspace. So there is no way for userspace to
actually submit GPU commands.

Regards,
Lucas


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window

2019-06-27 Thread Russell King - ARM Linux admin
On Thu, Jun 27, 2019 at 04:49:30PM +0200, Lucas Stach wrote:
> Am Donnerstag, den 27.06.2019, 15:32 +0100 schrieb Russell King - ARM Linux 
> admin:
> > On Thu, Jun 27, 2019 at 11:04:17AM +0100, Russell King - ARM Linux admin 
> > wrote:
> > > On Thu, Jun 27, 2019 at 11:20:15AM +0200, Lucas Stach wrote:
> > > > Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM 
> > > > Linux admin:
> > > > > While updating my various systems for the TCP SACK issue, I notice
> > > > > that while most platforms are happy, the Cubox-i4 is not.  During
> > > > > boot, we get:
> > > > > 
> > > > > [0.00] cma: Reserved 256 MiB at 0x3000
> > > > > ...
> > > > > [0.00] Kernel command line: console=ttymxc0,115200n8 
> > > > > console=tty1 video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M 
> > > > > ahci_imx.hotplug=1 splash resume=/dev/sda1
> > > > > [0.00] Dentry cache hash table entries: 131072 (order: 7, 
> > > > > 524288 bytes)
> > > > > [0.00] Inode-cache hash table entries: 65536 (order: 6, 
> > > > > 262144 bytes)
> > > > > [0.00] Memory: 1790972K/2097152K available (8471K kernel 
> > > > > code, 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K 
> > > > > reserved, 262144K cma-reserved, 1310720K highmem)
> > > > > ...
> > > > > [   13.101098] etnaviv-gpu 13.gpu: command buffer outside valid 
> > > > > memory window
> > > > > [   13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid 
> > > > > memory window
> > > > 
> > > > Yes, that's a regression due to different default CMA area placement
> > > > and etnaviv not being smart enough to move the linear window to the
> > > > right offset.
> > > 
> > > As it's a user visible regression, it needs fixing, either by reverting
> > > the changes that caused it or by some other issue.  In the kernel, the
> > > policy is "if a bug fix causes a regression, the bug fix was itself
> > > wrong".  We don't fix one person's bug if it causes a regression for
> > > someone else.
> > > 
> > > Please resolve the acknowledged regression.
> 
> The regression is caused due to a different CMA placement, which is
> outside of the control of etnaviv. If you can point to the commit
> causing this change in placement we could work with the
> authors/maintainers of this code to get rid of the regression.
> Currently I don't have the bandwidth to pinpoint the offending code
> change.

Ok, thanks for the explanation.

Well, the problem has become weirder.  I'm unable to reproduce the hang
now - the only change has been to add your patch for the unload issue,
as well as temporarily disabling lightdm's startup at boot (which is
now back as it was.)  Odd.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window

2019-06-27 Thread Lucas Stach
Am Donnerstag, den 27.06.2019, 15:32 +0100 schrieb Russell King - ARM Linux 
admin:
> On Thu, Jun 27, 2019 at 11:04:17AM +0100, Russell King - ARM Linux admin 
> wrote:
> > On Thu, Jun 27, 2019 at 11:20:15AM +0200, Lucas Stach wrote:
> > > Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux 
> > > admin:
> > > > While updating my various systems for the TCP SACK issue, I notice
> > > > that while most platforms are happy, the Cubox-i4 is not.  During
> > > > boot, we get:
> > > > 
> > > > [0.00] cma: Reserved 256 MiB at 0x3000
> > > > ...
> > > > [0.00] Kernel command line: console=ttymxc0,115200n8 
> > > > console=tty1 video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M 
> > > > ahci_imx.hotplug=1 splash resume=/dev/sda1
> > > > [0.00] Dentry cache hash table entries: 131072 (order: 7, 
> > > > 524288 bytes)
> > > > [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 
> > > > bytes)
> > > > [0.00] Memory: 1790972K/2097152K available (8471K kernel code, 
> > > > 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 
> > > > 262144K cma-reserved, 1310720K highmem)
> > > > ...
> > > > [   13.101098] etnaviv-gpu 13.gpu: command buffer outside valid 
> > > > memory window
> > > > [   13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid 
> > > > memory window
> > > 
> > > Yes, that's a regression due to different default CMA area placement
> > > and etnaviv not being smart enough to move the linear window to the
> > > right offset.
> > 
> > As it's a user visible regression, it needs fixing, either by reverting
> > the changes that caused it or by some other issue.  In the kernel, the
> > policy is "if a bug fix causes a regression, the bug fix was itself
> > wrong".  We don't fix one person's bug if it causes a regression for
> > someone else.
> > 
> > Please resolve the acknowledged regression.

The regression is caused due to a different CMA placement, which is
outside of the control of etnaviv. If you can point to the commit
causing this change in placement we could work with the
authors/maintainers of this code to get rid of the regression.
Currently I don't have the bandwidth to pinpoint the offending code
change.

> > > > and shortly after the login prompt appears, the entire SoC appears to
> > > > lock up - it becomes unresponsive on the network, or via serial console
> > > > to sysrq requests.
> > > > 
> > > > I suspect the GPU ends up scribbling over the CPU's vector page/kernel
> > > > as a result of the above two etnaviv errors when Xorg attempts to start
> > > > using the GPU.
> > > 
> > > This should not be possible. The driver notices that the command buffer
> > > isn't accessible to the GPU, which aborts the GPU init. While the
> > > etnaviv DRM device is still accessible, it will not expose any
> > > enumerable GPU cores to userspace. So there is no way for userspace to
> > > actually submit GPU commands.
> > 
> > Yep, I came to that conclusion.  Nevertheless, if I allow Xorg to start
> > with 5.1, the system totally hangs shortly thereafter.  I need to try
> > without etnaviv loaded at all.
> 
> Well, it seems to get worse.  I just tried to unload etnaviv, and was
> greeted by this oops.  It's another regression; etnaviv used to unload
> perfectly fine.  Please can you add module unload testing to your
> workflow?

As you can see from the patch I've just sent, this is a missing error
cleanup. So it's really the same regression. A module unload after
successful init of all GPU cores doesn't show this crash. The issue is
only unmasked due to the CMA placement regression.

Regards,
Lucas
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window

2019-06-27 Thread Russell King - ARM Linux admin
On Thu, Jun 27, 2019 at 11:04:17AM +0100, Russell King - ARM Linux admin wrote:
> On Thu, Jun 27, 2019 at 11:20:15AM +0200, Lucas Stach wrote:
> > Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux 
> > admin:
> > > While updating my various systems for the TCP SACK issue, I notice
> > > that while most platforms are happy, the Cubox-i4 is not.  During
> > > boot, we get:
> > > 
> > > [0.00] cma: Reserved 256 MiB at 0x3000
> > > ...
> > > [0.00] Kernel command line: console=ttymxc0,115200n8 console=tty1 
> > > video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash 
> > > resume=/dev/sda1
> > > [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 
> > > bytes)
> > > [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 
> > > bytes)
> > > [0.00] Memory: 1790972K/2097152K available (8471K kernel code, 
> > > 693K rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K 
> > > cma-reserved, 1310720K highmem)
> > > ...
> > > [   13.101098] etnaviv-gpu 13.gpu: command buffer outside valid 
> > > memory window
> > > [   13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid 
> > > memory window
> > 
> > Yes, that's a regression due to different default CMA area placement
> > and etnaviv not being smart enough to move the linear window to the
> > right offset.
> 
> As it's a user visible regression, it needs fixing, either by reverting
> the changes that caused it or by some other issue.  In the kernel, the
> policy is "if a bug fix causes a regression, the bug fix was itself
> wrong".  We don't fix one person's bug if it causes a regression for
> someone else.
> 
> Please resolve the acknowledged regression.
> 
> > > and shortly after the login prompt appears, the entire SoC appears to
> > > lock up - it becomes unresponsive on the network, or via serial console
> > > to sysrq requests.
> > > 
> > > I suspect the GPU ends up scribbling over the CPU's vector page/kernel
> > > as a result of the above two etnaviv errors when Xorg attempts to start
> > > using the GPU.
> > 
> > This should not be possible. The driver notices that the command buffer
> > isn't accessible to the GPU, which aborts the GPU init. While the
> > etnaviv DRM device is still accessible, it will not expose any
> > enumerable GPU cores to userspace. So there is no way for userspace to
> > actually submit GPU commands.
> 
> Yep, I came to that conclusion.  Nevertheless, if I allow Xorg to start
> with 5.1, the system totally hangs shortly thereafter.  I need to try
> without etnaviv loaded at all.

Well, it seems to get worse.  I just tried to unload etnaviv, and was
greeted by this oops.  It's another regression; etnaviv used to unload
perfectly fine.  Please can you add module unload testing to your
workflow?

Unable to handle kernel NULL pointer dereference at virtual address 0008
pgd = da59c000
[0008] *pgd=8fc0f831
Internal error: Oops: 17 [#1] SMP ARM
Modules linked in: ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 
xt_tcpudp xt_owner xt_multiport iptable_filter ip_tables x_tables bnep rfcomm 
bluetooth
ecdh_generic nfsd rc_cec snd_soc_fsl_spdif nvmem_imx_ocotp imx_pcm_dma imx_sdma
virt_dma coda v4l2_mem2mem imx_vdoa dw_hdmi_ahb_audio dw_hdmi_cec 
videobuf2_dma_contig etnaviv(-) gpu_sched imx_thermal snd_soc_imx_spdif 
imx6q_cpufreq caamrng
caam_jr caam error
CPU: 1 PID: 2898 Comm: rmmod Not tainted 5.1.0+ #319
Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
PC is at etnaviv_iommu_put_suballoc_va+0x10/0x68 [etnaviv]
LR is at etnaviv_cmdbuf_suballoc_destroy+0x20/0x48 [etnaviv]
pc : []lr : []psr: a00f0013
sp : d9f2be40  ip : 01b0  fp : 
r10: 0081  r9 : d9f2a000  r8 : c00091c4
r7 : dc993800  r6 :   r5 : dd4c6810  r4 : 
r3 : b00c  r2 : 0004  r1 : dd4c6810  r0 : dc991840
Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 2a59c04a  DAC: 0051
Process rmmod (pid: 2898, stack limit = 0xd9f2a218)
Stack: (0xd9f2be40 to 0xd9f2c000)
be40:   dd4c6800 dd5e9b40  bf04a664 dd5e9b40 
be60: dc991840 bf04e4d0 bf04e458 dd5e93c0 dd5e9b40 c04aa2e0 0018 dc993800
be80: c00091c4 dd5e9b40 0001 c04aa3b4  dc993800 dd0f9410 dd5a4000
bea0:  bf04a97c dd5e9b40 dd0f9410 bf05295c c04aa9bc dd5e9b40 c04aaf6c
bec0: dd0f9410  bf055260 bf04a950 bf04a93c c04b1f00 c04b1edc dd0f9410
bee0:  c04b0798 c0c493a8 de8af44c dd0f9410 c0c493a8 c0c49408 c04af450
bf00: dd0f9444 dd0f9410 000120a8 c04ac02c c0bf5f44 bec80600 d9f2bf30 c142e46c
bf20: dd0f9400 dd0f9400 000120a8 0081 c00091c4 c04b2718 bf058390 dd0f9400
bf40: bec80600 c04b2790 bf056140 bf0528c4 bf0528b4 c00d6710 d9f2bf80 616e7465
bf60: 00766976 ddf7b4d8 b6ef5000  0001 c0196490 0001 
bf80: d9f2bf80 d9f2bf80 0095d008   005b bec805f4 0880
bfa0: bec80600 c0009000 0880 

Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window

2019-06-27 Thread Russell King - ARM Linux admin
On Thu, Jun 27, 2019 at 11:20:15AM +0200, Lucas Stach wrote:
> Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux 
> admin:
> > While updating my various systems for the TCP SACK issue, I notice
> > that while most platforms are happy, the Cubox-i4 is not.  During
> > boot, we get:
> > 
> > [0.00] cma: Reserved 256 MiB at 0x3000
> > ...
> > [0.00] Kernel command line: console=ttymxc0,115200n8 console=tty1 
> > video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash 
> > resume=/dev/sda1
> > [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 
> > bytes)
> > [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 
> > bytes)
> > [0.00] Memory: 1790972K/2097152K available (8471K kernel code, 693K 
> > rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K 
> > cma-reserved, 1310720K highmem)
> > ...
> > [   13.101098] etnaviv-gpu 13.gpu: command buffer outside valid memory 
> > window
> > [   13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid memory 
> > window
> 
> Yes, that's a regression due to different default CMA area placement
> and etnaviv not being smart enough to move the linear window to the
> right offset.

As it's a user visible regression, it needs fixing, either by reverting
the changes that caused it or by some other issue.  In the kernel, the
policy is "if a bug fix causes a regression, the bug fix was itself
wrong".  We don't fix one person's bug if it causes a regression for
someone else.

Please resolve the acknowledged regression.

> > and shortly after the login prompt appears, the entire SoC appears to
> > lock up - it becomes unresponsive on the network, or via serial console
> > to sysrq requests.
> > 
> > I suspect the GPU ends up scribbling over the CPU's vector page/kernel
> > as a result of the above two etnaviv errors when Xorg attempts to start
> > using the GPU.
> 
> This should not be possible. The driver notices that the command buffer
> isn't accessible to the GPU, which aborts the GPU init. While the
> etnaviv DRM device is still accessible, it will not expose any
> enumerable GPU cores to userspace. So there is no way for userspace to
> actually submit GPU commands.

Yep, I came to that conclusion.  Nevertheless, if I allow Xorg to start
with 5.1, the system totally hangs shortly thereafter.  I need to try
without etnaviv loaded at all.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [REGRESSION] drm/etnaviv: command buffer outside valid memory window

2019-06-27 Thread Lucas Stach
Am Samstag, den 22.06.2019, 17:16 +0100 schrieb Russell King - ARM Linux admin:
> While updating my various systems for the TCP SACK issue, I notice
> that while most platforms are happy, the Cubox-i4 is not.  During
> boot, we get:
> 
> [0.00] cma: Reserved 256 MiB at 0x3000
> ...
> [0.00] Kernel command line: console=ttymxc0,115200n8 console=tty1 
> video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash 
> resume=/dev/sda1
> [0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 
> bytes)
> [0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
> [0.00] Memory: 1790972K/2097152K available (8471K kernel code, 693K 
> rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K 
> cma-reserved, 1310720K highmem)
> ...
> [   13.101098] etnaviv-gpu 13.gpu: command buffer outside valid memory 
> window
> [   13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid memory 
> window

Yes, that's a regression due to different default CMA area placement
and etnaviv not being smart enough to move the linear window to the
right offset.

Patches to fix this (but have rightfully been shot down, due to
layering violations) are "[PATCH 1/2] mm: cma: export functions to get
CMA base and size" and "[PATCH 2/2] drm/etnaviv: use CMA area to
compute linear window offset if possible".

> and shortly after the login prompt appears, the entire SoC appears to
> lock up - it becomes unresponsive on the network, or via serial console
> to sysrq requests.
> 
> I suspect the GPU ends up scribbling over the CPU's vector page/kernel
> as a result of the above two etnaviv errors when Xorg attempts to start
> using the GPU.

This should not be possible. The driver notices that the command buffer
isn't accessible to the GPU, which aborts the GPU init. While the
etnaviv DRM device is still accessible, it will not expose any
enumerable GPU cores to userspace. So there is no way for userspace to
actually submit GPU commands.

Regards,
Lucas
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[REGRESSION] drm/etnaviv: command buffer outside valid memory window

2019-06-22 Thread Russell King - ARM Linux admin
While updating my various systems for the TCP SACK issue, I notice
that while most platforms are happy, the Cubox-i4 is not.  During
boot, we get:

[0.00] cma: Reserved 256 MiB at 0x3000
...
[0.00] Kernel command line: console=ttymxc0,115200n8 console=tty1 
video=mxcfb0:dev=hdmi root=/dev/nfs rw cma=256M ahci_imx.hotplug=1 splash 
resume=/dev/sda1
[0.00] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[0.00] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.00] Memory: 1790972K/2097152K available (8471K kernel code, 693K 
rwdata, 2844K rodata, 500K init, 8062K bss, 44036K reserved, 262144K 
cma-reserved, 1310720K highmem)
...
[   13.101098] etnaviv-gpu 13.gpu: command buffer outside valid memory 
window
[   13.171963] etnaviv-gpu 134000.gpu: command buffer outside valid memory 
window

and shortly after the login prompt appears, the entire SoC appears to
lock up - it becomes unresponsive on the network, or via serial console
to sysrq requests.

I suspect the GPU ends up scribbling over the CPU's vector page/kernel
as a result of the above two etnaviv errors when Xorg attempts to start
using the GPU.

This used to work, so its a regression.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel