Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
We haven't had a chance to look yet. Alex From: Luís Mendes Sent: Wednesday, February 7, 2018 10:50:48 AM To: Koenig, Christian Cc: Alex Deucher; Deucher, Alexander; Zhou, David(ChunMing); Michel Dänzer; amd-gfx@lists.freedesktop.org Subject: Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2 Hi Christian, Alexander, Kmemleak reported leaked data structures and the GPU hung a bit after. Could this be caused from DC? Info in attachments. I'm not sure if my previous email got overlooked, or if simply, there are no suggestions at this moment. Sorry for kind of re-sending the email. Regards, Luís On Mon, Feb 5, 2018 at 12:40 PM, Luís Mendes wrote: > Hi everyone, > > I have some updates. I left the system idle most of the time during > the weekend and from time to time I played a video on youtube and > turned off the screen. Yesterday night I did the same and today > morning I checked the system and it got hung up during the night. This > time it took a lot longer to hang, but I think it was related to a > Flash animation add that was only present on the youtube page the last > time I switched off the screen. The amdgpu always seem to hang when > that flash animation is present, from all the crash attempts I have > made. > There is a memory leak according to kmemleak which I attach along with > the crash dmesg log. > > The kernel and patches are the same as on my previous email. I ended > up not changing either the mesa version, nor the kernel version and > patches. > > Regards, > Luís > > > On Fri, Feb 2, 2018 at 6:46 PM, Luís Mendes wrote: >> Hi Christian, Alexander, >> >> I have enabled kmemleak, but memleak didn't detect anything special, >> in fact this time, I don't know why, I didn't get any allocation >> failure at all, but the GPU did hang after around 4h 6m of uptime with >> Xorg. >> The log can be found in attachment. I will try again to see if the >> allocation failure reappears, or if it has become less apparent due to >> kmemleak scans. >> >> The kernel stack trace is similar to the GPU hangs I was getting on >> earlier kernel versions with Kodi, or Firefox when watching videos >> with either one, but if I left Xorg idle, it would remain up and >> available without hanging for more than one day. >> This stack trace also looks quite similar to what Daniel Andersson >> reported in "[BUG] Intermittent hang/deadlock when opening browser tab >> with Vega gpu", looks like another demonstration of the same bug on >> different architectures. >> >> Regards, >> Luís >> >> On Fri, Feb 2, 2018 at 7:48 AM, Christian König >> wrote: >>> Hi Luis, >>> >>> please enable kmemleak in your build and watch out for any suspicious >>> messages in the system log. >>> >>> Regards, >>> Christian. >>> >>> >>> Am 02.02.2018 um 00:03 schrieb Luís Mendes: >>>> >>>> Hi Alexander, >>>> >>>> I didn't notice improvements on this issue with that particular patch >>>> applied. It still ends up failing to allocate kernel memory after a >>>> few hours of uptime with Xorg. >>>> >>>> I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next >>>> head, to see if the issue still occurs with those versions. >>>> >>>> If you have additional suggestions I'll be happy to try them. >>>> >>>> Regards, >>>> Luís Mendes >>>> >>>> On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher >>>> wrote: >>>>> >>>>> On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes >>>>> wrote: >>>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I am getting a new issue with amdgpu with RX460, that is, now I can >>>>>> play any videos with Kodi or play web videos with firefox and run >>>>>> OpenGL applications without running into any issues, however after >>>>>> some uptime with XOrg even when almost inactive I get a kmalloc >>>>>> allocation failure, normally followed by a GPU hang a while after the >>>>>> the allocation failure. >>>>>> I had a terminal window under Ubuntu Mate 17.10 and I was compiling >>>>>> code when I got the kernel messages that can be found in attachment. >>>>>> >>>>>> I am using the kernel as identified on my previous email, which can be >>>>>> found
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Christian, Alexander, Kmemleak reported leaked data structures and the GPU hung a bit after. Could this be caused from DC? Info in attachments. I'm not sure if my previous email got overlooked, or if simply, there are no suggestions at this moment. Sorry for kind of re-sending the email. Regards, Luís On Mon, Feb 5, 2018 at 12:40 PM, Luís Mendes wrote: > Hi everyone, > > I have some updates. I left the system idle most of the time during > the weekend and from time to time I played a video on youtube and > turned off the screen. Yesterday night I did the same and today > morning I checked the system and it got hung up during the night. This > time it took a lot longer to hang, but I think it was related to a > Flash animation add that was only present on the youtube page the last > time I switched off the screen. The amdgpu always seem to hang when > that flash animation is present, from all the crash attempts I have > made. > There is a memory leak according to kmemleak which I attach along with > the crash dmesg log. > > The kernel and patches are the same as on my previous email. I ended > up not changing either the mesa version, nor the kernel version and > patches. > > Regards, > Luís > > > On Fri, Feb 2, 2018 at 6:46 PM, Luís Mendes wrote: >> Hi Christian, Alexander, >> >> I have enabled kmemleak, but memleak didn't detect anything special, >> in fact this time, I don't know why, I didn't get any allocation >> failure at all, but the GPU did hang after around 4h 6m of uptime with >> Xorg. >> The log can be found in attachment. I will try again to see if the >> allocation failure reappears, or if it has become less apparent due to >> kmemleak scans. >> >> The kernel stack trace is similar to the GPU hangs I was getting on >> earlier kernel versions with Kodi, or Firefox when watching videos >> with either one, but if I left Xorg idle, it would remain up and >> available without hanging for more than one day. >> This stack trace also looks quite similar to what Daniel Andersson >> reported in "[BUG] Intermittent hang/deadlock when opening browser tab >> with Vega gpu", looks like another demonstration of the same bug on >> different architectures. >> >> Regards, >> Luís >> >> On Fri, Feb 2, 2018 at 7:48 AM, Christian König >> wrote: >>> Hi Luis, >>> >>> please enable kmemleak in your build and watch out for any suspicious >>> messages in the system log. >>> >>> Regards, >>> Christian. >>> >>> >>> Am 02.02.2018 um 00:03 schrieb Luís Mendes: Hi Alexander, I didn't notice improvements on this issue with that particular patch applied. It still ends up failing to allocate kernel memory after a few hours of uptime with Xorg. I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next head, to see if the issue still occurs with those versions. If you have additional suggestions I'll be happy to try them. Regards, Luís Mendes On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher wrote: > > On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes > wrote: >> >> Hi everyone, >> >> I am getting a new issue with amdgpu with RX460, that is, now I can >> play any videos with Kodi or play web videos with firefox and run >> OpenGL applications without running into any issues, however after >> some uptime with XOrg even when almost inactive I get a kmalloc >> allocation failure, normally followed by a GPU hang a while after the >> the allocation failure. >> I had a terminal window under Ubuntu Mate 17.10 and I was compiling >> code when I got the kernel messages that can be found in attachment. >> >> I am using the kernel as identified on my previous email, which can be >> found below. > > does this patch help? > https://patchwork.freedesktop.org/patch/198258/ > > Alex > >> Regards, >> Luís Mendes >> >> On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes >> wrote: >>> >>> Hi Alexander, >>> >>> I've cherry picked the patch you pointed out into kernel from >>> amd-drm-next-4.17-wip at commit >>> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set >>> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has >>> gone indeed. >>> >>> >>> Working great on ARMv7l with AMD RX460. >>> >>> Thanks, >>> Luís Mendes >>> >>> >>> On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander >>> wrote: Fixed with this patch: https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html Alex >> >> <> __ >> >> ___ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> __
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi everyone, I have some updates. I left the system idle most of the time during the weekend and from time to time I played a video on youtube and turned off the screen. Yesterday night I did the same and today morning I checked the system and it got hung up during the night. This time it took a lot longer to hang, but I think it was related to a Flash animation add that was only present on the youtube page the last time I switched off the screen. The amdgpu always seem to hang when that flash animation is present, from all the crash attempts I have made. There is a memory leak according to kmemleak which I attach along with the crash dmesg log. The kernel and patches are the same as on my previous email. I ended up not changing either the mesa version, nor the kernel version and patches. Regards, Luís On Fri, Feb 2, 2018 at 6:46 PM, Luís Mendes wrote: > Hi Christian, Alexander, > > I have enabled kmemleak, but memleak didn't detect anything special, > in fact this time, I don't know why, I didn't get any allocation > failure at all, but the GPU did hang after around 4h 6m of uptime with > Xorg. > The log can be found in attachment. I will try again to see if the > allocation failure reappears, or if it has become less apparent due to > kmemleak scans. > > The kernel stack trace is similar to the GPU hangs I was getting on > earlier kernel versions with Kodi, or Firefox when watching videos > with either one, but if I left Xorg idle, it would remain up and > available without hanging for more than one day. > This stack trace also looks quite similar to what Daniel Andersson > reported in "[BUG] Intermittent hang/deadlock when opening browser tab > with Vega gpu", looks like another demonstration of the same bug on > different architectures. > > Regards, > Luís > > On Fri, Feb 2, 2018 at 7:48 AM, Christian König > wrote: >> Hi Luis, >> >> please enable kmemleak in your build and watch out for any suspicious >> messages in the system log. >> >> Regards, >> Christian. >> >> >> Am 02.02.2018 um 00:03 schrieb Luís Mendes: >>> >>> Hi Alexander, >>> >>> I didn't notice improvements on this issue with that particular patch >>> applied. It still ends up failing to allocate kernel memory after a >>> few hours of uptime with Xorg. >>> >>> I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next >>> head, to see if the issue still occurs with those versions. >>> >>> If you have additional suggestions I'll be happy to try them. >>> >>> Regards, >>> Luís Mendes >>> >>> On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher >>> wrote: On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes wrote: > > Hi everyone, > > I am getting a new issue with amdgpu with RX460, that is, now I can > play any videos with Kodi or play web videos with firefox and run > OpenGL applications without running into any issues, however after > some uptime with XOrg even when almost inactive I get a kmalloc > allocation failure, normally followed by a GPU hang a while after the > the allocation failure. > I had a terminal window under Ubuntu Mate 17.10 and I was compiling > code when I got the kernel messages that can be found in attachment. > > I am using the kernel as identified on my previous email, which can be > found below. does this patch help? https://patchwork.freedesktop.org/patch/198258/ Alex > Regards, > Luís Mendes > > On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes > wrote: >> >> Hi Alexander, >> >> I've cherry picked the patch you pointed out into kernel from >> amd-drm-next-4.17-wip at commit >> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set >> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has >> gone indeed. >> >> >> Working great on ARMv7l with AMD RX460. >> >> Thanks, >> Luís Mendes >> >> >> On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander >> wrote: >>> >>> Fixed with this patch: >>> >>> >>> https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html >>> >>> >>> Alex > > <> >>> >>> __ > > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > >>> ___ >>> amd-gfx mailing list >>> amd-gfx@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >> ubuntu@linux:~$ sudo cat /sys/kernel/debug/kmemleak [sudo] password for ubuntu: unreferenced object 0xb0fac380 (size 128): comm "Xorg", pid 3750, jiffies 5608934 (age 178088.970s) hex dump (first 32 bytes): 00 4e 9f b9 00 f0 33 bb 80 1a 15 97 00 00 00 00 .N3. fa 00 00 00 82 01 00 00 80 00 00 00 80 00 00 00 backtrace: [<400a53a4>] kmem_cache_
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Christian, Alexander, I have enabled kmemleak, but memleak didn't detect anything special, in fact this time, I don't know why, I didn't get any allocation failure at all, but the GPU did hang after around 4h 6m of uptime with Xorg. The log can be found in attachment. I will try again to see if the allocation failure reappears, or if it has become less apparent due to kmemleak scans. The kernel stack trace is similar to the GPU hangs I was getting on earlier kernel versions with Kodi, or Firefox when watching videos with either one, but if I left Xorg idle, it would remain up and available without hanging for more than one day. This stack trace also looks quite similar to what Daniel Andersson reported in "[BUG] Intermittent hang/deadlock when opening browser tab with Vega gpu", looks like another demonstration of the same bug on different architectures. Regards, Luís On Fri, Feb 2, 2018 at 7:48 AM, Christian König wrote: > Hi Luis, > > please enable kmemleak in your build and watch out for any suspicious > messages in the system log. > > Regards, > Christian. > > > Am 02.02.2018 um 00:03 schrieb Luís Mendes: >> >> Hi Alexander, >> >> I didn't notice improvements on this issue with that particular patch >> applied. It still ends up failing to allocate kernel memory after a >> few hours of uptime with Xorg. >> >> I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next >> head, to see if the issue still occurs with those versions. >> >> If you have additional suggestions I'll be happy to try them. >> >> Regards, >> Luís Mendes >> >> On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher >> wrote: >>> >>> On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes >>> wrote: Hi everyone, I am getting a new issue with amdgpu with RX460, that is, now I can play any videos with Kodi or play web videos with firefox and run OpenGL applications without running into any issues, however after some uptime with XOrg even when almost inactive I get a kmalloc allocation failure, normally followed by a GPU hang a while after the the allocation failure. I had a terminal window under Ubuntu Mate 17.10 and I was compiling code when I got the kernel messages that can be found in attachment. I am using the kernel as identified on my previous email, which can be found below. >>> >>> does this patch help? >>> https://patchwork.freedesktop.org/patch/198258/ >>> >>> Alex >>> Regards, Luís Mendes On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes wrote: > > Hi Alexander, > > I've cherry picked the patch you pointed out into kernel from > amd-drm-next-4.17-wip at commit > 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set > DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has > gone indeed. > > > Working great on ARMv7l with AMD RX460. > > Thanks, > Luís Mendes > > > On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander > wrote: >> >> Fixed with this patch: >> >> >> https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html >> >> >> Alex <> >> >> __ ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> ___ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > Feb 2 16:29:29 localhost kernel: [14801.740467] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=831006, last emitted seq=831008 Feb 2 16:29:29 localhost kernel: [14801.751557] [drm] IP block:gmc_v8_0 is hung! Feb 2 16:29:29 localhost kernel: [14801.751563] [drm] IP block:gfx_v8_0 is hung! Feb 2 16:29:29 localhost kernel: [14801.751611] [drm] GPU recovery disabled. Feb 2 16:44:53 localhost kernel: [15725.856181] INFO: task amdgpu_cs:0:3803 blocked for more than 120 seconds. Feb 2 16:44:53 localhost kernel: [15725.863085] Not tainted 4.15.0-rc8-next2g-g9ab2894-dirty #3 Feb 2 16:44:53 localhost kernel: [15725.869213] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 2 16:44:53 localhost kernel: [15725.877078] amdgpu_cs:0 D0 3803 3091 0x Feb 2 16:44:53 localhost kernel: [15725.877084] Backtrace: Feb 2 16:44:53 localhost kernel: [15725.877096] [<80b571c8>] (__schedule) from [<80b578cc>] (schedule+0x44/0xa4) Feb 2 16:44:53 localhost kernel: [15725.877102] r10:600f0013 r9:b45b6000 r8:b45b7bd4 r7: r6:7fff r5:81004c48 Feb 2 16:44:53 localhost kernel: [15725.877104] r4:e000 Feb 2 16:44:53 localhost kernel: [15725.877110] [<80b57888>] (schedule) from [<80b5b4f0>] (schedule_timeout+0x1e0/0x2e8) Feb 2 16:44:53 localhost kernel: [15725.877112] r5:81004c48 r4:7fff F
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Luis, please enable kmemleak in your build and watch out for any suspicious messages in the system log. Regards, Christian. Am 02.02.2018 um 00:03 schrieb Luís Mendes: Hi Alexander, I didn't notice improvements on this issue with that particular patch applied. It still ends up failing to allocate kernel memory after a few hours of uptime with Xorg. I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next head, to see if the issue still occurs with those versions. If you have additional suggestions I'll be happy to try them. Regards, Luís Mendes On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher wrote: On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes wrote: Hi everyone, I am getting a new issue with amdgpu with RX460, that is, now I can play any videos with Kodi or play web videos with firefox and run OpenGL applications without running into any issues, however after some uptime with XOrg even when almost inactive I get a kmalloc allocation failure, normally followed by a GPU hang a while after the the allocation failure. I had a terminal window under Ubuntu Mate 17.10 and I was compiling code when I got the kernel messages that can be found in attachment. I am using the kernel as identified on my previous email, which can be found below. does this patch help? https://patchwork.freedesktop.org/patch/198258/ Alex Regards, Luís Mendes On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes wrote: Hi Alexander, I've cherry picked the patch you pointed out into kernel from amd-drm-next-4.17-wip at commit 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has gone indeed. Working great on ARMv7l with AMD RX460. Thanks, Luís Mendes On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander wrote: Fixed with this patch: https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html Alex <> __ ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Alexander, I didn't notice improvements on this issue with that particular patch applied. It still ends up failing to allocate kernel memory after a few hours of uptime with Xorg. I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next head, to see if the issue still occurs with those versions. If you have additional suggestions I'll be happy to try them. Regards, Luís Mendes On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher wrote: > On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes wrote: >> Hi everyone, >> >> I am getting a new issue with amdgpu with RX460, that is, now I can >> play any videos with Kodi or play web videos with firefox and run >> OpenGL applications without running into any issues, however after >> some uptime with XOrg even when almost inactive I get a kmalloc >> allocation failure, normally followed by a GPU hang a while after the >> the allocation failure. >> I had a terminal window under Ubuntu Mate 17.10 and I was compiling >> code when I got the kernel messages that can be found in attachment. >> >> I am using the kernel as identified on my previous email, which can be >> found below. > > does this patch help? > https://patchwork.freedesktop.org/patch/198258/ > > Alex > >> >> Regards, >> Luís Mendes >> >> On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes >> wrote: >>> Hi Alexander, >>> >>> I've cherry picked the patch you pointed out into kernel from >>> amd-drm-next-4.17-wip at commit >>> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set >>> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has >>> gone indeed. >>> >>> >>>Working great on ARMv7l with AMD RX460. >>> >>>Thanks, >>>Luís Mendes >>> >>> >>>On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander >>> wrote: Fixed with this patch: https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html Alex >> <> __ >> >> ___ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes wrote: > Hi everyone, > > I am getting a new issue with amdgpu with RX460, that is, now I can > play any videos with Kodi or play web videos with firefox and run > OpenGL applications without running into any issues, however after > some uptime with XOrg even when almost inactive I get a kmalloc > allocation failure, normally followed by a GPU hang a while after the > the allocation failure. > I had a terminal window under Ubuntu Mate 17.10 and I was compiling > code when I got the kernel messages that can be found in attachment. > > I am using the kernel as identified on my previous email, which can be > found below. does this patch help? https://patchwork.freedesktop.org/patch/198258/ Alex > > Regards, > Luís Mendes > > On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes wrote: >> Hi Alexander, >> >> I've cherry picked the patch you pointed out into kernel from >> amd-drm-next-4.17-wip at commit >> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set >> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has >> gone indeed. >> >> >>Working great on ARMv7l with AMD RX460. >> >>Thanks, >>Luís Mendes >> >> >>On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander >> wrote: >>> Fixed with this patch: >>> >>> https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html >>> >>> >>> Alex > <> >>> __ > > ___ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi everyone, I am getting a new issue with amdgpu with RX460, that is, now I can play any videos with Kodi or play web videos with firefox and run OpenGL applications without running into any issues, however after some uptime with XOrg even when almost inactive I get a kmalloc allocation failure, normally followed by a GPU hang a while after the the allocation failure. I had a terminal window under Ubuntu Mate 17.10 and I was compiling code when I got the kernel messages that can be found in attachment. I am using the kernel as identified on my previous email, which can be found below. Regards, Luís Mendes On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes wrote: > Hi Alexander, > > I've cherry picked the patch you pointed out into kernel from > amd-drm-next-4.17-wip at commit > 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set > DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has > gone indeed. > > >Working great on ARMv7l with AMD RX460. > >Thanks, >Luís Mendes > > >On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander > wrote: >> Fixed with this patch: >> >> https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html >> >> >> Alex <> >> __ Jan 31 21:56:11 localhost kernel: [ 4091.449841] Xorg: page allocation failure: order:5, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null) Jan 31 21:56:11 localhost kernel: [ 4091.449845] Xorg cpuset=/ mems_allowed=0 Jan 31 21:56:11 localhost kernel: [ 4091.449855] CPU: 0 PID: 3810 Comm: Xorg Not tainted 4.15.0-rc8-next2g-g9ab2894-dirty #1 Jan 31 21:56:11 localhost kernel: [ 4091.449857] Hardware name: Marvell Armada 380/385 (Device Tree) Jan 31 21:56:11 localhost kernel: [ 4091.449859] Backtrace: Jan 31 21:56:11 localhost kernel: [ 4091.449870] [<8010dca8>] (dump_backtrace) from [<8010dfa4>] (show_stack+0x18/0x1c) Jan 31 21:56:11 localhost kernel: [ 4091.449875] r7:e000 r6:60070013 r5: r4:8108d150 Jan 31 21:56:11 localhost kernel: [ 4091.449883] [<8010df8c>] (show_stack) from [<80b3ef04>] (dump_stack+0x94/0xa8) Jan 31 21:56:11 localhost kernel: [ 4091.449891] [<80b3ee70>] (dump_stack) from [<8021fbe8>] (warn_alloc+0xc4/0x15c) Jan 31 21:56:11 localhost kernel: [ 4091.449895] r7:e000 r6:80d53610 r5: r4:81004c48 Jan 31 21:56:11 localhost kernel: [ 4091.449900] [<8021fb28>] (warn_alloc) from [<80220b0c>] (__alloc_pages_nodemask+0xde4/0xf54) Jan 31 21:56:11 localhost kernel: [ 4091.449902] r3:0005 r2:80d53610 Jan 31 21:56:11 localhost kernel: [ 4091.449905] r7:0032 r6:0140c0c0 r5:0040 r4: Jan 31 21:56:11 localhost kernel: [ 4091.449911] [<8021fd28>] (__alloc_pages_nodemask) from [<80242164>] (kmalloc_order+0x20/0x38) Jan 31 21:56:11 localhost kernel: [ 4091.449915] r10:bcb48000 r9:a766ca00 r8:0005 r7:7f2f37f4 r6:014080c0 r5:00018018 Jan 31 21:56:11 localhost kernel: [ 4091.449917] r4:bc83d02c Jan 31 21:56:11 localhost kernel: [ 4091.449922] [<80242144>] (kmalloc_order) from [<802421a0>] (kmalloc_order_trace+0x24/0xc8) Jan 31 21:56:11 localhost kernel: [ 4091.450184] [<8024217c>] (kmalloc_order_trace) from [<7f2f37f4>] (dc_create_gamma+0x24/0x34 [amdgpu]) Jan 31 21:56:11 localhost kernel: [ 4091.450189] r10:bcb48000 r9:a766ca00 r8: r7:0001 r6:be4f0c00 r5:b9de1448 Jan 31 21:56:11 localhost kernel: [ 4091.450191] r4:bc83d02c Jan 31 21:56:11 localhost kernel: [ 4091.450474] [<7f2f37d0>] (dc_create_gamma [amdgpu]) from [<7f29d8a8>] (amdgpu_dm_atomic_check+0x67c/0xc6c [amdgpu]) Jan 31 21:56:11 localhost kernel: [ 4091.450658] [<7f29d22c>] (amdgpu_dm_atomic_check [amdgpu]) from [<7f0af238>] (drm_atomic_check_only+0x3bc/0x5c4 [drm]) Jan 31 21:56:11 localhost kernel: [ 4091.450663] r10:0800 r9:7fff r8:81004c48 r7:b4718e80 r6:b9cd1380 r5:0001 Jan 31 21:56:11 localhost kernel: [ 4091.450664] r4: Jan 31 21:56:11 localhost kernel: [ 4091.450711] [<7f0aee7c>] (drm_atomic_check_only [drm]) from [<7f0af458>] (drm_atomic_commit+0x18/0x60 [drm]) Jan 31 21:56:11 localhost kernel: [ 4091.450715] r10:0800 r9:bbc4c000 r8:bc83d000 r7:b9cd1380 r6:be1b5000 r5:b9cd1380 Jan 31 21:56:11 localhost kernel: [ 4091.450717] r4:0001 Jan 31 21:56:11 localhost kernel: [ 4091.450763] [<7f0af440>] (drm_atomic_commit [drm]) from [<7f1260e8>] (drm_atomic_helper_legacy_gamma_set+0x110/0x160 [drm_kms_helper]) Jan 31 21:56:11 localhost kernel: [ 4091.450767] r7:b9cd1380 r6:bbc4b9fe r5:a766d200 r4:0001 Jan 31 21:56:11 localhost kernel: [ 4091.450799] [<7f125fd8>] (drm_atomic_helper_legacy_gamma_set [drm_kms_helper]) from [<7f0b951c>] (drm_mode_gamma_set_ioctl+0x1c4/0x2c0 [drm]) Jan 31 21:56:11 localhost kernel: [ 4091.450804] r10:e000 r9:bbc4ba00 r8:bbc4b800 r7:bbc4c034 r6:b1911e2c r5:b1911d80 Jan 31 21:56:11 localhost kernel: [ 4091.450806] r4:7f125fd8 r3:bbc4bc00 Jan 31 21:56:11 localhost kernel: [ 4091.450848] [<7f0b9358>] (drm_mode_gamma_set_ioctl [drm]) from [<7f09d920>] (drm_ioctl_kernel+0x68/0xb4 [drm
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Fixed with this patch: https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html Alex From: Luís Mendes Sent: Tuesday, January 30, 2018 1:30 PM To: Michel Dänzer; Koenig, Christian Cc: Deucher, Alexander; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org Subject: Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2 Hi everyone, I've tested the kernel from amd-drm-next-4.17-wip at commit 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set DRIVER_ATOMIC flag early) on ARMv7l, and the reported issues seem now to have gone. I haven't checked from which commit this is fixed, but it is now fixed! I also noticed a performance improvement in one of the glmark2 tests. There seem to be some other small issues, possibly unrelated, such that sometimes the screen becomes black and the sound stops while playing the video for a second or less and then normal playback is recovered, this happens rarely and at most once per power cycle, while using X and Kodi, despite I have played many individual videos and power cycled the machine sometimes. I've also observed what was already reported, when watching non-VP9 videos: [ 591.729558] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.740255] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.750968] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.761628] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.772248] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.782672] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.793172] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.803681] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.814129] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.824560] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.835054] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.845437] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.855860] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.866415] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.876945] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.887454] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! Regards, Luís Mendes On Wed, Jan 3, 2018 at 11:08 PM, Luís Mendes wrote: > Hi Michel, Christian, > > Michel, I have tested amd-staging-drm-next at commit "drm/amdgpu/gfx9: > only init the apertures used by KGD (v2)" - > 0e4946409d11913523d30bc4830d10b388438c7a and the issues remain, both > on ARMv7 and on x86 amd64. > > Christian, in fact if I replay the apitraces obtained on the ARMv7 > platform on the AMD64 I am also able to reproduce the GPU hang! So it > is not ARM platform specific. Should I send/upload the apitraces? I > have two of them, typically when one doesn't hang the gpu the other > hangs. One takes about 1GB of disk space while the other takes 2.3GB. > ... > [ 69.019381] ISO 9660 Extensions: RRIP_1991A > [ 213.292094] DMAR: DRHD: handling fault status reg 2 > [ 213.292102] DMAR: [INTR-REMAP] Request device [00:00.0] fault index > 1c [fault reason 38] Blocked an interrupt request due to source-id > verification failure > [ 223.406919] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx > timeout, last signaled seq=25158, last emitted seq=25160 > [ 223.406926] [drm] IP block:tonga_ih is hung! > [ 223.407167] [drm] GPU recovery disabled. > > Regards, > Luís > > > On Wed, Jan 3, 2018 at 5:47 PM, Luís Mendes wrote: >> Hi Michel, Christian, >> >> Christian, I have followed your suggestion and I have just submitted a >> bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 - >> GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7 >> platforms while playing video. >> >> Michel, amdgpu.dc=0 seems to make no difference. I will try >> amd-staging-drm-next and report back. >> >> Regards, >> Luís >> >> On Wed, Jan 3, 2018 at 5:09 PM, Michel Dänzer
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi everyone, I've tested the kernel from amd-drm-next-4.17-wip at commit 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set DRIVER_ATOMIC flag early) on ARMv7l, and the reported issues seem now to have gone. I haven't checked from which commit this is fixed, but it is now fixed! I also noticed a performance improvement in one of the glmark2 tests. There seem to be some other small issues, possibly unrelated, such that sometimes the screen becomes black and the sound stops while playing the video for a second or less and then normal playback is recovered, this happens rarely and at most once per power cycle, while using X and Kodi, despite I have played many individual videos and power cycled the machine sometimes. I've also observed what was already reported, when watching non-VP9 videos: [ 591.729558] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.740255] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.750968] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.761628] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.772248] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.782672] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.793172] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.803681] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.814129] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.824560] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.835054] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.845437] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.855860] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.866415] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.876945] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! [ 591.887454] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu: writing more dwords to the ring than expected! Regards, Luís Mendes On Wed, Jan 3, 2018 at 11:08 PM, Luís Mendes wrote: > Hi Michel, Christian, > > Michel, I have tested amd-staging-drm-next at commit "drm/amdgpu/gfx9: > only init the apertures used by KGD (v2)" - > 0e4946409d11913523d30bc4830d10b388438c7a and the issues remain, both > on ARMv7 and on x86 amd64. > > Christian, in fact if I replay the apitraces obtained on the ARMv7 > platform on the AMD64 I am also able to reproduce the GPU hang! So it > is not ARM platform specific. Should I send/upload the apitraces? I > have two of them, typically when one doesn't hang the gpu the other > hangs. One takes about 1GB of disk space while the other takes 2.3GB. > ... > [ 69.019381] ISO 9660 Extensions: RRIP_1991A > [ 213.292094] DMAR: DRHD: handling fault status reg 2 > [ 213.292102] DMAR: [INTR-REMAP] Request device [00:00.0] fault index > 1c [fault reason 38] Blocked an interrupt request due to source-id > verification failure > [ 223.406919] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx > timeout, last signaled seq=25158, last emitted seq=25160 > [ 223.406926] [drm] IP block:tonga_ih is hung! > [ 223.407167] [drm] GPU recovery disabled. > > Regards, > Luís > > > On Wed, Jan 3, 2018 at 5:47 PM, Luís Mendes wrote: >> Hi Michel, Christian, >> >> Christian, I have followed your suggestion and I have just submitted a >> bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 - >> GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7 >> platforms while playing video. >> >> Michel, amdgpu.dc=0 seems to make no difference. I will try >> amd-staging-drm-next and report back. >> >> Regards, >> Luís >> >> On Wed, Jan 3, 2018 at 5:09 PM, Michel Dänzer wrote: >>> On 2018-01-03 12:02 PM, Luís Mendes wrote: What I believe it seems to be the case is that the GPU lock up only happens when doing a page flip, since the kernel locks with: [ 243.693200] kworker/u4:3D089 2 0x [ 243.693232] Workqueue: events_unbound commit_work [drm_kms_helper] [ 243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac) [ 243.693259] [<80b8cdd0>] (schedule) from [<80b91024>] (schedule_timeout+0x228/0x444) [ 243.693270] [<80b91024>] (schedule_timeout) from [<80886738>] (dma
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Michel, Christian, Michel, I have tested amd-staging-drm-next at commit "drm/amdgpu/gfx9: only init the apertures used by KGD (v2)" - 0e4946409d11913523d30bc4830d10b388438c7a and the issues remain, both on ARMv7 and on x86 amd64. Christian, in fact if I replay the apitraces obtained on the ARMv7 platform on the AMD64 I am also able to reproduce the GPU hang! So it is not ARM platform specific. Should I send/upload the apitraces? I have two of them, typically when one doesn't hang the gpu the other hangs. One takes about 1GB of disk space while the other takes 2.3GB. ... [ 69.019381] ISO 9660 Extensions: RRIP_1991A [ 213.292094] DMAR: DRHD: handling fault status reg 2 [ 213.292102] DMAR: [INTR-REMAP] Request device [00:00.0] fault index 1c [fault reason 38] Blocked an interrupt request due to source-id verification failure [ 223.406919] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=25158, last emitted seq=25160 [ 223.406926] [drm] IP block:tonga_ih is hung! [ 223.407167] [drm] GPU recovery disabled. Regards, Luís On Wed, Jan 3, 2018 at 5:47 PM, Luís Mendes wrote: > Hi Michel, Christian, > > Christian, I have followed your suggestion and I have just submitted a > bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 - > GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7 > platforms while playing video. > > Michel, amdgpu.dc=0 seems to make no difference. I will try > amd-staging-drm-next and report back. > > Regards, > Luís > > On Wed, Jan 3, 2018 at 5:09 PM, Michel Dänzer wrote: >> On 2018-01-03 12:02 PM, Luís Mendes wrote: >>> >>> What I believe it seems to be the case is that the GPU lock up only >>> happens when doing a page flip, since the kernel locks with: >>> [ 243.693200] kworker/u4:3D089 2 0x >>> [ 243.693232] Workqueue: events_unbound commit_work [drm_kms_helper] >>> [ 243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] >>> (schedule+0x4c/0xac) >>> [ 243.693259] [<80b8cdd0>] (schedule) from [<80b91024>] >>> (schedule_timeout+0x228/0x444) >>> [ 243.693270] [<80b91024>] (schedule_timeout) from [<80886738>] >>> (dma_fence_default_wait+0x2b4/0x2d8) >>> [ 243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>] >>> (dma_fence_wait_timeout+0x40/0x150) >>> [ 243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>] >>> (reservation_object_wait_timeout_rcu+0xfc/0x34c) >>> [ 243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from >>> [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu]) >>> [ 243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from >>> [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu]) >>> ... >> >> Does the problem also occur if you disable DC with amdgpu.dc=0 on the >> kernel command line? >> >> Does it also happen with a kernel built from the amd-staging-drm-next >> branch instead of drm-next-4.16? >> >> >> -- >> Earthling Michel Dänzer | http://www.amd.com >> Libre software enthusiast | Mesa and X developer ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Michel, Christian, Christian, I have followed your suggestion and I have just submitted a bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 - GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7 platforms while playing video. Michel, amdgpu.dc=0 seems to make no difference. I will try amd-staging-drm-next and report back. Regards, Luís On Wed, Jan 3, 2018 at 5:09 PM, Michel Dänzer wrote: > On 2018-01-03 12:02 PM, Luís Mendes wrote: >> >> What I believe it seems to be the case is that the GPU lock up only >> happens when doing a page flip, since the kernel locks with: >> [ 243.693200] kworker/u4:3D089 2 0x >> [ 243.693232] Workqueue: events_unbound commit_work [drm_kms_helper] >> [ 243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] >> (schedule+0x4c/0xac) >> [ 243.693259] [<80b8cdd0>] (schedule) from [<80b91024>] >> (schedule_timeout+0x228/0x444) >> [ 243.693270] [<80b91024>] (schedule_timeout) from [<80886738>] >> (dma_fence_default_wait+0x2b4/0x2d8) >> [ 243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>] >> (dma_fence_wait_timeout+0x40/0x150) >> [ 243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>] >> (reservation_object_wait_timeout_rcu+0xfc/0x34c) >> [ 243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from >> [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu]) >> [ 243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from >> [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu]) >> ... > > Does the problem also occur if you disable DC with amdgpu.dc=0 on the > kernel command line? > > Does it also happen with a kernel built from the amd-staging-drm-next > branch instead of drm-next-4.16? > > > -- > Earthling Michel Dänzer | http://www.amd.com > Libre software enthusiast | Mesa and X developer ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
On 2018-01-03 12:02 PM, Luís Mendes wrote: > > What I believe it seems to be the case is that the GPU lock up only > happens when doing a page flip, since the kernel locks with: > [ 243.693200] kworker/u4:3D089 2 0x > [ 243.693232] Workqueue: events_unbound commit_work [drm_kms_helper] > [ 243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] > (schedule+0x4c/0xac) > [ 243.693259] [<80b8cdd0>] (schedule) from [<80b91024>] > (schedule_timeout+0x228/0x444) > [ 243.693270] [<80b91024>] (schedule_timeout) from [<80886738>] > (dma_fence_default_wait+0x2b4/0x2d8) > [ 243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>] > (dma_fence_wait_timeout+0x40/0x150) > [ 243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>] > (reservation_object_wait_timeout_rcu+0xfc/0x34c) > [ 243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from > [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu]) > [ 243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from > [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu]) > ... Does the problem also occur if you disable DC with amdgpu.dc=0 on the kernel command line? Does it also happen with a kernel built from the amd-staging-drm-next branch instead of drm-next-4.16? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
In this case please open a bug report on fdo and describe exactly how to reproduce it. Marek should be able to take a look then. Thanks, Christian. Am 03.01.2018 um 12:56 schrieb Luís Mendes: Hi Christian, David, David, replying to your question... The issue is indeed reproducible on x86, I just did it with kodi and the same VP9 video. So it is not arm specific. Regards, Luís On Wed, Jan 3, 2018 at 11:02 AM, Luís Mendes wrote: Hi Christian, Replies follow in between. Regards, Luís On Wed, Jan 3, 2018 at 9:37 AM, Christian König wrote: Hi Luis, In general please add information like /proc/iomem and dmesg as attachment and not mangled inside the mail. Ok, I'll take that into account next time. Sorry for the inconvenience. The good news is that your ARM board at least has a memory layout which should work in theory. So at least one problem rules out. Ok, nice. I don't think that apitrace would be much helpful in this case as long as no developer has access to one of those ARM boards. But it is interesting that the apitrace reliable reproduces the issue. This means that it isn't something random, but rather a specific timing of things. I am afraid, I currently don't have boards that I can send yet. I am developing one, but it will still take some time, before I have one ready. I've checked the apitrace and there is a common call glXSwapBuffers(dpy=0x1389f00, drawable=52428803) that I believe will trigger the page flip. I suspect there is a race condition with glXSwapBuffers in mesa or amdgpu, that corrupts some of the data sent to the GPU causing an hang. What I believe it seems to be the case is that the GPU lock up only happens when doing a page flip, since the kernel locks with: [ 243.693200] kworker/u4:3D089 2 0x [ 243.693232] Workqueue: events_unbound commit_work [drm_kms_helper] [ 243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac) [ 243.693259] [<80b8cdd0>] (schedule) from [<80b91024>] (schedule_timeout+0x228/0x444) [ 243.693270] [<80b91024>] (schedule_timeout) from [<80886738>] (dma_fence_default_wait+0x2b4/0x2d8) [ 243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>] (dma_fence_wait_timeout+0x40/0x150) [ 243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>] (reservation_object_wait_timeout_rcu+0xfc/0x34c) [ 243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu]) [ 243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu]) ... I will try to reproduce this on x86 with a similar software stack... and the apitrace traces I got. What do you think, does this makes sense? Do you have further suggestions that may help pin down the problem? Another strange thing... the traces that were consistently causing hangs yesterday, today are having a bit more difficulty causing them, but if I play the video with kodi it hangs easily again. Both kodi and glretarce always hangs with similar kernel backtraces, like the one above. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Christian, David, David, replying to your question... The issue is indeed reproducible on x86, I just did it with kodi and the same VP9 video. So it is not arm specific. Regards, Luís On Wed, Jan 3, 2018 at 11:02 AM, Luís Mendes wrote: > Hi Christian, > > Replies follow in between. > > Regards, > Luís > > On Wed, Jan 3, 2018 at 9:37 AM, Christian König > wrote: >> Hi Luis, >> >> In general please add information like /proc/iomem and dmesg as attachment >> and not mangled inside the mail. > > Ok, I'll take that into account next time. Sorry for the inconvenience. > >> >> The good news is that your ARM board at least has a memory layout which >> should work in theory. So at least one problem rules out. > > Ok, nice. > >> >> I don't think that apitrace would be much helpful in this case as long as no >> developer has access to one of those ARM boards. But it is interesting that >> the apitrace reliable reproduces the issue. This means that it isn't >> something random, but rather a specific timing of things. > > I am afraid, I currently don't have boards that I can send yet. I am > developing one, but it will still take some time, before I have one > ready. > > I've checked the apitrace and there is a common call > glXSwapBuffers(dpy=0x1389f00, drawable=52428803) that I believe will > trigger the page flip. I suspect there is a race condition with > glXSwapBuffers in mesa or amdgpu, that corrupts some of the data sent > to the GPU causing an hang. > What I believe it seems to be the case is that the GPU lock up only > happens when doing a page flip, since the kernel locks with: > [ 243.693200] kworker/u4:3D089 2 0x > [ 243.693232] Workqueue: events_unbound commit_work [drm_kms_helper] > [ 243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] > (schedule+0x4c/0xac) > [ 243.693259] [<80b8cdd0>] (schedule) from [<80b91024>] > (schedule_timeout+0x228/0x444) > [ 243.693270] [<80b91024>] (schedule_timeout) from [<80886738>] > (dma_fence_default_wait+0x2b4/0x2d8) > [ 243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>] > (dma_fence_wait_timeout+0x40/0x150) > [ 243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>] > (reservation_object_wait_timeout_rcu+0xfc/0x34c) > [ 243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from > [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu]) > [ 243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from > [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu]) > ... > > I will try to reproduce this on x86 with a similar software stack... > and the apitrace traces I got. > What do you think, does this makes sense? Do you have further > suggestions that may help pin down the problem? > > Another strange thing... the traces that were consistently causing > hangs yesterday, today are having a bit more difficulty causing them, > but if I play the video with kodi it hangs easily again. Both kodi and > glretarce always hangs with similar kernel backtraces, like the one > above. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Christian, Replies follow in between. Regards, Luís On Wed, Jan 3, 2018 at 9:37 AM, Christian König wrote: > Hi Luis, > > In general please add information like /proc/iomem and dmesg as attachment > and not mangled inside the mail. Ok, I'll take that into account next time. Sorry for the inconvenience. > > The good news is that your ARM board at least has a memory layout which > should work in theory. So at least one problem rules out. Ok, nice. > > I don't think that apitrace would be much helpful in this case as long as no > developer has access to one of those ARM boards. But it is interesting that > the apitrace reliable reproduces the issue. This means that it isn't > something random, but rather a specific timing of things. I am afraid, I currently don't have boards that I can send yet. I am developing one, but it will still take some time, before I have one ready. I've checked the apitrace and there is a common call glXSwapBuffers(dpy=0x1389f00, drawable=52428803) that I believe will trigger the page flip. I suspect there is a race condition with glXSwapBuffers in mesa or amdgpu, that corrupts some of the data sent to the GPU causing an hang. What I believe it seems to be the case is that the GPU lock up only happens when doing a page flip, since the kernel locks with: [ 243.693200] kworker/u4:3D089 2 0x [ 243.693232] Workqueue: events_unbound commit_work [drm_kms_helper] [ 243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac) [ 243.693259] [<80b8cdd0>] (schedule) from [<80b91024>] (schedule_timeout+0x228/0x444) [ 243.693270] [<80b91024>] (schedule_timeout) from [<80886738>] (dma_fence_default_wait+0x2b4/0x2d8) [ 243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>] (dma_fence_wait_timeout+0x40/0x150) [ 243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>] (reservation_object_wait_timeout_rcu+0xfc/0x34c) [ 243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu]) [ 243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu]) ... I will try to reproduce this on x86 with a similar software stack... and the apitrace traces I got. What do you think, does this makes sense? Do you have further suggestions that may help pin down the problem? Another strange thing... the traces that were consistently causing hangs yesterday, today are having a bit more difficulty causing them, but if I play the video with kodi it hangs easily again. Both kodi and glretarce always hangs with similar kernel backtraces, like the one above. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Luis, In general please add information like /proc/iomem and dmesg as attachment and not mangled inside the mail. The good news is that your ARM board at least has a memory layout which should work in theory. So at least one problem rules out. I don't think that apitrace would be much helpful in this case as long as no developer has access to one of those ARM boards. But it is interesting that the apitrace reliable reproduces the issue. This means that it isn't something random, but rather a specific timing of things. Regards, Christian. Am 03.01.2018 um 01:36 schrieb Luís Mendes: Just a small update, regarding to what I have posted... I've made additional tests with mesa-17.4 at commit "radv: Implement binning on GFX9" - 6a36bfc64d2096aa338958c4605f5fc6372c07b8 and I was able to gather a smaller apitrace of kodi playing a video with about 1GB that hangs the GPU, almost always, when replayed with glretrace if without the option --singlethread. If option --singlethread is used, when doing glretrace, no gpu hang occurs, ever, it seems. For some reason now I am getting past the lightdm login screen without issues, maybe some of the suggested changes improved the behaviour with mesa-17.4, however with mesa-17.3.1 I didn't have those issue anyway. Now both mesa-17.3.1 and mesa-17.4 behave similarly, blocking while playing video with kodi, but is also possible to cause the gpu hang with other applications. On the other hand pure openGL application seem to work fine... I am able to run glmark2 tests without issues. How can I send these apitraces? On Tue, Jan 2, 2018 at 10:29 PM, Luís Mendes wrote: Ok... I've done some of the suggested tests. I still haven't tested on x86, but I'll get to that. I've recompiled the kernel to disable Power Management as much as possible at all levels, including the PCIe, I've also modified /include/drm/drm_cache.h - static inline bool drm_arch_can_wc_memory(void) to always return false, but neither solved the issue. When I run kodi under apitrace with mesa 17.3.1 it becomes much more difficult to reproduce the crash, there are a lot of missed frames due to the CPU overload of apitrace, but I was to able to crash the GPU once. The apitrace log has 2.3GB, how should I send it? It happened while playing a VP9 encoded webm video file, which is decoded by software, as RX 460 is unable to hardware decode this codec AFAIK. In fact software decoded videos are more prone to produce the GPU hang, while a H265 4K hardware decoded video never causes a GPU hang. I'm affraid I forgot to have kodi to log the execution data when I did the apitrace. The full dmesg is presented below as well as the /proc/iomem information and lspci output. I just want to note that I'm having EDID DDC errors with my TV screen, because at some point in kernel 4.14 onwards, both the RX460 as well as the RX550 cards started to corrupt the I2C TV screen EDID memory, so that I have to reflash the correct EDID data to get the screen back to its own configuration. This is a rare problem that only occurs with this TV. All other TVs and monitors that I've tested don't show this EDID corruption issue. I currently have stopped to reflash the I2C EDID configuration memory of my TV to avoid exceeding the memory write cycles endurance, instead I now modify gpu/drm/drm_edid.c in function drm_do_get_edid() to allow the corrupted EDID to pass and enter X. So please ignore the EDID error warnings on my dmesg log. The GPU hangs occur just the same, even when I have the correct EDID, as it is an unrelated issue. Regards, Luís iomem shows this: -3fff : System RAM 8000-00ef : Kernel code 0100-010e3913 : Kernel data d000-efff : PCI MEM d000-e7ff : PCI Bus :01 d000-dfff : :01:00.0 e000-e01f : :01:00.0 e020-e023 : :01:00.0 e024-e025 : :01:00.0 e026-e0263fff : :01:00.1 e026-e0263fff : ICH HD audio f1010680-f10106cf : spi@10680 f1011000-f101101f : i2c@11000 f1011100-f10f : i2c@11100 f1012000-f101201f : serial f1012100-f101211f : serial f1018000-f101801f : pinctrl@18000 f1018100-f101813f : gpio f1018140-f101817f : gpio f1018454-f1018457 : conf-sdio3 f10184a0-f10184ab : rtc-soc f1020704-f1020707 : watchdog@20300 f1020800-f102080f : cpurst@20800 f1020a00-f1020ccf : interrupt-controller@20a00 f1021070-f10210c7 : interrupt-controller@20a00 f1022000-f1022fff : pmsu@22000 f103-f1033fff : ethernet@3 f1034000-f1037fff : ethernet@34000 f104-f1041fff : pcie@2,0 f1044000-f1045fff : pcie@3,0 f1058000-f10584ff : usb@58000 f107-f1073fff : ethernet@7 f10a3800-f10a381f : rtc f10a8000-f10a9fff : sata@a8000 f10d8000-f10d8fff : sdhci f10e-f10e1fff : sata@e f10e4074-f10e4077 : thermal@e8078 f10e4078-f10e407b : thermal@e8078 f10f-f10f3fff : usb3@f f10f8000-f10fbfff : usb3@f8000 f110-f11007ff : f110.sa-sram0 f111-f11107ff : f111.sa-sram
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Just a small update, regarding to what I have posted... I've made additional tests with mesa-17.4 at commit "radv: Implement binning on GFX9" - 6a36bfc64d2096aa338958c4605f5fc6372c07b8 and I was able to gather a smaller apitrace of kodi playing a video with about 1GB that hangs the GPU, almost always, when replayed with glretrace if without the option --singlethread. If option --singlethread is used, when doing glretrace, no gpu hang occurs, ever, it seems. For some reason now I am getting past the lightdm login screen without issues, maybe some of the suggested changes improved the behaviour with mesa-17.4, however with mesa-17.3.1 I didn't have those issues anyway. Now both mesa-17.3.1 and mesa-17.4 behave similarly, blocking while playing video with kodi, but is also possible to cause the gpu hang with other applications. On the other hand pure openGL application seem to work fine... I am able to run glmark2 tests without issues. How can I send these apitraces? On Tue, Jan 2, 2018 at 10:29 PM, Luís Mendes wrote: > Ok... I've done some of the suggested tests. > > I still haven't tested on x86, but I'll get to that. > > I've recompiled the kernel to disable Power Management as much as > possible at all levels, including the PCIe, I've also modified > /include/drm/drm_cache.h - static inline bool > drm_arch_can_wc_memory(void) to always return false, but neither > solved the issue. > > When I run kodi under apitrace with mesa 17.3.1 it becomes much more > difficult to reproduce the crash, there are a lot of missed frames due > to the CPU overload of apitrace, but I was to able to crash the GPU > once. The apitrace log has 2.3GB, how should I send it? > It happened while playing a VP9 encoded webm video file, which is > decoded by software, as RX 460 is unable to hardware decode this codec > AFAIK. In fact software decoded videos are more prone to produce the > GPU hang, while a H265 4K hardware decoded video never causes a GPU > hang. I'm affraid I forgot to have kodi to log the execution data when > I did the apitrace. > > > The full dmesg is presented below as well as the /proc/iomem > information and lspci output. > I just want to note that I'm having EDID DDC errors with my TV > screen, because at some point in kernel 4.14 onwards, both the RX460 > as well as the RX550 cards started to corrupt the I2C TV screen EDID > memory, so that I have to reflash the correct EDID data to get the > screen back to its own configuration. This is a rare problem that only > occurs with this TV. All other TVs and monitors that I've tested don't > show this EDID corruption issue. I currently have stopped to reflash > the I2C EDID configuration memory of my TV to avoid exceeding the > memory write cycles endurance, instead I now modify gpu/drm/drm_edid.c > in function drm_do_get_edid() to allow the corrupted EDID to pass and > enter X. So please ignore the EDID error warnings on my dmesg log. The > GPU hangs occur just the same, even when I have the correct EDID, as > it is an unrelated issue. > > Regards, > Luís > > iomem shows this: > -3fff : System RAM > 8000-00ef : Kernel code > 0100-010e3913 : Kernel data > d000-efff : PCI MEM > d000-e7ff : PCI Bus :01 > d000-dfff : :01:00.0 > e000-e01f : :01:00.0 > e020-e023 : :01:00.0 > e024-e025 : :01:00.0 > e026-e0263fff : :01:00.1 > e026-e0263fff : ICH HD audio > f1010680-f10106cf : spi@10680 > f1011000-f101101f : i2c@11000 > f1011100-f10f : i2c@11100 > f1012000-f101201f : serial > f1012100-f101211f : serial > f1018000-f101801f : pinctrl@18000 > f1018100-f101813f : gpio > f1018140-f101817f : gpio > f1018454-f1018457 : conf-sdio3 > f10184a0-f10184ab : rtc-soc > f1020704-f1020707 : watchdog@20300 > f1020800-f102080f : cpurst@20800 > f1020a00-f1020ccf : interrupt-controller@20a00 > f1021070-f10210c7 : interrupt-controller@20a00 > f1022000-f1022fff : pmsu@22000 > f103-f1033fff : ethernet@3 > f1034000-f1037fff : ethernet@34000 > f104-f1041fff : pcie@2,0 > f1044000-f1045fff : pcie@3,0 > f1058000-f10584ff : usb@58000 > f107-f1073fff : ethernet@7 > f10a3800-f10a381f : rtc > f10a8000-f10a9fff : sata@a8000 > f10d8000-f10d8fff : sdhci > f10e-f10e1fff : sata@e > f10e4074-f10e4077 : thermal@e8078 > f10e4078-f10e407b : thermal@e8078 > f10f-f10f3fff : usb3@f > f10f8000-f10fbfff : usb3@f8000 > f110-f11007ff : f110.sa-sram0 > f111-f11107ff : f111.sa-sram1 > f120-f12f : f120.bm-bppi > > lspci is like this: > 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. > [AMD/ATI] Baffin [Radeon RX 460] (rev cf) (prog-if 00) > Subsystem: PC Partner Limited / Sapphire Technology Baffin > [Radeon RX 460] > Flags: bus master, fast devsel, latency 0, IRQ 57 > Memory at d000 (64-bit, prefetchable) [size=256M] > Memory at e000 (64-bit, prefetchable) [siz
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Ok... I've done some of the suggested tests. I still haven't tested on x86, but I'll get to that. I've recompiled the kernel to disable Power Management as much as possible at all levels, including the PCIe, I've also modified /include/drm/drm_cache.h - static inline bool drm_arch_can_wc_memory(void) to always return false, but neither solved the issue. When I run kodi under apitrace with mesa 17.3.1 it becomes much more difficult to reproduce the crash, there are a lot of missed frames due to the CPU overload of apitrace, but I was to able to crash the GPU once. The apitrace log has 2.3GB, how should I send it? It happened while playing a VP9 encoded webm video file, which is decoded by software, as RX 460 is unable to hardware decode this codec AFAIK. In fact software decoded videos are more prone to produce the GPU hang, while a H265 4K hardware decoded video never causes a GPU hang. I'm affraid I forgot to have kodi to log the execution data when I did the apitrace. The full dmesg is presented below as well as the /proc/iomem information and lspci output. I just want to note that I'm having EDID DDC errors with my TV screen, because at some point in kernel 4.14 onwards, both the RX460 as well as the RX550 cards started to corrupt the I2C TV screen EDID memory, so that I have to reflash the correct EDID data to get the screen back to its own configuration. This is a rare problem that only occurs with this TV. All other TVs and monitors that I've tested don't show this EDID corruption issue. I currently have stopped to reflash the I2C EDID configuration memory of my TV to avoid exceeding the memory write cycles endurance, instead I now modify gpu/drm/drm_edid.c in function drm_do_get_edid() to allow the corrupted EDID to pass and enter X. So please ignore the EDID error warnings on my dmesg log. The GPU hangs occur just the same, even when I have the correct EDID, as it is an unrelated issue. Regards, Luís iomem shows this: -3fff : System RAM 8000-00ef : Kernel code 0100-010e3913 : Kernel data d000-efff : PCI MEM d000-e7ff : PCI Bus :01 d000-dfff : :01:00.0 e000-e01f : :01:00.0 e020-e023 : :01:00.0 e024-e025 : :01:00.0 e026-e0263fff : :01:00.1 e026-e0263fff : ICH HD audio f1010680-f10106cf : spi@10680 f1011000-f101101f : i2c@11000 f1011100-f10f : i2c@11100 f1012000-f101201f : serial f1012100-f101211f : serial f1018000-f101801f : pinctrl@18000 f1018100-f101813f : gpio f1018140-f101817f : gpio f1018454-f1018457 : conf-sdio3 f10184a0-f10184ab : rtc-soc f1020704-f1020707 : watchdog@20300 f1020800-f102080f : cpurst@20800 f1020a00-f1020ccf : interrupt-controller@20a00 f1021070-f10210c7 : interrupt-controller@20a00 f1022000-f1022fff : pmsu@22000 f103-f1033fff : ethernet@3 f1034000-f1037fff : ethernet@34000 f104-f1041fff : pcie@2,0 f1044000-f1045fff : pcie@3,0 f1058000-f10584ff : usb@58000 f107-f1073fff : ethernet@7 f10a3800-f10a381f : rtc f10a8000-f10a9fff : sata@a8000 f10d8000-f10d8fff : sdhci f10e-f10e1fff : sata@e f10e4074-f10e4077 : thermal@e8078 f10e4078-f10e407b : thermal@e8078 f10f-f10f3fff : usb3@f f10f8000-f10fbfff : usb3@f8000 f110-f11007ff : f110.sa-sram0 f111-f11107ff : f111.sa-sram1 f120-f12f : f120.bm-bppi lspci is like this: 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460] (rev cf) (prog-if 00) Subsystem: PC Partner Limited / Sapphire Technology Baffin [Radeon RX 460] Flags: bus master, fast devsel, latency 0, IRQ 57 Memory at d000 (64-bit, prefetchable) [size=256M] Memory at e000 (64-bit, prefetchable) [size=2M] I/O ports at 1 [size=256] Memory at e020 (32-bit, non-prefetchable) [size=256K] Expansion ROM at e024 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 Capabilities: [50] Power Management version 3 Capabilities: [58] Express Legacy Endpoint, MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 Capabilities: [150] Advanced Error Reporting Capabilities: [200] #15 Capabilities: [270] #19 Capabilities: [2b0] Address Translation Service (ATS) Capabilities: [2c0] Page Request Interface (PRI) Capabilities: [2d0] Process Address Space ID (PASID) Capabilities: [320] Latency Tolerance Reporting Capabilities: [328] Alternative Routing-ID Interpretation (ARI) Capabilities: [370] L1 PM Substates Kernel driver in use: amdgpu Kernel modules: amdgpu 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aae0 Subsystem: PC Partner Limited / Sapphire Technology Device aae0 Flags: bus master, fast devsel, lat
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
when you refer to API traces, you're suggesting to strace kodi, or what do you mean? What I meant was apitrace (https://github.com/apitrace/apitrace), but when even the lightdm login screen crashes than this won't be much helpful. That strongly sounds like a ARM specific problem, maybe USWC doesn't work as it should? See function drm_arch_can_wc_memory() in the kernel source and try if it helps if you always return false. Apart from that the only other explanation I have is that some system memory isn't accessible for the GPU while some other is working fine. Please provide the output of "sudo cat /proc/iomem" to double check that. Regards, Christian. Am 02.01.2018 um 14:09 schrieb Luís Mendes: Dear Mr. David, Mr. Christian, First of all, thanks for your replies! David, I will try the same software versions on x86 to see if I am able to replicate the problem on x86, but I suspect it is ARM specific... I'll report back when I have more details. Christian, I'll collect the data you've referred and will disable the power management. Regarding the mesa master version, I've tried it, and the problem just gets worse. With latest mesa, It easily locks up in lightdm login screen, or when navigating through the Ubuntu Mate menus, or with Kodi. I've tested with mesa commit "radv: Implement binning on GFX9" - 6a36bfc64d2096aa338958c4605f5fc6372c07b8. Just one question... when you refer to API traces, you're suggesting to strace kodi, or what do you mean? Regards, Luís On Tue, Jan 2, 2018 at 2:51 AM, Chunming Zhou wrote: Did you try it on x86 board? Is there same issue? We should identify it is ARM specific or genera issue for amdgpu driver. Thanks, David Zhou ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Dear Mr. David, Mr. Christian, First of all, thanks for your replies! David, I will try the same software versions on x86 to see if I am able to replicate the problem on x86, but I suspect it is ARM specific... I'll report back when I have more details. Christian, I'll collect the data you've referred and will disable the power management. Regarding the mesa master version, I've tried it, and the problem just gets worse. With latest mesa, It easily locks up in lightdm login screen, or when navigating through the Ubuntu Mate menus, or with Kodi. I've tested with mesa commit "radv: Implement binning on GFX9" - 6a36bfc64d2096aa338958c4605f5fc6372c07b8. Just one question... when you refer to API traces, you're suggesting to strace kodi, or what do you mean? Regards, Luís On Tue, Jan 2, 2018 at 2:51 AM, Chunming Zhou wrote: > Did you try it on x86 board? Is there same issue? > > We should identify it is ARM specific or genera issue for amdgpu driver. > > > Thanks, > > David Zhou > > > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Hi Luis, well first of all that isn't a deadlock, but just a hardware lockup. So the traces and logs you tried to attach are not really useful. What we need instead is a full dmesg and/or some kodi logs, API trace etc.. what exactly is happening when the problem occurs. Please also try Mesa master and try to disable power management as much as possible. Regards, Christian. Am 02.01.2018 um 03:51 schrieb Chunming Zhou: Did you try it on x86 board? Is there same issue? We should identify it is ARM specific or genera issue for amdgpu driver. Thanks, David Zhou On 2018年01月02日 00:32, Luís Mendes wrote: I am currently testing the amdgpu driver with AMD RX460 and RX550 graphics cards on an ARM Cortex-A9 with 1GB RAM and I am consistently getting deadlocks when playing videos with Kodi or other applications. I'm using Linux kernel from https://cgit.freedesktop.org/~agd5f/linux/, branch drm-next-4.16 at commit "drm/amdgpu: Correct the IB size of bo update mapping" - 104bd2ca1124dfd9aa904d5f5a96253ef2b580f6 along with libdrm-2.4.89 and mesa-17.3.1 on an Ubuntu 17.10 with Mate desktop and Lightdm session manager over X11. I am consistently getting deadlocks, which sometimes are almost immediate, but sometimes they take about half an hour to occur. There are some video files that I am using for testing which have more probability of causing a deadlock than others. I got some kernel crash dumps, kodi process backtraces for the offending thread and the deadlocked process tree listing which I attach here. The kernel seems to deadlock during a page flip, indefinitelly waiting for the DMA fence to complete, however, it doesn't and the timeout doesn't expire either... as such this may be a GPU lockup. I can provide more details, if needed, if there is interest or time to look into this. Regards, Luís Mendes Software and Hardware engineer [ 253.904103] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=43831, last emitted seq=43833 [ 253.915041] [drm] IP block:gmc_v8_0 is hung! [ 253.915047] [drm] IP block:gfx_v8_0 is hung! [ 253.915162] [drm] GPU recovery disabled. [ 366.541614] INFO: task kworker/u4:4:90 blocked for more than 120 seconds. [ 366.548436] Not tainted 4.15.0-rc4-drmnext2g #1 [ 366.554300] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 366.562162] kworker/u4:4 D 0 90 2 0x [ 366.562196] Workqueue: events_unbound commit_work [drm_kms_helper] [ 366.562215] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac) [ 366.562223] [<80b8cdd0>] (schedule) from [<80b91024>] (schedule_timeout+0x228/0x444) [ 366.562233] [<80b91024>] (schedule_timeout) from [<80886738>] (dma_fence_default_wait+0x2b4/0x2d8) [ 366.562241] [<80886738>] (dma_fence_default_wait) from [<80885d60>] (dma_fence_wait_timeout+0x40/0x150) [ 366.562248] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>] (reservation_object_wait_timeout_rcu+0xfc/0x34c) [ 366.562476] [<80887b1c>] (reservation_object_wait_timeout_rcu) from [<7f2d3988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu]) [ 366.562754] [<7f2d3988>] (amdgpu_dm_do_flip [amdgpu]) from [<7f2d509c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu]) [ 366.562908] [<7f2d509c>] (amdgpu_dm_atomic_commit_tail [amdgpu]) from [<7f13e58c>] (commit_tail+0x50/0x94 [drm_kms_helper]) [ 366.562931] [<7f13e58c>] (commit_tail [drm_kms_helper]) from [<7f13e5ec>] (commit_work+0x1c/0x20 [drm_kms_helper]) [ 366.562948] [<7f13e5ec>] (commit_work [drm_kms_helper]) from [<8016f4c8>] (process_one_work+0x1a8/0x4ac) [ 366.562955] [<8016f4c8>] (process_one_work) from [<8017050c>] (worker_thread+0x68/0x598) [ 366.562962] [<8017050c>] (worker_thread) from [<80175e50>] (kthread+0x16c/0x174) [ 366.562970] [<80175e50>] (kthread) from [<80109de8>] (ret_from_fork+0x14/0x2c) From userland side: (gdb) info thread Id Target Id Frame * 1 Thread 0x6eb17c70 (LWP 2071) "kodi.bin" 0x748b2246 in ioctl () at ../sysdeps/unix/syscall-template.S:84 2 Thread 0x6eb14170 (LWP 2072) "Announce" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 3 Thread 0x6e1ff170 (LWP 2075) "ActiveAE" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 4 Thread 0x6d9ff170 (LWP 2076) "AESink" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 5 Thread 0x6b7c9170 (LWP 2081) "amdgpu_cs:0" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 6 Thread 0x6ae3c170 (LWP 2082) "disk_cache:0" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 7 Thread 0x571df170 (LWP 2083) "si_shader:0" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 8 Thread 0x569df170 (LWP 2084) "si_shader_low:0" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 9 Thread 0x561df170 (LWP 208
Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2
Did you try it on x86 board? Is there same issue? We should identify it is ARM specific or genera issue for amdgpu driver. Thanks, David Zhou On 2018年01月02日 00:32, Luís Mendes wrote: I am currently testing the amdgpu driver with AMD RX460 and RX550 graphics cards on an ARM Cortex-A9 with 1GB RAM and I am consistently getting deadlocks when playing videos with Kodi or other applications. I'm using Linux kernel from https://cgit.freedesktop.org/~agd5f/linux/, branch drm-next-4.16 at commit "drm/amdgpu: Correct the IB size of bo update mapping" - 104bd2ca1124dfd9aa904d5f5a96253ef2b580f6 along with libdrm-2.4.89 and mesa-17.3.1 on an Ubuntu 17.10 with Mate desktop and Lightdm session manager over X11. I am consistently getting deadlocks, which sometimes are almost immediate, but sometimes they take about half an hour to occur. There are some video files that I am using for testing which have more probability of causing a deadlock than others. I got some kernel crash dumps, kodi process backtraces for the offending thread and the deadlocked process tree listing which I attach here. The kernel seems to deadlock during a page flip, indefinitelly waiting for the DMA fence to complete, however, it doesn't and the timeout doesn't expire either... as such this may be a GPU lockup. I can provide more details, if needed, if there is interest or time to look into this. Regards, Luís Mendes Software and Hardware engineer [ 253.904103] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=43831, last emitted seq=43833 [ 253.915041] [drm] IP block:gmc_v8_0 is hung! [ 253.915047] [drm] IP block:gfx_v8_0 is hung! [ 253.915162] [drm] GPU recovery disabled. [ 366.541614] INFO: task kworker/u4:4:90 blocked for more than 120 seconds. [ 366.548436] Not tainted 4.15.0-rc4-drmnext2g #1 [ 366.554300] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 366.562162] kworker/u4:4D090 2 0x [ 366.562196] Workqueue: events_unbound commit_work [drm_kms_helper] [ 366.562215] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac) [ 366.562223] [<80b8cdd0>] (schedule) from [<80b91024>] (schedule_timeout+0x228/0x444) [ 366.562233] [<80b91024>] (schedule_timeout) from [<80886738>] (dma_fence_default_wait+0x2b4/0x2d8) [ 366.562241] [<80886738>] (dma_fence_default_wait) from [<80885d60>] (dma_fence_wait_timeout+0x40/0x150) [ 366.562248] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>] (reservation_object_wait_timeout_rcu+0xfc/0x34c) [ 366.562476] [<80887b1c>] (reservation_object_wait_timeout_rcu) from [<7f2d3988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu]) [ 366.562754] [<7f2d3988>] (amdgpu_dm_do_flip [amdgpu]) from [<7f2d509c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu]) [ 366.562908] [<7f2d509c>] (amdgpu_dm_atomic_commit_tail [amdgpu]) from [<7f13e58c>] (commit_tail+0x50/0x94 [drm_kms_helper]) [ 366.562931] [<7f13e58c>] (commit_tail [drm_kms_helper]) from [<7f13e5ec>] (commit_work+0x1c/0x20 [drm_kms_helper]) [ 366.562948] [<7f13e5ec>] (commit_work [drm_kms_helper]) from [<8016f4c8>] (process_one_work+0x1a8/0x4ac) [ 366.562955] [<8016f4c8>] (process_one_work) from [<8017050c>] (worker_thread+0x68/0x598) [ 366.562962] [<8017050c>] (worker_thread) from [<80175e50>] (kthread+0x16c/0x174) [ 366.562970] [<80175e50>] (kthread) from [<80109de8>] (ret_from_fork+0x14/0x2c) From userland side: (gdb) info thread Id Target Id Frame * 1Thread 0x6eb17c70 (LWP 2071) "kodi.bin" 0x748b2246 in ioctl () at ../sysdeps/unix/syscall-template.S:84 2Thread 0x6eb14170 (LWP 2072) "Announce" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 3Thread 0x6e1ff170 (LWP 2075) "ActiveAE" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 4Thread 0x6d9ff170 (LWP 2076) "AESink" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 5Thread 0x6b7c9170 (LWP 2081) "amdgpu_cs:0" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 6Thread 0x6ae3c170 (LWP 2082) "disk_cache:0" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 7Thread 0x571df170 (LWP 2083) "si_shader:0" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 8Thread 0x569df170 (LWP 2084) "si_shader_low:0" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 9Thread 0x561df170 (LWP 2085) "gallium_drv:0" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 10 Thread 0x551f6170 (LWP 2086) "kodi.bin" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 11 Thread 0x549f6170 (LWP 2087) "PeripBusUSBUdev" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46 ---Type to continue, or q to quit--- 12 Thread 0x541f6170 (LWP