Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-02-07 Thread Deucher, Alexander
We haven't had a chance to look yet.


Alex


From: Luís Mendes 
Sent: Wednesday, February 7, 2018 10:50:48 AM
To: Koenig, Christian
Cc: Alex Deucher; Deucher, Alexander; Zhou, David(ChunMing); Michel Dänzer; 
amd-gfx@lists.freedesktop.org
Subject: Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - 
Update 2

Hi Christian, Alexander,

Kmemleak reported leaked data structures and the GPU hung a bit after.
Could this be caused from DC?
Info in attachments.


I'm not sure if my previous email got overlooked, or if simply, there
are no suggestions at this moment. Sorry for kind of re-sending the
email.


Regards,
Luís

On Mon, Feb 5, 2018 at 12:40 PM, Luís Mendes  wrote:
> Hi everyone,
>
> I have some updates. I left the system idle most of the time during
> the weekend and from time to time I played a video on youtube and
> turned off the screen. Yesterday night I did the same and today
> morning I checked the system and it got hung up during the night. This
> time it took a lot longer to hang, but I think it was related to a
> Flash animation add that was only present on the youtube page the last
> time I switched off the screen. The amdgpu always seem to hang when
> that flash animation is present, from all the crash attempts I have
> made.
> There is a memory leak according to kmemleak which I attach along with
> the crash dmesg log.
>
> The kernel and patches are the same as on my previous email. I ended
> up not changing either the mesa version, nor the kernel version and
> patches.
>
> Regards,
> Luís
>
>
> On Fri, Feb 2, 2018 at 6:46 PM, Luís Mendes  wrote:
>> Hi Christian, Alexander,
>>
>> I have enabled kmemleak, but memleak didn't detect anything special,
>> in fact this time, I don't know why, I didn't get any allocation
>> failure at all, but the GPU did hang after around 4h 6m of uptime with
>> Xorg.
>> The log can be found in attachment. I will try again to see if the
>> allocation failure reappears, or if it has become less apparent due to
>> kmemleak scans.
>>
>> The kernel stack trace is similar to the GPU hangs I was getting on
>> earlier kernel versions with Kodi, or Firefox when watching videos
>> with either one, but if I left Xorg idle, it would remain up and
>> available without hanging for more than one day.
>> This stack trace also looks quite similar to what Daniel Andersson
>> reported in "[BUG] Intermittent hang/deadlock when opening browser tab
>> with Vega gpu", looks like another demonstration of the same bug on
>> different architectures.
>>
>> Regards,
>> Luís
>>
>> On Fri, Feb 2, 2018 at 7:48 AM, Christian König
>>  wrote:
>>> Hi Luis,
>>>
>>> please enable kmemleak in your build and watch out for any suspicious
>>> messages in the system log.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>> Am 02.02.2018 um 00:03 schrieb Luís Mendes:
>>>>
>>>> Hi Alexander,
>>>>
>>>> I didn't notice improvements on this issue with that particular patch
>>>> applied. It still ends up failing to allocate kernel memory after a
>>>> few hours of uptime with Xorg.
>>>>
>>>> I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next
>>>> head, to see if the issue still occurs with those versions.
>>>>
>>>> If you have additional suggestions I'll be happy to try them.
>>>>
>>>> Regards,
>>>> Luís Mendes
>>>>
>>>> On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher 
>>>> wrote:
>>>>>
>>>>> On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes 
>>>>> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I am getting a new issue with amdgpu with RX460, that is, now I can
>>>>>> play any videos with Kodi or play web videos with firefox and run
>>>>>> OpenGL applications without running into any issues, however after
>>>>>> some uptime with XOrg even when almost inactive I get a kmalloc
>>>>>> allocation failure, normally followed by a GPU hang a while after the
>>>>>> the allocation failure.
>>>>>> I had a terminal window under Ubuntu Mate 17.10 and I was compiling
>>>>>> code when I got the kernel messages that can be found in attachment.
>>>>>>
>>>>>> I am using the kernel as identified on my previous email, which can be
>>>>>> found

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-02-07 Thread Luís Mendes
Hi Christian, Alexander,

Kmemleak reported leaked data structures and the GPU hung a bit after.
Could this be caused from DC?
Info in attachments.


I'm not sure if my previous email got overlooked, or if simply, there
are no suggestions at this moment. Sorry for kind of re-sending the
email.


Regards,
Luís

On Mon, Feb 5, 2018 at 12:40 PM, Luís Mendes  wrote:
> Hi everyone,
>
> I have some updates. I left the system idle most of the time during
> the weekend and from time to time I played a video on youtube and
> turned off the screen. Yesterday night I did the same and today
> morning I checked the system and it got hung up during the night. This
> time it took a lot longer to hang, but I think it was related to a
> Flash animation add that was only present on the youtube page the last
> time I switched off the screen. The amdgpu always seem to hang when
> that flash animation is present, from all the crash attempts I have
> made.
> There is a memory leak according to kmemleak which I attach along with
> the crash dmesg log.
>
> The kernel and patches are the same as on my previous email. I ended
> up not changing either the mesa version, nor the kernel version and
> patches.
>
> Regards,
> Luís
>
>
> On Fri, Feb 2, 2018 at 6:46 PM, Luís Mendes  wrote:
>> Hi Christian, Alexander,
>>
>> I have enabled kmemleak, but memleak didn't detect anything special,
>> in fact this time, I don't know why, I didn't get any allocation
>> failure at all, but the GPU did hang after around 4h 6m of uptime with
>> Xorg.
>> The log can be found in attachment. I will try again to see if the
>> allocation failure reappears, or if it has become less apparent due to
>> kmemleak scans.
>>
>> The kernel stack trace is similar to the GPU hangs I was getting on
>> earlier kernel versions with Kodi, or Firefox when watching videos
>> with either one, but if I left Xorg idle, it would remain up and
>> available without hanging for more than one day.
>> This stack trace also looks quite similar to what Daniel Andersson
>> reported in "[BUG] Intermittent hang/deadlock when opening browser tab
>> with Vega gpu", looks like another demonstration of the same bug on
>> different architectures.
>>
>> Regards,
>> Luís
>>
>> On Fri, Feb 2, 2018 at 7:48 AM, Christian König
>>  wrote:
>>> Hi Luis,
>>>
>>> please enable kmemleak in your build and watch out for any suspicious
>>> messages in the system log.
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>> Am 02.02.2018 um 00:03 schrieb Luís Mendes:

 Hi Alexander,

 I didn't notice improvements on this issue with that particular patch
 applied. It still ends up failing to allocate kernel memory after a
 few hours of uptime with Xorg.

 I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next
 head, to see if the issue still occurs with those versions.

 If you have additional suggestions I'll be happy to try them.

 Regards,
 Luís Mendes

 On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher 
 wrote:
>
> On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes 
> wrote:
>>
>> Hi everyone,
>>
>> I am getting a new issue with amdgpu with RX460, that is, now I can
>> play any videos with Kodi or play web videos with firefox and run
>> OpenGL applications without running into any issues, however after
>> some uptime with XOrg even when almost inactive I get a kmalloc
>> allocation failure, normally followed by a GPU hang a while after the
>> the allocation failure.
>> I had a terminal window under Ubuntu Mate 17.10 and I was compiling
>> code when I got the kernel messages that can be found in attachment.
>>
>> I am using the kernel as identified on my previous email, which can be
>> found below.
>
> does this patch help?
> https://patchwork.freedesktop.org/patch/198258/
>
> Alex
>
>> Regards,
>> Luís Mendes
>>
>> On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes 
>> wrote:
>>>
>>> Hi Alexander,
>>>
>>> I've cherry picked the patch you pointed out into kernel from
>>> amd-drm-next-4.17-wip at commit
>>> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set
>>> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has
>>> gone indeed.
>>>
>>>
>>> Working great on ARMv7l with AMD RX460.
>>>
>>> Thanks,
>>> Luís Mendes
>>>
>>>
>>> On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander
>>>  wrote:

 Fixed with this patch:


 https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html


 Alex
>>
>> <>

 __
>>
>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
 __

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-02-05 Thread Luís Mendes
Hi everyone,

I have some updates. I left the system idle most of the time during
the weekend and from time to time I played a video on youtube and
turned off the screen. Yesterday night I did the same and today
morning I checked the system and it got hung up during the night. This
time it took a lot longer to hang, but I think it was related to a
Flash animation add that was only present on the youtube page the last
time I switched off the screen. The amdgpu always seem to hang when
that flash animation is present, from all the crash attempts I have
made.
There is a memory leak according to kmemleak which I attach along with
the crash dmesg log.

The kernel and patches are the same as on my previous email. I ended
up not changing either the mesa version, nor the kernel version and
patches.

Regards,
Luís


On Fri, Feb 2, 2018 at 6:46 PM, Luís Mendes  wrote:
> Hi Christian, Alexander,
>
> I have enabled kmemleak, but memleak didn't detect anything special,
> in fact this time, I don't know why, I didn't get any allocation
> failure at all, but the GPU did hang after around 4h 6m of uptime with
> Xorg.
> The log can be found in attachment. I will try again to see if the
> allocation failure reappears, or if it has become less apparent due to
> kmemleak scans.
>
> The kernel stack trace is similar to the GPU hangs I was getting on
> earlier kernel versions with Kodi, or Firefox when watching videos
> with either one, but if I left Xorg idle, it would remain up and
> available without hanging for more than one day.
> This stack trace also looks quite similar to what Daniel Andersson
> reported in "[BUG] Intermittent hang/deadlock when opening browser tab
> with Vega gpu", looks like another demonstration of the same bug on
> different architectures.
>
> Regards,
> Luís
>
> On Fri, Feb 2, 2018 at 7:48 AM, Christian König
>  wrote:
>> Hi Luis,
>>
>> please enable kmemleak in your build and watch out for any suspicious
>> messages in the system log.
>>
>> Regards,
>> Christian.
>>
>>
>> Am 02.02.2018 um 00:03 schrieb Luís Mendes:
>>>
>>> Hi Alexander,
>>>
>>> I didn't notice improvements on this issue with that particular patch
>>> applied. It still ends up failing to allocate kernel memory after a
>>> few hours of uptime with Xorg.
>>>
>>> I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next
>>> head, to see if the issue still occurs with those versions.
>>>
>>> If you have additional suggestions I'll be happy to try them.
>>>
>>> Regards,
>>> Luís Mendes
>>>
>>> On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher 
>>> wrote:

 On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes 
 wrote:
>
> Hi everyone,
>
> I am getting a new issue with amdgpu with RX460, that is, now I can
> play any videos with Kodi or play web videos with firefox and run
> OpenGL applications without running into any issues, however after
> some uptime with XOrg even when almost inactive I get a kmalloc
> allocation failure, normally followed by a GPU hang a while after the
> the allocation failure.
> I had a terminal window under Ubuntu Mate 17.10 and I was compiling
> code when I got the kernel messages that can be found in attachment.
>
> I am using the kernel as identified on my previous email, which can be
> found below.

 does this patch help?
 https://patchwork.freedesktop.org/patch/198258/

 Alex

> Regards,
> Luís Mendes
>
> On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes 
> wrote:
>>
>> Hi Alexander,
>>
>> I've cherry picked the patch you pointed out into kernel from
>> amd-drm-next-4.17-wip at commit
>> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set
>> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has
>> gone indeed.
>>
>>
>> Working great on ARMv7l with AMD RX460.
>>
>> Thanks,
>> Luís Mendes
>>
>>
>> On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander
>>  wrote:
>>>
>>> Fixed with this patch:
>>>
>>>
>>> https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html
>>>
>>>
>>> Alex
>
> <>
>>>
>>> __
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>>> ___
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
ubuntu@linux:~$ sudo cat /sys/kernel/debug/kmemleak
[sudo] password for ubuntu:
unreferenced object 0xb0fac380 (size 128):
  comm "Xorg", pid 3750, jiffies 5608934 (age 178088.970s)
  hex dump (first 32 bytes):
00 4e 9f b9 00 f0 33 bb 80 1a 15 97 00 00 00 00  .N3.
fa 00 00 00 82 01 00 00 80 00 00 00 80 00 00 00  
  backtrace:
[<400a53a4>] kmem_cache_

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-02-02 Thread Luís Mendes
Hi Christian, Alexander,

I have enabled kmemleak, but memleak didn't detect anything special,
in fact this time, I don't know why, I didn't get any allocation
failure at all, but the GPU did hang after around 4h 6m of uptime with
Xorg.
The log can be found in attachment. I will try again to see if the
allocation failure reappears, or if it has become less apparent due to
kmemleak scans.

The kernel stack trace is similar to the GPU hangs I was getting on
earlier kernel versions with Kodi, or Firefox when watching videos
with either one, but if I left Xorg idle, it would remain up and
available without hanging for more than one day.
This stack trace also looks quite similar to what Daniel Andersson
reported in "[BUG] Intermittent hang/deadlock when opening browser tab
with Vega gpu", looks like another demonstration of the same bug on
different architectures.

Regards,
Luís

On Fri, Feb 2, 2018 at 7:48 AM, Christian König
 wrote:
> Hi Luis,
>
> please enable kmemleak in your build and watch out for any suspicious
> messages in the system log.
>
> Regards,
> Christian.
>
>
> Am 02.02.2018 um 00:03 schrieb Luís Mendes:
>>
>> Hi Alexander,
>>
>> I didn't notice improvements on this issue with that particular patch
>> applied. It still ends up failing to allocate kernel memory after a
>> few hours of uptime with Xorg.
>>
>> I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next
>> head, to see if the issue still occurs with those versions.
>>
>> If you have additional suggestions I'll be happy to try them.
>>
>> Regards,
>> Luís Mendes
>>
>> On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher 
>> wrote:
>>>
>>> On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes 
>>> wrote:

 Hi everyone,

 I am getting a new issue with amdgpu with RX460, that is, now I can
 play any videos with Kodi or play web videos with firefox and run
 OpenGL applications without running into any issues, however after
 some uptime with XOrg even when almost inactive I get a kmalloc
 allocation failure, normally followed by a GPU hang a while after the
 the allocation failure.
 I had a terminal window under Ubuntu Mate 17.10 and I was compiling
 code when I got the kernel messages that can be found in attachment.

 I am using the kernel as identified on my previous email, which can be
 found below.
>>>
>>> does this patch help?
>>> https://patchwork.freedesktop.org/patch/198258/
>>>
>>> Alex
>>>
 Regards,
 Luís Mendes

 On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes 
 wrote:
>
> Hi Alexander,
>
> I've cherry picked the patch you pointed out into kernel from
> amd-drm-next-4.17-wip at commit
> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set
> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has
> gone indeed.
>
>
> Working great on ARMv7l with AMD RX460.
>
> Thanks,
> Luís Mendes
>
>
> On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander
>  wrote:
>>
>> Fixed with this patch:
>>
>>
>> https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html
>>
>>
>> Alex

 <>
>>
>> __

 ___
 amd-gfx mailing list
 amd-gfx@lists.freedesktop.org
 https://lists.freedesktop.org/mailman/listinfo/amd-gfx

>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
Feb  2 16:29:29 localhost kernel: [14801.740467] [drm:amdgpu_job_timedout 
[amdgpu]] *ERROR* ring gfx timeout, last signaled seq=831006, last emitted 
seq=831008
Feb  2 16:29:29 localhost kernel: [14801.751557] [drm] IP block:gmc_v8_0 is 
hung!
Feb  2 16:29:29 localhost kernel: [14801.751563] [drm] IP block:gfx_v8_0 is 
hung!
Feb  2 16:29:29 localhost kernel: [14801.751611] [drm] GPU recovery disabled.
Feb  2 16:44:53 localhost kernel: [15725.856181] INFO: task amdgpu_cs:0:3803 
blocked for more than 120 seconds.
Feb  2 16:44:53 localhost kernel: [15725.863085]   Not tainted 
4.15.0-rc8-next2g-g9ab2894-dirty #3
Feb  2 16:44:53 localhost kernel: [15725.869213] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  2 16:44:53 localhost kernel: [15725.877078] amdgpu_cs:0 D0  3803   
3091 0x
Feb  2 16:44:53 localhost kernel: [15725.877084] Backtrace: 
Feb  2 16:44:53 localhost kernel: [15725.877096] [<80b571c8>] (__schedule) from 
[<80b578cc>] (schedule+0x44/0xa4)
Feb  2 16:44:53 localhost kernel: [15725.877102]  r10:600f0013 r9:b45b6000 
r8:b45b7bd4 r7: r6:7fff r5:81004c48
Feb  2 16:44:53 localhost kernel: [15725.877104]  r4:e000
Feb  2 16:44:53 localhost kernel: [15725.877110] [<80b57888>] (schedule) from 
[<80b5b4f0>] (schedule_timeout+0x1e0/0x2e8)
Feb  2 16:44:53 localhost kernel: [15725.877112]  r5:81004c48 r4:7fff
F

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-02-01 Thread Christian König

Hi Luis,

please enable kmemleak in your build and watch out for any suspicious 
messages in the system log.


Regards,
Christian.

Am 02.02.2018 um 00:03 schrieb Luís Mendes:

Hi Alexander,

I didn't notice improvements on this issue with that particular patch
applied. It still ends up failing to allocate kernel memory after a
few hours of uptime with Xorg.

I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next
head, to see if the issue still occurs with those versions.

If you have additional suggestions I'll be happy to try them.

Regards,
Luís Mendes

On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher  wrote:

On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes  wrote:

Hi everyone,

I am getting a new issue with amdgpu with RX460, that is, now I can
play any videos with Kodi or play web videos with firefox and run
OpenGL applications without running into any issues, however after
some uptime with XOrg even when almost inactive I get a kmalloc
allocation failure, normally followed by a GPU hang a while after the
the allocation failure.
I had a terminal window under Ubuntu Mate 17.10 and I was compiling
code when I got the kernel messages that can be found in attachment.

I am using the kernel as identified on my previous email, which can be
found below.

does this patch help?
https://patchwork.freedesktop.org/patch/198258/

Alex


Regards,
Luís Mendes

On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes  wrote:

Hi Alexander,

I've cherry picked the patch you pointed out into kernel from
amd-drm-next-4.17-wip at commit
9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set
DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has
gone indeed.


Working great on ARMv7l with AMD RX460.

Thanks,
Luís Mendes


On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander
 wrote:

Fixed with this patch:

https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html


Alex

<>

__

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-02-01 Thread Luís Mendes
Hi Alexander,

I didn't notice improvements on this issue with that particular patch
applied. It still ends up failing to allocate kernel memory after a
few hours of uptime with Xorg.

I will try to upgrade to mesa 18.0.0-rc3 and to amd-staging-drm-next
head, to see if the issue still occurs with those versions.

If you have additional suggestions I'll be happy to try them.

Regards,
Luís Mendes

On Thu, Feb 1, 2018 at 2:30 AM, Alex Deucher  wrote:
> On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes  wrote:
>> Hi everyone,
>>
>> I am getting a new issue with amdgpu with RX460, that is, now I can
>> play any videos with Kodi or play web videos with firefox and run
>> OpenGL applications without running into any issues, however after
>> some uptime with XOrg even when almost inactive I get a kmalloc
>> allocation failure, normally followed by a GPU hang a while after the
>> the allocation failure.
>> I had a terminal window under Ubuntu Mate 17.10 and I was compiling
>> code when I got the kernel messages that can be found in attachment.
>>
>> I am using the kernel as identified on my previous email, which can be
>> found below.
>
> does this patch help?
> https://patchwork.freedesktop.org/patch/198258/
>
> Alex
>
>>
>> Regards,
>> Luís Mendes
>>
>> On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes  
>> wrote:
>>> Hi Alexander,
>>>
>>> I've cherry picked the patch you pointed out into kernel from
>>> amd-drm-next-4.17-wip at commit
>>> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set
>>> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has
>>> gone indeed.
>>>
>>>
>>>Working great on ARMv7l with AMD RX460.
>>>
>>>Thanks,
>>>Luís Mendes
>>>
>>>
>>>On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander
>>> wrote:
 Fixed with this patch:

 https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html


 Alex
>> <>
 __
>>
>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-31 Thread Alex Deucher
On Wed, Jan 31, 2018 at 6:57 PM, Luís Mendes  wrote:
> Hi everyone,
>
> I am getting a new issue with amdgpu with RX460, that is, now I can
> play any videos with Kodi or play web videos with firefox and run
> OpenGL applications without running into any issues, however after
> some uptime with XOrg even when almost inactive I get a kmalloc
> allocation failure, normally followed by a GPU hang a while after the
> the allocation failure.
> I had a terminal window under Ubuntu Mate 17.10 and I was compiling
> code when I got the kernel messages that can be found in attachment.
>
> I am using the kernel as identified on my previous email, which can be
> found below.

does this patch help?
https://patchwork.freedesktop.org/patch/198258/

Alex

>
> Regards,
> Luís Mendes
>
> On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes  wrote:
>> Hi Alexander,
>>
>> I've cherry picked the patch you pointed out into kernel from
>> amd-drm-next-4.17-wip at commit
>> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set
>> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has
>> gone indeed.
>>
>>
>>Working great on ARMv7l with AMD RX460.
>>
>>Thanks,
>>Luís Mendes
>>
>>
>>On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander
>> wrote:
>>> Fixed with this patch:
>>>
>>> https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html
>>>
>>>
>>> Alex
> <>
>>> __
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-31 Thread Luís Mendes
Hi everyone,

I am getting a new issue with amdgpu with RX460, that is, now I can
play any videos with Kodi or play web videos with firefox and run
OpenGL applications without running into any issues, however after
some uptime with XOrg even when almost inactive I get a kmalloc
allocation failure, normally followed by a GPU hang a while after the
the allocation failure.
I had a terminal window under Ubuntu Mate 17.10 and I was compiling
code when I got the kernel messages that can be found in attachment.

I am using the kernel as identified on my previous email, which can be
found below.

Regards,
Luís Mendes

On Wed, Jan 31, 2018 at 12:47 PM, Luís Mendes  wrote:
> Hi Alexander,
>
> I've cherry picked the patch you pointed out into kernel from
> amd-drm-next-4.17-wip at commit
> 9ab2894122275a6d636bb2654a157e88a0f7b9e2 ( drm/amdgpu: set
> DRIVER_ATOMIC flag early) and tested it on ARMv7l and the problem has
> gone indeed.
>
>
>Working great on ARMv7l with AMD RX460.
>
>Thanks,
>Luís Mendes
>
>
>On Tue, Jan 30, 2018 at 6:44 PM, Deucher, Alexander
> wrote:
>> Fixed with this patch:
>>
>> https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html
>>
>>
>> Alex
<>
>> __
Jan 31 21:56:11 localhost kernel: [ 4091.449841] Xorg: page allocation failure: 
order:5, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
Jan 31 21:56:11 localhost kernel: [ 4091.449845] Xorg cpuset=/ mems_allowed=0
Jan 31 21:56:11 localhost kernel: [ 4091.449855] CPU: 0 PID: 3810 Comm: Xorg 
Not tainted 4.15.0-rc8-next2g-g9ab2894-dirty #1
Jan 31 21:56:11 localhost kernel: [ 4091.449857] Hardware name: Marvell Armada 
380/385 (Device Tree)
Jan 31 21:56:11 localhost kernel: [ 4091.449859] Backtrace: 
Jan 31 21:56:11 localhost kernel: [ 4091.449870] [<8010dca8>] (dump_backtrace) 
from [<8010dfa4>] (show_stack+0x18/0x1c)
Jan 31 21:56:11 localhost kernel: [ 4091.449875]  r7:e000 r6:60070013 
r5: r4:8108d150
Jan 31 21:56:11 localhost kernel: [ 4091.449883] [<8010df8c>] (show_stack) from 
[<80b3ef04>] (dump_stack+0x94/0xa8)
Jan 31 21:56:11 localhost kernel: [ 4091.449891] [<80b3ee70>] (dump_stack) from 
[<8021fbe8>] (warn_alloc+0xc4/0x15c)
Jan 31 21:56:11 localhost kernel: [ 4091.449895]  r7:e000 r6:80d53610 
r5: r4:81004c48
Jan 31 21:56:11 localhost kernel: [ 4091.449900] [<8021fb28>] (warn_alloc) from 
[<80220b0c>] (__alloc_pages_nodemask+0xde4/0xf54)
Jan 31 21:56:11 localhost kernel: [ 4091.449902]  r3:0005 r2:80d53610
Jan 31 21:56:11 localhost kernel: [ 4091.449905]  r7:0032 r6:0140c0c0 
r5:0040 r4:
Jan 31 21:56:11 localhost kernel: [ 4091.449911] [<8021fd28>] 
(__alloc_pages_nodemask) from [<80242164>] (kmalloc_order+0x20/0x38)
Jan 31 21:56:11 localhost kernel: [ 4091.449915]  r10:bcb48000 r9:a766ca00 
r8:0005 r7:7f2f37f4 r6:014080c0 r5:00018018
Jan 31 21:56:11 localhost kernel: [ 4091.449917]  r4:bc83d02c
Jan 31 21:56:11 localhost kernel: [ 4091.449922] [<80242144>] (kmalloc_order) 
from [<802421a0>] (kmalloc_order_trace+0x24/0xc8)
Jan 31 21:56:11 localhost kernel: [ 4091.450184] [<8024217c>] 
(kmalloc_order_trace) from [<7f2f37f4>] (dc_create_gamma+0x24/0x34 [amdgpu])
Jan 31 21:56:11 localhost kernel: [ 4091.450189]  r10:bcb48000 r9:a766ca00 
r8: r7:0001 r6:be4f0c00 r5:b9de1448
Jan 31 21:56:11 localhost kernel: [ 4091.450191]  r4:bc83d02c
Jan 31 21:56:11 localhost kernel: [ 4091.450474] [<7f2f37d0>] (dc_create_gamma 
[amdgpu]) from [<7f29d8a8>] (amdgpu_dm_atomic_check+0x67c/0xc6c [amdgpu])
Jan 31 21:56:11 localhost kernel: [ 4091.450658] [<7f29d22c>] 
(amdgpu_dm_atomic_check [amdgpu]) from [<7f0af238>] 
(drm_atomic_check_only+0x3bc/0x5c4 [drm])
Jan 31 21:56:11 localhost kernel: [ 4091.450663]  r10:0800 r9:7fff 
r8:81004c48 r7:b4718e80 r6:b9cd1380 r5:0001
Jan 31 21:56:11 localhost kernel: [ 4091.450664]  r4:
Jan 31 21:56:11 localhost kernel: [ 4091.450711] [<7f0aee7c>] 
(drm_atomic_check_only [drm]) from [<7f0af458>] (drm_atomic_commit+0x18/0x60 
[drm])
Jan 31 21:56:11 localhost kernel: [ 4091.450715]  r10:0800 r9:bbc4c000 
r8:bc83d000 r7:b9cd1380 r6:be1b5000 r5:b9cd1380
Jan 31 21:56:11 localhost kernel: [ 4091.450717]  r4:0001
Jan 31 21:56:11 localhost kernel: [ 4091.450763] [<7f0af440>] 
(drm_atomic_commit [drm]) from [<7f1260e8>] 
(drm_atomic_helper_legacy_gamma_set+0x110/0x160 [drm_kms_helper])
Jan 31 21:56:11 localhost kernel: [ 4091.450767]  r7:b9cd1380 r6:bbc4b9fe 
r5:a766d200 r4:0001
Jan 31 21:56:11 localhost kernel: [ 4091.450799] [<7f125fd8>] 
(drm_atomic_helper_legacy_gamma_set [drm_kms_helper]) from [<7f0b951c>] 
(drm_mode_gamma_set_ioctl+0x1c4/0x2c0 [drm])
Jan 31 21:56:11 localhost kernel: [ 4091.450804]  r10:e000 r9:bbc4ba00 
r8:bbc4b800 r7:bbc4c034 r6:b1911e2c r5:b1911d80
Jan 31 21:56:11 localhost kernel: [ 4091.450806]  r4:7f125fd8 r3:bbc4bc00
Jan 31 21:56:11 localhost kernel: [ 4091.450848] [<7f0b9358>] 
(drm_mode_gamma_set_ioctl [drm]) from [<7f09d920>] (drm_ioctl_kernel+0x68/0xb4 
[drm

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-30 Thread Deucher, Alexander
Fixed with this patch:

https://lists.freedesktop.org/archives/amd-gfx/2018-January/018472.html


Alex


From: Luís Mendes 
Sent: Tuesday, January 30, 2018 1:30 PM
To: Michel Dänzer; Koenig, Christian
Cc: Deucher, Alexander; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
Subject: Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - 
Update 2

Hi everyone,

I've tested the kernel from amd-drm-next-4.17-wip at commit
9ab2894122275a6d636bb2654a157e88a0f7b9e2 (
drm/amdgpu: set DRIVER_ATOMIC flag early) on ARMv7l, and the reported
issues seem now to have gone. I haven't checked from which commit this
is fixed, but it is now fixed! I also noticed a performance
improvement in one of the glmark2 tests.

There seem to be some other small issues, possibly unrelated, such
that sometimes the screen becomes black and the sound stops while
playing the video for a second or less and then normal playback is
recovered, this happens rarely and at most once per power cycle, while
using X and Kodi, despite I have played many individual videos and
power cycled the machine sometimes.

I've also observed what was already reported, when watching non-VP9 videos:
[  591.729558] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.740255] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.750968] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.761628] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.772248] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.782672] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.793172] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.803681] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.814129] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.824560] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.835054] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.845437] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.855860] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.866415] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.876945] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.887454] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!

Regards,
Luís Mendes

On Wed, Jan 3, 2018 at 11:08 PM, Luís Mendes  wrote:
> Hi Michel, Christian,
>
> Michel, I have tested amd-staging-drm-next at commit "drm/amdgpu/gfx9:
> only init the apertures used by KGD (v2)" -
> 0e4946409d11913523d30bc4830d10b388438c7a and the issues remain, both
> on ARMv7 and on x86 amd64.
>
> Christian, in fact if I replay the apitraces obtained on the ARMv7
> platform on the AMD64 I am also able to reproduce the GPU hang! So it
> is not ARM platform specific. Should I send/upload the apitraces? I
> have two of them, typically when one doesn't hang the gpu the other
> hangs. One takes about 1GB of disk space while the other takes 2.3GB.
> ...
> [   69.019381] ISO 9660 Extensions: RRIP_1991A
> [  213.292094] DMAR: DRHD: handling fault status reg 2
> [  213.292102] DMAR: [INTR-REMAP] Request device [00:00.0] fault index
> 1c [fault reason 38] Blocked an interrupt request due to source-id
> verification failure
> [  223.406919] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
> timeout, last signaled seq=25158, last emitted seq=25160
> [  223.406926] [drm] IP block:tonga_ih is hung!
> [  223.407167] [drm] GPU recovery disabled.
>
> Regards,
> Luís
>
>
> On Wed, Jan 3, 2018 at 5:47 PM, Luís Mendes  wrote:
>> Hi Michel, Christian,
>>
>> Christian, I have followed your suggestion and I have just submitted a
>> bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 -
>> GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7
>> platforms while playing video.
>>
>> Michel, amdgpu.dc=0 seems to make no difference. I will try
>> amd-staging-drm-next and report back.
>>
>> Regards,
>> Luís
>>
>> On Wed, Jan 3, 2018 at 5:09 PM, Michel Dänzer

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-30 Thread Luís Mendes
Hi everyone,

I've tested the kernel from amd-drm-next-4.17-wip at commit
9ab2894122275a6d636bb2654a157e88a0f7b9e2 (
drm/amdgpu: set DRIVER_ATOMIC flag early) on ARMv7l, and the reported
issues seem now to have gone. I haven't checked from which commit this
is fixed, but it is now fixed! I also noticed a performance
improvement in one of the glmark2 tests.

There seem to be some other small issues, possibly unrelated, such
that sometimes the screen becomes black and the sound stops while
playing the video for a second or less and then normal playback is
recovered, this happens rarely and at most once per power cycle, while
using X and Kodi, despite I have played many individual videos and
power cycled the machine sometimes.

I've also observed what was already reported, when watching non-VP9 videos:
[  591.729558] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.740255] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.750968] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.761628] [drm:uvd_v6_0_ring_emit_fence [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.772248] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.782672] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.793172] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.803681] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.814129] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.824560] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.835054] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.845437] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.855860] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.866415] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.876945] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!
[  591.887454] [drm:amdgpu_ring_insert_nop [amdgpu]] *ERROR* amdgpu:
writing more dwords to the ring than expected!

Regards,
Luís Mendes

On Wed, Jan 3, 2018 at 11:08 PM, Luís Mendes  wrote:
> Hi Michel, Christian,
>
> Michel, I have tested amd-staging-drm-next at commit "drm/amdgpu/gfx9:
> only init the apertures used by KGD (v2)" -
> 0e4946409d11913523d30bc4830d10b388438c7a and the issues remain, both
> on ARMv7 and on x86 amd64.
>
> Christian, in fact if I replay the apitraces obtained on the ARMv7
> platform on the AMD64 I am also able to reproduce the GPU hang! So it
> is not ARM platform specific. Should I send/upload the apitraces? I
> have two of them, typically when one doesn't hang the gpu the other
> hangs. One takes about 1GB of disk space while the other takes 2.3GB.
> ...
> [   69.019381] ISO 9660 Extensions: RRIP_1991A
> [  213.292094] DMAR: DRHD: handling fault status reg 2
> [  213.292102] DMAR: [INTR-REMAP] Request device [00:00.0] fault index
> 1c [fault reason 38] Blocked an interrupt request due to source-id
> verification failure
> [  223.406919] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
> timeout, last signaled seq=25158, last emitted seq=25160
> [  223.406926] [drm] IP block:tonga_ih is hung!
> [  223.407167] [drm] GPU recovery disabled.
>
> Regards,
> Luís
>
>
> On Wed, Jan 3, 2018 at 5:47 PM, Luís Mendes  wrote:
>> Hi Michel, Christian,
>>
>> Christian, I have followed your suggestion and I have just submitted a
>> bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 -
>> GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7
>> platforms while playing video.
>>
>> Michel, amdgpu.dc=0 seems to make no difference. I will try
>> amd-staging-drm-next and report back.
>>
>> Regards,
>> Luís
>>
>> On Wed, Jan 3, 2018 at 5:09 PM, Michel Dänzer  wrote:
>>> On 2018-01-03 12:02 PM, Luís Mendes wrote:

 What I believe it seems to be the case is that the GPU lock up only
 happens when doing a page flip, since the kernel locks with:
 [  243.693200] kworker/u4:3D089  2 0x
 [  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
 [  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] 
 (schedule+0x4c/0xac)
 [  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
 (schedule_timeout+0x228/0x444)
 [  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
 (dma

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-03 Thread Luís Mendes
Hi Michel, Christian,

Michel, I have tested amd-staging-drm-next at commit "drm/amdgpu/gfx9:
only init the apertures used by KGD (v2)" -
0e4946409d11913523d30bc4830d10b388438c7a and the issues remain, both
on ARMv7 and on x86 amd64.

Christian, in fact if I replay the apitraces obtained on the ARMv7
platform on the AMD64 I am also able to reproduce the GPU hang! So it
is not ARM platform specific. Should I send/upload the apitraces? I
have two of them, typically when one doesn't hang the gpu the other
hangs. One takes about 1GB of disk space while the other takes 2.3GB.
...
[   69.019381] ISO 9660 Extensions: RRIP_1991A
[  213.292094] DMAR: DRHD: handling fault status reg 2
[  213.292102] DMAR: [INTR-REMAP] Request device [00:00.0] fault index
1c [fault reason 38] Blocked an interrupt request due to source-id
verification failure
[  223.406919] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, last signaled seq=25158, last emitted seq=25160
[  223.406926] [drm] IP block:tonga_ih is hung!
[  223.407167] [drm] GPU recovery disabled.

Regards,
Luís


On Wed, Jan 3, 2018 at 5:47 PM, Luís Mendes  wrote:
> Hi Michel, Christian,
>
> Christian, I have followed your suggestion and I have just submitted a
> bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 -
> GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7
> platforms while playing video.
>
> Michel, amdgpu.dc=0 seems to make no difference. I will try
> amd-staging-drm-next and report back.
>
> Regards,
> Luís
>
> On Wed, Jan 3, 2018 at 5:09 PM, Michel Dänzer  wrote:
>> On 2018-01-03 12:02 PM, Luís Mendes wrote:
>>>
>>> What I believe it seems to be the case is that the GPU lock up only
>>> happens when doing a page flip, since the kernel locks with:
>>> [  243.693200] kworker/u4:3D089  2 0x
>>> [  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
>>> [  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] 
>>> (schedule+0x4c/0xac)
>>> [  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
>>> (schedule_timeout+0x228/0x444)
>>> [  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
>>> (dma_fence_default_wait+0x2b4/0x2d8)
>>> [  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
>>> (dma_fence_wait_timeout+0x40/0x150)
>>> [  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
>>> (reservation_object_wait_timeout_rcu+0xfc/0x34c)
>>> [  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
>>> [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
>>> [  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
>>> [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
>>> ...
>>
>> Does the problem also occur if you disable DC with amdgpu.dc=0 on the
>> kernel command line?
>>
>> Does it also happen with a kernel built from the amd-staging-drm-next
>> branch instead of drm-next-4.16?
>>
>>
>> --
>> Earthling Michel Dänzer   |   http://www.amd.com
>> Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-03 Thread Luís Mendes
Hi Michel, Christian,

Christian, I have followed your suggestion and I have just submitted a
bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 -
GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7
platforms while playing video.

Michel, amdgpu.dc=0 seems to make no difference. I will try
amd-staging-drm-next and report back.

Regards,
Luís

On Wed, Jan 3, 2018 at 5:09 PM, Michel Dänzer  wrote:
> On 2018-01-03 12:02 PM, Luís Mendes wrote:
>>
>> What I believe it seems to be the case is that the GPU lock up only
>> happens when doing a page flip, since the kernel locks with:
>> [  243.693200] kworker/u4:3D089  2 0x
>> [  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
>> [  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] 
>> (schedule+0x4c/0xac)
>> [  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
>> (schedule_timeout+0x228/0x444)
>> [  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
>> (dma_fence_default_wait+0x2b4/0x2d8)
>> [  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
>> (dma_fence_wait_timeout+0x40/0x150)
>> [  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
>> (reservation_object_wait_timeout_rcu+0xfc/0x34c)
>> [  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
>> [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
>> [  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
>> [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
>> ...
>
> Does the problem also occur if you disable DC with amdgpu.dc=0 on the
> kernel command line?
>
> Does it also happen with a kernel built from the amd-staging-drm-next
> branch instead of drm-next-4.16?
>
>
> --
> Earthling Michel Dänzer   |   http://www.amd.com
> Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-03 Thread Michel Dänzer
On 2018-01-03 12:02 PM, Luís Mendes wrote:
> 
> What I believe it seems to be the case is that the GPU lock up only
> happens when doing a page flip, since the kernel locks with:
> [  243.693200] kworker/u4:3D089  2 0x
> [  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
> [  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] 
> (schedule+0x4c/0xac)
> [  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
> (schedule_timeout+0x228/0x444)
> [  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
> (dma_fence_default_wait+0x2b4/0x2d8)
> [  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
> (dma_fence_wait_timeout+0x40/0x150)
> [  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
> (reservation_object_wait_timeout_rcu+0xfc/0x34c)
> [  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
> [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
> [  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
> [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
> ...

Does the problem also occur if you disable DC with amdgpu.dc=0 on the
kernel command line?

Does it also happen with a kernel built from the amd-staging-drm-next
branch instead of drm-next-4.16?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-03 Thread Christian König
In this case please open a bug report on fdo and describe exactly how to 
reproduce it.


Marek should be able to take a look then.

Thanks,
Christian.

Am 03.01.2018 um 12:56 schrieb Luís Mendes:

Hi Christian, David,

David, replying to your question... The issue is indeed reproducible
on x86, I just did it with kodi and the same VP9 video. So it is not
arm specific.

Regards,
Luís

On Wed, Jan 3, 2018 at 11:02 AM, Luís Mendes  wrote:

Hi Christian,

Replies follow in between.

Regards,
Luís

On Wed, Jan 3, 2018 at 9:37 AM, Christian König
 wrote:

Hi Luis,

In general please add information like /proc/iomem and dmesg as attachment
and not mangled inside the mail.

Ok, I'll take that into account next time. Sorry for the inconvenience.


The good news is that your ARM board at least has a memory layout which
should work in theory. So at least one problem rules out.

Ok, nice.


I don't think that apitrace would be much helpful in this case as long as no
developer has access to one of those ARM boards. But it is interesting that
the apitrace reliable reproduces the issue. This means that it isn't
something random, but rather a specific timing of things.

I am afraid, I currently don't have boards that I can send yet. I am
developing one, but it will still take some time, before I have one
ready.

I've checked the apitrace and there is a common call
glXSwapBuffers(dpy=0x1389f00, drawable=52428803) that I believe will
trigger the page flip. I suspect there is a race condition with
glXSwapBuffers in mesa or amdgpu, that corrupts some of the data sent
to the GPU causing an hang.
What I believe it seems to be the case is that the GPU lock up only
happens when doing a page flip, since the kernel locks with:
[  243.693200] kworker/u4:3D089  2 0x
[  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
[  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac)
[  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
(schedule_timeout+0x228/0x444)
[  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
(dma_fence_default_wait+0x2b4/0x2d8)
[  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
(dma_fence_wait_timeout+0x40/0x150)
[  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
(reservation_object_wait_timeout_rcu+0xfc/0x34c)
[  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
[<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
[  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
[<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
...

I will try to reproduce this on x86 with a similar software stack...
and the apitrace traces I got.
What do you think, does this makes sense? Do you have further
suggestions that may help pin down the problem?

Another strange thing... the traces that were consistently causing
hangs yesterday, today are having a bit more difficulty causing them,
but if I play the video with kodi it hangs easily again. Both kodi and
glretarce always hangs with similar kernel backtraces, like the one
above.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-03 Thread Luís Mendes
Hi Christian, David,

David, replying to your question... The issue is indeed reproducible
on x86, I just did it with kodi and the same VP9 video. So it is not
arm specific.

Regards,
Luís

On Wed, Jan 3, 2018 at 11:02 AM, Luís Mendes  wrote:
> Hi Christian,
>
> Replies follow in between.
>
> Regards,
> Luís
>
> On Wed, Jan 3, 2018 at 9:37 AM, Christian König
>  wrote:
>> Hi Luis,
>>
>> In general please add information like /proc/iomem and dmesg as attachment
>> and not mangled inside the mail.
>
> Ok, I'll take that into account next time. Sorry for the inconvenience.
>
>>
>> The good news is that your ARM board at least has a memory layout which
>> should work in theory. So at least one problem rules out.
>
> Ok, nice.
>
>>
>> I don't think that apitrace would be much helpful in this case as long as no
>> developer has access to one of those ARM boards. But it is interesting that
>> the apitrace reliable reproduces the issue. This means that it isn't
>> something random, but rather a specific timing of things.
>
> I am afraid, I currently don't have boards that I can send yet. I am
> developing one, but it will still take some time, before I have one
> ready.
>
> I've checked the apitrace and there is a common call
> glXSwapBuffers(dpy=0x1389f00, drawable=52428803) that I believe will
> trigger the page flip. I suspect there is a race condition with
> glXSwapBuffers in mesa or amdgpu, that corrupts some of the data sent
> to the GPU causing an hang.
> What I believe it seems to be the case is that the GPU lock up only
> happens when doing a page flip, since the kernel locks with:
> [  243.693200] kworker/u4:3D089  2 0x
> [  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
> [  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] 
> (schedule+0x4c/0xac)
> [  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
> (schedule_timeout+0x228/0x444)
> [  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
> (dma_fence_default_wait+0x2b4/0x2d8)
> [  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
> (dma_fence_wait_timeout+0x40/0x150)
> [  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
> (reservation_object_wait_timeout_rcu+0xfc/0x34c)
> [  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
> [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
> [  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
> [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
> ...
>
> I will try to reproduce this on x86 with a similar software stack...
> and the apitrace traces I got.
> What do you think, does this makes sense? Do you have further
> suggestions that may help pin down the problem?
>
> Another strange thing... the traces that were consistently causing
> hangs yesterday, today are having a bit more difficulty causing them,
> but if I play the video with kodi it hangs easily again. Both kodi and
> glretarce always hangs with similar kernel backtraces, like the one
> above.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-03 Thread Luís Mendes
Hi Christian,

Replies follow in between.

Regards,
Luís

On Wed, Jan 3, 2018 at 9:37 AM, Christian König
 wrote:
> Hi Luis,
>
> In general please add information like /proc/iomem and dmesg as attachment
> and not mangled inside the mail.

Ok, I'll take that into account next time. Sorry for the inconvenience.

>
> The good news is that your ARM board at least has a memory layout which
> should work in theory. So at least one problem rules out.

Ok, nice.

>
> I don't think that apitrace would be much helpful in this case as long as no
> developer has access to one of those ARM boards. But it is interesting that
> the apitrace reliable reproduces the issue. This means that it isn't
> something random, but rather a specific timing of things.

I am afraid, I currently don't have boards that I can send yet. I am
developing one, but it will still take some time, before I have one
ready.

I've checked the apitrace and there is a common call
glXSwapBuffers(dpy=0x1389f00, drawable=52428803) that I believe will
trigger the page flip. I suspect there is a race condition with
glXSwapBuffers in mesa or amdgpu, that corrupts some of the data sent
to the GPU causing an hang.
What I believe it seems to be the case is that the GPU lock up only
happens when doing a page flip, since the kernel locks with:
[  243.693200] kworker/u4:3D089  2 0x
[  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
[  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac)
[  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
(schedule_timeout+0x228/0x444)
[  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
(dma_fence_default_wait+0x2b4/0x2d8)
[  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
(dma_fence_wait_timeout+0x40/0x150)
[  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
(reservation_object_wait_timeout_rcu+0xfc/0x34c)
[  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
[<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
[  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
[<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
...

I will try to reproduce this on x86 with a similar software stack...
and the apitrace traces I got.
What do you think, does this makes sense? Do you have further
suggestions that may help pin down the problem?

Another strange thing... the traces that were consistently causing
hangs yesterday, today are having a bit more difficulty causing them,
but if I play the video with kodi it hangs easily again. Both kodi and
glretarce always hangs with similar kernel backtraces, like the one
above.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-03 Thread Christian König

Hi Luis,

In general please add information like /proc/iomem and dmesg as 
attachment and not mangled inside the mail.


The good news is that your ARM board at least has a memory layout which 
should work in theory. So at least one problem rules out.


I don't think that apitrace would be much helpful in this case as long 
as no developer has access to one of those ARM boards. But it is 
interesting that the apitrace reliable reproduces the issue. This means 
that it isn't something random, but rather a specific timing of things.


Regards,
Christian.

Am 03.01.2018 um 01:36 schrieb Luís Mendes:

Just a small update, regarding to what I have posted...

I've made additional tests with mesa-17.4 at commit "radv: Implement
binning on GFX9" - 6a36bfc64d2096aa338958c4605f5fc6372c07b8 and I was
able to gather a smaller apitrace of kodi playing a video with about
1GB that hangs the GPU, almost always, when replayed with glretrace if
without the option --singlethread. If option --singlethread is used,
when doing glretrace, no gpu hang occurs, ever, it seems.

For some reason now I am getting past the lightdm login screen without
issues, maybe some of the suggested changes improved the behaviour
with mesa-17.4, however with mesa-17.3.1 I didn't have those issue
anyway.

Now both mesa-17.3.1 and mesa-17.4 behave similarly, blocking while
playing video with kodi, but is also possible to cause the gpu hang
with other applications.
On the other hand pure openGL application seem to work fine... I am
able to run glmark2 tests without issues.

How can I send these apitraces?

On Tue, Jan 2, 2018 at 10:29 PM, Luís Mendes  wrote:

Ok... I've done some of the suggested tests.

I still haven't tested on x86, but I'll get to that.

I've recompiled the kernel to disable Power Management as much as
possible at all levels, including the PCIe, I've also modified
/include/drm/drm_cache.h - static inline bool
drm_arch_can_wc_memory(void) to always return false, but neither
solved the issue.

When I run kodi under apitrace with mesa 17.3.1 it becomes much more
difficult to reproduce the crash, there are a lot of missed frames due
to the CPU overload of apitrace, but I was to able to crash the GPU
once. The apitrace log has 2.3GB, how should I send it?
It happened while playing a VP9 encoded webm video file, which is
decoded by software, as RX 460 is unable to hardware decode this codec
AFAIK. In fact software decoded videos are more prone to produce the
GPU hang, while a H265 4K hardware decoded video never causes a GPU
hang. I'm affraid I forgot to have kodi to log the execution data when
I did the apitrace.


The full dmesg is presented below as well as the /proc/iomem
information and lspci output.
  I just want to note that I'm having EDID DDC errors with my TV
screen, because at some point in kernel 4.14 onwards, both the RX460
as well as the RX550 cards started to corrupt the I2C TV screen EDID
memory, so that I have to reflash the correct EDID data to get the
screen back to its own configuration. This is a rare problem that only
occurs with this TV. All other TVs and monitors that I've tested don't
show this EDID corruption issue. I currently have stopped to reflash
the I2C EDID configuration memory of my TV to avoid exceeding the
memory write cycles endurance, instead I now modify gpu/drm/drm_edid.c
in function drm_do_get_edid() to allow the corrupted EDID to pass and
enter X. So please ignore the EDID error warnings on my dmesg log. The
GPU hangs occur just the same, even when I have the correct EDID, as
it is an unrelated issue.

Regards,
Luís

iomem shows this:
-3fff : System RAM
   8000-00ef : Kernel code
   0100-010e3913 : Kernel data
d000-efff : PCI MEM
   d000-e7ff : PCI Bus :01
 d000-dfff : :01:00.0
 e000-e01f : :01:00.0
 e020-e023 : :01:00.0
 e024-e025 : :01:00.0
 e026-e0263fff : :01:00.1
   e026-e0263fff : ICH HD audio
f1010680-f10106cf : spi@10680
f1011000-f101101f : i2c@11000
f1011100-f10f : i2c@11100
f1012000-f101201f : serial
f1012100-f101211f : serial
f1018000-f101801f : pinctrl@18000
f1018100-f101813f : gpio
f1018140-f101817f : gpio
f1018454-f1018457 : conf-sdio3
f10184a0-f10184ab : rtc-soc
f1020704-f1020707 : watchdog@20300
f1020800-f102080f : cpurst@20800
f1020a00-f1020ccf : interrupt-controller@20a00
f1021070-f10210c7 : interrupt-controller@20a00
f1022000-f1022fff : pmsu@22000
f103-f1033fff : ethernet@3
f1034000-f1037fff : ethernet@34000
f104-f1041fff : pcie@2,0
f1044000-f1045fff : pcie@3,0
f1058000-f10584ff : usb@58000
f107-f1073fff : ethernet@7
f10a3800-f10a381f : rtc
f10a8000-f10a9fff : sata@a8000
f10d8000-f10d8fff : sdhci
f10e-f10e1fff : sata@e
f10e4074-f10e4077 : thermal@e8078
f10e4078-f10e407b : thermal@e8078
f10f-f10f3fff : usb3@f
f10f8000-f10fbfff : usb3@f8000
f110-f11007ff : f110.sa-sram0
f111-f11107ff : f111.sa-sram

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-02 Thread Luís Mendes
Just a small update, regarding to what I have posted...

I've made additional tests with mesa-17.4 at commit "radv: Implement
binning on GFX9" - 6a36bfc64d2096aa338958c4605f5fc6372c07b8 and I was
able to gather a smaller apitrace of kodi playing a video with about
1GB that hangs the GPU, almost always, when replayed with glretrace if
without the option --singlethread. If option --singlethread is used,
when doing glretrace, no gpu hang occurs, ever, it seems.

For some reason now I am getting past the lightdm login screen without
issues, maybe some of the suggested changes improved the behaviour
with mesa-17.4, however with mesa-17.3.1 I didn't have those issues
anyway.

Now both mesa-17.3.1 and mesa-17.4 behave similarly, blocking while
playing video with kodi, but is also possible to cause the gpu hang
with other applications.
On the other hand pure openGL application seem to work fine... I am
able to run glmark2 tests without issues.

How can I send these apitraces?

On Tue, Jan 2, 2018 at 10:29 PM, Luís Mendes  wrote:
> Ok... I've done some of the suggested tests.
>
> I still haven't tested on x86, but I'll get to that.
>
> I've recompiled the kernel to disable Power Management as much as
> possible at all levels, including the PCIe, I've also modified
> /include/drm/drm_cache.h - static inline bool
> drm_arch_can_wc_memory(void) to always return false, but neither
> solved the issue.
>
> When I run kodi under apitrace with mesa 17.3.1 it becomes much more
> difficult to reproduce the crash, there are a lot of missed frames due
> to the CPU overload of apitrace, but I was to able to crash the GPU
> once. The apitrace log has 2.3GB, how should I send it?
> It happened while playing a VP9 encoded webm video file, which is
> decoded by software, as RX 460 is unable to hardware decode this codec
> AFAIK. In fact software decoded videos are more prone to produce the
> GPU hang, while a H265 4K hardware decoded video never causes a GPU
> hang. I'm affraid I forgot to have kodi to log the execution data when
> I did the apitrace.
>
>
> The full dmesg is presented below as well as the /proc/iomem
> information and lspci output.
>  I just want to note that I'm having EDID DDC errors with my TV
> screen, because at some point in kernel 4.14 onwards, both the RX460
> as well as the RX550 cards started to corrupt the I2C TV screen EDID
> memory, so that I have to reflash the correct EDID data to get the
> screen back to its own configuration. This is a rare problem that only
> occurs with this TV. All other TVs and monitors that I've tested don't
> show this EDID corruption issue. I currently have stopped to reflash
> the I2C EDID configuration memory of my TV to avoid exceeding the
> memory write cycles endurance, instead I now modify gpu/drm/drm_edid.c
> in function drm_do_get_edid() to allow the corrupted EDID to pass and
> enter X. So please ignore the EDID error warnings on my dmesg log. The
> GPU hangs occur just the same, even when I have the correct EDID, as
> it is an unrelated issue.
>
> Regards,
> Luís
>
> iomem shows this:
> -3fff : System RAM
>   8000-00ef : Kernel code
>   0100-010e3913 : Kernel data
> d000-efff : PCI MEM
>   d000-e7ff : PCI Bus :01
> d000-dfff : :01:00.0
> e000-e01f : :01:00.0
> e020-e023 : :01:00.0
> e024-e025 : :01:00.0
> e026-e0263fff : :01:00.1
>   e026-e0263fff : ICH HD audio
> f1010680-f10106cf : spi@10680
> f1011000-f101101f : i2c@11000
> f1011100-f10f : i2c@11100
> f1012000-f101201f : serial
> f1012100-f101211f : serial
> f1018000-f101801f : pinctrl@18000
> f1018100-f101813f : gpio
> f1018140-f101817f : gpio
> f1018454-f1018457 : conf-sdio3
> f10184a0-f10184ab : rtc-soc
> f1020704-f1020707 : watchdog@20300
> f1020800-f102080f : cpurst@20800
> f1020a00-f1020ccf : interrupt-controller@20a00
> f1021070-f10210c7 : interrupt-controller@20a00
> f1022000-f1022fff : pmsu@22000
> f103-f1033fff : ethernet@3
> f1034000-f1037fff : ethernet@34000
> f104-f1041fff : pcie@2,0
> f1044000-f1045fff : pcie@3,0
> f1058000-f10584ff : usb@58000
> f107-f1073fff : ethernet@7
> f10a3800-f10a381f : rtc
> f10a8000-f10a9fff : sata@a8000
> f10d8000-f10d8fff : sdhci
> f10e-f10e1fff : sata@e
> f10e4074-f10e4077 : thermal@e8078
> f10e4078-f10e407b : thermal@e8078
> f10f-f10f3fff : usb3@f
> f10f8000-f10fbfff : usb3@f8000
> f110-f11007ff : f110.sa-sram0
> f111-f11107ff : f111.sa-sram1
> f120-f12f : f120.bm-bppi
>
> lspci is like this:
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> [AMD/ATI] Baffin [Radeon RX 460] (rev cf) (prog-if 00)
> Subsystem: PC Partner Limited / Sapphire Technology Baffin
> [Radeon RX 460]
> Flags: bus master, fast devsel, latency 0, IRQ 57
> Memory at d000 (64-bit, prefetchable) [size=256M]
> Memory at e000 (64-bit, prefetchable) [siz

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-02 Thread Luís Mendes
Ok... I've done some of the suggested tests.

I still haven't tested on x86, but I'll get to that.

I've recompiled the kernel to disable Power Management as much as
possible at all levels, including the PCIe, I've also modified
/include/drm/drm_cache.h - static inline bool
drm_arch_can_wc_memory(void) to always return false, but neither
solved the issue.

When I run kodi under apitrace with mesa 17.3.1 it becomes much more
difficult to reproduce the crash, there are a lot of missed frames due
to the CPU overload of apitrace, but I was to able to crash the GPU
once. The apitrace log has 2.3GB, how should I send it?
It happened while playing a VP9 encoded webm video file, which is
decoded by software, as RX 460 is unable to hardware decode this codec
AFAIK. In fact software decoded videos are more prone to produce the
GPU hang, while a H265 4K hardware decoded video never causes a GPU
hang. I'm affraid I forgot to have kodi to log the execution data when
I did the apitrace.


The full dmesg is presented below as well as the /proc/iomem
information and lspci output.
 I just want to note that I'm having EDID DDC errors with my TV
screen, because at some point in kernel 4.14 onwards, both the RX460
as well as the RX550 cards started to corrupt the I2C TV screen EDID
memory, so that I have to reflash the correct EDID data to get the
screen back to its own configuration. This is a rare problem that only
occurs with this TV. All other TVs and monitors that I've tested don't
show this EDID corruption issue. I currently have stopped to reflash
the I2C EDID configuration memory of my TV to avoid exceeding the
memory write cycles endurance, instead I now modify gpu/drm/drm_edid.c
in function drm_do_get_edid() to allow the corrupted EDID to pass and
enter X. So please ignore the EDID error warnings on my dmesg log. The
GPU hangs occur just the same, even when I have the correct EDID, as
it is an unrelated issue.

Regards,
Luís

iomem shows this:
-3fff : System RAM
  8000-00ef : Kernel code
  0100-010e3913 : Kernel data
d000-efff : PCI MEM
  d000-e7ff : PCI Bus :01
d000-dfff : :01:00.0
e000-e01f : :01:00.0
e020-e023 : :01:00.0
e024-e025 : :01:00.0
e026-e0263fff : :01:00.1
  e026-e0263fff : ICH HD audio
f1010680-f10106cf : spi@10680
f1011000-f101101f : i2c@11000
f1011100-f10f : i2c@11100
f1012000-f101201f : serial
f1012100-f101211f : serial
f1018000-f101801f : pinctrl@18000
f1018100-f101813f : gpio
f1018140-f101817f : gpio
f1018454-f1018457 : conf-sdio3
f10184a0-f10184ab : rtc-soc
f1020704-f1020707 : watchdog@20300
f1020800-f102080f : cpurst@20800
f1020a00-f1020ccf : interrupt-controller@20a00
f1021070-f10210c7 : interrupt-controller@20a00
f1022000-f1022fff : pmsu@22000
f103-f1033fff : ethernet@3
f1034000-f1037fff : ethernet@34000
f104-f1041fff : pcie@2,0
f1044000-f1045fff : pcie@3,0
f1058000-f10584ff : usb@58000
f107-f1073fff : ethernet@7
f10a3800-f10a381f : rtc
f10a8000-f10a9fff : sata@a8000
f10d8000-f10d8fff : sdhci
f10e-f10e1fff : sata@e
f10e4074-f10e4077 : thermal@e8078
f10e4078-f10e407b : thermal@e8078
f10f-f10f3fff : usb3@f
f10f8000-f10fbfff : usb3@f8000
f110-f11007ff : f110.sa-sram0
f111-f11107ff : f111.sa-sram1
f120-f12f : f120.bm-bppi

lspci is like this:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
[AMD/ATI] Baffin [Radeon RX 460] (rev cf) (prog-if 00)
Subsystem: PC Partner Limited / Sapphire Technology Baffin
[Radeon RX 460]
Flags: bus master, fast devsel, latency 0, IRQ 57
Memory at d000 (64-bit, prefetchable) [size=256M]
Memory at e000 (64-bit, prefetchable) [size=2M]
I/O ports at 1 [size=256]
Memory at e020 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at e024 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Legacy Endpoint, MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1
Len=010 
Capabilities: [150] Advanced Error Reporting
Capabilities: [200] #15
Capabilities: [270] #19
Capabilities: [2b0] Address Translation Service (ATS)
Capabilities: [2c0] Page Request Interface (PRI)
Capabilities: [2d0] Process Address Space ID (PASID)
Capabilities: [320] Latency Tolerance Reporting
Capabilities: [328] Alternative Routing-ID Interpretation
(ARI)
Capabilities: [370] L1 PM Substates
Kernel driver in use: amdgpu
Kernel modules: amdgpu

01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device
aae0
Subsystem: PC Partner Limited / Sapphire Technology Device
aae0
Flags: bus master, fast devsel, lat

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-02 Thread Christian König

when you refer to API traces, you're suggesting to strace
kodi, or what do you mean?
What I meant was apitrace (https://github.com/apitrace/apitrace), but 
when even the lightdm login screen crashes than this won't be much helpful.


That strongly sounds like a ARM specific problem, maybe USWC doesn't 
work as it should? See function drm_arch_can_wc_memory() in the kernel 
source and try if it helps if you always return false.


Apart from that the only other explanation I have is that some system 
memory isn't accessible for the GPU while some other is working fine.


Please provide the output of "sudo cat /proc/iomem" to double check that.

Regards,
Christian.

Am 02.01.2018 um 14:09 schrieb Luís Mendes:

Dear Mr. David, Mr. Christian,

First of all, thanks for your replies!

David, I will try the same software versions on x86 to see if I am
able to replicate the problem on x86, but I suspect it is ARM
specific... I'll report back when I have more details.

Christian, I'll collect the data you've referred and will disable the
power management. Regarding the mesa master version, I've tried it,
and the problem just gets worse. With latest mesa, It easily locks up
in lightdm login screen, or when navigating through the Ubuntu Mate
menus, or with Kodi.  I've tested with mesa commit "radv: Implement
binning on GFX9" - 6a36bfc64d2096aa338958c4605f5fc6372c07b8. Just one
question... when you refer to API traces, you're suggesting to strace
kodi, or what do you mean?

Regards,
Luís

On Tue, Jan 2, 2018 at 2:51 AM, Chunming Zhou  wrote:

Did you try it on x86 board? Is there same issue?

We should identify it is ARM specific or genera issue for amdgpu driver.


Thanks,

David Zhou





___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-02 Thread Luís Mendes
Dear Mr. David, Mr. Christian,

First of all, thanks for your replies!

David, I will try the same software versions on x86 to see if I am
able to replicate the problem on x86, but I suspect it is ARM
specific... I'll report back when I have more details.

Christian, I'll collect the data you've referred and will disable the
power management. Regarding the mesa master version, I've tried it,
and the problem just gets worse. With latest mesa, It easily locks up
in lightdm login screen, or when navigating through the Ubuntu Mate
menus, or with Kodi.  I've tested with mesa commit "radv: Implement
binning on GFX9" - 6a36bfc64d2096aa338958c4605f5fc6372c07b8. Just one
question... when you refer to API traces, you're suggesting to strace
kodi, or what do you mean?

Regards,
Luís

On Tue, Jan 2, 2018 at 2:51 AM, Chunming Zhou  wrote:
> Did you try it on x86 board? Is there same issue?
>
> We should identify it is ARM specific or genera issue for amdgpu driver.
>
>
> Thanks,
>
> David Zhou
>
>
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-02 Thread Christian König

Hi Luis,

well first of all that isn't a deadlock, but just a hardware lockup. So 
the traces and logs you tried to attach are not really useful.


What we need instead is a full dmesg and/or some kodi logs, API trace 
etc.. what exactly is happening when the problem occurs.


Please also try Mesa master and try to disable power management as much 
as possible.


Regards,
Christian.

Am 02.01.2018 um 03:51 schrieb Chunming Zhou:

Did you try it on x86 board? Is there same issue?

We should identify it is ARM specific or genera issue for amdgpu driver.


Thanks,

David Zhou


On 2018年01月02日 00:32, Luís Mendes wrote:

I am currently testing the amdgpu driver with AMD RX460 and RX550
graphics cards on an ARM Cortex-A9 with 1GB RAM and I am consistently
getting deadlocks when playing videos with Kodi or other applications.

I'm using Linux kernel from
https://cgit.freedesktop.org/~agd5f/linux/, branch drm-next-4.16 at
commit "drm/amdgpu: Correct the IB size of bo update mapping" -
104bd2ca1124dfd9aa904d5f5a96253ef2b580f6  along with libdrm-2.4.89 and
mesa-17.3.1 on an Ubuntu 17.10 with Mate desktop and Lightdm session
manager over X11.


I am consistently getting deadlocks, which sometimes are almost
immediate, but sometimes they take about half an hour to occur. There
are some video files that I am using for testing which have more
probability of causing a deadlock than others.

I got some kernel crash dumps, kodi process backtraces for the
offending thread and the deadlocked process tree listing which I
attach here. The kernel seems to deadlock during a page flip,
indefinitelly waiting for the DMA fence to complete, however, it
doesn't and the timeout doesn't expire either... as such this may be a
GPU lockup.

I can provide more details, if needed, if there is interest or time to
look into this.

Regards,
Luís Mendes
Software and Hardware engineer

[  253.904103] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, last signaled seq=43831, last emitted seq=43833
[  253.915041] [drm] IP block:gmc_v8_0 is hung!
[  253.915047] [drm] IP block:gfx_v8_0 is hung!
[  253.915162] [drm] GPU recovery disabled.
[  366.541614] INFO: task kworker/u4:4:90 blocked for more than 120
seconds.
[  366.548436]   Not tainted 4.15.0-rc4-drmnext2g #1
[  366.554300] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  366.562162] kworker/u4:4    D    0    90  2 0x
[  366.562196] Workqueue: events_unbound commit_work [drm_kms_helper]
[  366.562215] [<80b8c6d4>] (__schedule) from [<80b8cdd0>]
(schedule+0x4c/0xac)
[  366.562223] [<80b8cdd0>] (schedule) from [<80b91024>]
(schedule_timeout+0x228/0x444)
[  366.562233] [<80b91024>] (schedule_timeout) from [<80886738>]
(dma_fence_default_wait+0x2b4/0x2d8)
[  366.562241] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
(dma_fence_wait_timeout+0x40/0x150)
[  366.562248] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
(reservation_object_wait_timeout_rcu+0xfc/0x34c)
[  366.562476] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
[<7f2d3988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
[  366.562754] [<7f2d3988>] (amdgpu_dm_do_flip [amdgpu]) from
[<7f2d509c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
[  366.562908] [<7f2d509c>] (amdgpu_dm_atomic_commit_tail [amdgpu])
from [<7f13e58c>] (commit_tail+0x50/0x94 [drm_kms_helper])
[  366.562931] [<7f13e58c>] (commit_tail [drm_kms_helper]) from
[<7f13e5ec>] (commit_work+0x1c/0x20 [drm_kms_helper])
[  366.562948] [<7f13e5ec>] (commit_work [drm_kms_helper]) from
[<8016f4c8>] (process_one_work+0x1a8/0x4ac)
[  366.562955] [<8016f4c8>] (process_one_work) from [<8017050c>]
(worker_thread+0x68/0x598)
[  366.562962] [<8017050c>] (worker_thread) from [<80175e50>]
(kthread+0x16c/0x174)
[  366.562970] [<80175e50>] (kthread) from [<80109de8>]
(ret_from_fork+0x14/0x2c)


 From userland side:
(gdb) info thread
   Id   Target Id Frame
* 1    Thread 0x6eb17c70 (LWP 2071) "kodi.bin" 0x748b2246 in ioctl ()
 at ../sysdeps/unix/syscall-template.S:84
   2    Thread 0x6eb14170 (LWP 2072) "Announce" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   3    Thread 0x6e1ff170 (LWP 2075) "ActiveAE" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   4    Thread 0x6d9ff170 (LWP 2076) "AESink" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   5    Thread 0x6b7c9170 (LWP 2081) "amdgpu_cs:0" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   6    Thread 0x6ae3c170 (LWP 2082) "disk_cache:0" __libc_do_syscall
()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   7    Thread 0x571df170 (LWP 2083) "si_shader:0" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   8    Thread 0x569df170 (LWP 2084) "si_shader_low:0"
__libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   9    Thread 0x561df170 (LWP 208

Re: Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

2018-01-01 Thread Chunming Zhou

Did you try it on x86 board? Is there same issue?

We should identify it is ARM specific or genera issue for amdgpu driver.


Thanks,

David Zhou


On 2018年01月02日 00:32, Luís Mendes wrote:

I am currently testing the amdgpu driver with AMD RX460 and RX550
graphics cards on an ARM Cortex-A9 with 1GB RAM and I am consistently
getting deadlocks when playing videos with Kodi or other applications.

I'm using Linux kernel from
https://cgit.freedesktop.org/~agd5f/linux/, branch drm-next-4.16 at
commit "drm/amdgpu: Correct the IB size of bo update mapping" -
104bd2ca1124dfd9aa904d5f5a96253ef2b580f6  along with libdrm-2.4.89 and
mesa-17.3.1 on an Ubuntu 17.10 with Mate desktop and Lightdm session
manager over X11.


I am consistently getting deadlocks, which sometimes are almost
immediate, but sometimes they take about half an hour to occur. There
are some video files that I am using for testing which have more
probability of causing a deadlock than others.

I got some kernel crash dumps, kodi process backtraces for the
offending thread and the deadlocked process tree listing which I
attach here. The kernel seems to deadlock during a page flip,
indefinitelly waiting for the DMA fence to complete, however, it
doesn't and the timeout doesn't expire either... as such this may be a
GPU lockup.

I can provide more details, if needed, if there is interest or time to
look into this.

Regards,
Luís Mendes
Software and Hardware engineer

[  253.904103] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, last signaled seq=43831, last emitted seq=43833
[  253.915041] [drm] IP block:gmc_v8_0 is hung!
[  253.915047] [drm] IP block:gfx_v8_0 is hung!
[  253.915162] [drm] GPU recovery disabled.
[  366.541614] INFO: task kworker/u4:4:90 blocked for more than 120
seconds.
[  366.548436]   Not tainted 4.15.0-rc4-drmnext2g #1
[  366.554300] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  366.562162] kworker/u4:4D090  2 0x
[  366.562196] Workqueue: events_unbound commit_work [drm_kms_helper]
[  366.562215] [<80b8c6d4>] (__schedule) from [<80b8cdd0>]
(schedule+0x4c/0xac)
[  366.562223] [<80b8cdd0>] (schedule) from [<80b91024>]
(schedule_timeout+0x228/0x444)
[  366.562233] [<80b91024>] (schedule_timeout) from [<80886738>]
(dma_fence_default_wait+0x2b4/0x2d8)
[  366.562241] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
(dma_fence_wait_timeout+0x40/0x150)
[  366.562248] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
(reservation_object_wait_timeout_rcu+0xfc/0x34c)
[  366.562476] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
[<7f2d3988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
[  366.562754] [<7f2d3988>] (amdgpu_dm_do_flip [amdgpu]) from
[<7f2d509c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
[  366.562908] [<7f2d509c>] (amdgpu_dm_atomic_commit_tail [amdgpu])
from [<7f13e58c>] (commit_tail+0x50/0x94 [drm_kms_helper])
[  366.562931] [<7f13e58c>] (commit_tail [drm_kms_helper]) from
[<7f13e5ec>] (commit_work+0x1c/0x20 [drm_kms_helper])
[  366.562948] [<7f13e5ec>] (commit_work [drm_kms_helper]) from
[<8016f4c8>] (process_one_work+0x1a8/0x4ac)
[  366.562955] [<8016f4c8>] (process_one_work) from [<8017050c>]
(worker_thread+0x68/0x598)
[  366.562962] [<8017050c>] (worker_thread) from [<80175e50>]
(kthread+0x16c/0x174)
[  366.562970] [<80175e50>] (kthread) from [<80109de8>]
(ret_from_fork+0x14/0x2c)


 From userland side:
(gdb) info thread
   Id   Target Id Frame
* 1Thread 0x6eb17c70 (LWP 2071) "kodi.bin" 0x748b2246 in ioctl ()
 at ../sysdeps/unix/syscall-template.S:84
   2Thread 0x6eb14170 (LWP 2072) "Announce" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   3Thread 0x6e1ff170 (LWP 2075) "ActiveAE" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   4Thread 0x6d9ff170 (LWP 2076) "AESink" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   5Thread 0x6b7c9170 (LWP 2081) "amdgpu_cs:0" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   6Thread 0x6ae3c170 (LWP 2082) "disk_cache:0" __libc_do_syscall
()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   7Thread 0x571df170 (LWP 2083) "si_shader:0" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   8Thread 0x569df170 (LWP 2084) "si_shader_low:0"
__libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   9Thread 0x561df170 (LWP 2085) "gallium_drv:0" __libc_do_syscall
()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   10   Thread 0x551f6170 (LWP 2086) "kodi.bin" __libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
   11   Thread 0x549f6170 (LWP 2087) "PeripBusUSBUdev"
__libc_do_syscall ()
 at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
---Type  to continue, or q  to quit---
   12   Thread 0x541f6170 (LWP