Reworking of GPU reset logic + dumping

2012-04-25 Thread j.gli...@gmail.com
Patches also available at:
http://people.freedesktop.org/~glisse/debug/

So it's the Christian series minus all the debugfs related to
ring/ib/mc. The last patch add a new blob dumping facilities
that dump everythings (pm4, relocs table, bo content). It's
just a proof of concept to show what i meant because code
speaks more clearly on this kind of topic.

The blob format we dump could be different i want with a
simple binary dword format:
type, id, size, [data (present if size > 0)]

Note that the benefit (simpler code, less code) of dumping
current debugfs seems to me greater than their usefullness.

Cheers,
Jerome



Reworking of GPU reset logic

2012-04-25 Thread Christian König
On 21.04.2012 16:14, Jerome Glisse wrote:
> 2012/4/21 Christian K?nig:
>> On 20.04.2012 01:47, Jerome Glisse wrote:
>>> 2012/4/19 Christian K?nig:
 This includes mostly fixes for multi ring lockups and GPU resets, but it
 should general improve the behavior of the kernel mode driver in case
 something goes badly wrong.

 On the other hand it completely rewrites the IB pool and semaphore
 handling, so I think there are still a couple of problems in it.

 The first four patches were already send to the list, but the current set
 depends on them so I resend them again.

 Cheers,
 Christian.
>>> I did a quick review, it looks mostly good, but as it's sensitive code
>>> i would like to spend sometime on
>>> it. Probably next week. Note that i had some work on this area too, i
>>> mostly want to drop all the debugfs
>>> related to this and add some new more usefull (basicly something that
>>> allow you to read all the data
>>> needed to replay a locking up ib). I also was looking into Dave reset
>>> thread and your solution of moving
>>> reset in ioctl return path sounds good too but i need to convince my
>>> self that it encompass all possible
>>> case.
>>>
>>> Cheers,
>>> Jerome
>>>
>> After sleeping a night over it I already reworked the patch for improving
>> the SA performance, so please wait at least for v2 before taking a look at
>> it :)
>>
>> Regarding the debugging of lockups I had the following on my "in mind todo"
>> list:
>> 1. Rework the chip specific lockup detection code a bit more and probably
>> clean it up a bit.
>> 2. Make the timeout a module parameter, cause compute task sometimes block a
>> ring for more than 10 seconds.
>> 3. Keep track of the actually RPTR offset a fence is emitted to
>> 3. Keep track of all the BOs a IB is touching.
>> 4. Now if a lockup happens start with the last successfully signaled fence
>> and dump the ring content after that RPTR offset till the first not signaled
>> fence.
>> 5. Then if this fence references to an IB dump it's content and the BOs it
>> is touching.
>> 6. Dump everything on the ring after that fence until you reach the RPTR of
>> the next fence or the WPTR of the ring.
>> 7. If there is a next fence repeat the whole thing at number 5.
>>
>> If I'm not completely wrong that should give you practically every
>> information available, and we probably should put that behind another module
>> option, cause we are going to spam syslog pretty much here. Feel free to
>> add/modify the ideas on this list.
>>
>> Christian.
> What i have is similar, i am assuming only ib trigger lockup, before each ib
> emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. On
> lockup allocate big memory to copy the whole ib and all the bo referenced
> by the ib (i am using my bof format as i already have userspace tools).
>
> Remove all the debugfs file. Just add a new one that gave you the first faulty
> ib. On read of this file kernel free the memory. Kernel should also free the
> memory after a while or better would be to enable the lockup copy only if
> some kernel radeon option is enabled.

Just resent my current patchset to the mailing list, it's not as 
complete as your solution, but seems to be a step into the right 
direction. So please take a look at them.

Being able to generate something like a "GPU crash dump" on lockup 
sounds like something very valuable to me, but I'm not sure if debugfs 
files are the right direction to go. Maybe something more like a module 
parameter containing a directory, and if set we dump all informations 
(including bo content) available in binary form (instead of the current 
human readable form of the debugfs files).

Anyway, the just send patchset solves the problem I'm currently looking 
into, and I'm running a bit out of time (again). So I don't know if I 
can complete that solution

Cheers,
Christian.


Reworking of GPU reset logic

2012-04-25 Thread Christian König
On 23.04.2012 09:40, Michel D?nzer wrote:
> On Sam, 2012-04-21 at 11:42 +0200, Christian K?nig wrote:
>> Regarding the debugging of lockups I had the following on my "in mind
>> todo" list:
>> 1. Rework the chip specific lockup detection code a bit more and
>> probably clean it up a bit.
>> 2. Make the timeout a module parameter, cause compute task sometimes
>> block a ring for more than 10 seconds.
> A better solution for that would be to improve the detection of the GPU
> making progress, also for graphics operations. We should try to reduce
> the timeout rather than making it even larger.

Well, let's call it a more complete solution.

Since making the parameter configurable don't necessary means we are 
going to increase it. I usually set it to zero now, since that disables 
lockup detection at all and enables me to dig into the reason for 
something getting stuck.

Christian.


Reworking of GPU reset logic

2012-04-25 Thread Christian König
Second round of patchset.

Thanks for all the comments and/or bug reports, allot of patches are now v2/v3 
and should get another look. Every regression known so far should be fixed with 
them now.
Additionally to the patches that where already included in the last set there 
are 8 new ones which are also reset, lockup and debugging related.

As always comments and bug-reports are very welcome,
Christian.



Reworking of GPU reset logic

2012-04-25 Thread Dave Airlie
2012/4/25 Christian K?nig :
> On 21.04.2012 16:14, Jerome Glisse wrote:
>>
>> 2012/4/21 Christian K?nig:
>>>
>>> On 20.04.2012 01:47, Jerome Glisse wrote:

 2012/4/19 Christian K?nig:
>
> This includes mostly fixes for multi ring lockups and GPU resets, but
> it
> should general improve the behavior of the kernel mode driver in case
> something goes badly wrong.
>
> On the other hand it completely rewrites the IB pool and semaphore
> handling, so I think there are still a couple of problems in it.
>
> The first four patches were already send to the list, but the current
> set
> depends on them so I resend them again.
>
> Cheers,
> Christian.

 I did a quick review, it looks mostly good, but as it's sensitive code
 i would like to spend sometime on
 it. Probably next week. Note that i had some work on this area too, i
 mostly want to drop all the debugfs
 related to this and add some new more usefull (basicly something that
 allow you to read all the data
 needed to replay a locking up ib). I also was looking into Dave reset
 thread and your solution of moving
 reset in ioctl return path sounds good too but i need to convince my
 self that it encompass all possible
 case.

 Cheers,
 Jerome

>>> After sleeping a night over it I already reworked the patch for improving
>>> the SA performance, so please wait at least for v2 before taking a look
>>> at
>>> it :)
>>>
>>> Regarding the debugging of lockups I had the following on my "in mind
>>> todo"
>>> list:
>>> 1. Rework the chip specific lockup detection code a bit more and probably
>>> clean it up a bit.
>>> 2. Make the timeout a module parameter, cause compute task sometimes
>>> block a
>>> ring for more than 10 seconds.
>>> 3. Keep track of the actually RPTR offset a fence is emitted to
>>> 3. Keep track of all the BOs a IB is touching.
>>> 4. Now if a lockup happens start with the last successfully signaled
>>> fence
>>> and dump the ring content after that RPTR offset till the first not
>>> signaled
>>> fence.
>>> 5. Then if this fence references to an IB dump it's content and the BOs
>>> it
>>> is touching.
>>> 6. Dump everything on the ring after that fence until you reach the RPTR
>>> of
>>> the next fence or the WPTR of the ring.
>>> 7. If there is a next fence repeat the whole thing at number 5.
>>>
>>> If I'm not completely wrong that should give you practically every
>>> information available, and we probably should put that behind another
>>> module
>>> option, cause we are going to spam syslog pretty much here. Feel free to
>>> add/modify the ideas on this list.
>>>
>>> Christian.
>>
>> What i have is similar, i am assuming only ib trigger lockup, before each
>> ib
>> emit to scratch reg ib offset in sa and ib size. For each ib keep bo list.
>> On
>> lockup allocate big memory to copy the whole ib and all the bo referenced
>> by the ib (i am using my bof format as i already have userspace tools).
>>
>> Remove all the debugfs file. Just add a new one that gave you the first
>> faulty
>> ib. On read of this file kernel free the memory. Kernel should also free
>> the
>> memory after a while or better would be to enable the lockup copy only if
>> some kernel radeon option is enabled.
>
>
> Just resent my current patchset to the mailing list, it's not as complete as
> your solution, but seems to be a step into the right direction. So please
> take a look at them.
>
> Being able to generate something like a "GPU crash dump" on lockup sounds
> like something very valuable to me, but I'm not sure if debugfs files are
> the right direction to go. Maybe something more like a module parameter
> containing a directory, and if set we dump all informations (including bo
> content) available in binary form (instead of the current human readable
> form of the debugfs files).

Do what intel driver does, create a versioned binary debugfs file with
all the error state in it for a lockup,
store only one of these at a time, run a userspace tool to dump it out
into something you can
upload or just cat the file and upload it.

You don't want the kernel writing to dirs on disk under any circumstances

Dave.


Reworking of GPU reset logic

2012-04-25 Thread Jerome Glisse
On Wed, Apr 25, 2012 at 9:46 AM, Alex Deucher  wrote:
> 2012/4/25 Dave Airlie :
>> 2012/4/25 Christian K?nig :
>>> On 21.04.2012 16:14, Jerome Glisse wrote:

 2012/4/21 Christian K?nig:
>
> On 20.04.2012 01:47, Jerome Glisse wrote:
>>
>> 2012/4/19 Christian K?nig:
>>>
>>> This includes mostly fixes for multi ring lockups and GPU resets, but
>>> it
>>> should general improve the behavior of the kernel mode driver in case
>>> something goes badly wrong.
>>>
>>> On the other hand it completely rewrites the IB pool and semaphore
>>> handling, so I think there are still a couple of problems in it.
>>>
>>> The first four patches were already send to the list, but the current
>>> set
>>> depends on them so I resend them again.
>>>
>>> Cheers,
>>> Christian.
>>
>> I did a quick review, it looks mostly good, but as it's sensitive code
>> i would like to spend sometime on
>> it. Probably next week. Note that i had some work on this area too, i
>> mostly want to drop all the debugfs
>> related to this and add some new more usefull (basicly something that
>> allow you to read all the data
>> needed to replay a locking up ib). I also was looking into Dave reset
>> thread and your solution of moving
>> reset in ioctl return path sounds good too but i need to convince my
>> self that it encompass all possible
>> case.
>>
>> Cheers,
>> Jerome
>>
> After sleeping a night over it I already reworked the patch for improving
> the SA performance, so please wait at least for v2 before taking a look
> at
> it :)
>
> Regarding the debugging of lockups I had the following on my "in mind
> todo"
> list:
> 1. Rework the chip specific lockup detection code a bit more and probably
> clean it up a bit.
> 2. Make the timeout a module parameter, cause compute task sometimes
> block a
> ring for more than 10 seconds.
> 3. Keep track of the actually RPTR offset a fence is emitted to
> 3. Keep track of all the BOs a IB is touching.
> 4. Now if a lockup happens start with the last successfully signaled
> fence
> and dump the ring content after that RPTR offset till the first not
> signaled
> fence.
> 5. Then if this fence references to an IB dump it's content and the BOs
> it
> is touching.
> 6. Dump everything on the ring after that fence until you reach the RPTR
> of
> the next fence or the WPTR of the ring.
> 7. If there is a next fence repeat the whole thing at number 5.
>
> If I'm not completely wrong that should give you practically every
> information available, and we probably should put that behind another
> module
> option, cause we are going to spam syslog pretty much here. Feel free to
> add/modify the ideas on this list.
>
> Christian.

 What i have is similar, i am assuming only ib trigger lockup, before each
 ib
 emit to scratch reg ib offset in sa and ib size. For each ib keep bo list.
 On
 lockup allocate big memory to copy the whole ib and all the bo referenced
 by the ib (i am using my bof format as i already have userspace tools).

 Remove all the debugfs file. Just add a new one that gave you the first
 faulty
 ib. On read of this file kernel free the memory. Kernel should also free
 the
 memory after a while or better would be to enable the lockup copy only if
 some kernel radeon option is enabled.
>>>
>>>
>>> Just resent my current patchset to the mailing list, it's not as complete as
>>> your solution, but seems to be a step into the right direction. So please
>>> take a look at them.
>>>
>>> Being able to generate something like a "GPU crash dump" on lockup sounds
>>> like something very valuable to me, but I'm not sure if debugfs files are
>>> the right direction to go. Maybe something more like a module parameter
>>> containing a directory, and if set we dump all informations (including bo
>>> content) available in binary form (instead of the current human readable
>>> form of the debugfs files).
>>
>> Do what intel driver does, create a versioned binary debugfs file with
>> all the error state in it for a lockup,
>> store only one of these at a time, run a userspace tool to dump it out
>> into something you can
>> upload or just cat the file and upload it.
>>
>> You don't want the kernel writing to dirs on disk under any circumstances
>>
>
> We have an internal binary format for dumping command streams and
> associated buffers, we should probably use that so that we can better
> take advantage of existing internal tools.
>
> Alex
>

I really would like to drop all the debugfs file related to ib/ring with this
patchset. Note that i also have a binary format to replay command stream
the blob format. It has all the information needed to replay on the
open driver and tools are 

Reworking of GPU reset logic

2012-04-25 Thread Alex Deucher
2012/4/25 Dave Airlie :
> 2012/4/25 Christian K?nig :
>> On 21.04.2012 16:14, Jerome Glisse wrote:
>>>
>>> 2012/4/21 Christian K?nig:

 On 20.04.2012 01:47, Jerome Glisse wrote:
>
> 2012/4/19 Christian K?nig:
>>
>> This includes mostly fixes for multi ring lockups and GPU resets, but
>> it
>> should general improve the behavior of the kernel mode driver in case
>> something goes badly wrong.
>>
>> On the other hand it completely rewrites the IB pool and semaphore
>> handling, so I think there are still a couple of problems in it.
>>
>> The first four patches were already send to the list, but the current
>> set
>> depends on them so I resend them again.
>>
>> Cheers,
>> Christian.
>
> I did a quick review, it looks mostly good, but as it's sensitive code
> i would like to spend sometime on
> it. Probably next week. Note that i had some work on this area too, i
> mostly want to drop all the debugfs
> related to this and add some new more usefull (basicly something that
> allow you to read all the data
> needed to replay a locking up ib). I also was looking into Dave reset
> thread and your solution of moving
> reset in ioctl return path sounds good too but i need to convince my
> self that it encompass all possible
> case.
>
> Cheers,
> Jerome
>
 After sleeping a night over it I already reworked the patch for improving
 the SA performance, so please wait at least for v2 before taking a look
 at
 it :)

 Regarding the debugging of lockups I had the following on my "in mind
 todo"
 list:
 1. Rework the chip specific lockup detection code a bit more and probably
 clean it up a bit.
 2. Make the timeout a module parameter, cause compute task sometimes
 block a
 ring for more than 10 seconds.
 3. Keep track of the actually RPTR offset a fence is emitted to
 3. Keep track of all the BOs a IB is touching.
 4. Now if a lockup happens start with the last successfully signaled
 fence
 and dump the ring content after that RPTR offset till the first not
 signaled
 fence.
 5. Then if this fence references to an IB dump it's content and the BOs
 it
 is touching.
 6. Dump everything on the ring after that fence until you reach the RPTR
 of
 the next fence or the WPTR of the ring.
 7. If there is a next fence repeat the whole thing at number 5.

 If I'm not completely wrong that should give you practically every
 information available, and we probably should put that behind another
 module
 option, cause we are going to spam syslog pretty much here. Feel free to
 add/modify the ideas on this list.

 Christian.
>>>
>>> What i have is similar, i am assuming only ib trigger lockup, before each
>>> ib
>>> emit to scratch reg ib offset in sa and ib size. For each ib keep bo list.
>>> On
>>> lockup allocate big memory to copy the whole ib and all the bo referenced
>>> by the ib (i am using my bof format as i already have userspace tools).
>>>
>>> Remove all the debugfs file. Just add a new one that gave you the first
>>> faulty
>>> ib. On read of this file kernel free the memory. Kernel should also free
>>> the
>>> memory after a while or better would be to enable the lockup copy only if
>>> some kernel radeon option is enabled.
>>
>>
>> Just resent my current patchset to the mailing list, it's not as complete as
>> your solution, but seems to be a step into the right direction. So please
>> take a look at them.
>>
>> Being able to generate something like a "GPU crash dump" on lockup sounds
>> like something very valuable to me, but I'm not sure if debugfs files are
>> the right direction to go. Maybe something more like a module parameter
>> containing a directory, and if set we dump all informations (including bo
>> content) available in binary form (instead of the current human readable
>> form of the debugfs files).
>
> Do what intel driver does, create a versioned binary debugfs file with
> all the error state in it for a lockup,
> store only one of these at a time, run a userspace tool to dump it out
> into something you can
> upload or just cat the file and upload it.
>
> You don't want the kernel writing to dirs on disk under any circumstances
>

We have an internal binary format for dumping command streams and
associated buffers, we should probably use that so that we can better
take advantage of existing internal tools.

Alex

> Dave.
> ___
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Reworking of GPU reset logic

2012-04-25 Thread Christian König
Second round of patchset.

Thanks for all the comments and/or bug reports, allot of patches are now v2/v3 
and should get another look. Every regression known so far should be fixed with 
them now.
Additionally to the patches that where already included in the last set there 
are 8 new ones which are also reset, lockup and debugging related.

As always comments and bug-reports are very welcome,
Christian.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Reworking of GPU reset logic

2012-04-25 Thread Christian König

On 23.04.2012 09:40, Michel Dänzer wrote:

On Sam, 2012-04-21 at 11:42 +0200, Christian König wrote:

Regarding the debugging of lockups I had the following on my in mind
todo list:
1. Rework the chip specific lockup detection code a bit more and
probably clean it up a bit.
2. Make the timeout a module parameter, cause compute task sometimes
block a ring for more than 10 seconds.

A better solution for that would be to improve the detection of the GPU
making progress, also for graphics operations. We should try to reduce
the timeout rather than making it even larger.


Well, let's call it a more complete solution.

Since making the parameter configurable don't necessary means we are 
going to increase it. I usually set it to zero now, since that disables 
lockup detection at all and enables me to dig into the reason for 
something getting stuck.


Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Reworking of GPU reset logic

2012-04-25 Thread Christian König

On 21.04.2012 16:14, Jerome Glisse wrote:

2012/4/21 Christian Königdeathsim...@vodafone.de:

On 20.04.2012 01:47, Jerome Glisse wrote:

2012/4/19 Christian Königdeathsim...@vodafone.de:

This includes mostly fixes for multi ring lockups and GPU resets, but it
should general improve the behavior of the kernel mode driver in case
something goes badly wrong.

On the other hand it completely rewrites the IB pool and semaphore
handling, so I think there are still a couple of problems in it.

The first four patches were already send to the list, but the current set
depends on them so I resend them again.

Cheers,
Christian.

I did a quick review, it looks mostly good, but as it's sensitive code
i would like to spend sometime on
it. Probably next week. Note that i had some work on this area too, i
mostly want to drop all the debugfs
related to this and add some new more usefull (basicly something that
allow you to read all the data
needed to replay a locking up ib). I also was looking into Dave reset
thread and your solution of moving
reset in ioctl return path sounds good too but i need to convince my
self that it encompass all possible
case.

Cheers,
Jerome


After sleeping a night over it I already reworked the patch for improving
the SA performance, so please wait at least for v2 before taking a look at
it :)

Regarding the debugging of lockups I had the following on my in mind todo
list:
1. Rework the chip specific lockup detection code a bit more and probably
clean it up a bit.
2. Make the timeout a module parameter, cause compute task sometimes block a
ring for more than 10 seconds.
3. Keep track of the actually RPTR offset a fence is emitted to
3. Keep track of all the BOs a IB is touching.
4. Now if a lockup happens start with the last successfully signaled fence
and dump the ring content after that RPTR offset till the first not signaled
fence.
5. Then if this fence references to an IB dump it's content and the BOs it
is touching.
6. Dump everything on the ring after that fence until you reach the RPTR of
the next fence or the WPTR of the ring.
7. If there is a next fence repeat the whole thing at number 5.

If I'm not completely wrong that should give you practically every
information available, and we probably should put that behind another module
option, cause we are going to spam syslog pretty much here. Feel free to
add/modify the ideas on this list.

Christian.

What i have is similar, i am assuming only ib trigger lockup, before each ib
emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. On
lockup allocate big memory to copy the whole ib and all the bo referenced
by the ib (i am using my bof format as i already have userspace tools).

Remove all the debugfs file. Just add a new one that gave you the first faulty
ib. On read of this file kernel free the memory. Kernel should also free the
memory after a while or better would be to enable the lockup copy only if
some kernel radeon option is enabled.


Just resent my current patchset to the mailing list, it's not as 
complete as your solution, but seems to be a step into the right 
direction. So please take a look at them.


Being able to generate something like a GPU crash dump on lockup 
sounds like something very valuable to me, but I'm not sure if debugfs 
files are the right direction to go. Maybe something more like a module 
parameter containing a directory, and if set we dump all informations 
(including bo content) available in binary form (instead of the current 
human readable form of the debugfs files).


Anyway, the just send patchset solves the problem I'm currently looking 
into, and I'm running a bit out of time (again). So I don't know if I 
can complete that solution


Cheers,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Reworking of GPU reset logic

2012-04-25 Thread Dave Airlie
2012/4/25 Christian König deathsim...@vodafone.de:
 On 21.04.2012 16:14, Jerome Glisse wrote:

 2012/4/21 Christian Königdeathsim...@vodafone.de:

 On 20.04.2012 01:47, Jerome Glisse wrote:

 2012/4/19 Christian Königdeathsim...@vodafone.de:

 This includes mostly fixes for multi ring lockups and GPU resets, but
 it
 should general improve the behavior of the kernel mode driver in case
 something goes badly wrong.

 On the other hand it completely rewrites the IB pool and semaphore
 handling, so I think there are still a couple of problems in it.

 The first four patches were already send to the list, but the current
 set
 depends on them so I resend them again.

 Cheers,
 Christian.

 I did a quick review, it looks mostly good, but as it's sensitive code
 i would like to spend sometime on
 it. Probably next week. Note that i had some work on this area too, i
 mostly want to drop all the debugfs
 related to this and add some new more usefull (basicly something that
 allow you to read all the data
 needed to replay a locking up ib). I also was looking into Dave reset
 thread and your solution of moving
 reset in ioctl return path sounds good too but i need to convince my
 self that it encompass all possible
 case.

 Cheers,
 Jerome

 After sleeping a night over it I already reworked the patch for improving
 the SA performance, so please wait at least for v2 before taking a look
 at
 it :)

 Regarding the debugging of lockups I had the following on my in mind
 todo
 list:
 1. Rework the chip specific lockup detection code a bit more and probably
 clean it up a bit.
 2. Make the timeout a module parameter, cause compute task sometimes
 block a
 ring for more than 10 seconds.
 3. Keep track of the actually RPTR offset a fence is emitted to
 3. Keep track of all the BOs a IB is touching.
 4. Now if a lockup happens start with the last successfully signaled
 fence
 and dump the ring content after that RPTR offset till the first not
 signaled
 fence.
 5. Then if this fence references to an IB dump it's content and the BOs
 it
 is touching.
 6. Dump everything on the ring after that fence until you reach the RPTR
 of
 the next fence or the WPTR of the ring.
 7. If there is a next fence repeat the whole thing at number 5.

 If I'm not completely wrong that should give you practically every
 information available, and we probably should put that behind another
 module
 option, cause we are going to spam syslog pretty much here. Feel free to
 add/modify the ideas on this list.

 Christian.

 What i have is similar, i am assuming only ib trigger lockup, before each
 ib
 emit to scratch reg ib offset in sa and ib size. For each ib keep bo list.
 On
 lockup allocate big memory to copy the whole ib and all the bo referenced
 by the ib (i am using my bof format as i already have userspace tools).

 Remove all the debugfs file. Just add a new one that gave you the first
 faulty
 ib. On read of this file kernel free the memory. Kernel should also free
 the
 memory after a while or better would be to enable the lockup copy only if
 some kernel radeon option is enabled.


 Just resent my current patchset to the mailing list, it's not as complete as
 your solution, but seems to be a step into the right direction. So please
 take a look at them.

 Being able to generate something like a GPU crash dump on lockup sounds
 like something very valuable to me, but I'm not sure if debugfs files are
 the right direction to go. Maybe something more like a module parameter
 containing a directory, and if set we dump all informations (including bo
 content) available in binary form (instead of the current human readable
 form of the debugfs files).

Do what intel driver does, create a versioned binary debugfs file with
all the error state in it for a lockup,
store only one of these at a time, run a userspace tool to dump it out
into something you can
upload or just cat the file and upload it.

You don't want the kernel writing to dirs on disk under any circumstances

Dave.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Reworking of GPU reset logic

2012-04-25 Thread Alex Deucher
2012/4/25 Dave Airlie airl...@gmail.com:
 2012/4/25 Christian König deathsim...@vodafone.de:
 On 21.04.2012 16:14, Jerome Glisse wrote:

 2012/4/21 Christian Königdeathsim...@vodafone.de:

 On 20.04.2012 01:47, Jerome Glisse wrote:

 2012/4/19 Christian Königdeathsim...@vodafone.de:

 This includes mostly fixes for multi ring lockups and GPU resets, but
 it
 should general improve the behavior of the kernel mode driver in case
 something goes badly wrong.

 On the other hand it completely rewrites the IB pool and semaphore
 handling, so I think there are still a couple of problems in it.

 The first four patches were already send to the list, but the current
 set
 depends on them so I resend them again.

 Cheers,
 Christian.

 I did a quick review, it looks mostly good, but as it's sensitive code
 i would like to spend sometime on
 it. Probably next week. Note that i had some work on this area too, i
 mostly want to drop all the debugfs
 related to this and add some new more usefull (basicly something that
 allow you to read all the data
 needed to replay a locking up ib). I also was looking into Dave reset
 thread and your solution of moving
 reset in ioctl return path sounds good too but i need to convince my
 self that it encompass all possible
 case.

 Cheers,
 Jerome

 After sleeping a night over it I already reworked the patch for improving
 the SA performance, so please wait at least for v2 before taking a look
 at
 it :)

 Regarding the debugging of lockups I had the following on my in mind
 todo
 list:
 1. Rework the chip specific lockup detection code a bit more and probably
 clean it up a bit.
 2. Make the timeout a module parameter, cause compute task sometimes
 block a
 ring for more than 10 seconds.
 3. Keep track of the actually RPTR offset a fence is emitted to
 3. Keep track of all the BOs a IB is touching.
 4. Now if a lockup happens start with the last successfully signaled
 fence
 and dump the ring content after that RPTR offset till the first not
 signaled
 fence.
 5. Then if this fence references to an IB dump it's content and the BOs
 it
 is touching.
 6. Dump everything on the ring after that fence until you reach the RPTR
 of
 the next fence or the WPTR of the ring.
 7. If there is a next fence repeat the whole thing at number 5.

 If I'm not completely wrong that should give you practically every
 information available, and we probably should put that behind another
 module
 option, cause we are going to spam syslog pretty much here. Feel free to
 add/modify the ideas on this list.

 Christian.

 What i have is similar, i am assuming only ib trigger lockup, before each
 ib
 emit to scratch reg ib offset in sa and ib size. For each ib keep bo list.
 On
 lockup allocate big memory to copy the whole ib and all the bo referenced
 by the ib (i am using my bof format as i already have userspace tools).

 Remove all the debugfs file. Just add a new one that gave you the first
 faulty
 ib. On read of this file kernel free the memory. Kernel should also free
 the
 memory after a while or better would be to enable the lockup copy only if
 some kernel radeon option is enabled.


 Just resent my current patchset to the mailing list, it's not as complete as
 your solution, but seems to be a step into the right direction. So please
 take a look at them.

 Being able to generate something like a GPU crash dump on lockup sounds
 like something very valuable to me, but I'm not sure if debugfs files are
 the right direction to go. Maybe something more like a module parameter
 containing a directory, and if set we dump all informations (including bo
 content) available in binary form (instead of the current human readable
 form of the debugfs files).

 Do what intel driver does, create a versioned binary debugfs file with
 all the error state in it for a lockup,
 store only one of these at a time, run a userspace tool to dump it out
 into something you can
 upload or just cat the file and upload it.

 You don't want the kernel writing to dirs on disk under any circumstances


We have an internal binary format for dumping command streams and
associated buffers, we should probably use that so that we can better
take advantage of existing internal tools.

Alex

 Dave.
 ___
 dri-devel mailing list
 dri-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/dri-devel
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Reworking of GPU reset logic

2012-04-25 Thread Jerome Glisse
On Wed, Apr 25, 2012 at 9:46 AM, Alex Deucher alexdeuc...@gmail.com wrote:
 2012/4/25 Dave Airlie airl...@gmail.com:
 2012/4/25 Christian König deathsim...@vodafone.de:
 On 21.04.2012 16:14, Jerome Glisse wrote:

 2012/4/21 Christian Königdeathsim...@vodafone.de:

 On 20.04.2012 01:47, Jerome Glisse wrote:

 2012/4/19 Christian Königdeathsim...@vodafone.de:

 This includes mostly fixes for multi ring lockups and GPU resets, but
 it
 should general improve the behavior of the kernel mode driver in case
 something goes badly wrong.

 On the other hand it completely rewrites the IB pool and semaphore
 handling, so I think there are still a couple of problems in it.

 The first four patches were already send to the list, but the current
 set
 depends on them so I resend them again.

 Cheers,
 Christian.

 I did a quick review, it looks mostly good, but as it's sensitive code
 i would like to spend sometime on
 it. Probably next week. Note that i had some work on this area too, i
 mostly want to drop all the debugfs
 related to this and add some new more usefull (basicly something that
 allow you to read all the data
 needed to replay a locking up ib). I also was looking into Dave reset
 thread and your solution of moving
 reset in ioctl return path sounds good too but i need to convince my
 self that it encompass all possible
 case.

 Cheers,
 Jerome

 After sleeping a night over it I already reworked the patch for improving
 the SA performance, so please wait at least for v2 before taking a look
 at
 it :)

 Regarding the debugging of lockups I had the following on my in mind
 todo
 list:
 1. Rework the chip specific lockup detection code a bit more and probably
 clean it up a bit.
 2. Make the timeout a module parameter, cause compute task sometimes
 block a
 ring for more than 10 seconds.
 3. Keep track of the actually RPTR offset a fence is emitted to
 3. Keep track of all the BOs a IB is touching.
 4. Now if a lockup happens start with the last successfully signaled
 fence
 and dump the ring content after that RPTR offset till the first not
 signaled
 fence.
 5. Then if this fence references to an IB dump it's content and the BOs
 it
 is touching.
 6. Dump everything on the ring after that fence until you reach the RPTR
 of
 the next fence or the WPTR of the ring.
 7. If there is a next fence repeat the whole thing at number 5.

 If I'm not completely wrong that should give you practically every
 information available, and we probably should put that behind another
 module
 option, cause we are going to spam syslog pretty much here. Feel free to
 add/modify the ideas on this list.

 Christian.

 What i have is similar, i am assuming only ib trigger lockup, before each
 ib
 emit to scratch reg ib offset in sa and ib size. For each ib keep bo list.
 On
 lockup allocate big memory to copy the whole ib and all the bo referenced
 by the ib (i am using my bof format as i already have userspace tools).

 Remove all the debugfs file. Just add a new one that gave you the first
 faulty
 ib. On read of this file kernel free the memory. Kernel should also free
 the
 memory after a while or better would be to enable the lockup copy only if
 some kernel radeon option is enabled.


 Just resent my current patchset to the mailing list, it's not as complete as
 your solution, but seems to be a step into the right direction. So please
 take a look at them.

 Being able to generate something like a GPU crash dump on lockup sounds
 like something very valuable to me, but I'm not sure if debugfs files are
 the right direction to go. Maybe something more like a module parameter
 containing a directory, and if set we dump all informations (including bo
 content) available in binary form (instead of the current human readable
 form of the debugfs files).

 Do what intel driver does, create a versioned binary debugfs file with
 all the error state in it for a lockup,
 store only one of these at a time, run a userspace tool to dump it out
 into something you can
 upload or just cat the file and upload it.

 You don't want the kernel writing to dirs on disk under any circumstances


 We have an internal binary format for dumping command streams and
 associated buffers, we should probably use that so that we can better
 take advantage of existing internal tools.

 Alex


I really would like to drop all the debugfs file related to ib/ring with this
patchset. Note that i also have a binary format to replay command stream
the blob format. It has all the information needed to replay on the
open driver and tools are their (my joujou repo on fdo).

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Reworking of GPU reset logic + dumping

2012-04-25 Thread j . glisse
Patches also available at:
http://people.freedesktop.org/~glisse/debug/

So it's the Christian series minus all the debugfs related to
ring/ib/mc. The last patch add a new blob dumping facilities
that dump everythings (pm4, relocs table, bo content). It's
just a proof of concept to show what i meant because code
speaks more clearly on this kind of topic.

The blob format we dump could be different i want with a
simple binary dword format:
type, id, size, [data (present if size  0)]

Note that the benefit (simpler code, less code) of dumping
current debugfs seems to me greater than their usefullness.

Cheers,
Jerome

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Reworking of GPU reset logic

2012-04-23 Thread Michel Dänzer
On Sam, 2012-04-21 at 11:42 +0200, Christian K?nig wrote: 
> 
> Regarding the debugging of lockups I had the following on my "in mind 
> todo" list:
> 1. Rework the chip specific lockup detection code a bit more and 
> probably clean it up a bit.
> 2. Make the timeout a module parameter, cause compute task sometimes 
> block a ring for more than 10 seconds.

A better solution for that would be to improve the detection of the GPU
making progress, also for graphics operations. We should try to reduce
the timeout rather than making it even larger.


-- 
Earthling Michel D?nzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer


Re: Reworking of GPU reset logic

2012-04-23 Thread Michel Dänzer
On Sam, 2012-04-21 at 11:42 +0200, Christian König wrote: 
 
 Regarding the debugging of lockups I had the following on my in mind 
 todo list:
 1. Rework the chip specific lockup detection code a bit more and 
 probably clean it up a bit.
 2. Make the timeout a module parameter, cause compute task sometimes 
 block a ring for more than 10 seconds.

A better solution for that would be to improve the detection of the GPU
making progress, also for graphics operations. We should try to reduce
the timeout rather than making it even larger.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Reworking of GPU reset logic

2012-04-21 Thread Christian König
On 20.04.2012 01:47, Jerome Glisse wrote:
> 2012/4/19 Christian K?nig:
>> This includes mostly fixes for multi ring lockups and GPU resets, but it 
>> should general improve the behavior of the kernel mode driver in case 
>> something goes badly wrong.
>>
>> On the other hand it completely rewrites the IB pool and semaphore handling, 
>> so I think there are still a couple of problems in it.
>>
>> The first four patches were already send to the list, but the current set 
>> depends on them so I resend them again.
>>
>> Cheers,
>> Christian.
> I did a quick review, it looks mostly good, but as it's sensitive code
> i would like to spend sometime on
> it. Probably next week. Note that i had some work on this area too, i
> mostly want to drop all the debugfs
> related to this and add some new more usefull (basicly something that
> allow you to read all the data
> needed to replay a locking up ib). I also was looking into Dave reset
> thread and your solution of moving
> reset in ioctl return path sounds good too but i need to convince my
> self that it encompass all possible
> case.
>
> Cheers,
> Jerome
>
After sleeping a night over it I already reworked the patch for 
improving the SA performance, so please wait at least for v2 before 
taking a look at it :)

Regarding the debugging of lockups I had the following on my "in mind 
todo" list:
1. Rework the chip specific lockup detection code a bit more and 
probably clean it up a bit.
2. Make the timeout a module parameter, cause compute task sometimes 
block a ring for more than 10 seconds.
3. Keep track of the actually RPTR offset a fence is emitted to
3. Keep track of all the BOs a IB is touching.
4. Now if a lockup happens start with the last successfully signaled 
fence and dump the ring content after that RPTR offset till the first 
not signaled fence.
5. Then if this fence references to an IB dump it's content and the BOs 
it is touching.
6. Dump everything on the ring after that fence until you reach the RPTR 
of the next fence or the WPTR of the ring.
7. If there is a next fence repeat the whole thing at number 5.

If I'm not completely wrong that should give you practically every 
information available, and we probably should put that behind another 
module option, cause we are going to spam syslog pretty much here. Feel 
free to add/modify the ideas on this list.

Christian.


Reworking of GPU reset logic

2012-04-21 Thread Jerome Glisse
2012/4/21 Christian K?nig :
> On 20.04.2012 01:47, Jerome Glisse wrote:
>>
>> 2012/4/19 Christian K?nig:
>>>
>>> This includes mostly fixes for multi ring lockups and GPU resets, but it
>>> should general improve the behavior of the kernel mode driver in case
>>> something goes badly wrong.
>>>
>>> On the other hand it completely rewrites the IB pool and semaphore
>>> handling, so I think there are still a couple of problems in it.
>>>
>>> The first four patches were already send to the list, but the current set
>>> depends on them so I resend them again.
>>>
>>> Cheers,
>>> Christian.
>>
>> I did a quick review, it looks mostly good, but as it's sensitive code
>> i would like to spend sometime on
>> it. Probably next week. Note that i had some work on this area too, i
>> mostly want to drop all the debugfs
>> related to this and add some new more usefull (basicly something that
>> allow you to read all the data
>> needed to replay a locking up ib). I also was looking into Dave reset
>> thread and your solution of moving
>> reset in ioctl return path sounds good too but i need to convince my
>> self that it encompass all possible
>> case.
>>
>> Cheers,
>> Jerome
>>
> After sleeping a night over it I already reworked the patch for improving
> the SA performance, so please wait at least for v2 before taking a look at
> it :)
>
> Regarding the debugging of lockups I had the following on my "in mind todo"
> list:
> 1. Rework the chip specific lockup detection code a bit more and probably
> clean it up a bit.
> 2. Make the timeout a module parameter, cause compute task sometimes block a
> ring for more than 10 seconds.
> 3. Keep track of the actually RPTR offset a fence is emitted to
> 3. Keep track of all the BOs a IB is touching.
> 4. Now if a lockup happens start with the last successfully signaled fence
> and dump the ring content after that RPTR offset till the first not signaled
> fence.
> 5. Then if this fence references to an IB dump it's content and the BOs it
> is touching.
> 6. Dump everything on the ring after that fence until you reach the RPTR of
> the next fence or the WPTR of the ring.
> 7. If there is a next fence repeat the whole thing at number 5.
>
> If I'm not completely wrong that should give you practically every
> information available, and we probably should put that behind another module
> option, cause we are going to spam syslog pretty much here. Feel free to
> add/modify the ideas on this list.
>
> Christian.

What i have is similar, i am assuming only ib trigger lockup, before each ib
emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. On
lockup allocate big memory to copy the whole ib and all the bo referenced
by the ib (i am using my bof format as i already have userspace tools).

Remove all the debugfs file. Just add a new one that gave you the first faulty
ib. On read of this file kernel free the memory. Kernel should also free the
memory after a while or better would be to enable the lockup copy only if
some kernel radeon option is enabled.

Cheers,
Jerome


Re: Reworking of GPU reset logic

2012-04-21 Thread Christian König

On 20.04.2012 01:47, Jerome Glisse wrote:

2012/4/19 Christian Königdeathsim...@vodafone.de:

This includes mostly fixes for multi ring lockups and GPU resets, but it should 
general improve the behavior of the kernel mode driver in case something goes 
badly wrong.

On the other hand it completely rewrites the IB pool and semaphore handling, so 
I think there are still a couple of problems in it.

The first four patches were already send to the list, but the current set 
depends on them so I resend them again.

Cheers,
Christian.

I did a quick review, it looks mostly good, but as it's sensitive code
i would like to spend sometime on
it. Probably next week. Note that i had some work on this area too, i
mostly want to drop all the debugfs
related to this and add some new more usefull (basicly something that
allow you to read all the data
needed to replay a locking up ib). I also was looking into Dave reset
thread and your solution of moving
reset in ioctl return path sounds good too but i need to convince my
self that it encompass all possible
case.

Cheers,
Jerome

After sleeping a night over it I already reworked the patch for 
improving the SA performance, so please wait at least for v2 before 
taking a look at it :)


Regarding the debugging of lockups I had the following on my in mind 
todo list:
1. Rework the chip specific lockup detection code a bit more and 
probably clean it up a bit.
2. Make the timeout a module parameter, cause compute task sometimes 
block a ring for more than 10 seconds.

3. Keep track of the actually RPTR offset a fence is emitted to
3. Keep track of all the BOs a IB is touching.
4. Now if a lockup happens start with the last successfully signaled 
fence and dump the ring content after that RPTR offset till the first 
not signaled fence.
5. Then if this fence references to an IB dump it's content and the BOs 
it is touching.
6. Dump everything on the ring after that fence until you reach the RPTR 
of the next fence or the WPTR of the ring.

7. If there is a next fence repeat the whole thing at number 5.

If I'm not completely wrong that should give you practically every 
information available, and we probably should put that behind another 
module option, cause we are going to spam syslog pretty much here. Feel 
free to add/modify the ideas on this list.


Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Reworking of GPU reset logic

2012-04-21 Thread Jerome Glisse
2012/4/21 Christian König deathsim...@vodafone.de:
 On 20.04.2012 01:47, Jerome Glisse wrote:

 2012/4/19 Christian Königdeathsim...@vodafone.de:

 This includes mostly fixes for multi ring lockups and GPU resets, but it
 should general improve the behavior of the kernel mode driver in case
 something goes badly wrong.

 On the other hand it completely rewrites the IB pool and semaphore
 handling, so I think there are still a couple of problems in it.

 The first four patches were already send to the list, but the current set
 depends on them so I resend them again.

 Cheers,
 Christian.

 I did a quick review, it looks mostly good, but as it's sensitive code
 i would like to spend sometime on
 it. Probably next week. Note that i had some work on this area too, i
 mostly want to drop all the debugfs
 related to this and add some new more usefull (basicly something that
 allow you to read all the data
 needed to replay a locking up ib). I also was looking into Dave reset
 thread and your solution of moving
 reset in ioctl return path sounds good too but i need to convince my
 self that it encompass all possible
 case.

 Cheers,
 Jerome

 After sleeping a night over it I already reworked the patch for improving
 the SA performance, so please wait at least for v2 before taking a look at
 it :)

 Regarding the debugging of lockups I had the following on my in mind todo
 list:
 1. Rework the chip specific lockup detection code a bit more and probably
 clean it up a bit.
 2. Make the timeout a module parameter, cause compute task sometimes block a
 ring for more than 10 seconds.
 3. Keep track of the actually RPTR offset a fence is emitted to
 3. Keep track of all the BOs a IB is touching.
 4. Now if a lockup happens start with the last successfully signaled fence
 and dump the ring content after that RPTR offset till the first not signaled
 fence.
 5. Then if this fence references to an IB dump it's content and the BOs it
 is touching.
 6. Dump everything on the ring after that fence until you reach the RPTR of
 the next fence or the WPTR of the ring.
 7. If there is a next fence repeat the whole thing at number 5.

 If I'm not completely wrong that should give you practically every
 information available, and we probably should put that behind another module
 option, cause we are going to spam syslog pretty much here. Feel free to
 add/modify the ideas on this list.

 Christian.

What i have is similar, i am assuming only ib trigger lockup, before each ib
emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. On
lockup allocate big memory to copy the whole ib and all the bo referenced
by the ib (i am using my bof format as i already have userspace tools).

Remove all the debugfs file. Just add a new one that gave you the first faulty
ib. On read of this file kernel free the memory. Kernel should also free the
memory after a while or better would be to enable the lockup copy only if
some kernel radeon option is enabled.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Reworking of GPU reset logic

2012-04-20 Thread Christian König
This includes mostly fixes for multi ring lockups and GPU resets, but it should 
general improve the behavior of the kernel mode driver in case something goes 
badly wrong.

On the other hand it completely rewrites the IB pool and semaphore handling, so 
I think there are still a couple of problems in it.

The first four patches were already send to the list, but the current set 
depends on them so I resend them again.

Cheers,
Christian.



Reworking of GPU reset logic

2012-04-19 Thread Jerome Glisse
2012/4/19 Christian K?nig :
> This includes mostly fixes for multi ring lockups and GPU resets, but it 
> should general improve the behavior of the kernel mode driver in case 
> something goes badly wrong.
>
> On the other hand it completely rewrites the IB pool and semaphore handling, 
> so I think there are still a couple of problems in it.
>
> The first four patches were already send to the list, but the current set 
> depends on them so I resend them again.
>
> Cheers,
> Christian.

I did a quick review, it looks mostly good, but as it's sensitive code
i would like to spend sometime on
it. Probably next week. Note that i had some work on this area too, i
mostly want to drop all the debugfs
related to this and add some new more usefull (basicly something that
allow you to read all the data
needed to replay a locking up ib). I also was looking into Dave reset
thread and your solution of moving
reset in ioctl return path sounds good too but i need to convince my
self that it encompass all possible
case.

Cheers,
Jerome


Reworking of GPU reset logic

2012-04-19 Thread Christian König
This includes mostly fixes for multi ring lockups and GPU resets, but it should 
general improve the behavior of the kernel mode driver in case something goes 
badly wrong.

On the other hand it completely rewrites the IB pool and semaphore handling, so 
I think there are still a couple of problems in it.

The first four patches were already send to the list, but the current set 
depends on them so I resend them again.

Cheers,
Christian.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Reworking of GPU reset logic

2012-04-19 Thread Jerome Glisse
2012/4/19 Christian König deathsim...@vodafone.de:
 This includes mostly fixes for multi ring lockups and GPU resets, but it 
 should general improve the behavior of the kernel mode driver in case 
 something goes badly wrong.

 On the other hand it completely rewrites the IB pool and semaphore handling, 
 so I think there are still a couple of problems in it.

 The first four patches were already send to the list, but the current set 
 depends on them so I resend them again.

 Cheers,
 Christian.

I did a quick review, it looks mostly good, but as it's sensitive code
i would like to spend sometime on
it. Probably next week. Note that i had some work on this area too, i
mostly want to drop all the debugfs
related to this and add some new more usefull (basicly something that
allow you to read all the data
needed to replay a locking up ib). I also was looking into Dave reset
thread and your solution of moving
reset in ioctl return path sounds good too but i need to convince my
self that it encompass all possible
case.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel