Re: Reworking of GPU reset logic
Second round of patchset. Thanks for all the comments and/or bug reports, allot of patches are now v2/v3 and should get another look. Every regression known so far should be fixed with them now. Additionally to the patches that where already included in the last set there are 8 new ones which are also reset, lockup and debugging related. As always comments and bug-reports are very welcome, Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Reworking of GPU reset logic
On 23.04.2012 09:40, Michel Dänzer wrote: On Sam, 2012-04-21 at 11:42 +0200, Christian König wrote: Regarding the debugging of lockups I had the following on my in mind todo list: 1. Rework the chip specific lockup detection code a bit more and probably clean it up a bit. 2. Make the timeout a module parameter, cause compute task sometimes block a ring for more than 10 seconds. A better solution for that would be to improve the detection of the GPU making progress, also for graphics operations. We should try to reduce the timeout rather than making it even larger. Well, let's call it a more complete solution. Since making the parameter configurable don't necessary means we are going to increase it. I usually set it to zero now, since that disables lockup detection at all and enables me to dig into the reason for something getting stuck. Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Reworking of GPU reset logic
On 21.04.2012 16:14, Jerome Glisse wrote: 2012/4/21 Christian Königdeathsim...@vodafone.de: On 20.04.2012 01:47, Jerome Glisse wrote: 2012/4/19 Christian Königdeathsim...@vodafone.de: This includes mostly fixes for multi ring lockups and GPU resets, but it should general improve the behavior of the kernel mode driver in case something goes badly wrong. On the other hand it completely rewrites the IB pool and semaphore handling, so I think there are still a couple of problems in it. The first four patches were already send to the list, but the current set depends on them so I resend them again. Cheers, Christian. I did a quick review, it looks mostly good, but as it's sensitive code i would like to spend sometime on it. Probably next week. Note that i had some work on this area too, i mostly want to drop all the debugfs related to this and add some new more usefull (basicly something that allow you to read all the data needed to replay a locking up ib). I also was looking into Dave reset thread and your solution of moving reset in ioctl return path sounds good too but i need to convince my self that it encompass all possible case. Cheers, Jerome After sleeping a night over it I already reworked the patch for improving the SA performance, so please wait at least for v2 before taking a look at it :) Regarding the debugging of lockups I had the following on my in mind todo list: 1. Rework the chip specific lockup detection code a bit more and probably clean it up a bit. 2. Make the timeout a module parameter, cause compute task sometimes block a ring for more than 10 seconds. 3. Keep track of the actually RPTR offset a fence is emitted to 3. Keep track of all the BOs a IB is touching. 4. Now if a lockup happens start with the last successfully signaled fence and dump the ring content after that RPTR offset till the first not signaled fence. 5. Then if this fence references to an IB dump it's content and the BOs it is touching. 6. Dump everything on the ring after that fence until you reach the RPTR of the next fence or the WPTR of the ring. 7. If there is a next fence repeat the whole thing at number 5. If I'm not completely wrong that should give you practically every information available, and we probably should put that behind another module option, cause we are going to spam syslog pretty much here. Feel free to add/modify the ideas on this list. Christian. What i have is similar, i am assuming only ib trigger lockup, before each ib emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. On lockup allocate big memory to copy the whole ib and all the bo referenced by the ib (i am using my bof format as i already have userspace tools). Remove all the debugfs file. Just add a new one that gave you the first faulty ib. On read of this file kernel free the memory. Kernel should also free the memory after a while or better would be to enable the lockup copy only if some kernel radeon option is enabled. Just resent my current patchset to the mailing list, it's not as complete as your solution, but seems to be a step into the right direction. So please take a look at them. Being able to generate something like a GPU crash dump on lockup sounds like something very valuable to me, but I'm not sure if debugfs files are the right direction to go. Maybe something more like a module parameter containing a directory, and if set we dump all informations (including bo content) available in binary form (instead of the current human readable form of the debugfs files). Anyway, the just send patchset solves the problem I'm currently looking into, and I'm running a bit out of time (again). So I don't know if I can complete that solution Cheers, Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Reworking of GPU reset logic
2012/4/25 Christian König deathsim...@vodafone.de: On 21.04.2012 16:14, Jerome Glisse wrote: 2012/4/21 Christian Königdeathsim...@vodafone.de: On 20.04.2012 01:47, Jerome Glisse wrote: 2012/4/19 Christian Königdeathsim...@vodafone.de: This includes mostly fixes for multi ring lockups and GPU resets, but it should general improve the behavior of the kernel mode driver in case something goes badly wrong. On the other hand it completely rewrites the IB pool and semaphore handling, so I think there are still a couple of problems in it. The first four patches were already send to the list, but the current set depends on them so I resend them again. Cheers, Christian. I did a quick review, it looks mostly good, but as it's sensitive code i would like to spend sometime on it. Probably next week. Note that i had some work on this area too, i mostly want to drop all the debugfs related to this and add some new more usefull (basicly something that allow you to read all the data needed to replay a locking up ib). I also was looking into Dave reset thread and your solution of moving reset in ioctl return path sounds good too but i need to convince my self that it encompass all possible case. Cheers, Jerome After sleeping a night over it I already reworked the patch for improving the SA performance, so please wait at least for v2 before taking a look at it :) Regarding the debugging of lockups I had the following on my in mind todo list: 1. Rework the chip specific lockup detection code a bit more and probably clean it up a bit. 2. Make the timeout a module parameter, cause compute task sometimes block a ring for more than 10 seconds. 3. Keep track of the actually RPTR offset a fence is emitted to 3. Keep track of all the BOs a IB is touching. 4. Now if a lockup happens start with the last successfully signaled fence and dump the ring content after that RPTR offset till the first not signaled fence. 5. Then if this fence references to an IB dump it's content and the BOs it is touching. 6. Dump everything on the ring after that fence until you reach the RPTR of the next fence or the WPTR of the ring. 7. If there is a next fence repeat the whole thing at number 5. If I'm not completely wrong that should give you practically every information available, and we probably should put that behind another module option, cause we are going to spam syslog pretty much here. Feel free to add/modify the ideas on this list. Christian. What i have is similar, i am assuming only ib trigger lockup, before each ib emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. On lockup allocate big memory to copy the whole ib and all the bo referenced by the ib (i am using my bof format as i already have userspace tools). Remove all the debugfs file. Just add a new one that gave you the first faulty ib. On read of this file kernel free the memory. Kernel should also free the memory after a while or better would be to enable the lockup copy only if some kernel radeon option is enabled. Just resent my current patchset to the mailing list, it's not as complete as your solution, but seems to be a step into the right direction. So please take a look at them. Being able to generate something like a GPU crash dump on lockup sounds like something very valuable to me, but I'm not sure if debugfs files are the right direction to go. Maybe something more like a module parameter containing a directory, and if set we dump all informations (including bo content) available in binary form (instead of the current human readable form of the debugfs files). Do what intel driver does, create a versioned binary debugfs file with all the error state in it for a lockup, store only one of these at a time, run a userspace tool to dump it out into something you can upload or just cat the file and upload it. You don't want the kernel writing to dirs on disk under any circumstances Dave. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Reworking of GPU reset logic
2012/4/25 Dave Airlie airl...@gmail.com: 2012/4/25 Christian König deathsim...@vodafone.de: On 21.04.2012 16:14, Jerome Glisse wrote: 2012/4/21 Christian Königdeathsim...@vodafone.de: On 20.04.2012 01:47, Jerome Glisse wrote: 2012/4/19 Christian Königdeathsim...@vodafone.de: This includes mostly fixes for multi ring lockups and GPU resets, but it should general improve the behavior of the kernel mode driver in case something goes badly wrong. On the other hand it completely rewrites the IB pool and semaphore handling, so I think there are still a couple of problems in it. The first four patches were already send to the list, but the current set depends on them so I resend them again. Cheers, Christian. I did a quick review, it looks mostly good, but as it's sensitive code i would like to spend sometime on it. Probably next week. Note that i had some work on this area too, i mostly want to drop all the debugfs related to this and add some new more usefull (basicly something that allow you to read all the data needed to replay a locking up ib). I also was looking into Dave reset thread and your solution of moving reset in ioctl return path sounds good too but i need to convince my self that it encompass all possible case. Cheers, Jerome After sleeping a night over it I already reworked the patch for improving the SA performance, so please wait at least for v2 before taking a look at it :) Regarding the debugging of lockups I had the following on my in mind todo list: 1. Rework the chip specific lockup detection code a bit more and probably clean it up a bit. 2. Make the timeout a module parameter, cause compute task sometimes block a ring for more than 10 seconds. 3. Keep track of the actually RPTR offset a fence is emitted to 3. Keep track of all the BOs a IB is touching. 4. Now if a lockup happens start with the last successfully signaled fence and dump the ring content after that RPTR offset till the first not signaled fence. 5. Then if this fence references to an IB dump it's content and the BOs it is touching. 6. Dump everything on the ring after that fence until you reach the RPTR of the next fence or the WPTR of the ring. 7. If there is a next fence repeat the whole thing at number 5. If I'm not completely wrong that should give you practically every information available, and we probably should put that behind another module option, cause we are going to spam syslog pretty much here. Feel free to add/modify the ideas on this list. Christian. What i have is similar, i am assuming only ib trigger lockup, before each ib emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. On lockup allocate big memory to copy the whole ib and all the bo referenced by the ib (i am using my bof format as i already have userspace tools). Remove all the debugfs file. Just add a new one that gave you the first faulty ib. On read of this file kernel free the memory. Kernel should also free the memory after a while or better would be to enable the lockup copy only if some kernel radeon option is enabled. Just resent my current patchset to the mailing list, it's not as complete as your solution, but seems to be a step into the right direction. So please take a look at them. Being able to generate something like a GPU crash dump on lockup sounds like something very valuable to me, but I'm not sure if debugfs files are the right direction to go. Maybe something more like a module parameter containing a directory, and if set we dump all informations (including bo content) available in binary form (instead of the current human readable form of the debugfs files). Do what intel driver does, create a versioned binary debugfs file with all the error state in it for a lockup, store only one of these at a time, run a userspace tool to dump it out into something you can upload or just cat the file and upload it. You don't want the kernel writing to dirs on disk under any circumstances We have an internal binary format for dumping command streams and associated buffers, we should probably use that so that we can better take advantage of existing internal tools. Alex Dave. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Reworking of GPU reset logic
On Wed, Apr 25, 2012 at 9:46 AM, Alex Deucher alexdeuc...@gmail.com wrote: 2012/4/25 Dave Airlie airl...@gmail.com: 2012/4/25 Christian König deathsim...@vodafone.de: On 21.04.2012 16:14, Jerome Glisse wrote: 2012/4/21 Christian Königdeathsim...@vodafone.de: On 20.04.2012 01:47, Jerome Glisse wrote: 2012/4/19 Christian Königdeathsim...@vodafone.de: This includes mostly fixes for multi ring lockups and GPU resets, but it should general improve the behavior of the kernel mode driver in case something goes badly wrong. On the other hand it completely rewrites the IB pool and semaphore handling, so I think there are still a couple of problems in it. The first four patches were already send to the list, but the current set depends on them so I resend them again. Cheers, Christian. I did a quick review, it looks mostly good, but as it's sensitive code i would like to spend sometime on it. Probably next week. Note that i had some work on this area too, i mostly want to drop all the debugfs related to this and add some new more usefull (basicly something that allow you to read all the data needed to replay a locking up ib). I also was looking into Dave reset thread and your solution of moving reset in ioctl return path sounds good too but i need to convince my self that it encompass all possible case. Cheers, Jerome After sleeping a night over it I already reworked the patch for improving the SA performance, so please wait at least for v2 before taking a look at it :) Regarding the debugging of lockups I had the following on my in mind todo list: 1. Rework the chip specific lockup detection code a bit more and probably clean it up a bit. 2. Make the timeout a module parameter, cause compute task sometimes block a ring for more than 10 seconds. 3. Keep track of the actually RPTR offset a fence is emitted to 3. Keep track of all the BOs a IB is touching. 4. Now if a lockup happens start with the last successfully signaled fence and dump the ring content after that RPTR offset till the first not signaled fence. 5. Then if this fence references to an IB dump it's content and the BOs it is touching. 6. Dump everything on the ring after that fence until you reach the RPTR of the next fence or the WPTR of the ring. 7. If there is a next fence repeat the whole thing at number 5. If I'm not completely wrong that should give you practically every information available, and we probably should put that behind another module option, cause we are going to spam syslog pretty much here. Feel free to add/modify the ideas on this list. Christian. What i have is similar, i am assuming only ib trigger lockup, before each ib emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. On lockup allocate big memory to copy the whole ib and all the bo referenced by the ib (i am using my bof format as i already have userspace tools). Remove all the debugfs file. Just add a new one that gave you the first faulty ib. On read of this file kernel free the memory. Kernel should also free the memory after a while or better would be to enable the lockup copy only if some kernel radeon option is enabled. Just resent my current patchset to the mailing list, it's not as complete as your solution, but seems to be a step into the right direction. So please take a look at them. Being able to generate something like a GPU crash dump on lockup sounds like something very valuable to me, but I'm not sure if debugfs files are the right direction to go. Maybe something more like a module parameter containing a directory, and if set we dump all informations (including bo content) available in binary form (instead of the current human readable form of the debugfs files). Do what intel driver does, create a versioned binary debugfs file with all the error state in it for a lockup, store only one of these at a time, run a userspace tool to dump it out into something you can upload or just cat the file and upload it. You don't want the kernel writing to dirs on disk under any circumstances We have an internal binary format for dumping command streams and associated buffers, we should probably use that so that we can better take advantage of existing internal tools. Alex I really would like to drop all the debugfs file related to ib/ring with this patchset. Note that i also have a binary format to replay command stream the blob format. It has all the information needed to replay on the open driver and tools are their (my joujou repo on fdo). Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Reworking of GPU reset logic
On Sam, 2012-04-21 at 11:42 +0200, Christian König wrote: Regarding the debugging of lockups I had the following on my in mind todo list: 1. Rework the chip specific lockup detection code a bit more and probably clean it up a bit. 2. Make the timeout a module parameter, cause compute task sometimes block a ring for more than 10 seconds. A better solution for that would be to improve the detection of the GPU making progress, also for graphics operations. We should try to reduce the timeout rather than making it even larger. -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Reworking of GPU reset logic
On 20.04.2012 01:47, Jerome Glisse wrote: 2012/4/19 Christian Königdeathsim...@vodafone.de: This includes mostly fixes for multi ring lockups and GPU resets, but it should general improve the behavior of the kernel mode driver in case something goes badly wrong. On the other hand it completely rewrites the IB pool and semaphore handling, so I think there are still a couple of problems in it. The first four patches were already send to the list, but the current set depends on them so I resend them again. Cheers, Christian. I did a quick review, it looks mostly good, but as it's sensitive code i would like to spend sometime on it. Probably next week. Note that i had some work on this area too, i mostly want to drop all the debugfs related to this and add some new more usefull (basicly something that allow you to read all the data needed to replay a locking up ib). I also was looking into Dave reset thread and your solution of moving reset in ioctl return path sounds good too but i need to convince my self that it encompass all possible case. Cheers, Jerome After sleeping a night over it I already reworked the patch for improving the SA performance, so please wait at least for v2 before taking a look at it :) Regarding the debugging of lockups I had the following on my in mind todo list: 1. Rework the chip specific lockup detection code a bit more and probably clean it up a bit. 2. Make the timeout a module parameter, cause compute task sometimes block a ring for more than 10 seconds. 3. Keep track of the actually RPTR offset a fence is emitted to 3. Keep track of all the BOs a IB is touching. 4. Now if a lockup happens start with the last successfully signaled fence and dump the ring content after that RPTR offset till the first not signaled fence. 5. Then if this fence references to an IB dump it's content and the BOs it is touching. 6. Dump everything on the ring after that fence until you reach the RPTR of the next fence or the WPTR of the ring. 7. If there is a next fence repeat the whole thing at number 5. If I'm not completely wrong that should give you practically every information available, and we probably should put that behind another module option, cause we are going to spam syslog pretty much here. Feel free to add/modify the ideas on this list. Christian. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Reworking of GPU reset logic
2012/4/21 Christian König deathsim...@vodafone.de: On 20.04.2012 01:47, Jerome Glisse wrote: 2012/4/19 Christian Königdeathsim...@vodafone.de: This includes mostly fixes for multi ring lockups and GPU resets, but it should general improve the behavior of the kernel mode driver in case something goes badly wrong. On the other hand it completely rewrites the IB pool and semaphore handling, so I think there are still a couple of problems in it. The first four patches were already send to the list, but the current set depends on them so I resend them again. Cheers, Christian. I did a quick review, it looks mostly good, but as it's sensitive code i would like to spend sometime on it. Probably next week. Note that i had some work on this area too, i mostly want to drop all the debugfs related to this and add some new more usefull (basicly something that allow you to read all the data needed to replay a locking up ib). I also was looking into Dave reset thread and your solution of moving reset in ioctl return path sounds good too but i need to convince my self that it encompass all possible case. Cheers, Jerome After sleeping a night over it I already reworked the patch for improving the SA performance, so please wait at least for v2 before taking a look at it :) Regarding the debugging of lockups I had the following on my in mind todo list: 1. Rework the chip specific lockup detection code a bit more and probably clean it up a bit. 2. Make the timeout a module parameter, cause compute task sometimes block a ring for more than 10 seconds. 3. Keep track of the actually RPTR offset a fence is emitted to 3. Keep track of all the BOs a IB is touching. 4. Now if a lockup happens start with the last successfully signaled fence and dump the ring content after that RPTR offset till the first not signaled fence. 5. Then if this fence references to an IB dump it's content and the BOs it is touching. 6. Dump everything on the ring after that fence until you reach the RPTR of the next fence or the WPTR of the ring. 7. If there is a next fence repeat the whole thing at number 5. If I'm not completely wrong that should give you practically every information available, and we probably should put that behind another module option, cause we are going to spam syslog pretty much here. Feel free to add/modify the ideas on this list. Christian. What i have is similar, i am assuming only ib trigger lockup, before each ib emit to scratch reg ib offset in sa and ib size. For each ib keep bo list. On lockup allocate big memory to copy the whole ib and all the bo referenced by the ib (i am using my bof format as i already have userspace tools). Remove all the debugfs file. Just add a new one that gave you the first faulty ib. On read of this file kernel free the memory. Kernel should also free the memory after a while or better would be to enable the lockup copy only if some kernel radeon option is enabled. Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Reworking of GPU reset logic
2012/4/19 Christian König deathsim...@vodafone.de: This includes mostly fixes for multi ring lockups and GPU resets, but it should general improve the behavior of the kernel mode driver in case something goes badly wrong. On the other hand it completely rewrites the IB pool and semaphore handling, so I think there are still a couple of problems in it. The first four patches were already send to the list, but the current set depends on them so I resend them again. Cheers, Christian. I did a quick review, it looks mostly good, but as it's sensitive code i would like to spend sometime on it. Probably next week. Note that i had some work on this area too, i mostly want to drop all the debugfs related to this and add some new more usefull (basicly something that allow you to read all the data needed to replay a locking up ib). I also was looking into Dave reset thread and your solution of moving reset in ioctl return path sounds good too but i need to convince my self that it encompass all possible case. Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel