Re: [Mesa-dev] [ANNOUNCE] Mesa 17.3.4 release candidate
On 21 February 2018 at 19:14, Kenneth Graunke wrote: > On Thursday, February 8, 2018 8:47:00 PM PST Emil Velikov wrote: >> Rejected (9) >> >> Jason Ekstrand (2): >> e52a9f18d69c94b7cb7f81361cdb9e2582c3d742 i965: Replace >> draw_aux_buffer_disabled with draw_aux_usage >> 20f70ae3858bc213e052a8434f0e637eb36203c4 i965/draw: Set >> NEW_AUX_STATE when draw aux changes >> Reason: Introduce multiple regressions in the piglit compute shader tests. > > Hi Emil, > Hi Ken, > These are absolutely critical fixes. These patches fix GPU hangs and > crashes in Glamor which cause people's X session to die when doing > exciting things like using their text editor, IDE, or desktop panel. > It's responsible for a huge swath of our GPU hang bugs on i965. > > Did Jason or I miss an email from you about these being rejected, > other than at the bottom of a large changelog in an RC announcement? > Which Piglit tests are regressing? My guess is that we just need to > nominate another patch, as they aren't broken in master. > You're right, I should have included more specifics. The commits cause approx. 1700 regressions in the following: spec/amd_shader_trinary_minmax/execution/built-in-functions/cs-* spec/arb_compute_shader/execution/* spec/arb_gpu_shader_int64/execution/built-in-functions/cs-* spec/glsl-4.30/execution/built-in-functions/cs-* I've a dozen of attempts trying to find the missing patch(es). I _really_ want the patches to land, see [1]. As a rule the author of a rejected patch or one with merge conflicts is explicitly CCed in the RC email. Additionally, there is also a reply[2] to the patch itself with request for a)information and/or b) backport. Yes, we can remind developers more frequently. Yet at some point it only gets annoying and ultimately - ignored. Suggestions are more than welcome. > At this point, we've done 5 point releases in the 17.3.x series, which > have had DRI3 crashes when pageflipping (in all drivers), and X server > hangs and crashes galore in i965/Gen9+. Worse, we fixed those hangs a > month ago and haven't managed to ship them yet. We also managed to > ship a radv that broke completely. > > At this point, 17.3.x is looking like the worst Mesa release in recent > memory, and I'm about on the verge of advising people to just go back > to 17.2 until 18.0 comes out. It's pretty frustrating, and I feel bad > for our users, who depend on our software for their computer to work. > According to the results from the Jenkins setup, there are no regressions in 17.3.x wrt the 17.2.x series. Perhaps we lack test coverage? Additionally I would not call for 17.2 since I did notice some glitches with it and Tomb Rider and Dota2. Latter triggered by a Dota2 update. > We have to do better, somehow - myself included. Ideally, we'd find a > way to avoid major bugs like this in the first place. Barring that, > do we need to have developers take a more active role in backporting > fixes again? It seems like our nomination process works for simple > things, but for more complex series, it doesn't work as well. Maybe > we need to proactively put together (tested) pull requests for stable? > Hear, hear (aka yes please) for more developer backports. Should be a good idea to also cross review for the conflicts that myself or the Igalia team resolve. Obviously that should not substitute testing _and_ reporting from the different teams. Currently the _only_ information that we have is from the Jenkins CI. Thanks, Emil [1] https://lists.freedesktop.org/archives/mesa-dev/2018-February/185822.html [2] Must admit the last one, isn't at 100% quite yet. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [ANNOUNCE] Mesa 17.3.4 release candidate
Kenneth Graunke writes: > On Thursday, February 8, 2018 8:47:00 PM PST Emil Velikov wrote: >> Rejected (9) >> >> Jason Ekstrand (2): >> e52a9f18d69c94b7cb7f81361cdb9e2582c3d742 i965: Replace >> draw_aux_buffer_disabled with draw_aux_usage >> 20f70ae3858bc213e052a8434f0e637eb36203c4 i965/draw: Set >> NEW_AUX_STATE when draw aux changes >> Reason: Introduce multiple regressions in the piglit compute shader tests. > > Hi Emil, > > These are absolutely critical fixes. These patches fix GPU hangs and > crashes in Glamor which cause people's X session to die when doing > exciting things like using their text editor, IDE, or desktop panel. > It's responsible for a huge swath of our GPU hang bugs on i965. > > Did Jason or I miss an email from you about these being rejected, > other than at the bottom of a large changelog in an RC announcement? > Which Piglit tests are regressing? My guess is that we just need to > nominate another patch, as they aren't broken in master. > > At this point, we've done 5 point releases in the 17.3.x series, which > have had DRI3 crashes when pageflipping (in all drivers), and X server > hangs and crashes galore in i965/Gen9+. Worse, we fixed those hangs a > month ago and haven't managed to ship them yet. We also managed to > ship a radv that broke completely. > > At this point, 17.3.x is looking like the worst Mesa release in recent > memory, and I'm about on the verge of advising people to just go back > to 17.2 until 18.0 comes out. It's pretty frustrating, and I feel bad > for our users, who depend on our software for their computer to work. > > We have to do better, somehow - myself included. Ideally, we'd find a > way to avoid major bugs like this in the first place. Barring that, > do we need to have developers take a more active role in backporting > fixes again? It seems like our nomination process works for simple > things, but for more complex series, it doesn't work as well. Maybe > we need to proactively put together (tested) pull requests for stable? It seems to me that patches CCing stable or containing 'Fixes:' could be automatically be applied to a branch for incremental test. Anything that doesn't apply (or generates regressions) could be bounced to the committer for a manual backport. The recent issues all stem from patches that do not apply cleanly to the release branch. Developers should be on the hook for resolving the conflicts *and* testing the result. Finding conflicts shortly before release means the developers have no chance to fix them. It also generates pressure on the release manager to fix up other people's patches. Personally, I wouldn't want to be in a position where I could easily impact large numbers of users with a simple typo. > --Ken > ___ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [ANNOUNCE] Mesa 17.3.4 release candidate
On Thursday, February 8, 2018 8:47:00 PM PST Emil Velikov wrote: > Rejected (9) > > Jason Ekstrand (2): > e52a9f18d69c94b7cb7f81361cdb9e2582c3d742 i965: Replace > draw_aux_buffer_disabled with draw_aux_usage > 20f70ae3858bc213e052a8434f0e637eb36203c4 i965/draw: Set > NEW_AUX_STATE when draw aux changes > Reason: Introduce multiple regressions in the piglit compute shader tests. Hi Emil, These are absolutely critical fixes. These patches fix GPU hangs and crashes in Glamor which cause people's X session to die when doing exciting things like using their text editor, IDE, or desktop panel. It's responsible for a huge swath of our GPU hang bugs on i965. Did Jason or I miss an email from you about these being rejected, other than at the bottom of a large changelog in an RC announcement? Which Piglit tests are regressing? My guess is that we just need to nominate another patch, as they aren't broken in master. At this point, we've done 5 point releases in the 17.3.x series, which have had DRI3 crashes when pageflipping (in all drivers), and X server hangs and crashes galore in i965/Gen9+. Worse, we fixed those hangs a month ago and haven't managed to ship them yet. We also managed to ship a radv that broke completely. At this point, 17.3.x is looking like the worst Mesa release in recent memory, and I'm about on the verge of advising people to just go back to 17.2 until 18.0 comes out. It's pretty frustrating, and I feel bad for our users, who depend on our software for their computer to work. We have to do better, somehow - myself included. Ideally, we'd find a way to avoid major bugs like this in the first place. Barring that, do we need to have developers take a more active role in backporting fixes again? It seems like our nomination process works for simple things, but for more complex series, it doesn't work as well. Maybe we need to proactively put together (tested) pull requests for stable? --Ken signature.asc Description: This is a digitally signed message part. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [ANNOUNCE] mesa 17.3.4
On Mon, Feb 19, 2018 at 6:28 PM, Emil Velikov wrote: > On 17 February 2018 at 15:04, Bas Nieuwenhuizen > wrote: >> (-mesa-announce + Mark, Dave and James) >> >> Hi Emil, >> >> radv is broken for nearly all commercial games in 17.3.4. The cause is >> >> commit ad764e365beb8a119369b97f5cb95fc7ea8c >> Author: Bas Nieuwenhuizen >> Date: Mon Jan 22 09:01:29 2018 +0100 >> >> ac/nir: Use instance_rate_inputs per attribute, not per variable. >> >> This did the wrong thing if we had e.g. an array for which only some >> of the attributes use the instance index. Tripped up some new CTS >> tests. >> >> CC: >> Reviewed-by: Samuel Pitoiset >> Reviewed-by: Dave Airlie >> (cherry picked from commit 5a4dc285002e1924dbc8c72d17481a3dbc4c0142) >> >> Conflicts: >> src/amd/common/ac_nir_to_llvm.c >> >> A typo was introduced during the conflict resolution while >> cherrypicking to 17.3. >> >> First things first, can we mitigate this? Would it be possible to get >> a 17.3.5 with this fixed ASAP? This can be fixed by either rolling >> back ad764e365beb8a119369b97f5cb95fc7ea8c or by applying >> https://patchwork.freedesktop.org/patch/204260/ (obviously not applied >> to master as the issue did not occur there). >> > I'll do a 17.3.5 momentarily, including only the above fix. > >> Secondly, I'd like to talk about process and how to prevent this in >> the future. Bugfix releases are supposed to be stable so downstream >> maintainers don't have to to deal with this kind of stuff happening, >> so I think that breaking radv pretty much completely is particularly >> egregious and we should look at how to prevent this happnening another >> time. >> > Sadly James' fix went below the radar since: > - it wasn't merged in master - understandably so > - I mistook his reply as some lovely emails that mesa-announce is > throwing at me > > A few things come to mind: > - on my end - resolve the mesa-announce noise, ensure replies are highlighted > - having regular reply [to the RC email] with negative and positive > testing results will be great Thanks, I'll work on getting the release candidates tested in a timely manner and making sure we send these replies from the radv side. > - wire up some Vulkan games into the LunarG*/other test rig > > I've been going through * since it's inception, as the only means of > checking non Intel hardware. > > -Emil ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [ANNOUNCE] mesa 17.3.4
On 17 February 2018 at 15:04, Bas Nieuwenhuizen wrote: > (-mesa-announce + Mark, Dave and James) > > Hi Emil, > > radv is broken for nearly all commercial games in 17.3.4. The cause is > > commit ad764e365beb8a119369b97f5cb95fc7ea8c > Author: Bas Nieuwenhuizen > Date: Mon Jan 22 09:01:29 2018 +0100 > > ac/nir: Use instance_rate_inputs per attribute, not per variable. > > This did the wrong thing if we had e.g. an array for which only some > of the attributes use the instance index. Tripped up some new CTS > tests. > > CC: > Reviewed-by: Samuel Pitoiset > Reviewed-by: Dave Airlie > (cherry picked from commit 5a4dc285002e1924dbc8c72d17481a3dbc4c0142) > > Conflicts: > src/amd/common/ac_nir_to_llvm.c > > A typo was introduced during the conflict resolution while > cherrypicking to 17.3. > > First things first, can we mitigate this? Would it be possible to get > a 17.3.5 with this fixed ASAP? This can be fixed by either rolling > back ad764e365beb8a119369b97f5cb95fc7ea8c or by applying > https://patchwork.freedesktop.org/patch/204260/ (obviously not applied > to master as the issue did not occur there). > I'll do a 17.3.5 momentarily, including only the above fix. > Secondly, I'd like to talk about process and how to prevent this in > the future. Bugfix releases are supposed to be stable so downstream > maintainers don't have to to deal with this kind of stuff happening, > so I think that breaking radv pretty much completely is particularly > egregious and we should look at how to prevent this happnening another > time. > Sadly James' fix went below the radar since: - it wasn't merged in master - understandably so - I mistook his reply as some lovely emails that mesa-announce is throwing at me A few things come to mind: - on my end - resolve the mesa-announce noise, ensure replies are highlighted - having regular reply [to the RC email] with negative and positive testing results will be great - wire up some Vulkan games into the LunarG*/other test rig I've been going through * since it's inception, as the only means of checking non Intel hardware. -Emil ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [ANNOUNCE] mesa 17.3.4
Bas Nieuwenhuizen writes: > (-mesa-announce + Mark, Dave and James) > > Hi Emil, > > radv is broken for nearly all commercial games in 17.3.4. The cause is > > commit ad764e365beb8a119369b97f5cb95fc7ea8c > Author: Bas Nieuwenhuizen > Date: Mon Jan 22 09:01:29 2018 +0100 > > ac/nir: Use instance_rate_inputs per attribute, not per variable. > > This did the wrong thing if we had e.g. an array for which only some > of the attributes use the instance index. Tripped up some new CTS > tests. > > CC: > Reviewed-by: Samuel Pitoiset > Reviewed-by: Dave Airlie > (cherry picked from commit 5a4dc285002e1924dbc8c72d17481a3dbc4c0142) > > Conflicts: > src/amd/common/ac_nir_to_llvm.c > > A typo was introduced during the conflict resolution while > cherrypicking to 17.3. > > First things first, can we mitigate this? Would it be possible to get > a 17.3.5 with this fixed ASAP? This can be fixed by either rolling > back ad764e365beb8a119369b97f5cb95fc7ea8c or by applying > https://patchwork.freedesktop.org/patch/204260/ (obviously not applied > to master as the issue did not occur there). > > Secondly, I'd like to talk about process and how to prevent this in > the future. Bugfix releases are supposed to be stable so downstream > maintainers don't have to to deal with this kind of stuff happening, > so I think that breaking radv pretty much completely is particularly > egregious and we should look at how to prevent this happnening another > time. > > First a short summary of what happened and when (all times in UTC): > > 2018-02-09 4:47: Emil sends out the pre-release announcement, git > branch contains the faulty commit > 2018-02-13 16:05: James Legg replies to the pre-release announcement: > https://lists.freedesktop.org/archives/mesa-dev/2018-February/185355.html > 2018-02-13 16:06: James Legg sends a patch to fix it: > https://lists.freedesktop.org/archives/mesa-dev/2018-February/185356.html > 2018-02-13 16:33: Bas reviews the patch. > 2018-02-15 12:56: Emil releases 17.3.4 without the fix: > https://lists.freedesktop.org/archives/mesa-dev/2018-February/185691.html > 2018-02-16 21:09: Someone else replies with a fix to the pre-release > announcement: > https://lists.freedesktop.org/archives/mesa-dev/2018-February/185908.html > 2018-02-17 ~13:00: Bas notices the attached fix has "mismerge" in the > name and decides to investigate the issue > > > So I think there have been a couple of issues in communication here: > > 1) Essentially none of the communication talked about this being an > issue on the branch only. This made me underestimate the severity of > the issue, since the games kept working on master. This also meant no > manual pings from my side to the release manager for not applying this > to master first. > 2) The reply to the pre-release announcement only said there was an > issue, not that there was a fix sent out, nor any details like > severity. > 3) As far as I can tell no action had been taken by the release > manager. James' reply did not get a response nor was the fix included > in the release. > > My question would be how to improve the communication here. Could you > elaborate the reason for (3)? Was it because you thought it would have > to be picked up by the radv developers first? Would it help if I > replied to James' reply in the announcement to link to the patch? > > The remaining issue is mainly about testing. The initial detection of > this issue by James was already more than the 24-48 hours after the > pre-release announcement that is recommended between the announcement > and release, so a faster release would have prevented timely detection > in the first place. I suspect radv does not get tested at all on > releases except maybe a build test? > > Since manually testing all games on every release is not scalable for > either the radv developers or the release manager I'm thinking of > getting the automated tests from the vulkan CTS and crucible running > before a release. That said I don't think keeping track of what tests > are supposed to pass is something we should be pushing on the release > manager, so I suppose we should be setting up our own CI and surfacing > results of the stable branches to the release manager? Tracking results is at best a significant investment which attenuates as the driver becomes more stable and automation improves. It takes us perhaps an hour per day on average to track i965 status in CI. This investment pays bigger dividends for larger developer teams. There is constant demand to expand the coverage of CI. Tracking results requires developer time when the driver has been broken, but this cost is less than any other alternative. Unfortunately, this cost is often borne by whoever runs the CI, and can become burdensome. > I'm assuming there is already a similar arrangement with intel? Is > this something were the release manager triggers their CI specifically > when doing a pre-release or ar
Re: [Mesa-dev] [ANNOUNCE] mesa 17.3.4
(-mesa-announce + Mark, Dave and James) Hi Emil, radv is broken for nearly all commercial games in 17.3.4. The cause is commit ad764e365beb8a119369b97f5cb95fc7ea8c Author: Bas Nieuwenhuizen Date: Mon Jan 22 09:01:29 2018 +0100 ac/nir: Use instance_rate_inputs per attribute, not per variable. This did the wrong thing if we had e.g. an array for which only some of the attributes use the instance index. Tripped up some new CTS tests. CC: Reviewed-by: Samuel Pitoiset Reviewed-by: Dave Airlie (cherry picked from commit 5a4dc285002e1924dbc8c72d17481a3dbc4c0142) Conflicts: src/amd/common/ac_nir_to_llvm.c A typo was introduced during the conflict resolution while cherrypicking to 17.3. First things first, can we mitigate this? Would it be possible to get a 17.3.5 with this fixed ASAP? This can be fixed by either rolling back ad764e365beb8a119369b97f5cb95fc7ea8c or by applying https://patchwork.freedesktop.org/patch/204260/ (obviously not applied to master as the issue did not occur there). Secondly, I'd like to talk about process and how to prevent this in the future. Bugfix releases are supposed to be stable so downstream maintainers don't have to to deal with this kind of stuff happening, so I think that breaking radv pretty much completely is particularly egregious and we should look at how to prevent this happnening another time. First a short summary of what happened and when (all times in UTC): 2018-02-09 4:47: Emil sends out the pre-release announcement, git branch contains the faulty commit 2018-02-13 16:05: James Legg replies to the pre-release announcement: https://lists.freedesktop.org/archives/mesa-dev/2018-February/185355.html 2018-02-13 16:06: James Legg sends a patch to fix it: https://lists.freedesktop.org/archives/mesa-dev/2018-February/185356.html 2018-02-13 16:33: Bas reviews the patch. 2018-02-15 12:56: Emil releases 17.3.4 without the fix: https://lists.freedesktop.org/archives/mesa-dev/2018-February/185691.html 2018-02-16 21:09: Someone else replies with a fix to the pre-release announcement: https://lists.freedesktop.org/archives/mesa-dev/2018-February/185908.html 2018-02-17 ~13:00: Bas notices the attached fix has "mismerge" in the name and decides to investigate the issue So I think there have been a couple of issues in communication here: 1) Essentially none of the communication talked about this being an issue on the branch only. This made me underestimate the severity of the issue, since the games kept working on master. This also meant no manual pings from my side to the release manager for not applying this to master first. 2) The reply to the pre-release announcement only said there was an issue, not that there was a fix sent out, nor any details like severity. 3) As far as I can tell no action had been taken by the release manager. James' reply did not get a response nor was the fix included in the release. My question would be how to improve the communication here. Could you elaborate the reason for (3)? Was it because you thought it would have to be picked up by the radv developers first? Would it help if I replied to James' reply in the announcement to link to the patch? The remaining issue is mainly about testing. The initial detection of this issue by James was already more than the 24-48 hours after the pre-release announcement that is recommended between the announcement and release, so a faster release would have prevented timely detection in the first place. I suspect radv does not get tested at all on releases except maybe a build test? Since manually testing all games on every release is not scalable for either the radv developers or the release manager I'm thinking of getting the automated tests from the vulkan CTS and crucible running before a release. That said I don't think keeping track of what tests are supposed to pass is something we should be pushing on the release manager, so I suppose we should be setting up our own CI and surfacing results of the stable branches to the release manager? I'm assuming there is already a similar arrangement with intel? Is this something were the release manager triggers their CI specifically when doing a pre-release or are they just continuously watching stable branches? Also how do the results get surfaced back to the release manager? Apologies for the barrage of questions! Thanks, Bas Nieuwenhuizen On Thu, Feb 15, 2018 at 1:56 PM, Emil Velikov wrote: > Mesa 17.3.4 is now available. > > In this release we have: > > Dozens of fixes in the i965, ANV and RADV drivers. Additionally > the r600, virgl, etnaviv and renderonly drivers have also seen some love. > > The experimental Vulkan extension VK_KHX_multiview was disabled. > > On the video decoding drivers side: > r600/radeonsi correctly handle new UVD/VCN firmware. The VA and OMX > state-trackers have some MPEG2 glitches resolved, while locking is correctly > handled in the error paths. > > To top