Re: [Mesa-dev] [ANNOUNCE] Mesa 17.3.4 release candidate

2018-02-22 Thread Emil Velikov
On 21 February 2018 at 19:14, Kenneth Graunke  wrote:
> On Thursday, February 8, 2018 8:47:00 PM PST Emil Velikov wrote:
>> Rejected (9)
>> 
>> Jason Ekstrand (2):
>>   e52a9f18d69c94b7cb7f81361cdb9e2582c3d742 i965: Replace
>> draw_aux_buffer_disabled with draw_aux_usage
>>   20f70ae3858bc213e052a8434f0e637eb36203c4 i965/draw: Set
>> NEW_AUX_STATE when draw aux changes
>> Reason: Introduce multiple regressions in the piglit compute shader tests.
>
> Hi Emil,
>
Hi Ken,

> These are absolutely critical fixes.  These patches fix GPU hangs and
> crashes in Glamor which cause people's X session to die when doing
> exciting things like using their text editor, IDE, or desktop panel.
> It's responsible for a huge swath of our GPU hang bugs on i965.
>
> Did Jason or I miss an email from you about these being rejected,
> other than at the bottom of a large changelog in an RC announcement?
> Which Piglit tests are regressing?  My guess is that we just need to
> nominate another patch, as they aren't broken in master.
>
You're right, I should have included more specifics.

The commits cause approx. 1700 regressions in the following:
spec/amd_shader_trinary_minmax/execution/built-in-functions/cs-*
spec/arb_compute_shader/execution/*
spec/arb_gpu_shader_int64/execution/built-in-functions/cs-*
spec/glsl-4.30/execution/built-in-functions/cs-*


I've a dozen of attempts trying to find the missing patch(es).
I _really_ want the patches to land, see [1].

As a rule the author of a rejected patch or one with merge conflicts
is explicitly CCed in the RC email.
Additionally, there is also a reply[2] to the patch itself with
request for a)information and/or b) backport.

Yes, we can remind developers more frequently. Yet at some point it
only gets annoying and ultimately - ignored.
Suggestions are more than welcome.


> At this point, we've done 5 point releases in the 17.3.x series, which
> have had DRI3 crashes when pageflipping (in all drivers), and X server
> hangs and crashes galore in i965/Gen9+.  Worse, we fixed those hangs a
> month ago and haven't managed to ship them yet.  We also managed to
> ship a radv that broke completely.
>
> At this point, 17.3.x is looking like the worst Mesa release in recent
> memory, and I'm about on the verge of advising people to just go back
> to 17.2 until 18.0 comes out.  It's pretty frustrating, and I feel bad
> for our users, who depend on our software for their computer to work.
>
According to the results from the Jenkins setup, there are no
regressions in 17.3.x wrt the 17.2.x series.
Perhaps we lack test coverage?

Additionally I would not call for 17.2 since I did notice some
glitches with it and Tomb Rider and Dota2.
Latter triggered by a Dota2 update.

> We have to do better, somehow - myself included.  Ideally, we'd find a
> way to avoid major bugs like this in the first place.  Barring that,
> do we need to have developers take a more active role in backporting
> fixes again?  It seems like our nomination process works for simple
> things, but for more complex series, it doesn't work as well.  Maybe
> we need to proactively put together (tested) pull requests for stable?
>
Hear, hear (aka yes please) for more developer backports.

Should be a good idea to also cross review for the conflicts that
myself or the Igalia team resolve.

Obviously that should not substitute testing _and_ reporting from the
different teams.
Currently the _only_ information that we have is from the Jenkins CI.

Thanks,
Emil

[1] https://lists.freedesktop.org/archives/mesa-dev/2018-February/185822.html
[2] Must admit the last one, isn't at 100% quite yet.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [ANNOUNCE] Mesa 17.3.4 release candidate

2018-02-21 Thread Mark Janes
Kenneth Graunke  writes:

> On Thursday, February 8, 2018 8:47:00 PM PST Emil Velikov wrote:
>> Rejected (9)
>> 
>> Jason Ekstrand (2):
>>   e52a9f18d69c94b7cb7f81361cdb9e2582c3d742 i965: Replace
>> draw_aux_buffer_disabled with draw_aux_usage
>>   20f70ae3858bc213e052a8434f0e637eb36203c4 i965/draw: Set
>> NEW_AUX_STATE when draw aux changes
>> Reason: Introduce multiple regressions in the piglit compute shader tests.
>
> Hi Emil,
>
> These are absolutely critical fixes.  These patches fix GPU hangs and
> crashes in Glamor which cause people's X session to die when doing
> exciting things like using their text editor, IDE, or desktop panel.
> It's responsible for a huge swath of our GPU hang bugs on i965.
>
> Did Jason or I miss an email from you about these being rejected,
> other than at the bottom of a large changelog in an RC announcement?
> Which Piglit tests are regressing?  My guess is that we just need to
> nominate another patch, as they aren't broken in master.
>
> At this point, we've done 5 point releases in the 17.3.x series, which
> have had DRI3 crashes when pageflipping (in all drivers), and X server
> hangs and crashes galore in i965/Gen9+.  Worse, we fixed those hangs a
> month ago and haven't managed to ship them yet.  We also managed to
> ship a radv that broke completely.
>
> At this point, 17.3.x is looking like the worst Mesa release in recent
> memory, and I'm about on the verge of advising people to just go back
> to 17.2 until 18.0 comes out.  It's pretty frustrating, and I feel bad
> for our users, who depend on our software for their computer to work.
>
> We have to do better, somehow - myself included.  Ideally, we'd find a
> way to avoid major bugs like this in the first place.  Barring that,
> do we need to have developers take a more active role in backporting
> fixes again?  It seems like our nomination process works for simple
> things, but for more complex series, it doesn't work as well.  Maybe
> we need to proactively put together (tested) pull requests for stable?

It seems to me that patches CCing stable or containing 'Fixes:' could be
automatically be applied to a branch for incremental test.  Anything
that doesn't apply (or generates regressions) could be bounced to the
committer for a manual backport.

The recent issues all stem from patches that do not apply cleanly to the
release branch.  Developers should be on the hook for resolving the
conflicts *and* testing the result.

Finding conflicts shortly before release means the developers have no
chance to fix them.  It also generates pressure on the release manager
to fix up other people's patches.  Personally, I wouldn't want to be in
a position where I could easily impact large numbers of users with a
simple typo.

> --Ken
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [ANNOUNCE] Mesa 17.3.4 release candidate

2018-02-21 Thread Kenneth Graunke
On Thursday, February 8, 2018 8:47:00 PM PST Emil Velikov wrote:
> Rejected (9)
> 
> Jason Ekstrand (2):
>   e52a9f18d69c94b7cb7f81361cdb9e2582c3d742 i965: Replace
> draw_aux_buffer_disabled with draw_aux_usage
>   20f70ae3858bc213e052a8434f0e637eb36203c4 i965/draw: Set
> NEW_AUX_STATE when draw aux changes
> Reason: Introduce multiple regressions in the piglit compute shader tests.

Hi Emil,

These are absolutely critical fixes.  These patches fix GPU hangs and
crashes in Glamor which cause people's X session to die when doing
exciting things like using their text editor, IDE, or desktop panel.
It's responsible for a huge swath of our GPU hang bugs on i965.

Did Jason or I miss an email from you about these being rejected,
other than at the bottom of a large changelog in an RC announcement?
Which Piglit tests are regressing?  My guess is that we just need to
nominate another patch, as they aren't broken in master.

At this point, we've done 5 point releases in the 17.3.x series, which
have had DRI3 crashes when pageflipping (in all drivers), and X server
hangs and crashes galore in i965/Gen9+.  Worse, we fixed those hangs a
month ago and haven't managed to ship them yet.  We also managed to
ship a radv that broke completely.

At this point, 17.3.x is looking like the worst Mesa release in recent
memory, and I'm about on the verge of advising people to just go back
to 17.2 until 18.0 comes out.  It's pretty frustrating, and I feel bad
for our users, who depend on our software for their computer to work.

We have to do better, somehow - myself included.  Ideally, we'd find a
way to avoid major bugs like this in the first place.  Barring that,
do we need to have developers take a more active role in backporting
fixes again?  It seems like our nomination process works for simple
things, but for more complex series, it doesn't work as well.  Maybe
we need to proactively put together (tested) pull requests for stable?

--Ken


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [ANNOUNCE] mesa 17.3.4

2018-02-19 Thread Bas Nieuwenhuizen
On Mon, Feb 19, 2018 at 6:28 PM, Emil Velikov  wrote:
> On 17 February 2018 at 15:04, Bas Nieuwenhuizen  
> wrote:
>> (-mesa-announce + Mark, Dave and James)
>>
>> Hi Emil,
>>
>> radv is broken for nearly all commercial games in 17.3.4. The cause is
>>
>> commit ad764e365beb8a119369b97f5cb95fc7ea8c
>> Author: Bas Nieuwenhuizen 
>> Date:   Mon Jan 22 09:01:29 2018 +0100
>>
>> ac/nir: Use instance_rate_inputs per attribute, not per variable.
>>
>> This did the wrong thing if we had e.g. an array for which only some
>> of the attributes use the instance index. Tripped up some new CTS
>> tests.
>>
>> CC: 
>> Reviewed-by: Samuel Pitoiset 
>> Reviewed-by: Dave Airlie 
>> (cherry picked from commit 5a4dc285002e1924dbc8c72d17481a3dbc4c0142)
>>
>> Conflicts:
>> src/amd/common/ac_nir_to_llvm.c
>>
>> A typo was introduced during the conflict resolution while
>> cherrypicking to 17.3.
>>
>> First things first, can we mitigate this? Would it be possible to get
>> a 17.3.5 with this fixed ASAP? This can be fixed by either rolling
>> back ad764e365beb8a119369b97f5cb95fc7ea8c or by applying
>> https://patchwork.freedesktop.org/patch/204260/ (obviously not applied
>> to master as the issue did not occur there).
>>
> I'll do a 17.3.5 momentarily, including only the above fix.
>
>> Secondly, I'd like to talk about process and how to prevent this in
>> the future. Bugfix releases are supposed to be stable so downstream
>> maintainers don't have to to deal with this kind of stuff happening,
>> so I think that breaking radv pretty much completely is particularly
>> egregious and we should look at how to prevent this happnening another
>> time.
>>
> Sadly James' fix went below the radar since:
>  - it wasn't merged in master - understandably so
>  - I mistook his reply as some lovely emails that mesa-announce is
> throwing at me
>
> A few things come to mind:
>  - on my end - resolve the mesa-announce noise, ensure replies are highlighted
>  - having regular reply [to the RC email] with negative and positive
> testing results will be great

Thanks, I'll work on getting the release candidates tested in a timely
manner and making sure we send these replies from the radv side.

>  - wire up some Vulkan games into the LunarG*/other test rig
>
> I've been going through * since it's inception, as the only means of
> checking non Intel hardware.
>
> -Emil
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [ANNOUNCE] mesa 17.3.4

2018-02-19 Thread Emil Velikov
On 17 February 2018 at 15:04, Bas Nieuwenhuizen  
wrote:
> (-mesa-announce + Mark, Dave and James)
>
> Hi Emil,
>
> radv is broken for nearly all commercial games in 17.3.4. The cause is
>
> commit ad764e365beb8a119369b97f5cb95fc7ea8c
> Author: Bas Nieuwenhuizen 
> Date:   Mon Jan 22 09:01:29 2018 +0100
>
> ac/nir: Use instance_rate_inputs per attribute, not per variable.
>
> This did the wrong thing if we had e.g. an array for which only some
> of the attributes use the instance index. Tripped up some new CTS
> tests.
>
> CC: 
> Reviewed-by: Samuel Pitoiset 
> Reviewed-by: Dave Airlie 
> (cherry picked from commit 5a4dc285002e1924dbc8c72d17481a3dbc4c0142)
>
> Conflicts:
> src/amd/common/ac_nir_to_llvm.c
>
> A typo was introduced during the conflict resolution while
> cherrypicking to 17.3.
>
> First things first, can we mitigate this? Would it be possible to get
> a 17.3.5 with this fixed ASAP? This can be fixed by either rolling
> back ad764e365beb8a119369b97f5cb95fc7ea8c or by applying
> https://patchwork.freedesktop.org/patch/204260/ (obviously not applied
> to master as the issue did not occur there).
>
I'll do a 17.3.5 momentarily, including only the above fix.

> Secondly, I'd like to talk about process and how to prevent this in
> the future. Bugfix releases are supposed to be stable so downstream
> maintainers don't have to to deal with this kind of stuff happening,
> so I think that breaking radv pretty much completely is particularly
> egregious and we should look at how to prevent this happnening another
> time.
>
Sadly James' fix went below the radar since:
 - it wasn't merged in master - understandably so
 - I mistook his reply as some lovely emails that mesa-announce is
throwing at me

A few things come to mind:
 - on my end - resolve the mesa-announce noise, ensure replies are highlighted
 - having regular reply [to the RC email] with negative and positive
testing results will be great
 - wire up some Vulkan games into the LunarG*/other test rig

I've been going through * since it's inception, as the only means of
checking non Intel hardware.

-Emil
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [ANNOUNCE] mesa 17.3.4

2018-02-17 Thread Mark Janes
Bas Nieuwenhuizen  writes:

> (-mesa-announce + Mark, Dave and James)
>
> Hi Emil,
>
> radv is broken for nearly all commercial games in 17.3.4. The cause is
>
> commit ad764e365beb8a119369b97f5cb95fc7ea8c
> Author: Bas Nieuwenhuizen 
> Date:   Mon Jan 22 09:01:29 2018 +0100
>
> ac/nir: Use instance_rate_inputs per attribute, not per variable.
>
> This did the wrong thing if we had e.g. an array for which only some
> of the attributes use the instance index. Tripped up some new CTS
> tests.
>
> CC: 
> Reviewed-by: Samuel Pitoiset 
> Reviewed-by: Dave Airlie 
> (cherry picked from commit 5a4dc285002e1924dbc8c72d17481a3dbc4c0142)
>
> Conflicts:
> src/amd/common/ac_nir_to_llvm.c
>
> A typo was introduced during the conflict resolution while
> cherrypicking to 17.3.
>
> First things first, can we mitigate this? Would it be possible to get
> a 17.3.5 with this fixed ASAP? This can be fixed by either rolling
> back ad764e365beb8a119369b97f5cb95fc7ea8c or by applying
> https://patchwork.freedesktop.org/patch/204260/ (obviously not applied
> to master as the issue did not occur there).
>
> Secondly, I'd like to talk about process and how to prevent this in
> the future. Bugfix releases are supposed to be stable so downstream
> maintainers don't have to to deal with this kind of stuff happening,
> so I think that breaking radv pretty much completely is particularly
> egregious and we should look at how to prevent this happnening another
> time.
>
> First a short summary of what happened and when (all times in UTC):
>
> 2018-02-09 4:47: Emil sends out the pre-release announcement, git
> branch contains the faulty commit
> 2018-02-13 16:05: James Legg replies to the pre-release announcement:
> https://lists.freedesktop.org/archives/mesa-dev/2018-February/185355.html
> 2018-02-13 16:06: James Legg sends a patch to fix it:
> https://lists.freedesktop.org/archives/mesa-dev/2018-February/185356.html
> 2018-02-13 16:33: Bas reviews the patch.
> 2018-02-15 12:56: Emil releases 17.3.4 without the fix:
> https://lists.freedesktop.org/archives/mesa-dev/2018-February/185691.html
> 2018-02-16 21:09: Someone else replies with a fix to the pre-release
> announcement: 
> https://lists.freedesktop.org/archives/mesa-dev/2018-February/185908.html
> 2018-02-17 ~13:00: Bas notices the attached fix has "mismerge" in the
> name and decides to investigate the issue
>
>
> So I think there have been a couple of issues in communication here:
>
> 1) Essentially none of the communication talked about this being an
> issue on the branch only. This made me underestimate the severity of
> the issue, since the games kept working on master. This also meant no
> manual pings from my side to the release manager for not applying this
> to master first.
> 2) The reply to the pre-release announcement only said there was an
> issue, not that there was a fix sent out, nor any details like
> severity.
> 3) As far as I can tell no action had been taken by the release
> manager. James' reply did not get a response nor was the fix included
> in the release.
>
> My question would be how to improve the communication here. Could you
> elaborate the reason for (3)? Was it because you thought it would have
> to be picked up by the radv developers first? Would it help if I
> replied to James' reply in the announcement to link to the patch?
>
> The remaining issue is mainly about testing. The initial detection of
> this issue by James was already more than the 24-48 hours after the
> pre-release announcement that is recommended between the announcement
> and release, so a faster release would have prevented timely detection
> in the first place. I suspect radv does not get tested at all on
> releases except maybe a build test?
>
> Since manually testing all games on every release is not scalable for
> either the radv developers or the release manager I'm thinking of
> getting the automated tests from the vulkan CTS and crucible running
> before a release. That said I don't think keeping track of what tests
> are supposed to pass is something we should be pushing on the release
> manager, so I suppose we should be setting up our own CI and surfacing
> results of the stable branches to the release manager?

Tracking results is at best a significant investment which attenuates as
the driver becomes more stable and automation improves.  It takes us
perhaps an hour per day on average to track i965 status in CI.  This
investment pays bigger dividends for larger developer teams.  There is
constant demand to expand the coverage of CI.

Tracking results requires developer time when the driver has been
broken, but this cost is less than any other alternative.
Unfortunately, this cost is often borne by whoever runs the CI, and can
become burdensome.

> I'm assuming there is already a similar arrangement with intel? Is
> this something were the release manager triggers their CI specifically
> when doing a pre-release or ar

Re: [Mesa-dev] [ANNOUNCE] mesa 17.3.4

2018-02-17 Thread Bas Nieuwenhuizen
(-mesa-announce + Mark, Dave and James)

Hi Emil,

radv is broken for nearly all commercial games in 17.3.4. The cause is

commit ad764e365beb8a119369b97f5cb95fc7ea8c
Author: Bas Nieuwenhuizen 
Date:   Mon Jan 22 09:01:29 2018 +0100

ac/nir: Use instance_rate_inputs per attribute, not per variable.

This did the wrong thing if we had e.g. an array for which only some
of the attributes use the instance index. Tripped up some new CTS
tests.

CC: 
Reviewed-by: Samuel Pitoiset 
Reviewed-by: Dave Airlie 
(cherry picked from commit 5a4dc285002e1924dbc8c72d17481a3dbc4c0142)

Conflicts:
src/amd/common/ac_nir_to_llvm.c

A typo was introduced during the conflict resolution while
cherrypicking to 17.3.

First things first, can we mitigate this? Would it be possible to get
a 17.3.5 with this fixed ASAP? This can be fixed by either rolling
back ad764e365beb8a119369b97f5cb95fc7ea8c or by applying
https://patchwork.freedesktop.org/patch/204260/ (obviously not applied
to master as the issue did not occur there).

Secondly, I'd like to talk about process and how to prevent this in
the future. Bugfix releases are supposed to be stable so downstream
maintainers don't have to to deal with this kind of stuff happening,
so I think that breaking radv pretty much completely is particularly
egregious and we should look at how to prevent this happnening another
time.

First a short summary of what happened and when (all times in UTC):

2018-02-09 4:47: Emil sends out the pre-release announcement, git
branch contains the faulty commit
2018-02-13 16:05: James Legg replies to the pre-release announcement:
https://lists.freedesktop.org/archives/mesa-dev/2018-February/185355.html
2018-02-13 16:06: James Legg sends a patch to fix it:
https://lists.freedesktop.org/archives/mesa-dev/2018-February/185356.html
2018-02-13 16:33: Bas reviews the patch.
2018-02-15 12:56: Emil releases 17.3.4 without the fix:
https://lists.freedesktop.org/archives/mesa-dev/2018-February/185691.html
2018-02-16 21:09: Someone else replies with a fix to the pre-release
announcement: 
https://lists.freedesktop.org/archives/mesa-dev/2018-February/185908.html
2018-02-17 ~13:00: Bas notices the attached fix has "mismerge" in the
name and decides to investigate the issue


So I think there have been a couple of issues in communication here:

1) Essentially none of the communication talked about this being an
issue on the branch only. This made me underestimate the severity of
the issue, since the games kept working on master. This also meant no
manual pings from my side to the release manager for not applying this
to master first.
2) The reply to the pre-release announcement only said there was an
issue, not that there was a fix sent out, nor any details like
severity.
3) As far as I can tell no action had been taken by the release
manager. James' reply did not get a response nor was the fix included
in the release.

My question would be how to improve the communication here. Could you
elaborate the reason for (3)? Was it because you thought it would have
to be picked up by the radv developers first? Would it help if I
replied to James' reply in the announcement to link to the patch?

The remaining issue is mainly about testing. The initial detection of
this issue by James was already more than the 24-48 hours after the
pre-release announcement that is recommended between the announcement
and release, so a faster release would have prevented timely detection
in the first place. I suspect radv does not get tested at all on
releases except maybe a build test?

Since manually testing all games on every release is not scalable for
either the radv developers or the release manager I'm thinking of
getting the automated tests from the vulkan CTS and crucible running
before a release. That said I don't think keeping track of what tests
are supposed to pass is something we should be pushing on the release
manager, so I suppose we should be setting up our own CI and surfacing
results of the stable branches to the release manager?

I'm assuming there is already a similar arrangement with intel? Is
this something were the release manager triggers their CI specifically
when doing a pre-release or are they just continuously watching stable
branches? Also how do the results get surfaced back to the release
manager?

Apologies for the barrage of questions!

Thanks,
Bas Nieuwenhuizen

On Thu, Feb 15, 2018 at 1:56 PM, Emil Velikov  wrote:
> Mesa 17.3.4 is now available.
>
> In this release we have:
>
> Dozens of fixes in the i965, ANV and RADV drivers. Additionally
> the r600, virgl, etnaviv and renderonly drivers have also seen some love.
>
> The experimental Vulkan extension VK_KHX_multiview was disabled.
>
> On the video decoding drivers side:
> r600/radeonsi correctly handle new UVD/VCN firmware. The VA and OMX
> state-trackers have some MPEG2 glitches resolved, while locking is correctly
> handled in the error paths.
>
> To top