Re: [Intel-gfx] [PATCH i-g-t] lib: don't hang on blt on snb

2017-08-08 Thread Daniel Vetter
On Tue, Aug 8, 2017 at 11:25 AM, Chris Wilson  wrote:
> Quoting Daniel Vetter (2017-08-08 10:01:59)
>> On Mon, Aug 7, 2017 at 6:34 PM, Chris Wilson  
>> wrote:
>> > Quoting Daniel Vetter (2017-08-07 17:26:56)
>> >> On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
>> >> > Quoting Daniel Vetter (2017-08-04 17:07:22)
>> >> > > We now have full (or a lot at least) igt running in beta CI, and snb
>> >> > > blt hangs are really unhappy:
>> >> > >
>> >> > > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
>> >> > >   reliably result in insta-machine death when we try to reset the gpu,
>> >> > >   both on the CI snb and the one I have here.
>> >> > >
>> >> > > - Other testcases also randomly (and sometimes rather rarely) die on
>> >> > >   snb.
>> >> > >
>> >> > > We can't use the endless batch because that results in a reset failure
>> >> > > and wedged gpu, so also not really better.
>> >> >
>> >> > It shouldn't be the recursion, but the invalid instruction we use to try
>> >> > and trigger the hang quicker (otherwise hangcheck may see the advancing
>> >> > ACTHD and give us longer to escape the loop).
>> >> >
>> >> > In gem_exec_capture we shouldn't even need that invalid instruction, we
>> >> > just need the busy batch as we pull the trigger ourselves, and if that
>> >> > fails to reset a simple recursive batch we have some issues to resolve.
>> >>
>> >> Endless loop for haning results in a reset failure on blt as described in
>> >> the commit message. We end up with a permanent and unrecoverable -EIO,
>> >> which is as deadly to CI as outright killing the machine.
>> >
>> > No, it doesn't. snb-gt1 exhibiting the machine death on invalid blt
>> > instruction as reported, after fixes:
>>
>> Well my gt2 disagreed, but I guess we can push your patches to igt and
>> then ask CI whether we need more.
>
> Fine, dug out the snb-gt2,
>
> [ickle@huronriver tests]$ sudo ./drv_hangman
> IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
> Subtest error-state-sysfs-entry: SUCCESS (0.000s)
> Subtest error-state-basic: SUCCESS (0.004s)
> Subtest error-state-capture-render: SUCCESS (13.711s)
> Subtest error-state-capture-bsd: SUCCESS (8.006s)
> Test requirement not met in function test_error_state_capture, file 
> drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd1: SKIP (0.000s)
> Test requirement not met in function test_error_state_capture, file 
> drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd2: SKIP (0.000s)
> Subtest error-state-capture-blt: SUCCESS (6.049s)
> Test requirement not met in function test_error_state_capture, file 
> drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-vebox: SKIP (0.000s)
> Test requirement not met in function hangcheck_unterminated, file 
> drv_hangman.c:213:
> Test requirement: gem_uses_full_ppgtt(device)
> Subtest hangcheck-unterminated: SKIP (0.000s)
> [ickle@huronriver tests]$ sudo ./gem_exec_capture
> IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
> Subtest capture-render: SUCCESS (0.009s)
> Test requirement not met in function __real_main175, file 
> gem_exec_capture.c:202:
> Test requirement: gem_can_store_dword(fd, e->exec_id | e->flags)
> Subtest capture-bsd: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file 
> ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd1: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file 
> ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd2: SKIP (0.000s)
> Subtest capture-blt: SUCCESS (0.005s)
> Test requirement not met in function gem_require_ring, file 
> ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-vebox: SKIP (0.000s)
>
> Seems solid to me.

Ok I'll retest once your patches have landed, could very well be that
I screwed up something with my looping batch.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t] lib: don't hang on blt on snb

2017-08-08 Thread Chris Wilson
Quoting Daniel Vetter (2017-08-08 10:01:59)
> On Mon, Aug 7, 2017 at 6:34 PM, Chris Wilson  wrote:
> > Quoting Daniel Vetter (2017-08-07 17:26:56)
> >> On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
> >> > Quoting Daniel Vetter (2017-08-04 17:07:22)
> >> > > We now have full (or a lot at least) igt running in beta CI, and snb
> >> > > blt hangs are really unhappy:
> >> > >
> >> > > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
> >> > >   reliably result in insta-machine death when we try to reset the gpu,
> >> > >   both on the CI snb and the one I have here.
> >> > >
> >> > > - Other testcases also randomly (and sometimes rather rarely) die on
> >> > >   snb.
> >> > >
> >> > > We can't use the endless batch because that results in a reset failure
> >> > > and wedged gpu, so also not really better.
> >> >
> >> > It shouldn't be the recursion, but the invalid instruction we use to try
> >> > and trigger the hang quicker (otherwise hangcheck may see the advancing
> >> > ACTHD and give us longer to escape the loop).
> >> >
> >> > In gem_exec_capture we shouldn't even need that invalid instruction, we
> >> > just need the busy batch as we pull the trigger ourselves, and if that
> >> > fails to reset a simple recursive batch we have some issues to resolve.
> >>
> >> Endless loop for haning results in a reset failure on blt as described in
> >> the commit message. We end up with a permanent and unrecoverable -EIO,
> >> which is as deadly to CI as outright killing the machine.
> >
> > No, it doesn't. snb-gt1 exhibiting the machine death on invalid blt
> > instruction as reported, after fixes:
> 
> Well my gt2 disagreed, but I guess we can push your patches to igt and
> then ask CI whether we need more.

Fine, dug out the snb-gt2,

[ickle@huronriver tests]$ sudo ./drv_hangman 
IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
Subtest error-state-sysfs-entry: SUCCESS (0.000s)
Subtest error-state-basic: SUCCESS (0.004s)
Subtest error-state-capture-render: SUCCESS (13.711s)
Subtest error-state-capture-bsd: SUCCESS (8.006s)
Test requirement not met in function test_error_state_capture, file 
drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd1: SKIP (0.000s)
Test requirement not met in function test_error_state_capture, file 
drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd2: SKIP (0.000s)
Subtest error-state-capture-blt: SUCCESS (6.049s)
Test requirement not met in function test_error_state_capture, file 
drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-vebox: SKIP (0.000s)
Test requirement not met in function hangcheck_unterminated, file 
drv_hangman.c:213:
Test requirement: gem_uses_full_ppgtt(device)
Subtest hangcheck-unterminated: SKIP (0.000s)
[ickle@huronriver tests]$ sudo ./gem_exec_capture 
IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
Subtest capture-render: SUCCESS (0.009s)
Test requirement not met in function __real_main175, file 
gem_exec_capture.c:202:
Test requirement: gem_can_store_dword(fd, e->exec_id | e->flags)
Subtest capture-bsd: SKIP (0.000s)
Test requirement not met in function gem_require_ring, file 
ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-bsd1: SKIP (0.000s)
Test requirement not met in function gem_require_ring, file 
ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-bsd2: SKIP (0.000s)
Subtest capture-blt: SUCCESS (0.005s)
Test requirement not met in function gem_require_ring, file 
ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-vebox: SKIP (0.000s)

Seems solid to me.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t] lib: don't hang on blt on snb

2017-08-08 Thread Daniel Vetter
On Mon, Aug 7, 2017 at 6:34 PM, Chris Wilson  wrote:
> Quoting Daniel Vetter (2017-08-07 17:26:56)
>> On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
>> > Quoting Daniel Vetter (2017-08-04 17:07:22)
>> > > We now have full (or a lot at least) igt running in beta CI, and snb
>> > > blt hangs are really unhappy:
>> > >
>> > > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
>> > >   reliably result in insta-machine death when we try to reset the gpu,
>> > >   both on the CI snb and the one I have here.
>> > >
>> > > - Other testcases also randomly (and sometimes rather rarely) die on
>> > >   snb.
>> > >
>> > > We can't use the endless batch because that results in a reset failure
>> > > and wedged gpu, so also not really better.
>> >
>> > It shouldn't be the recursion, but the invalid instruction we use to try
>> > and trigger the hang quicker (otherwise hangcheck may see the advancing
>> > ACTHD and give us longer to escape the loop).
>> >
>> > In gem_exec_capture we shouldn't even need that invalid instruction, we
>> > just need the busy batch as we pull the trigger ourselves, and if that
>> > fails to reset a simple recursive batch we have some issues to resolve.
>>
>> Endless loop for haning results in a reset failure on blt as described in
>> the commit message. We end up with a permanent and unrecoverable -EIO,
>> which is as deadly to CI as outright killing the machine.
>
> No, it doesn't. snb-gt1 exhibiting the machine death on invalid blt
> instruction as reported, after fixes:

Well my gt2 disagreed, but I guess we can push your patches to igt and
then ask CI whether we need more.
-Daniel

>
> Subtest error-state-basic: SUCCESS (0.001s)
> Subtest error-state-capture-render: SUCCESS (7.740s)
> Subtest error-state-capture-bsd: SUCCESS (6.024s)
> Test requirement not met in function test_error_state_capture, file 
> drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd1: SKIP (0.000s)
> Test requirement not met in function test_error_state_capture, file 
> drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd2: SKIP (0.000s)
> Subtest error-state-capture-blt: SUCCESS (13.965s)
> Test requirement not met in function test_error_state_capture, file 
> drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-vebox: SKIP (0.000s)
>
> Subtest capture-render: SUCCESS (0.003s)
> Test requirement not met in function __real_main175, file 
> gem_exec_capture.c:202:
> Test requirement: gem_can_store_dword(fd, e->exec_id | e->flags)
> Subtest capture-bsd: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file 
> ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd1: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file 
> ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd2: SKIP (0.000s)
> Subtest capture-blt: SUCCESS (0.002s)
> Test requirement not met in function gem_require_ring, file 
> ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-vebox: SKIP (0.000s)
>
> -Chris



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t] lib: don't hang on blt on snb

2017-08-07 Thread Chris Wilson
Quoting Daniel Vetter (2017-08-07 17:26:56)
> On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2017-08-04 17:07:22)
> > > We now have full (or a lot at least) igt running in beta CI, and snb
> > > blt hangs are really unhappy:
> > > 
> > > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
> > >   reliably result in insta-machine death when we try to reset the gpu,
> > >   both on the CI snb and the one I have here.
> > > 
> > > - Other testcases also randomly (and sometimes rather rarely) die on
> > >   snb.
> > > 
> > > We can't use the endless batch because that results in a reset failure
> > > and wedged gpu, so also not really better.
> > 
> > It shouldn't be the recursion, but the invalid instruction we use to try
> > and trigger the hang quicker (otherwise hangcheck may see the advancing
> > ACTHD and give us longer to escape the loop).
> > 
> > In gem_exec_capture we shouldn't even need that invalid instruction, we
> > just need the busy batch as we pull the trigger ourselves, and if that
> > fails to reset a simple recursive batch we have some issues to resolve.
> 
> Endless loop for haning results in a reset failure on blt as described in
> the commit message. We end up with a permanent and unrecoverable -EIO,
> which is as deadly to CI as outright killing the machine.

No, it doesn't. snb-gt1 exhibiting the machine death on invalid blt
instruction as reported, after fixes:

Subtest error-state-basic: SUCCESS (0.001s)
Subtest error-state-capture-render: SUCCESS (7.740s)
Subtest error-state-capture-bsd: SUCCESS (6.024s)
Test requirement not met in function test_error_state_capture, file 
drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd1: SKIP (0.000s)
Test requirement not met in function test_error_state_capture, file 
drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd2: SKIP (0.000s)
Subtest error-state-capture-blt: SUCCESS (13.965s)
Test requirement not met in function test_error_state_capture, file 
drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-vebox: SKIP (0.000s)

Subtest capture-render: SUCCESS (0.003s)
Test requirement not met in function __real_main175, file 
gem_exec_capture.c:202:
Test requirement: gem_can_store_dword(fd, e->exec_id | e->flags)
Subtest capture-bsd: SKIP (0.000s)
Test requirement not met in function gem_require_ring, file 
ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-bsd1: SKIP (0.000s)
Test requirement not met in function gem_require_ring, file 
ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-bsd2: SKIP (0.000s)
Subtest capture-blt: SUCCESS (0.002s)
Test requirement not met in function gem_require_ring, file 
ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-vebox: SKIP (0.000s)

-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t] lib: don't hang on blt on snb

2017-08-07 Thread Daniel Vetter
On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2017-08-04 17:07:22)
> > We now have full (or a lot at least) igt running in beta CI, and snb
> > blt hangs are really unhappy:
> > 
> > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
> >   reliably result in insta-machine death when we try to reset the gpu,
> >   both on the CI snb and the one I have here.
> > 
> > - Other testcases also randomly (and sometimes rather rarely) die on
> >   snb.
> > 
> > We can't use the endless batch because that results in a reset failure
> > and wedged gpu, so also not really better.
> 
> It shouldn't be the recursion, but the invalid instruction we use to try
> and trigger the hang quicker (otherwise hangcheck may see the advancing
> ACTHD and give us longer to escape the loop).
> 
> In gem_exec_capture we shouldn't even need that invalid instruction, we
> just need the busy batch as we pull the trigger ourselves, and if that
> fails to reset a simple recursive batch we have some issues to resolve.

Endless loop for haning results in a reset failure on blt as described in
the commit message. We end up with a permanent and unrecoverable -EIO,
which is as deadly to CI as outright killing the machine.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH i-g-t] lib: don't hang on blt on snb

2017-08-04 Thread Chris Wilson
Quoting Daniel Vetter (2017-08-04 17:07:22)
> We now have full (or a lot at least) igt running in beta CI, and snb
> blt hangs are really unhappy:
> 
> - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
>   reliably result in insta-machine death when we try to reset the gpu,
>   both on the CI snb and the one I have here.
> 
> - Other testcases also randomly (and sometimes rather rarely) die on
>   snb.
> 
> We can't use the endless batch because that results in a reset failure
> and wedged gpu, so also not really better.

It shouldn't be the recursion, but the invalid instruction we use to try
and trigger the hang quicker (otherwise hangcheck may see the advancing
ACTHD and give us longer to escape the loop).

In gem_exec_capture we shouldn't even need that invalid instruction, we
just need the busy batch as we pull the trigger ourselves, and if that
fails to reset a simple recursive batch we have some issues to resolve.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx