Re: [PATCH] drm/i915/selftests: Use preemption timeout on cleanup

2024-12-20 Thread Andi Shyti
Hi Janusz,

> > > > > > +
> > > > > > cond_resched();
> > > > > >  
> > > > > > -   if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > > > +   if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) 
> > > > > > == -ETIME) {
> > > > > 
> > > > > where is this 500 coming from?
> > > > 
> > > > / 1000 would convert it to seconds as needed, and / 500 used instead 
> > > > was 
> > > > supposed to to mean that we are willing to wait for preempt_timeout_ms 
> > > > * 
> > 2.  
> > > > Sorry for that shortcut.  Would you like me to provide a clarifying 
> > > > comment, 
> > > > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> > > 
> > > It was clear that you were doubling it, but what's more
> > > interesting to know (perhaps in a comment) is why you are
> > > choosing to use the double of the timeout_ms instead of other
> > > values.
> > > 
> > > Makes sense?
> > 
> > Yes, good question.
> > 
> > Is it possible for more than one bb to hang?  If yes then should we wait 
> > longer than the longest preemption timeout?  Before I assumed that maybe we 
> > should, just in case, but now, having that revisited and reconsidered, I 
> > tend 
> > to agree that the longest preempt timeout, perhaps with a small margin 
> > (let's 
> > say +100ms) should be enough to recover from a single failing test case.  
> > Let 
> > me verify if that works for the linked case.
> 
> I've done some testing and got a confirmation that the issue I'm trying to 
> address in the first place requires a timeout almost twice as long as the 
> longest preemption timeout.
> 
> I propose the following correction:
> 
> - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> + /* 2 x longest preempt timeout, experimentally determined */
> + if (intel_gt_wait_for_idle(gt, 2 * timeout_ms * HZ / 1000) == -ETIME) {

with this change, I merge your patch to drm-intel-next.

Thanks,
Andi


Re: [PATCH] drm/i915/selftests: Use preemption timeout on cleanup

2024-12-20 Thread Andi Shyti
Hi Janusz,

> > > > > +
> > > > >   cond_resched();
> > > > >  
> > > > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) 
> > > > > == -
> > > ETIME) {
> > > > 
> > > > where is this 500 coming from?
> > > 
> > > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
> 2.  
> > > Sorry for that shortcut.  Would you like me to provide a clarifying 
> comment, 
> > > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> > 
> > It was clear that you were doubling it, but what's more
> > interesting to know (perhaps in a comment) is why you are
> > choosing to use the double of the timeout_ms instead of other
> > values.
> > 
> > Makes sense?
> 
> Yes, good question.
> 
> Is it possible for more than one bb to hang?  If yes then should we wait 
> longer than the longest preemption timeout?  Before I assumed that maybe we 
> should, just in case, but now, having that revisited and reconsidered, I tend 
> to agree that the longest preempt timeout, perhaps with a small margin (let's 
> say +100ms) should be enough to recover from a single failing test case.  Let 
> me verify if that works for the linked case.

As we agreed offline, I'm going to add this comment you suggested
to your change as a justification to the "/ 500":

/* 2x longest preempt timeout, experimentally determined */

With this:

Reviewed-by: Andi Shyti 

Thanks,
Andi


Re: [PATCH] drm/i915/selftests: Use preemption timeout on cleanup

2024-12-20 Thread Janusz Krzysztofik
Hi Andi,

On Tuesday, 17 December 2024 19:00:40 CET Janusz Krzysztofik wrote:
> Hi Andi,
> 
> On Tuesday, 17 December 2024 18:12:08 CET Andi Shyti wrote:
> > Hi Janusz,
> > 
> > ...
> > 
> > > > > +
> > > > >   cond_resched();
> > > > >  
> > > > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) 
> > > > > == -ETIME) {
> > > > 
> > > > where is this 500 coming from?
> > > 
> > > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
> 2.  
> > > Sorry for that shortcut.  Would you like me to provide a clarifying 
> > > comment, 
> > > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> > 
> > It was clear that you were doubling it, but what's more
> > interesting to know (perhaps in a comment) is why you are
> > choosing to use the double of the timeout_ms instead of other
> > values.
> > 
> > Makes sense?
> 
> Yes, good question.
> 
> Is it possible for more than one bb to hang?  If yes then should we wait 
> longer than the longest preemption timeout?  Before I assumed that maybe we 
> should, just in case, but now, having that revisited and reconsidered, I tend 
> to agree that the longest preempt timeout, perhaps with a small margin (let's 
> say +100ms) should be enough to recover from a single failing test case.  Let 
> me verify if that works for the linked case.

I've done some testing and got a confirmation that the issue I'm trying to 
address in the first place requires a timeout almost twice as long as the 
longest preemption timeout.

I propose the following correction:

-   if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
+   /* 2 x longest preempt timeout, experimentally determined */
+   if (intel_gt_wait_for_idle(gt, 2 * timeout_ms * HZ / 1000) == -ETIME) {

Thanks,
Janusz

> 
> Thanks,
> Janusz
> 
> > 
> > Thanks,
> > Andi
> > 
> 
> 
> 
> 
> 






Re: [PATCH] drm/i915/selftests: Use preemption timeout on cleanup

2024-12-17 Thread Janusz Krzysztofik
Hi Andi,

On Tuesday, 17 December 2024 18:12:08 CET Andi Shyti wrote:
> Hi Janusz,
> 
> ...
> 
> > > > +
> > > > cond_resched();
> > > >  
> > > > -   if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > > +   if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) 
> > > > == -
> > ETIME) {
> > > 
> > > where is this 500 coming from?
> > 
> > / 1000 would convert it to seconds as needed, and / 500 used instead was 
> > supposed to to mean that we are willing to wait for preempt_timeout_ms * 
2.  
> > Sorry for that shortcut.  Would you like me to provide a clarifying 
comment, 
> > or maybe better use explicit 2 * preempt_timeout / 1000 ?
> 
> It was clear that you were doubling it, but what's more
> interesting to know (perhaps in a comment) is why you are
> choosing to use the double of the timeout_ms instead of other
> values.
> 
> Makes sense?

Yes, good question.

Is it possible for more than one bb to hang?  If yes then should we wait 
longer than the longest preemption timeout?  Before I assumed that maybe we 
should, just in case, but now, having that revisited and reconsidered, I tend 
to agree that the longest preempt timeout, perhaps with a small margin (let's 
say +100ms) should be enough to recover from a single failing test case.  Let 
me verify if that works for the linked case.

Thanks,
Janusz

> 
> Thanks,
> Andi
> 






Re: [PATCH] drm/i915/selftests: Use preemption timeout on cleanup

2024-12-17 Thread Andi Shyti
Hi Janusz,

...

> > > +
> > >   cond_resched();
> > >  
> > > - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > > + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
> ETIME) {
> > 
> > where is this 500 coming from?
> 
> / 1000 would convert it to seconds as needed, and / 500 used instead was 
> supposed to to mean that we are willing to wait for preempt_timeout_ms * 2.  
> Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
> or maybe better use explicit 2 * preempt_timeout / 1000 ?

It was clear that you were doubling it, but what's more
interesting to know (perhaps in a comment) is why you are
choosing to use the double of the timeout_ms instead of other
values.

Makes sense?

Thanks,
Andi


Re: [PATCH] drm/i915/selftests: Use preemption timeout on cleanup

2024-12-16 Thread Janusz Krzysztofik
Hi Andi,

Thanks for review.

On Monday, 16 December 2024 14:26:58 CET Andi Shyti wrote:
> Hi Janusz,
> 
> ...
> 
> > for_each_gt(gt, i915, i) {
> > +   struct intel_engine_cs *engine;
> > +   unsigned long timeout_ms = 0;
> > +   unsigned int id;
> > +
> > if (intel_gt_is_wedged(gt))
> > ret = -EIO;
> >  
> > +   for_each_engine(engine, gt, id) {
> > +   if (engine->props.preempt_timeout_ms > 
timeout_ms)
> > +   timeout_ms = engine-
>props.preempt_timeout_ms;
> > +   }
> 
> 
> the brackets are not really required here.

OK, I was not sure if for_each_if used inside for_each_engine is supposed to 
resolve potential issues with potentially confusing if nesting, but from your 
comment I conclude it does.  I'll fix it.

> 
> > +
> > cond_resched();
> >  
> > -   if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> > +   if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == -
ETIME) {
> 
> where is this 500 coming from?

/ 1000 would convert it to seconds as needed, and / 500 used instead was 
supposed to to mean that we are willing to wait for preempt_timeout_ms * 2.  
Sorry for that shortcut.  Would you like me to provide a clarifying comment, 
or maybe better use explicit 2 * preempt_timeout / 1000 ?

Thanks,
Janusz

> 
> Thanks,
> Andi
> 
> > pr_err("%pS timed out, cancelling all further 
testing.\n",
> >__builtin_return_address(0));
> >  
> 






Re: [PATCH] drm/i915/selftests: Use preemption timeout on cleanup

2024-12-16 Thread Andi Shyti
Hi Janusz,

...

>   for_each_gt(gt, i915, i) {
> + struct intel_engine_cs *engine;
> + unsigned long timeout_ms = 0;
> + unsigned int id;
> +
>   if (intel_gt_is_wedged(gt))
>   ret = -EIO;
>  
> + for_each_engine(engine, gt, id) {
> + if (engine->props.preempt_timeout_ms > timeout_ms)
> + timeout_ms = engine->props.preempt_timeout_ms;
> + }


the brackets are not really required here.

> +
>   cond_resched();
>  
> - if (intel_gt_wait_for_idle(gt, HZ * 3) == -ETIME) {
> + if (intel_gt_wait_for_idle(gt, HZ * timeout_ms / 500) == 
> -ETIME) {

where is this 500 coming from?

Thanks,
Andi

>   pr_err("%pS timed out, cancelling all further 
> testing.\n",
>  __builtin_return_address(0));
>  
> -- 
> 2.47.1