Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-07-10 Thread Dmitry Osipenko
08.07.2020 13:06, Mikko Perttunen пишет:
> On 7/7/20 2:06 PM, Dmitry Osipenko wrote:
>> 02.07.2020 15:10, Mikko Perttunen пишет:
>>> Ok, so we would have two kinds of syncpoints for the job; one
>>> for kernel job tracking; and one that userspace can
>>> manipulate as it wants to.
>>>
>>> Could we handle the job tracking syncpoint completely inside the kernel,
>>> i.e. allocate it in kernel during job submission, and add an increment
>>> for it at the end of the job (with condition OP_DONE)? For MLOCKing, the
>>> kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT +
>>> MLOCK_RELEASE sequence at the end of each job.
>>
>> If sync point is allocated within kernel, then we'll need to always
>> patch all job's sync point increments with the ID of the allocated sync
>> point, regardless of whether firewall enabled or not.
> 
> The idea was that the job tracking increment would also be added to the
> pushbuffer in the kernel, so gathers would only have increments for the
> "user syncpoints", if any. I think this should work for THI-based
> engines (e.g. VIC), you probably have better information about
> GR2D/GR3D. On newer Tegras we could use CHANNEL/APPID protection to
> prevent the gathers from incrementing these job tracking syncpoints.

Could you please clarify what is THI?

>> Secondly, I'm now recalling that only one sync point could be assigned
>> to a channel at a time on newer Tegras which support sync point
>> protection. So it sounds like we don't really have variants other than
>> to allocate one sync point per channel for the jobs usage if we want to
>> be able to push multiple jobs into channel's pushbuffer, correct?
>>
> 
> The other way around; each syncpoint can be assigned to one channel. One
> channel may have multiple syncpoints.

Okay! So we actually could use a single sync point per-job for user
increments + job tracking, correct?

>> ...
 Hmm, we actually should be able to have a one sync point per-channel
 for
 the job submission, similarly to what the current driver does!

 I'm keep forgetting about the waitbase existence!
>>>
>>> Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs
>>> anyway, can't we just recalculate wait thresholds at that time?
>>
>> Yes, thresholds could be recalculated + job should be re-formed at the
>> push-time then.
>>
>> It also means that jobs always should be formed only at the push-time if
>> wait-command is utilized by cmdstream since the waits always need to be
>> patched because we won't know the thresholds until scheduler actually
>> runs the job.
> 
> Could we keep the job tracking syncpoints entirely within the kernel,
> and have all wait commands and other stuff that userspace does use the
> user syncpoints? Then kernel job tracking and userspace activity would
> be separate from each other.

I think this should work, but it's not clear to me what benefits this
brings to us if we could re-use the same user-allocated sync point for
both user increments + kernel job tracking.

Is the idea to protect from userspace incrementing sync point too much?
I.e. job could be treated as completed before CDMA is actually done with
it.

> Alternatively, if we know that jobs can only be removed from the middle
> of pushbuffers, and not added, we could replace any removed jobs with
> syncpoint increments in the pushbuffer and any thresholds would stay
> intact.

A bit unlikely that jobs could ever be removed from the middle of
hardware queue by the DRM scheduler.

Anyways, it should be nicer to shoot down everything and restart with a
clean slate when necessary, like it's already supposed to happen by the
scheduler.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-07-08 Thread Mikko Perttunen

On 7/7/20 2:06 PM, Dmitry Osipenko wrote:

02.07.2020 15:10, Mikko Perttunen пишет:

Ok, so we would have two kinds of syncpoints for the job; one
for kernel job tracking; and one that userspace can
manipulate as it wants to.

Could we handle the job tracking syncpoint completely inside the kernel,
i.e. allocate it in kernel during job submission, and add an increment
for it at the end of the job (with condition OP_DONE)? For MLOCKing, the
kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT +
MLOCK_RELEASE sequence at the end of each job.


If sync point is allocated within kernel, then we'll need to always
patch all job's sync point increments with the ID of the allocated sync
point, regardless of whether firewall enabled or not.


The idea was that the job tracking increment would also be added to the 
pushbuffer in the kernel, so gathers would only have increments for the 
"user syncpoints", if any. I think this should work for THI-based 
engines (e.g. VIC), you probably have better information about 
GR2D/GR3D. On newer Tegras we could use CHANNEL/APPID protection to 
prevent the gathers from incrementing these job tracking syncpoints.




Secondly, I'm now recalling that only one sync point could be assigned
to a channel at a time on newer Tegras which support sync point
protection. So it sounds like we don't really have variants other than
to allocate one sync point per channel for the jobs usage if we want to
be able to push multiple jobs into channel's pushbuffer, correct?



The other way around; each syncpoint can be assigned to one channel. One 
channel may have multiple syncpoints.



...

Hmm, we actually should be able to have a one sync point per-channel for
the job submission, similarly to what the current driver does!

I'm keep forgetting about the waitbase existence!


Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs
anyway, can't we just recalculate wait thresholds at that time?


Yes, thresholds could be recalculated + job should be re-formed at the
push-time then.

It also means that jobs always should be formed only at the push-time if
wait-command is utilized by cmdstream since the waits always need to be
patched because we won't know the thresholds until scheduler actually
runs the job.


Could we keep the job tracking syncpoints entirely within the kernel, 
and have all wait commands and other stuff that userspace does use the 
user syncpoints? Then kernel job tracking and userspace activity would 
be separate from each other.


Alternatively, if we know that jobs can only be removed from the middle 
of pushbuffers, and not added, we could replace any removed jobs with 
syncpoint increments in the pushbuffer and any thresholds would stay intact.





Maybe a more detailed sequence list or diagram of what happens during
submission and recovery would be useful.


The textual form + code is already good enough to me. A diagram could be
nice to have, although it may take a bit too much effort to create +
maintain it. But I don't mind at all if you'd want to make one :)

...

* We should be able to keep the syncpoint refcounting based on fences.


The fence doesn't need the sync point itself, it only needs to get a
signal when the threshold is reached or when sync point is ceased.

Imagine:

    - Process A creates sync point
    - Process A creates dma-fence from this sync point
    - Process A exports dma-fence to process B
    - Process A dies

What should happen to process B?

    - Should dma-fence of the process B get a error signal when process A
dies?
    - Should process B get stuck waiting endlessly for the dma-fence?

This is one example of why I'm proposing that fence shouldn't be coupled
tightly to a sync point.


As a baseline, we should consider process B to get stuck endlessly
(until a timeout of its choosing) for the fence. In this case it is
avoidable, but if the ID/threshold pair is exported out of the fence and
is waited for otherwise, it is unavoidable. I.e. once the ID/threshold
are exported out of a fence, the waiter can only see the fence being
signaled by the threshold being reached, not by the syncpoint getting
freed.


This is correct. If sync point's FD is exported or once sync point is
resolved from a dma-fence, then sync point will stay alive until the
last reference to the sync point is dropped. I.e. if Process A dies
*after* process B started to wait on the sync point, then it will get stuck.


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-07-08 Thread Dmitry Osipenko
02.07.2020 15:10, Mikko Perttunen пишет:
> Ok, so we would have two kinds of syncpoints for the job; one
> for kernel job tracking; and one that userspace can
> manipulate as it wants to.
> 
> Could we handle the job tracking syncpoint completely inside the kernel,
> i.e. allocate it in kernel during job submission, and add an increment
> for it at the end of the job (with condition OP_DONE)? For MLOCKing, the
> kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT +
> MLOCK_RELEASE sequence at the end of each job.

If sync point is allocated within kernel, then we'll need to always
patch all job's sync point increments with the ID of the allocated sync
point, regardless of whether firewall enabled or not.

Secondly, I'm now recalling that only one sync point could be assigned
to a channel at a time on newer Tegras which support sync point
protection. So it sounds like we don't really have variants other than
to allocate one sync point per channel for the jobs usage if we want to
be able to push multiple jobs into channel's pushbuffer, correct?

...
>> Hmm, we actually should be able to have a one sync point per-channel for
>> the job submission, similarly to what the current driver does!
>>
>> I'm keep forgetting about the waitbase existence!
> 
> Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs
> anyway, can't we just recalculate wait thresholds at that time?

Yes, thresholds could be recalculated + job should be re-formed at the
push-time then.

It also means that jobs always should be formed only at the push-time if
wait-command is utilized by cmdstream since the waits always need to be
patched because we won't know the thresholds until scheduler actually
runs the job.

> Maybe a more detailed sequence list or diagram of what happens during
> submission and recovery would be useful.

The textual form + code is already good enough to me. A diagram could be
nice to have, although it may take a bit too much effort to create +
maintain it. But I don't mind at all if you'd want to make one :)

...
>>> * We should be able to keep the syncpoint refcounting based on fences.
>>
>> The fence doesn't need the sync point itself, it only needs to get a
>> signal when the threshold is reached or when sync point is ceased.
>>
>> Imagine:
>>
>>    - Process A creates sync point
>>    - Process A creates dma-fence from this sync point
>>    - Process A exports dma-fence to process B
>>    - Process A dies
>>
>> What should happen to process B?
>>
>>    - Should dma-fence of the process B get a error signal when process A
>> dies?
>>    - Should process B get stuck waiting endlessly for the dma-fence?
>>
>> This is one example of why I'm proposing that fence shouldn't be coupled
>> tightly to a sync point.
> 
> As a baseline, we should consider process B to get stuck endlessly
> (until a timeout of its choosing) for the fence. In this case it is
> avoidable, but if the ID/threshold pair is exported out of the fence and
> is waited for otherwise, it is unavoidable. I.e. once the ID/threshold
> are exported out of a fence, the waiter can only see the fence being
> signaled by the threshold being reached, not by the syncpoint getting
> freed.

This is correct. If sync point's FD is exported or once sync point is
resolved from a dma-fence, then sync point will stay alive until the
last reference to the sync point is dropped. I.e. if Process A dies
*after* process B started to wait on the sync point, then it will get stuck.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-07-02 Thread Mikko Perttunen

On 7/1/20 3:22 AM, Dmitry Osipenko wrote:

30.06.2020 13:26, Mikko Perttunen пишет:

On 6/29/20 10:42 PM, Dmitry Osipenko wrote:


Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync
point, correct? Or are they integrated with Host1x HW?



They can access syncpoints directly. (That's what I alluded to in the
"Introduction to the hardware" section :) all those things have hardware
access to syncpoints)


Should we CC all the Nouveau developers then, or is it a bit too early? :)


I think we have a few other issues still to resolve before that :)





.. rest ..



Let me try to summarize once more for my own understanding:

* When submitting a job, you would allocate new syncpoints for the job


- Yes


* After submitting the job, those syncpoints are not usable anymore


- Yes

Although, thinking a bit more about it, this needs to be relaxed.

It should be a userspace agreement/policy how to utilize sync points.

For example, if we know that userspace will have multiple application
instances all using Tegra DRM UAPI, like a mesa or VDPAU drivers, then
this userspace should consider to return sync points into the pool for
sharing them with others. While something like an Opentegra Xorg driver,
which usually has a single instance, could keep sync points pre-allocated.

The job's sync point counter will be reset to 0 by the kernel driver
during of the submission process for each job, so we won't have the sync
point recovery problem.


* Postfences of that job would keep references to those syncpoints so
they aren't freed and cleared before the fences have been released


- No

I suggested that fence shouldn't refcount the sync point and *only* have
a reference to it, this reference will be invalidated once fence is
signaled by sync point reaching the threshold or once sync point is
released.

The sync point will have a reference to every active fence (waiting for
the signal) that is using this sync point until the threshold is reached.

So fence could detach itself from the sync point + sync point could
detach all the fences from itself.

There will be more about this below, please see example with a dead
process in the end of this email.


* Once postfences have been released, syncpoints would be returned to
the pool and reset to zero


- No

I'm suggesting that sync point should be returned to the pool once its
usage refcount reaches 0. This means that both userspace that created
this sync point + the executed job will both keep the sync point alive
until it is closed by userspace + job is completed.


The advantage of this would be that at any point in time, there would be
a 1:1 correspondence between allocated syncpoints and jobs; so you could
  shuffle the jobs around channels or reorder them.


- Yes


Please correct if I got that wrong :)

---

I have two concerns:

* A lot of churn on syncpoints - any time you submit a job you might not
get a syncpoint for an indefinite time. If we allocate syncpoints
up-front at least you know beforehand, and then you have the syncpoint
as long as you need it.


If you'll have a lot of active application instances all allocating sync
points, then inevitably the sync points pool will be exhausted.

But my proposal doesn't differ from yours in this regards, correct?

And maybe there is a nice solution, please see more below!


* Plumbing the dma-fence/sync_file everywhere, and keeping it alive
until waits on it have completed, is more work than just having the
ID/threshold. This is probably mainly a problem for downstream, where
updating code for this would be difficult. I know that's not a proper
argument but I hope we can reach something that works for both worlds.


You could have ID/threshold! :)

But, you can't use the *job's* ID/threshold because you won't know them
until kernel driver scheduler will *complete(!)* the job's execution!
The job may be re-pushed multiple times by the scheduler to recovered
channel if a previous jobs hang!

Now, you could allocate *two* sync points:

   1. For the job itself (job's sync point).

   2. For the userspace to wait (user's sync point).

The job will have to increment both these sync points (example of
multiple sync points usage) and you know the user's sync point ID/threshold!

If job times out, you *could* increment the user's sync point on CPU
from userspace!

The counter of the user's sync point won't be touched by the kernel
driver if job hangs!


Ok, so we would have two kinds of syncpoints for the job; one for kernel 
job tracking; and one that userspace can manipulate as it wants to.


Could we handle the job tracking syncpoint completely inside the kernel, 
i.e. allocate it in kernel during job submission, and add an increment 
for it at the end of the job (with condition OP_DONE)? For MLOCKing, the 
kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + 
MLOCK_RELEASE sequence at the end of each job.





Here's a proposal in between:

* Keep syncpoint allocation and submission in jobs 

Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-07-01 Thread Dmitry Osipenko
30.06.2020 13:26, Mikko Perttunen пишет:
> On 6/29/20 10:42 PM, Dmitry Osipenko wrote:
>>
>> Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync
>> point, correct? Or are they integrated with Host1x HW?
>>
> 
> They can access syncpoints directly. (That's what I alluded to in the
> "Introduction to the hardware" section :) all those things have hardware
> access to syncpoints)

Should we CC all the Nouveau developers then, or is it a bit too early? :)

>>
>> .. rest ..
>>
> 
> Let me try to summarize once more for my own understanding:
> 
> * When submitting a job, you would allocate new syncpoints for the job

- Yes

> * After submitting the job, those syncpoints are not usable anymore

- Yes

Although, thinking a bit more about it, this needs to be relaxed.

It should be a userspace agreement/policy how to utilize sync points.

For example, if we know that userspace will have multiple application
instances all using Tegra DRM UAPI, like a mesa or VDPAU drivers, then
this userspace should consider to return sync points into the pool for
sharing them with others. While something like an Opentegra Xorg driver,
which usually has a single instance, could keep sync points pre-allocated.

The job's sync point counter will be reset to 0 by the kernel driver
during of the submission process for each job, so we won't have the sync
point recovery problem.

> * Postfences of that job would keep references to those syncpoints so
> they aren't freed and cleared before the fences have been released

- No

I suggested that fence shouldn't refcount the sync point and *only* have
a reference to it, this reference will be invalidated once fence is
signaled by sync point reaching the threshold or once sync point is
released.

The sync point will have a reference to every active fence (waiting for
the signal) that is using this sync point until the threshold is reached.

So fence could detach itself from the sync point + sync point could
detach all the fences from itself.

There will be more about this below, please see example with a dead
process in the end of this email.

> * Once postfences have been released, syncpoints would be returned to
> the pool and reset to zero

- No

I'm suggesting that sync point should be returned to the pool once its
usage refcount reaches 0. This means that both userspace that created
this sync point + the executed job will both keep the sync point alive
until it is closed by userspace + job is completed.

> The advantage of this would be that at any point in time, there would be
> a 1:1 correspondence between allocated syncpoints and jobs; so you could
>  shuffle the jobs around channels or reorder them.

- Yes

> Please correct if I got that wrong :)
> 
> ---
> 
> I have two concerns:
> 
> * A lot of churn on syncpoints - any time you submit a job you might not
> get a syncpoint for an indefinite time. If we allocate syncpoints
> up-front at least you know beforehand, and then you have the syncpoint
> as long as you need it.

If you'll have a lot of active application instances all allocating sync
points, then inevitably the sync points pool will be exhausted.

But my proposal doesn't differ from yours in this regards, correct?

And maybe there is a nice solution, please see more below!

> * Plumbing the dma-fence/sync_file everywhere, and keeping it alive
> until waits on it have completed, is more work than just having the
> ID/threshold. This is probably mainly a problem for downstream, where
> updating code for this would be difficult. I know that's not a proper
> argument but I hope we can reach something that works for both worlds.

You could have ID/threshold! :)

But, you can't use the *job's* ID/threshold because you won't know them
until kernel driver scheduler will *complete(!)* the job's execution!
The job may be re-pushed multiple times by the scheduler to recovered
channel if a previous jobs hang!

Now, you could allocate *two* sync points:

  1. For the job itself (job's sync point).

  2. For the userspace to wait (user's sync point).

The job will have to increment both these sync points (example of
multiple sync points usage) and you know the user's sync point ID/threshold!

If job times out, you *could* increment the user's sync point on CPU
from userspace!

The counter of the user's sync point won't be touched by the kernel
driver if job hangs!

> Here's a proposal in between:
> 
> * Keep syncpoint allocation and submission in jobs as in my original
> proposal

Yes, we could keep it.

But, as I suggested in my other email, we may want to extend the
allocation IOCTL for the multi-syncpoint allocation support.

Secondly, if we'll want to have the multi-syncpoint support for the job,
then we may want improve the SUBMIT IOCTL like this:

struct drm_tegra_channel_submit {
__u32 num_usr_syncpt_incrs;
__u64 usr_sync_points_ptr;

__u32 num_job_syncpt_incrs;
__u32 job_syncpt_handle;
};

If job doesn't need to increment user's sync

Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-06-30 Thread Mikko Perttunen

On 6/29/20 10:42 PM, Dmitry Osipenko wrote:


Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync
point, correct? Or are they integrated with Host1x HW?



They can access syncpoints directly. (That's what I alluded to in the 
"Introduction to the hardware" section :) all those things have hardware 
access to syncpoints)


>
> .. rest ..
>

Let me try to summarize once more for my own understanding:

* When submitting a job, you would allocate new syncpoints for the job
* After submitting the job, those syncpoints are not usable anymore
* Postfences of that job would keep references to those syncpoints so 
they aren't freed and cleared before the fences have been released
* Once postfences have been released, syncpoints would be returned to 
the pool and reset to zero


The advantage of this would be that at any point in time, there would be 
a 1:1 correspondence between allocated syncpoints and jobs; so you could 
 shuffle the jobs around channels or reorder them.


Please correct if I got that wrong :)

---

I have two concerns:

* A lot of churn on syncpoints - any time you submit a job you might not 
get a syncpoint for an indefinite time. If we allocate syncpoints 
up-front at least you know beforehand, and then you have the syncpoint 
as long as you need it.
* Plumbing the dma-fence/sync_file everywhere, and keeping it alive 
until waits on it have completed, is more work than just having the 
ID/threshold. This is probably mainly a problem for downstream, where 
updating code for this would be difficult. I know that's not a proper 
argument but I hope we can reach something that works for both worlds.


Here's a proposal in between:

* Keep syncpoint allocation and submission in jobs as in my original 
proposal

* Don't attempt to recover user channel contexts. What this means:
  * If we have a hardware channel per context (MLOCKing), just tear 
down the channel
  * Otherwise, we can just remove (either by patching or by full 
teardown/resubmit of the channel) all jobs submitted by the user channel 
context that submitted the hanging job. Jobs of other contexts would be 
undisturbed (though potentially delayed, which could be taken into 
account and timeouts adjusted)
* If this happens, we can set removed jobs' post-fences to error status 
and user will have to resubmit them.

* We should be able to keep the syncpoint refcounting based on fences.

This can be made more fine-grained by not caring about the user channel 
context, but tearing down all jobs with the same syncpoint. I think the 
result would be that we can get either what you described (or how I 
understood it in the summary in the beginning of the message), or a more 
traditional syncpoint-per-userctx workflow, depending on how the 
userspace decides to allocate syncpoints.


If needed, the kernel can still do e.g. reordering (you mentioned job 
priorities) at syncpoint granularity, which, if the userspace followed 
the model you described, would be the same thing as job granularity.


(Maybe it would be more difficult with current drm_scheduler, sorry, 
haven't had the time yet to read up on that. Dealing with clearing work 
stuff up before summer vacation)


Mikko
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-06-30 Thread Dmitry Osipenko
29.06.2020 13:27, Mikko Perttunen пишет:
...
 3. Sync points should be properly refcounted. Job's sync points
 shouldn't be re-used while job is alive.

 4. The job's sync point can't be re-used after job's submission (UAPI
 constraint!). Userspace must free sync point and allocate a new one for
 the next job submission. And now we:

     - Know that job's sync point is always in a healthy state!

     - We're not limited by a number of physically available hardware
 sync
 points! Allocation should block until free sync point is available.

It also occurred to me that if allocation is blocking and if there is a
need to allocate multiple sync points for a job, then the IOCTL should
be able to allocate multiple sync points at once. This will prevent
interlock situation where two context could block on each other.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-06-30 Thread Dmitry Osipenko
29.06.2020 13:27, Mikko Perttunen пишет:
...
 4. The job's sync point can't be re-used after job's submission (UAPI
 constraint!). Userspace must free sync point and allocate a new one for
 the next job submission. And now we:

     - Know that job's sync point is always in a healthy state!

     - We're not limited by a number of physically available hardware
 sync
 points! Allocation should block until free sync point is available.

     - The logical number of job's sync point increments matches the SP
 hardware state! Which is handy for a job's debugging.

 Optionally, the job's sync point could be auto-removed from the DRM's
 context after job's submission, avoiding a need for an extra SYNCPT_PUT
 IOCTL invocation to be done by userspace after the job's submission.
 Could be a job's flag.
>>>
>>> I think this would cause problems where after a job completes but before
>>> the fence has been waited, the syncpoint is already recycled (especially
>>> if the syncpoint is reset into some clean state).
>>
>> Exactly, good point! The dma-fence shouldn't be hardwired to the sync
>> point in order to avoid this situation :)
>>
>> Please take a look at the fence implementation that I made for the
>> grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached
>> to a sync point by host1x_fence_create(). Once job is completed, the
>> host1x-fence is detached from the sync point [5][6] and sync point could
>> be recycled safely!
> 
> What if the fence has been programmed as a prefence to another channel
> (that is getting delayed), or to the GPU, or some other accelerator like
> DLA, or maybe some other VM? Those don't know the dma_fence has been
> signaled, they can only rely on the syncpoint ID/threshold pair.

The explicit job's fence is always just a dma-fence, it's not tied to a
host1x-fence and it should be waited for a signal on CPU.

If you want to make job to wait for a sync point on hardware, then you
should use the drm_tegra_submit_command wait-command.

Again, please notice that DRM scheduler supports the job-submitted
fence! This dma-fence will signal once job is pushed to hardware for
execution, so it shouldn't be a problem to maintain jobs order for a
complex jobs without much hassle. We'll need to write some userspace to
check how it works in practice :) For now I really had experience with a
simple jobs only.

Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync
point, correct? Or are they integrated with Host1x HW?

Anyways, it shouldn't be difficult to resolve dma-fence into
host1x-fence, get SP ID and maintain the SP's liveness. Please see more
below.

In the grate-driver I made all sync points refcounted, so sync point
won't be recycled while it has active users [1][2][3] in kernel (or
userspace).

[1]
https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L428
[2]
https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L1206
[3]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L163

Now, grate-kernel isn't a 100% complete implementation, as I already
mentioned before. The host1x-fence doesn't have a reference to a sync
point as you may see in the code, this is because the userspace sync
points are not implemented in the grate-driver.

But nothing stops us to add that SP reference and then we could simply
do the following in the code:

struct dma_fence *host1x_fence_create(syncpt,...)
{
...
fence->sp = syncpt;
...
return &fence->base;
}

void host1x_syncpt_signal_fence(struct host1x_fence *fence)
{
...
fence->sp = NULL;
}

irqreturn_t host1x_hw_syncpt_isr()
{
spin_lock(&host1x_syncpts_lock);
...
host1x_syncpt_signal_fence(sp->fence);
...
spin_unlock(&host1x_syncpts_lock);
}

void host1x_submit_job(job)
{
...
spin_lock_irqsave(&host1x_syncpts_lock);
sp = host1x_syncpt_get(host1x_fence->sp);
spin_unlock_irqrestore(&host1x_syncpts_lock);
...
if (sp) {
push(WAIT(sp->id, host1x_fence->threshold));
job->sync_points = sp;
}
}

void host1x_free_job(job)
{
host1x_syncpt_put(job->sync_points);
...
}

For example: if you'll share host1-fence (dma-fence) with a GPU context,
then the fence's SP won't be released until GPU's context will be done
with the SP. The GPU's job will be timed out if shared SP won't get
incremented, and this is totally okay situation.

Does this answer yours question?

===

I'm not familiar with the Host1x VMs, so please let me use my
imagination here:

In a case of VM we could have a special VM-shared sync point type. The
userspace will allocate this special VM SP using ALLOCATE_SYNCPOINT and
this SP won't be used for a job(!). This is the case where job will need
to increment multiple sync points, its own SP + VM's SP. If job h

Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-06-30 Thread Dmitry Osipenko
29.06.2020 13:27, Mikko Perttunen пишет:
...
>> We don't need a dedicated sync point FD for all kinds of jobs, don't we?
>> For example, I don't see why a sync point FD may be needed in a case of
>> Opentegra jobs.
> 
> I think it's cleaner if we have just one way to allocate syncpoints, and
> then those syncpoints can be passed to different things depending on the
> situation.
> 
> If we want to protect direct incrementing while a job is submitted, we
> could have a locking API where an ongoing job can take a lock refcount
> in the syncpoint, and incrementing would return -EBUSY.

Okay, let's go with this for now.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-06-29 Thread Mikko Perttunen

On 6/29/20 5:36 AM, Dmitry Osipenko wrote:

28.06.2020 12:44, Mikko Perttunen пишет:

On 6/28/20 2:27 AM, Dmitry Osipenko wrote:

23.06.2020 15:09, Mikko Perttunen пишет:


### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)

Allocates a free syncpoint, returning a file descriptor representing it.
Only the owner of the file descriptor is allowed to mutate the value of
the syncpoint.

```
struct host1x_ctrl_allocate_syncpoint {
     /**
  * @fd:
  *
  * [out] New file descriptor representing the allocated
syncpoint.
  */
     __s32 fd;

     __u32 reserved[3];
};


We should need at least these basic things from the sync points API >
- Execution context shouldn't be able to tamper sync points of the other
contexts.


This is covered by this UAPI - when submitting, as part of the
syncpt_incr struct you pass the syncpoint FD. This way the driver can
check the syncpoints used are correct, or program HW protection.



- Sync point could be shared with other contexts for explicit fencing.


Not sure what you specifically mean; you can get the ID out of the
syncpoint fd and share the ID for read-only access. (Or the FD for
read-write access)


I enumerated the overall points that UAPI should provide to us, just for
clarity. Not like you haven't covered any of them, sorry for the
confusion! :)

Please see more comments below!


Ok, good :)





- Sync points should work reliably.

Some problems of the current Host1x driver, like where it falls over if
sync point value is out-of-sync + all the hang-job recovery labor could
be easily reduced if sync point health is protected by extra UAPI
constraints. >
So I think we may want the following:

1. We still should need to assign sync point ID to a DRM-channel's
context. This sync point ID will be used for a commands stream forming,
like it is done by the current staging UAPI.

So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but
improve it.


My point here is that the UAPI shouldn't be able to increment the job's
sync point using SYNCPOINT_INCREMENT IOCTL, which is another UAPI
constraint.

I'm suggesting that we should have two methods of sync point allocations:

1) Sync point that could be used only by a submitted job.

2) Sync point that could be incremented by CPU.

The first method will allocate a raw sync point ID that is assigned to
the channel's context. This ID will be used for the job's completion
tracking. Perhaps this method also could optionally return a sync point
FD if you'd want to wait on this sync point by another job.

We don't need a dedicated sync point FD for all kinds of jobs, don't we?
For example, I don't see why a sync point FD may be needed in a case of
Opentegra jobs.


I think it's cleaner if we have just one way to allocate syncpoints, and 
then those syncpoints can be passed to different things depending on the 
situation.


If we want to protect direct incrementing while a job is submitted, we 
could have a locking API where an ongoing job can take a lock refcount 
in the syncpoint, and incrementing would return -EBUSY.





2. Allocated sync point must have a clean hardware state.


What do you mean by clean hardware state?


I mean that sync point should have a predictable state [1], it shouldn't
accidentally fire during of hardware programming for example.

[1]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L132

For a submitted job, the job's sync point state could be reset at a
submission time, for example like I did it in the grate-kernel's
experimental driver [2].

[2]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/channel.c#L145



3. Sync points should be properly refcounted. Job's sync points
shouldn't be re-used while job is alive.

4. The job's sync point can't be re-used after job's submission (UAPI
constraint!). Userspace must free sync point and allocate a new one for
the next job submission. And now we:

    - Know that job's sync point is always in a healthy state!

    - We're not limited by a number of physically available hardware sync
points! Allocation should block until free sync point is available.

    - The logical number of job's sync point increments matches the SP
hardware state! Which is handy for a job's debugging.

Optionally, the job's sync point could be auto-removed from the DRM's
context after job's submission, avoiding a need for an extra SYNCPT_PUT
IOCTL invocation to be done by userspace after the job's submission.
Could be a job's flag.


I think this would cause problems where after a job completes but before
the fence has been waited, the syncpoint is already recycled (especially
if the syncpoint is reset into some clean state).


Exactly, good point! The dma-fence shouldn't be hardwired to the sync
point in order to avoid this situation :)

Please take a look at the fence implementation that I made for the
grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached

Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-06-28 Thread Dmitry Osipenko
28.06.2020 12:44, Mikko Perttunen пишет:
> On 6/28/20 2:27 AM, Dmitry Osipenko wrote:
>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>
>>> ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
>>>
>>> Allocates a free syncpoint, returning a file descriptor representing it.
>>> Only the owner of the file descriptor is allowed to mutate the value of
>>> the syncpoint.
>>>
>>> ```
>>> struct host1x_ctrl_allocate_syncpoint {
>>>     /**
>>>  * @fd:
>>>  *
>>>  * [out] New file descriptor representing the allocated
>>> syncpoint.
>>>  */
>>>     __s32 fd;
>>>
>>>     __u32 reserved[3];
>>> };
>>
>> We should need at least these basic things from the sync points API >
>> - Execution context shouldn't be able to tamper sync points of the other
>> contexts.
> 
> This is covered by this UAPI - when submitting, as part of the
> syncpt_incr struct you pass the syncpoint FD. This way the driver can
> check the syncpoints used are correct, or program HW protection.
> 
>>
>> - Sync point could be shared with other contexts for explicit fencing.
> 
> Not sure what you specifically mean; you can get the ID out of the
> syncpoint fd and share the ID for read-only access. (Or the FD for
> read-write access)

I enumerated the overall points that UAPI should provide to us, just for
clarity. Not like you haven't covered any of them, sorry for the
confusion! :)

Please see more comments below!

>>
>> - Sync points should work reliably.
>>
>> Some problems of the current Host1x driver, like where it falls over if
>> sync point value is out-of-sync + all the hang-job recovery labor could
>> be easily reduced if sync point health is protected by extra UAPI
>> constraints. >
>> So I think we may want the following:
>>
>> 1. We still should need to assign sync point ID to a DRM-channel's
>> context. This sync point ID will be used for a commands stream forming,
>> like it is done by the current staging UAPI.
>>
>> So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but
>> improve it.

My point here is that the UAPI shouldn't be able to increment the job's
sync point using SYNCPOINT_INCREMENT IOCTL, which is another UAPI
constraint.

I'm suggesting that we should have two methods of sync point allocations:

1) Sync point that could be used only by a submitted job.

2) Sync point that could be incremented by CPU.

The first method will allocate a raw sync point ID that is assigned to
the channel's context. This ID will be used for the job's completion
tracking. Perhaps this method also could optionally return a sync point
FD if you'd want to wait on this sync point by another job.

We don't need a dedicated sync point FD for all kinds of jobs, don't we?
For example, I don't see why a sync point FD may be needed in a case of
Opentegra jobs.

>> 2. Allocated sync point must have a clean hardware state.
> 
> What do you mean by clean hardware state?

I mean that sync point should have a predictable state [1], it shouldn't
accidentally fire during of hardware programming for example.

[1]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L132

For a submitted job, the job's sync point state could be reset at a
submission time, for example like I did it in the grate-kernel's
experimental driver [2].

[2]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/channel.c#L145

>>
>> 3. Sync points should be properly refcounted. Job's sync points
>> shouldn't be re-used while job is alive.
>>
>> 4. The job's sync point can't be re-used after job's submission (UAPI
>> constraint!). Userspace must free sync point and allocate a new one for
>> the next job submission. And now we:
>>
>>    - Know that job's sync point is always in a healthy state!
>>
>>    - We're not limited by a number of physically available hardware sync
>> points! Allocation should block until free sync point is available.
>>
>>    - The logical number of job's sync point increments matches the SP
>> hardware state! Which is handy for a job's debugging.
>>
>> Optionally, the job's sync point could be auto-removed from the DRM's
>> context after job's submission, avoiding a need for an extra SYNCPT_PUT
>> IOCTL invocation to be done by userspace after the job's submission.
>> Could be a job's flag.
> 
> I think this would cause problems where after a job completes but before
> the fence has been waited, the syncpoint is already recycled (especially
> if the syncpoint is reset into some clean state).

Exactly, good point! The dma-fence shouldn't be hardwired to the sync
point in order to avoid this situation :)

Please take a look at the fence implementation that I made for the
grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached
to a sync point by host1x_fence_create(). Once job is completed, the
host1x-fence is detached from the sync point [5][6] and sync point could
be recycled safely!

[3]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/hos

Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-06-28 Thread Dmitry Osipenko
23.06.2020 15:09, Mikko Perttunen пишет:
> 
> ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
> 
> Allocates a free syncpoint, returning a file descriptor representing it.
> Only the owner of the file descriptor is allowed to mutate the value of
> the syncpoint.
> 
> ```
> struct host1x_ctrl_allocate_syncpoint {
>    /**
>     * @fd:
>     *
>     * [out] New file descriptor representing the allocated syncpoint.
>     */
>    __s32 fd;
> 
>    __u32 reserved[3];
> };

We should need at least these basic things from the sync points API:

- Execution context shouldn't be able to tamper sync points of the other
contexts.

- Sync point could be shared with other contexts for explicit fencing.

- Sync points should work reliably.

Some problems of the current Host1x driver, like where it falls over if
sync point value is out-of-sync + all the hang-job recovery labor could
be easily reduced if sync point health is protected by extra UAPI
constraints.

So I think we may want the following:

1. We still should need to assign sync point ID to a DRM-channel's
context. This sync point ID will be used for a commands stream forming,
like it is done by the current staging UAPI.

So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it.

2. Allocated sync point must have a clean hardware state.

3. Sync points should be properly refcounted. Job's sync points
shouldn't be re-used while job is alive.

4. The job's sync point can't be re-used after job's submission (UAPI
constraint!). Userspace must free sync point and allocate a new one for
the next job submission. And now we:

  - Know that job's sync point is always in a healthy state!

  - We're not limited by a number of physically available hardware sync
points! Allocation should block until free sync point is available.

  - The logical number of job's sync point increments matches the SP
hardware state! Which is handy for a job's debugging.

Optionally, the job's sync point could be auto-removed from the DRM's
context after job's submission, avoiding a need for an extra SYNCPT_PUT
IOCTL invocation to be done by userspace after the job's submission.
Could be a job's flag.

We could avoid a need for a statically-allocated sync points at all for
a patched cmdstreams! The sync point could be dynamically allocated at a
job's submission time by the kernel driver and then cmdstream will be
patched with this sync point.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] Host1x/TegraDRM UAPI (sync points)

2020-06-28 Thread Mikko Perttunen

On 6/28/20 2:27 AM, Dmitry Osipenko wrote:

23.06.2020 15:09, Mikko Perttunen пишет:


### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)

Allocates a free syncpoint, returning a file descriptor representing it.
Only the owner of the file descriptor is allowed to mutate the value of
the syncpoint.

```
struct host1x_ctrl_allocate_syncpoint {
    /**
     * @fd:
     *
     * [out] New file descriptor representing the allocated syncpoint.
     */
    __s32 fd;

    __u32 reserved[3];
};


We should need at least these basic things from the sync points API >
- Execution context shouldn't be able to tamper sync points of the other
contexts.


This is covered by this UAPI - when submitting, as part of the 
syncpt_incr struct you pass the syncpoint FD. This way the driver can 
check the syncpoints used are correct, or program HW protection.




- Sync point could be shared with other contexts for explicit fencing.


Not sure what you specifically mean; you can get the ID out of the 
syncpoint fd and share the ID for read-only access. (Or the FD for 
read-write access)




- Sync points should work reliably.

Some problems of the current Host1x driver, like where it falls over if
sync point value is out-of-sync + all the hang-job recovery labor could
be easily reduced if sync point health is protected by extra UAPI
constraints. >
So I think we may want the following:

1. We still should need to assign sync point ID to a DRM-channel's
context. This sync point ID will be used for a commands stream forming,
like it is done by the current staging UAPI.

So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it.

2. Allocated sync point must have a clean hardware state.


What do you mean by clean hardware state?



3. Sync points should be properly refcounted. Job's sync points
shouldn't be re-used while job is alive.

4. The job's sync point can't be re-used after job's submission (UAPI
constraint!). Userspace must free sync point and allocate a new one for
the next job submission. And now we:

   - Know that job's sync point is always in a healthy state!

   - We're not limited by a number of physically available hardware sync
points! Allocation should block until free sync point is available.

   - The logical number of job's sync point increments matches the SP
hardware state! Which is handy for a job's debugging.

Optionally, the job's sync point could be auto-removed from the DRM's
context after job's submission, avoiding a need for an extra SYNCPT_PUT
IOCTL invocation to be done by userspace after the job's submission.
Could be a job's flag.


I think this would cause problems where after a job completes but before 
the fence has been waited, the syncpoint is already recycled (especially 
if the syncpoint is reset into some clean state).


I would prefer having a syncpoint for each userspace channel context 
(several of which could share a hardware channel if MLOCKing is not used).


In my experience it's then not difficult to pinpoint which job has 
failed, and if each userspace channel context uses a separate syncpoint, 
a hanging job wouldn't mess with other application's jobs, either.


Mikko



We could avoid a need for a statically-allocated sync points at all for
a patched cmdstreams! The sync point could be dynamically allocated at a
job's submission time by the kernel driver and then cmdstream will be
patched with this sync point.


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel