Re: [RFC] Host1x/TegraDRM UAPI (sync points)
08.07.2020 13:06, Mikko Perttunen пишет: > On 7/7/20 2:06 PM, Dmitry Osipenko wrote: >> 02.07.2020 15:10, Mikko Perttunen пишет: >>> Ok, so we would have two kinds of syncpoints for the job; one >>> for kernel job tracking; and one that userspace can >>> manipulate as it wants to. >>> >>> Could we handle the job tracking syncpoint completely inside the kernel, >>> i.e. allocate it in kernel during job submission, and add an increment >>> for it at the end of the job (with condition OP_DONE)? For MLOCKing, the >>> kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + >>> MLOCK_RELEASE sequence at the end of each job. >> >> If sync point is allocated within kernel, then we'll need to always >> patch all job's sync point increments with the ID of the allocated sync >> point, regardless of whether firewall enabled or not. > > The idea was that the job tracking increment would also be added to the > pushbuffer in the kernel, so gathers would only have increments for the > "user syncpoints", if any. I think this should work for THI-based > engines (e.g. VIC), you probably have better information about > GR2D/GR3D. On newer Tegras we could use CHANNEL/APPID protection to > prevent the gathers from incrementing these job tracking syncpoints. Could you please clarify what is THI? >> Secondly, I'm now recalling that only one sync point could be assigned >> to a channel at a time on newer Tegras which support sync point >> protection. So it sounds like we don't really have variants other than >> to allocate one sync point per channel for the jobs usage if we want to >> be able to push multiple jobs into channel's pushbuffer, correct? >> > > The other way around; each syncpoint can be assigned to one channel. One > channel may have multiple syncpoints. Okay! So we actually could use a single sync point per-job for user increments + job tracking, correct? >> ... Hmm, we actually should be able to have a one sync point per-channel for the job submission, similarly to what the current driver does! I'm keep forgetting about the waitbase existence! >>> >>> Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs >>> anyway, can't we just recalculate wait thresholds at that time? >> >> Yes, thresholds could be recalculated + job should be re-formed at the >> push-time then. >> >> It also means that jobs always should be formed only at the push-time if >> wait-command is utilized by cmdstream since the waits always need to be >> patched because we won't know the thresholds until scheduler actually >> runs the job. > > Could we keep the job tracking syncpoints entirely within the kernel, > and have all wait commands and other stuff that userspace does use the > user syncpoints? Then kernel job tracking and userspace activity would > be separate from each other. I think this should work, but it's not clear to me what benefits this brings to us if we could re-use the same user-allocated sync point for both user increments + kernel job tracking. Is the idea to protect from userspace incrementing sync point too much? I.e. job could be treated as completed before CDMA is actually done with it. > Alternatively, if we know that jobs can only be removed from the middle > of pushbuffers, and not added, we could replace any removed jobs with > syncpoint increments in the pushbuffer and any thresholds would stay > intact. A bit unlikely that jobs could ever be removed from the middle of hardware queue by the DRM scheduler. Anyways, it should be nicer to shoot down everything and restart with a clean slate when necessary, like it's already supposed to happen by the scheduler. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
On 7/7/20 2:06 PM, Dmitry Osipenko wrote: 02.07.2020 15:10, Mikko Perttunen пишет: Ok, so we would have two kinds of syncpoints for the job; one for kernel job tracking; and one that userspace can manipulate as it wants to. Could we handle the job tracking syncpoint completely inside the kernel, i.e. allocate it in kernel during job submission, and add an increment for it at the end of the job (with condition OP_DONE)? For MLOCKing, the kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + MLOCK_RELEASE sequence at the end of each job. If sync point is allocated within kernel, then we'll need to always patch all job's sync point increments with the ID of the allocated sync point, regardless of whether firewall enabled or not. The idea was that the job tracking increment would also be added to the pushbuffer in the kernel, so gathers would only have increments for the "user syncpoints", if any. I think this should work for THI-based engines (e.g. VIC), you probably have better information about GR2D/GR3D. On newer Tegras we could use CHANNEL/APPID protection to prevent the gathers from incrementing these job tracking syncpoints. Secondly, I'm now recalling that only one sync point could be assigned to a channel at a time on newer Tegras which support sync point protection. So it sounds like we don't really have variants other than to allocate one sync point per channel for the jobs usage if we want to be able to push multiple jobs into channel's pushbuffer, correct? The other way around; each syncpoint can be assigned to one channel. One channel may have multiple syncpoints. ... Hmm, we actually should be able to have a one sync point per-channel for the job submission, similarly to what the current driver does! I'm keep forgetting about the waitbase existence! Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs anyway, can't we just recalculate wait thresholds at that time? Yes, thresholds could be recalculated + job should be re-formed at the push-time then. It also means that jobs always should be formed only at the push-time if wait-command is utilized by cmdstream since the waits always need to be patched because we won't know the thresholds until scheduler actually runs the job. Could we keep the job tracking syncpoints entirely within the kernel, and have all wait commands and other stuff that userspace does use the user syncpoints? Then kernel job tracking and userspace activity would be separate from each other. Alternatively, if we know that jobs can only be removed from the middle of pushbuffers, and not added, we could replace any removed jobs with syncpoint increments in the pushbuffer and any thresholds would stay intact. Maybe a more detailed sequence list or diagram of what happens during submission and recovery would be useful. The textual form + code is already good enough to me. A diagram could be nice to have, although it may take a bit too much effort to create + maintain it. But I don't mind at all if you'd want to make one :) ... * We should be able to keep the syncpoint refcounting based on fences. The fence doesn't need the sync point itself, it only needs to get a signal when the threshold is reached or when sync point is ceased. Imagine: - Process A creates sync point - Process A creates dma-fence from this sync point - Process A exports dma-fence to process B - Process A dies What should happen to process B? - Should dma-fence of the process B get a error signal when process A dies? - Should process B get stuck waiting endlessly for the dma-fence? This is one example of why I'm proposing that fence shouldn't be coupled tightly to a sync point. As a baseline, we should consider process B to get stuck endlessly (until a timeout of its choosing) for the fence. In this case it is avoidable, but if the ID/threshold pair is exported out of the fence and is waited for otherwise, it is unavoidable. I.e. once the ID/threshold are exported out of a fence, the waiter can only see the fence being signaled by the threshold being reached, not by the syncpoint getting freed. This is correct. If sync point's FD is exported or once sync point is resolved from a dma-fence, then sync point will stay alive until the last reference to the sync point is dropped. I.e. if Process A dies *after* process B started to wait on the sync point, then it will get stuck. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
02.07.2020 15:10, Mikko Perttunen пишет: > Ok, so we would have two kinds of syncpoints for the job; one > for kernel job tracking; and one that userspace can > manipulate as it wants to. > > Could we handle the job tracking syncpoint completely inside the kernel, > i.e. allocate it in kernel during job submission, and add an increment > for it at the end of the job (with condition OP_DONE)? For MLOCKing, the > kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + > MLOCK_RELEASE sequence at the end of each job. If sync point is allocated within kernel, then we'll need to always patch all job's sync point increments with the ID of the allocated sync point, regardless of whether firewall enabled or not. Secondly, I'm now recalling that only one sync point could be assigned to a channel at a time on newer Tegras which support sync point protection. So it sounds like we don't really have variants other than to allocate one sync point per channel for the jobs usage if we want to be able to push multiple jobs into channel's pushbuffer, correct? ... >> Hmm, we actually should be able to have a one sync point per-channel for >> the job submission, similarly to what the current driver does! >> >> I'm keep forgetting about the waitbase existence! > > Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs > anyway, can't we just recalculate wait thresholds at that time? Yes, thresholds could be recalculated + job should be re-formed at the push-time then. It also means that jobs always should be formed only at the push-time if wait-command is utilized by cmdstream since the waits always need to be patched because we won't know the thresholds until scheduler actually runs the job. > Maybe a more detailed sequence list or diagram of what happens during > submission and recovery would be useful. The textual form + code is already good enough to me. A diagram could be nice to have, although it may take a bit too much effort to create + maintain it. But I don't mind at all if you'd want to make one :) ... >>> * We should be able to keep the syncpoint refcounting based on fences. >> >> The fence doesn't need the sync point itself, it only needs to get a >> signal when the threshold is reached or when sync point is ceased. >> >> Imagine: >> >> - Process A creates sync point >> - Process A creates dma-fence from this sync point >> - Process A exports dma-fence to process B >> - Process A dies >> >> What should happen to process B? >> >> - Should dma-fence of the process B get a error signal when process A >> dies? >> - Should process B get stuck waiting endlessly for the dma-fence? >> >> This is one example of why I'm proposing that fence shouldn't be coupled >> tightly to a sync point. > > As a baseline, we should consider process B to get stuck endlessly > (until a timeout of its choosing) for the fence. In this case it is > avoidable, but if the ID/threshold pair is exported out of the fence and > is waited for otherwise, it is unavoidable. I.e. once the ID/threshold > are exported out of a fence, the waiter can only see the fence being > signaled by the threshold being reached, not by the syncpoint getting > freed. This is correct. If sync point's FD is exported or once sync point is resolved from a dma-fence, then sync point will stay alive until the last reference to the sync point is dropped. I.e. if Process A dies *after* process B started to wait on the sync point, then it will get stuck. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
On 7/1/20 3:22 AM, Dmitry Osipenko wrote: 30.06.2020 13:26, Mikko Perttunen пишет: On 6/29/20 10:42 PM, Dmitry Osipenko wrote: Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync point, correct? Or are they integrated with Host1x HW? They can access syncpoints directly. (That's what I alluded to in the "Introduction to the hardware" section :) all those things have hardware access to syncpoints) Should we CC all the Nouveau developers then, or is it a bit too early? :) I think we have a few other issues still to resolve before that :) .. rest .. Let me try to summarize once more for my own understanding: * When submitting a job, you would allocate new syncpoints for the job - Yes * After submitting the job, those syncpoints are not usable anymore - Yes Although, thinking a bit more about it, this needs to be relaxed. It should be a userspace agreement/policy how to utilize sync points. For example, if we know that userspace will have multiple application instances all using Tegra DRM UAPI, like a mesa or VDPAU drivers, then this userspace should consider to return sync points into the pool for sharing them with others. While something like an Opentegra Xorg driver, which usually has a single instance, could keep sync points pre-allocated. The job's sync point counter will be reset to 0 by the kernel driver during of the submission process for each job, so we won't have the sync point recovery problem. * Postfences of that job would keep references to those syncpoints so they aren't freed and cleared before the fences have been released - No I suggested that fence shouldn't refcount the sync point and *only* have a reference to it, this reference will be invalidated once fence is signaled by sync point reaching the threshold or once sync point is released. The sync point will have a reference to every active fence (waiting for the signal) that is using this sync point until the threshold is reached. So fence could detach itself from the sync point + sync point could detach all the fences from itself. There will be more about this below, please see example with a dead process in the end of this email. * Once postfences have been released, syncpoints would be returned to the pool and reset to zero - No I'm suggesting that sync point should be returned to the pool once its usage refcount reaches 0. This means that both userspace that created this sync point + the executed job will both keep the sync point alive until it is closed by userspace + job is completed. The advantage of this would be that at any point in time, there would be a 1:1 correspondence between allocated syncpoints and jobs; so you could shuffle the jobs around channels or reorder them. - Yes Please correct if I got that wrong :) --- I have two concerns: * A lot of churn on syncpoints - any time you submit a job you might not get a syncpoint for an indefinite time. If we allocate syncpoints up-front at least you know beforehand, and then you have the syncpoint as long as you need it. If you'll have a lot of active application instances all allocating sync points, then inevitably the sync points pool will be exhausted. But my proposal doesn't differ from yours in this regards, correct? And maybe there is a nice solution, please see more below! * Plumbing the dma-fence/sync_file everywhere, and keeping it alive until waits on it have completed, is more work than just having the ID/threshold. This is probably mainly a problem for downstream, where updating code for this would be difficult. I know that's not a proper argument but I hope we can reach something that works for both worlds. You could have ID/threshold! :) But, you can't use the *job's* ID/threshold because you won't know them until kernel driver scheduler will *complete(!)* the job's execution! The job may be re-pushed multiple times by the scheduler to recovered channel if a previous jobs hang! Now, you could allocate *two* sync points: 1. For the job itself (job's sync point). 2. For the userspace to wait (user's sync point). The job will have to increment both these sync points (example of multiple sync points usage) and you know the user's sync point ID/threshold! If job times out, you *could* increment the user's sync point on CPU from userspace! The counter of the user's sync point won't be touched by the kernel driver if job hangs! Ok, so we would have two kinds of syncpoints for the job; one for kernel job tracking; and one that userspace can manipulate as it wants to. Could we handle the job tracking syncpoint completely inside the kernel, i.e. allocate it in kernel during job submission, and add an increment for it at the end of the job (with condition OP_DONE)? For MLOCKing, the kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + MLOCK_RELEASE sequence at the end of each job. Here's a proposal in between: * Keep syncpoint allocation and submission in jobs
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
30.06.2020 13:26, Mikko Perttunen пишет: > On 6/29/20 10:42 PM, Dmitry Osipenko wrote: >> >> Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync >> point, correct? Or are they integrated with Host1x HW? >> > > They can access syncpoints directly. (That's what I alluded to in the > "Introduction to the hardware" section :) all those things have hardware > access to syncpoints) Should we CC all the Nouveau developers then, or is it a bit too early? :) >> >> .. rest .. >> > > Let me try to summarize once more for my own understanding: > > * When submitting a job, you would allocate new syncpoints for the job - Yes > * After submitting the job, those syncpoints are not usable anymore - Yes Although, thinking a bit more about it, this needs to be relaxed. It should be a userspace agreement/policy how to utilize sync points. For example, if we know that userspace will have multiple application instances all using Tegra DRM UAPI, like a mesa or VDPAU drivers, then this userspace should consider to return sync points into the pool for sharing them with others. While something like an Opentegra Xorg driver, which usually has a single instance, could keep sync points pre-allocated. The job's sync point counter will be reset to 0 by the kernel driver during of the submission process for each job, so we won't have the sync point recovery problem. > * Postfences of that job would keep references to those syncpoints so > they aren't freed and cleared before the fences have been released - No I suggested that fence shouldn't refcount the sync point and *only* have a reference to it, this reference will be invalidated once fence is signaled by sync point reaching the threshold or once sync point is released. The sync point will have a reference to every active fence (waiting for the signal) that is using this sync point until the threshold is reached. So fence could detach itself from the sync point + sync point could detach all the fences from itself. There will be more about this below, please see example with a dead process in the end of this email. > * Once postfences have been released, syncpoints would be returned to > the pool and reset to zero - No I'm suggesting that sync point should be returned to the pool once its usage refcount reaches 0. This means that both userspace that created this sync point + the executed job will both keep the sync point alive until it is closed by userspace + job is completed. > The advantage of this would be that at any point in time, there would be > a 1:1 correspondence between allocated syncpoints and jobs; so you could > shuffle the jobs around channels or reorder them. - Yes > Please correct if I got that wrong :) > > --- > > I have two concerns: > > * A lot of churn on syncpoints - any time you submit a job you might not > get a syncpoint for an indefinite time. If we allocate syncpoints > up-front at least you know beforehand, and then you have the syncpoint > as long as you need it. If you'll have a lot of active application instances all allocating sync points, then inevitably the sync points pool will be exhausted. But my proposal doesn't differ from yours in this regards, correct? And maybe there is a nice solution, please see more below! > * Plumbing the dma-fence/sync_file everywhere, and keeping it alive > until waits on it have completed, is more work than just having the > ID/threshold. This is probably mainly a problem for downstream, where > updating code for this would be difficult. I know that's not a proper > argument but I hope we can reach something that works for both worlds. You could have ID/threshold! :) But, you can't use the *job's* ID/threshold because you won't know them until kernel driver scheduler will *complete(!)* the job's execution! The job may be re-pushed multiple times by the scheduler to recovered channel if a previous jobs hang! Now, you could allocate *two* sync points: 1. For the job itself (job's sync point). 2. For the userspace to wait (user's sync point). The job will have to increment both these sync points (example of multiple sync points usage) and you know the user's sync point ID/threshold! If job times out, you *could* increment the user's sync point on CPU from userspace! The counter of the user's sync point won't be touched by the kernel driver if job hangs! > Here's a proposal in between: > > * Keep syncpoint allocation and submission in jobs as in my original > proposal Yes, we could keep it. But, as I suggested in my other email, we may want to extend the allocation IOCTL for the multi-syncpoint allocation support. Secondly, if we'll want to have the multi-syncpoint support for the job, then we may want improve the SUBMIT IOCTL like this: struct drm_tegra_channel_submit { __u32 num_usr_syncpt_incrs; __u64 usr_sync_points_ptr; __u32 num_job_syncpt_incrs; __u32 job_syncpt_handle; }; If job doesn't need to increment user's sync
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
On 6/29/20 10:42 PM, Dmitry Osipenko wrote: Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync point, correct? Or are they integrated with Host1x HW? They can access syncpoints directly. (That's what I alluded to in the "Introduction to the hardware" section :) all those things have hardware access to syncpoints) > > .. rest .. > Let me try to summarize once more for my own understanding: * When submitting a job, you would allocate new syncpoints for the job * After submitting the job, those syncpoints are not usable anymore * Postfences of that job would keep references to those syncpoints so they aren't freed and cleared before the fences have been released * Once postfences have been released, syncpoints would be returned to the pool and reset to zero The advantage of this would be that at any point in time, there would be a 1:1 correspondence between allocated syncpoints and jobs; so you could shuffle the jobs around channels or reorder them. Please correct if I got that wrong :) --- I have two concerns: * A lot of churn on syncpoints - any time you submit a job you might not get a syncpoint for an indefinite time. If we allocate syncpoints up-front at least you know beforehand, and then you have the syncpoint as long as you need it. * Plumbing the dma-fence/sync_file everywhere, and keeping it alive until waits on it have completed, is more work than just having the ID/threshold. This is probably mainly a problem for downstream, where updating code for this would be difficult. I know that's not a proper argument but I hope we can reach something that works for both worlds. Here's a proposal in between: * Keep syncpoint allocation and submission in jobs as in my original proposal * Don't attempt to recover user channel contexts. What this means: * If we have a hardware channel per context (MLOCKing), just tear down the channel * Otherwise, we can just remove (either by patching or by full teardown/resubmit of the channel) all jobs submitted by the user channel context that submitted the hanging job. Jobs of other contexts would be undisturbed (though potentially delayed, which could be taken into account and timeouts adjusted) * If this happens, we can set removed jobs' post-fences to error status and user will have to resubmit them. * We should be able to keep the syncpoint refcounting based on fences. This can be made more fine-grained by not caring about the user channel context, but tearing down all jobs with the same syncpoint. I think the result would be that we can get either what you described (or how I understood it in the summary in the beginning of the message), or a more traditional syncpoint-per-userctx workflow, depending on how the userspace decides to allocate syncpoints. If needed, the kernel can still do e.g. reordering (you mentioned job priorities) at syncpoint granularity, which, if the userspace followed the model you described, would be the same thing as job granularity. (Maybe it would be more difficult with current drm_scheduler, sorry, haven't had the time yet to read up on that. Dealing with clearing work stuff up before summer vacation) Mikko ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
29.06.2020 13:27, Mikko Perttunen пишет: ... 3. Sync points should be properly refcounted. Job's sync points shouldn't be re-used while job is alive. 4. The job's sync point can't be re-used after job's submission (UAPI constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we: - Know that job's sync point is always in a healthy state! - We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available. It also occurred to me that if allocation is blocking and if there is a need to allocate multiple sync points for a job, then the IOCTL should be able to allocate multiple sync points at once. This will prevent interlock situation where two context could block on each other. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
29.06.2020 13:27, Mikko Perttunen пишет: ... 4. The job's sync point can't be re-used after job's submission (UAPI constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we: - Know that job's sync point is always in a healthy state! - We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available. - The logical number of job's sync point increments matches the SP hardware state! Which is handy for a job's debugging. Optionally, the job's sync point could be auto-removed from the DRM's context after job's submission, avoiding a need for an extra SYNCPT_PUT IOCTL invocation to be done by userspace after the job's submission. Could be a job's flag. >>> >>> I think this would cause problems where after a job completes but before >>> the fence has been waited, the syncpoint is already recycled (especially >>> if the syncpoint is reset into some clean state). >> >> Exactly, good point! The dma-fence shouldn't be hardwired to the sync >> point in order to avoid this situation :) >> >> Please take a look at the fence implementation that I made for the >> grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached >> to a sync point by host1x_fence_create(). Once job is completed, the >> host1x-fence is detached from the sync point [5][6] and sync point could >> be recycled safely! > > What if the fence has been programmed as a prefence to another channel > (that is getting delayed), or to the GPU, or some other accelerator like > DLA, or maybe some other VM? Those don't know the dma_fence has been > signaled, they can only rely on the syncpoint ID/threshold pair. The explicit job's fence is always just a dma-fence, it's not tied to a host1x-fence and it should be waited for a signal on CPU. If you want to make job to wait for a sync point on hardware, then you should use the drm_tegra_submit_command wait-command. Again, please notice that DRM scheduler supports the job-submitted fence! This dma-fence will signal once job is pushed to hardware for execution, so it shouldn't be a problem to maintain jobs order for a complex jobs without much hassle. We'll need to write some userspace to check how it works in practice :) For now I really had experience with a simple jobs only. Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync point, correct? Or are they integrated with Host1x HW? Anyways, it shouldn't be difficult to resolve dma-fence into host1x-fence, get SP ID and maintain the SP's liveness. Please see more below. In the grate-driver I made all sync points refcounted, so sync point won't be recycled while it has active users [1][2][3] in kernel (or userspace). [1] https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L428 [2] https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L1206 [3] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L163 Now, grate-kernel isn't a 100% complete implementation, as I already mentioned before. The host1x-fence doesn't have a reference to a sync point as you may see in the code, this is because the userspace sync points are not implemented in the grate-driver. But nothing stops us to add that SP reference and then we could simply do the following in the code: struct dma_fence *host1x_fence_create(syncpt,...) { ... fence->sp = syncpt; ... return &fence->base; } void host1x_syncpt_signal_fence(struct host1x_fence *fence) { ... fence->sp = NULL; } irqreturn_t host1x_hw_syncpt_isr() { spin_lock(&host1x_syncpts_lock); ... host1x_syncpt_signal_fence(sp->fence); ... spin_unlock(&host1x_syncpts_lock); } void host1x_submit_job(job) { ... spin_lock_irqsave(&host1x_syncpts_lock); sp = host1x_syncpt_get(host1x_fence->sp); spin_unlock_irqrestore(&host1x_syncpts_lock); ... if (sp) { push(WAIT(sp->id, host1x_fence->threshold)); job->sync_points = sp; } } void host1x_free_job(job) { host1x_syncpt_put(job->sync_points); ... } For example: if you'll share host1-fence (dma-fence) with a GPU context, then the fence's SP won't be released until GPU's context will be done with the SP. The GPU's job will be timed out if shared SP won't get incremented, and this is totally okay situation. Does this answer yours question? === I'm not familiar with the Host1x VMs, so please let me use my imagination here: In a case of VM we could have a special VM-shared sync point type. The userspace will allocate this special VM SP using ALLOCATE_SYNCPOINT and this SP won't be used for a job(!). This is the case where job will need to increment multiple sync points, its own SP + VM's SP. If job h
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
29.06.2020 13:27, Mikko Perttunen пишет: ... >> We don't need a dedicated sync point FD for all kinds of jobs, don't we? >> For example, I don't see why a sync point FD may be needed in a case of >> Opentegra jobs. > > I think it's cleaner if we have just one way to allocate syncpoints, and > then those syncpoints can be passed to different things depending on the > situation. > > If we want to protect direct incrementing while a job is submitted, we > could have a locking API where an ongoing job can take a lock refcount > in the syncpoint, and incrementing would return -EBUSY. Okay, let's go with this for now. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
On 6/29/20 5:36 AM, Dmitry Osipenko wrote: 28.06.2020 12:44, Mikko Perttunen пишет: On 6/28/20 2:27 AM, Dmitry Osipenko wrote: 23.06.2020 15:09, Mikko Perttunen пишет: ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x) Allocates a free syncpoint, returning a file descriptor representing it. Only the owner of the file descriptor is allowed to mutate the value of the syncpoint. ``` struct host1x_ctrl_allocate_syncpoint { /** * @fd: * * [out] New file descriptor representing the allocated syncpoint. */ __s32 fd; __u32 reserved[3]; }; We should need at least these basic things from the sync points API > - Execution context shouldn't be able to tamper sync points of the other contexts. This is covered by this UAPI - when submitting, as part of the syncpt_incr struct you pass the syncpoint FD. This way the driver can check the syncpoints used are correct, or program HW protection. - Sync point could be shared with other contexts for explicit fencing. Not sure what you specifically mean; you can get the ID out of the syncpoint fd and share the ID for read-only access. (Or the FD for read-write access) I enumerated the overall points that UAPI should provide to us, just for clarity. Not like you haven't covered any of them, sorry for the confusion! :) Please see more comments below! Ok, good :) - Sync points should work reliably. Some problems of the current Host1x driver, like where it falls over if sync point value is out-of-sync + all the hang-job recovery labor could be easily reduced if sync point health is protected by extra UAPI constraints. > So I think we may want the following: 1. We still should need to assign sync point ID to a DRM-channel's context. This sync point ID will be used for a commands stream forming, like it is done by the current staging UAPI. So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it. My point here is that the UAPI shouldn't be able to increment the job's sync point using SYNCPOINT_INCREMENT IOCTL, which is another UAPI constraint. I'm suggesting that we should have two methods of sync point allocations: 1) Sync point that could be used only by a submitted job. 2) Sync point that could be incremented by CPU. The first method will allocate a raw sync point ID that is assigned to the channel's context. This ID will be used for the job's completion tracking. Perhaps this method also could optionally return a sync point FD if you'd want to wait on this sync point by another job. We don't need a dedicated sync point FD for all kinds of jobs, don't we? For example, I don't see why a sync point FD may be needed in a case of Opentegra jobs. I think it's cleaner if we have just one way to allocate syncpoints, and then those syncpoints can be passed to different things depending on the situation. If we want to protect direct incrementing while a job is submitted, we could have a locking API where an ongoing job can take a lock refcount in the syncpoint, and incrementing would return -EBUSY. 2. Allocated sync point must have a clean hardware state. What do you mean by clean hardware state? I mean that sync point should have a predictable state [1], it shouldn't accidentally fire during of hardware programming for example. [1] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L132 For a submitted job, the job's sync point state could be reset at a submission time, for example like I did it in the grate-kernel's experimental driver [2]. [2] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/channel.c#L145 3. Sync points should be properly refcounted. Job's sync points shouldn't be re-used while job is alive. 4. The job's sync point can't be re-used after job's submission (UAPI constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we: - Know that job's sync point is always in a healthy state! - We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available. - The logical number of job's sync point increments matches the SP hardware state! Which is handy for a job's debugging. Optionally, the job's sync point could be auto-removed from the DRM's context after job's submission, avoiding a need for an extra SYNCPT_PUT IOCTL invocation to be done by userspace after the job's submission. Could be a job's flag. I think this would cause problems where after a job completes but before the fence has been waited, the syncpoint is already recycled (especially if the syncpoint is reset into some clean state). Exactly, good point! The dma-fence shouldn't be hardwired to the sync point in order to avoid this situation :) Please take a look at the fence implementation that I made for the grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
28.06.2020 12:44, Mikko Perttunen пишет: > On 6/28/20 2:27 AM, Dmitry Osipenko wrote: >> 23.06.2020 15:09, Mikko Perttunen пишет: >>> >>> ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x) >>> >>> Allocates a free syncpoint, returning a file descriptor representing it. >>> Only the owner of the file descriptor is allowed to mutate the value of >>> the syncpoint. >>> >>> ``` >>> struct host1x_ctrl_allocate_syncpoint { >>> /** >>> * @fd: >>> * >>> * [out] New file descriptor representing the allocated >>> syncpoint. >>> */ >>> __s32 fd; >>> >>> __u32 reserved[3]; >>> }; >> >> We should need at least these basic things from the sync points API > >> - Execution context shouldn't be able to tamper sync points of the other >> contexts. > > This is covered by this UAPI - when submitting, as part of the > syncpt_incr struct you pass the syncpoint FD. This way the driver can > check the syncpoints used are correct, or program HW protection. > >> >> - Sync point could be shared with other contexts for explicit fencing. > > Not sure what you specifically mean; you can get the ID out of the > syncpoint fd and share the ID for read-only access. (Or the FD for > read-write access) I enumerated the overall points that UAPI should provide to us, just for clarity. Not like you haven't covered any of them, sorry for the confusion! :) Please see more comments below! >> >> - Sync points should work reliably. >> >> Some problems of the current Host1x driver, like where it falls over if >> sync point value is out-of-sync + all the hang-job recovery labor could >> be easily reduced if sync point health is protected by extra UAPI >> constraints. > >> So I think we may want the following: >> >> 1. We still should need to assign sync point ID to a DRM-channel's >> context. This sync point ID will be used for a commands stream forming, >> like it is done by the current staging UAPI. >> >> So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but >> improve it. My point here is that the UAPI shouldn't be able to increment the job's sync point using SYNCPOINT_INCREMENT IOCTL, which is another UAPI constraint. I'm suggesting that we should have two methods of sync point allocations: 1) Sync point that could be used only by a submitted job. 2) Sync point that could be incremented by CPU. The first method will allocate a raw sync point ID that is assigned to the channel's context. This ID will be used for the job's completion tracking. Perhaps this method also could optionally return a sync point FD if you'd want to wait on this sync point by another job. We don't need a dedicated sync point FD for all kinds of jobs, don't we? For example, I don't see why a sync point FD may be needed in a case of Opentegra jobs. >> 2. Allocated sync point must have a clean hardware state. > > What do you mean by clean hardware state? I mean that sync point should have a predictable state [1], it shouldn't accidentally fire during of hardware programming for example. [1] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L132 For a submitted job, the job's sync point state could be reset at a submission time, for example like I did it in the grate-kernel's experimental driver [2]. [2] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/channel.c#L145 >> >> 3. Sync points should be properly refcounted. Job's sync points >> shouldn't be re-used while job is alive. >> >> 4. The job's sync point can't be re-used after job's submission (UAPI >> constraint!). Userspace must free sync point and allocate a new one for >> the next job submission. And now we: >> >> - Know that job's sync point is always in a healthy state! >> >> - We're not limited by a number of physically available hardware sync >> points! Allocation should block until free sync point is available. >> >> - The logical number of job's sync point increments matches the SP >> hardware state! Which is handy for a job's debugging. >> >> Optionally, the job's sync point could be auto-removed from the DRM's >> context after job's submission, avoiding a need for an extra SYNCPT_PUT >> IOCTL invocation to be done by userspace after the job's submission. >> Could be a job's flag. > > I think this would cause problems where after a job completes but before > the fence has been waited, the syncpoint is already recycled (especially > if the syncpoint is reset into some clean state). Exactly, good point! The dma-fence shouldn't be hardwired to the sync point in order to avoid this situation :) Please take a look at the fence implementation that I made for the grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached to a sync point by host1x_fence_create(). Once job is completed, the host1x-fence is detached from the sync point [5][6] and sync point could be recycled safely! [3] https://github.com/grate-driver/linux/blob/master/drivers/gpu/hos
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
23.06.2020 15:09, Mikko Perttunen пишет: > > ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x) > > Allocates a free syncpoint, returning a file descriptor representing it. > Only the owner of the file descriptor is allowed to mutate the value of > the syncpoint. > > ``` > struct host1x_ctrl_allocate_syncpoint { > /** > * @fd: > * > * [out] New file descriptor representing the allocated syncpoint. > */ > __s32 fd; > > __u32 reserved[3]; > }; We should need at least these basic things from the sync points API: - Execution context shouldn't be able to tamper sync points of the other contexts. - Sync point could be shared with other contexts for explicit fencing. - Sync points should work reliably. Some problems of the current Host1x driver, like where it falls over if sync point value is out-of-sync + all the hang-job recovery labor could be easily reduced if sync point health is protected by extra UAPI constraints. So I think we may want the following: 1. We still should need to assign sync point ID to a DRM-channel's context. This sync point ID will be used for a commands stream forming, like it is done by the current staging UAPI. So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it. 2. Allocated sync point must have a clean hardware state. 3. Sync points should be properly refcounted. Job's sync points shouldn't be re-used while job is alive. 4. The job's sync point can't be re-used after job's submission (UAPI constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we: - Know that job's sync point is always in a healthy state! - We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available. - The logical number of job's sync point increments matches the SP hardware state! Which is handy for a job's debugging. Optionally, the job's sync point could be auto-removed from the DRM's context after job's submission, avoiding a need for an extra SYNCPT_PUT IOCTL invocation to be done by userspace after the job's submission. Could be a job's flag. We could avoid a need for a statically-allocated sync points at all for a patched cmdstreams! The sync point could be dynamically allocated at a job's submission time by the kernel driver and then cmdstream will be patched with this sync point. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [RFC] Host1x/TegraDRM UAPI (sync points)
On 6/28/20 2:27 AM, Dmitry Osipenko wrote: 23.06.2020 15:09, Mikko Perttunen пишет: ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x) Allocates a free syncpoint, returning a file descriptor representing it. Only the owner of the file descriptor is allowed to mutate the value of the syncpoint. ``` struct host1x_ctrl_allocate_syncpoint { /** * @fd: * * [out] New file descriptor representing the allocated syncpoint. */ __s32 fd; __u32 reserved[3]; }; We should need at least these basic things from the sync points API > - Execution context shouldn't be able to tamper sync points of the other contexts. This is covered by this UAPI - when submitting, as part of the syncpt_incr struct you pass the syncpoint FD. This way the driver can check the syncpoints used are correct, or program HW protection. - Sync point could be shared with other contexts for explicit fencing. Not sure what you specifically mean; you can get the ID out of the syncpoint fd and share the ID for read-only access. (Or the FD for read-write access) - Sync points should work reliably. Some problems of the current Host1x driver, like where it falls over if sync point value is out-of-sync + all the hang-job recovery labor could be easily reduced if sync point health is protected by extra UAPI constraints. > So I think we may want the following: 1. We still should need to assign sync point ID to a DRM-channel's context. This sync point ID will be used for a commands stream forming, like it is done by the current staging UAPI. So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it. 2. Allocated sync point must have a clean hardware state. What do you mean by clean hardware state? 3. Sync points should be properly refcounted. Job's sync points shouldn't be re-used while job is alive. 4. The job's sync point can't be re-used after job's submission (UAPI constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we: - Know that job's sync point is always in a healthy state! - We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available. - The logical number of job's sync point increments matches the SP hardware state! Which is handy for a job's debugging. Optionally, the job's sync point could be auto-removed from the DRM's context after job's submission, avoiding a need for an extra SYNCPT_PUT IOCTL invocation to be done by userspace after the job's submission. Could be a job's flag. I think this would cause problems where after a job completes but before the fence has been waited, the syncpoint is already recycled (especially if the syncpoint is reset into some clean state). I would prefer having a syncpoint for each userspace channel context (several of which could share a hardware channel if MLOCKing is not used). In my experience it's then not difficult to pinpoint which job has failed, and if each userspace channel context uses a separate syncpoint, a hanging job wouldn't mess with other application's jobs, either. Mikko We could avoid a need for a statically-allocated sync points at all for a patched cmdstreams! The sync point could be dynamically allocated at a job's submission time by the kernel driver and then cmdstream will be patched with this sync point. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel