Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Felix Kuehling

On 2018-02-14 02:01 PM, Felix Kuehling wrote:
> On 2018-02-14 01:33 PM, Christian König wrote:
>> Am 14.02.2018 um 19:24 schrieb Felix Kuehling:
>>> On 2018-02-14 01:15 PM, Christian König wrote:
>>>
 As I said that concept is incompatible with the requirements on A+A
 systems, so we need to find another solution to provide the
 functionality.
>>> Do you mean you need to find another solution for A+A buffer sharing
>>> specifically? Or is this a more general statement that includes the
>>> mapping of BOs to multiple VMs on different devices?
>> A more general statement. We need to find a solution which works for
>> everybody and not just works like this in the KFD but breaks A+A
>> buffer sharing and so needs to be disabled there.
> Well, KFD sharing system memory BOs between GPUs doesn't break A+A.
> Implementing a solution for A+A that involves DMABufs will not affect
> KFD. And KFD isn't actually broken as far as I know. Once you have a
> solution for A+A, maybe it will help me understand the problem and I can
> evaluate whether the solution is applicable to KFD and worth adopting.
> But for now I have neither a good understanding of the problem, no
> evidence that there is a problem affecting KFD, and no way towards a
> solution.

Let me add, I'm definitely interested in your solution for P2P, because
we want to enable that for KFD for large-BAR systems. For now I'm not
upstreaming any P2P support, because I know that our current hack is
going to be superseded by the solution you're working on.

Thanks,
  Felix

>
 What's on my TODO list anyway is to extend DMA-buf to not require
 pinning and to be able to deal with P2P.
>>> Sounds good. That said, KFD is not using DMABufs here.
>>>
 The former is actually rather easy and already mostly done by sharing
 the reservation object between exporter and importer.

 The later is a bit more tricky because I need to create the necessary
 P2P infrastructure, but even that is doable in the mid term.
>>> The sooner you can share your plans, the better. Right now I'm in a bit
>>> of limbo. I feel you're blocking KFD upstreaming based on AMDGPU plans
>>> and changes that no one has seen yet.
>> Well as far as I understand it that is not blocking for the current
>> upstreaming because you didn't planned to upstream this use case
>> anyway, didn't you?
> Which use case? The current patch series enables multi-GPU buffer
> sharing of system memory BOs. If it is actually broken, I can reduce the
> scope to single-GPU support. But I have no evidence that multi-GPU is
> actually broken.
>
> Regards,
>   Felix
>
>> Regards,
>> Christian.
>>
>>> Thanks,
>>>    Felix
>>>
 Regards,
 Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Felix Kuehling
On 2018-02-14 01:33 PM, Christian König wrote:
> Am 14.02.2018 um 19:24 schrieb Felix Kuehling:
>> On 2018-02-14 01:15 PM, Christian König wrote:
>>> Am 14.02.2018 um 17:35 schrieb Felix Kuehling:
 [SNIP]
> I don't see how you can have separate TTM objects referring to
> the
> same memory.
 Well that is trivial, we do this all the time with prime and I+A
 laptops.
>>> As I understand it, you use DMABuf to export/import buffers on
>>> multiple
>>> devices. I believe all devices share a single amdgpu_bo, which
>>> contains
>>> the ttm_buffer_object.
> That's incorrect as well. Normally multiple devices have multiple
> ttm_buffer_object, one for each device.
> Going a bit higher that actually makes sense because the status of
> each BO is deferent for each device. E.g. one device could have
> the BO
> in access while it could be idle on another device.
 Can you point me where this is done? I'm looking at
 amdgpu_gem_prime_foreign_bo. It is used if an AMDGPU BO is imported
 into
 a different AMDGPU device. It creates a new GEM object, with a
 reference
 to the same amdgpu BO (gobj->bo = amdgpu_bo_ref(bo)). To me this looks
 very much like the same amdgpu_bo, and cosequently the same TTM BO
 being
 shared by two GEM objects and two devices.
>>> As Michel pointed out as well that stuff isn't upstream and judging
>>> from the recent requirements it will never go upstream.
>>>
> If this is enabled by any changes that break existing buffer sharing
> for
> A+A or A+I systems, please point it out to me. I'm not aware that
> this
> patch series does anything to that effect.
> As I said it completely breaks scanout with A+I systems.
 Please tell me what "it" is. What in the changes I have posted is
 breaking A+I systems. I don't see it.
>>> Using the same amdgpu_bo structure with multiple devices is what "it"
>>> means here.
>> That statement seems to contradict this previous statement by you:
>    What we do is map the same
> BO in VMs on other devices. It has no effect on GART mappings.
>>> Correct VM mapping is unaffected here.
>> Can you clarify that contradiction? Is it OK for us to map the same BO
>> in multiple VMs or not?
>
> By  the current requirements I have I think the answer is no.
>
>>> As I said that concept is incompatible with the requirements on A+A
>>> systems, so we need to find another solution to provide the
>>> functionality.
>> Do you mean you need to find another solution for A+A buffer sharing
>> specifically? Or is this a more general statement that includes the
>> mapping of BOs to multiple VMs on different devices?
>
> A more general statement. We need to find a solution which works for
> everybody and not just works like this in the KFD but breaks A+A
> buffer sharing and so needs to be disabled there.

Well, KFD sharing system memory BOs between GPUs doesn't break A+A.
Implementing a solution for A+A that involves DMABufs will not affect
KFD. And KFD isn't actually broken as far as I know. Once you have a
solution for A+A, maybe it will help me understand the problem and I can
evaluate whether the solution is applicable to KFD and worth adopting.
But for now I have neither a good understanding of the problem, no
evidence that there is a problem affecting KFD, and no way towards a
solution.

>
>>
>>> What's on my TODO list anyway is to extend DMA-buf to not require
>>> pinning and to be able to deal with P2P.
>> Sounds good. That said, KFD is not using DMABufs here.
>>
>>> The former is actually rather easy and already mostly done by sharing
>>> the reservation object between exporter and importer.
>>>
>>> The later is a bit more tricky because I need to create the necessary
>>> P2P infrastructure, but even that is doable in the mid term.
>> The sooner you can share your plans, the better. Right now I'm in a bit
>> of limbo. I feel you're blocking KFD upstreaming based on AMDGPU plans
>> and changes that no one has seen yet.
>
> Well as far as I understand it that is not blocking for the current
> upstreaming because you didn't planned to upstream this use case
> anyway, didn't you?

Which use case? The current patch series enables multi-GPU buffer
sharing of system memory BOs. If it is actually broken, I can reduce the
scope to single-GPU support. But I have no evidence that multi-GPU is
actually broken.

Regards,
  Felix

>
> Regards,
> Christian.
>
>>
>> Thanks,
>>    Felix
>>
>>> Regards,
>>> Christian.
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Christian König

Am 14.02.2018 um 19:24 schrieb Felix Kuehling:

On 2018-02-14 01:15 PM, Christian König wrote:

Am 14.02.2018 um 17:35 schrieb Felix Kuehling:

[SNIP]

I don't see how you can have separate TTM objects referring to the
same memory.

Well that is trivial, we do this all the time with prime and I+A
laptops.

As I understand it, you use DMABuf to export/import buffers on
multiple
devices. I believe all devices share a single amdgpu_bo, which
contains
the ttm_buffer_object.

That's incorrect as well. Normally multiple devices have multiple
ttm_buffer_object, one for each device.
Going a bit higher that actually makes sense because the status of
each BO is deferent for each device. E.g. one device could have the BO
in access while it could be idle on another device.

Can you point me where this is done? I'm looking at
amdgpu_gem_prime_foreign_bo. It is used if an AMDGPU BO is imported into
a different AMDGPU device. It creates a new GEM object, with a reference
to the same amdgpu BO (gobj->bo = amdgpu_bo_ref(bo)). To me this looks
very much like the same amdgpu_bo, and cosequently the same TTM BO being
shared by two GEM objects and two devices.

As Michel pointed out as well that stuff isn't upstream and judging
from the recent requirements it will never go upstream.


If this is enabled by any changes that break existing buffer sharing
for
A+A or A+I systems, please point it out to me. I'm not aware that this
patch series does anything to that effect.
As I said it completely breaks scanout with A+I systems.

Please tell me what "it" is. What in the changes I have posted is
breaking A+I systems. I don't see it.

Using the same amdgpu_bo structure with multiple devices is what "it"
means here.

That statement seems to contradict this previous statement by you:

   What we do is map the same
BO in VMs on other devices. It has no effect on GART mappings.

Correct VM mapping is unaffected here.

Can you clarify that contradiction? Is it OK for us to map the same BO
in multiple VMs or not?


By  the current requirements I have I think the answer is no.


As I said that concept is incompatible with the requirements on A+A
systems, so we need to find another solution to provide the
functionality.

Do you mean you need to find another solution for A+A buffer sharing
specifically? Or is this a more general statement that includes the
mapping of BOs to multiple VMs on different devices?


A more general statement. We need to find a solution which works for 
everybody and not just works like this in the KFD but breaks A+A buffer 
sharing and so needs to be disabled there.





What's on my TODO list anyway is to extend DMA-buf to not require
pinning and to be able to deal with P2P.

Sounds good. That said, KFD is not using DMABufs here.


The former is actually rather easy and already mostly done by sharing
the reservation object between exporter and importer.

The later is a bit more tricky because I need to create the necessary
P2P infrastructure, but even that is doable in the mid term.

The sooner you can share your plans, the better. Right now I'm in a bit
of limbo. I feel you're blocking KFD upstreaming based on AMDGPU plans
and changes that no one has seen yet.


Well as far as I understand it that is not blocking for the current 
upstreaming because you didn't planned to upstream this use case anyway, 
didn't you?


Regards,
Christian.



Thanks,
   Felix


Regards,
Christian.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Felix Kuehling
On 2018-02-14 01:15 PM, Christian König wrote:
> Am 14.02.2018 um 17:35 schrieb Felix Kuehling:
>> [SNIP]
>>> I don't see how you can have separate TTM objects referring to the
>>> same memory.
>> Well that is trivial, we do this all the time with prime and I+A
>> laptops.
> As I understand it, you use DMABuf to export/import buffers on
> multiple
> devices. I believe all devices share a single amdgpu_bo, which
> contains
> the ttm_buffer_object.
>>> That's incorrect as well. Normally multiple devices have multiple
>>> ttm_buffer_object, one for each device.
>>> Going a bit higher that actually makes sense because the status of
>>> each BO is deferent for each device. E.g. one device could have the BO
>>> in access while it could be idle on another device.
>> Can you point me where this is done? I'm looking at
>> amdgpu_gem_prime_foreign_bo. It is used if an AMDGPU BO is imported into
>> a different AMDGPU device. It creates a new GEM object, with a reference
>> to the same amdgpu BO (gobj->bo = amdgpu_bo_ref(bo)). To me this looks
>> very much like the same amdgpu_bo, and cosequently the same TTM BO being
>> shared by two GEM objects and two devices.
>
> As Michel pointed out as well that stuff isn't upstream and judging
> from the recent requirements it will never go upstream.
>
>>> If this is enabled by any changes that break existing buffer sharing
>>> for
>>> A+A or A+I systems, please point it out to me. I'm not aware that this
>>> patch series does anything to that effect.
>>> As I said it completely breaks scanout with A+I systems.
>> Please tell me what "it" is. What in the changes I have posted is
>> breaking A+I systems. I don't see it.
>
> Using the same amdgpu_bo structure with multiple devices is what "it"
> means here.

That statement seems to contradict this previous statement by you:
>>>   What we do is map the same
>>> BO in VMs on other devices. It has no effect on GART mappings.
>
> Correct VM mapping is unaffected here. 
Can you clarify that contradiction? Is it OK for us to map the same BO
in multiple VMs or not?

> As I said that concept is incompatible with the requirements on A+A
> systems, so we need to find another solution to provide the
> functionality.

Do you mean you need to find another solution for A+A buffer sharing
specifically? Or is this a more general statement that includes the
mapping of BOs to multiple VMs on different devices?

> What's on my TODO list anyway is to extend DMA-buf to not require
> pinning and to be able to deal with P2P.

Sounds good. That said, KFD is not using DMABufs here.

>
> The former is actually rather easy and already mostly done by sharing
> the reservation object between exporter and importer.
>
> The later is a bit more tricky because I need to create the necessary
> P2P infrastructure, but even that is doable in the mid term.

The sooner you can share your plans, the better. Right now I'm in a bit
of limbo. I feel you're blocking KFD upstreaming based on AMDGPU plans
and changes that no one has seen yet.

Thanks,
  Felix

>
> Regards,
> Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Christian König

Am 14.02.2018 um 17:35 schrieb Felix Kuehling:

[SNIP]

I don't see how you can have separate TTM objects referring to the
same memory.

Well that is trivial, we do this all the time with prime and I+A
laptops.

As I understand it, you use DMABuf to export/import buffers on multiple
devices. I believe all devices share a single amdgpu_bo, which contains
the ttm_buffer_object.

That's incorrect as well. Normally multiple devices have multiple
ttm_buffer_object, one for each device.
Going a bit higher that actually makes sense because the status of
each BO is deferent for each device. E.g. one device could have the BO
in access while it could be idle on another device.

Can you point me where this is done? I'm looking at
amdgpu_gem_prime_foreign_bo. It is used if an AMDGPU BO is imported into
a different AMDGPU device. It creates a new GEM object, with a reference
to the same amdgpu BO (gobj->bo = amdgpu_bo_ref(bo)). To me this looks
very much like the same amdgpu_bo, and cosequently the same TTM BO being
shared by two GEM objects and two devices.


As Michel pointed out as well that stuff isn't upstream and judging from 
the recent requirements it will never go upstream.



If this is enabled by any changes that break existing buffer sharing for
A+A or A+I systems, please point it out to me. I'm not aware that this
patch series does anything to that effect.
As I said it completely breaks scanout with A+I systems.

Please tell me what "it" is. What in the changes I have posted is
breaking A+I systems. I don't see it.


Using the same amdgpu_bo structure with multiple devices is what "it" 
means here.


As I said that concept is incompatible with the requirements on A+A 
systems, so we need to find another solution to provide the functionality.


What's on my TODO list anyway is to extend DMA-buf to not require 
pinning and to be able to deal with P2P.


The former is actually rather easy and already mostly done by sharing 
the reservation object between exporter and importer.


The later is a bit more tricky because I need to create the necessary 
P2P infrastructure, but even that is doable in the mid term.


Regards,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Felix Kuehling
On 2018-02-14 11:50 AM, Michel Dänzer wrote:
> On 2018-02-14 05:35 PM, Felix Kuehling wrote:
>> On 2018-02-14 02:42 AM, Christian König wrote:
>>> Am 14.02.2018 um 00:17 schrieb Felix Kuehling:
 On 2018-02-13 02:18 PM, Felix Kuehling wrote:
> On 2018-02-13 01:15 PM, Christian König wrote:
>> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>>> I don't see how you can have separate TTM objects referring to the
>>> same memory.
>> Well that is trivial, we do this all the time with prime and I+A
>> laptops.
> As I understand it, you use DMABuf to export/import buffers on multiple
> devices. I believe all devices share a single amdgpu_bo, which contains
> the ttm_buffer_object.
>>> That's incorrect as well. Normally multiple devices have multiple
>>> ttm_buffer_object, one for each device.
>>> Going a bit higher that actually makes sense because the status of
>>> each BO is deferent for each device. E.g. one device could have the BO
>>> in access while it could be idle on another device.
>> Can you point me where this is done? I'm looking at
>> amdgpu_gem_prime_foreign_bo. It is used if an AMDGPU BO is imported into
>> a different AMDGPU device. It creates a new GEM object, with a reference
>> to the same amdgpu BO (gobj->bo = amdgpu_bo_ref(bo)). To me this looks
>> very much like the same amdgpu_bo, and cosequently the same TTM BO being
>> shared by two GEM objects and two devices.
> amdgpu_gem_prime_foreign_bo doesn't exist in amd-staging-drm-next, let
> alone upstream. Even on current internal branches, it's no longer used
> for dma-buf import AFAICT.

You're right. It exists on the KFD branch for so long that I was taking
it for granted. I see that on amd-staging-drm-next, importing a BO from
a different device, even with the same driver, results in pinning and
using SG tables.

Either way, the patch series discussed here doesn't touch any of that
code. All we do is map system memory BOs into multiple VMs on different
devices. And according Christian that is OK. So I'm a bit perplexed
about the opposition I'm facing from him.

Maybe it's time to take a step back from discussing details that are
irrelevant to the patch series being reviewed here. I understand and I
agree that the hacks we have in the KFD branch for enabling P2P
(including amdgpu_gem_prime_foreign_bo) will not be accepted upstream.
We discussed this a few monts ago with a patch series that was rejected.
Maybe your understanding has evolved beyond that since then and I'm just
catching up on that.

I also want to make it clear, that none of those hacks are included in
this patch series. They are not needed for enabling multi-GPU support
with system memory for KFD. If anything slipped into this patch series
that is objectionable from your point of view, please point it out to
me, and I will gladly address it.

The constructive feedback I've gathered so far concerns:

  * Use of VMs from DRM file descriptors for KFD instead of creating our own
  o I agree with this one because it's a more efficient use of resources
  * Creating GEM objects for KFD buffers
  o I disagree with this one because we don't need GEM objects and I
haven't seen a good reason to convince me otherwise
  * Using GEM buffer handlers (and IDR) instead of our own in KFD
  o Related to the previous one. I think the overhead of KFD having
its own per-process IDR is small, compared to the overhead of
creating a GEM object for each KFD BO. Again, I haven't seen a
good argument to convince me otherwise

Thanks,
  Felix

>
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Michel Dänzer
On 2018-02-14 05:35 PM, Felix Kuehling wrote:
> On 2018-02-14 02:42 AM, Christian König wrote:
>> Am 14.02.2018 um 00:17 schrieb Felix Kuehling:
>>> On 2018-02-13 02:18 PM, Felix Kuehling wrote:
 On 2018-02-13 01:15 PM, Christian König wrote:
> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>>
>> I don't see how you can have separate TTM objects referring to the
>> same memory.
> Well that is trivial, we do this all the time with prime and I+A
> laptops.
 As I understand it, you use DMABuf to export/import buffers on multiple
 devices. I believe all devices share a single amdgpu_bo, which contains
 the ttm_buffer_object.
>>
>> That's incorrect as well. Normally multiple devices have multiple
>> ttm_buffer_object, one for each device.
>> Going a bit higher that actually makes sense because the status of
>> each BO is deferent for each device. E.g. one device could have the BO
>> in access while it could be idle on another device.
> 
> Can you point me where this is done? I'm looking at
> amdgpu_gem_prime_foreign_bo. It is used if an AMDGPU BO is imported into
> a different AMDGPU device. It creates a new GEM object, with a reference
> to the same amdgpu BO (gobj->bo = amdgpu_bo_ref(bo)). To me this looks
> very much like the same amdgpu_bo, and cosequently the same TTM BO being
> shared by two GEM objects and two devices.

amdgpu_gem_prime_foreign_bo doesn't exist in amd-staging-drm-next, let
alone upstream. Even on current internal branches, it's no longer used
for dma-buf import AFAICT.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Felix Kuehling

On 2018-02-14 03:50 AM, Michel Dänzer wrote:
> On 2018-02-13 08:18 PM, Felix Kuehling wrote:
>> On 2018-02-13 01:15 PM, Christian König wrote:
>>> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>>>
 I don't see how you can have separate TTM objects referring to the
 same memory.
>>> Well that is trivial, we do this all the time with prime and I+A laptops.
>> As I understand it, you use DMABuf to export/import buffers on multiple
>> devices. I believe all devices share a single amdgpu_bo, which contains
>> the ttm_buffer_object.
> The dma-buf exporter and importer can be different drivers, so this is
> not possible.

Yes. In the general case this is handled by SG tables. However, GEM and
AMDGPU have some special cases for importing buffers from the same driver.

The discussion here is about sharing BOs between different AMDGPU
devices (with or without involving DMABufs). I'm not talking about
sharing BOs with different drivers.

Regards
  Felix

>
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Felix Kuehling


On 2018-02-14 02:42 AM, Christian König wrote:
> Am 14.02.2018 um 00:17 schrieb Felix Kuehling:
>> On 2018-02-13 02:18 PM, Felix Kuehling wrote:
>>> On 2018-02-13 01:15 PM, Christian König wrote:
 Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
> On 2018-02-13 12:06 PM, Christian König wrote:
>> [SNIP]
>> Ah, yeah that is also a point I wanted to to talk about with you.
>>
>> The approach of using the same buffer object with multiple amdgpu
>> devices doesn't work in general.
>>
>> We need separate TTM object for each BO in each device or
>> otherwise we
>> break A+A laptops.
> I think it broke for VRAM BOs because we enabled P2P on systems that
> didn't support it properly. But at least system memory BOs can be
> shared
> quite well between devices and we do it all the time.
 Sharing VRAM BOs is one issue, but the problem goes deeper than just
 that.

 Starting with Carizzo we can scanout from system memory to avoid the
 extra copy on A+A laptops. For this to work we need the BO mapped to
 GART (and I mean a real VMID0 mapping, not just in the GTT domain).
 And for this to work in turn we need a TTM object per device and not a
 global one.
>>> I still don't understand. I think what you're talking about applies
>>> only
>>> to BOs used for scan-out. Every BO is allocated from a specific device
>>> and can only be GART-mapped on that device.
>
> Exactly that assumption is incorrect. BOs can be GART mapped into any
> device.

Fine. My point is, we're not doing that.

>
>>>   What we do is map the same
>>> BO in VMs on other devices. It has no effect on GART mappings.
>
> Correct VM mapping is unaffected here.

Great.

>
> I don't see how you can have separate TTM objects referring to the
> same memory.
 Well that is trivial, we do this all the time with prime and I+A
 laptops.
>>> As I understand it, you use DMABuf to export/import buffers on multiple
>>> devices. I believe all devices share a single amdgpu_bo, which contains
>>> the ttm_buffer_object.
>
> That's incorrect as well. Normally multiple devices have multiple
> ttm_buffer_object, one for each device.
> Going a bit higher that actually makes sense because the status of
> each BO is deferent for each device. E.g. one device could have the BO
> in access while it could be idle on another device.

Can you point me where this is done? I'm looking at
amdgpu_gem_prime_foreign_bo. It is used if an AMDGPU BO is imported into
a different AMDGPU device. It creates a new GEM object, with a reference
to the same amdgpu BO (gobj->bo = amdgpu_bo_ref(bo)). To me this looks
very much like the same amdgpu_bo, and cosequently the same TTM BO being
shared by two GEM objects and two devices.

>
> Same is true for placement of the BO. E.g. a VRAM placement of one
> device is actually a GTT placement for another.
>
>>>   The only way you can have a new TTM buffer object
>>> per device is by using SG tables and pinning the BOs. But I think we
>>> want to avoid pinning BOs.
>>>
>>> What we do does not involve pinning of BOs, even when they're shared
>>> between multiple devices' VMs.
>>>
>> That is also the reason we had to disable this feature again in the
>> hybrid branches.
> What you disabled on hybrid branches was P2P, which only affects
> large-BAR systems. Sharing of system memory BOs between devices still
> works fine.
 No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
 and Raven. Additional to that we found that it breaks some aspects of
 the user space interface.
>>> Let me check that with my current patch series. The patches I submitted
>>> here shouldn't include anything that breaks the use cases you describe.
>>> But I'm quite sure it will support sharing BOs between multiple
>>> devices'
>>> VMs.
>> Confirmed with the current patch series. I can allocate buffers on GPU 1
>> and map them into a VM on GPU 2.
>
> As I said VM mapping is not the problem here.

Great.

>
>> BTW, this is the normal mode of operation for system memory BOs on
>> multi-GPU systems. Logically system memory BOs belong to the CPU node in
>> the KFD topology. So the Thunk API isn't told which GPU to allocate the
>> system memory BO from. It just picks the first one. Then we can map
>> those BOs on any GPUs we want. If we want to use them on GPU2, that's
>> where they get mapped.
>
> Well keeping NUMA in mind that actually sounds like a design problem
> to me.

We're not doing NUMA with TTM because TTM is not NUMA aware. We have
some prototype NUMA code in the Thunk that uses userptr to map memory
allocated with NUMA awareness to the GPU.

>
> On NUMA system some parts of the system memory can be "closer" to a
> GPU than other parts.

Yes, I understand that.

>
>> So I just put two GPUs in a system and ran a test on GPU2. The system
>> memory buffers are allocated from the GPU1 device, but mapped into the
>> GPU2 VM. The tests 

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-14 Thread Michel Dänzer
On 2018-02-13 08:18 PM, Felix Kuehling wrote:
> On 2018-02-13 01:15 PM, Christian König wrote:
>> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>>
>>> I don't see how you can have separate TTM objects referring to the
>>> same memory.
>>
>> Well that is trivial, we do this all the time with prime and I+A laptops.
> 
> As I understand it, you use DMABuf to export/import buffers on multiple
> devices. I believe all devices share a single amdgpu_bo, which contains
> the ttm_buffer_object.

The dma-buf exporter and importer can be different drivers, so this is
not possible.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 20:22 schrieb Felix Kuehling:

On 2018-02-13 01:45 PM, Christian König wrote:

Am 13.02.2018 um 17:56 schrieb Felix Kuehling:

[SNIP]
Each process gets a whole page of the doorbell aperture assigned to it.
The assumption is that amdgpu only uses the first page of the doorbell
aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is
used as the offset into the doorbell page. On GFX9 the hardware does
some engine-specific doorbell routing, so we added another layer of
doorbell management that's decoupled from the queue ID.

Either way, an entire doorbell page gets mapped into user mode and user
mode knows the offset of the doorbells for specific queues. The mapping
is currently handled by kfd_mmap in kfd_chardev.c.

Ok, wait a second. Taking a look at kfd_doorbell_mmap() it almost
looks like you map different doorbells with the same offset depending
on which process is calling this.

Is that correct? If yes then that would be illegal and a problem if
I'm not completely mistaken.

Why is that a problem. Each process has its own file descriptor. The
mapping is done using io_remap_pfn_range in kfd_doorbell_mmap.


Yeah, but all share the same file address space from the inode, doesn't 
they?


E.g. imagine that you map an offset from a normal file into process A 
and the same offset from the same file into process B, what do you get? 
Exactly, the same page mapped into both processes.


Now take a look at the KFD interface, you map the same offset from the 
same file (or rather inode) into two different processes and get two 
different things.



This is nothing new. It's been done like this forever even on Kaveri and 
Carrizo.


It most likely doesn't cause any problems of hand or otherwise we would 
have noticed already, but at bare minimum that totally confuses the 
reverse mapping code.


Need to think about it if that is really an issue or not and if yes how 
to fix/mitigate it.


Regards,
Christian.




Do you simply assume that after evicting a process it always needs to
be restarted without checking if it actually does something? Or how
does that work?

Exactly.

Ok, understood. Well that limits the usefulness of the whole eviction
drastically.


With later addition of GPU self-dispatch a page-fault based
mechanism wouldn't work any more. We have to restart the queues blindly
with a timer. See evict_process_worker, which schedules the restore with
a delayed worker.
which was send either by the GPU o
The user mode queue ABI specifies that user mode update both the
doorbell and a WPTR in memory. When we restart queues we (or the CP
firmware) use the WPTR to make sure we catch up with any work that was
submitted while the queues were unmapped.

Putting cross process work dispatch aside for a moment GPU
self-dispatch works only when there is work on the GPU running.

So you can still check if there are some work pending after you
unmapped everything and only restart the queues when there is new work
based on the page fault.

In other words either there is work pending and it doesn't matter if
it was send by the GPU or by the CPU or there is no work pending and
we can delay restarting everything until there is.

That sounds like a useful optimization.

Regards,
   Felix


Regards,
Christian.


Regards,
    Felix


Regards,
Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 14.02.2018 um 00:17 schrieb Felix Kuehling:

On 2018-02-13 02:18 PM, Felix Kuehling wrote:

On 2018-02-13 01:15 PM, Christian König wrote:

Am 13.02.2018 um 18:18 schrieb Felix Kuehling:

On 2018-02-13 12:06 PM, Christian König wrote:

[SNIP]
Ah, yeah that is also a point I wanted to to talk about with you.

The approach of using the same buffer object with multiple amdgpu
devices doesn't work in general.

We need separate TTM object for each BO in each device or otherwise we
break A+A laptops.

I think it broke for VRAM BOs because we enabled P2P on systems that
didn't support it properly. But at least system memory BOs can be shared
quite well between devices and we do it all the time.

Sharing VRAM BOs is one issue, but the problem goes deeper than just
that.

Starting with Carizzo we can scanout from system memory to avoid the
extra copy on A+A laptops. For this to work we need the BO mapped to
GART (and I mean a real VMID0 mapping, not just in the GTT domain).
And for this to work in turn we need a TTM object per device and not a
global one.

I still don't understand. I think what you're talking about applies only
to BOs used for scan-out. Every BO is allocated from a specific device
and can only be GART-mapped on that device.


Exactly that assumption is incorrect. BOs can be GART mapped into any 
device.



  What we do is map the same
BO in VMs on other devices. It has no effect on GART mappings.


Correct VM mapping is unaffected here.


I don't see how you can have separate TTM objects referring to the
same memory.

Well that is trivial, we do this all the time with prime and I+A laptops.

As I understand it, you use DMABuf to export/import buffers on multiple
devices. I believe all devices share a single amdgpu_bo, which contains
the ttm_buffer_object.


That's incorrect as well. Normally multiple devices have multiple 
ttm_buffer_object, one for each device.


Going a bit higher that actually makes sense because the status of each 
BO is deferent for each device. E.g. one device could have the BO in 
access while it could be idle on another device.


Same is true for placement of the BO. E.g. a VRAM placement of one 
device is actually a GTT placement for another.



  The only way you can have a new TTM buffer object
per device is by using SG tables and pinning the BOs. But I think we
want to avoid pinning BOs.

What we do does not involve pinning of BOs, even when they're shared
between multiple devices' VMs.


That is also the reason we had to disable this feature again in the
hybrid branches.

What you disabled on hybrid branches was P2P, which only affects
large-BAR systems. Sharing of system memory BOs between devices still
works fine.

No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
and Raven. Additional to that we found that it breaks some aspects of
the user space interface.

Let me check that with my current patch series. The patches I submitted
here shouldn't include anything that breaks the use cases you describe.
But I'm quite sure it will support sharing BOs between multiple devices'
VMs.

Confirmed with the current patch series. I can allocate buffers on GPU 1
and map them into a VM on GPU 2.


As I said VM mapping is not the problem here.


BTW, this is the normal mode of operation for system memory BOs on
multi-GPU systems. Logically system memory BOs belong to the CPU node in
the KFD topology. So the Thunk API isn't told which GPU to allocate the
system memory BO from. It just picks the first one. Then we can map
those BOs on any GPUs we want. If we want to use them on GPU2, that's
where they get mapped.


Well keeping NUMA in mind that actually sounds like a design problem to me.

On NUMA system some parts of the system memory can be "closer" to a GPU 
than other parts.



So I just put two GPUs in a system and ran a test on GPU2. The system
memory buffers are allocated from the GPU1 device, but mapped into the
GPU2 VM. The tests work normally.

If this is enabled by any changes that break existing buffer sharing for
A+A or A+I systems, please point it out to me. I'm not aware that this
patch series does anything to that effect.


As I said it completely breaks scanout with A+I systems.

Over all it looks like the change causes more problems than it solves. 
So I'm going to block upstreaming it until we have found a way to make 
it work for everybody.


Regards,
Christian.



Regards,
   Felix


Regards,
   Felix


So end result is that we probably need to revert it and find a
different solution. I'm already working on this for a couple of weeks
now and should have something ready after I'm done with the PASID
handling.

Regards,
Christian.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling
On 2018-02-13 02:18 PM, Felix Kuehling wrote:
> On 2018-02-13 01:15 PM, Christian König wrote:
>> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>>> On 2018-02-13 12:06 PM, Christian König wrote:
 [SNIP]
 Ah, yeah that is also a point I wanted to to talk about with you.

 The approach of using the same buffer object with multiple amdgpu
 devices doesn't work in general.

 We need separate TTM object for each BO in each device or otherwise we
 break A+A laptops.
>>> I think it broke for VRAM BOs because we enabled P2P on systems that
>>> didn't support it properly. But at least system memory BOs can be shared
>>> quite well between devices and we do it all the time.
>> Sharing VRAM BOs is one issue, but the problem goes deeper than just
>> that.
>>
>> Starting with Carizzo we can scanout from system memory to avoid the
>> extra copy on A+A laptops. For this to work we need the BO mapped to
>> GART (and I mean a real VMID0 mapping, not just in the GTT domain).
>> And for this to work in turn we need a TTM object per device and not a
>> global one.
> I still don't understand. I think what you're talking about applies only
> to BOs used for scan-out. Every BO is allocated from a specific device
> and can only be GART-mapped on that device. What we do is map the same
> BO in VMs on other devices. It has no effect on GART mappings.
>
>>> I don't see how you can have separate TTM objects referring to the
>>> same memory.
>> Well that is trivial, we do this all the time with prime and I+A laptops.
> As I understand it, you use DMABuf to export/import buffers on multiple
> devices. I believe all devices share a single amdgpu_bo, which contains
> the ttm_buffer_object. The only way you can have a new TTM buffer object
> per device is by using SG tables and pinning the BOs. But I think we
> want to avoid pinning BOs.
>
> What we do does not involve pinning of BOs, even when they're shared
> between multiple devices' VMs.
>
 That is also the reason we had to disable this feature again in the
 hybrid branches.
>>> What you disabled on hybrid branches was P2P, which only affects
>>> large-BAR systems. Sharing of system memory BOs between devices still
>>> works fine.
>> No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
>> and Raven. Additional to that we found that it breaks some aspects of
>> the user space interface.
> Let me check that with my current patch series. The patches I submitted
> here shouldn't include anything that breaks the use cases you describe.
> But I'm quite sure it will support sharing BOs between multiple devices'
> VMs.

Confirmed with the current patch series. I can allocate buffers on GPU 1
and map them into a VM on GPU 2.

BTW, this is the normal mode of operation for system memory BOs on
multi-GPU systems. Logically system memory BOs belong to the CPU node in
the KFD topology. So the Thunk API isn't told which GPU to allocate the
system memory BO from. It just picks the first one. Then we can map
those BOs on any GPUs we want. If we want to use them on GPU2, that's
where they get mapped.

So I just put two GPUs in a system and ran a test on GPU2. The system
memory buffers are allocated from the GPU1 device, but mapped into the
GPU2 VM. The tests work normally.

If this is enabled by any changes that break existing buffer sharing for
A+A or A+I systems, please point it out to me. I'm not aware that this
patch series does anything to that effect.

Regards,
  Felix

>
> Regards,
>   Felix
>
>> So end result is that we probably need to revert it and find a
>> different solution. I'm already working on this for a couple of weeks
>> now and should have something ready after I'm done with the PASID
>> handling.
>>
>> Regards,
>> Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling
On 2018-02-13 01:45 PM, Christian König wrote:
> Am 13.02.2018 um 17:56 schrieb Felix Kuehling:
>> [SNIP]
>> Each process gets a whole page of the doorbell aperture assigned to it.
>> The assumption is that amdgpu only uses the first page of the doorbell
>> aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is
>> used as the offset into the doorbell page. On GFX9 the hardware does
>> some engine-specific doorbell routing, so we added another layer of
>> doorbell management that's decoupled from the queue ID.
>>
>> Either way, an entire doorbell page gets mapped into user mode and user
>> mode knows the offset of the doorbells for specific queues. The mapping
>> is currently handled by kfd_mmap in kfd_chardev.c.
>
> Ok, wait a second. Taking a look at kfd_doorbell_mmap() it almost
> looks like you map different doorbells with the same offset depending
> on which process is calling this.
>
> Is that correct? If yes then that would be illegal and a problem if
> I'm not completely mistaken.

Why is that a problem. Each process has its own file descriptor. The
mapping is done using io_remap_pfn_range in kfd_doorbell_mmap. This is
nothing new. It's been done like this forever even on Kaveri and Carrizo.

>
>>> Do you simply assume that after evicting a process it always needs to
>>> be restarted without checking if it actually does something? Or how
>>> does that work?
>> Exactly.
>
> Ok, understood. Well that limits the usefulness of the whole eviction
> drastically.
>
>> With later addition of GPU self-dispatch a page-fault based
>> mechanism wouldn't work any more. We have to restart the queues blindly
>> with a timer. See evict_process_worker, which schedules the restore with
>> a delayed worker.
>> which was send either by the GPU o
>> The user mode queue ABI specifies that user mode update both the
>> doorbell and a WPTR in memory. When we restart queues we (or the CP
>> firmware) use the WPTR to make sure we catch up with any work that was
>> submitted while the queues were unmapped.
>
> Putting cross process work dispatch aside for a moment GPU
> self-dispatch works only when there is work on the GPU running.
>
> So you can still check if there are some work pending after you
> unmapped everything and only restart the queues when there is new work
> based on the page fault.
>
> In other words either there is work pending and it doesn't matter if
> it was send by the GPU or by the CPU or there is no work pending and
> we can delay restarting everything until there is.

That sounds like a useful optimization.

Regards,
  Felix

>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>> Regards,
>>> Christian.
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling
On 2018-02-13 01:15 PM, Christian König wrote:
> Am 13.02.2018 um 18:18 schrieb Felix Kuehling:
>> On 2018-02-13 12:06 PM, Christian König wrote:
>>> [SNIP]
>>> Ah, yeah that is also a point I wanted to to talk about with you.
>>>
>>> The approach of using the same buffer object with multiple amdgpu
>>> devices doesn't work in general.
>>>
>>> We need separate TTM object for each BO in each device or otherwise we
>>> break A+A laptops.
>> I think it broke for VRAM BOs because we enabled P2P on systems that
>> didn't support it properly. But at least system memory BOs can be shared
>> quite well between devices and we do it all the time.
>
> Sharing VRAM BOs is one issue, but the problem goes deeper than just
> that.
>
> Starting with Carizzo we can scanout from system memory to avoid the
> extra copy on A+A laptops. For this to work we need the BO mapped to
> GART (and I mean a real VMID0 mapping, not just in the GTT domain).
> And for this to work in turn we need a TTM object per device and not a
> global one.

I still don't understand. I think what you're talking about applies only
to BOs used for scan-out. Every BO is allocated from a specific device
and can only be GART-mapped on that device. What we do is map the same
BO in VMs on other devices. It has no effect on GART mappings.

>
>> I don't see how you can have separate TTM objects referring to the
>> same memory.
>
> Well that is trivial, we do this all the time with prime and I+A laptops.

As I understand it, you use DMABuf to export/import buffers on multiple
devices. I believe all devices share a single amdgpu_bo, which contains
the ttm_buffer_object. The only way you can have a new TTM buffer object
per device is by using SG tables and pinning the BOs. But I think we
want to avoid pinning BOs.

What we do does not involve pinning of BOs, even when they're shared
between multiple devices' VMs.

>
>>> That is also the reason we had to disable this feature again in the
>>> hybrid branches.
>> What you disabled on hybrid branches was P2P, which only affects
>> large-BAR systems. Sharing of system memory BOs between devices still
>> works fine.
>
> No, it doesn't. It completely breaks any scanout on Carizzo, Stoney
> and Raven. Additional to that we found that it breaks some aspects of
> the user space interface.

Let me check that with my current patch series. The patches I submitted
here shouldn't include anything that breaks the use cases you describe.
But I'm quite sure it will support sharing BOs between multiple devices'
VMs.

Regards,
  Felix

>
> So end result is that we probably need to revert it and find a
> different solution. I'm already working on this for a couple of weeks
> now and should have something ready after I'm done with the PASID
> handling.
>
> Regards,
> Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 17:56 schrieb Felix Kuehling:

[SNIP]
Each process gets a whole page of the doorbell aperture assigned to it.
The assumption is that amdgpu only uses the first page of the doorbell
aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is
used as the offset into the doorbell page. On GFX9 the hardware does
some engine-specific doorbell routing, so we added another layer of
doorbell management that's decoupled from the queue ID.

Either way, an entire doorbell page gets mapped into user mode and user
mode knows the offset of the doorbells for specific queues. The mapping
is currently handled by kfd_mmap in kfd_chardev.c.


Ok, wait a second. Taking a look at kfd_doorbell_mmap() it almost looks 
like you map different doorbells with the same offset depending on which 
process is calling this.


Is that correct? If yes then that would be illegal and a problem if I'm 
not completely mistaken.



Do you simply assume that after evicting a process it always needs to
be restarted without checking if it actually does something? Or how
does that work?

Exactly.


Ok, understood. Well that limits the usefulness of the whole eviction 
drastically.



With later addition of GPU self-dispatch a page-fault based
mechanism wouldn't work any more. We have to restart the queues blindly
with a timer. See evict_process_worker, which schedules the restore with
a delayed worker.
which was send either by the GPU o
The user mode queue ABI specifies that user mode update both the
doorbell and a WPTR in memory. When we restart queues we (or the CP
firmware) use the WPTR to make sure we catch up with any work that was
submitted while the queues were unmapped.


Putting cross process work dispatch aside for a moment GPU self-dispatch 
works only when there is work on the GPU running.


So you can still check if there are some work pending after you unmapped 
everything and only restart the queues when there is new work based on 
the page fault.


In other words either there is work pending and it doesn't matter if it 
was send by the GPU or by the CPU or there is no work pending and we can 
delay restarting everything until there is.


Regards,
Christian.



Regards,
   Felix


Regards,
Christian.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 18:18 schrieb Felix Kuehling:

On 2018-02-13 12:06 PM, Christian König wrote:

[SNIP]
Ah, yeah that is also a point I wanted to to talk about with you.

The approach of using the same buffer object with multiple amdgpu
devices doesn't work in general.

We need separate TTM object for each BO in each device or otherwise we
break A+A laptops.

I think it broke for VRAM BOs because we enabled P2P on systems that
didn't support it properly. But at least system memory BOs can be shared
quite well between devices and we do it all the time.


Sharing VRAM BOs is one issue, but the problem goes deeper than just that.

Starting with Carizzo we can scanout from system memory to avoid the 
extra copy on A+A laptops. For this to work we need the BO mapped to 
GART (and I mean a real VMID0 mapping, not just in the GTT domain). And 
for this to work in turn we need a TTM object per device and not a 
global one.



I don't see how you can have separate TTM objects referring to the same memory.


Well that is trivial, we do this all the time with prime and I+A laptops.


That is also the reason we had to disable this feature again in the
hybrid branches.

What you disabled on hybrid branches was P2P, which only affects
large-BAR systems. Sharing of system memory BOs between devices still
works fine.


No, it doesn't. It completely breaks any scanout on Carizzo, Stoney and 
Raven. Additional to that we found that it breaks some aspects of the 
user space interface.


So end result is that we probably need to revert it and find a different 
solution. I'm already working on this for a couple of weeks now and 
should have something ready after I'm done with the PASID handling.


Regards,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling
On 2018-02-13 12:06 PM, Christian König wrote:
> Am 13.02.2018 um 17:42 schrieb Felix Kuehling:
>> On 2018-02-13 05:25 AM, Christian König wrote:
>>> Am 13.02.2018 um 00:23 schrieb Felix Kuehling:
 On 2018-02-12 11:57 AM, Felix Kuehling wrote:
>>> I just thought of a slightly different approach I would consider
>>> more
>>> realistic, without having thought through all the details: Adding
>>> KFD-specific memory management ioctls to the amdgpu device.
>>> Basically
>>> call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions
>>> instead
>>> of KFD ioctl functions. But we'd still have KFD ioctls for other
>>> things,
>>> and the new amdgpu ioctls would be KFD-specific and not useful for
>>> graphics applications. It's probably still several weeks of
>>> work, but
>>> shouldn't have major teething issues because the internal logic and
>>> functionality would be basically unchanged. It would just move the
>>> ioctls from one device to another.
>> My thinking went into a similar direction. But instead of
>> exposing the
>> KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.
>>
>> And then use the DRM FD in the KFD for things like the buffer
>> provider
>> of a device, e.g. no separate IDR for BOs in the KFD but rather a
>> reference to the DRM FD.
> I'm not sure I understand this. With "DRM FD" I think you mean the
> device file descriptor. Then have KFD call file operation on the
> FD to
> call AMDGPU ioctls? Is there any precedent for this that I can
> look at
> for an example?
 OK, I think this could work for finding a VM from a DRM file
 descriptor:

   struct file *filp = fget(fd);
   struct drm_file *drm_priv = filp->private_data;
   struct amdgpu_fpriv *drv_priv = file_priv->driver_priv;
   struct amdgpu_vm *vm = _priv->vm;

 That would let us use the DRM VM instead of creating our own and would
 avoid wasting a perfectly good VM that gets created by opening the DRM
 device. We'd need a new ioctl to import VMs into KFD. But that's about
 as far as I would take it.

 We'd still need to add KFD-specific data to the VM. If we use an
 existing VM, we can't wrap it in our own structure any more. We'd need
 to come up with a different solution or extend struct amdgpu_vm with
 what we need.
>>> Well feel free to extend the VM structure with a pointer for KFD data.
>> Some more thoughts about that: Currently the lifetime of the VM is tied
>> to the file descriptor. If we import it into KFD, we either have to make
>> struct amdgpu_vm reference counted, or we have to keep a reference to
>> the file descriptor in KFD just to keep the VM alive until we drop our
>> reference to it.
>>
 And we still need our own BO creation ioctl. Importing DMABufs adds
 extra overhead (more system calls, file descriptors, GEM objects) and
 doesn't cover some of our use cases (userptr). We also need our own
 idr
 for managing buffer handles that are used with all our memory and VM
 management functions.
>>> Yeah, well that is the second part of that idea.
>>>
>>> When you have the drm_file for the VM, you can also use it to call
>>> drm_gem_object_lookup().
>> KFD only has one device, so we need a buffer ID that's globally unique,
>> not per-device. Or we'd need to identify buffers by a pair of a GEM
>> handle and a device ID. Our BO handles are 64-bit, so we could pack both
>> into a single handle.
>
> Ah, yeah that is also a point I wanted to to talk about with you.
>
> The approach of using the same buffer object with multiple amdgpu
> devices doesn't work in general.
>
> We need separate TTM object for each BO in each device or otherwise we
> break A+A laptops.

I think it broke for VRAM BOs because we enabled P2P on systems that
didn't support it properly. But at least system memory BOs can be shared
quite well between devices and we do it all the time. I don't see how
you can have separate TTM objects referring to the same memory.

>
> That is also the reason we had to disable this feature again in the
> hybrid branches.

What you disabled on hybrid branches was P2P, which only affects
large-BAR systems. Sharing of system memory BOs between devices still
works fine.

>
> So we need BO handles per device here or some other solution.
>
>> Either way, that still requires a GEM object, which adds not just a
>> minor bit of overhead. It includes a struct file * which points to a
>> shared memory file that gets created for each BO in drm_gem_object_init.
>
> HUI WHAT? Why the heck do we still do this?
>
> Sorry something went wrong here and yes that shmen file is completely
> superfluous even for the gfx side.
>
>>> And when you need to iterate over all the BOs you can just use the
>>> object_idr or the VM structure as well.
>> object_idr only gives us BOs allocated from a particular 

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 17:42 schrieb Felix Kuehling:

On 2018-02-13 05:25 AM, Christian König wrote:

Am 13.02.2018 um 00:23 schrieb Felix Kuehling:

On 2018-02-12 11:57 AM, Felix Kuehling wrote:

I just thought of a slightly different approach I would consider more
realistic, without having thought through all the details: Adding
KFD-specific memory management ioctls to the amdgpu device. Basically
call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions
instead
of KFD ioctl functions. But we'd still have KFD ioctls for other
things,
and the new amdgpu ioctls would be KFD-specific and not useful for
graphics applications. It's probably still several weeks of work, but
shouldn't have major teething issues because the internal logic and
functionality would be basically unchanged. It would just move the
ioctls from one device to another.

My thinking went into a similar direction. But instead of exposing the
KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.

And then use the DRM FD in the KFD for things like the buffer provider
of a device, e.g. no separate IDR for BOs in the KFD but rather a
reference to the DRM FD.

I'm not sure I understand this. With "DRM FD" I think you mean the
device file descriptor. Then have KFD call file operation on the FD to
call AMDGPU ioctls? Is there any precedent for this that I can look at
for an example?

OK, I think this could work for finding a VM from a DRM file descriptor:

  struct file *filp = fget(fd);
  struct drm_file *drm_priv = filp->private_data;
  struct amdgpu_fpriv *drv_priv = file_priv->driver_priv;
  struct amdgpu_vm *vm = _priv->vm;

That would let us use the DRM VM instead of creating our own and would
avoid wasting a perfectly good VM that gets created by opening the DRM
device. We'd need a new ioctl to import VMs into KFD. But that's about
as far as I would take it.

We'd still need to add KFD-specific data to the VM. If we use an
existing VM, we can't wrap it in our own structure any more. We'd need
to come up with a different solution or extend struct amdgpu_vm with
what we need.

Well feel free to extend the VM structure with a pointer for KFD data.

Some more thoughts about that: Currently the lifetime of the VM is tied
to the file descriptor. If we import it into KFD, we either have to make
struct amdgpu_vm reference counted, or we have to keep a reference to
the file descriptor in KFD just to keep the VM alive until we drop our
reference to it.


And we still need our own BO creation ioctl. Importing DMABufs adds
extra overhead (more system calls, file descriptors, GEM objects) and
doesn't cover some of our use cases (userptr). We also need our own idr
for managing buffer handles that are used with all our memory and VM
management functions.

Yeah, well that is the second part of that idea.

When you have the drm_file for the VM, you can also use it to call
drm_gem_object_lookup().

KFD only has one device, so we need a buffer ID that's globally unique,
not per-device. Or we'd need to identify buffers by a pair of a GEM
handle and a device ID. Our BO handles are 64-bit, so we could pack both
into a single handle.


Ah, yeah that is also a point I wanted to to talk about with you.

The approach of using the same buffer object with multiple amdgpu 
devices doesn't work in general.


We need separate TTM object for each BO in each device or otherwise we 
break A+A laptops.


That is also the reason we had to disable this feature again in the 
hybrid branches.


So we need BO handles per device here or some other solution.


Either way, that still requires a GEM object, which adds not just a
minor bit of overhead. It includes a struct file * which points to a
shared memory file that gets created for each BO in drm_gem_object_init.


HUI WHAT? Why the heck do we still do this?

Sorry something went wrong here and yes that shmen file is completely 
superfluous even for the gfx side.



And when you need to iterate over all the BOs you can just use the
object_idr or the VM structure as well.

object_idr only gives us BOs allocated from a particular device. The VM
structure has various BO lists, but they only contain BOs that were
added to the VM. The same BO can be added multiple times or to multiple
VMs, or it may not be in any VM at all.


Ok, yeah that probably isn't sufficient for the BO handling like you 
need it for eviction.



In later patch series we also need to track additional information for
KFD buffers in KFD itself. At that point we change the KFD IDR to point
to our own structure, which in turn points to the BO. For example our
BOs get their VA assigned at allocation time, and we later do reverse
lookups from the address to the BO using an interval tree.


You can do this with the standard VM structure as well, that is needed 
for UVD and VCE anyway. See amdgpu_vm_bo_lookup_mapping.


Regards,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling

On 2018-02-13 05:46 AM, Christian König wrote:
> Am 12.02.2018 um 17:57 schrieb Felix Kuehling:
>>> [SNIP]
>>> I see that as an advantage rather than a problem, cause it fixes a
>>> couple of problems with the KFD where the address space of the inode
>>> is not managed correctly as far as I can see.
>> I don't think GEM is involved in the management of address space. That's
>> handled inside TTM and drm_vma_manager, both of which we are using. We
>> have tested CPU mappings with evictions and this is working correctly
>> now.
>
> Ok, correct. I was thinking about that as one large functionality.
> When you see it like this GEM basically only implements looking up BOs
> and reference counting them, but both are still useful features.
>
>> We had problems with this before, when we were CPU-mapping our buffers
>> through the KFD device FD.
>>
>>> That implementation issues never caused problems right now because you
>>> never tried to unmap doorbells. But with the new eviction code that is
>>> about to change, isn't it?
>> I don't understand this comment. What does this have to do with
>> doorbells?
>
> I was assuming that doorbells are now unmapped to figure out when to
> start a process again.

Restart is done based on a timer.

>
> See what I'm missing is how does userspace figures out which address
> to use to map the doorbell? E.g. I don't see a call to
> drm_vma_offset_manager_init() or something like that.

Each process gets a whole page of the doorbell aperture assigned to it.
The assumption is that amdgpu only uses the first page of the doorbell
aperture, so KFD uses all the rest. On GFX8 and before, the queue ID is
used as the offset into the doorbell page. On GFX9 the hardware does
some engine-specific doorbell routing, so we added another layer of
doorbell management that's decoupled from the queue ID.

Either way, an entire doorbell page gets mapped into user mode and user
mode knows the offset of the doorbells for specific queues. The mapping
is currently handled by kfd_mmap in kfd_chardev.c. A later patch will
add the ability to map doorbells into GPUVM address space so that GPUs
can dispatch work to themselves (or each other) by creating a special
doorbell BO that can be mapped both to the GPU and the CPU. It works by
creating an amdgpu_bo with an SG describing the doorbell page of the
process.

>
>> Doorbells are never unmapped. When queues are evicted doorbells remain
>> mapped to the process. User mode can continue writing to doorbells,
>> though they won't have any immediate effect while the corresponding
>> queues are unmapped from the HQDs.
>
> Do you simply assume that after evicting a process it always needs to
> be restarted without checking if it actually does something? Or how
> does that work?

Exactly. With later addition of GPU self-dispatch a page-fault based
mechanism wouldn't work any more. We have to restart the queues blindly
with a timer. See evict_process_worker, which schedules the restore with
a delayed worker.

The user mode queue ABI specifies that user mode update both the
doorbell and a WPTR in memory. When we restart queues we (or the CP
firmware) use the WPTR to make sure we catch up with any work that was
submitted while the queues were unmapped.

Regards,
  Felix

>
> Regards,
> Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Felix Kuehling

On 2018-02-13 05:25 AM, Christian König wrote:
> Am 13.02.2018 um 00:23 schrieb Felix Kuehling:
>> On 2018-02-12 11:57 AM, Felix Kuehling wrote:
> I just thought of a slightly different approach I would consider more
> realistic, without having thought through all the details: Adding
> KFD-specific memory management ioctls to the amdgpu device. Basically
> call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions
> instead
> of KFD ioctl functions. But we'd still have KFD ioctls for other
> things,
> and the new amdgpu ioctls would be KFD-specific and not useful for
> graphics applications. It's probably still several weeks of work, but
> shouldn't have major teething issues because the internal logic and
> functionality would be basically unchanged. It would just move the
> ioctls from one device to another.
 My thinking went into a similar direction. But instead of exposing the
 KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.

 And then use the DRM FD in the KFD for things like the buffer provider
 of a device, e.g. no separate IDR for BOs in the KFD but rather a
 reference to the DRM FD.
>>> I'm not sure I understand this. With "DRM FD" I think you mean the
>>> device file descriptor. Then have KFD call file operation on the FD to
>>> call AMDGPU ioctls? Is there any precedent for this that I can look at
>>> for an example?
>> OK, I think this could work for finding a VM from a DRM file descriptor:
>>
>>  struct file *filp = fget(fd);
>>  struct drm_file *drm_priv = filp->private_data;
>>  struct amdgpu_fpriv *drv_priv = file_priv->driver_priv;
>>  struct amdgpu_vm *vm = _priv->vm;
>>
>> That would let us use the DRM VM instead of creating our own and would
>> avoid wasting a perfectly good VM that gets created by opening the DRM
>> device. We'd need a new ioctl to import VMs into KFD. But that's about
>> as far as I would take it.
>>
>> We'd still need to add KFD-specific data to the VM. If we use an
>> existing VM, we can't wrap it in our own structure any more. We'd need
>> to come up with a different solution or extend struct amdgpu_vm with
>> what we need.
>
> Well feel free to extend the VM structure with a pointer for KFD data.

Some more thoughts about that: Currently the lifetime of the VM is tied
to the file descriptor. If we import it into KFD, we either have to make
struct amdgpu_vm reference counted, or we have to keep a reference to
the file descriptor in KFD just to keep the VM alive until we drop our
reference to it.

>
>> And we still need our own BO creation ioctl. Importing DMABufs adds
>> extra overhead (more system calls, file descriptors, GEM objects) and
>> doesn't cover some of our use cases (userptr). We also need our own idr
>> for managing buffer handles that are used with all our memory and VM
>> management functions.
>
> Yeah, well that is the second part of that idea.
>
> When you have the drm_file for the VM, you can also use it to call
> drm_gem_object_lookup().

KFD only has one device, so we need a buffer ID that's globally unique,
not per-device. Or we'd need to identify buffers by a pair of a GEM
handle and a device ID. Our BO handles are 64-bit, so we could pack both
into a single handle.

Either way, that still requires a GEM object, which adds not just a
minor bit of overhead. It includes a struct file * which points to a
shared memory file that gets created for each BO in drm_gem_object_init.

>
> And when you need to iterate over all the BOs you can just use the
> object_idr or the VM structure as well.

object_idr only gives us BOs allocated from a particular device. The VM
structure has various BO lists, but they only contain BOs that were
added to the VM. The same BO can be added multiple times or to multiple
VMs, or it may not be in any VM at all.

In later patch series we also need to track additional information for
KFD buffers in KFD itself. At that point we change the KFD IDR to point
to our own structure, which in turn points to the BO. For example our
BOs get their VA assigned at allocation time, and we later do reverse
lookups from the address to the BO using an interval tree.

Regards,
  Felix

>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Or do you mean a DMABuf FD exported by GEM? After importing the buffer
>>> and closing the FD, we still need a way to refer to the buffer for
>>> managing the GPU mappings and for releasing the reference. So we still
>>> need handles that are currently provided by our own IDR. I think we'll
>>> still need that.
>>>
>>> Finally, the kfd_ioctl_alloc_memory_of_gpu will later also be used for
>>> creating userptr BOs. Those can't be exported as DMABufs, so we still
>>> need our own ioctl for creating that type of BO. That was going to
>>> be my
>>> next patch series.
>>>
>>> Regards,
>>>    Felix
>>>
>
> ___
> amd-gfx mailing list
> 

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 12.02.2018 um 17:57 schrieb Felix Kuehling:

[SNIP]
I see that as an advantage rather than a problem, cause it fixes a
couple of problems with the KFD where the address space of the inode
is not managed correctly as far as I can see.

I don't think GEM is involved in the management of address space. That's
handled inside TTM and drm_vma_manager, both of which we are using. We
have tested CPU mappings with evictions and this is working correctly now.


Ok, correct. I was thinking about that as one large functionality. When 
you see it like this GEM basically only implements looking up BOs and 
reference counting them, but both are still useful features.



We had problems with this before, when we were CPU-mapping our buffers
through the KFD device FD.


That implementation issues never caused problems right now because you
never tried to unmap doorbells. But with the new eviction code that is
about to change, isn't it?

I don't understand this comment. What does this have to do with doorbells?


I was assuming that doorbells are now unmapped to figure out when to 
start a process again.


See what I'm missing is how does userspace figures out which address to 
use to map the doorbell? E.g. I don't see a call to 
drm_vma_offset_manager_init() or something like that.



Doorbells are never unmapped. When queues are evicted doorbells remain
mapped to the process. User mode can continue writing to doorbells,
though they won't have any immediate effect while the corresponding
queues are unmapped from the HQDs.


Do you simply assume that after evicting a process it always needs to be 
restarted without checking if it actually does something? Or how does 
that work?


Regards,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-13 Thread Christian König

Am 13.02.2018 um 00:23 schrieb Felix Kuehling:

On 2018-02-12 11:57 AM, Felix Kuehling wrote:

I just thought of a slightly different approach I would consider more
realistic, without having thought through all the details: Adding
KFD-specific memory management ioctls to the amdgpu device. Basically
call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions instead
of KFD ioctl functions. But we'd still have KFD ioctls for other things,
and the new amdgpu ioctls would be KFD-specific and not useful for
graphics applications. It's probably still several weeks of work, but
shouldn't have major teething issues because the internal logic and
functionality would be basically unchanged. It would just move the
ioctls from one device to another.

My thinking went into a similar direction. But instead of exposing the
KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.

And then use the DRM FD in the KFD for things like the buffer provider
of a device, e.g. no separate IDR for BOs in the KFD but rather a
reference to the DRM FD.

I'm not sure I understand this. With "DRM FD" I think you mean the
device file descriptor. Then have KFD call file operation on the FD to
call AMDGPU ioctls? Is there any precedent for this that I can look at
for an example?

OK, I think this could work for finding a VM from a DRM file descriptor:

 struct file *filp = fget(fd);
 struct drm_file *drm_priv = filp->private_data;
 struct amdgpu_fpriv *drv_priv = file_priv->driver_priv;
 struct amdgpu_vm *vm = _priv->vm;

That would let us use the DRM VM instead of creating our own and would
avoid wasting a perfectly good VM that gets created by opening the DRM
device. We'd need a new ioctl to import VMs into KFD. But that's about
as far as I would take it.

We'd still need to add KFD-specific data to the VM. If we use an
existing VM, we can't wrap it in our own structure any more. We'd need
to come up with a different solution or extend struct amdgpu_vm with
what we need.


Well feel free to extend the VM structure with a pointer for KFD data.


And we still need our own BO creation ioctl. Importing DMABufs adds
extra overhead (more system calls, file descriptors, GEM objects) and
doesn't cover some of our use cases (userptr). We also need our own idr
for managing buffer handles that are used with all our memory and VM
management functions.


Yeah, well that is the second part of that idea.

When you have the drm_file for the VM, you can also use it to call 
drm_gem_object_lookup().


And when you need to iterate over all the BOs you can just use the 
object_idr or the VM structure as well.


Regards,
Christian.



Regards,
   Felix



Or do you mean a DMABuf FD exported by GEM? After importing the buffer
and closing the FD, we still need a way to refer to the buffer for
managing the GPU mappings and for releasing the reference. So we still
need handles that are currently provided by our own IDR. I think we'll
still need that.

Finally, the kfd_ioctl_alloc_memory_of_gpu will later also be used for
creating userptr BOs. Those can't be exported as DMABufs, so we still
need our own ioctl for creating that type of BO. That was going to be my
next patch series.

Regards,
   Felix



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-12 Thread Felix Kuehling

On 2018-02-12 11:57 AM, Felix Kuehling wrote:
>>> I just thought of a slightly different approach I would consider more
>>> realistic, without having thought through all the details: Adding
>>> KFD-specific memory management ioctls to the amdgpu device. Basically
>>> call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions instead
>>> of KFD ioctl functions. But we'd still have KFD ioctls for other things,
>>> and the new amdgpu ioctls would be KFD-specific and not useful for
>>> graphics applications. It's probably still several weeks of work, but
>>> shouldn't have major teething issues because the internal logic and
>>> functionality would be basically unchanged. It would just move the
>>> ioctls from one device to another.
>> My thinking went into a similar direction. But instead of exposing the
>> KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.
>>
>> And then use the DRM FD in the KFD for things like the buffer provider
>> of a device, e.g. no separate IDR for BOs in the KFD but rather a
>> reference to the DRM FD.
> I'm not sure I understand this. With "DRM FD" I think you mean the
> device file descriptor. Then have KFD call file operation on the FD to
> call AMDGPU ioctls? Is there any precedent for this that I can look at
> for an example?

OK, I think this could work for finding a VM from a DRM file descriptor:

struct file *filp = fget(fd);
struct drm_file *drm_priv = filp->private_data;
struct amdgpu_fpriv *drv_priv = file_priv->driver_priv;
struct amdgpu_vm *vm = _priv->vm;

That would let us use the DRM VM instead of creating our own and would
avoid wasting a perfectly good VM that gets created by opening the DRM
device. We'd need a new ioctl to import VMs into KFD. But that's about
as far as I would take it.

We'd still need to add KFD-specific data to the VM. If we use an
existing VM, we can't wrap it in our own structure any more. We'd need
to come up with a different solution or extend struct amdgpu_vm with
what we need.

And we still need our own BO creation ioctl. Importing DMABufs adds
extra overhead (more system calls, file descriptors, GEM objects) and
doesn't cover some of our use cases (userptr). We also need our own idr
for managing buffer handles that are used with all our memory and VM
management functions.

Regards,
  Felix


>
> Or do you mean a DMABuf FD exported by GEM? After importing the buffer
> and closing the FD, we still need a way to refer to the buffer for
> managing the GPU mappings and for releasing the reference. So we still
> need handles that are currently provided by our own IDR. I think we'll
> still need that.
>
> Finally, the kfd_ioctl_alloc_memory_of_gpu will later also be used for
> creating userptr BOs. Those can't be exported as DMABufs, so we still
> need our own ioctl for creating that type of BO. That was going to be my
> next patch series.
>
> Regards,
>   Felix
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-12 Thread Felix Kuehling
On 2018-02-11 04:55 AM, Christian König wrote:
> Am 09.02.2018 um 21:31 schrieb Felix Kuehling:
>> On 2018-02-09 07:34 AM, Christian König wrote:
>>> I really wonder if sharing the GPUVM instance from a render node file
>>> descriptor wouldn't be easier.
>>>
>>> You could just use the existing IOCTL for allocating and mapping
>>> memory on a render node and then give the prepared fd to kfd to use.
>> In amd-kfd-staging we have an ioctl to import a DMABuf into the KFD VM
>> for interoperability between graphics and compute. I'm planning to
>> upstream that later.
>
> Could you move that around and upstream it first?

Sure. I assume with "first" you mean before userptr and Vega10 support.
I was going to leave it for later because graphics interoperability not
an essential feature. Userptr is more important to get existing ROCm
user mode to work properly.

>
>> It still depends on the KFD calls for managing the
>> GPU mappings. The KFD GPU mapping calls need to interact with the
>> eviction logic. TLB flushing involves the HIQ/KIQ for figuring out the
>> VMID/PASID mapping, which is managed asynchronously by the HWS, not
>> amdgpu_ids.c. User pointers need a different MMU notifier. AQL queue
>> buffers on some GPUs need a double-mapping workaround for a HW
>> wraparound bug. I could go on. So this is not easy to retrofit into
>> amdgpu_gem and amdgpu_cs. Also, KFD VMs are created differently
>> (AMDGPU_VM_CONTEXT_COMPUTE, amdkfd_vm wrapper structure).
>
> Well, that is actually the reason why I'm asking about it.
>
> First of all kernel development is not use case driver, e.g. we should
> not implement something because userspace needs it in a specific way,
> but rather because the hardware supports it in a specific way.
>
> This means that in theory we should have the fixes for the HW problems
> in both interfaces. That doesn't make sense at the moment because we
> don't support user space queue through the render node file
> descriptor, but people are starting to ask for that as well.
>> What's more, buffers allocated through amdgpu_gem calls create GEM
>> objects that we don't need.
>
> I see that as an advantage rather than a problem, cause it fixes a
> couple of problems with the KFD where the address space of the inode
> is not managed correctly as far as I can see.

I don't think GEM is involved in the management of address space. That's
handled inside TTM and drm_vma_manager, both of which we are using. We
have tested CPU mappings with evictions and this is working correctly now.

We had problems with this before, when we were CPU-mapping our buffers
through the KFD device FD.

>
> That implementation issues never caused problems right now because you
> never tried to unmap doorbells. But with the new eviction code that is
> about to change, isn't it?

I don't understand this comment. What does this have to do with doorbells?

Doorbells are never unmapped. When queues are evicted doorbells remain
mapped to the process. User mode can continue writing to doorbells,
though they won't have any immediate effect while the corresponding
queues are unmapped from the HQDs.

>
>> And exporting and importing DMABufs adds
>> more overhead and a potential to run into the process file-descriptor
>> limit (maybe the FD could be discarded after importing).
>
> Closing the file descriptor is a must have after importing, so that
> isn't an issue.
>
> But I agree that exporting and reimporting the file descriptor adds
> some additional overhead.
>
>> I honestly thought about whether this would be feasible when we
>> implemented the CPU mapping through the DRM FD. But it would be nothing
>> short of a complete redesign of the KFD memory management code. It would
>> be months of work, more instability, for questionable benefits. I don't
>> think it would be in the interest of end users and customers.
>
> Yeah, agree on that.
>
>> I just thought of a slightly different approach I would consider more
>> realistic, without having thought through all the details: Adding
>> KFD-specific memory management ioctls to the amdgpu device. Basically
>> call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions instead
>> of KFD ioctl functions. But we'd still have KFD ioctls for other things,
>> and the new amdgpu ioctls would be KFD-specific and not useful for
>> graphics applications. It's probably still several weeks of work, but
>> shouldn't have major teething issues because the internal logic and
>> functionality would be basically unchanged. It would just move the
>> ioctls from one device to another.
>
> My thinking went into a similar direction. But instead of exposing the
> KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.
>
> And then use the DRM FD in the KFD for things like the buffer provider
> of a device, e.g. no separate IDR for BOs in the KFD but rather a
> reference to the DRM FD.

I'm not sure I understand this. With "DRM FD" I think you mean the
device file descriptor. Then have KFD call 

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-11 Thread Christian König

Am 09.02.2018 um 21:31 schrieb Felix Kuehling:

On 2018-02-09 07:34 AM, Christian König wrote:

I really wonder if sharing the GPUVM instance from a render node file
descriptor wouldn't be easier.

You could just use the existing IOCTL for allocating and mapping
memory on a render node and then give the prepared fd to kfd to use.

In amd-kfd-staging we have an ioctl to import a DMABuf into the KFD VM
for interoperability between graphics and compute. I'm planning to
upstream that later.


Could you move that around and upstream it first?


It still depends on the KFD calls for managing the
GPU mappings. The KFD GPU mapping calls need to interact with the
eviction logic. TLB flushing involves the HIQ/KIQ for figuring out the
VMID/PASID mapping, which is managed asynchronously by the HWS, not
amdgpu_ids.c. User pointers need a different MMU notifier. AQL queue
buffers on some GPUs need a double-mapping workaround for a HW
wraparound bug. I could go on. So this is not easy to retrofit into
amdgpu_gem and amdgpu_cs. Also, KFD VMs are created differently
(AMDGPU_VM_CONTEXT_COMPUTE, amdkfd_vm wrapper structure).


Well, that is actually the reason why I'm asking about it.

First of all kernel development is not use case driver, e.g. we should 
not implement something because userspace needs it in a specific way, 
but rather because the hardware supports it in a specific way.


This means that in theory we should have the fixes for the HW problems 
in both interfaces. That doesn't make sense at the moment because we 
don't support user space queue through the render node file descriptor, 
but people are starting to ask for that as well.



What's more, buffers allocated through amdgpu_gem calls create GEM
objects that we don't need.


I see that as an advantage rather than a problem, cause it fixes a 
couple of problems with the KFD where the address space of the inode is 
not managed correctly as far as I can see.


That implementation issues never caused problems right now because you 
never tried to unmap doorbells. But with the new eviction code that is 
about to change, isn't it?



And exporting and importing DMABufs adds
more overhead and a potential to run into the process file-descriptor
limit (maybe the FD could be discarded after importing).


Closing the file descriptor is a must have after importing, so that 
isn't an issue.


But I agree that exporting and reimporting the file descriptor adds some 
additional overhead.



I honestly thought about whether this would be feasible when we
implemented the CPU mapping through the DRM FD. But it would be nothing
short of a complete redesign of the KFD memory management code. It would
be months of work, more instability, for questionable benefits. I don't
think it would be in the interest of end users and customers.


Yeah, agree on that.


I just thought of a slightly different approach I would consider more
realistic, without having thought through all the details: Adding
KFD-specific memory management ioctls to the amdgpu device. Basically
call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions instead
of KFD ioctl functions. But we'd still have KFD ioctls for other things,
and the new amdgpu ioctls would be KFD-specific and not useful for
graphics applications. It's probably still several weeks of work, but
shouldn't have major teething issues because the internal logic and
functionality would be basically unchanged. It would just move the
ioctls from one device to another.


My thinking went into a similar direction. But instead of exposing the 
KFD IOCTLs through the DRM FD, I would let the KFD import a DRM FD.


And then use the DRM FD in the KFD for things like the buffer provider 
of a device, e.g. no separate IDR for BOs in the KFD but rather a 
reference to the DRM FD.


We can still manage the VM through KFD IOCTL, but the BOs and the VM are 
actually provided by the DRM FD.


Regards,
Christian.


Regards,
   Felix


Regards,
Christian.

Am 07.02.2018 um 02:32 schrieb Felix Kuehling:

From: Oak Zeng 

Populate DRM render device minor in kfd topology

Signed-off-by: Oak Zeng 
Signed-off-by: Felix Kuehling 
---
   drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 4 
   drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 1 +
   2 files changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 2506155..ac28abc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -441,6 +441,8 @@ static ssize_t node_show(struct kobject *kobj,
struct attribute *attr,
   dev->node_props.device_id);
   sysfs_show_32bit_prop(buffer, "location_id",
   dev->node_props.location_id);
+    sysfs_show_32bit_prop(buffer, "drm_render_minor",
+    dev->node_props.drm_render_minor);
     if (dev->gpu) {
   log_max_watch_addr =
@@ -1214,6 +1216,8 @@ 

Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-09 Thread Felix Kuehling
On 2018-02-09 07:34 AM, Christian König wrote:
> I really wonder if sharing the GPUVM instance from a render node file
> descriptor wouldn't be easier.
>
> You could just use the existing IOCTL for allocating and mapping
> memory on a render node and then give the prepared fd to kfd to use.

In amd-kfd-staging we have an ioctl to import a DMABuf into the KFD VM
for interoperability between graphics and compute. I'm planning to
upstream that later. It still depends on the KFD calls for managing the
GPU mappings. The KFD GPU mapping calls need to interact with the
eviction logic. TLB flushing involves the HIQ/KIQ for figuring out the
VMID/PASID mapping, which is managed asynchronously by the HWS, not
amdgpu_ids.c. User pointers need a different MMU notifier. AQL queue
buffers on some GPUs need a double-mapping workaround for a HW
wraparound bug. I could go on. So this is not easy to retrofit into
amdgpu_gem and amdgpu_cs. Also, KFD VMs are created differently
(AMDGPU_VM_CONTEXT_COMPUTE, amdkfd_vm wrapper structure).

What's more, buffers allocated through amdgpu_gem calls create GEM
objects that we don't need. And exporting and importing DMABufs adds
more overhead and a potential to run into the process file-descriptor
limit (maybe the FD could be discarded after importing).

I honestly thought about whether this would be feasible when we
implemented the CPU mapping through the DRM FD. But it would be nothing
short of a complete redesign of the KFD memory management code. It would
be months of work, more instability, for questionable benefits. I don't
think it would be in the interest of end users and customers.

I just thought of a slightly different approach I would consider more
realistic, without having thought through all the details: Adding
KFD-specific memory management ioctls to the amdgpu device. Basically
call amdgpu_amdkfd_gpuvm functions from amdgpu ioctl functions instead
of KFD ioctl functions. But we'd still have KFD ioctls for other things,
and the new amdgpu ioctls would be KFD-specific and not useful for
graphics applications. It's probably still several weeks of work, but
shouldn't have major teething issues because the internal logic and
functionality would be basically unchanged. It would just move the
ioctls from one device to another.

Regards,
  Felix

>
> Regards,
> Christian.
>
> Am 07.02.2018 um 02:32 schrieb Felix Kuehling:
>> From: Oak Zeng 
>>
>> Populate DRM render device minor in kfd topology
>>
>> Signed-off-by: Oak Zeng 
>> Signed-off-by: Felix Kuehling 
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 4 
>>   drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 1 +
>>   2 files changed, 5 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> index 2506155..ac28abc 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
>> @@ -441,6 +441,8 @@ static ssize_t node_show(struct kobject *kobj,
>> struct attribute *attr,
>>   dev->node_props.device_id);
>>   sysfs_show_32bit_prop(buffer, "location_id",
>>   dev->node_props.location_id);
>> +    sysfs_show_32bit_prop(buffer, "drm_render_minor",
>> +    dev->node_props.drm_render_minor);
>>     if (dev->gpu) {
>>   log_max_watch_addr =
>> @@ -1214,6 +1216,8 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
>>   dev->gpu->kfd2kgd->get_max_engine_clock_in_mhz(dev->gpu->kgd);
>>   dev->node_props.max_engine_clk_ccompute =
>>   cpufreq_quick_get_max(0) / 1000;
>> +    dev->node_props.drm_render_minor =
>> +    gpu->shared_resources.drm_render_minor;
>>     kfd_fill_mem_clk_max_info(dev);
>>   kfd_fill_iolink_non_crat_info(dev);
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
>> b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
>> index c0be2be..eb54cfc 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
>> @@ -71,6 +71,7 @@ struct kfd_node_properties {
>>   uint32_t location_id;
>>   uint32_t max_engine_clk_fcompute;
>>   uint32_t max_engine_clk_ccompute;
>> +    int32_t  drm_render_minor;
>>   uint16_t marketing_name[KFD_TOPOLOGY_PUBLIC_NAME_SIZE];
>>   };
>>   
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 14/25] drm/amdkfd: Populate DRM render device minor

2018-02-09 Thread Christian König
I really wonder if sharing the GPUVM instance from a render node file 
descriptor wouldn't be easier.


You could just use the existing IOCTL for allocating and mapping memory 
on a render node and then give the prepared fd to kfd to use.


Regards,
Christian.

Am 07.02.2018 um 02:32 schrieb Felix Kuehling:

From: Oak Zeng 

Populate DRM render device minor in kfd topology

Signed-off-by: Oak Zeng 
Signed-off-by: Felix Kuehling 
---
  drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 4 
  drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 1 +
  2 files changed, 5 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
index 2506155..ac28abc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c
@@ -441,6 +441,8 @@ static ssize_t node_show(struct kobject *kobj, struct 
attribute *attr,
dev->node_props.device_id);
sysfs_show_32bit_prop(buffer, "location_id",
dev->node_props.location_id);
+   sysfs_show_32bit_prop(buffer, "drm_render_minor",
+   dev->node_props.drm_render_minor);
  
  	if (dev->gpu) {

log_max_watch_addr =
@@ -1214,6 +1216,8 @@ int kfd_topology_add_device(struct kfd_dev *gpu)
dev->gpu->kfd2kgd->get_max_engine_clock_in_mhz(dev->gpu->kgd);
dev->node_props.max_engine_clk_ccompute =
cpufreq_quick_get_max(0) / 1000;
+   dev->node_props.drm_render_minor =
+   gpu->shared_resources.drm_render_minor;
  
  	kfd_fill_mem_clk_max_info(dev);

kfd_fill_iolink_non_crat_info(dev);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
index c0be2be..eb54cfc 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h
@@ -71,6 +71,7 @@ struct kfd_node_properties {
uint32_t location_id;
uint32_t max_engine_clk_fcompute;
uint32_t max_engine_clk_ccompute;
+   int32_t  drm_render_minor;
uint16_t marketing_name[KFD_TOPOLOGY_PUBLIC_NAME_SIZE];
  };
  


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx