Re: Plan: BO move throttling for visible VRAM evictions

2017-05-19 Thread Michel Dänzer
On 18/05/17 07:22 PM, Marek Olšák wrote:
> On May 18, 2017 10:17 AM, "Michel Dänzer"  > wrote:
> 
> On 17/05/17 09:35 PM, Marek Olšák wrote:
> > On May 16, 2017 3:57 AM, "Michel Dänzer"  
> > >> wrote:
> > On 15/05/17 07:11 PM, Marek Olšák wrote:
> > > On May 15, 2017 4:29 AM, "Michel Dänzer"  
> > >
> > > 
>  > >
> > > I think the next step should be to make radeonsi keep
> track of
> > how much
> > > VRAM it's trying to use that's expected to be accessed
> by the
> > CPU, and
> > > to use GTT instead when that exceeds a threshold (probably
> > derived from
> > > vram_vis_size).
> > >
> > > That's difficult to estimate. There are apps with 600MB of
> mapped VRAM
> > > and don't experience any performance issues. And some apps with
> > 300MB of
> > > mapped VRAM do. It only depends on the CPU access pattern,
> not what
> > > radeonsi sees.
> >
> > What I mean is keeping track of the total size of resources
> which have
> > RADEON_DOMAIN_VRAM and RADEON_FLAG_CPU_ACCESS set, and if it
> exceeds a
> > threshold, create new ones having those flags in GTT instead. Even
> > though this might not be strictly necessary with amdgpu in the
> long run,
> > it probably is for radeon anyway, and in the short term it
> might help
> > even with amdgpu.
> >
> >
> > That might hurt us more than it can help.
> 
> You may be right, but I think I'll play with that idea a little anyway
> to see how it goes. :)
> 
> > All mappable buffers have the CPU access flag set, but many of
> them are
> > immutable.
> 
> You mean they're only written to once by the CPU? We shouldn't set the
> RADEON_FLAG_CPU_ACCESS flag for BOs where we expect that, because it
> will currently prevent them from being in the CPU invisible part of
> VRAM.
> 
> 
> The only thing I can do is set the CPU access flag for persistently
> mapped buffers only.

Something like that might make sense for now.


> We certainly want buffers to go to the invisible part of VRAM if there
> is no CPU access for a certain timeframe. So maybe we shouldn't set the
> flag at all. What do you thing?

https://patchwork.freedesktop.org/patch/156991/ allows
AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED BOs to be evicted from CPU visible
to invisible VRAM, but I'm not sure yet that's a good idea.
"CPU_ACCESS_REQUIRED" kind of implies CPU access should always be possible.


> > The only place where this can be handled​ is the kernel.
> 
> Ideally, the placement of a BO should be determined based on how it's
> actually being used by the GPU vs CPU. But I'm not sure how to determine
> that in a useful way.
> 
> CPU page faults are the only way to determine that CPU access is happening.

A page fault only happens the first time (since the BO was last moved)
the CPU tries to access a page. Currently we're not even differentiating
reads vs writes, and we have no idea how much CPU access happens to a
page after it's faulted in.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-05-18 Thread Marek Olšák
On May 18, 2017 10:17 AM, "Michel Dänzer"  wrote:

On 17/05/17 09:35 PM, Marek Olšák wrote:
> On May 16, 2017 3:57 AM, "Michel Dänzer"  > wrote:
> On 15/05/17 07:11 PM, Marek Olšák wrote:
> > On May 15, 2017 4:29 AM, "Michel Dänzer"  
> > >> wrote:
> >
> > I think the next step should be to make radeonsi keep track of
> how much
> > VRAM it's trying to use that's expected to be accessed by the
> CPU, and
> > to use GTT instead when that exceeds a threshold (probably
> derived from
> > vram_vis_size).
> >
> > That's difficult to estimate. There are apps with 600MB of mapped
VRAM
> > and don't experience any performance issues. And some apps with
> 300MB of
> > mapped VRAM do. It only depends on the CPU access pattern, not what
> > radeonsi sees.
>
> What I mean is keeping track of the total size of resources which have
> RADEON_DOMAIN_VRAM and RADEON_FLAG_CPU_ACCESS set, and if it exceeds a
> threshold, create new ones having those flags in GTT instead. Even
> though this might not be strictly necessary with amdgpu in the long
run,
> it probably is for radeon anyway, and in the short term it might help
> even with amdgpu.
>
>
> That might hurt us more than it can help.

You may be right, but I think I'll play with that idea a little anyway
to see how it goes. :)

> All mappable buffers have the CPU access flag set, but many of them are
> immutable.

You mean they're only written to once by the CPU? We shouldn't set the
RADEON_FLAG_CPU_ACCESS flag for BOs where we expect that, because it
will currently prevent them from being in the CPU invisible part of VRAM.


The only thing I can do is set the CPU access flag for persistently mapped
buffers only. We certainly want buffers to go to the invisible part of VRAM
if there is no CPU access for a certain timeframe. So maybe we shouldn't
set the flag at all. What do you thing?

The truth is we have no way to know what apps intend to do with any buffers.



> The only place where this can be handled​ is the kernel.

Ideally, the placement of a BO should be determined based on how it's
actually being used by the GPU vs CPU. But I'm not sure how to determine
that in a useful way.


CPU page faults are the only way to determine that CPU access is happening.


> Even if it's as simple as: if (bo->numcpufaults > 10) domain = GTT_WC;

I'm skeptical about the number of CPU page faults per se being a useful
metric. It doesn't tell us much about how the BO is used even by the
CPU, let alone the GPU. But let's see where this leads you.


It tells us more than what Mesa can ever know, which is nothing.

Marek



One thing that might help would be if we could swap individual memory
nodes between visible and invisible VRAM for CPU page faults, instead of
moving/evicting whole BOs. Christian, do you think something like that
would be possible?


Another idea (to avoid issues such as the recent one with Rocket League)
was to make VRAM CPU mappings write-only, and move the BO to GTT if
there's a read fault. But not sure if this is possible at all, or how
much effort it would be.


--
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-05-18 Thread Christian König

Am 18.05.2017 um 10:17 schrieb Michel Dänzer:

[SNIP]
One thing that might help would be if we could swap individual memory
nodes between visible and invisible VRAM for CPU page faults, instead of
moving/evicting whole BOs. Christian, do you think something like that
would be possible?
I've looked into this while working on splitting VRAM allocations into 
smaller nodes and the answer is nope, not at all.


We would pretty much need to rewrite whole TTM from BO based to page 
based. Otherwise you just have way to much overhead with all those ttm 
objects around.


And when we really want this step I would rather wait for HMM to land 
and use that instead.


Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-05-18 Thread Michel Dänzer
On 17/05/17 09:35 PM, Marek Olšák wrote:
> On May 16, 2017 3:57 AM, "Michel Dänzer"  > wrote:
> On 15/05/17 07:11 PM, Marek Olšák wrote:
> > On May 15, 2017 4:29 AM, "Michel Dänzer"  
> > >> wrote:
> >
> > I think the next step should be to make radeonsi keep track of
> how much
> > VRAM it's trying to use that's expected to be accessed by the
> CPU, and
> > to use GTT instead when that exceeds a threshold (probably
> derived from
> > vram_vis_size).
> >
> > That's difficult to estimate. There are apps with 600MB of mapped VRAM
> > and don't experience any performance issues. And some apps with
> 300MB of
> > mapped VRAM do. It only depends on the CPU access pattern, not what
> > radeonsi sees.
> 
> What I mean is keeping track of the total size of resources which have
> RADEON_DOMAIN_VRAM and RADEON_FLAG_CPU_ACCESS set, and if it exceeds a
> threshold, create new ones having those flags in GTT instead. Even
> though this might not be strictly necessary with amdgpu in the long run,
> it probably is for radeon anyway, and in the short term it might help
> even with amdgpu.
> 
> 
> That might hurt us more than it can help.

You may be right, but I think I'll play with that idea a little anyway
to see how it goes. :)

> All mappable buffers have the CPU access flag set, but many of them are
> immutable.

You mean they're only written to once by the CPU? We shouldn't set the
RADEON_FLAG_CPU_ACCESS flag for BOs where we expect that, because it
will currently prevent them from being in the CPU invisible part of VRAM.


> The only place where this can be handled​ is the kernel.

Ideally, the placement of a BO should be determined based on how it's
actually being used by the GPU vs CPU. But I'm not sure how to determine
that in a useful way.

> Even if it's as simple as: if (bo->numcpufaults > 10) domain = GTT_WC;

I'm skeptical about the number of CPU page faults per se being a useful
metric. It doesn't tell us much about how the BO is used even by the
CPU, let alone the GPU. But let's see where this leads you.


One thing that might help would be if we could swap individual memory
nodes between visible and invisible VRAM for CPU page faults, instead of
moving/evicting whole BOs. Christian, do you think something like that
would be possible?


Another idea (to avoid issues such as the recent one with Rocket League)
was to make VRAM CPU mappings write-only, and move the BO to GTT if
there's a read fault. But not sure if this is possible at all, or how
much effort it would be.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-05-17 Thread Marek Olšák
On May 16, 2017 3:57 AM, "Michel Dänzer"  wrote:

On 15/05/17 07:11 PM, Marek Olšák wrote:
> On May 15, 2017 4:29 AM, "Michel Dänzer"  > wrote:
>
> I think the next step should be to make radeonsi keep track of how
much
> VRAM it's trying to use that's expected to be accessed by the CPU, and
> to use GTT instead when that exceeds a threshold (probably derived
from
> vram_vis_size).
>
> That's difficult to estimate. There are apps with 600MB of mapped VRAM
> and don't experience any performance issues. And some apps with 300MB of
> mapped VRAM do. It only depends on the CPU access pattern, not what
> radeonsi sees.

What I mean is keeping track of the total size of resources which have
RADEON_DOMAIN_VRAM and RADEON_FLAG_CPU_ACCESS set, and if it exceeds a
threshold, create new ones having those flags in GTT instead. Even
though this might not be strictly necessary with amdgpu in the long run,
it probably is for radeon anyway, and in the short term it might help
even with amdgpu.


That might hurt us more than it can help. All mappable buffers have the CPU
access flag set, but many of them are immutable.

The only place where this can be handled​ is the kernel. Even if it's as
simple as: if (bo->numcpufaults > 10) domain = GTT_WC;

Marek



--
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-05-14 Thread zhoucm1



On 2017年05月14日 05:31, Marek Olšák wrote:

On Mon, Apr 17, 2017 at 11:55 AM, Michel Dänzer  wrote:

On 17/04/17 07:58 AM, Marek Olšák wrote:

On Fri, Apr 14, 2017 at 12:14 PM, Michel Dänzer  wrote:

On 04/04/17 05:11 AM, Marek Olšák wrote:

On Fri, Mar 31, 2017 at 5:24 AM, Michel Dänzer  wrote:

On 30/03/17 07:03 PM, Michel Dänzer wrote:

On 25/03/17 01:33 AM, Marek Olšák wrote:

Hi,

I'm sharing this idea here, because it's something that has been
decreasing our performance a lot recently, for example:
http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa

The attached proof-of-concept patch (on top of Christian's "CPU mapping
of split VRAM buffers" series, ported from radeon) results in 145.05 fps
on my Tonga.

I get the same result without my or Christian's patches though, with
4.11 based DRM or amd-staging-4.9. So I guess I just can't reproduce the
problem with this test. Are there any other tests for it?

It's random. Sometimes the benchmark runs OK, other times it's slow.
You can easily see the difference but observing how smooth it is. The
visible VRAM evictions result in constant 100-200ms stalls but not
every frame, which feels like the frame rate is much lower than it
actually is.

Make sure your graphics details are maxed out. The best score I can
get with my rig is 70 fps. (Fiji & Core i5 3570)

I'm getting around 53-54 fps at Ultra with Tonga, both with Mesa 13.0.6
and Git.

Have you tried if Christian's patches for CPU access to split VRAM
buffers help? I can imagine that forcing contiguous VRAM buffers for CPU
access could cause lots of other BOs to be unnecessarily evicted from
VRAM, if at least one of their fragments happens to be in the CPU
visible part of VRAM.

I've finally tested latest amd-staging-4.9 and I'm very pleased. For
the first time, the Deus Ex benchmark has almost no hiccups. I've
never seen it so smooth. At one point, the MB/s BO move rate increase
to 200MB/s, stayed there for a couple of seconds, and then it dropped
to 0 again. The frame rate was OK-ish, so I guess the moves didn't
happen all at once. I also tested DiRT Rally and I haven't been able
to reproduce the low FPS with the consistently-high BO move rate that
I saw several months ago.

We could do some move throttling there for sure, but it's much better
than it ever was.

That's great to hear. If you get a chance, it would be interesting if
the attached updated patch improves things even more for you. (The patch
I attached previously couldn't work as intended, this one at least might :)

Frogging101 on IRC noticed that we get a ton of TTM BO moves due to
visible VRAM thrashing and Michel's patch doesn't help. His kernel is
up to date with amd-staging. It looks like the only option left is my
original plan: BO move throttling for visible VRAM by redirecting
mapped buffers to GTT and not allowing them to go back to VRAM if some
counter is too high.
I agree on this opinion, from our performance tuning experiment, this 
case indeed often happen especially under vram memory pressure. 
redirecting to GTT is better than heavy eviction between VRAM and GTT.
But we should get a condition for redirecting (eviction counter?), 
otherwise, BO have no change back to prefer domain.


Regards,
David Zhou


Opinions?

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-05-14 Thread Michel Dänzer
On 14/05/17 06:31 AM, Marek Olšák wrote:
> On Mon, Apr 17, 2017 at 11:55 AM, Michel Dänzer  wrote:
>> On 17/04/17 07:58 AM, Marek Olšák wrote:
>>> On Fri, Apr 14, 2017 at 12:14 PM, Michel Dänzer  wrote:
 On 04/04/17 05:11 AM, Marek Olšák wrote:
> On Fri, Mar 31, 2017 at 5:24 AM, Michel Dänzer  wrote:
>> On 30/03/17 07:03 PM, Michel Dänzer wrote:
>>> On 25/03/17 01:33 AM, Marek Olšák wrote:
 Hi,

 I'm sharing this idea here, because it's something that has been
 decreasing our performance a lot recently, for example:
 http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>>>
>>> The attached proof-of-concept patch (on top of Christian's "CPU mapping
>>> of split VRAM buffers" series, ported from radeon) results in 145.05 fps
>>> on my Tonga.
>>
>> I get the same result without my or Christian's patches though, with
>> 4.11 based DRM or amd-staging-4.9. So I guess I just can't reproduce the
>> problem with this test. Are there any other tests for it?
>
> It's random. Sometimes the benchmark runs OK, other times it's slow.
> You can easily see the difference but observing how smooth it is. The
> visible VRAM evictions result in constant 100-200ms stalls but not
> every frame, which feels like the frame rate is much lower than it
> actually is.
>
> Make sure your graphics details are maxed out. The best score I can
> get with my rig is 70 fps. (Fiji & Core i5 3570)

 I'm getting around 53-54 fps at Ultra with Tonga, both with Mesa 13.0.6
 and Git.

 Have you tried if Christian's patches for CPU access to split VRAM
 buffers help? I can imagine that forcing contiguous VRAM buffers for CPU
 access could cause lots of other BOs to be unnecessarily evicted from
 VRAM, if at least one of their fragments happens to be in the CPU
 visible part of VRAM.
>>>
>>> I've finally tested latest amd-staging-4.9 and I'm very pleased. For
>>> the first time, the Deus Ex benchmark has almost no hiccups. I've
>>> never seen it so smooth. At one point, the MB/s BO move rate increase
>>> to 200MB/s, stayed there for a couple of seconds, and then it dropped
>>> to 0 again. The frame rate was OK-ish, so I guess the moves didn't
>>> happen all at once. I also tested DiRT Rally and I haven't been able
>>> to reproduce the low FPS with the consistently-high BO move rate that
>>> I saw several months ago.
>>>
>>> We could do some move throttling there for sure, but it's much better
>>> than it ever was.
>>
>> That's great to hear. If you get a chance, it would be interesting if
>> the attached updated patch improves things even more for you. (The patch
>> I attached previously couldn't work as intended, this one at least might :)
> 
> Frogging101 on IRC noticed that we get a ton of TTM BO moves due to
> visible VRAM thrashing and Michel's patch doesn't help. His kernel is
> up to date with amd-staging. It looks like the only option left is my
> original plan: BO move throttling for visible VRAM by redirecting
> mapped buffers to GTT and not allowing them to go back to VRAM if some
> counter is too high.
> 
> Opinions?

I think the next step should be to make radeonsi keep track of how much
VRAM it's trying to use that's expected to be accessed by the CPU, and
to use GTT instead when that exceeds a threshold (probably derived from
vram_vis_size).


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-05-13 Thread Marek Olšák
On Mon, Apr 17, 2017 at 11:55 AM, Michel Dänzer  wrote:
> On 17/04/17 07:58 AM, Marek Olšák wrote:
>> On Fri, Apr 14, 2017 at 12:14 PM, Michel Dänzer  wrote:
>>> On 04/04/17 05:11 AM, Marek Olšák wrote:
 On Fri, Mar 31, 2017 at 5:24 AM, Michel Dänzer  wrote:
> On 30/03/17 07:03 PM, Michel Dänzer wrote:
>> On 25/03/17 01:33 AM, Marek Olšák wrote:
>>> Hi,
>>>
>>> I'm sharing this idea here, because it's something that has been
>>> decreasing our performance a lot recently, for example:
>>> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>>
>> The attached proof-of-concept patch (on top of Christian's "CPU mapping
>> of split VRAM buffers" series, ported from radeon) results in 145.05 fps
>> on my Tonga.
>
> I get the same result without my or Christian's patches though, with
> 4.11 based DRM or amd-staging-4.9. So I guess I just can't reproduce the
> problem with this test. Are there any other tests for it?

 It's random. Sometimes the benchmark runs OK, other times it's slow.
 You can easily see the difference but observing how smooth it is. The
 visible VRAM evictions result in constant 100-200ms stalls but not
 every frame, which feels like the frame rate is much lower than it
 actually is.

 Make sure your graphics details are maxed out. The best score I can
 get with my rig is 70 fps. (Fiji & Core i5 3570)
>>>
>>> I'm getting around 53-54 fps at Ultra with Tonga, both with Mesa 13.0.6
>>> and Git.
>>>
>>> Have you tried if Christian's patches for CPU access to split VRAM
>>> buffers help? I can imagine that forcing contiguous VRAM buffers for CPU
>>> access could cause lots of other BOs to be unnecessarily evicted from
>>> VRAM, if at least one of their fragments happens to be in the CPU
>>> visible part of VRAM.
>>
>> I've finally tested latest amd-staging-4.9 and I'm very pleased. For
>> the first time, the Deus Ex benchmark has almost no hiccups. I've
>> never seen it so smooth. At one point, the MB/s BO move rate increase
>> to 200MB/s, stayed there for a couple of seconds, and then it dropped
>> to 0 again. The frame rate was OK-ish, so I guess the moves didn't
>> happen all at once. I also tested DiRT Rally and I haven't been able
>> to reproduce the low FPS with the consistently-high BO move rate that
>> I saw several months ago.
>>
>> We could do some move throttling there for sure, but it's much better
>> than it ever was.
>
> That's great to hear. If you get a chance, it would be interesting if
> the attached updated patch improves things even more for you. (The patch
> I attached previously couldn't work as intended, this one at least might :)

Frogging101 on IRC noticed that we get a ton of TTM BO moves due to
visible VRAM thrashing and Michel's patch doesn't help. His kernel is
up to date with amd-staging. It looks like the only option left is my
original plan: BO move throttling for visible VRAM by redirecting
mapped buffers to GTT and not allowing them to go back to VRAM if some
counter is too high.

Opinions?

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-04-17 Thread Michel Dänzer
On 17/04/17 07:58 AM, Marek Olšák wrote:
> On Fri, Apr 14, 2017 at 12:14 PM, Michel Dänzer  wrote:
>> On 04/04/17 05:11 AM, Marek Olšák wrote:
>>> On Fri, Mar 31, 2017 at 5:24 AM, Michel Dänzer  wrote:
 On 30/03/17 07:03 PM, Michel Dänzer wrote:
> On 25/03/17 01:33 AM, Marek Olšák wrote:
>> Hi,
>>
>> I'm sharing this idea here, because it's something that has been
>> decreasing our performance a lot recently, for example:
>> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>
> The attached proof-of-concept patch (on top of Christian's "CPU mapping
> of split VRAM buffers" series, ported from radeon) results in 145.05 fps
> on my Tonga.

 I get the same result without my or Christian's patches though, with
 4.11 based DRM or amd-staging-4.9. So I guess I just can't reproduce the
 problem with this test. Are there any other tests for it?
>>>
>>> It's random. Sometimes the benchmark runs OK, other times it's slow.
>>> You can easily see the difference but observing how smooth it is. The
>>> visible VRAM evictions result in constant 100-200ms stalls but not
>>> every frame, which feels like the frame rate is much lower than it
>>> actually is.
>>>
>>> Make sure your graphics details are maxed out. The best score I can
>>> get with my rig is 70 fps. (Fiji & Core i5 3570)
>>
>> I'm getting around 53-54 fps at Ultra with Tonga, both with Mesa 13.0.6
>> and Git.
>>
>> Have you tried if Christian's patches for CPU access to split VRAM
>> buffers help? I can imagine that forcing contiguous VRAM buffers for CPU
>> access could cause lots of other BOs to be unnecessarily evicted from
>> VRAM, if at least one of their fragments happens to be in the CPU
>> visible part of VRAM.
> 
> I've finally tested latest amd-staging-4.9 and I'm very pleased. For
> the first time, the Deus Ex benchmark has almost no hiccups. I've
> never seen it so smooth. At one point, the MB/s BO move rate increase
> to 200MB/s, stayed there for a couple of seconds, and then it dropped
> to 0 again. The frame rate was OK-ish, so I guess the moves didn't
> happen all at once. I also tested DiRT Rally and I haven't been able
> to reproduce the low FPS with the consistently-high BO move rate that
> I saw several months ago.
> 
> We could do some move throttling there for sure, but it's much better
> than it ever was.

That's great to hear. If you get a chance, it would be interesting if
the attached updated patch improves things even more for you. (The patch
I attached previously couldn't work as intended, this one at least might :)


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 7f9710502bcc..78362e09cc51 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -205,7 +205,44 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo,
 	case TTM_PL_VRAM:
 		if (adev->mman.buffer_funcs_ring->ready == false) {
 			amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_CPU);
+		} else if (adev->mc.visible_vram_size < adev->mc.real_vram_size) {
+			unsigned fpfn = adev->mc.visible_vram_size >> PAGE_SHIFT;
+			int i;
+
+			if (bo->mem.start >= fpfn) {
+struct drm_mm_node *node = bo->mem.mm_node;
+unsigned long pages_left;
+
+for (pages_left = bo->mem.num_pages; pages_left;
+ pages_left -= node->size, node++) {
+	if (node->start < fpfn)
+		break;
+}
+
+if (!pages_left)
+	goto gtt;
+			}
+
+			/* Try evicting to the CPU inaccessible part of VRAM
+			 * first, but only set GTT as busy placement, so this
+			 * BO will be evicted to GTT rather than causing other
+			 * BOs to be evicted from VRAM
+			 */
+			amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM |
+			 AMDGPU_GEM_DOMAIN_GTT);
+			abo->placement.num_busy_placement = 0;
+			for (i = 0; i < abo->placement.num_placement; i++) {
+if (abo->placements[i].flags & TTM_PL_FLAG_VRAM) {
+	if (abo->placements[i].fpfn < fpfn)
+		abo->placements[i].fpfn = fpfn;
+} else {
+	abo->placement.busy_placement =
+		>placements[i];
+	abo->placement.num_busy_placement = 1;
+}
+			}
 		} else {
+gtt:
 			amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT);
 			for (i = 0; i < abo->placement.num_placement; ++i) {
 if (!(abo->placements[i].flags &
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-04-16 Thread Marek Olšák
On Fri, Apr 14, 2017 at 12:14 PM, Michel Dänzer  wrote:
> On 04/04/17 05:11 AM, Marek Olšák wrote:
>> On Fri, Mar 31, 2017 at 5:24 AM, Michel Dänzer  wrote:
>>> On 30/03/17 07:03 PM, Michel Dänzer wrote:
 On 25/03/17 01:33 AM, Marek Olšák wrote:
> Hi,
>
> I'm sharing this idea here, because it's something that has been
> decreasing our performance a lot recently, for example:
> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa

 The attached proof-of-concept patch (on top of Christian's "CPU mapping
 of split VRAM buffers" series, ported from radeon) results in 145.05 fps
 on my Tonga.
>>>
>>> I get the same result without my or Christian's patches though, with
>>> 4.11 based DRM or amd-staging-4.9. So I guess I just can't reproduce the
>>> problem with this test. Are there any other tests for it?
>>
>> It's random. Sometimes the benchmark runs OK, other times it's slow.
>> You can easily see the difference but observing how smooth it is. The
>> visible VRAM evictions result in constant 100-200ms stalls but not
>> every frame, which feels like the frame rate is much lower than it
>> actually is.
>>
>> Make sure your graphics details are maxed out. The best score I can
>> get with my rig is 70 fps. (Fiji & Core i5 3570)
>
> I'm getting around 53-54 fps at Ultra with Tonga, both with Mesa 13.0.6
> and Git.
>
> Have you tried if Christian's patches for CPU access to split VRAM
> buffers help? I can imagine that forcing contiguous VRAM buffers for CPU
> access could cause lots of other BOs to be unnecessarily evicted from
> VRAM, if at least one of their fragments happens to be in the CPU
> visible part of VRAM.

I've finally tested latest amd-staging-4.9 and I'm very pleased. For
the first time, the Deus Ex benchmark has almost no hiccups. I've
never seen it so smooth. At one point, the MB/s BO move rate increase
to 200MB/s, stayed there for a couple of seconds, and then it dropped
to 0 again. The frame rate was OK-ish, so I guess the moves didn't
happen all at once. I also tested DiRT Rally and I haven't been able
to reproduce the low FPS with the consistently-high BO move rate that
I saw several months ago.

We could do some move throttling there for sure, but it's much better
than it ever was.

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-04-14 Thread Michel Dänzer
On 04/04/17 05:11 AM, Marek Olšák wrote:
> On Fri, Mar 31, 2017 at 5:24 AM, Michel Dänzer  wrote:
>> On 30/03/17 07:03 PM, Michel Dänzer wrote:
>>> On 25/03/17 01:33 AM, Marek Olšák wrote:
 Hi,

 I'm sharing this idea here, because it's something that has been
 decreasing our performance a lot recently, for example:
 http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>>>
>>> The attached proof-of-concept patch (on top of Christian's "CPU mapping
>>> of split VRAM buffers" series, ported from radeon) results in 145.05 fps
>>> on my Tonga.
>>
>> I get the same result without my or Christian's patches though, with
>> 4.11 based DRM or amd-staging-4.9. So I guess I just can't reproduce the
>> problem with this test. Are there any other tests for it?
> 
> It's random. Sometimes the benchmark runs OK, other times it's slow.
> You can easily see the difference but observing how smooth it is. The
> visible VRAM evictions result in constant 100-200ms stalls but not
> every frame, which feels like the frame rate is much lower than it
> actually is.
> 
> Make sure your graphics details are maxed out. The best score I can
> get with my rig is 70 fps. (Fiji & Core i5 3570)

I'm getting around 53-54 fps at Ultra with Tonga, both with Mesa 13.0.6
and Git.

Have you tried if Christian's patches for CPU access to split VRAM
buffers help? I can imagine that forcing contiguous VRAM buffers for CPU
access could cause lots of other BOs to be unnecessarily evicted from
VRAM, if at least one of their fragments happens to be in the CPU
visible part of VRAM.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-04-03 Thread Marek Olšák
On Fri, Mar 31, 2017 at 5:24 AM, Michel Dänzer  wrote:
> On 30/03/17 07:03 PM, Michel Dänzer wrote:
>> On 25/03/17 01:33 AM, Marek Olšák wrote:
>>> Hi,
>>>
>>> I'm sharing this idea here, because it's something that has been
>>> decreasing our performance a lot recently, for example:
>>> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>>
>> The attached proof-of-concept patch (on top of Christian's "CPU mapping
>> of split VRAM buffers" series, ported from radeon) results in 145.05 fps
>> on my Tonga.
>
> I get the same result without my or Christian's patches though, with
> 4.11 based DRM or amd-staging-4.9. So I guess I just can't reproduce the
> problem with this test. Are there any other tests for it?

It's random. Sometimes the benchmark runs OK, other times it's slow.
You can easily see the difference but observing how smooth it is. The
visible VRAM evictions result in constant 100-200ms stalls but not
every frame, which feels like the frame rate is much lower than it
actually is.

Make sure your graphics details are maxed out. The best score I can
get with my rig is 70 fps. (Fiji & Core i5 3570)

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-30 Thread Michel Dänzer
On 30/03/17 07:03 PM, Michel Dänzer wrote:
> On 25/03/17 01:33 AM, Marek Olšák wrote:
>> Hi,
>>
>> I'm sharing this idea here, because it's something that has been
>> decreasing our performance a lot recently, for example:
>> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
> 
> The attached proof-of-concept patch (on top of Christian's "CPU mapping
> of split VRAM buffers" series, ported from radeon) results in 145.05 fps
> on my Tonga.

I get the same result without my or Christian's patches though, with
4.11 based DRM or amd-staging-4.9. So I guess I just can't reproduce the
problem with this test. Are there any other tests for it?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-30 Thread Michel Dänzer
On 25/03/17 01:33 AM, Marek Olšák wrote:
> Hi,
> 
> I'm sharing this idea here, because it's something that has been
> decreasing our performance a lot recently, for example:
> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa

The attached proof-of-concept patch (on top of Christian's "CPU mapping
of split VRAM buffers" series, ported from radeon) results in 145.05 fps
on my Tonga.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 7f9710502bcc..0bb9c0059497 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -205,6 +205,29 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo,
 	case TTM_PL_VRAM:
 		if (adev->mman.buffer_funcs_ring->ready == false) {
 			amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_CPU);
+		} else if (adev->mc.visible_vram_size < adev->mc.real_vram_size &&
+			   bo->mem.start < (adev->mc.visible_vram_size >> PAGE_SHIFT)) {
+			unsigned fpfn = adev->mc.visible_vram_size >> PAGE_SHIFT;
+			int i;
+
+			/* Try evicting to the CPU inaccessible part of VRAM
+			 * first, but only set GTT as busy placement, so this
+			 * BO will be evicted to GTT rather than causing other
+			 * BOs to be evicted from VRAM
+			 */
+			amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_VRAM |
+			 AMDGPU_GEM_DOMAIN_GTT);
+			abo->placement.num_busy_placement = 0;
+			for (i = 0; i < abo->placement.num_placement; i++) {
+if (abo->placements[i].flags & TTM_PL_FLAG_VRAM) {
+	if (abo->placements[i].fpfn < fpfn)
+		abo->placements[i].fpfn = fpfn;
+} else {
+	abo->placement.busy_placement =
+		>placements[i];
+	abo->placement.num_busy_placement = 1;
+}
+			}
 		} else {
 			amdgpu_ttm_placement_from_domain(abo, AMDGPU_GEM_DOMAIN_GTT);
 			for (i = 0; i < abo->placement.num_placement; ++i) {
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-28 Thread Christian König

Am 28.03.2017 um 15:58 schrieb Alex Deucher:

On Tue, Mar 28, 2017 at 4:29 AM, Christian König
 wrote:

Am 28.03.2017 um 08:00 schrieb Michel Dänzer:

On 28/03/17 12:50 PM, zhoucm1 wrote:

On 2017年03月28日 10:40, Michel Dänzer wrote:

On 27/03/17 04:53 PM, Zhou, David(ChunMing) wrote:

For APU special case, can we prevent eviction happening between VRAM
<> GTT?

We can, if we can close the performance gap between VRAM and GTT. We
measured around 30% gap a while ago, though right now I'm only measuring
~5%, but the test system has slower RAM now (still dual channel though).

My impression VRAM and GTT have no much difference for APU case, if I'm
wrong, pls correct me.

The Mesa patch below makes radeonsi use mostly GTT instead of mostly
VRAM, and slows down Unigine Valley by about 5% on my desktop Kaveri.
You can try it for yourself.


Additional to that you still need the stolen VRAM on APUs for page tables
and DCE.

Actually DCE on CZ/BR/ST can use GTT as well, we just don't currently
allow it.  Older APUs did require "vram" however.


I'm not so deep into that, but as far as I understand it at least one 
DCE fetch unit needs to be contiguous.


So you need 64K pages (or at minimum 8/16K pages) for your GTT 
implementation to support that if I understood it correctly.


Christian.



Alex


So we need to keep the eviction from VRAM to GTT enabled, but what we don't
do is swapping them back in because Marek added the GTT flags on APUs as
extra domain to look into.

In other word once BOs are removed from VRAM they are only swapped back in
if the hardware needs it.

Regards,
Christian.



diff --git a/src/gallium/drivers/radeon/r600_buffer_common.c
b/src/gallium/drivers/radeon/r600_buffer_common.c
index da6f0206d7..47661cab76 100644
--- a/src/gallium/drivers/radeon/r600_buffer_common.c
+++ b/src/gallium/drivers/radeon/r600_buffer_common.c
@@ -175,9 +175,11 @@ void r600_init_resource_fields(struct
r600_common_screen *rscreen,
   * placements even with a low amount of stolen VRAM.
   */
  if (!rscreen->info.has_dedicated_vram &&
-   (rscreen->info.drm_major < 3 || rscreen->info.drm_minor < 6)
&&
-   res->domains == RADEON_DOMAIN_VRAM)
-   res->domains = RADEON_DOMAIN_VRAM_GTT;
+   res->domains == RADEON_DOMAIN_VRAM &&
+   !(res->b.b.bind & PIPE_BIND_SCANOUT)) {
+  res->domains = RADEON_DOMAIN_GTT;
+  res->flags |= RADEON_FLAG_GTT_WC;
+   }

  if (rscreen->debug_flags & DBG_NO_WC)
  res->flags &= ~RADEON_FLAG_GTT_WC;




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-28 Thread Alex Deucher
On Tue, Mar 28, 2017 at 4:29 AM, Christian König
 wrote:
> Am 28.03.2017 um 08:00 schrieb Michel Dänzer:
>>
>> On 28/03/17 12:50 PM, zhoucm1 wrote:
>>>
>>> On 2017年03月28日 10:40, Michel Dänzer wrote:

 On 27/03/17 04:53 PM, Zhou, David(ChunMing) wrote:
>
> For APU special case, can we prevent eviction happening between VRAM
> <> GTT?

 We can, if we can close the performance gap between VRAM and GTT. We
 measured around 30% gap a while ago, though right now I'm only measuring
 ~5%, but the test system has slower RAM now (still dual channel though).
>>>
>>> My impression VRAM and GTT have no much difference for APU case, if I'm
>>> wrong, pls correct me.
>>
>> The Mesa patch below makes radeonsi use mostly GTT instead of mostly
>> VRAM, and slows down Unigine Valley by about 5% on my desktop Kaveri.
>> You can try it for yourself.
>
>
> Additional to that you still need the stolen VRAM on APUs for page tables
> and DCE.

Actually DCE on CZ/BR/ST can use GTT as well, we just don't currently
allow it.  Older APUs did require "vram" however.

Alex

>
> So we need to keep the eviction from VRAM to GTT enabled, but what we don't
> do is swapping them back in because Marek added the GTT flags on APUs as
> extra domain to look into.
>
> In other word once BOs are removed from VRAM they are only swapped back in
> if the hardware needs it.
>
> Regards,
> Christian.
>
>>
>>
>> diff --git a/src/gallium/drivers/radeon/r600_buffer_common.c
>> b/src/gallium/drivers/radeon/r600_buffer_common.c
>> index da6f0206d7..47661cab76 100644
>> --- a/src/gallium/drivers/radeon/r600_buffer_common.c
>> +++ b/src/gallium/drivers/radeon/r600_buffer_common.c
>> @@ -175,9 +175,11 @@ void r600_init_resource_fields(struct
>> r600_common_screen *rscreen,
>>   * placements even with a low amount of stolen VRAM.
>>   */
>>  if (!rscreen->info.has_dedicated_vram &&
>> -   (rscreen->info.drm_major < 3 || rscreen->info.drm_minor < 6)
>> &&
>> -   res->domains == RADEON_DOMAIN_VRAM)
>> -   res->domains = RADEON_DOMAIN_VRAM_GTT;
>> +   res->domains == RADEON_DOMAIN_VRAM &&
>> +   !(res->b.b.bind & PIPE_BIND_SCANOUT)) {
>> +  res->domains = RADEON_DOMAIN_GTT;
>> +  res->flags |= RADEON_FLAG_GTT_WC;
>> +   }
>>
>>  if (rscreen->debug_flags & DBG_NO_WC)
>>  res->flags &= ~RADEON_FLAG_GTT_WC;
>>
>>
>>
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-28 Thread Marek Olšák
On Mar 28, 2017 3:07 AM, "Michel Dänzer"  wrote:

On 27/03/17 07:29 PM, Marek Olšák wrote:
> On Mar 27, 2017 9:35 AM, "Michel Dänzer"  > wrote:
>
> On 25/03/17 01:33 AM, Marek Olšák wrote:
> > Hi,
> >
> > I'm sharing this idea here, because it's something that has been
> > decreasing our performance a lot recently, for example:
> >
> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/
7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
> 
> >
> > I think the problem there is that Mesa git started uploading
> > descriptors and uniforms to VRAM, which helps when TC L2 has a low
> > hit/miss ratio, but the performance can randomly drop by an order of
> > magnitude. I've heard rumours that kernel 4.11 has an improved
> > allocator that should perform better, but the situation is still far
> > from ideal.
> >
> > AMD CPUs and APUs will hopefully suffer less, because we can resize
> > the visible VRAM with the help of our CPU hw specs, but Intel CPUs
> > will remain limited to 256 MB. The following plan describes how to
do
> > throttling for visible VRAM evictions.
> >
> >
> > 1) Theory
> >
> > Initially, the driver doesn't care about where buffers are in VRAM,
> > because VRAM buffers are only moved to visible VRAM on CPU page
faults
> > (when the CPU touches the buffer memory but the memory is in the
> > invisible part of VRAM). When it happens,
> > amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
> > visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
> > also marks the buffer as contiguous, which makes memory
fragmentation
> > worse.
> >
> > I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
> > was much higher in a CPU profiler than anything else in the kernel.
> >
> >
> > 2) Monitoring via Gallium HUD
> >
> > We need to expose 2 kernel counters via the INFO ioctl and display
> > those via Gallium HUD:
> > - The number of VRAM CPU page faults. (the number of calls to
> > amdgpu_bo_fault_reserve_notify).
> > - The number of bytes moved by ttm_bo_validate inside
> > amdgpu_bo_fault_reserve_notify.
> >
> > This will help us observe what exactly is happening and fine-tune
the
> > throttling when it's done.
> >
> >
> > 3) Solution
> >
> > a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
> > (amdgpu_bo::had_cpu_page_fault = true)
> >
> > b) Monitor the MB/s rate at which buffers are moved by
> > amdgpu_bo_fault_reserve_notify. If we get above a specific
threshold,
> > don't move the buffer to visible VRAM. Move it to GTT instead. Note
> > that moving to GTT can be cheaper, because moving to visible VRAM is
> > likely to evict a lot of buffers there and unmap them from the CPU,
>
> FWIW, this can be avoided by only setting GTT in busy_placement. Then
> TTM will only move the BO to visible VRAM if that can be done without
> evicting anything from there.
>
>
> > but moving to GTT shouldn't evict or unmap anything.
> >
> > c) When we get into the CS ioctl and a buffer has
had_cpu_page_fault,
> > it can be moved to VRAM if:
> > - the GTT->VRAM move rate is low enough to allow it (this is the
> > existing throttling mechanism)
> > - the visible VRAM move rate is low enough that we will be OK with
> > another CPU page fault if it happens.
>
> Some other ideas that might be worth trying:
>
> Evicting BOs to GTT instead of moving them to CPU accessible VRAM in
> principle in some cases (e.g. for all BOs except those with
> AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) or even always.
>
>
> I've tried this and it made things even worse.

What exactly did you try?


I only set the placement to GTT, but I think I kept the contiguous flag.

Marek



--
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-28 Thread Marek Olšák
On Mar 28, 2017 10:41 AM, "Christian König"  wrote:

Am 28.03.2017 um 10:35 schrieb Michel Dänzer:

> On 28/03/17 05:29 PM, Christian König wrote:
>
>> Am 28.03.2017 um 08:00 schrieb Michel Dänzer:
>>
>>> On 28/03/17 12:50 PM, zhoucm1 wrote:
>>>
 On 2017年03月28日 10:40, Michel Dänzer wrote:

> On 27/03/17 04:53 PM, Zhou, David(ChunMing) wrote:
>
>> For APU special case, can we prevent eviction happening between VRAM
>> <> GTT?
>>
> We can, if we can close the performance gap between VRAM and GTT. We
> measured around 30% gap a while ago, though right now I'm only
> measuring
> ~5%, but the test system has slower RAM now (still dual channel
> though).
>
 My impression VRAM and GTT have no much difference for APU case, if I'm
 wrong, pls correct me.

>>> The Mesa patch below makes radeonsi use mostly GTT instead of mostly
>>> VRAM, and slows down Unigine Valley by about 5% on my desktop Kaveri.
>>> You can try it for yourself.
>>>
>> Additional to that you still need the stolen VRAM on APUs for page
>> tables and DCE.
>>
>> So we need to keep the eviction from VRAM to GTT enabled, but what we
>> don't do is swapping them back in because Marek added the GTT flags on
>> APUs as extra domain to look into.
>>
> As long as there's a performance gap between VRAM and GTT, this means
> that performance of long-running apps (e.g. Xorg or the compositor) will
> degrade over time, or after e.g. a suspend-resume cycle.
>
> OTOH, if we can close the gap, we can stop trying to put most BOs in
> VRAM in the first place with APUs.
>

Yeah, John and I are already working on this (but mostly for GFX9).

The difference is that VRAM allocations are mostly contiguously, while GTT
allocations are scattered. So you got more TLB pressure with GTT.


Another aspect is that GART has smaller pages, so the translation cache has
to fetch more of the page directory and also the cache is finite, meaning
that it can be thrashed more easily with small pages.

Marek



Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-28 Thread Christian König

Am 28.03.2017 um 10:35 schrieb Michel Dänzer:

On 28/03/17 05:29 PM, Christian König wrote:

Am 28.03.2017 um 08:00 schrieb Michel Dänzer:

On 28/03/17 12:50 PM, zhoucm1 wrote:

On 2017年03月28日 10:40, Michel Dänzer wrote:

On 27/03/17 04:53 PM, Zhou, David(ChunMing) wrote:

For APU special case, can we prevent eviction happening between VRAM
<> GTT?

We can, if we can close the performance gap between VRAM and GTT. We
measured around 30% gap a while ago, though right now I'm only
measuring
~5%, but the test system has slower RAM now (still dual channel
though).

My impression VRAM and GTT have no much difference for APU case, if I'm
wrong, pls correct me.

The Mesa patch below makes radeonsi use mostly GTT instead of mostly
VRAM, and slows down Unigine Valley by about 5% on my desktop Kaveri.
You can try it for yourself.

Additional to that you still need the stolen VRAM on APUs for page
tables and DCE.

So we need to keep the eviction from VRAM to GTT enabled, but what we
don't do is swapping them back in because Marek added the GTT flags on
APUs as extra domain to look into.

As long as there's a performance gap between VRAM and GTT, this means
that performance of long-running apps (e.g. Xorg or the compositor) will
degrade over time, or after e.g. a suspend-resume cycle.

OTOH, if we can close the gap, we can stop trying to put most BOs in
VRAM in the first place with APUs.


Yeah, John and I are already working on this (but mostly for GFX9).

The difference is that VRAM allocations are mostly contiguously, while 
GTT allocations are scattered. So you got more TLB pressure with GTT.


Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-28 Thread Michel Dänzer
On 28/03/17 05:29 PM, Christian König wrote:
> Am 28.03.2017 um 08:00 schrieb Michel Dänzer:
>> On 28/03/17 12:50 PM, zhoucm1 wrote:
>>> On 2017年03月28日 10:40, Michel Dänzer wrote:
 On 27/03/17 04:53 PM, Zhou, David(ChunMing) wrote:
> For APU special case, can we prevent eviction happening between VRAM
> <> GTT?
 We can, if we can close the performance gap between VRAM and GTT. We
 measured around 30% gap a while ago, though right now I'm only
 measuring
 ~5%, but the test system has slower RAM now (still dual channel
 though).
>>> My impression VRAM and GTT have no much difference for APU case, if I'm
>>> wrong, pls correct me.
>> The Mesa patch below makes radeonsi use mostly GTT instead of mostly
>> VRAM, and slows down Unigine Valley by about 5% on my desktop Kaveri.
>> You can try it for yourself.
> 
> Additional to that you still need the stolen VRAM on APUs for page
> tables and DCE.
> 
> So we need to keep the eviction from VRAM to GTT enabled, but what we
> don't do is swapping them back in because Marek added the GTT flags on
> APUs as extra domain to look into.

As long as there's a performance gap between VRAM and GTT, this means
that performance of long-running apps (e.g. Xorg or the compositor) will
degrade over time, or after e.g. a suspend-resume cycle.

OTOH, if we can close the gap, we can stop trying to put most BOs in
VRAM in the first place with APUs.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-28 Thread Christian König

Am 28.03.2017 um 08:00 schrieb Michel Dänzer:

On 28/03/17 12:50 PM, zhoucm1 wrote:

On 2017年03月28日 10:40, Michel Dänzer wrote:

On 27/03/17 04:53 PM, Zhou, David(ChunMing) wrote:

For APU special case, can we prevent eviction happening between VRAM
<> GTT?

We can, if we can close the performance gap between VRAM and GTT. We
measured around 30% gap a while ago, though right now I'm only measuring
~5%, but the test system has slower RAM now (still dual channel though).

My impression VRAM and GTT have no much difference for APU case, if I'm
wrong, pls correct me.

The Mesa patch below makes radeonsi use mostly GTT instead of mostly
VRAM, and slows down Unigine Valley by about 5% on my desktop Kaveri.
You can try it for yourself.


Additional to that you still need the stolen VRAM on APUs for page 
tables and DCE.


So we need to keep the eviction from VRAM to GTT enabled, but what we 
don't do is swapping them back in because Marek added the GTT flags on 
APUs as extra domain to look into.


In other word once BOs are removed from VRAM they are only swapped back 
in if the hardware needs it.


Regards,
Christian.




diff --git a/src/gallium/drivers/radeon/r600_buffer_common.c 
b/src/gallium/drivers/radeon/r600_buffer_common.c
index da6f0206d7..47661cab76 100644
--- a/src/gallium/drivers/radeon/r600_buffer_common.c
+++ b/src/gallium/drivers/radeon/r600_buffer_common.c
@@ -175,9 +175,11 @@ void r600_init_resource_fields(struct r600_common_screen 
*rscreen,
  * placements even with a low amount of stolen VRAM.
  */
 if (!rscreen->info.has_dedicated_vram &&
-   (rscreen->info.drm_major < 3 || rscreen->info.drm_minor < 6) &&
-   res->domains == RADEON_DOMAIN_VRAM)
-   res->domains = RADEON_DOMAIN_VRAM_GTT;
+   res->domains == RADEON_DOMAIN_VRAM &&
+   !(res->b.b.bind & PIPE_BIND_SCANOUT)) {
+  res->domains = RADEON_DOMAIN_GTT;
+  res->flags |= RADEON_FLAG_GTT_WC;
+   }

 if (rscreen->debug_flags & DBG_NO_WC)
 res->flags &= ~RADEON_FLAG_GTT_WC;





___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-28 Thread Michel Dänzer
On 28/03/17 12:50 PM, zhoucm1 wrote:
> On 2017年03月28日 10:40, Michel Dänzer wrote:
>> On 27/03/17 04:53 PM, Zhou, David(ChunMing) wrote:
>>> For APU special case, can we prevent eviction happening between VRAM
>>> <> GTT?
>> We can, if we can close the performance gap between VRAM and GTT. We
>> measured around 30% gap a while ago, though right now I'm only measuring
>> ~5%, but the test system has slower RAM now (still dual channel though).
> My impression VRAM and GTT have no much difference for APU case, if I'm
> wrong, pls correct me.

The Mesa patch below makes radeonsi use mostly GTT instead of mostly
VRAM, and slows down Unigine Valley by about 5% on my desktop Kaveri.
You can try it for yourself.


diff --git a/src/gallium/drivers/radeon/r600_buffer_common.c 
b/src/gallium/drivers/radeon/r600_buffer_common.c
index da6f0206d7..47661cab76 100644
--- a/src/gallium/drivers/radeon/r600_buffer_common.c
+++ b/src/gallium/drivers/radeon/r600_buffer_common.c
@@ -175,9 +175,11 @@ void r600_init_resource_fields(struct r600_common_screen 
*rscreen,
 * placements even with a low amount of stolen VRAM.
 */
if (!rscreen->info.has_dedicated_vram &&
-   (rscreen->info.drm_major < 3 || rscreen->info.drm_minor < 6) &&
-   res->domains == RADEON_DOMAIN_VRAM)
-   res->domains = RADEON_DOMAIN_VRAM_GTT;
+   res->domains == RADEON_DOMAIN_VRAM &&
+   !(res->b.b.bind & PIPE_BIND_SCANOUT)) {
+  res->domains = RADEON_DOMAIN_GTT;
+  res->flags |= RADEON_FLAG_GTT_WC;
+   }

if (rscreen->debug_flags & DBG_NO_WC)
res->flags &= ~RADEON_FLAG_GTT_WC;



-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-27 Thread zhoucm1



On 2017年03月27日 17:55, Christian König wrote:

Am 27.03.2017 um 11:36 schrieb zhoucm1:



On 2017年03月27日 17:29, Christian König wrote:
On APUs I've already enabled using direct access to the stolen parts 
of system memory.

Thanks, could you point me out where is doing this?


See here gmc_v7_0_mc_init():

/* Could aper size report 0 ? */
adev->mc.aper_base = pci_resource_start(adev->pdev, 0);
adev->mc.aper_size = pci_resource_len(adev->pdev, 0);
/* size in MB on si */
adev->mc.mc_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 
1024ULL;
adev->mc.real_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL 
* 1024ULL;


#ifdef CONFIG_X86_64
if (adev->flags & AMD_IS_APU) {
adev->mc.aper_base = ((u64)RREG32(mmMC_VM_FB_OFFSET)) 
<< 22;

adev->mc.aper_size = adev->mc.real_vram_size;
}
#endif


We use the real physical address and size as aperture on APUs.

This setting is only for invisible vram access?

But real VRAM size is still from CONFIG_MEMSIZE, that means vram size is 
still what is carved out by bios, right?
If vram is allocated end up by user, then ttm still could trigger 
eviction, right?


Regards,
David Zhou


Similar code is in gmc_v8_0_mc_init().

Regards,
Christian.



Regards,
David Zhou


So there won't be any eviction any more because of page faults on APUs.

Regards,
Christian.

Am 27.03.2017 um 09:53 schrieb Zhou, David(ChunMing):
For APU special case, can we prevent eviction happening between 
VRAM <> GTT?


Regards,
David Zhou

-Original Message-
From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On 
Behalf Of Michel D?nzer

Sent: Monday, March 27, 2017 3:36 PM
To: Marek Olšák <mar...@gmail.com>
Cc: amd-gfx mailing list <amd-gfx@lists.freedesktop.org>
Subject: Re: Plan: BO move throttling for visible VRAM evictions

On 25/03/17 01:33 AM, Marek Olšák wrote:

Hi,

I'm sharing this idea here, because it's something that has been
decreasing our performance a lot recently, for example:
http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc1 


09d1c3dc27e871c8aea71ca13f23fa

I think the problem there is that Mesa git started uploading
descriptors and uniforms to VRAM, which helps when TC L2 has a low
hit/miss ratio, but the performance can randomly drop by an order of
magnitude. I've heard rumours that kernel 4.11 has an improved
allocator that should perform better, but the situation is still far
from ideal.

AMD CPUs and APUs will hopefully suffer less, because we can resize
the visible VRAM with the help of our CPU hw specs, but Intel CPUs
will remain limited to 256 MB. The following plan describes how to do
throttling for visible VRAM evictions.


1) Theory

Initially, the driver doesn't care about where buffers are in VRAM,
because VRAM buffers are only moved to visible VRAM on CPU page 
faults

(when the CPU touches the buffer memory but the memory is in the
invisible part of VRAM). When it happens,
amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
also marks the buffer as contiguous, which makes memory fragmentation
worse.

I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
was much higher in a CPU profiler than anything else in the kernel.


2) Monitoring via Gallium HUD

We need to expose 2 kernel counters via the INFO ioctl and display
those via Gallium HUD:
- The number of VRAM CPU page faults. (the number of calls to
amdgpu_bo_fault_reserve_notify).
- The number of bytes moved by ttm_bo_validate inside
amdgpu_bo_fault_reserve_notify.

This will help us observe what exactly is happening and fine-tune the
throttling when it's done.


3) Solution

a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
(amdgpu_bo::had_cpu_page_fault = true)

b) Monitor the MB/s rate at which buffers are moved by
amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
don't move the buffer to visible VRAM. Move it to GTT instead. Note
that moving to GTT can be cheaper, because moving to visible VRAM is
likely to evict a lot of buffers there and unmap them from the CPU,
FWIW, this can be avoided by only setting GTT in busy_placement. 
Then TTM will only move the BO to visible VRAM if that can be done 
without evicting anything from there.




but moving to GTT shouldn't evict or unmap anything.

c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
it can be moved to VRAM if:
- the GTT->VRAM move rate is low enough to allow it (this is the
existing throttling mechanism)
- the visible VRAM move rate is low enough that we will be OK with
another CPU page fault if it happens.

Some other ideas that might be worth trying:

Evicting BOs to GTT instead of moving them to CPU accessible VRAM 
in principle in some cases (e.g. for all BOs except those with

AMDGPU_GEM_CREATE_CPU_ACC

Re: Plan: BO move throttling for visible VRAM evictions

2017-03-27 Thread zhoucm1



On 2017年03月28日 10:40, Michel Dänzer wrote:

On 27/03/17 04:53 PM, Zhou, David(ChunMing) wrote:

For APU special case, can we prevent eviction happening between VRAM <> GTT?

We can, if we can close the performance gap between VRAM and GTT. We
measured around 30% gap a while ago, though right now I'm only measuring
~5%, but the test system has slower RAM now (still dual channel though).
My impression VRAM and GTT have no much difference for APU case, if I'm 
wrong, pls correct me.


Thanks,
David Zhou





___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-27 Thread Michel Dänzer
On 27/03/17 04:53 PM, Zhou, David(ChunMing) wrote:
> For APU special case, can we prevent eviction happening between VRAM <> 
> GTT?

We can, if we can close the performance gap between VRAM and GTT. We
measured around 30% gap a while ago, though right now I'm only measuring
~5%, but the test system has slower RAM now (still dual channel though).


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-27 Thread Michel Dänzer
On 27/03/17 07:29 PM, Marek Olšák wrote:
> On Mar 27, 2017 9:35 AM, "Michel Dänzer"  > wrote:
> 
> On 25/03/17 01:33 AM, Marek Olšák wrote:
> > Hi,
> >
> > I'm sharing this idea here, because it's something that has been
> > decreasing our performance a lot recently, for example:
> >
> 
> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
> 
> 
> >
> > I think the problem there is that Mesa git started uploading
> > descriptors and uniforms to VRAM, which helps when TC L2 has a low
> > hit/miss ratio, but the performance can randomly drop by an order of
> > magnitude. I've heard rumours that kernel 4.11 has an improved
> > allocator that should perform better, but the situation is still far
> > from ideal.
> >
> > AMD CPUs and APUs will hopefully suffer less, because we can resize
> > the visible VRAM with the help of our CPU hw specs, but Intel CPUs
> > will remain limited to 256 MB. The following plan describes how to do
> > throttling for visible VRAM evictions.
> >
> >
> > 1) Theory
> >
> > Initially, the driver doesn't care about where buffers are in VRAM,
> > because VRAM buffers are only moved to visible VRAM on CPU page faults
> > (when the CPU touches the buffer memory but the memory is in the
> > invisible part of VRAM). When it happens,
> > amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
> > visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
> > also marks the buffer as contiguous, which makes memory fragmentation
> > worse.
> >
> > I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
> > was much higher in a CPU profiler than anything else in the kernel.
> >
> >
> > 2) Monitoring via Gallium HUD
> >
> > We need to expose 2 kernel counters via the INFO ioctl and display
> > those via Gallium HUD:
> > - The number of VRAM CPU page faults. (the number of calls to
> > amdgpu_bo_fault_reserve_notify).
> > - The number of bytes moved by ttm_bo_validate inside
> > amdgpu_bo_fault_reserve_notify.
> >
> > This will help us observe what exactly is happening and fine-tune the
> > throttling when it's done.
> >
> >
> > 3) Solution
> >
> > a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
> > (amdgpu_bo::had_cpu_page_fault = true)
> >
> > b) Monitor the MB/s rate at which buffers are moved by
> > amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
> > don't move the buffer to visible VRAM. Move it to GTT instead. Note
> > that moving to GTT can be cheaper, because moving to visible VRAM is
> > likely to evict a lot of buffers there and unmap them from the CPU,
> 
> FWIW, this can be avoided by only setting GTT in busy_placement. Then
> TTM will only move the BO to visible VRAM if that can be done without
> evicting anything from there.
> 
> 
> > but moving to GTT shouldn't evict or unmap anything.
> >
> > c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
> > it can be moved to VRAM if:
> > - the GTT->VRAM move rate is low enough to allow it (this is the
> > existing throttling mechanism)
> > - the visible VRAM move rate is low enough that we will be OK with
> > another CPU page fault if it happens.
> 
> Some other ideas that might be worth trying:
> 
> Evicting BOs to GTT instead of moving them to CPU accessible VRAM in
> principle in some cases (e.g. for all BOs except those with
> AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) or even always.
> 
> 
> I've tried this and it made things even worse.

What exactly did you try?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-27 Thread Marek Olšák
On Mar 27, 2017 9:35 AM, "Michel Dänzer"  wrote:

On 25/03/17 01:33 AM, Marek Olšák wrote:
> Hi,
>
> I'm sharing this idea here, because it's something that has been
> decreasing our performance a lot recently, for example:
> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/
7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>
> I think the problem there is that Mesa git started uploading
> descriptors and uniforms to VRAM, which helps when TC L2 has a low
> hit/miss ratio, but the performance can randomly drop by an order of
> magnitude. I've heard rumours that kernel 4.11 has an improved
> allocator that should perform better, but the situation is still far
> from ideal.
>
> AMD CPUs and APUs will hopefully suffer less, because we can resize
> the visible VRAM with the help of our CPU hw specs, but Intel CPUs
> will remain limited to 256 MB. The following plan describes how to do
> throttling for visible VRAM evictions.
>
>
> 1) Theory
>
> Initially, the driver doesn't care about where buffers are in VRAM,
> because VRAM buffers are only moved to visible VRAM on CPU page faults
> (when the CPU touches the buffer memory but the memory is in the
> invisible part of VRAM). When it happens,
> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
> also marks the buffer as contiguous, which makes memory fragmentation
> worse.
>
> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
> was much higher in a CPU profiler than anything else in the kernel.
>
>
> 2) Monitoring via Gallium HUD
>
> We need to expose 2 kernel counters via the INFO ioctl and display
> those via Gallium HUD:
> - The number of VRAM CPU page faults. (the number of calls to
> amdgpu_bo_fault_reserve_notify).
> - The number of bytes moved by ttm_bo_validate inside
> amdgpu_bo_fault_reserve_notify.
>
> This will help us observe what exactly is happening and fine-tune the
> throttling when it's done.
>
>
> 3) Solution
>
> a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
> (amdgpu_bo::had_cpu_page_fault = true)
>
> b) Monitor the MB/s rate at which buffers are moved by
> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
> don't move the buffer to visible VRAM. Move it to GTT instead. Note
> that moving to GTT can be cheaper, because moving to visible VRAM is
> likely to evict a lot of buffers there and unmap them from the CPU,

FWIW, this can be avoided by only setting GTT in busy_placement. Then
TTM will only move the BO to visible VRAM if that can be done without
evicting anything from there.


> but moving to GTT shouldn't evict or unmap anything.
>
> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
> it can be moved to VRAM if:
> - the GTT->VRAM move rate is low enough to allow it (this is the
> existing throttling mechanism)
> - the visible VRAM move rate is low enough that we will be OK with
> another CPU page fault if it happens.

Some other ideas that might be worth trying:

Evicting BOs to GTT instead of moving them to CPU accessible VRAM in
principle in some cases (e.g. for all BOs except those with
AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) or even always.


I've tried this and it made things even worse.

Marek


Implementing eviction from CPU visible to CPU invisible VRAM, similar to
how it's done in radeon. Note that there's potential for userspace
triggering an infinite loop in the kernel in cases where BOs are moved
back from invisible to visible VRAM on page faults.


--
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-27 Thread Christian König

Am 27.03.2017 um 11:36 schrieb zhoucm1:



On 2017年03月27日 17:29, Christian König wrote:
On APUs I've already enabled using direct access to the stolen parts 
of system memory.

Thanks, could you point me out where is doing this?


See here gmc_v7_0_mc_init():

/* Could aper size report 0 ? */
adev->mc.aper_base = pci_resource_start(adev->pdev, 0);
adev->mc.aper_size = pci_resource_len(adev->pdev, 0);
/* size in MB on si */
adev->mc.mc_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 
1024ULL;
adev->mc.real_vram_size = RREG32(mmCONFIG_MEMSIZE) * 1024ULL * 
1024ULL;


#ifdef CONFIG_X86_64
if (adev->flags & AMD_IS_APU) {
adev->mc.aper_base = ((u64)RREG32(mmMC_VM_FB_OFFSET)) 
<< 22;

adev->mc.aper_size = adev->mc.real_vram_size;
}
#endif


We use the real physical address and size as aperture on APUs.

Similar code is in gmc_v8_0_mc_init().

Regards,
Christian.



Regards,
David Zhou


So there won't be any eviction any more because of page faults on APUs.

Regards,
Christian.

Am 27.03.2017 um 09:53 schrieb Zhou, David(ChunMing):
For APU special case, can we prevent eviction happening between VRAM 
<> GTT?


Regards,
David Zhou

-Original Message-
From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On 
Behalf Of Michel D?nzer

Sent: Monday, March 27, 2017 3:36 PM
To: Marek Olšák <mar...@gmail.com>
Cc: amd-gfx mailing list <amd-gfx@lists.freedesktop.org>
Subject: Re: Plan: BO move throttling for visible VRAM evictions

On 25/03/17 01:33 AM, Marek Olšák wrote:

Hi,

I'm sharing this idea here, because it's something that has been
decreasing our performance a lot recently, for example:
http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc1
09d1c3dc27e871c8aea71ca13f23fa

I think the problem there is that Mesa git started uploading
descriptors and uniforms to VRAM, which helps when TC L2 has a low
hit/miss ratio, but the performance can randomly drop by an order of
magnitude. I've heard rumours that kernel 4.11 has an improved
allocator that should perform better, but the situation is still far
from ideal.

AMD CPUs and APUs will hopefully suffer less, because we can resize
the visible VRAM with the help of our CPU hw specs, but Intel CPUs
will remain limited to 256 MB. The following plan describes how to do
throttling for visible VRAM evictions.


1) Theory

Initially, the driver doesn't care about where buffers are in VRAM,
because VRAM buffers are only moved to visible VRAM on CPU page faults
(when the CPU touches the buffer memory but the memory is in the
invisible part of VRAM). When it happens,
amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
also marks the buffer as contiguous, which makes memory fragmentation
worse.

I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
was much higher in a CPU profiler than anything else in the kernel.


2) Monitoring via Gallium HUD

We need to expose 2 kernel counters via the INFO ioctl and display
those via Gallium HUD:
- The number of VRAM CPU page faults. (the number of calls to
amdgpu_bo_fault_reserve_notify).
- The number of bytes moved by ttm_bo_validate inside
amdgpu_bo_fault_reserve_notify.

This will help us observe what exactly is happening and fine-tune the
throttling when it's done.


3) Solution

a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
(amdgpu_bo::had_cpu_page_fault = true)

b) Monitor the MB/s rate at which buffers are moved by
amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
don't move the buffer to visible VRAM. Move it to GTT instead. Note
that moving to GTT can be cheaper, because moving to visible VRAM is
likely to evict a lot of buffers there and unmap them from the CPU,
FWIW, this can be avoided by only setting GTT in busy_placement. 
Then TTM will only move the BO to visible VRAM if that can be done 
without evicting anything from there.




but moving to GTT shouldn't evict or unmap anything.

c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
it can be moved to VRAM if:
- the GTT->VRAM move rate is low enough to allow it (this is the
existing throttling mechanism)
- the visible VRAM move rate is low enough that we will be OK with
another CPU page fault if it happens.

Some other ideas that might be worth trying:

Evicting BOs to GTT instead of moving them to CPU accessible VRAM in 
principle in some cases (e.g. for all BOs except those with

AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) or even always.

Implementing eviction from CPU visible to CPU invisible VRAM, 
similar to how it's done in radeon. Note that there's potential for 
userspace triggering an infinite loop in the kernel in cases where 
BOs are moved back from invisible to visible VRAM on page faults.







___

Re: Plan: BO move throttling for visible VRAM evictions

2017-03-27 Thread zhoucm1



On 2017年03月27日 17:29, Christian König wrote:
On APUs I've already enabled using direct access to the stolen parts 
of system memory.

Thanks, could you point me out where is doing this?

Regards,
David Zhou


So there won't be any eviction any more because of page faults on APUs.

Regards,
Christian.

Am 27.03.2017 um 09:53 schrieb Zhou, David(ChunMing):
For APU special case, can we prevent eviction happening between VRAM 
<> GTT?


Regards,
David Zhou

-Original Message-
From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On 
Behalf Of Michel D?nzer

Sent: Monday, March 27, 2017 3:36 PM
To: Marek Olšák <mar...@gmail.com>
Cc: amd-gfx mailing list <amd-gfx@lists.freedesktop.org>
Subject: Re: Plan: BO move throttling for visible VRAM evictions

On 25/03/17 01:33 AM, Marek Olšák wrote:

Hi,

I'm sharing this idea here, because it's something that has been
decreasing our performance a lot recently, for example:
http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc1
09d1c3dc27e871c8aea71ca13f23fa

I think the problem there is that Mesa git started uploading
descriptors and uniforms to VRAM, which helps when TC L2 has a low
hit/miss ratio, but the performance can randomly drop by an order of
magnitude. I've heard rumours that kernel 4.11 has an improved
allocator that should perform better, but the situation is still far
from ideal.

AMD CPUs and APUs will hopefully suffer less, because we can resize
the visible VRAM with the help of our CPU hw specs, but Intel CPUs
will remain limited to 256 MB. The following plan describes how to do
throttling for visible VRAM evictions.


1) Theory

Initially, the driver doesn't care about where buffers are in VRAM,
because VRAM buffers are only moved to visible VRAM on CPU page faults
(when the CPU touches the buffer memory but the memory is in the
invisible part of VRAM). When it happens,
amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
also marks the buffer as contiguous, which makes memory fragmentation
worse.

I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
was much higher in a CPU profiler than anything else in the kernel.


2) Monitoring via Gallium HUD

We need to expose 2 kernel counters via the INFO ioctl and display
those via Gallium HUD:
- The number of VRAM CPU page faults. (the number of calls to
amdgpu_bo_fault_reserve_notify).
- The number of bytes moved by ttm_bo_validate inside
amdgpu_bo_fault_reserve_notify.

This will help us observe what exactly is happening and fine-tune the
throttling when it's done.


3) Solution

a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
(amdgpu_bo::had_cpu_page_fault = true)

b) Monitor the MB/s rate at which buffers are moved by
amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
don't move the buffer to visible VRAM. Move it to GTT instead. Note
that moving to GTT can be cheaper, because moving to visible VRAM is
likely to evict a lot of buffers there and unmap them from the CPU,
FWIW, this can be avoided by only setting GTT in busy_placement. Then 
TTM will only move the BO to visible VRAM if that can be done without 
evicting anything from there.




but moving to GTT shouldn't evict or unmap anything.

c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
it can be moved to VRAM if:
- the GTT->VRAM move rate is low enough to allow it (this is the
existing throttling mechanism)
- the visible VRAM move rate is low enough that we will be OK with
another CPU page fault if it happens.

Some other ideas that might be worth trying:

Evicting BOs to GTT instead of moving them to CPU accessible VRAM in 
principle in some cases (e.g. for all BOs except those with

AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) or even always.

Implementing eviction from CPU visible to CPU invisible VRAM, similar 
to how it's done in radeon. Note that there's potential for userspace 
triggering an infinite loop in the kernel in cases where BOs are 
moved back from invisible to visible VRAM on page faults.







___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Plan: BO move throttling for visible VRAM evictions

2017-03-27 Thread Zhou, David(ChunMing)
For APU special case, can we prevent eviction happening between VRAM <> GTT?

Regards,
David Zhou

-Original Message-
From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of 
Michel D?nzer
Sent: Monday, March 27, 2017 3:36 PM
To: Marek Olšák <mar...@gmail.com>
Cc: amd-gfx mailing list <amd-gfx@lists.freedesktop.org>
Subject: Re: Plan: BO move throttling for visible VRAM evictions

On 25/03/17 01:33 AM, Marek Olšák wrote:
> Hi,
> 
> I'm sharing this idea here, because it's something that has been 
> decreasing our performance a lot recently, for example:
> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc1
> 09d1c3dc27e871c8aea71ca13f23fa
> 
> I think the problem there is that Mesa git started uploading 
> descriptors and uniforms to VRAM, which helps when TC L2 has a low 
> hit/miss ratio, but the performance can randomly drop by an order of 
> magnitude. I've heard rumours that kernel 4.11 has an improved 
> allocator that should perform better, but the situation is still far 
> from ideal.
> 
> AMD CPUs and APUs will hopefully suffer less, because we can resize 
> the visible VRAM with the help of our CPU hw specs, but Intel CPUs 
> will remain limited to 256 MB. The following plan describes how to do 
> throttling for visible VRAM evictions.
> 
> 
> 1) Theory
> 
> Initially, the driver doesn't care about where buffers are in VRAM, 
> because VRAM buffers are only moved to visible VRAM on CPU page faults 
> (when the CPU touches the buffer memory but the memory is in the 
> invisible part of VRAM). When it happens, 
> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to 
> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify 
> also marks the buffer as contiguous, which makes memory fragmentation 
> worse.
> 
> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify 
> was much higher in a CPU profiler than anything else in the kernel.
> 
> 
> 2) Monitoring via Gallium HUD
> 
> We need to expose 2 kernel counters via the INFO ioctl and display 
> those via Gallium HUD:
> - The number of VRAM CPU page faults. (the number of calls to 
> amdgpu_bo_fault_reserve_notify).
> - The number of bytes moved by ttm_bo_validate inside 
> amdgpu_bo_fault_reserve_notify.
> 
> This will help us observe what exactly is happening and fine-tune the 
> throttling when it's done.
> 
> 
> 3) Solution
> 
> a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
> (amdgpu_bo::had_cpu_page_fault = true)
> 
> b) Monitor the MB/s rate at which buffers are moved by 
> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold, 
> don't move the buffer to visible VRAM. Move it to GTT instead. Note 
> that moving to GTT can be cheaper, because moving to visible VRAM is 
> likely to evict a lot of buffers there and unmap them from the CPU,

FWIW, this can be avoided by only setting GTT in busy_placement. Then TTM will 
only move the BO to visible VRAM if that can be done without evicting anything 
from there.


> but moving to GTT shouldn't evict or unmap anything.
> 
> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault, 
> it can be moved to VRAM if:
> - the GTT->VRAM move rate is low enough to allow it (this is the 
> existing throttling mechanism)
> - the visible VRAM move rate is low enough that we will be OK with 
> another CPU page fault if it happens.

Some other ideas that might be worth trying:

Evicting BOs to GTT instead of moving them to CPU accessible VRAM in principle 
in some cases (e.g. for all BOs except those with
AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) or even always.

Implementing eviction from CPU visible to CPU invisible VRAM, similar to how 
it's done in radeon. Note that there's potential for userspace triggering an 
infinite loop in the kernel in cases where BOs are moved back from invisible to 
visible VRAM on page faults.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-27 Thread Michel Dänzer
On 25/03/17 01:33 AM, Marek Olšák wrote:
> Hi,
> 
> I'm sharing this idea here, because it's something that has been
> decreasing our performance a lot recently, for example:
> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
> 
> I think the problem there is that Mesa git started uploading
> descriptors and uniforms to VRAM, which helps when TC L2 has a low
> hit/miss ratio, but the performance can randomly drop by an order of
> magnitude. I've heard rumours that kernel 4.11 has an improved
> allocator that should perform better, but the situation is still far
> from ideal.
> 
> AMD CPUs and APUs will hopefully suffer less, because we can resize
> the visible VRAM with the help of our CPU hw specs, but Intel CPUs
> will remain limited to 256 MB. The following plan describes how to do
> throttling for visible VRAM evictions.
> 
> 
> 1) Theory
> 
> Initially, the driver doesn't care about where buffers are in VRAM,
> because VRAM buffers are only moved to visible VRAM on CPU page faults
> (when the CPU touches the buffer memory but the memory is in the
> invisible part of VRAM). When it happens,
> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
> also marks the buffer as contiguous, which makes memory fragmentation
> worse.
> 
> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
> was much higher in a CPU profiler than anything else in the kernel.
> 
> 
> 2) Monitoring via Gallium HUD
> 
> We need to expose 2 kernel counters via the INFO ioctl and display
> those via Gallium HUD:
> - The number of VRAM CPU page faults. (the number of calls to
> amdgpu_bo_fault_reserve_notify).
> - The number of bytes moved by ttm_bo_validate inside
> amdgpu_bo_fault_reserve_notify.
> 
> This will help us observe what exactly is happening and fine-tune the
> throttling when it's done.
> 
> 
> 3) Solution
> 
> a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
> (amdgpu_bo::had_cpu_page_fault = true)
> 
> b) Monitor the MB/s rate at which buffers are moved by
> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
> don't move the buffer to visible VRAM. Move it to GTT instead. Note
> that moving to GTT can be cheaper, because moving to visible VRAM is
> likely to evict a lot of buffers there and unmap them from the CPU,

FWIW, this can be avoided by only setting GTT in busy_placement. Then
TTM will only move the BO to visible VRAM if that can be done without
evicting anything from there.


> but moving to GTT shouldn't evict or unmap anything.
> 
> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
> it can be moved to VRAM if:
> - the GTT->VRAM move rate is low enough to allow it (this is the
> existing throttling mechanism)
> - the visible VRAM move rate is low enough that we will be OK with
> another CPU page fault if it happens.

Some other ideas that might be worth trying:

Evicting BOs to GTT instead of moving them to CPU accessible VRAM in
principle in some cases (e.g. for all BOs except those with
AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED) or even always.

Implementing eviction from CPU visible to CPU invisible VRAM, similar to
how it's done in radeon. Note that there's potential for userspace
triggering an infinite loop in the kernel in cases where BOs are moved
back from invisible to visible VRAM on page faults.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-24 Thread Samuel Pitoiset



On 03/24/2017 05:33 PM, Marek Olšák wrote:

Hi,

I'm sharing this idea here, because it's something that has been
decreasing our performance a lot recently, for example:
http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa

I think the problem there is that Mesa git started uploading
descriptors and uniforms to VRAM, which helps when TC L2 has a low
hit/miss ratio, but the performance can randomly drop by an order of
magnitude. I've heard rumours that kernel 4.11 has an improved
allocator that should perform better, but the situation is still far
from ideal.


I have just tried 4.11-rc3 from torvalds's github, nothing changed with 
civ6, still 23FPS.




AMD CPUs and APUs will hopefully suffer less, because we can resize
the visible VRAM with the help of our CPU hw specs, but Intel CPUs
will remain limited to 256 MB. The following plan describes how to do
throttling for visible VRAM evictions.


1) Theory

Initially, the driver doesn't care about where buffers are in VRAM,
because VRAM buffers are only moved to visible VRAM on CPU page faults
(when the CPU touches the buffer memory but the memory is in the
invisible part of VRAM). When it happens,
amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
also marks the buffer as contiguous, which makes memory fragmentation
worse.

I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
was much higher in a CPU profiler than anything else in the kernel.


Looks like, I see similar things with civ6.




2) Monitoring via Gallium HUD

We need to expose 2 kernel counters via the INFO ioctl and display
those via Gallium HUD:
- The number of VRAM CPU page faults. (the number of calls to
amdgpu_bo_fault_reserve_notify).
- The number of bytes moved by ttm_bo_validate inside
amdgpu_bo_fault_reserve_notify.

This will help us observe what exactly is happening and fine-tune the
throttling when it's done.



Should really be useful.

Samuel.



3) Solution

a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
(amdgpu_bo::had_cpu_page_fault = true)

b) Monitor the MB/s rate at which buffers are moved by
amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
don't move the buffer to visible VRAM. Move it to GTT instead. Note
that moving to GTT can be cheaper, because moving to visible VRAM is
likely to evict a lot of buffers there and unmap them from the CPU,
but moving to GTT shouldn't evict or unmap anything.

c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
it can be moved to VRAM if:
- the GTT->VRAM move rate is low enough to allow it (this is the
existing throttling mechanism)
- the visible VRAM move rate is low enough that we will be OK with
another CPU page fault if it happens.

d) The solution can be fine-tuned with the help of Gallium HUD to get
the best performance under various scenarios. The current throttling
mechanism can serve as an inspiration.


That's it. Feel free to comment. I think this is our biggest
performance bottleneck at the moment.

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: Plan: BO move throttling for visible VRAM evictions

2017-03-24 Thread Deucher, Alexander
> -Original Message-
> From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
> Of Marek Olšák
> Sent: Friday, March 24, 2017 12:34 PM
> To: amd-gfx mailing list
> Subject: Plan: BO move throttling for visible VRAM evictions
> 
> Hi,
> 
> I'm sharing this idea here, because it's something that has been
> decreasing our performance a lot recently, for example:
> http://openbenchmarking.org/prospect/1703011-RI-
> RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
> 
> I think the problem there is that Mesa git started uploading
> descriptors and uniforms to VRAM, which helps when TC L2 has a low
> hit/miss ratio, but the performance can randomly drop by an order of
> magnitude. I've heard rumours that kernel 4.11 has an improved
> allocator that should perform better, but the situation is still far
> from ideal.
> 
> AMD CPUs and APUs will hopefully suffer less, because we can resize
> the visible VRAM with the help of our CPU hw specs, but Intel CPUs
> will remain limited to 256 MB. The following plan describes how to do
> throttling for visible VRAM evictions.

Has anyone checked the Intel chipset docs?  Maybe they document the interface?  
There's also ACPI _SRS which should be the vendor independent way to handle 
this.

Alex

> 
> 
> 1) Theory
> 
> Initially, the driver doesn't care about where buffers are in VRAM,
> because VRAM buffers are only moved to visible VRAM on CPU page faults
> (when the CPU touches the buffer memory but the memory is in the
> invisible part of VRAM). When it happens,
> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
> also marks the buffer as contiguous, which makes memory fragmentation
> worse.
> 
> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
> was much higher in a CPU profiler than anything else in the kernel.
> 
> 
> 2) Monitoring via Gallium HUD
> 
> We need to expose 2 kernel counters via the INFO ioctl and display
> those via Gallium HUD:
> - The number of VRAM CPU page faults. (the number of calls to
> amdgpu_bo_fault_reserve_notify).
> - The number of bytes moved by ttm_bo_validate inside
> amdgpu_bo_fault_reserve_notify.
> 
> This will help us observe what exactly is happening and fine-tune the
> throttling when it's done.
> 
> 
> 3) Solution
> 
> a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
> (amdgpu_bo::had_cpu_page_fault = true)
> 
> b) Monitor the MB/s rate at which buffers are moved by
> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
> don't move the buffer to visible VRAM. Move it to GTT instead. Note
> that moving to GTT can be cheaper, because moving to visible VRAM is
> likely to evict a lot of buffers there and unmap them from the CPU,
> but moving to GTT shouldn't evict or unmap anything.
> 
> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
> it can be moved to VRAM if:
> - the GTT->VRAM move rate is low enough to allow it (this is the
> existing throttling mechanism)
> - the visible VRAM move rate is low enough that we will be OK with
> another CPU page fault if it happens.
> 
> d) The solution can be fine-tuned with the help of Gallium HUD to get
> the best performance under various scenarios. The current throttling
> mechanism can serve as an inspiration.
> 
> 
> That's it. Feel free to comment. I think this is our biggest
> performance bottleneck at the moment.
> 
> Marek
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-24 Thread Marek Olšák
On Fri, Mar 24, 2017 at 5:45 PM, Christian König
 wrote:
> Am 24.03.2017 um 17:33 schrieb Marek Olšák:
>>
>> Hi,
>>
>> I'm sharing this idea here, because it's something that has been
>> decreasing our performance a lot recently, for example:
>>
>> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>>
>> I think the problem there is that Mesa git started uploading
>> descriptors and uniforms to VRAM, which helps when TC L2 has a low
>> hit/miss ratio, but the performance can randomly drop by an order of
>> magnitude. I've heard rumours that kernel 4.11 has an improved
>> allocator that should perform better, but the situation is still far
>> from ideal.
>>
>> AMD CPUs and APUs will hopefully suffer less, because we can resize
>> the visible VRAM with the help of our CPU hw specs, but Intel CPUs
>> will remain limited to 256 MB. The following plan describes how to do
>> throttling for visible VRAM evictions.
>>
>>
>> 1) Theory
>>
>> Initially, the driver doesn't care about where buffers are in VRAM,
>> because VRAM buffers are only moved to visible VRAM on CPU page faults
>> (when the CPU touches the buffer memory but the memory is in the
>> invisible part of VRAM). When it happens,
>> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
>> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
>> also marks the buffer as contiguous, which makes memory fragmentation
>> worse.
>>
>> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
>> was much higher in a CPU profiler than anything else in the kernel.
>
>
> Good to know that my expectations on this are correct.
>
> How about fixing the need for contiguous buffers when CPU mapping them?
>
> That should actually be pretty easy to do.
>
>> 2) Monitoring via Gallium HUD
>>
>> We need to expose 2 kernel counters via the INFO ioctl and display
>> those via Gallium HUD:
>> - The number of VRAM CPU page faults. (the number of calls to
>> amdgpu_bo_fault_reserve_notify).
>> - The number of bytes moved by ttm_bo_validate inside
>> amdgpu_bo_fault_reserve_notify.
>>
>> This will help us observe what exactly is happening and fine-tune the
>> throttling when it's done.
>>
>>
>> 3) Solution
>>
>> a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
>> (amdgpu_bo::had_cpu_page_fault = true)
>
>
> What is that good for?
>
>> b) Monitor the MB/s rate at which buffers are moved by
>> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
>> don't move the buffer to visible VRAM. Move it to GTT instead. Note
>> that moving to GTT can be cheaper, because moving to visible VRAM is
>> likely to evict a lot of buffers there and unmap them from the CPU,
>> but moving to GTT shouldn't evict or unmap anything.
>
>
> Yeah, had that idea as well. I've been working on adding a context to TTMs
> BO validation call chain.
>
> This way we could add a byte limit on how much TTM will try to evict before
> returning -ENOMEM (or better ENOSPC).
>
>> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
>> it can be moved to VRAM if:
>> - the GTT->VRAM move rate is low enough to allow it (this is the
>> existing throttling mechanism)
>> - the visible VRAM move rate is low enough that we will be OK with
>> another CPU page fault if it happens.
>
>
> Interesting idea, need to think a bit about it.
>
> But I would say this has second priority, fixing the contiguous buffer
> requirement should be first. Going to work on that next.

Interesting. I didn't know the contiguous setting wasn't required.

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Plan: BO move throttling for visible VRAM evictions

2017-03-24 Thread Christian König

Am 24.03.2017 um 17:33 schrieb Marek Olšák:

Hi,

I'm sharing this idea here, because it's something that has been
decreasing our performance a lot recently, for example:
http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa

I think the problem there is that Mesa git started uploading
descriptors and uniforms to VRAM, which helps when TC L2 has a low
hit/miss ratio, but the performance can randomly drop by an order of
magnitude. I've heard rumours that kernel 4.11 has an improved
allocator that should perform better, but the situation is still far
from ideal.

AMD CPUs and APUs will hopefully suffer less, because we can resize
the visible VRAM with the help of our CPU hw specs, but Intel CPUs
will remain limited to 256 MB. The following plan describes how to do
throttling for visible VRAM evictions.


1) Theory

Initially, the driver doesn't care about where buffers are in VRAM,
because VRAM buffers are only moved to visible VRAM on CPU page faults
(when the CPU touches the buffer memory but the memory is in the
invisible part of VRAM). When it happens,
amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
also marks the buffer as contiguous, which makes memory fragmentation
worse.

I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
was much higher in a CPU profiler than anything else in the kernel.


Good to know that my expectations on this are correct.

How about fixing the need for contiguous buffers when CPU mapping them?

That should actually be pretty easy to do.


2) Monitoring via Gallium HUD

We need to expose 2 kernel counters via the INFO ioctl and display
those via Gallium HUD:
- The number of VRAM CPU page faults. (the number of calls to
amdgpu_bo_fault_reserve_notify).
- The number of bytes moved by ttm_bo_validate inside
amdgpu_bo_fault_reserve_notify.

This will help us observe what exactly is happening and fine-tune the
throttling when it's done.


3) Solution

a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
(amdgpu_bo::had_cpu_page_fault = true)


What is that good for?


b) Monitor the MB/s rate at which buffers are moved by
amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
don't move the buffer to visible VRAM. Move it to GTT instead. Note
that moving to GTT can be cheaper, because moving to visible VRAM is
likely to evict a lot of buffers there and unmap them from the CPU,
but moving to GTT shouldn't evict or unmap anything.


Yeah, had that idea as well. I've been working on adding a context to 
TTMs BO validation call chain.


This way we could add a byte limit on how much TTM will try to evict 
before returning -ENOMEM (or better ENOSPC).



c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
it can be moved to VRAM if:
- the GTT->VRAM move rate is low enough to allow it (this is the
existing throttling mechanism)
- the visible VRAM move rate is low enough that we will be OK with
another CPU page fault if it happens.


Interesting idea, need to think a bit about it.

But I would say this has second priority, fixing the contiguous buffer 
requirement should be first. Going to work on that next.



d) The solution can be fine-tuned with the help of Gallium HUD to get
the best performance under various scenarios. The current throttling
mechanism can serve as an inspiration.


That's it. Feel free to comment. I think this is our biggest
performance bottleneck at the moment.


Yeah, completely agree.

Regards,
Christian.



Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx