Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

2012-11-12 Thread Michel Dänzer
On Sam, 2012-11-10 at 16:52 +0100, Marek Olšák wrote: 
 On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse j.gli...@gmail.com wrote:
  On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
  On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher alexdeuc...@gmail.com wrote:
   On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák mar...@gmail.com wrote:
   The problem was we set VRAM|GTT for relocations of STATIC resources.
   Setting just VRAM increases the framerate 4 times on my machine.
  
   I rewrote the switch statement and adjusted the domains for window
   framebuffers too.
  
   Reviewed-by: Alex Deucher alexander.deuc...@amd.com
  
   Stable branches?
 
  Yes, good idea.
 
  Marek
 
  Btw as a follow up on this, i did some experiment with ttm and eviction.
  Blocking any vram eviction improve average fps (20-30%) and minimum fps
  (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
  just make framerate more consistant.
 
  I then tried several heuristic on the eviction process (not evicting buffer
  if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
  btw buffer used for rendering and auxiliary buffer use by kernel, ...
  none of those heuristic improved anything. I also removed bo wait in the
  eviction pipeline but still no improvement. Haven't time to look further
  but anyway bottom line is that some benchmark are memory tight and constant
  eviction hurt.
 
  (used unigine heaven and reaction quake for benchmark)
 
 I've came up with the following solution, which I think would help
 improve the situation a lot.
 
 We should prepare a list of command streams and one list of
 relocations for an entire frame, do buffer validation/placements for
 the entire frame at the beginning and then just render the whole frame
 (schedule all the command streams at once). That would minimize the
 buffer evictions and give us the ideal buffer placements for the whole
 frame and then the GPU would run the commands uninterrupted by other
 processes (and we don't have to flush caches so much).
 
 The only downsides are:
 - Buffers would be marked as busy for the entire frame, because the
 fence would only be at the end of the frame. We definitely need more
 fine-grained distribution of fences for apps which map buffers during
 rendering. One possible solution is to let userspace emit fences by
 itself and associate the fences with the buffers in the relocation
 list. The bo-wait mechanism would then use the fence from the (buffer,
 fence) pair, while TTM would use the end-of-frame fence (we can't
 trust the userspace giving us the right fences).
 - We should find out how to offload flushing and SwapBuffers to
 another thread, because the final CS ioctl will be really big.
 Currently, the radeon winsys doesn't offload the CS ioctl if it's in
 the SwapBuffers call.

- Deferring to a single big flush like that might introduce additional
latency before the GPU starts processing a frame and hurt some apps.


 Possible improvement:
 - The userspace should emit commands into a GPU buffer and not in the
 user memory, so that we don't have to do copy_from_user in the kernel.
 I expect the CS ioctl to unmap the GPU buffer and forbid later mapping
 as well as putting the buffer in the relocation list.

Unmapping etc. shouldn't be necessary in the long run with GPUVM.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

2012-11-12 Thread Marek Olšák
On Mon, Nov 12, 2012 at 12:23 PM, Christian König
deathsim...@vodafone.de wrote:
 On 12.11.2012 11:08, Michel Dänzer wrote:

 On Sam, 2012-11-10 at 16:52 +0100, Marek Olšák wrote:

 On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse j.gli...@gmail.com wrote:

 On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:

 On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher alexdeuc...@gmail.com
 wrote:

 On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák mar...@gmail.com wrote:

 The problem was we set VRAM|GTT for relocations of STATIC resources.
 Setting just VRAM increases the framerate 4 times on my machine.

 I rewrote the switch statement and adjusted the domains for window
 framebuffers too.

 Reviewed-by: Alex Deucher alexander.deuc...@amd.com

 Stable branches?

 Yes, good idea.

 Marek

 Btw as a follow up on this, i did some experiment with ttm and eviction.
 Blocking any vram eviction improve average fps (20-30%) and minimum fps
 (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
 just make framerate more consistant.

 I then tried several heuristic on the eviction process (not evicting
 buffer
 if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru
 differently
 btw buffer used for rendering and auxiliary buffer use by kernel, ...
 none of those heuristic improved anything. I also removed bo wait in the
 eviction pipeline but still no improvement. Haven't time to look further
 but anyway bottom line is that some benchmark are memory tight and
 constant
 eviction hurt.

 (used unigine heaven and reaction quake for benchmark)

 I've came up with the following solution, which I think would help
 improve the situation a lot.

 We should prepare a list of command streams and one list of
 relocations for an entire frame, do buffer validation/placements for
 the entire frame at the beginning and then just render the whole frame
 (schedule all the command streams at once). That would minimize the
 buffer evictions and give us the ideal buffer placements for the whole
 frame and then the GPU would run the commands uninterrupted by other
 processes (and we don't have to flush caches so much).

 The only downsides are:
 - Buffers would be marked as busy for the entire frame, because the
 fence would only be at the end of the frame. We definitely need more
 fine-grained distribution of fences for apps which map buffers during
 rendering. One possible solution is to let userspace emit fences by
 itself and associate the fences with the buffers in the relocation
 list. The bo-wait mechanism would then use the fence from the (buffer,
 fence) pair, while TTM would use the end-of-frame fence (we can't
 trust the userspace giving us the right fences).
 - We should find out how to offload flushing and SwapBuffers to
 another thread, because the final CS ioctl will be really big.
 Currently, the radeon winsys doesn't offload the CS ioctl if it's in
 the SwapBuffers call.

 - Deferring to a single big flush like that might introduce additional
 latency before the GPU starts processing a frame and hurt some apps.


 Instead of fencing the buffers in userspace how about something like this
 for the kernel CS interface:

 RADEON_CHUNK_ID_IB
 RADEON_CHUNK_ID_IB
 RADEON_CHUNK_ID_IB
 RADEON_CHUNK_ID_IB
 RADEON_CHUNK_ID_RELOCS
 RADEON_CHUNK_ID_IB
 RADEON_CHUNK_ID_IB
 RADEON_CHUNK_ID_RELOCS
 RADEON_CHUNK_ID_IB
 RADEON_CHUNK_ID_IB
 RADEON_CHUNK_ID_RELOCS
 RADEON_CHUNK_ID_FLAGS

 Fences are only emitted at RADEON_CHUNK_ID_RELOCS borders, but the whole CS
 call is submitted as one single chunk of work and so all BOs get reserved
 and placed at once. That of course doesn't help with the higher latency
 before actually starting a frame, but I don't think that this would actually
 be such a big problem.

The latency can add an input lag, which can negatively impact game
experience especially in first person shooters.

In the long run I think Radeon/TTM should plan buffer moves across
several frames, i.e. an incremental approach that should eventually
converge to the ideal state, i.e. a process generating the highest GPU
load should get much better buffer placements than the idling
processes after like 10-20 CS ioctls and with a guarantee that its
buffers won't be evicted anytime soon.

Also I think the domains in the relocation list should be ignored
completely and only the initial domain from GEM_CREATE should be taken
into account, because that's the domain the gallium driver wanted in
the first place.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

2012-11-10 Thread Marek Olšák
On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse j.gli...@gmail.com wrote:
 On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
 On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher alexdeuc...@gmail.com wrote:
  On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák mar...@gmail.com wrote:
  The problem was we set VRAM|GTT for relocations of STATIC resources.
  Setting just VRAM increases the framerate 4 times on my machine.
 
  I rewrote the switch statement and adjusted the domains for window
  framebuffers too.
 
  Reviewed-by: Alex Deucher alexander.deuc...@amd.com
 
  Stable branches?

 Yes, good idea.

 Marek

 Btw as a follow up on this, i did some experiment with ttm and eviction.
 Blocking any vram eviction improve average fps (20-30%) and minimum fps
 (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
 just make framerate more consistant.

 I then tried several heuristic on the eviction process (not evicting buffer
 if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
 btw buffer used for rendering and auxiliary buffer use by kernel, ...
 none of those heuristic improved anything. I also removed bo wait in the
 eviction pipeline but still no improvement. Haven't time to look further
 but anyway bottom line is that some benchmark are memory tight and constant
 eviction hurt.

 (used unigine heaven and reaction quake for benchmark)

I've came up with the following solution, which I think would help
improve the situation a lot.

We should prepare a list of command streams and one list of
relocations for an entire frame, do buffer validation/placements for
the entire frame at the beginning and then just render the whole frame
(schedule all the command streams at once). That would minimize the
buffer evictions and give us the ideal buffer placements for the whole
frame and then the GPU would run the commands uninterrupted by other
processes (and we don't have to flush caches so much).

The only downsides are:
- Buffers would be marked as busy for the entire frame, because the
fence would only be at the end of the frame. We definitely need more
fine-grained distribution of fences for apps which map buffers during
rendering. One possible solution is to let userspace emit fences by
itself and associate the fences with the buffers in the relocation
list. The bo-wait mechanism would then use the fence from the (buffer,
fence) pair, while TTM would use the end-of-frame fence (we can't
trust the userspace giving us the right fences).
- We should find out how to offload flushing and SwapBuffers to
another thread, because the final CS ioctl will be really big.
Currently, the radeon winsys doesn't offload the CS ioctl if it's in
the SwapBuffers call.

Possible improvement:
- The userspace should emit commands into a GPU buffer and not in the
user memory, so that we don't have to do copy_from_user in the kernel.
I expect the CS ioctl to unmap the GPU buffer and forbid later mapping
as well as putting the buffer in the relocation list.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

2012-11-10 Thread Alex Deucher
On Sat, Nov 10, 2012 at 10:52 AM, Marek Olšák mar...@gmail.com wrote:
 On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse j.gli...@gmail.com wrote:
 On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
 On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher alexdeuc...@gmail.com wrote:
  On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák mar...@gmail.com wrote:
  The problem was we set VRAM|GTT for relocations of STATIC resources.
  Setting just VRAM increases the framerate 4 times on my machine.
 
  I rewrote the switch statement and adjusted the domains for window
  framebuffers too.
 
  Reviewed-by: Alex Deucher alexander.deuc...@amd.com
 
  Stable branches?

 Yes, good idea.

 Marek

 Btw as a follow up on this, i did some experiment with ttm and eviction.
 Blocking any vram eviction improve average fps (20-30%) and minimum fps
 (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
 just make framerate more consistant.

 I then tried several heuristic on the eviction process (not evicting buffer
 if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
 btw buffer used for rendering and auxiliary buffer use by kernel, ...
 none of those heuristic improved anything. I also removed bo wait in the
 eviction pipeline but still no improvement. Haven't time to look further
 but anyway bottom line is that some benchmark are memory tight and constant
 eviction hurt.

 (used unigine heaven and reaction quake for benchmark)

 I've came up with the following solution, which I think would help
 improve the situation a lot.

 We should prepare a list of command streams and one list of
 relocations for an entire frame, do buffer validation/placements for
 the entire frame at the beginning and then just render the whole frame
 (schedule all the command streams at once). That would minimize the
 buffer evictions and give us the ideal buffer placements for the whole
 frame and then the GPU would run the commands uninterrupted by other
 processes (and we don't have to flush caches so much).


Another possibility would be to allocate a small number of very large
buffers and then sub-allocate in the 3D driver.  That should alleviate
some of the overhead in dealing with lots of small buffers in ttm and
also reduce fragmentation.

Alex

 The only downsides are:
 - Buffers would be marked as busy for the entire frame, because the
 fence would only be at the end of the frame. We definitely need more
 fine-grained distribution of fences for apps which map buffers during
 rendering. One possible solution is to let userspace emit fences by
 itself and associate the fences with the buffers in the relocation
 list. The bo-wait mechanism would then use the fence from the (buffer,
 fence) pair, while TTM would use the end-of-frame fence (we can't
 trust the userspace giving us the right fences).
 - We should find out how to offload flushing and SwapBuffers to
 another thread, because the final CS ioctl will be really big.
 Currently, the radeon winsys doesn't offload the CS ioctl if it's in
 the SwapBuffers call.

 Possible improvement:
 - The userspace should emit commands into a GPU buffer and not in the
 user memory, so that we don't have to do copy_from_user in the kernel.
 I expect the CS ioctl to unmap the GPU buffer and forbid later mapping
 as well as putting the buffer in the relocation list.

 Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

2012-11-10 Thread Dave Airlie
On Sun, Nov 11, 2012 at 1:52 AM, Marek Olšák mar...@gmail.com wrote:
 On Fri, Nov 9, 2012 at 9:44 PM, Jerome Glisse j.gli...@gmail.com wrote:
 On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
 On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher alexdeuc...@gmail.com wrote:
  On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák mar...@gmail.com wrote:
  The problem was we set VRAM|GTT for relocations of STATIC resources.
  Setting just VRAM increases the framerate 4 times on my machine.
 
  I rewrote the switch statement and adjusted the domains for window
  framebuffers too.
 
  Reviewed-by: Alex Deucher alexander.deuc...@amd.com
 
  Stable branches?

 Yes, good idea.

 Marek

 Btw as a follow up on this, i did some experiment with ttm and eviction.
 Blocking any vram eviction improve average fps (20-30%) and minimum fps
 (40-60%) but it diminish maximum fps (100%). Overall blocking eviction
 just make framerate more consistant.

 I then tried several heuristic on the eviction process (not evicting buffer
 if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
 btw buffer used for rendering and auxiliary buffer use by kernel, ...
 none of those heuristic improved anything. I also removed bo wait in the
 eviction pipeline but still no improvement. Haven't time to look further
 but anyway bottom line is that some benchmark are memory tight and constant
 eviction hurt.

 (used unigine heaven and reaction quake for benchmark)

 I've came up with the following solution, which I think would help
 improve the situation a lot.

 We should prepare a list of command streams and one list of
 relocations for an entire frame, do buffer validation/placements for
 the entire frame at the beginning and then just render the whole frame
 (schedule all the command streams at once). That would minimize the
 buffer evictions and give us the ideal buffer placements for the whole
 frame and then the GPU would run the commands uninterrupted by other
 processes (and we don't have to flush caches so much).

I actually did something a bit similar once but it didn't show up as
useful at the time,

http://cgit.freedesktop.org/~airlied/linux/log/?h=radeon-cs-setup

It meant one ioctl per frame, and it avoided flushes between ioctls,
at the time I just tested it with OA and it made no difference, so I
didn't pursue it further, there might be something in there to pick
over, though the code has moved a fair bit since then.

My only worry is with fragmentation, I expect we need to have an evict
all validation in place (do we have this already? I'm not sure).

 - Buffers would be marked as busy for the entire frame, because the
 fence would only be at the end of the frame. We definitely need more
 fine-grained distribution of fences for apps which map buffers during
 rendering. One possible solution is to let userspace emit fences by
 itself and associate the fences with the buffers in the relocation
 list. The bo-wait mechanism would then use the fence from the (buffer,
 fence) pair, while TTM would use the end-of-frame fence (we can't
 trust the userspace giving us the right fences).
 - We should find out how to offload flushing and SwapBuffers to
 another thread, because the final CS ioctl will be really big.
 Currently, the radeon winsys doesn't offload the CS ioctl if it's in
 the SwapBuffers call.

The Intel guys were looking at some of this as well, interactions with
GLX and threads are v. messy.

 Possible improvement:
 - The userspace should emit commands into a GPU buffer and not in the
 user memory, so that we don't have to do copy_from_user in the kernel.
 I expect the CS ioctl to unmap the GPU buffer and forbid later mapping
 as well as putting the buffer in the relocation list.

This I've gone back and forward on a few times, you'd be surprised how
much overhead there is in mapping/unmapping, and also AGP would need
the older paths to avoid it accessing uncached memory.

Dave.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

2012-11-09 Thread Jerome Glisse
On Thu, Nov 01, 2012 at 03:13:31AM +0100, Marek Olšák wrote:
 On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher alexdeuc...@gmail.com wrote:
  On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák mar...@gmail.com wrote:
  The problem was we set VRAM|GTT for relocations of STATIC resources.
  Setting just VRAM increases the framerate 4 times on my machine.
 
  I rewrote the switch statement and adjusted the domains for window
  framebuffers too.
 
  Reviewed-by: Alex Deucher alexander.deuc...@amd.com
 
  Stable branches?
 
 Yes, good idea.
 
 Marek

Btw as a follow up on this, i did some experiment with ttm and eviction.
Blocking any vram eviction improve average fps (20-30%) and minimum fps
(40-60%) but it diminish maximum fps (100%). Overall blocking eviction
just make framerate more consistant.

I then tried several heuristic on the eviction process (not evicting buffer
if buffer was use in the last 1ms, 10ms, 20ms ..., sorting lru differently
btw buffer used for rendering and auxiliary buffer use by kernel, ...
none of those heuristic improved anything. I also removed bo wait in the
eviction pipeline but still no improvement. Haven't time to look further
but anyway bottom line is that some benchmark are memory tight and constant
eviction hurt.

(used unigine heaven and reaction quake for benchmark)

Cheers,
Jerome
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

2012-10-31 Thread Alex Deucher
On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák mar...@gmail.com wrote:
 The problem was we set VRAM|GTT for relocations of STATIC resources.
 Setting just VRAM increases the framerate 4 times on my machine.

 I rewrote the switch statement and adjusted the domains for window
 framebuffers too.

Reviewed-by: Alex Deucher alexander.deuc...@amd.com

Stable branches?


 ---
  src/gallium/drivers/r600/r600_buffer.c  |   42 
 ---
  src/gallium/drivers/r600/r600_texture.c |3 ++-
  2 files changed, 24 insertions(+), 21 deletions(-)

 diff --git a/src/gallium/drivers/r600/r600_buffer.c 
 b/src/gallium/drivers/r600/r600_buffer.c
 index f4566ee..116ab51 100644
 --- a/src/gallium/drivers/r600/r600_buffer.c
 +++ b/src/gallium/drivers/r600/r600_buffer.c
 @@ -206,29 +206,31 @@ bool r600_init_resource(struct r600_screen *rscreen,
  {
 uint32_t initial_domain, domains;

 -   /* Staging resources particpate in transfers and blits only
 -* and are used for uploads and downloads from regular
 -* resources.  We generate them internally for some transfers.
 -*/
 -   if (usage == PIPE_USAGE_STAGING) {
 +   switch(usage) {
 +   case PIPE_USAGE_STAGING:
 +   /* Staging resources participate in transfers, i.e. are used
 +* for uploads and downloads from regular resources.
 +* We generate them internally for some transfers.
 +*/
 +   initial_domain = RADEON_DOMAIN_GTT;
 domains = RADEON_DOMAIN_GTT;
 +   break;
 +   case PIPE_USAGE_DYNAMIC:
 +   case PIPE_USAGE_STREAM:
 +   /* Default to GTT, but allow the memory manager to move it to 
 VRAM. */
 initial_domain = RADEON_DOMAIN_GTT;
 -   } else {
 domains = RADEON_DOMAIN_GTT | RADEON_DOMAIN_VRAM;
 -
 -   switch(usage) {
 -   case PIPE_USAGE_DYNAMIC:
 -   case PIPE_USAGE_STREAM:
 -   case PIPE_USAGE_STAGING:
 -   initial_domain = RADEON_DOMAIN_GTT;
 -   break;
 -   case PIPE_USAGE_DEFAULT:
 -   case PIPE_USAGE_STATIC:
 -   case PIPE_USAGE_IMMUTABLE:
 -   default:
 -   initial_domain = RADEON_DOMAIN_VRAM;
 -   break;
 -   }
 +   break;
 +   case PIPE_USAGE_DEFAULT:
 +   case PIPE_USAGE_STATIC:
 +   case PIPE_USAGE_IMMUTABLE:
 +   default:
 +   /* Don't list GTT here, because the memory manager would put 
 some
 +* resources to GTT no matter what the initial domain is.
 +* Not listing GTT in the domains improves performance a lot. 
 */
 +   initial_domain = RADEON_DOMAIN_VRAM;
 +   domains = RADEON_DOMAIN_VRAM;
 +   break;
 }

 res-buf = rscreen-ws-buffer_create(rscreen-ws, size, alignment, 
 bind, initial_domain);
 diff --git a/src/gallium/drivers/r600/r600_texture.c 
 b/src/gallium/drivers/r600/r600_texture.c
 index 785eeff..2df390d 100644
 --- a/src/gallium/drivers/r600/r600_texture.c
 +++ b/src/gallium/drivers/r600/r600_texture.c
 @@ -421,9 +421,10 @@ r600_texture_create_object(struct pipe_screen *screen,
 return NULL;
 }
 } else if (buf) {
 +   /* This is usually the window framebuffer. We want it in 
 VRAM, always. */
 resource-buf = buf;
 resource-cs_buf = rscreen-ws-buffer_get_cs_handle(buf);
 -   resource-domains = RADEON_DOMAIN_GTT | RADEON_DOMAIN_VRAM;
 +   resource-domains = RADEON_DOMAIN_VRAM;
 }

 if (rtex-cmask_size) {
 --
 1.7.9.5

 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

2012-10-31 Thread Jerome Glisse
On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák mar...@gmail.com wrote:
 The problem was we set VRAM|GTT for relocations of STATIC resources.
 Setting just VRAM increases the framerate 4 times on my machine.

 I rewrote the switch statement and adjusted the domains for window
 framebuffers too.

Reviewed-by: Jerome Glisse jgli...@redhat.com

 ---
  src/gallium/drivers/r600/r600_buffer.c  |   42 
 ---
  src/gallium/drivers/r600/r600_texture.c |3 ++-
  2 files changed, 24 insertions(+), 21 deletions(-)

 diff --git a/src/gallium/drivers/r600/r600_buffer.c 
 b/src/gallium/drivers/r600/r600_buffer.c
 index f4566ee..116ab51 100644
 --- a/src/gallium/drivers/r600/r600_buffer.c
 +++ b/src/gallium/drivers/r600/r600_buffer.c
 @@ -206,29 +206,31 @@ bool r600_init_resource(struct r600_screen *rscreen,
  {
 uint32_t initial_domain, domains;

 -   /* Staging resources particpate in transfers and blits only
 -* and are used for uploads and downloads from regular
 -* resources.  We generate them internally for some transfers.
 -*/
 -   if (usage == PIPE_USAGE_STAGING) {
 +   switch(usage) {
 +   case PIPE_USAGE_STAGING:
 +   /* Staging resources participate in transfers, i.e. are used
 +* for uploads and downloads from regular resources.
 +* We generate them internally for some transfers.
 +*/
 +   initial_domain = RADEON_DOMAIN_GTT;
 domains = RADEON_DOMAIN_GTT;
 +   break;
 +   case PIPE_USAGE_DYNAMIC:
 +   case PIPE_USAGE_STREAM:
 +   /* Default to GTT, but allow the memory manager to move it to 
 VRAM. */
 initial_domain = RADEON_DOMAIN_GTT;
 -   } else {
 domains = RADEON_DOMAIN_GTT | RADEON_DOMAIN_VRAM;
 -
 -   switch(usage) {
 -   case PIPE_USAGE_DYNAMIC:
 -   case PIPE_USAGE_STREAM:
 -   case PIPE_USAGE_STAGING:
 -   initial_domain = RADEON_DOMAIN_GTT;
 -   break;
 -   case PIPE_USAGE_DEFAULT:
 -   case PIPE_USAGE_STATIC:
 -   case PIPE_USAGE_IMMUTABLE:
 -   default:
 -   initial_domain = RADEON_DOMAIN_VRAM;
 -   break;
 -   }
 +   break;
 +   case PIPE_USAGE_DEFAULT:
 +   case PIPE_USAGE_STATIC:
 +   case PIPE_USAGE_IMMUTABLE:
 +   default:
 +   /* Don't list GTT here, because the memory manager would put 
 some
 +* resources to GTT no matter what the initial domain is.
 +* Not listing GTT in the domains improves performance a lot. 
 */
 +   initial_domain = RADEON_DOMAIN_VRAM;
 +   domains = RADEON_DOMAIN_VRAM;
 +   break;
 }

 res-buf = rscreen-ws-buffer_create(rscreen-ws, size, alignment, 
 bind, initial_domain);
 diff --git a/src/gallium/drivers/r600/r600_texture.c 
 b/src/gallium/drivers/r600/r600_texture.c
 index 785eeff..2df390d 100644
 --- a/src/gallium/drivers/r600/r600_texture.c
 +++ b/src/gallium/drivers/r600/r600_texture.c
 @@ -421,9 +421,10 @@ r600_texture_create_object(struct pipe_screen *screen,
 return NULL;
 }
 } else if (buf) {
 +   /* This is usually the window framebuffer. We want it in 
 VRAM, always. */
 resource-buf = buf;
 resource-cs_buf = rscreen-ws-buffer_get_cs_handle(buf);
 -   resource-domains = RADEON_DOMAIN_GTT | RADEON_DOMAIN_VRAM;
 +   resource-domains = RADEON_DOMAIN_VRAM;
 }

 if (rtex-cmask_size) {
 --
 1.7.9.5

 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] r600g: fix abysmal performance in Reaction Quake

2012-10-31 Thread Marek Olšák
On Thu, Nov 1, 2012 at 2:13 AM, Alex Deucher alexdeuc...@gmail.com wrote:
 On Wed, Oct 31, 2012 at 8:05 PM, Marek Olšák mar...@gmail.com wrote:
 The problem was we set VRAM|GTT for relocations of STATIC resources.
 Setting just VRAM increases the framerate 4 times on my machine.

 I rewrote the switch statement and adjusted the domains for window
 framebuffers too.

 Reviewed-by: Alex Deucher alexander.deuc...@amd.com

 Stable branches?

Yes, good idea.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev