Re: [Mesa-dev] [PATCH 00/13] Use the constant engine in radeonsi

2016-04-15 Thread Marek Olšák
On Thu, Apr 14, 2016 at 8:29 PM, Nicolai Hähnle  wrote:
> On 13.04.2016 20:34, Bas Nieuwenhuizen wrote:
>>
>> This series implements updating descriptors using the constant engine.
>> This should result in a 0%-3% improvement for CPU bound applications,
>> as we only have to upload the change descriptors from the CPU.
>
>
> Thanks again for looking at this!
>
>> There are very slight performance advantages on the GPU too, as the
>> CE uploads the data directly to the L2 cache. This series reduces the
>> number of cycles waited on LGKM counters by about 1/6th across multiple
>> applications. However this translates in almost negligible performance
>> improvements.
>>
>> Only the amdgpu winsys supports this for now. I think the radeon
>> winsys can be extended to support it, but I don't have any radeon
>> hardware so it is probably better that someone else implements that.
>>
>> This series does not include support for partially updating
>> uniforms. My gallium interface proof of concept became kind of messy
>> and the constant engine patches have been requested for descriptors.
>
>
> Interesting coincidence that Jakob wrote an email about "Rework uniform
> storage" on the same day...
>
> Two related general observations on the series:
>
> 1. The CE also has a LOAD_CONST_RAM command. Instead of writing all
> descriptors to the CE IB at CS begin, perhaps you could save space and CPU
> time by using that for the first upload per CS? Though it also might make
> sense to have this depend on the dirty_mask population count.
>
> 2. It seems like it should be possible to get rid of the list_dirty flag
> now, but I actually don't think that's true, because AFAICS doing that on
> top of this series will force a re-upload of all descriptors on every new
> CS, which is clearly bad. Still, logically there really shouldn't be a
> list_dirty _and_ a dirty_mask.
>
> So how about a dirty_mask and a ce_ram_dirty flag? Do not set the dirty_mask
> on begin_new_cs, and use the dirty_mask to gate descriptor uploads. Set the
> ce_ram_dirty flag on begin_new_cs, and use it to either (a) override the
> dirty_mask as appropriate for the first descriptor upload in each CS or (b)
> emit a LOAD_CONST_RAM command.
>
> I'd appreciate if you could address these points - I don't feel too strongly
> about the LOAD_CONST_RAM optimization, but having both list_dirty and
> dirty_mask is weird.

The AMDGPU kernel driver support the preamble CE IB, which is
conditionally executed before the CE IB. It was designed to skip CE
RAM initialization at the beginning of IBs. The usage should be:

The preamble CE IB should initialize CE RAM from memory (LOAD_CONST_RAM).
The main CE & DE IBs should assume CE RAM is initialized.

The kernel driver doesn't schedule the preamble CE IB if the command
submission immediately follows a previous submission from the same
context.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 00/13] Use the constant engine in radeonsi

2016-04-14 Thread Nicolai Hähnle

On 13.04.2016 20:34, Bas Nieuwenhuizen wrote:

This series implements updating descriptors using the constant engine.
This should result in a 0%-3% improvement for CPU bound applications,
as we only have to upload the change descriptors from the CPU.


Thanks again for looking at this!


There are very slight performance advantages on the GPU too, as the
CE uploads the data directly to the L2 cache. This series reduces the
number of cycles waited on LGKM counters by about 1/6th across multiple
applications. However this translates in almost negligible performance
improvements.

Only the amdgpu winsys supports this for now. I think the radeon
winsys can be extended to support it, but I don't have any radeon
hardware so it is probably better that someone else implements that.

This series does not include support for partially updating
uniforms. My gallium interface proof of concept became kind of messy
and the constant engine patches have been requested for descriptors.


Interesting coincidence that Jakob wrote an email about "Rework uniform 
storage" on the same day...


Two related general observations on the series:

1. The CE also has a LOAD_CONST_RAM command. Instead of writing all 
descriptors to the CE IB at CS begin, perhaps you could save space and 
CPU time by using that for the first upload per CS? Though it also might 
make sense to have this depend on the dirty_mask population count.


2. It seems like it should be possible to get rid of the list_dirty flag 
now, but I actually don't think that's true, because AFAICS doing that 
on top of this series will force a re-upload of all descriptors on every 
new CS, which is clearly bad. Still, logically there really shouldn't be 
a list_dirty _and_ a dirty_mask.


So how about a dirty_mask and a ce_ram_dirty flag? Do not set the 
dirty_mask on begin_new_cs, and use the dirty_mask to gate descriptor 
uploads. Set the ce_ram_dirty flag on begin_new_cs, and use it to either 
(a) override the dirty_mask as appropriate for the first descriptor 
upload in each CS or (b) emit a LOAD_CONST_RAM command.


I'd appreciate if you could address these points - I don't feel too 
strongly about the LOAD_CONST_RAM optimization, but having both 
list_dirty and dirty_mask is weird.


Thanks,
Nicolai


Bas Nieuwenhuizen (10):
   winsys/amdgpu: Enlarge const IB size.
   radeonsi: Create CE IB.
   radeonsi: Add dirty_mask to descriptor list.
   radeonsi: Add CE packet definitions.
   radeonsi: Add CE synchronization.
   radeonsi: Allocate chunks of CE ram.
   radeonsi: Add CE uploader.
   radeonsi: Use CE for vertex buffers.
   gallium/util: Add u_bit_scan_consecutive_range64.
   radeonsi: Use CE for all descriptors.

Marek Olšák (3):
   gallium/radeon: move ring_type into winsyses
   winsys/amdgpu: split IB data into a new structure in preparation for
 CE
   winsys/amdgpu: add support for const IB

  src/gallium/auxiliary/util/u_math.h   |   8 ++
  src/gallium/drivers/radeon/r600_pipe_common.c |   1 +
  src/gallium/drivers/radeon/r600_pipe_common.h |   1 +
  src/gallium/drivers/radeon/radeon_winsys.h|  19 +++-
  src/gallium/drivers/radeonsi/si_descriptors.c | 149 +-
  src/gallium/drivers/radeonsi/si_hw_context.c  |   4 +-
  src/gallium/drivers/radeonsi/si_pipe.c|  18 
  src/gallium/drivers/radeonsi/si_pipe.h|   6 ++
  src/gallium/drivers/radeonsi/si_state.h   |   4 +
  src/gallium/drivers/radeonsi/si_state_draw.c  |  24 +
  src/gallium/drivers/radeonsi/sid.h|   6 ++
  src/gallium/winsys/amdgpu/drm/amdgpu_bo.c |   5 -
  src/gallium/winsys/amdgpu/drm/amdgpu_bo.h |   6 ++
  src/gallium/winsys/amdgpu/drm/amdgpu_cs.c | 125 ++---
  src/gallium/winsys/amdgpu/drm/amdgpu_cs.h |  26 +++--
  src/gallium/winsys/radeon/drm/radeon_drm_cs.c |  10 +-
  src/gallium/winsys/radeon/drm/radeon_drm_cs.h |   1 +
  17 files changed, 326 insertions(+), 87 deletions(-)


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 00/13] Use the constant engine in radeonsi

2016-04-13 Thread Bas Nieuwenhuizen
This series implements updating descriptors using the constant engine.
This should result in a 0%-3% improvement for CPU bound applications,
as we only have to upload the change descriptors from the CPU.

There are very slight performance advantages on the GPU too, as the
CE uploads the data directly to the L2 cache. This series reduces the
number of cycles waited on LGKM counters by about 1/6th across multiple
applications. However this translates in almost negligible performance
improvements.

Only the amdgpu winsys supports this for now. I think the radeon
winsys can be extended to support it, but I don't have any radeon
hardware so it is probably better that someone else implements that.

This series does not include support for partially updating
uniforms. My gallium interface proof of concept became kind of messy
and the constant engine patches have been requested for descriptors.

Bas Nieuwenhuizen (10):
  winsys/amdgpu: Enlarge const IB size.
  radeonsi: Create CE IB.
  radeonsi: Add dirty_mask to descriptor list.
  radeonsi: Add CE packet definitions.
  radeonsi: Add CE synchronization.
  radeonsi: Allocate chunks of CE ram.
  radeonsi: Add CE uploader.
  radeonsi: Use CE for vertex buffers.
  gallium/util: Add u_bit_scan_consecutive_range64.
  radeonsi: Use CE for all descriptors.

Marek Olšák (3):
  gallium/radeon: move ring_type into winsyses
  winsys/amdgpu: split IB data into a new structure in preparation for
CE
  winsys/amdgpu: add support for const IB

 src/gallium/auxiliary/util/u_math.h   |   8 ++
 src/gallium/drivers/radeon/r600_pipe_common.c |   1 +
 src/gallium/drivers/radeon/r600_pipe_common.h |   1 +
 src/gallium/drivers/radeon/radeon_winsys.h|  19 +++-
 src/gallium/drivers/radeonsi/si_descriptors.c | 149 +-
 src/gallium/drivers/radeonsi/si_hw_context.c  |   4 +-
 src/gallium/drivers/radeonsi/si_pipe.c|  18 
 src/gallium/drivers/radeonsi/si_pipe.h|   6 ++
 src/gallium/drivers/radeonsi/si_state.h   |   4 +
 src/gallium/drivers/radeonsi/si_state_draw.c  |  24 +
 src/gallium/drivers/radeonsi/sid.h|   6 ++
 src/gallium/winsys/amdgpu/drm/amdgpu_bo.c |   5 -
 src/gallium/winsys/amdgpu/drm/amdgpu_bo.h |   6 ++
 src/gallium/winsys/amdgpu/drm/amdgpu_cs.c | 125 ++---
 src/gallium/winsys/amdgpu/drm/amdgpu_cs.h |  26 +++--
 src/gallium/winsys/radeon/drm/radeon_drm_cs.c |  10 +-
 src/gallium/winsys/radeon/drm/radeon_drm_cs.h |   1 +
 17 files changed, 326 insertions(+), 87 deletions(-)

-- 
2.8.0

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev