drm/radeon/kms: improve performance of blit-copy

2011-10-13 Thread Roland Scheidegger
Am 13.10.2011 05:29, schrieb Ilija Hadzic:
> 
> The following set of patches will improve the performance of
> blit-copy functions for Radeon GPUs based on R600, R700, Evergreen
> and NI ASICs.
> 
> The foundation for improvement is the use of tiled mode access (which
> for copying bo's can be used regardless of whether the content is
> tiled or not), and segmenting the memory block being copied into
> rectangles whose edge ratio is between 1:1 and 1:2. This maximizes
> the number of PCIe transactions that use maximum payload size
> (typically 128 bytes) and also creates a memory access pattern that
> is more favorable for both VRAM and host DRAM than what's currently
> in the kernel.
> 
> To come up with the new blit-copy code, I did a lot of PCIe traffic
> analysis with the bus analyzer and also had many discussions with
> Alex, trying to explain what's going on (thanks to Alex for his
> time).
> 
> Below (at the end of this note) are the results of some benchmarks 
> that I did with various GPUs (all in the same host: Intel i7 CPU, X58
> chipset, three DRAM channels). To run the tests on your machine load
> the radeon module with 'benchmark=1 pcie_gen2=1' parameters. Most
> significant improvement is in the upstream (VRAM to GART) direction
> because that's where the PCIe transactions were fragmented and also
> where memory access pattern was such that it created a lot of 
> backpressure from the host.
> 
> It is also interesting that high-end devices (e.g. Cayman) exhibit 
> the least improvement and were the worst to begin with. This is 
> because high-end devices copy more tiles in parallel which in turn
> can create bank conflicts on host memory and cause the host to do
> lots of bank-close/precharge/bank-open cycles.
Interesting stuff! Nice results showing the low-end devices completely
blowing away the high-end ones for VRAM->GTT blits :-).
I guess it isn't possible to temporarily disable some RBEs or otherwise
reconfigure the chip that you could get the same performance for the
high-end chips? Granted the high-end chips are only much slower for
VRAM->GTT according to these results but even the other way it's still
~20% or so.
Anyway, can't comment much on the patches, though the idea certainly
seems to make sense.

Roland



> As an added "bonus", I also did some code cleanup and consolidated 
> the repeated code into common function, so r600 and evergreen/NI 
> parts now share the blit-copy code. I also expanded on the benchmark
> coverage, so the module now takes benckmark parameter value between 1
> and 8 and each results in running a different benchmark.
> 
> For details, see the commit log messages and the code. I have been
> running with these patches for a few months (and I kept rebasing them
> to drm-core-next as the public git progressed) and I used them in a
> system setup that does *many* copying of this kind (and does them
> frequently); I have not seen instabilities introduced by these
> patches. I also verified the correctness of the copy using test=1
> parameter for each GPU that I had and the test passed.
> 
> I would welcome some feedback and if you run the benchmarks with the
> new blit code, I would very much like to hear what kind of
> improvement you are seeing.
> 


drm/radeon/kms: improve performance of blit-copy

2011-10-13 Thread Ilija Hadzic


On Thu, 13 Oct 2011, Roland Scheidegger wrote:

> I guess it isn't possible to temporarily disable some RBEs or otherwise
> reconfigure the chip that you could get the same performance for the
> high-end chips?

According to the conversation I had with Alex, this *is* possible but 
requires the pipeline and cache flush. So it is unclear what the overall 
gain will be given the flush penalty.

Also, this phenomena occurs only when GTT is involved in the copy. 
VRAM-to-VRAM copy in which there is no host memory involved (for which I 
added a benchmark, but didn't report in my note yesterday), high-end 
devices are beating low-end ones big time  they better be ;-)

So if we can get RBE-reduction to work, it should be turned on only when 
one of the BOs is in GTT domain. I looked at what it would take to do 
this, and it's doable, but requires hacks at many places.

-- Ilija


drm/radeon/kms: improve performance of blit-copy

2011-10-13 Thread Ilija Hadzic

Dave,

Alex pointed to me that the patches I sent last night under this thread 
may conflict with 003cefe0c238e683a29d2207dba945b508cd45b7 that currently 
resides on drm-fixes branch (my patches are based on drm-next or 
drm-core-next).

I'd like to make sure that the eventual merge goes smoothly:

If you merge drm-fixes before my patches, then I'll rebase my patches and 
resend them after that happens and make sure everything is resolved 
correctly.

If you merge my patches first and then follow with drm-fixes merge, two 
things should happen with 003cefe0c238e683a29d2207dba945b508cd45b7. Hunks 
related to evergreen.c file will fall out but that's expected and OK 
because my patches consolidate the blit code for r600 and evergreen into a 
common one. Then in r600.c, the hunks related to r600_blit_prepare_copy
and r600_kms_blit_copy function calls will show conflicts, which should be 
resolved such that the size argument is num_gpu_pages, not
num_gpu_pages * RADEON_GPU_PAGE_SIZE (this is because the new blit code
takes size argument in pages, not bytes). Everything else will merge 
smoothly.

For reference, pasted below is a patch that resulted after I cherry-picked 
003cefe0c238e683a29d2207dba945b508cd45b7 into drm-next augmented with my 
blit-improvement patches and resolved the conflicts correctly.

I guess the first option is less work for you (and I will be glad to 
rebase my patches if need be), but I hope that the info here is good 
enough to make the second path as easy as it can be

thanks,

Ilija



drm/radeon/kms: improve performance of blit-copy

2011-10-13 Thread Ilija Hadzic

The following set of patches will improve the performance
of blit-copy functions for Radeon GPUs based on 
R600, R700, Evergreen and NI ASICs.

The foundation for improvement is the use of tiled mode access
(which for copying bo's can be used regardless of whether the
content is tiled or not), and segmenting the memory block
being copied into rectangles whose edge ratio is between 1:1
and 1:2. This maximizes the number of PCIe transactions that
use maximum payload size (typically 128 bytes) and also 
creates a memory access pattern that is more favorable for
both VRAM and host DRAM than what's currently in the kernel.

To come up with the new blit-copy code, I did a lot of 
PCIe traffic analysis with the bus analyzer and also 
had many discussions with Alex, trying to explain what's 
going on (thanks to Alex for his time).

Below (at the end of this note) are the results of some benchmarks
that I did with various GPUs (all in the same host: Intel i7 CPU,
X58 chipset, three DRAM channels). To run the tests on your machine
load the radeon module with 'benchmark=1 pcie_gen2=1' parameters.
Most significant improvement is in the upstream (VRAM to GART)
direction because that's where the PCIe transactions were fragmented 
and also where memory access pattern was such that it created a lot of 
backpressure from the host.

It is also interesting that high-end devices (e.g. Cayman) exhibit
the least improvement and were the worst to begin with. This is
because high-end devices copy more tiles in parallel which 
in turn can create bank conflicts on host memory and cause the
host to do lots of bank-close/precharge/bank-open cycles. 

As an added "bonus", I also did some code cleanup and consolidated
the repeated code into common function, so r600 and evergreen/NI
parts now share the blit-copy code. I also expanded on the
benchmark coverage, so the module now takes benckmark parameter
value between 1 and 8 and each results in running a different 
benchmark.

For details, see the commit log messages and the code.
I have been running with these patches for a few months 
(and I kept rebasing them to drm-core-next as the public 
git progressed) and I used them in a system setup that does
*many* copying of this kind (and does them frequently); I 
have not seen instabilities introduced by these patches. I also
verified the correctness of the copy using test=1 parameter
for each GPU that I had and the test passed.

I would welcome some feedback and if you run the benchmarks
with the new blit code, I would very much like to hear
what kind of improvement you are seeing.


BENCHMARK RESULTS:
==

1) VRAM to GTT 
==

Card (ASIC) VRAMBefore  After
-
5570 (Redwood)  DDR3 1600MHZ 4543912
6450 (Caicos)   DDR5 3200MHz37185090
6570 (Turks)DDR3 1800MHz 4844144
5450 (Cedar)DDR3 1600MHz36795090
5450 (Cedar)DDR2  800MHz26954639
E4690 (RV730)   DDR3 1400MHZ 4854969
E6760 (Turks)   DDR5 3200MHz 4744177
V5700 (RV730)   DDR3 MHz 4884297
2260 (RV620)DDR2 MHz 4943093
6870 (Barts)DDR5 4200MHz 4751113
6970 (Cayman)   DDR5 4200MHz 473 710

2) GTT to VRAM
==

Card (ASIC) VRAMBefore  After
-
5570 (Redwood)  DDR3 1600MHz31583360
6450 (Caicos)   DDR5 3200MHz29953393
6570 (Turks)DDR3 1800MHz30393339
5450 (Cedar)DDR3 1600MHz32463404
5450 (Cedar)DDR2  800MHz26143371
E4690 (RV730)   DDR3 1400MHz30843426
E6760 (Turks)   DDR5 3200MHz24432570
V5700 (RV730)   DDR3 MHz31873506
2260 (RV620)DDR2 MHz 5843246
6870 (Barts)DDR5 4200MHz24722601
6970 (Cayman)   DDR5 4200MHz24602737


Re: drm/radeon/kms: improve performance of blit-copy

2011-10-13 Thread Ilija Hadzic


Dave,

Alex pointed to me that the patches I sent last night under this thread 
may conflict with 003cefe0c238e683a29d2207dba945b508cd45b7 that currently 
resides on drm-fixes branch (my patches are based on drm-next or 
drm-core-next).


I'd like to make sure that the eventual merge goes smoothly:

If you merge drm-fixes before my patches, then I'll rebase my patches and 
resend them after that happens and make sure everything is resolved 
correctly.


If you merge my patches first and then follow with drm-fixes merge, two 
things should happen with 003cefe0c238e683a29d2207dba945b508cd45b7. Hunks 
related to evergreen.c file will fall out but that's expected and OK 
because my patches consolidate the blit code for r600 and evergreen into a 
common one. Then in r600.c, the hunks related to r600_blit_prepare_copy
and r600_kms_blit_copy function calls will show conflicts, which should be 
resolved such that the size argument is num_gpu_pages, not

num_gpu_pages * RADEON_GPU_PAGE_SIZE (this is because the new blit code
takes size argument in pages, not bytes). Everything else will merge 
smoothly.


For reference, pasted below is a patch that resulted after I cherry-picked 
003cefe0c238e683a29d2207dba945b508cd45b7 into drm-next augmented with my 
blit-improvement patches and resolved the conflicts correctly.


I guess the first option is less work for you (and I will be glad to 
rebase my patches if need be), but I hope that the info here is good 
enough to make the second path as easy as it can be


thanks,

Ilija


From b12516c003cb35059f16ace774ef5a21170d6d78 Mon Sep 17 00:00:00 2001
From: Alex Deucher alexander.deuc...@amd.com
Date: Fri, 16 Sep 2011 12:04:08 -0400
Subject: [PATCH 11/14] drm/radeon/kms: Make GPU/CPU page size handling
 consistent in blit code (v3)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The BO blit code inconsistenly handled the page size.  This wasn't
an issue on system with 4k pages since the GPU's page size is 4k as
well.  Switch the driver blit callbacks to take num pages in GPU
page units.

Fixes lemote mipsel systems using AMD rs780/rs880 chipsets.

v2: incorporate suggestions from Michel.

Signed-off-by: Alex Deucher alexander.deuc...@amd.com
Reviewed-by: Michel D??nzer michel.daen...@amd.com
Cc: sta...@kernel.org
Signed-off-by: Dave Airlie airl...@redhat.com

v3: reconcile with changes due to blit-copy improvements on drm-next
branch

substitutes the v2 patch that currently resides on drm-fixes
branch

Conflicts:

drivers/gpu/drm/radeon/evergreen.c
drivers/gpu/drm/radeon/r600.c
drivers/gpu/drm/radeon/radeon_asic.h

Signed-off-by: Ilija Hadzic ihad...@research.bell-labs.com
---
 drivers/gpu/drm/radeon/r100.c|   12 ++--
 drivers/gpu/drm/radeon/r200.c|4 ++--
 drivers/gpu/drm/radeon/r600.c|   10 ++
 drivers/gpu/drm/radeon/radeon.h  |7 ---
 drivers/gpu/drm/radeon/radeon_asic.h |6 +++---
 drivers/gpu/drm/radeon/radeon_ttm.c  |7 ++-
 6 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
index 5985cb0..df60803 100644
--- a/drivers/gpu/drm/radeon/r100.c
+++ b/drivers/gpu/drm/radeon/r100.c
@@ -724,11 +724,11 @@ void r100_fence_ring_emit(struct radeon_device *rdev,
 int r100_copy_blit(struct radeon_device *rdev,
   uint64_t src_offset,
   uint64_t dst_offset,
-  unsigned num_pages,
+  unsigned num_gpu_pages,
   struct radeon_fence *fence)
 {
uint32_t cur_pages;
-   uint32_t stride_bytes = PAGE_SIZE;
+   uint32_t stride_bytes = RADEON_GPU_PAGE_SIZE;
uint32_t pitch;
uint32_t stride_pixels;
unsigned ndw;
@@ -740,7 +740,7 @@ int r100_copy_blit(struct radeon_device *rdev,
/* radeon pitch is /64 */
pitch = stride_bytes / 64;
stride_pixels = stride_bytes / 4;
-   num_loops = DIV_ROUND_UP(num_pages, 8191);
+   num_loops = DIV_ROUND_UP(num_gpu_pages, 8191);

/* Ask for enough room for blit + flush + fence */
ndw = 64 + (10 * num_loops);
@@ -749,12 +749,12 @@ int r100_copy_blit(struct radeon_device *rdev,
DRM_ERROR(radeon: moving bo (%d) asking for %u dw.\n, r, ndw);
return -EINVAL;
}
-   while (num_pages  0) {
-   cur_pages = num_pages;
+   while (num_gpu_pages  0) {
+   cur_pages = num_gpu_pages;
if (cur_pages  8191) {
cur_pages = 8191;
}
-   num_pages -= cur_pages;
+   num_gpu_pages -= cur_pages;

/* pages are in Y direction - height
   page width in X direction - width */
diff --git a/drivers/gpu/drm/radeon/r200.c b/drivers/gpu/drm/radeon/r200.c
index f240583..a1f3ba0 100644
--- a/drivers/gpu/drm/radeon/r200.c
+++ 

Re: drm/radeon/kms: improve performance of blit-copy

2011-10-13 Thread Ilija Hadzic



On Thu, 13 Oct 2011, Roland Scheidegger wrote:


I guess it isn't possible to temporarily disable some RBEs or otherwise
reconfigure the chip that you could get the same performance for the
high-end chips?


According to the conversation I had with Alex, this *is* possible but 
requires the pipeline and cache flush. So it is unclear what the overall 
gain will be given the flush penalty.


Also, this phenomena occurs only when GTT is involved in the copy. 
VRAM-to-VRAM copy in which there is no host memory involved (for which I 
added a benchmark, but didn't report in my note yesterday), high-end 
devices are beating low-end ones big time  they better be ;-)


So if we can get RBE-reduction to work, it should be turned on only when 
one of the BOs is in GTT domain. I looked at what it would take to do 
this, and it's doable, but requires hacks at many places.


-- Ilija
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


drm/radeon/kms: improve performance of blit-copy

2011-10-12 Thread Ilija Hadzic

The following set of patches will improve the performance
of blit-copy functions for Radeon GPUs based on 
R600, R700, Evergreen and NI ASICs.

The foundation for improvement is the use of tiled mode access
(which for copying bo's can be used regardless of whether the
content is tiled or not), and segmenting the memory block
being copied into rectangles whose edge ratio is between 1:1
and 1:2. This maximizes the number of PCIe transactions that
use maximum payload size (typically 128 bytes) and also 
creates a memory access pattern that is more favorable for
both VRAM and host DRAM than what's currently in the kernel.

To come up with the new blit-copy code, I did a lot of 
PCIe traffic analysis with the bus analyzer and also 
had many discussions with Alex, trying to explain what's 
going on (thanks to Alex for his time).

Below (at the end of this note) are the results of some benchmarks
that I did with various GPUs (all in the same host: Intel i7 CPU,
X58 chipset, three DRAM channels). To run the tests on your machine
load the radeon module with 'benchmark=1 pcie_gen2=1' parameters.
Most significant improvement is in the upstream (VRAM to GART)
direction because that's where the PCIe transactions were fragmented 
and also where memory access pattern was such that it created a lot of 
backpressure from the host.

It is also interesting that high-end devices (e.g. Cayman) exhibit
the least improvement and were the worst to begin with. This is
because high-end devices copy more tiles in parallel which 
in turn can create bank conflicts on host memory and cause the
host to do lots of bank-close/precharge/bank-open cycles. 

As an added bonus, I also did some code cleanup and consolidated
the repeated code into common function, so r600 and evergreen/NI
parts now share the blit-copy code. I also expanded on the
benchmark coverage, so the module now takes benckmark parameter
value between 1 and 8 and each results in running a different 
benchmark.

For details, see the commit log messages and the code.
I have been running with these patches for a few months 
(and I kept rebasing them to drm-core-next as the public 
git progressed) and I used them in a system setup that does
*many* copying of this kind (and does them frequently); I 
have not seen instabilities introduced by these patches. I also
verified the correctness of the copy using test=1 parameter
for each GPU that I had and the test passed.

I would welcome some feedback and if you run the benchmarks
with the new blit code, I would very much like to hear
what kind of improvement you are seeing.


BENCHMARK RESULTS:
==

1) VRAM to GTT 
==

Card (ASIC) VRAMBefore  After
-
5570 (Redwood)  DDR3 1600MHZ 4543912
6450 (Caicos)   DDR5 3200MHz37185090
6570 (Turks)DDR3 1800MHz 4844144
5450 (Cedar)DDR3 1600MHz36795090
5450 (Cedar)DDR2  800MHz26954639
E4690 (RV730)   DDR3 1400MHZ 4854969
E6760 (Turks)   DDR5 3200MHz 4744177
V5700 (RV730)   DDR3 MHz 4884297
2260 (RV620)DDR2 MHz 4943093
6870 (Barts)DDR5 4200MHz 4751113
6970 (Cayman)   DDR5 4200MHz 473 710

2) GTT to VRAM
==

Card (ASIC) VRAMBefore  After
-
5570 (Redwood)  DDR3 1600MHz31583360
6450 (Caicos)   DDR5 3200MHz29953393
6570 (Turks)DDR3 1800MHz30393339
5450 (Cedar)DDR3 1600MHz32463404
5450 (Cedar)DDR2  800MHz26143371
E4690 (RV730)   DDR3 1400MHz30843426
E6760 (Turks)   DDR5 3200MHz24432570
V5700 (RV730)   DDR3 MHz31873506
2260 (RV620)DDR2 MHz 5843246
6870 (Barts)DDR5 4200MHz24722601
6970 (Cayman)   DDR5 4200MHz24602737
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel