drm/radeon/kms: improve performance of blit-copy
Am 13.10.2011 05:29, schrieb Ilija Hadzic: > > The following set of patches will improve the performance of > blit-copy functions for Radeon GPUs based on R600, R700, Evergreen > and NI ASICs. > > The foundation for improvement is the use of tiled mode access (which > for copying bo's can be used regardless of whether the content is > tiled or not), and segmenting the memory block being copied into > rectangles whose edge ratio is between 1:1 and 1:2. This maximizes > the number of PCIe transactions that use maximum payload size > (typically 128 bytes) and also creates a memory access pattern that > is more favorable for both VRAM and host DRAM than what's currently > in the kernel. > > To come up with the new blit-copy code, I did a lot of PCIe traffic > analysis with the bus analyzer and also had many discussions with > Alex, trying to explain what's going on (thanks to Alex for his > time). > > Below (at the end of this note) are the results of some benchmarks > that I did with various GPUs (all in the same host: Intel i7 CPU, X58 > chipset, three DRAM channels). To run the tests on your machine load > the radeon module with 'benchmark=1 pcie_gen2=1' parameters. Most > significant improvement is in the upstream (VRAM to GART) direction > because that's where the PCIe transactions were fragmented and also > where memory access pattern was such that it created a lot of > backpressure from the host. > > It is also interesting that high-end devices (e.g. Cayman) exhibit > the least improvement and were the worst to begin with. This is > because high-end devices copy more tiles in parallel which in turn > can create bank conflicts on host memory and cause the host to do > lots of bank-close/precharge/bank-open cycles. Interesting stuff! Nice results showing the low-end devices completely blowing away the high-end ones for VRAM->GTT blits :-). I guess it isn't possible to temporarily disable some RBEs or otherwise reconfigure the chip that you could get the same performance for the high-end chips? Granted the high-end chips are only much slower for VRAM->GTT according to these results but even the other way it's still ~20% or so. Anyway, can't comment much on the patches, though the idea certainly seems to make sense. Roland > As an added "bonus", I also did some code cleanup and consolidated > the repeated code into common function, so r600 and evergreen/NI > parts now share the blit-copy code. I also expanded on the benchmark > coverage, so the module now takes benckmark parameter value between 1 > and 8 and each results in running a different benchmark. > > For details, see the commit log messages and the code. I have been > running with these patches for a few months (and I kept rebasing them > to drm-core-next as the public git progressed) and I used them in a > system setup that does *many* copying of this kind (and does them > frequently); I have not seen instabilities introduced by these > patches. I also verified the correctness of the copy using test=1 > parameter for each GPU that I had and the test passed. > > I would welcome some feedback and if you run the benchmarks with the > new blit code, I would very much like to hear what kind of > improvement you are seeing. >
drm/radeon/kms: improve performance of blit-copy
On Thu, 13 Oct 2011, Roland Scheidegger wrote: > I guess it isn't possible to temporarily disable some RBEs or otherwise > reconfigure the chip that you could get the same performance for the > high-end chips? According to the conversation I had with Alex, this *is* possible but requires the pipeline and cache flush. So it is unclear what the overall gain will be given the flush penalty. Also, this phenomena occurs only when GTT is involved in the copy. VRAM-to-VRAM copy in which there is no host memory involved (for which I added a benchmark, but didn't report in my note yesterday), high-end devices are beating low-end ones big time they better be ;-) So if we can get RBE-reduction to work, it should be turned on only when one of the BOs is in GTT domain. I looked at what it would take to do this, and it's doable, but requires hacks at many places. -- Ilija
drm/radeon/kms: improve performance of blit-copy
Dave, Alex pointed to me that the patches I sent last night under this thread may conflict with 003cefe0c238e683a29d2207dba945b508cd45b7 that currently resides on drm-fixes branch (my patches are based on drm-next or drm-core-next). I'd like to make sure that the eventual merge goes smoothly: If you merge drm-fixes before my patches, then I'll rebase my patches and resend them after that happens and make sure everything is resolved correctly. If you merge my patches first and then follow with drm-fixes merge, two things should happen with 003cefe0c238e683a29d2207dba945b508cd45b7. Hunks related to evergreen.c file will fall out but that's expected and OK because my patches consolidate the blit code for r600 and evergreen into a common one. Then in r600.c, the hunks related to r600_blit_prepare_copy and r600_kms_blit_copy function calls will show conflicts, which should be resolved such that the size argument is num_gpu_pages, not num_gpu_pages * RADEON_GPU_PAGE_SIZE (this is because the new blit code takes size argument in pages, not bytes). Everything else will merge smoothly. For reference, pasted below is a patch that resulted after I cherry-picked 003cefe0c238e683a29d2207dba945b508cd45b7 into drm-next augmented with my blit-improvement patches and resolved the conflicts correctly. I guess the first option is less work for you (and I will be glad to rebase my patches if need be), but I hope that the info here is good enough to make the second path as easy as it can be thanks, Ilija
drm/radeon/kms: improve performance of blit-copy
The following set of patches will improve the performance of blit-copy functions for Radeon GPUs based on R600, R700, Evergreen and NI ASICs. The foundation for improvement is the use of tiled mode access (which for copying bo's can be used regardless of whether the content is tiled or not), and segmenting the memory block being copied into rectangles whose edge ratio is between 1:1 and 1:2. This maximizes the number of PCIe transactions that use maximum payload size (typically 128 bytes) and also creates a memory access pattern that is more favorable for both VRAM and host DRAM than what's currently in the kernel. To come up with the new blit-copy code, I did a lot of PCIe traffic analysis with the bus analyzer and also had many discussions with Alex, trying to explain what's going on (thanks to Alex for his time). Below (at the end of this note) are the results of some benchmarks that I did with various GPUs (all in the same host: Intel i7 CPU, X58 chipset, three DRAM channels). To run the tests on your machine load the radeon module with 'benchmark=1 pcie_gen2=1' parameters. Most significant improvement is in the upstream (VRAM to GART) direction because that's where the PCIe transactions were fragmented and also where memory access pattern was such that it created a lot of backpressure from the host. It is also interesting that high-end devices (e.g. Cayman) exhibit the least improvement and were the worst to begin with. This is because high-end devices copy more tiles in parallel which in turn can create bank conflicts on host memory and cause the host to do lots of bank-close/precharge/bank-open cycles. As an added "bonus", I also did some code cleanup and consolidated the repeated code into common function, so r600 and evergreen/NI parts now share the blit-copy code. I also expanded on the benchmark coverage, so the module now takes benckmark parameter value between 1 and 8 and each results in running a different benchmark. For details, see the commit log messages and the code. I have been running with these patches for a few months (and I kept rebasing them to drm-core-next as the public git progressed) and I used them in a system setup that does *many* copying of this kind (and does them frequently); I have not seen instabilities introduced by these patches. I also verified the correctness of the copy using test=1 parameter for each GPU that I had and the test passed. I would welcome some feedback and if you run the benchmarks with the new blit code, I would very much like to hear what kind of improvement you are seeing. BENCHMARK RESULTS: == 1) VRAM to GTT == Card (ASIC) VRAMBefore After - 5570 (Redwood) DDR3 1600MHZ 4543912 6450 (Caicos) DDR5 3200MHz37185090 6570 (Turks)DDR3 1800MHz 4844144 5450 (Cedar)DDR3 1600MHz36795090 5450 (Cedar)DDR2 800MHz26954639 E4690 (RV730) DDR3 1400MHZ 4854969 E6760 (Turks) DDR5 3200MHz 4744177 V5700 (RV730) DDR3 MHz 4884297 2260 (RV620)DDR2 MHz 4943093 6870 (Barts)DDR5 4200MHz 4751113 6970 (Cayman) DDR5 4200MHz 473 710 2) GTT to VRAM == Card (ASIC) VRAMBefore After - 5570 (Redwood) DDR3 1600MHz31583360 6450 (Caicos) DDR5 3200MHz29953393 6570 (Turks)DDR3 1800MHz30393339 5450 (Cedar)DDR3 1600MHz32463404 5450 (Cedar)DDR2 800MHz26143371 E4690 (RV730) DDR3 1400MHz30843426 E6760 (Turks) DDR5 3200MHz24432570 V5700 (RV730) DDR3 MHz31873506 2260 (RV620)DDR2 MHz 5843246 6870 (Barts)DDR5 4200MHz24722601 6970 (Cayman) DDR5 4200MHz24602737
Re: drm/radeon/kms: improve performance of blit-copy
Dave, Alex pointed to me that the patches I sent last night under this thread may conflict with 003cefe0c238e683a29d2207dba945b508cd45b7 that currently resides on drm-fixes branch (my patches are based on drm-next or drm-core-next). I'd like to make sure that the eventual merge goes smoothly: If you merge drm-fixes before my patches, then I'll rebase my patches and resend them after that happens and make sure everything is resolved correctly. If you merge my patches first and then follow with drm-fixes merge, two things should happen with 003cefe0c238e683a29d2207dba945b508cd45b7. Hunks related to evergreen.c file will fall out but that's expected and OK because my patches consolidate the blit code for r600 and evergreen into a common one. Then in r600.c, the hunks related to r600_blit_prepare_copy and r600_kms_blit_copy function calls will show conflicts, which should be resolved such that the size argument is num_gpu_pages, not num_gpu_pages * RADEON_GPU_PAGE_SIZE (this is because the new blit code takes size argument in pages, not bytes). Everything else will merge smoothly. For reference, pasted below is a patch that resulted after I cherry-picked 003cefe0c238e683a29d2207dba945b508cd45b7 into drm-next augmented with my blit-improvement patches and resolved the conflicts correctly. I guess the first option is less work for you (and I will be glad to rebase my patches if need be), but I hope that the info here is good enough to make the second path as easy as it can be thanks, Ilija From b12516c003cb35059f16ace774ef5a21170d6d78 Mon Sep 17 00:00:00 2001 From: Alex Deucher alexander.deuc...@amd.com Date: Fri, 16 Sep 2011 12:04:08 -0400 Subject: [PATCH 11/14] drm/radeon/kms: Make GPU/CPU page size handling consistent in blit code (v3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The BO blit code inconsistenly handled the page size. This wasn't an issue on system with 4k pages since the GPU's page size is 4k as well. Switch the driver blit callbacks to take num pages in GPU page units. Fixes lemote mipsel systems using AMD rs780/rs880 chipsets. v2: incorporate suggestions from Michel. Signed-off-by: Alex Deucher alexander.deuc...@amd.com Reviewed-by: Michel D??nzer michel.daen...@amd.com Cc: sta...@kernel.org Signed-off-by: Dave Airlie airl...@redhat.com v3: reconcile with changes due to blit-copy improvements on drm-next branch substitutes the v2 patch that currently resides on drm-fixes branch Conflicts: drivers/gpu/drm/radeon/evergreen.c drivers/gpu/drm/radeon/r600.c drivers/gpu/drm/radeon/radeon_asic.h Signed-off-by: Ilija Hadzic ihad...@research.bell-labs.com --- drivers/gpu/drm/radeon/r100.c| 12 ++-- drivers/gpu/drm/radeon/r200.c|4 ++-- drivers/gpu/drm/radeon/r600.c| 10 ++ drivers/gpu/drm/radeon/radeon.h |7 --- drivers/gpu/drm/radeon/radeon_asic.h |6 +++--- drivers/gpu/drm/radeon/radeon_ttm.c |7 ++- 6 files changed, 27 insertions(+), 19 deletions(-) diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c index 5985cb0..df60803 100644 --- a/drivers/gpu/drm/radeon/r100.c +++ b/drivers/gpu/drm/radeon/r100.c @@ -724,11 +724,11 @@ void r100_fence_ring_emit(struct radeon_device *rdev, int r100_copy_blit(struct radeon_device *rdev, uint64_t src_offset, uint64_t dst_offset, - unsigned num_pages, + unsigned num_gpu_pages, struct radeon_fence *fence) { uint32_t cur_pages; - uint32_t stride_bytes = PAGE_SIZE; + uint32_t stride_bytes = RADEON_GPU_PAGE_SIZE; uint32_t pitch; uint32_t stride_pixels; unsigned ndw; @@ -740,7 +740,7 @@ int r100_copy_blit(struct radeon_device *rdev, /* radeon pitch is /64 */ pitch = stride_bytes / 64; stride_pixels = stride_bytes / 4; - num_loops = DIV_ROUND_UP(num_pages, 8191); + num_loops = DIV_ROUND_UP(num_gpu_pages, 8191); /* Ask for enough room for blit + flush + fence */ ndw = 64 + (10 * num_loops); @@ -749,12 +749,12 @@ int r100_copy_blit(struct radeon_device *rdev, DRM_ERROR(radeon: moving bo (%d) asking for %u dw.\n, r, ndw); return -EINVAL; } - while (num_pages 0) { - cur_pages = num_pages; + while (num_gpu_pages 0) { + cur_pages = num_gpu_pages; if (cur_pages 8191) { cur_pages = 8191; } - num_pages -= cur_pages; + num_gpu_pages -= cur_pages; /* pages are in Y direction - height page width in X direction - width */ diff --git a/drivers/gpu/drm/radeon/r200.c b/drivers/gpu/drm/radeon/r200.c index f240583..a1f3ba0 100644 --- a/drivers/gpu/drm/radeon/r200.c +++
Re: drm/radeon/kms: improve performance of blit-copy
On Thu, 13 Oct 2011, Roland Scheidegger wrote: I guess it isn't possible to temporarily disable some RBEs or otherwise reconfigure the chip that you could get the same performance for the high-end chips? According to the conversation I had with Alex, this *is* possible but requires the pipeline and cache flush. So it is unclear what the overall gain will be given the flush penalty. Also, this phenomena occurs only when GTT is involved in the copy. VRAM-to-VRAM copy in which there is no host memory involved (for which I added a benchmark, but didn't report in my note yesterday), high-end devices are beating low-end ones big time they better be ;-) So if we can get RBE-reduction to work, it should be turned on only when one of the BOs is in GTT domain. I looked at what it would take to do this, and it's doable, but requires hacks at many places. -- Ilija ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
drm/radeon/kms: improve performance of blit-copy
The following set of patches will improve the performance of blit-copy functions for Radeon GPUs based on R600, R700, Evergreen and NI ASICs. The foundation for improvement is the use of tiled mode access (which for copying bo's can be used regardless of whether the content is tiled or not), and segmenting the memory block being copied into rectangles whose edge ratio is between 1:1 and 1:2. This maximizes the number of PCIe transactions that use maximum payload size (typically 128 bytes) and also creates a memory access pattern that is more favorable for both VRAM and host DRAM than what's currently in the kernel. To come up with the new blit-copy code, I did a lot of PCIe traffic analysis with the bus analyzer and also had many discussions with Alex, trying to explain what's going on (thanks to Alex for his time). Below (at the end of this note) are the results of some benchmarks that I did with various GPUs (all in the same host: Intel i7 CPU, X58 chipset, three DRAM channels). To run the tests on your machine load the radeon module with 'benchmark=1 pcie_gen2=1' parameters. Most significant improvement is in the upstream (VRAM to GART) direction because that's where the PCIe transactions were fragmented and also where memory access pattern was such that it created a lot of backpressure from the host. It is also interesting that high-end devices (e.g. Cayman) exhibit the least improvement and were the worst to begin with. This is because high-end devices copy more tiles in parallel which in turn can create bank conflicts on host memory and cause the host to do lots of bank-close/precharge/bank-open cycles. As an added bonus, I also did some code cleanup and consolidated the repeated code into common function, so r600 and evergreen/NI parts now share the blit-copy code. I also expanded on the benchmark coverage, so the module now takes benckmark parameter value between 1 and 8 and each results in running a different benchmark. For details, see the commit log messages and the code. I have been running with these patches for a few months (and I kept rebasing them to drm-core-next as the public git progressed) and I used them in a system setup that does *many* copying of this kind (and does them frequently); I have not seen instabilities introduced by these patches. I also verified the correctness of the copy using test=1 parameter for each GPU that I had and the test passed. I would welcome some feedback and if you run the benchmarks with the new blit code, I would very much like to hear what kind of improvement you are seeing. BENCHMARK RESULTS: == 1) VRAM to GTT == Card (ASIC) VRAMBefore After - 5570 (Redwood) DDR3 1600MHZ 4543912 6450 (Caicos) DDR5 3200MHz37185090 6570 (Turks)DDR3 1800MHz 4844144 5450 (Cedar)DDR3 1600MHz36795090 5450 (Cedar)DDR2 800MHz26954639 E4690 (RV730) DDR3 1400MHZ 4854969 E6760 (Turks) DDR5 3200MHz 4744177 V5700 (RV730) DDR3 MHz 4884297 2260 (RV620)DDR2 MHz 4943093 6870 (Barts)DDR5 4200MHz 4751113 6970 (Cayman) DDR5 4200MHz 473 710 2) GTT to VRAM == Card (ASIC) VRAMBefore After - 5570 (Redwood) DDR3 1600MHz31583360 6450 (Caicos) DDR5 3200MHz29953393 6570 (Turks)DDR3 1800MHz30393339 5450 (Cedar)DDR3 1600MHz32463404 5450 (Cedar)DDR2 800MHz26143371 E4690 (RV730) DDR3 1400MHz30843426 E6760 (Turks) DDR5 3200MHz24432570 V5700 (RV730) DDR3 MHz31873506 2260 (RV620)DDR2 MHz 5843246 6870 (Barts)DDR5 4200MHz24722601 6970 (Cayman) DDR5 4200MHz24602737 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel