Re: [Mesa-dev] [PATCH] radeonsi: enable late VS export memory allocation
> > Axel Davy benchmarked this briefly. We may need more benchmarks though. > > Marek > I confirm setting this register helps get a few % with heaven. There was also another register to kill color exports early when doing depth only pass that helped a few % (but less). Axel ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] radeonsi: enable late VS export memory allocation
On Wed, Jan 13, 2016 at 4:35 PM, Axel Davywrote: > >> >> Axel Davy benchmarked this briefly. We may need more benchmarks though. >> >> Marek >> > > I confirm setting this register helps get a few % with heaven. > > There was also another register to kill color exports early when doing > depth only pass that helped a few % (but less). Do you remember which register it was? The hardware should not execute PS when doing depth-only rendering and the shader doesn't use KILL. Marek ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
[Mesa-dev] [PATCH] radeonsi: enable late VS export memory allocation
From: Marek OlšákThis allows more VS waves to run on CIK and VI. --- src/gallium/drivers/radeonsi/si_pipe.c | 4 +++- src/gallium/drivers/radeonsi/si_pipe.h | 4 src/gallium/drivers/radeonsi/si_state.c | 5 +++-- 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c index 4e23cb1..cb0b8b2 100644 --- a/src/gallium/drivers/radeonsi/si_pipe.c +++ b/src/gallium/drivers/radeonsi/si_pipe.c @@ -207,8 +207,10 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen, /* XXX: This is the maximum value allowed. I'm not sure how to compute * this for non-cs shaders. Using the wrong value here can result in * GPU lockups, but the maximum value seems to always work. +* +* If this is changed, SI_LATE_ALLOC_VS_WAVES should be updated accordingly. */ - sctx->scratch_waves = 32 * sscreen->b.info.max_compute_units; + sctx->scratch_waves = SI_NUM_SCRATCH_WAVES * sscreen->b.info.max_compute_units; #if HAVE_LLVM >= 0x0306 /* Initialize LLVM TargetMachine */ diff --git a/src/gallium/drivers/radeonsi/si_pipe.h b/src/gallium/drivers/radeonsi/si_pipe.h index f83cb02..6ea3017 100644 --- a/src/gallium/drivers/radeonsi/si_pipe.h +++ b/src/gallium/drivers/radeonsi/si_pipe.h @@ -36,6 +36,10 @@ #define SI_BIG_ENDIAN 0 #endif +#define SI_NUM_SCRATCH_WAVES 32 +#define SI_LATE_ALLOC_VS_WAVES (SI_NUM_SCRATCH_WAVES - 1) + + /* The base vertex and primitive restart can be any number, but we must pick * one which will mean "unknown" for the purpose of state tracking and * the number shouldn't be a commonly-used one. */ diff --git a/src/gallium/drivers/radeonsi/si_state.c b/src/gallium/drivers/radeonsi/si_state.c index 2a6d2c6..a8b292f 100644 --- a/src/gallium/drivers/radeonsi/si_state.c +++ b/src/gallium/drivers/radeonsi/si_state.c @@ -3600,8 +3600,9 @@ static void si_init_config(struct si_context *sctx) si_pm4_set_reg(pm4, R_00B41C_SPI_SHADER_PGM_RSRC3_HS, 0); si_pm4_set_reg(pm4, R_00B31C_SPI_SHADER_PGM_RSRC3_ES, S_00B31C_CU_EN(0xfffe)); si_pm4_set_reg(pm4, R_00B21C_SPI_SHADER_PGM_RSRC3_GS, S_00B21C_CU_EN(0x)); - si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0x)); - si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, S_00B11C_LIMIT(0)); + si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0xfffe)); + si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, + S_00B11C_LIMIT(SI_LATE_ALLOC_VS_WAVES)); si_pm4_set_reg(pm4, R_00B01C_SPI_SHADER_PGM_RSRC3_PS, S_00B01C_CU_EN(0x)); } -- 2.1.4 ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] radeonsi: enable late VS export memory allocation
How well-tested is this? I'm a bit confused because I thought the scratch_waves is about the scratch buffer ring (which should always be allocated immediately when the shader starts), while the late_alloc is about allocating the space where exports go. So late_alloc is basically bounded by (a) the need to avoid lockups (when all compute units are running VS that don't have their export space allocated yet, so that there is nothing left for the next shader to run on and start processing the queue) and (b) even when there's no lockup, it may still hurt performance to have lots of VS waiting for export space to become available. I'm not sure if (a) is already covered by some of the registers that do reservations (i.e., reserve one CU to be used only for pixel shaders etc.), or if it's not an issue because every chip can have more than 31 waves in flight (which I think is true). As for (b), it's less critical but has anybody run benchmarks with this patch? Cheers, Nicolai On 12.01.2016 13:02, Marek Olšák wrote: From: Marek OlšákThis allows more VS waves to run on CIK and VI. --- src/gallium/drivers/radeonsi/si_pipe.c | 4 +++- src/gallium/drivers/radeonsi/si_pipe.h | 4 src/gallium/drivers/radeonsi/si_state.c | 5 +++-- 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c index 4e23cb1..cb0b8b2 100644 --- a/src/gallium/drivers/radeonsi/si_pipe.c +++ b/src/gallium/drivers/radeonsi/si_pipe.c @@ -207,8 +207,10 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen, /* XXX: This is the maximum value allowed. I'm not sure how to compute * this for non-cs shaders. Using the wrong value here can result in * GPU lockups, but the maximum value seems to always work. +* +* If this is changed, SI_LATE_ALLOC_VS_WAVES should be updated accordingly. */ - sctx->scratch_waves = 32 * sscreen->b.info.max_compute_units; + sctx->scratch_waves = SI_NUM_SCRATCH_WAVES * sscreen->b.info.max_compute_units; #if HAVE_LLVM >= 0x0306 /* Initialize LLVM TargetMachine */ diff --git a/src/gallium/drivers/radeonsi/si_pipe.h b/src/gallium/drivers/radeonsi/si_pipe.h index f83cb02..6ea3017 100644 --- a/src/gallium/drivers/radeonsi/si_pipe.h +++ b/src/gallium/drivers/radeonsi/si_pipe.h @@ -36,6 +36,10 @@ #define SI_BIG_ENDIAN 0 #endif +#define SI_NUM_SCRATCH_WAVES 32 +#define SI_LATE_ALLOC_VS_WAVES (SI_NUM_SCRATCH_WAVES - 1) + + /* The base vertex and primitive restart can be any number, but we must pick * one which will mean "unknown" for the purpose of state tracking and * the number shouldn't be a commonly-used one. */ diff --git a/src/gallium/drivers/radeonsi/si_state.c b/src/gallium/drivers/radeonsi/si_state.c index 2a6d2c6..a8b292f 100644 --- a/src/gallium/drivers/radeonsi/si_state.c +++ b/src/gallium/drivers/radeonsi/si_state.c @@ -3600,8 +3600,9 @@ static void si_init_config(struct si_context *sctx) si_pm4_set_reg(pm4, R_00B41C_SPI_SHADER_PGM_RSRC3_HS, 0); si_pm4_set_reg(pm4, R_00B31C_SPI_SHADER_PGM_RSRC3_ES, S_00B31C_CU_EN(0xfffe)); si_pm4_set_reg(pm4, R_00B21C_SPI_SHADER_PGM_RSRC3_GS, S_00B21C_CU_EN(0x)); - si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0x)); - si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, S_00B11C_LIMIT(0)); + si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0xfffe)); + si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, + S_00B11C_LIMIT(SI_LATE_ALLOC_VS_WAVES)); si_pm4_set_reg(pm4, R_00B01C_SPI_SHADER_PGM_RSRC3_PS, S_00B01C_CU_EN(0x)); } ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] radeonsi: enable late VS export memory allocation
Hi Nicolai, I only tested piglit on Bonaire. The scratch buffer can deadlock too, though I guess we have enough scratch space that we should never run out of it. a) The lockup is prevented by the change to the value of SHADER_PGM_RSRC3_VS. It means that 1 compute unit is reserved for PS only. This is similar to tessellation, which isn't allowed to run on 2 compute units. b) Yes, it can hurt performance if there are too many PS waves. However, hw guys told me that the rule of thumb is to set the number of waves to about a half, which is 32. The number can also be calculated dynamically based on VS and PS resource usage, but I don't know the equations. Axel Davy benchmarked this briefly. We may need more benchmarks though. Marek On Tue, Jan 12, 2016 at 11:38 PM, Nicolai Hähnlewrote: > How well-tested is this? I'm a bit confused because I thought the > scratch_waves is about the scratch buffer ring (which should always be > allocated immediately when the shader starts), while the late_alloc is about > allocating the space where exports go. > > So late_alloc is basically bounded by > > (a) the need to avoid lockups (when all compute units are running VS that > don't have their export space allocated yet, so that there is nothing left > for the next shader to run on and start processing the queue) and > > (b) even when there's no lockup, it may still hurt performance to have lots > of VS waiting for export space to become available. > > I'm not sure if (a) is already covered by some of the registers that do > reservations (i.e., reserve one CU to be used only for pixel shaders etc.), > or if it's not an issue because every chip can have more than 31 waves in > flight (which I think is true). > > As for (b), it's less critical but has anybody run benchmarks with this > patch? > > Cheers, > Nicolai > > > On 12.01.2016 13:02, Marek Olšák wrote: >> >> From: Marek Olšák >> >> This allows more VS waves to run on CIK and VI. >> --- >> src/gallium/drivers/radeonsi/si_pipe.c | 4 +++- >> src/gallium/drivers/radeonsi/si_pipe.h | 4 >> src/gallium/drivers/radeonsi/si_state.c | 5 +++-- >> 3 files changed, 10 insertions(+), 3 deletions(-) >> >> diff --git a/src/gallium/drivers/radeonsi/si_pipe.c >> b/src/gallium/drivers/radeonsi/si_pipe.c >> index 4e23cb1..cb0b8b2 100644 >> --- a/src/gallium/drivers/radeonsi/si_pipe.c >> +++ b/src/gallium/drivers/radeonsi/si_pipe.c >> @@ -207,8 +207,10 @@ static struct pipe_context *si_create_context(struct >> pipe_screen *screen, >> /* XXX: This is the maximum value allowed. I'm not sure how to >> compute >> * this for non-cs shaders. Using the wrong value here can result >> in >> * GPU lockups, but the maximum value seems to always work. >> +* >> +* If this is changed, SI_LATE_ALLOC_VS_WAVES should be updated >> accordingly. >> */ >> - sctx->scratch_waves = 32 * sscreen->b.info.max_compute_units; >> + sctx->scratch_waves = SI_NUM_SCRATCH_WAVES * >> sscreen->b.info.max_compute_units; >> >> #if HAVE_LLVM >= 0x0306 >> /* Initialize LLVM TargetMachine */ >> diff --git a/src/gallium/drivers/radeonsi/si_pipe.h >> b/src/gallium/drivers/radeonsi/si_pipe.h >> index f83cb02..6ea3017 100644 >> --- a/src/gallium/drivers/radeonsi/si_pipe.h >> +++ b/src/gallium/drivers/radeonsi/si_pipe.h >> @@ -36,6 +36,10 @@ >> #define SI_BIG_ENDIAN 0 >> #endif >> >> +#define SI_NUM_SCRATCH_WAVES 32 >> +#define SI_LATE_ALLOC_VS_WAVES (SI_NUM_SCRATCH_WAVES - 1) >> + >> + >> /* The base vertex and primitive restart can be any number, but we must >> pick >>* one which will mean "unknown" for the purpose of state tracking and >>* the number shouldn't be a commonly-used one. */ >> diff --git a/src/gallium/drivers/radeonsi/si_state.c >> b/src/gallium/drivers/radeonsi/si_state.c >> index 2a6d2c6..a8b292f 100644 >> --- a/src/gallium/drivers/radeonsi/si_state.c >> +++ b/src/gallium/drivers/radeonsi/si_state.c >> @@ -3600,8 +3600,9 @@ static void si_init_config(struct si_context *sctx) >> si_pm4_set_reg(pm4, R_00B41C_SPI_SHADER_PGM_RSRC3_HS, 0); >> si_pm4_set_reg(pm4, R_00B31C_SPI_SHADER_PGM_RSRC3_ES, >> S_00B31C_CU_EN(0xfffe)); >> si_pm4_set_reg(pm4, R_00B21C_SPI_SHADER_PGM_RSRC3_GS, >> S_00B21C_CU_EN(0x)); >> - si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, >> S_00B118_CU_EN(0x)); >> - si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, >> S_00B11C_LIMIT(0)); >> + si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, >> S_00B118_CU_EN(0xfffe)); >> + si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, >> + S_00B11C_LIMIT(SI_LATE_ALLOC_VS_WAVES)); >> si_pm4_set_reg(pm4, R_00B01C_SPI_SHADER_PGM_RSRC3_PS, >> S_00B01C_CU_EN(0x)); >> } >> >> > ___ > mesa-dev
Re: [Mesa-dev] [PATCH] radeonsi: enable late VS export memory allocation
On 12.01.2016 18:55, Marek Olšák wrote: Hi Nicolai, I only tested piglit on Bonaire. The scratch buffer can deadlock too, though I guess we have enough scratch space that we should never run out of it. a) The lockup is prevented by the change to the value of SHADER_PGM_RSRC3_VS. It means that 1 compute unit is reserved for PS only. This is similar to tessellation, which isn't allowed to run on 2 compute units. b) Yes, it can hurt performance if there are too many PS waves. However, hw guys told me that the rule of thumb is to set the number of waves to about a half, which is 32. The number can also be calculated dynamically based on VS and PS resource usage, but I don't know the equations. Thanks for the explanation, that makes sense. Reviewed-by: Nicolai HähnleAxel Davy benchmarked this briefly. We may need more benchmarks though. Marek On Tue, Jan 12, 2016 at 11:38 PM, Nicolai Hähnle wrote: How well-tested is this? I'm a bit confused because I thought the scratch_waves is about the scratch buffer ring (which should always be allocated immediately when the shader starts), while the late_alloc is about allocating the space where exports go. So late_alloc is basically bounded by (a) the need to avoid lockups (when all compute units are running VS that don't have their export space allocated yet, so that there is nothing left for the next shader to run on and start processing the queue) and (b) even when there's no lockup, it may still hurt performance to have lots of VS waiting for export space to become available. I'm not sure if (a) is already covered by some of the registers that do reservations (i.e., reserve one CU to be used only for pixel shaders etc.), or if it's not an issue because every chip can have more than 31 waves in flight (which I think is true). As for (b), it's less critical but has anybody run benchmarks with this patch? Cheers, Nicolai On 12.01.2016 13:02, Marek Olšák wrote: From: Marek Olšák This allows more VS waves to run on CIK and VI. --- src/gallium/drivers/radeonsi/si_pipe.c | 4 +++- src/gallium/drivers/radeonsi/si_pipe.h | 4 src/gallium/drivers/radeonsi/si_state.c | 5 +++-- 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c index 4e23cb1..cb0b8b2 100644 --- a/src/gallium/drivers/radeonsi/si_pipe.c +++ b/src/gallium/drivers/radeonsi/si_pipe.c @@ -207,8 +207,10 @@ static struct pipe_context *si_create_context(struct pipe_screen *screen, /* XXX: This is the maximum value allowed. I'm not sure how to compute * this for non-cs shaders. Using the wrong value here can result in * GPU lockups, but the maximum value seems to always work. +* +* If this is changed, SI_LATE_ALLOC_VS_WAVES should be updated accordingly. */ - sctx->scratch_waves = 32 * sscreen->b.info.max_compute_units; + sctx->scratch_waves = SI_NUM_SCRATCH_WAVES * sscreen->b.info.max_compute_units; #if HAVE_LLVM >= 0x0306 /* Initialize LLVM TargetMachine */ diff --git a/src/gallium/drivers/radeonsi/si_pipe.h b/src/gallium/drivers/radeonsi/si_pipe.h index f83cb02..6ea3017 100644 --- a/src/gallium/drivers/radeonsi/si_pipe.h +++ b/src/gallium/drivers/radeonsi/si_pipe.h @@ -36,6 +36,10 @@ #define SI_BIG_ENDIAN 0 #endif +#define SI_NUM_SCRATCH_WAVES 32 +#define SI_LATE_ALLOC_VS_WAVES (SI_NUM_SCRATCH_WAVES - 1) + + /* The base vertex and primitive restart can be any number, but we must pick * one which will mean "unknown" for the purpose of state tracking and * the number shouldn't be a commonly-used one. */ diff --git a/src/gallium/drivers/radeonsi/si_state.c b/src/gallium/drivers/radeonsi/si_state.c index 2a6d2c6..a8b292f 100644 --- a/src/gallium/drivers/radeonsi/si_state.c +++ b/src/gallium/drivers/radeonsi/si_state.c @@ -3600,8 +3600,9 @@ static void si_init_config(struct si_context *sctx) si_pm4_set_reg(pm4, R_00B41C_SPI_SHADER_PGM_RSRC3_HS, 0); si_pm4_set_reg(pm4, R_00B31C_SPI_SHADER_PGM_RSRC3_ES, S_00B31C_CU_EN(0xfffe)); si_pm4_set_reg(pm4, R_00B21C_SPI_SHADER_PGM_RSRC3_GS, S_00B21C_CU_EN(0x)); - si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0x)); - si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, S_00B11C_LIMIT(0)); + si_pm4_set_reg(pm4, R_00B118_SPI_SHADER_PGM_RSRC3_VS, S_00B118_CU_EN(0xfffe)); + si_pm4_set_reg(pm4, R_00B11C_SPI_SHADER_LATE_ALLOC_VS, + S_00B11C_LIMIT(SI_LATE_ALLOC_VS_WAVES)); si_pm4_set_reg(pm4, R_00B01C_SPI_SHADER_PGM_RSRC3_PS, S_00B01C_CU_EN(0x)); } ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org