Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
On Wed, Mar 27, 2013 at 4:45 AM, Christian König deathsim...@vodafone.de wrote: Am 27.03.2013 01:43, schrieb Jerome Glisse: On Tue, Mar 26, 2013 at 6:45 PM, Dave Airlie airl...@gmail.com wrote: correctly). But Marek is quite right that this only counts for state objects and makes no sense for set_* and draw_* calls (and I'm currently thinking how to avoid that and can't come up with a proper solution). Anyway it's definitely not an urgent problem for radeonsi. It will be a problem once we actually start caring about performance and, most importantly, the CPU overhead of the driver. I still think that writing into the command buffers directly (e.g. without wrapper functions) is a bad idea, cause that lead to mixing driver logic and I'm convinced the exact opposite is a bad idea, because it adds another layer all commands must go through. A layer which brings no advantage. Think about apps which issue 1k-10k draw calls per frame. It's obvious that every byte moved around counts and the key to high framerate is to do (almost) nothing in the driver. It looks like the idea here is to make the driver as slow as possible. packet building in r600g. For example just try to figure out how the relocation in NOPs work by reading the source (please keep in mind that one of the primary goals why AMD is supporting this driver is to give a good example code for customers who want to implement that stuff on their own systems). I'm shocked. Sacrificing performance in the name of making the code nicer for some customers? Seriously? I thought the plan was to make the best graphics driver ever. Well, maybe I'm repeating myself: Performance is not a priority, it's only nice to have! Sorry to say so, but if we sacrifice a bit of performance for more code readability than that is perfectly ok with me (Don't understand me wrong I would really prefer to replace the closed source driver today than tomorrow, it's unfortunately just not what I'm paid for). On the other hand, we are talking about perfectly optimizeable inline functions and/or macros. All I'm saying is that we should structurize the code a bit more. Its okay to take steps in the right direction, but if you start taking steps that away from performance in lieu of code readability then please be prepared to deal with objections. The thing is in a lot of cases, code readability is in the eye of the beholder, I'm sure Jerome though r600g was perfectly readable when he wrote it, but a lot of us didn't and spent a lot of time trying to remove the CPU overheads, not least the amount of time Marek spent. The thing is performance is measureable, code readability isn't. Dave. Maybe once again you forgot why i did things the way i did them, i explained myself to you back then, i designed r600g for a new kernel api which was violently different from the cs one, my hope was that the other kernel api would be better, it was not and i never pushed more on that front. So r600g design was definitely not adapted to the cs ioctl and not thinked for it. History often explain a lot of things and people seems to forget about them. That being said, i too find ironic the code readability argument, if one understand the cs ioctl then the r600g code as it's nowadays make sense, but the radeonsi code is closer to what r600g use to be. So assuming same ioctl i would say that radeonsi should move towards what r600g is nowadays. Anyway just wanted to set history straight. Well I think you hit the point here quite well, may I ask what your kernel interface would have been looked like? Christian. I use to have a branch on fdo with it, basicly what use to be r600_hw_context was a nop in gallium and you had state in kernel (cb, db, sampler view, sampler, ...) and you created them and then bound them so everything was mostly security check at creation time and bound time was pretty quick, it was also transaction based. Relocation was easier too. Anyway it was a bad API, i know that in closed world or more obscure stack you can have a kernel api that doesn't do much security check and call it a day which gives you a lot more freedom on api. Cheers, Jerome ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
Am 25.03.2013 18:15, schrieb j.gli...@gmail.com: From: Jerome Glisse jgli...@redhat.com Same as on r600, trace cs execution by writting cs offset after each states, this allow to pin point lockup inside command stream and narrow down the scope of lockup investigation. v2: Use WRITE_DATA packet instead of WRITE_MEM Signed-off-by: Jerome Glisse jgli...@redhat.com --- src/gallium/drivers/radeonsi/r600_hw_context.c | 61 ++ src/gallium/drivers/radeonsi/radeonsi_pipe.c | 22 ++ src/gallium/drivers/radeonsi/radeonsi_pipe.h | 12 + src/gallium/drivers/radeonsi/radeonsi_pm4.c| 12 + src/gallium/drivers/radeonsi/si_state_draw.c | 7 ++- src/gallium/drivers/radeonsi/sid.h | 14 ++ 6 files changed, 127 insertions(+), 1 deletion(-) diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c b/src/gallium/drivers/radeonsi/r600_hw_context.c index bd348f9..967f093 100644 --- a/src/gallium/drivers/radeonsi/r600_hw_context.c +++ b/src/gallium/drivers/radeonsi/r600_hw_context.c @@ -142,6 +142,12 @@ void si_need_cs_space(struct r600_context *ctx, unsigned num_dw, /* Save 16 dwords for the fence mechanism. */ num_dw += 16; +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + num_dw += R600_TRACE_CS_DWORDS; + } +#endif + /* Flush if there's not enough space. */ if (num_dw RADEON_MAX_CMDBUF_DWORDS) { radeonsi_flush(ctx-context, NULL, RADEON_FLUSH_ASYNC); @@ -206,9 +212,41 @@ void si_context_flush(struct r600_context *ctx, unsigned flags) /* force to keep tiling flags */ flags |= RADEON_FLUSH_KEEP_TILING_FLAGS; +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + struct r600_screen *rscreen = ctx-screen; + unsigned i; + + for (i = 0; i cs-cdw; i++) { + fprintf(stderr, [%4d] [%5d] 0x%08x\n, rscreen-cs_count, i, cs-buf[i]); + } + rscreen-cs_count++; + } +#endif + /* Flush the CS. */ ctx-ws-cs_flush(ctx-cs, flags); +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + struct r600_screen *rscreen = ctx-screen; + unsigned i; + + for (i = 0; i 10; i++) { + usleep(5); + if (!ctx-ws-buffer_is_busy(rscreen-trace_bo-buf, RADEON_USAGE_READWRITE)) { + break; + } + } + if (i == 10) { + fprintf(stderr, timeout on cs lockup likely happen at cs %d dw %d\n, + rscreen-trace_ptr[1], rscreen-trace_ptr[0]); + } else { + fprintf(stderr, cs %d executed in %dms\n, rscreen-trace_ptr[1], i * 5); + } + } +#endif + ctx-pm4_dirty_cdwords = 0; ctx-flags = 0; @@ -665,3 +703,26 @@ void r600_context_draw_opaque_count(struct r600_context *ctx, struct r600_so_tar cs-buf[cs-cdw++] = r600_context_bo_reloc(ctx, t-filled_size, RADEON_USAGE_READ); } + +#if R600_TRACE_CS +void r600_trace_emit(struct r600_context *rctx) +{ + struct r600_screen *rscreen = rctx-screen; + struct radeon_winsys_cs *cs = rctx-cs; + uint64_t va; + uint32_t reloc; + + va = r600_resource_va(rscreen-screen, (void*)rscreen-trace_bo); + reloc = r600_context_bo_reloc(rctx, rscreen-trace_bo, RADEON_USAGE_READWRITE); + cs-buf[cs-cdw++] = PKT3(PKT3_WRITE_DATA, 4, 0); + cs-buf[cs-cdw++] = PKT3_WRITE_DATA_DST_SEL(PKT3_WRITE_DATA_DST_SEL_MEM_SYNC) | + PKT3_WRITE_DATA_WR_CONFIRM | + PKT3_WRITE_DATA_ENGINE_SEL(PKT3_WRITE_DATA_ENGINE_SEL_ME); + cs-buf[cs-cdw++] = va 0xUL; + cs-buf[cs-cdw++] = (va 32UL) 0xUL; + cs-buf[cs-cdw++] = cs-cdw; + cs-buf[cs-cdw++] = rscreen-cs_count; + cs-buf[cs-cdw++] = PKT3(PKT3_NOP, 0, 0); + cs-buf[cs-cdw++] = reloc; The NOP packet here is superfluous, also I don't really like how this is implemented after all. May I just use this patch as base of a cleaner implementation? Christian. +} +#endif diff --git a/src/gallium/drivers/radeonsi/radeonsi_pipe.c b/src/gallium/drivers/radeonsi/radeonsi_pipe.c index c5dac29..a370d7e 100644 --- a/src/gallium/drivers/radeonsi/radeonsi_pipe.c +++ b/src/gallium/drivers/radeonsi/radeonsi_pipe.c @@ -525,6 +525,14 @@ static void r600_destroy_screen(struct pipe_screen* pscreen) rscreen-ws-buffer_unmap(rscreen-fences.bo-cs_buf); si_resource_reference(rscreen-fences.bo, NULL); } + +#if R600_TRACE_CS + if (rscreen-trace_bo) { + rscreen-ws-buffer_unmap(rscreen-trace_bo-cs_buf); + pipe_resource_reference((struct pipe_resource**)rscreen-trace_bo, NULL); + } +#endif +
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
Am 26.03.2013 14:42, schrieb Jerome Glisse: On Tue, Mar 26, 2013 at 6:22 AM, Christian König deathsim...@vodafone.de wrote: Am 25.03.2013 18:15, schrieb j.gli...@gmail.com: From: Jerome Glisse jgli...@redhat.com Same as on r600, trace cs execution by writting cs offset after each states, this allow to pin point lockup inside command stream and narrow down the scope of lockup investigation. v2: Use WRITE_DATA packet instead of WRITE_MEM Signed-off-by: Jerome Glisse jgli...@redhat.com --- src/gallium/drivers/radeonsi/r600_hw_context.c | 61 ++ src/gallium/drivers/radeonsi/radeonsi_pipe.c | 22 ++ src/gallium/drivers/radeonsi/radeonsi_pipe.h | 12 + src/gallium/drivers/radeonsi/radeonsi_pm4.c| 12 + src/gallium/drivers/radeonsi/si_state_draw.c | 7 ++- src/gallium/drivers/radeonsi/sid.h | 14 ++ 6 files changed, 127 insertions(+), 1 deletion(-) diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c b/src/gallium/drivers/radeonsi/r600_hw_context.c index bd348f9..967f093 100644 --- a/src/gallium/drivers/radeonsi/r600_hw_context.c +++ b/src/gallium/drivers/radeonsi/r600_hw_context.c @@ -142,6 +142,12 @@ void si_need_cs_space(struct r600_context *ctx, unsigned num_dw, /* Save 16 dwords for the fence mechanism. */ num_dw += 16; +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + num_dw += R600_TRACE_CS_DWORDS; + } +#endif + /* Flush if there's not enough space. */ if (num_dw RADEON_MAX_CMDBUF_DWORDS) { radeonsi_flush(ctx-context, NULL, RADEON_FLUSH_ASYNC); @@ -206,9 +212,41 @@ void si_context_flush(struct r600_context *ctx, unsigned flags) /* force to keep tiling flags */ flags |= RADEON_FLUSH_KEEP_TILING_FLAGS; +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + struct r600_screen *rscreen = ctx-screen; + unsigned i; + + for (i = 0; i cs-cdw; i++) { + fprintf(stderr, [%4d] [%5d] 0x%08x\n, rscreen-cs_count, i, cs-buf[i]); + } + rscreen-cs_count++; + } +#endif + /* Flush the CS. */ ctx-ws-cs_flush(ctx-cs, flags); +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + struct r600_screen *rscreen = ctx-screen; + unsigned i; + + for (i = 0; i 10; i++) { + usleep(5); + if (!ctx-ws-buffer_is_busy(rscreen-trace_bo-buf, RADEON_USAGE_READWRITE)) { + break; + } + } + if (i == 10) { + fprintf(stderr, timeout on cs lockup likely happen at cs %d dw %d\n, + rscreen-trace_ptr[1], rscreen-trace_ptr[0]); + } else { + fprintf(stderr, cs %d executed in %dms\n, rscreen-trace_ptr[1], i * 5); + } + } +#endif + ctx-pm4_dirty_cdwords = 0; ctx-flags = 0; @@ -665,3 +703,26 @@ void r600_context_draw_opaque_count(struct r600_context *ctx, struct r600_so_tar cs-buf[cs-cdw++] = r600_context_bo_reloc(ctx, t-filled_size, RADEON_USAGE_READ); } + +#if R600_TRACE_CS +void r600_trace_emit(struct r600_context *rctx) +{ + struct r600_screen *rscreen = rctx-screen; + struct radeon_winsys_cs *cs = rctx-cs; + uint64_t va; + uint32_t reloc; + + va = r600_resource_va(rscreen-screen, (void*)rscreen-trace_bo); + reloc = r600_context_bo_reloc(rctx, rscreen-trace_bo, RADEON_USAGE_READWRITE); + cs-buf[cs-cdw++] = PKT3(PKT3_WRITE_DATA, 4, 0); + cs-buf[cs-cdw++] = PKT3_WRITE_DATA_DST_SEL(PKT3_WRITE_DATA_DST_SEL_MEM_SYNC) | + PKT3_WRITE_DATA_WR_CONFIRM | + PKT3_WRITE_DATA_ENGINE_SEL(PKT3_WRITE_DATA_ENGINE_SEL_ME); + cs-buf[cs-cdw++] = va 0xUL; + cs-buf[cs-cdw++] = (va 32UL) 0xUL; + cs-buf[cs-cdw++] = cs-cdw; + cs-buf[cs-cdw++] = rscreen-cs_count; + cs-buf[cs-cdw++] = PKT3(PKT3_NOP, 0, 0); + cs-buf[cs-cdw++] = reloc; The NOP packet here is superfluous, also I don't really like how this is implemented after all. May I just use this patch as base of a cleaner implementation? Christian. Yeah nop is a left over, what don't you like ? This is a build time debug option only that proved very useful (at least to me) on r600g Yeah it's definitely useful and a quite nifty idea, but as one example I would like to have it as a runtime option instead of a compile time option. Additional to that on SI the stuff in r600_hw_context is more or less deprecated, e.g. all the commands still have the now unnecessary NOPs, and actually I think some of them aren't really supported any more on SI (or at least a bit differently). Also I hoped to move all generating of PKT3 to si_commands. and so separate generating the
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
Speaking of si_pm4_state, I think it's a horrible mechanism for anything other than constant state objects (create/bind/delete functions). For everything else (set/draw functions), you want to emit directly into the command stream. It's not so different from the bad state management which r600g used to have (which is now gone). If you have to call malloc or calloc in a set_* or draw_* function, you're doing it wrong. Are there plans to change it to something more efficient (e.g. how r300g and r600g emit non-CSO states right now), or will it be like this forever? Marek On Tue, Mar 26, 2013 at 3:00 PM, Christian König deathsim...@vodafone.de wrote: Am 26.03.2013 14:42, schrieb Jerome Glisse: On Tue, Mar 26, 2013 at 6:22 AM, Christian König deathsim...@vodafone.de wrote: Am 25.03.2013 18:15, schrieb j.gli...@gmail.com: From: Jerome Glisse jgli...@redhat.com Same as on r600, trace cs execution by writting cs offset after each states, this allow to pin point lockup inside command stream and narrow down the scope of lockup investigation. v2: Use WRITE_DATA packet instead of WRITE_MEM Signed-off-by: Jerome Glisse jgli...@redhat.com --- src/gallium/drivers/radeonsi/r600_hw_context.c | 61 ++ src/gallium/drivers/radeonsi/radeonsi_pipe.c | 22 ++ src/gallium/drivers/radeonsi/radeonsi_pipe.h | 12 + src/gallium/drivers/radeonsi/radeonsi_pm4.c| 12 + src/gallium/drivers/radeonsi/si_state_draw.c | 7 ++- src/gallium/drivers/radeonsi/sid.h | 14 ++ 6 files changed, 127 insertions(+), 1 deletion(-) diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c b/src/gallium/drivers/radeonsi/r600_hw_context.c index bd348f9..967f093 100644 --- a/src/gallium/drivers/radeonsi/r600_hw_context.c +++ b/src/gallium/drivers/radeonsi/r600_hw_context.c @@ -142,6 +142,12 @@ void si_need_cs_space(struct r600_context *ctx, unsigned num_dw, /* Save 16 dwords for the fence mechanism. */ num_dw += 16; +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + num_dw += R600_TRACE_CS_DWORDS; + } +#endif + /* Flush if there's not enough space. */ if (num_dw RADEON_MAX_CMDBUF_DWORDS) { radeonsi_flush(ctx-context, NULL, RADEON_FLUSH_ASYNC); @@ -206,9 +212,41 @@ void si_context_flush(struct r600_context *ctx, unsigned flags) /* force to keep tiling flags */ flags |= RADEON_FLUSH_KEEP_TILING_FLAGS; +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + struct r600_screen *rscreen = ctx-screen; + unsigned i; + + for (i = 0; i cs-cdw; i++) { + fprintf(stderr, [%4d] [%5d] 0x%08x\n, rscreen-cs_count, i, cs-buf[i]); + } + rscreen-cs_count++; + } +#endif + /* Flush the CS. */ ctx-ws-cs_flush(ctx-cs, flags); +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + struct r600_screen *rscreen = ctx-screen; + unsigned i; + + for (i = 0; i 10; i++) { + usleep(5); + if (!ctx-ws-buffer_is_busy(rscreen-trace_bo-buf, RADEON_USAGE_READWRITE)) { + break; + } + } + if (i == 10) { + fprintf(stderr, timeout on cs lockup likely happen at cs %d dw %d\n, + rscreen-trace_ptr[1], rscreen-trace_ptr[0]); + } else { + fprintf(stderr, cs %d executed in %dms\n, rscreen-trace_ptr[1], i * 5); + } + } +#endif + ctx-pm4_dirty_cdwords = 0; ctx-flags = 0; @@ -665,3 +703,26 @@ void r600_context_draw_opaque_count(struct r600_context *ctx, struct r600_so_tar cs-buf[cs-cdw++] = r600_context_bo_reloc(ctx, t-filled_size, RADEON_USAGE_READ); } + +#if R600_TRACE_CS +void r600_trace_emit(struct r600_context *rctx) +{ + struct r600_screen *rscreen = rctx-screen; + struct radeon_winsys_cs *cs = rctx-cs; + uint64_t va; + uint32_t reloc; + + va = r600_resource_va(rscreen-screen, (void*)rscreen-trace_bo); + reloc = r600_context_bo_reloc(rctx, rscreen-trace_bo, RADEON_USAGE_READWRITE); + cs-buf[cs-cdw++] = PKT3(PKT3_WRITE_DATA, 4, 0); + cs-buf[cs-cdw++] = PKT3_WRITE_DATA_DST_SEL(PKT3_WRITE_DATA_DST_SEL_MEM_SYNC) | + PKT3_WRITE_DATA_WR_CONFIRM | + PKT3_WRITE_DATA_ENGINE_SEL(PKT3_WRITE_DATA_ENGINE_SEL_ME); + cs-buf[cs-cdw++] = va 0xUL; + cs-buf[cs-cdw++] = (va 32UL) 0xUL; + cs-buf[cs-cdw++] = cs-cdw; + cs-buf[cs-cdw++] = rscreen-cs_count; + cs-buf[cs-cdw++] = PKT3(PKT3_NOP, 0, 0); + cs-buf[cs-cdw++] = reloc; The NOP packet here is superfluous,
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
Am 26.03.2013 15:34, schrieb Marek Olšák: Speaking of si_pm4_state, I think it's a horrible mechanism for anything other than constant state objects (create/bind/delete functions). For everything else (set/draw functions), you want to emit directly into the command stream. It's not so different from the bad state management which r600g used to have (which is now gone). If you have to call malloc or calloc in a set_* or draw_* function, you're doing it wrong. Are there plans to change it to something more efficient (e.g. how r300g and r600g emit non-CSO states right now), or will it be like this forever? Actually I hoped that r600g sooner or later moves into the same direction some more. The fact that we currently need to malloc every buffer indeed sucks badly, but that is still better than mixing packet generation with driver logic. Also I don't think that emitting directly into the command stream is such a good idea, we sooner or later want that buffer to be a buffer allocated in GART memory. And under this condition it is better to build up the commands in a (heavily cached) system memory and then memcpy then to the destination buffer. Christian. Marek On Tue, Mar 26, 2013 at 3:00 PM, Christian König deathsim...@vodafone.de wrote: Am 26.03.2013 14:42, schrieb Jerome Glisse: On Tue, Mar 26, 2013 at 6:22 AM, Christian König deathsim...@vodafone.de wrote: Am 25.03.2013 18:15, schrieb j.gli...@gmail.com: From: Jerome Glisse jgli...@redhat.com Same as on r600, trace cs execution by writting cs offset after each states, this allow to pin point lockup inside command stream and narrow down the scope of lockup investigation. v2: Use WRITE_DATA packet instead of WRITE_MEM Signed-off-by: Jerome Glisse jgli...@redhat.com --- src/gallium/drivers/radeonsi/r600_hw_context.c | 61 ++ src/gallium/drivers/radeonsi/radeonsi_pipe.c | 22 ++ src/gallium/drivers/radeonsi/radeonsi_pipe.h | 12 + src/gallium/drivers/radeonsi/radeonsi_pm4.c| 12 + src/gallium/drivers/radeonsi/si_state_draw.c | 7 ++- src/gallium/drivers/radeonsi/sid.h | 14 ++ 6 files changed, 127 insertions(+), 1 deletion(-) diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c b/src/gallium/drivers/radeonsi/r600_hw_context.c index bd348f9..967f093 100644 --- a/src/gallium/drivers/radeonsi/r600_hw_context.c +++ b/src/gallium/drivers/radeonsi/r600_hw_context.c @@ -142,6 +142,12 @@ void si_need_cs_space(struct r600_context *ctx, unsigned num_dw, /* Save 16 dwords for the fence mechanism. */ num_dw += 16; +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + num_dw += R600_TRACE_CS_DWORDS; + } +#endif + /* Flush if there's not enough space. */ if (num_dw RADEON_MAX_CMDBUF_DWORDS) { radeonsi_flush(ctx-context, NULL, RADEON_FLUSH_ASYNC); @@ -206,9 +212,41 @@ void si_context_flush(struct r600_context *ctx, unsigned flags) /* force to keep tiling flags */ flags |= RADEON_FLUSH_KEEP_TILING_FLAGS; +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + struct r600_screen *rscreen = ctx-screen; + unsigned i; + + for (i = 0; i cs-cdw; i++) { + fprintf(stderr, [%4d] [%5d] 0x%08x\n, rscreen-cs_count, i, cs-buf[i]); + } + rscreen-cs_count++; + } +#endif + /* Flush the CS. */ ctx-ws-cs_flush(ctx-cs, flags); +#if R600_TRACE_CS + if (ctx-screen-trace_bo) { + struct r600_screen *rscreen = ctx-screen; + unsigned i; + + for (i = 0; i 10; i++) { + usleep(5); + if (!ctx-ws-buffer_is_busy(rscreen-trace_bo-buf, RADEON_USAGE_READWRITE)) { + break; + } + } + if (i == 10) { + fprintf(stderr, timeout on cs lockup likely happen at cs %d dw %d\n, + rscreen-trace_ptr[1], rscreen-trace_ptr[0]); + } else { + fprintf(stderr, cs %d executed in %dms\n, rscreen-trace_ptr[1], i * 5); + } + } +#endif + ctx-pm4_dirty_cdwords = 0; ctx-flags = 0; @@ -665,3 +703,26 @@ void r600_context_draw_opaque_count(struct r600_context *ctx, struct r600_so_tar cs-buf[cs-cdw++] = r600_context_bo_reloc(ctx, t-filled_size, RADEON_USAGE_READ); } + +#if R600_TRACE_CS +void r600_trace_emit(struct r600_context *rctx) +{ + struct r600_screen *rscreen = rctx-screen; + struct radeon_winsys_cs *cs = rctx-cs; + uint64_t va; + uint32_t reloc; + + va = r600_resource_va(rscreen-screen, (void*)rscreen-trace_bo); + reloc = r600_context_bo_reloc(rctx, rscreen-trace_bo, RADEON_USAGE_READWRITE); + cs-buf[cs-cdw++] =
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
On Tue, Mar 26, 2013 at 3:59 PM, Christian König deathsim...@vodafone.de wrote: Am 26.03.2013 15:34, schrieb Marek Olšák: Speaking of si_pm4_state, I think it's a horrible mechanism for anything other than constant state objects (create/bind/delete functions). For everything else (set/draw functions), you want to emit directly into the command stream. It's not so different from the bad state management which r600g used to have (which is now gone). If you have to call malloc or calloc in a set_* or draw_* function, you're doing it wrong. Are there plans to change it to something more efficient (e.g. how r300g and r600g emit non-CSO states right now), or will it be like this forever? Actually I hoped that r600g sooner or later moves into the same direction some more. The fact that we currently need to malloc every buffer indeed sucks badly, but that is still better than mixing packet generation with driver logic. I don't understand the last sentence. What mixing? The set_* and draw_* commands are supposed to be executed immediately, therefore it's reasonable and preferable to write to the CS directly. Having any intermediate storage for commands is a waste of time and space. Also I don't think that emitting directly into the command stream is such a good idea, we sooner or later want that buffer to be a buffer allocated in GART memory. And under this condition it is better to build up the commands in a (heavily cached) system memory and then memcpy then to the destination buffer. AFAIK, GART memory is cached on non-AGP systems, but even uncached access shouldn't be a big issue, because the access pattern is sequential and write-only. BTW, I have talked about emitting commands into a buffer object with Dave and he thinks it's a bad idea due to the map and unmap overhead. Also, we have to disallow writing to certain unsafe registers anyway. Marek ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
On Tue, Mar 26, 2013 at 12:40 PM, Marek Olšák mar...@gmail.com wrote: On Tue, Mar 26, 2013 at 3:59 PM, Christian König deathsim...@vodafone.de wrote: Am 26.03.2013 15:34, schrieb Marek Olšák: Speaking of si_pm4_state, I think it's a horrible mechanism for anything other than constant state objects (create/bind/delete functions). For everything else (set/draw functions), you want to emit directly into the command stream. It's not so different from the bad state management which r600g used to have (which is now gone). If you have to call malloc or calloc in a set_* or draw_* function, you're doing it wrong. Are there plans to change it to something more efficient (e.g. how r300g and r600g emit non-CSO states right now), or will it be like this forever? Actually I hoped that r600g sooner or later moves into the same direction some more. The fact that we currently need to malloc every buffer indeed sucks badly, but that is still better than mixing packet generation with driver logic. I don't understand the last sentence. What mixing? The set_* and draw_* commands are supposed to be executed immediately, therefore it's reasonable and preferable to write to the CS directly. Having any intermediate storage for commands is a waste of time and space. I agree here, i don't think uncached bo for command stream on new hw would bring huge perf increase, probably will just be noise. Also I don't think that emitting directly into the command stream is such a good idea, we sooner or later want that buffer to be a buffer allocated in GART memory. And under this condition it is better to build up the commands in a (heavily cached) system memory and then memcpy then to the destination buffer. AFAIK, GART memory is cached on non-AGP systems, but even uncached access shouldn't be a big issue, because the access pattern is sequential and write-only. BTW, I have talked about emitting commands into a buffer object with Dave and he thinks it's a bad idea due to the map and unmap overhead. Also, we have to disallow writing to certain unsafe registers anyway. Marek I think Christian is thinking about new hw cayman where we can skip register checking because of vm and hardware register checking (the hw CP checks that register in the user IB is not one of the privilege register and block write and throw irq if so). On this kind of hw you can have cmd stream in bo and don't do the map/unmap. Cheers, Jerome ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
On Tue, Mar 26, 2013 at 6:44 PM, Christian König deathsim...@vodafone.de wrote: Am 26.03.2013 18:02, schrieb Jerome Glisse: On Tue, Mar 26, 2013 at 12:40 PM, Marek Olšák mar...@gmail.com wrote: On Tue, Mar 26, 2013 at 3:59 PM, Christian König deathsim...@vodafone.de wrote: Am 26.03.2013 15:34, schrieb Marek Olšák: Speaking of si_pm4_state, I think it's a horrible mechanism for anything other than constant state objects (create/bind/delete functions). For everything else (set/draw functions), you want to emit directly into the command stream. It's not so different from the bad state management which r600g used to have (which is now gone). If you have to call malloc or calloc in a set_* or draw_* function, you're doing it wrong. Are there plans to change it to something more efficient (e.g. how r300g and r600g emit non-CSO states right now), or will it be like this forever? Actually I hoped that r600g sooner or later moves into the same direction some more. The fact that we currently need to malloc every buffer indeed sucks badly, but that is still better than mixing packet generation with driver logic. I don't understand the last sentence. What mixing? The set_* and draw_* commands are supposed to be executed immediately, therefore it's reasonable and preferable to write to the CS directly. Having any intermediate storage for commands is a waste of time and space. I agree here, i don't think uncached bo for command stream on new hw would bring huge perf increase, probably will just be noise. Also I don't think that emitting directly into the command stream is such a good idea, we sooner or later want that buffer to be a buffer allocated in GART memory. And under this condition it is better to build up the commands in a (heavily cached) system memory and then memcpy then to the destination buffer. AFAIK, GART memory is cached on non-AGP systems, but even uncached access shouldn't be a big issue, because the access pattern is sequential and write-only. BTW, I have talked about emitting commands into a buffer object with Dave and he thinks it's a bad idea due to the map and unmap overhead. Also, we have to disallow writing to certain unsafe registers anyway. Marek I think Christian is thinking about new hw cayman where we can skip register checking because of vm and hardware register checking (the hw CP checks that register in the user IB is not one of the privilege register and block write and throw irq if so). On this kind of hw you can have cmd stream in bo and don't do the map/unmap. Yes indeed, and my plan is to avoid the copying by referencing the state directly with indirect buffer commands. That should also make thinks like queries and predicated rendering a bit more simpler (think of PM4 subroutine calls). The problem on SI is that for embedded data and const IBs you need to patch up the buffer quite a bit after it is written (at least if I understand them If you need to patch up the buffer after it is written, i.e. it can't be immutable, you probably shouldn't use any buffer at all and just write the commands into the CS directly. How are you gonna update the buffer if it's busy? Unless you mean that some state depends on some other state and must be recomputed, which is another aspect of radeonsi which is pretty badly designed. Don't recompute states, just create multiple command buffers covering all possible combinations of states you can ever get with a single gallium CSO and switch between them at draw time. E.g. when some external state changes from I to J, switch from blend_state-variant[I] to blend_state-variant[J] (by just emitting variant[J]), where variant is an array of immutable command buffers. Same for pipe_surface (except that pipe_surface should initialize the variants on demand). correctly). But Marek is quite right that this only counts for state objects and makes no sense for set_* and draw_* calls (and I'm currently thinking how to avoid that and can't come up with a proper solution). Anyway it's definitely not an urgent problem for radeonsi. It will be a problem once we actually start caring about performance and, most importantly, the CPU overhead of the driver. I still think that writing into the command buffers directly (e.g. without wrapper functions) is a bad idea, cause that lead to mixing driver logic and I'm convinced the exact opposite is a bad idea, because it adds another layer all commands must go through. A layer which brings no advantage. Think about apps which issue 1k-10k draw calls per frame. It's obvious that every byte moved around counts and the key to high framerate is to do (almost) nothing in the driver. It looks like the idea here is to make the driver as slow as possible. packet building in r600g. For example just try to figure out how the relocation in NOPs work by reading the source (please keep in mind that one of the primary goals why AMD is supporting this driver is
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
Am 26.03.2013 20:28, schrieb Marek Olšák: On Tue, Mar 26, 2013 at 6:44 PM, Christian König deathsim...@vodafone.de wrote: Am 26.03.2013 18:02, schrieb Jerome Glisse: On Tue, Mar 26, 2013 at 12:40 PM, Marek Olšák mar...@gmail.com wrote: On Tue, Mar 26, 2013 at 3:59 PM, Christian König deathsim...@vodafone.de wrote: Am 26.03.2013 15:34, schrieb Marek Olšák: Speaking of si_pm4_state, I think it's a horrible mechanism for anything other than constant state objects (create/bind/delete functions). For everything else (set/draw functions), you want to emit directly into the command stream. It's not so different from the bad state management which r600g used to have (which is now gone). If you have to call malloc or calloc in a set_* or draw_* function, you're doing it wrong. Are there plans to change it to something more efficient (e.g. how r300g and r600g emit non-CSO states right now), or will it be like this forever? Actually I hoped that r600g sooner or later moves into the same direction some more. The fact that we currently need to malloc every buffer indeed sucks badly, but that is still better than mixing packet generation with driver logic. I don't understand the last sentence. What mixing? The set_* and draw_* commands are supposed to be executed immediately, therefore it's reasonable and preferable to write to the CS directly. Having any intermediate storage for commands is a waste of time and space. I agree here, i don't think uncached bo for command stream on new hw would bring huge perf increase, probably will just be noise. Also I don't think that emitting directly into the command stream is such a good idea, we sooner or later want that buffer to be a buffer allocated in GART memory. And under this condition it is better to build up the commands in a (heavily cached) system memory and then memcpy then to the destination buffer. AFAIK, GART memory is cached on non-AGP systems, but even uncached access shouldn't be a big issue, because the access pattern is sequential and write-only. BTW, I have talked about emitting commands into a buffer object with Dave and he thinks it's a bad idea due to the map and unmap overhead. Also, we have to disallow writing to certain unsafe registers anyway. Marek I think Christian is thinking about new hw cayman where we can skip register checking because of vm and hardware register checking (the hw CP checks that register in the user IB is not one of the privilege register and block write and throw irq if so). On this kind of hw you can have cmd stream in bo and don't do the map/unmap. Yes indeed, and my plan is to avoid the copying by referencing the state directly with indirect buffer commands. That should also make thinks like queries and predicated rendering a bit more simpler (think of PM4 subroutine calls). The problem on SI is that for embedded data and const IBs you need to patch up the buffer quite a bit after it is written (at least if I understand them If you need to patch up the buffer after it is written, i.e. it can't be immutable, you probably shouldn't use any buffer at all and just write the commands into the CS directly. How are you gonna update the buffer if it's busy? Unless you mean that some state depends on some other state and must be recomputed, which is another aspect of radeonsi which is pretty badly designed. Don't recompute states, just create multiple command buffers covering all possible combinations of states you can ever get with a single gallium CSO and switch between them at draw time. E.g. when some external state changes from I to J, switch from blend_state-variant[I] to blend_state-variant[J] (by just emitting variant[J]), where variant is an array of immutable command buffers. Same for pipe_surface (except that pipe_surface should initialize the variants on demand). Yeah, I know. I only have two hands, where should do you suggest I should start? correctly). But Marek is quite right that this only counts for state objects and makes no sense for set_* and draw_* calls (and I'm currently thinking how to avoid that and can't come up with a proper solution). Anyway it's definitely not an urgent problem for radeonsi. It will be a problem once we actually start caring about performance and, most importantly, the CPU overhead of the driver. I still think that writing into the command buffers directly (e.g. without wrapper functions) is a bad idea, cause that lead to mixing driver logic and I'm convinced the exact opposite is a bad idea, because it adds another layer all commands must go through. A layer which brings no advantage. Think about apps which issue 1k-10k draw calls per frame. It's obvious that every byte moved around counts and the key to high framerate is to do (almost) nothing in the driver. It looks like the idea here is to make the driver as slow as possible. packet building in r600g. For example just try to figure out how the relocation in NOPs work by reading the source (please keep
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
correctly). But Marek is quite right that this only counts for state objects and makes no sense for set_* and draw_* calls (and I'm currently thinking how to avoid that and can't come up with a proper solution). Anyway it's definitely not an urgent problem for radeonsi. It will be a problem once we actually start caring about performance and, most importantly, the CPU overhead of the driver. I still think that writing into the command buffers directly (e.g. without wrapper functions) is a bad idea, cause that lead to mixing driver logic and I'm convinced the exact opposite is a bad idea, because it adds another layer all commands must go through. A layer which brings no advantage. Think about apps which issue 1k-10k draw calls per frame. It's obvious that every byte moved around counts and the key to high framerate is to do (almost) nothing in the driver. It looks like the idea here is to make the driver as slow as possible. packet building in r600g. For example just try to figure out how the relocation in NOPs work by reading the source (please keep in mind that one of the primary goals why AMD is supporting this driver is to give a good example code for customers who want to implement that stuff on their own systems). I'm shocked. Sacrificing performance in the name of making the code nicer for some customers? Seriously? I thought the plan was to make the best graphics driver ever. Well, maybe I'm repeating myself: Performance is not a priority, it's only nice to have! Sorry to say so, but if we sacrifice a bit of performance for more code readability than that is perfectly ok with me (Don't understand me wrong I would really prefer to replace the closed source driver today than tomorrow, it's unfortunately just not what I'm paid for). On the other hand, we are talking about perfectly optimizeable inline functions and/or macros. All I'm saying is that we should structurize the code a bit more. Its okay to take steps in the right direction, but if you start taking steps that away from performance in lieu of code readability then please be prepared to deal with objections. The thing is in a lot of cases, code readability is in the eye of the beholder, I'm sure Jerome though r600g was perfectly readable when he wrote it, but a lot of us didn't and spent a lot of time trying to remove the CPU overheads, not least the amount of time Marek spent. The thing is performance is measureable, code readability isn't. Dave. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
On 27 Mar 2013 08:45, Dave Airlie airl...@gmail.com wrote: correctly). But Marek is quite right that this only counts for state objects and makes no sense for set_* and draw_* calls (and I'm currently thinking how to avoid that and can't come up with a proper solution). Anyway it's definitely not an urgent problem for radeonsi. It will be a problem once we actually start caring about performance and, most importantly, the CPU overhead of the driver. I still think that writing into the command buffers directly (e.g. without wrapper functions) is a bad idea, cause that lead to mixing driver logic and I'm convinced the exact opposite is a bad idea, because it adds another layer all commands must go through. A layer which brings no advantage. Think about apps which issue 1k-10k draw calls per frame. It's obvious that every byte moved around counts and the key to high framerate is to do (almost) nothing in the driver. It looks like the idea here is to make the driver as slow as possible. packet building in r600g. For example just try to figure out how the relocation in NOPs work by reading the source (please keep in mind that one of the primary goals why AMD is supporting this driver is to give a good example code for customers who want to implement that stuff on their own systems). I'm shocked. Sacrificing performance in the name of making the code nicer for some customers? Seriously? I thought the plan was to make the best graphics driver ever. Well, maybe I'm repeating myself: Performance is not a priority, it's only nice to have! Sorry to say so, but if we sacrifice a bit of performance for more code readability than that is perfectly ok with me (Don't understand me wrong I would really prefer to replace the closed source driver today than tomorrow, it's unfortunately just not what I'm paid for). On the other hand, we are talking about perfectly optimizeable inline functions and/or macros. All I'm saying is that we should structurize the code a bit more. Its okay to take steps in the right direction, but if you start taking steps that away from performance in lieu of code readability then please be prepared to deal with objections. The thing is in a lot of cases, code readability is in the eye of the beholder, I'm sure Jerome though r600g was perfectly readable when he wrote it, but a lot of us didn't and spent a lot of time trying to remove the CPU overheads, not least the amount of time Marek spent. The thing is performance is measureable, code readability isn't. I also realised there is a bit of irony in using llvm if code readability is a prime goal :-P, I'm sure .td is readable to some, but it takes a fair bit of learning. Dave. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2
On Tue, Mar 26, 2013 at 6:45 PM, Dave Airlie airl...@gmail.com wrote: correctly). But Marek is quite right that this only counts for state objects and makes no sense for set_* and draw_* calls (and I'm currently thinking how to avoid that and can't come up with a proper solution). Anyway it's definitely not an urgent problem for radeonsi. It will be a problem once we actually start caring about performance and, most importantly, the CPU overhead of the driver. I still think that writing into the command buffers directly (e.g. without wrapper functions) is a bad idea, cause that lead to mixing driver logic and I'm convinced the exact opposite is a bad idea, because it adds another layer all commands must go through. A layer which brings no advantage. Think about apps which issue 1k-10k draw calls per frame. It's obvious that every byte moved around counts and the key to high framerate is to do (almost) nothing in the driver. It looks like the idea here is to make the driver as slow as possible. packet building in r600g. For example just try to figure out how the relocation in NOPs work by reading the source (please keep in mind that one of the primary goals why AMD is supporting this driver is to give a good example code for customers who want to implement that stuff on their own systems). I'm shocked. Sacrificing performance in the name of making the code nicer for some customers? Seriously? I thought the plan was to make the best graphics driver ever. Well, maybe I'm repeating myself: Performance is not a priority, it's only nice to have! Sorry to say so, but if we sacrifice a bit of performance for more code readability than that is perfectly ok with me (Don't understand me wrong I would really prefer to replace the closed source driver today than tomorrow, it's unfortunately just not what I'm paid for). On the other hand, we are talking about perfectly optimizeable inline functions and/or macros. All I'm saying is that we should structurize the code a bit more. Its okay to take steps in the right direction, but if you start taking steps that away from performance in lieu of code readability then please be prepared to deal with objections. The thing is in a lot of cases, code readability is in the eye of the beholder, I'm sure Jerome though r600g was perfectly readable when he wrote it, but a lot of us didn't and spent a lot of time trying to remove the CPU overheads, not least the amount of time Marek spent. The thing is performance is measureable, code readability isn't. Dave. Maybe once again you forgot why i did things the way i did them, i explained myself to you back then, i designed r600g for a new kernel api which was violently different from the cs one, my hope was that the other kernel api would be better, it was not and i never pushed more on that front. So r600g design was definitely not adapted to the cs ioctl and not thinked for it. History often explain a lot of things and people seems to forget about them. That being said, i too find ironic the code readability argument, if one understand the cs ioctl then the r600g code as it's nowadays make sense, but the radeonsi code is closer to what r600g use to be. So assuming same ioctl i would say that radeonsi should move towards what r600g is nowadays. Anyway just wanted to set history straight. Cheers, Jerome ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev