Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-27 Thread Jerome Glisse
On Wed, Mar 27, 2013 at 4:45 AM, Christian König
deathsim...@vodafone.de wrote:
 Am 27.03.2013 01:43, schrieb Jerome Glisse:

 On Tue, Mar 26, 2013 at 6:45 PM, Dave Airlie airl...@gmail.com wrote:

 correctly). But Marek is quite right that this only counts for state
 objects
 and makes no sense for set_* and draw_* calls (and I'm currently
 thinking
 how to avoid that and can't come up with a proper solution). Anyway
 it's
 definitely not an urgent problem for radeonsi.

 It will be a problem once we actually start caring about performance
 and, most importantly, the CPU overhead of the driver.

 I still think that writing into the command buffers directly (e.g.
 without
 wrapper functions) is a bad idea, cause that lead to mixing driver
 logic
 and

 I'm convinced the exact opposite is a bad idea, because it adds
 another layer all commands must go through. A layer which brings no
 advantage. Think about apps which issue 1k-10k draw calls per frame.
 It's obvious that every byte moved around counts and the key to high
 framerate is to do (almost) nothing in the driver. It looks like the
 idea here is to make the driver as slow as possible.

 packet building in r600g. For example just try to figure out how the
 relocation in NOPs work by reading the source (please keep in mind
 that
 one
 of the primary goals why AMD is supporting this driver is to give a
 good
 example code for customers who want to implement that stuff on their
 own
 systems).

 I'm shocked. Sacrificing performance in the name of making the code
 nicer for some customers? Seriously? I thought the plan was to make
 the best graphics driver ever.


 Well, maybe I'm repeating myself: Performance is not a priority, it's
 only
 nice to have!

 Sorry to say so, but if we sacrifice a bit of performance for more code
 readability than that is perfectly ok with me (Don't understand me wrong
 I
 would really prefer to replace the closed source driver today than
 tomorrow,
 it's unfortunately just not what I'm paid for).

 On the other hand, we are talking about perfectly optimizeable inline
 functions and/or macros. All I'm saying is that we should structurize
 the
 code a bit more.

 Its okay to take steps in the right direction, but if you start taking
 steps that away
 from performance in lieu of code readability then please be prepared
 to deal with
 objections.

 The thing is in a lot of cases, code readability is in the eye of the
 beholder, I'm sure
 Jerome though r600g was perfectly readable when he wrote it, but a lot
 of us didn't
 and spent a lot of time trying to remove the CPU overheads, not least
 the amount of
 time Marek spent. The thing is performance is measureable, code
 readability isn't.

 Dave.

 Maybe once again you forgot why i did things the way i did them, i
 explained myself to you back then, i designed r600g for a new kernel
 api which was violently different from the cs one, my hope was that
 the other kernel api would be better, it was not and i never pushed
 more on that front. So r600g design was definitely not adapted to the
 cs ioctl and not thinked for it. History often explain a lot of things
 and people seems to forget about them.

 That being said, i too find ironic the code readability argument, if
 one understand the cs ioctl then the r600g code as it's nowadays make
 sense, but the radeonsi code is closer to what r600g use to be. So
 assuming same ioctl i would say that radeonsi should move towards what
 r600g is nowadays.

 Anyway just wanted to set history straight.


 Well I think you hit the point here quite well, may I ask what your kernel
 interface would have been looked like?

 Christian.

I use to have a branch on fdo with it, basicly what use to be
r600_hw_context was a nop in gallium and you had state in kernel (cb,
db, sampler view, sampler, ...) and you created them and then bound
them so everything was mostly security check at creation time and
bound time was pretty quick, it was also transaction based. Relocation
was easier too. Anyway it was a bad API, i know that in closed world
or more obscure stack you can have a kernel api that doesn't do much
security check and call it a day which gives you a lot more freedom on
api.

Cheers,
Jerome
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Christian König

Am 25.03.2013 18:15, schrieb j.gli...@gmail.com:

From: Jerome Glisse jgli...@redhat.com

Same as on r600, trace cs execution by writting cs offset after each
states, this allow to pin point lockup inside command stream and
narrow down the scope of lockup investigation.

v2: Use WRITE_DATA packet instead of WRITE_MEM

Signed-off-by: Jerome Glisse jgli...@redhat.com
---
  src/gallium/drivers/radeonsi/r600_hw_context.c | 61 ++
  src/gallium/drivers/radeonsi/radeonsi_pipe.c   | 22 ++
  src/gallium/drivers/radeonsi/radeonsi_pipe.h   | 12 +
  src/gallium/drivers/radeonsi/radeonsi_pm4.c| 12 +
  src/gallium/drivers/radeonsi/si_state_draw.c   |  7 ++-
  src/gallium/drivers/radeonsi/sid.h | 14 ++
  6 files changed, 127 insertions(+), 1 deletion(-)

diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c 
b/src/gallium/drivers/radeonsi/r600_hw_context.c
index bd348f9..967f093 100644
--- a/src/gallium/drivers/radeonsi/r600_hw_context.c
+++ b/src/gallium/drivers/radeonsi/r600_hw_context.c
@@ -142,6 +142,12 @@ void si_need_cs_space(struct r600_context *ctx, unsigned 
num_dw,
/* Save 16 dwords for the fence mechanism. */
num_dw += 16;
  
+#if R600_TRACE_CS

+   if (ctx-screen-trace_bo) {
+   num_dw += R600_TRACE_CS_DWORDS;
+   }
+#endif
+
/* Flush if there's not enough space. */
if (num_dw  RADEON_MAX_CMDBUF_DWORDS) {
radeonsi_flush(ctx-context, NULL, RADEON_FLUSH_ASYNC);
@@ -206,9 +212,41 @@ void si_context_flush(struct r600_context *ctx, unsigned 
flags)
/* force to keep tiling flags */
flags |= RADEON_FLUSH_KEEP_TILING_FLAGS;
  
+#if R600_TRACE_CS

+   if (ctx-screen-trace_bo) {
+   struct r600_screen *rscreen = ctx-screen;
+   unsigned i;
+
+   for (i = 0; i  cs-cdw; i++) {
+   fprintf(stderr, [%4d] [%5d] 0x%08x\n, rscreen-cs_count, 
i, cs-buf[i]);
+   }
+   rscreen-cs_count++;
+   }
+#endif
+
/* Flush the CS. */
ctx-ws-cs_flush(ctx-cs, flags);
  
+#if R600_TRACE_CS

+   if (ctx-screen-trace_bo) {
+   struct r600_screen *rscreen = ctx-screen;
+   unsigned i;
+
+   for (i = 0; i  10; i++) {
+   usleep(5);
+   if (!ctx-ws-buffer_is_busy(rscreen-trace_bo-buf, 
RADEON_USAGE_READWRITE)) {
+   break;
+   }
+   }
+   if (i == 10) {
+   fprintf(stderr, timeout on cs lockup likely happen at cs %d 
dw %d\n,
+   rscreen-trace_ptr[1], rscreen-trace_ptr[0]);
+   } else {
+   fprintf(stderr, cs %d executed in %dms\n, 
rscreen-trace_ptr[1], i * 5);
+   }
+   }
+#endif
+
ctx-pm4_dirty_cdwords = 0;
ctx-flags = 0;
  
@@ -665,3 +703,26 @@ void r600_context_draw_opaque_count(struct r600_context *ctx, struct r600_so_tar

cs-buf[cs-cdw++] = r600_context_bo_reloc(ctx, t-filled_size, 
RADEON_USAGE_READ);
  
  }

+
+#if R600_TRACE_CS
+void r600_trace_emit(struct r600_context *rctx)
+{
+   struct r600_screen *rscreen = rctx-screen;
+   struct radeon_winsys_cs *cs = rctx-cs;
+   uint64_t va;
+   uint32_t reloc;
+
+   va = r600_resource_va(rscreen-screen, (void*)rscreen-trace_bo);
+   reloc = r600_context_bo_reloc(rctx, rscreen-trace_bo, 
RADEON_USAGE_READWRITE);
+   cs-buf[cs-cdw++] = PKT3(PKT3_WRITE_DATA, 4, 0);
+   cs-buf[cs-cdw++] = 
PKT3_WRITE_DATA_DST_SEL(PKT3_WRITE_DATA_DST_SEL_MEM_SYNC) |
+   PKT3_WRITE_DATA_WR_CONFIRM |
+   
PKT3_WRITE_DATA_ENGINE_SEL(PKT3_WRITE_DATA_ENGINE_SEL_ME);
+   cs-buf[cs-cdw++] = va  0xUL;
+   cs-buf[cs-cdw++] = (va  32UL)  0xUL;
+   cs-buf[cs-cdw++] = cs-cdw;
+   cs-buf[cs-cdw++] = rscreen-cs_count;
+   cs-buf[cs-cdw++] = PKT3(PKT3_NOP, 0, 0);
+   cs-buf[cs-cdw++] = reloc;


The NOP packet here is superfluous,  also I don't really like how this 
is implemented after all.


May I just use this patch as base of a cleaner implementation?

Christian.


+}
+#endif
diff --git a/src/gallium/drivers/radeonsi/radeonsi_pipe.c 
b/src/gallium/drivers/radeonsi/radeonsi_pipe.c
index c5dac29..a370d7e 100644
--- a/src/gallium/drivers/radeonsi/radeonsi_pipe.c
+++ b/src/gallium/drivers/radeonsi/radeonsi_pipe.c
@@ -525,6 +525,14 @@ static void r600_destroy_screen(struct pipe_screen* 
pscreen)
rscreen-ws-buffer_unmap(rscreen-fences.bo-cs_buf);
si_resource_reference(rscreen-fences.bo, NULL);
}
+
+#if R600_TRACE_CS
+   if (rscreen-trace_bo) {
+   rscreen-ws-buffer_unmap(rscreen-trace_bo-cs_buf);
+   pipe_resource_reference((struct 
pipe_resource**)rscreen-trace_bo, NULL);
+   }
+#endif
+

Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Christian König

Am 26.03.2013 14:42, schrieb Jerome Glisse:

On Tue, Mar 26, 2013 at 6:22 AM, Christian König
deathsim...@vodafone.de wrote:

Am 25.03.2013 18:15, schrieb j.gli...@gmail.com:


From: Jerome Glisse jgli...@redhat.com

Same as on r600, trace cs execution by writting cs offset after each
states, this allow to pin point lockup inside command stream and
narrow down the scope of lockup investigation.

v2: Use WRITE_DATA packet instead of WRITE_MEM

Signed-off-by: Jerome Glisse jgli...@redhat.com
---
   src/gallium/drivers/radeonsi/r600_hw_context.c | 61
++
   src/gallium/drivers/radeonsi/radeonsi_pipe.c   | 22 ++
   src/gallium/drivers/radeonsi/radeonsi_pipe.h   | 12 +
   src/gallium/drivers/radeonsi/radeonsi_pm4.c| 12 +
   src/gallium/drivers/radeonsi/si_state_draw.c   |  7 ++-
   src/gallium/drivers/radeonsi/sid.h | 14 ++
   6 files changed, 127 insertions(+), 1 deletion(-)

diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c
b/src/gallium/drivers/radeonsi/r600_hw_context.c
index bd348f9..967f093 100644
--- a/src/gallium/drivers/radeonsi/r600_hw_context.c
+++ b/src/gallium/drivers/radeonsi/r600_hw_context.c
@@ -142,6 +142,12 @@ void si_need_cs_space(struct r600_context *ctx,
unsigned num_dw,
 /* Save 16 dwords for the fence mechanism. */
 num_dw += 16;
   +#if R600_TRACE_CS
+   if (ctx-screen-trace_bo) {
+   num_dw += R600_TRACE_CS_DWORDS;
+   }
+#endif
+
 /* Flush if there's not enough space. */
 if (num_dw  RADEON_MAX_CMDBUF_DWORDS) {
 radeonsi_flush(ctx-context, NULL, RADEON_FLUSH_ASYNC);
@@ -206,9 +212,41 @@ void si_context_flush(struct r600_context *ctx,
unsigned flags)
 /* force to keep tiling flags */
 flags |= RADEON_FLUSH_KEEP_TILING_FLAGS;
   +#if R600_TRACE_CS
+   if (ctx-screen-trace_bo) {
+   struct r600_screen *rscreen = ctx-screen;
+   unsigned i;
+
+   for (i = 0; i  cs-cdw; i++) {
+   fprintf(stderr, [%4d] [%5d] 0x%08x\n,
rscreen-cs_count, i, cs-buf[i]);
+   }
+   rscreen-cs_count++;
+   }
+#endif
+
 /* Flush the CS. */
 ctx-ws-cs_flush(ctx-cs, flags);
   +#if R600_TRACE_CS
+   if (ctx-screen-trace_bo) {
+   struct r600_screen *rscreen = ctx-screen;
+   unsigned i;
+
+   for (i = 0; i  10; i++) {
+   usleep(5);
+   if
(!ctx-ws-buffer_is_busy(rscreen-trace_bo-buf, RADEON_USAGE_READWRITE)) {
+   break;
+   }
+   }
+   if (i == 10) {
+   fprintf(stderr, timeout on cs lockup likely
happen at cs %d dw %d\n,
+   rscreen-trace_ptr[1],
rscreen-trace_ptr[0]);
+   } else {
+   fprintf(stderr, cs %d executed in %dms\n,
rscreen-trace_ptr[1], i * 5);
+   }
+   }
+#endif
+
 ctx-pm4_dirty_cdwords = 0;
 ctx-flags = 0;
   @@ -665,3 +703,26 @@ void r600_context_draw_opaque_count(struct
r600_context *ctx, struct r600_so_tar
 cs-buf[cs-cdw++] = r600_context_bo_reloc(ctx, t-filled_size,
RADEON_USAGE_READ);
 }
+
+#if R600_TRACE_CS
+void r600_trace_emit(struct r600_context *rctx)
+{
+   struct r600_screen *rscreen = rctx-screen;
+   struct radeon_winsys_cs *cs = rctx-cs;
+   uint64_t va;
+   uint32_t reloc;
+
+   va = r600_resource_va(rscreen-screen, (void*)rscreen-trace_bo);
+   reloc = r600_context_bo_reloc(rctx, rscreen-trace_bo,
RADEON_USAGE_READWRITE);
+   cs-buf[cs-cdw++] = PKT3(PKT3_WRITE_DATA, 4, 0);
+   cs-buf[cs-cdw++] =
PKT3_WRITE_DATA_DST_SEL(PKT3_WRITE_DATA_DST_SEL_MEM_SYNC) |
+   PKT3_WRITE_DATA_WR_CONFIRM |
+
PKT3_WRITE_DATA_ENGINE_SEL(PKT3_WRITE_DATA_ENGINE_SEL_ME);
+   cs-buf[cs-cdw++] = va  0xUL;
+   cs-buf[cs-cdw++] = (va  32UL)  0xUL;
+   cs-buf[cs-cdw++] = cs-cdw;
+   cs-buf[cs-cdw++] = rscreen-cs_count;
+   cs-buf[cs-cdw++] = PKT3(PKT3_NOP, 0, 0);
+   cs-buf[cs-cdw++] = reloc;


The NOP packet here is superfluous,  also I don't really like how this is
implemented after all.

May I just use this patch as base of a cleaner implementation?

Christian.

Yeah nop is a left over, what don't you like ? This is a build time
debug option only that proved very useful (at least to me) on r600g


Yeah it's definitely useful and a quite nifty idea, but as one example I 
would like to have it as a runtime option instead of a compile time option.


Additional to that on SI the stuff in r600_hw_context is more or less 
deprecated, e.g. all the commands still have the now unnecessary NOPs, 
and actually I think some of them aren't really supported any more on SI 
(or at least a bit differently). Also I hoped to move all generating of 
PKT3 to si_commands. and so separate generating the 

Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Marek Olšák
Speaking of si_pm4_state, I think it's a horrible mechanism for
anything other than constant state objects (create/bind/delete
functions). For everything else (set/draw functions), you want to emit
directly into the command stream. It's not so different from the bad
state management which r600g used to have (which is now gone). If you
have to call malloc or calloc in a set_* or draw_* function, you're
doing it wrong. Are there plans to change it to something more
efficient (e.g. how r300g and r600g emit non-CSO states right now), or
will it be like this forever?

Marek

On Tue, Mar 26, 2013 at 3:00 PM, Christian König
deathsim...@vodafone.de wrote:
 Am 26.03.2013 14:42, schrieb Jerome Glisse:

 On Tue, Mar 26, 2013 at 6:22 AM, Christian König
 deathsim...@vodafone.de wrote:

 Am 25.03.2013 18:15, schrieb j.gli...@gmail.com:

 From: Jerome Glisse jgli...@redhat.com

 Same as on r600, trace cs execution by writting cs offset after each
 states, this allow to pin point lockup inside command stream and
 narrow down the scope of lockup investigation.

 v2: Use WRITE_DATA packet instead of WRITE_MEM

 Signed-off-by: Jerome Glisse jgli...@redhat.com
 ---
src/gallium/drivers/radeonsi/r600_hw_context.c | 61
 ++
src/gallium/drivers/radeonsi/radeonsi_pipe.c   | 22 ++
src/gallium/drivers/radeonsi/radeonsi_pipe.h   | 12 +
src/gallium/drivers/radeonsi/radeonsi_pm4.c| 12 +
src/gallium/drivers/radeonsi/si_state_draw.c   |  7 ++-
src/gallium/drivers/radeonsi/sid.h | 14 ++
6 files changed, 127 insertions(+), 1 deletion(-)

 diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c
 b/src/gallium/drivers/radeonsi/r600_hw_context.c
 index bd348f9..967f093 100644
 --- a/src/gallium/drivers/radeonsi/r600_hw_context.c
 +++ b/src/gallium/drivers/radeonsi/r600_hw_context.c
 @@ -142,6 +142,12 @@ void si_need_cs_space(struct r600_context *ctx,
 unsigned num_dw,
  /* Save 16 dwords for the fence mechanism. */
  num_dw += 16;
+#if R600_TRACE_CS
 +   if (ctx-screen-trace_bo) {
 +   num_dw += R600_TRACE_CS_DWORDS;
 +   }
 +#endif
 +
  /* Flush if there's not enough space. */
  if (num_dw  RADEON_MAX_CMDBUF_DWORDS) {
  radeonsi_flush(ctx-context, NULL,
 RADEON_FLUSH_ASYNC);
 @@ -206,9 +212,41 @@ void si_context_flush(struct r600_context *ctx,
 unsigned flags)
  /* force to keep tiling flags */
  flags |= RADEON_FLUSH_KEEP_TILING_FLAGS;
+#if R600_TRACE_CS
 +   if (ctx-screen-trace_bo) {
 +   struct r600_screen *rscreen = ctx-screen;
 +   unsigned i;
 +
 +   for (i = 0; i  cs-cdw; i++) {
 +   fprintf(stderr, [%4d] [%5d] 0x%08x\n,
 rscreen-cs_count, i, cs-buf[i]);
 +   }
 +   rscreen-cs_count++;
 +   }
 +#endif
 +
  /* Flush the CS. */
  ctx-ws-cs_flush(ctx-cs, flags);
+#if R600_TRACE_CS
 +   if (ctx-screen-trace_bo) {
 +   struct r600_screen *rscreen = ctx-screen;
 +   unsigned i;
 +
 +   for (i = 0; i  10; i++) {
 +   usleep(5);
 +   if
 (!ctx-ws-buffer_is_busy(rscreen-trace_bo-buf,
 RADEON_USAGE_READWRITE)) {
 +   break;
 +   }
 +   }
 +   if (i == 10) {
 +   fprintf(stderr, timeout on cs lockup likely
 happen at cs %d dw %d\n,
 +   rscreen-trace_ptr[1],
 rscreen-trace_ptr[0]);
 +   } else {
 +   fprintf(stderr, cs %d executed in %dms\n,
 rscreen-trace_ptr[1], i * 5);
 +   }
 +   }
 +#endif
 +
  ctx-pm4_dirty_cdwords = 0;
  ctx-flags = 0;
@@ -665,3 +703,26 @@ void r600_context_draw_opaque_count(struct
 r600_context *ctx, struct r600_so_tar
  cs-buf[cs-cdw++] = r600_context_bo_reloc(ctx, t-filled_size,
 RADEON_USAGE_READ);
  }
 +
 +#if R600_TRACE_CS
 +void r600_trace_emit(struct r600_context *rctx)
 +{
 +   struct r600_screen *rscreen = rctx-screen;
 +   struct radeon_winsys_cs *cs = rctx-cs;
 +   uint64_t va;
 +   uint32_t reloc;
 +
 +   va = r600_resource_va(rscreen-screen,
 (void*)rscreen-trace_bo);
 +   reloc = r600_context_bo_reloc(rctx, rscreen-trace_bo,
 RADEON_USAGE_READWRITE);
 +   cs-buf[cs-cdw++] = PKT3(PKT3_WRITE_DATA, 4, 0);
 +   cs-buf[cs-cdw++] =
 PKT3_WRITE_DATA_DST_SEL(PKT3_WRITE_DATA_DST_SEL_MEM_SYNC) |
 +   PKT3_WRITE_DATA_WR_CONFIRM |
 +
 PKT3_WRITE_DATA_ENGINE_SEL(PKT3_WRITE_DATA_ENGINE_SEL_ME);
 +   cs-buf[cs-cdw++] = va  0xUL;
 +   cs-buf[cs-cdw++] = (va  32UL)  0xUL;
 +   cs-buf[cs-cdw++] = cs-cdw;
 +   cs-buf[cs-cdw++] = rscreen-cs_count;
 +   cs-buf[cs-cdw++] = PKT3(PKT3_NOP, 0, 0);
 +   cs-buf[cs-cdw++] = reloc;


 The NOP packet here is superfluous,  

Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Christian König

Am 26.03.2013 15:34, schrieb Marek Olšák:

Speaking of si_pm4_state, I think it's a horrible mechanism for
anything other than constant state objects (create/bind/delete
functions). For everything else (set/draw functions), you want to emit
directly into the command stream. It's not so different from the bad
state management which r600g used to have (which is now gone). If you
have to call malloc or calloc in a set_* or draw_* function, you're
doing it wrong. Are there plans to change it to something more
efficient (e.g. how r300g and r600g emit non-CSO states right now), or
will it be like this forever?


Actually I hoped that r600g sooner or later moves into the same 
direction some more. The fact that we currently need to malloc every 
buffer indeed sucks badly, but that is still better than mixing packet 
generation with driver logic.


Also I don't think that emitting directly into the command stream is 
such a good idea, we sooner or later want that buffer to be a buffer 
allocated in GART memory. And under this condition it is better to build 
up the commands in a (heavily cached) system memory and then memcpy then 
to the destination buffer.


Christian.



Marek

On Tue, Mar 26, 2013 at 3:00 PM, Christian König
deathsim...@vodafone.de wrote:

Am 26.03.2013 14:42, schrieb Jerome Glisse:


On Tue, Mar 26, 2013 at 6:22 AM, Christian König
deathsim...@vodafone.de wrote:

Am 25.03.2013 18:15, schrieb j.gli...@gmail.com:


From: Jerome Glisse jgli...@redhat.com

Same as on r600, trace cs execution by writting cs offset after each
states, this allow to pin point lockup inside command stream and
narrow down the scope of lockup investigation.

v2: Use WRITE_DATA packet instead of WRITE_MEM

Signed-off-by: Jerome Glisse jgli...@redhat.com
---
src/gallium/drivers/radeonsi/r600_hw_context.c | 61
++
src/gallium/drivers/radeonsi/radeonsi_pipe.c   | 22 ++
src/gallium/drivers/radeonsi/radeonsi_pipe.h   | 12 +
src/gallium/drivers/radeonsi/radeonsi_pm4.c| 12 +
src/gallium/drivers/radeonsi/si_state_draw.c   |  7 ++-
src/gallium/drivers/radeonsi/sid.h | 14 ++
6 files changed, 127 insertions(+), 1 deletion(-)

diff --git a/src/gallium/drivers/radeonsi/r600_hw_context.c
b/src/gallium/drivers/radeonsi/r600_hw_context.c
index bd348f9..967f093 100644
--- a/src/gallium/drivers/radeonsi/r600_hw_context.c
+++ b/src/gallium/drivers/radeonsi/r600_hw_context.c
@@ -142,6 +142,12 @@ void si_need_cs_space(struct r600_context *ctx,
unsigned num_dw,
  /* Save 16 dwords for the fence mechanism. */
  num_dw += 16;
+#if R600_TRACE_CS
+   if (ctx-screen-trace_bo) {
+   num_dw += R600_TRACE_CS_DWORDS;
+   }
+#endif
+
  /* Flush if there's not enough space. */
  if (num_dw  RADEON_MAX_CMDBUF_DWORDS) {
  radeonsi_flush(ctx-context, NULL,
RADEON_FLUSH_ASYNC);
@@ -206,9 +212,41 @@ void si_context_flush(struct r600_context *ctx,
unsigned flags)
  /* force to keep tiling flags */
  flags |= RADEON_FLUSH_KEEP_TILING_FLAGS;
+#if R600_TRACE_CS
+   if (ctx-screen-trace_bo) {
+   struct r600_screen *rscreen = ctx-screen;
+   unsigned i;
+
+   for (i = 0; i  cs-cdw; i++) {
+   fprintf(stderr, [%4d] [%5d] 0x%08x\n,
rscreen-cs_count, i, cs-buf[i]);
+   }
+   rscreen-cs_count++;
+   }
+#endif
+
  /* Flush the CS. */
  ctx-ws-cs_flush(ctx-cs, flags);
+#if R600_TRACE_CS
+   if (ctx-screen-trace_bo) {
+   struct r600_screen *rscreen = ctx-screen;
+   unsigned i;
+
+   for (i = 0; i  10; i++) {
+   usleep(5);
+   if
(!ctx-ws-buffer_is_busy(rscreen-trace_bo-buf,
RADEON_USAGE_READWRITE)) {
+   break;
+   }
+   }
+   if (i == 10) {
+   fprintf(stderr, timeout on cs lockup likely
happen at cs %d dw %d\n,
+   rscreen-trace_ptr[1],
rscreen-trace_ptr[0]);
+   } else {
+   fprintf(stderr, cs %d executed in %dms\n,
rscreen-trace_ptr[1], i * 5);
+   }
+   }
+#endif
+
  ctx-pm4_dirty_cdwords = 0;
  ctx-flags = 0;
@@ -665,3 +703,26 @@ void r600_context_draw_opaque_count(struct
r600_context *ctx, struct r600_so_tar
  cs-buf[cs-cdw++] = r600_context_bo_reloc(ctx, t-filled_size,
RADEON_USAGE_READ);
  }
+
+#if R600_TRACE_CS
+void r600_trace_emit(struct r600_context *rctx)
+{
+   struct r600_screen *rscreen = rctx-screen;
+   struct radeon_winsys_cs *cs = rctx-cs;
+   uint64_t va;
+   uint32_t reloc;
+
+   va = r600_resource_va(rscreen-screen,
(void*)rscreen-trace_bo);
+   reloc = r600_context_bo_reloc(rctx, rscreen-trace_bo,
RADEON_USAGE_READWRITE);
+   cs-buf[cs-cdw++] = 

Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Marek Olšák
On Tue, Mar 26, 2013 at 3:59 PM, Christian König
deathsim...@vodafone.de wrote:
 Am 26.03.2013 15:34, schrieb Marek Olšák:

 Speaking of si_pm4_state, I think it's a horrible mechanism for
 anything other than constant state objects (create/bind/delete
 functions). For everything else (set/draw functions), you want to emit
 directly into the command stream. It's not so different from the bad
 state management which r600g used to have (which is now gone). If you
 have to call malloc or calloc in a set_* or draw_* function, you're
 doing it wrong. Are there plans to change it to something more
 efficient (e.g. how r300g and r600g emit non-CSO states right now), or
 will it be like this forever?


 Actually I hoped that r600g sooner or later moves into the same direction
 some more. The fact that we currently need to malloc every buffer indeed
 sucks badly, but that is still better than mixing packet generation with
 driver logic.

I don't understand the last sentence. What mixing? The set_* and
draw_* commands are supposed to be executed immediately, therefore
it's reasonable and preferable to write to the CS directly. Having any
intermediate storage for commands is a waste of time and space.


 Also I don't think that emitting directly into the command stream is such a
 good idea, we sooner or later want that buffer to be a buffer allocated in
 GART memory. And under this condition it is better to build up the commands
 in a (heavily cached) system memory and then memcpy then to the destination
 buffer.

AFAIK, GART memory is cached on non-AGP systems, but even uncached
access shouldn't be a big issue, because the access pattern is
sequential and write-only. BTW, I have talked about emitting commands
into a buffer object with Dave and he thinks it's a bad idea due to
the map and unmap overhead. Also, we have to disallow writing to
certain unsafe registers anyway.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Jerome Glisse
On Tue, Mar 26, 2013 at 12:40 PM, Marek Olšák mar...@gmail.com wrote:
 On Tue, Mar 26, 2013 at 3:59 PM, Christian König
 deathsim...@vodafone.de wrote:
 Am 26.03.2013 15:34, schrieb Marek Olšák:

 Speaking of si_pm4_state, I think it's a horrible mechanism for
 anything other than constant state objects (create/bind/delete
 functions). For everything else (set/draw functions), you want to emit
 directly into the command stream. It's not so different from the bad
 state management which r600g used to have (which is now gone). If you
 have to call malloc or calloc in a set_* or draw_* function, you're
 doing it wrong. Are there plans to change it to something more
 efficient (e.g. how r300g and r600g emit non-CSO states right now), or
 will it be like this forever?


 Actually I hoped that r600g sooner or later moves into the same direction
 some more. The fact that we currently need to malloc every buffer indeed
 sucks badly, but that is still better than mixing packet generation with
 driver logic.

 I don't understand the last sentence. What mixing? The set_* and
 draw_* commands are supposed to be executed immediately, therefore
 it's reasonable and preferable to write to the CS directly. Having any
 intermediate storage for commands is a waste of time and space.

I agree here, i don't think uncached bo for command stream on new hw
would bring huge perf increase, probably will just be noise.


 Also I don't think that emitting directly into the command stream is such a
 good idea, we sooner or later want that buffer to be a buffer allocated in
 GART memory. And under this condition it is better to build up the commands
 in a (heavily cached) system memory and then memcpy then to the destination
 buffer.

 AFAIK, GART memory is cached on non-AGP systems, but even uncached
 access shouldn't be a big issue, because the access pattern is
 sequential and write-only. BTW, I have talked about emitting commands
 into a buffer object with Dave and he thinks it's a bad idea due to
 the map and unmap overhead. Also, we have to disallow writing to
 certain unsafe registers anyway.

 Marek

I think Christian is thinking about new hw  cayman where we can skip
register checking because of vm and hardware register checking (the hw
CP checks that register in the user IB is not one of the privilege
register and block write and throw irq if so). On this kind of hw you
can have cmd stream in bo and don't do the map/unmap.

Cheers,
Jerome
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Marek Olšák
On Tue, Mar 26, 2013 at 6:44 PM, Christian König
deathsim...@vodafone.de wrote:
 Am 26.03.2013 18:02, schrieb Jerome Glisse:

 On Tue, Mar 26, 2013 at 12:40 PM, Marek Olšák mar...@gmail.com wrote:

 On Tue, Mar 26, 2013 at 3:59 PM, Christian König
 deathsim...@vodafone.de wrote:

 Am 26.03.2013 15:34, schrieb Marek Olšák:

 Speaking of si_pm4_state, I think it's a horrible mechanism for
 anything other than constant state objects (create/bind/delete
 functions). For everything else (set/draw functions), you want to emit
 directly into the command stream. It's not so different from the bad
 state management which r600g used to have (which is now gone). If you
 have to call malloc or calloc in a set_* or draw_* function, you're
 doing it wrong. Are there plans to change it to something more
 efficient (e.g. how r300g and r600g emit non-CSO states right now), or
 will it be like this forever?


 Actually I hoped that r600g sooner or later moves into the same
 direction
 some more. The fact that we currently need to malloc every buffer indeed
 sucks badly, but that is still better than mixing packet generation with
 driver logic.

 I don't understand the last sentence. What mixing? The set_* and
 draw_* commands are supposed to be executed immediately, therefore
 it's reasonable and preferable to write to the CS directly. Having any
 intermediate storage for commands is a waste of time and space.

 I agree here, i don't think uncached bo for command stream on new hw
 would bring huge perf increase, probably will just be noise.

 Also I don't think that emitting directly into the command stream is
 such a
 good idea, we sooner or later want that buffer to be a buffer allocated
 in
 GART memory. And under this condition it is better to build up the
 commands
 in a (heavily cached) system memory and then memcpy then to the
 destination
 buffer.

 AFAIK, GART memory is cached on non-AGP systems, but even uncached
 access shouldn't be a big issue, because the access pattern is
 sequential and write-only. BTW, I have talked about emitting commands
 into a buffer object with Dave and he thinks it's a bad idea due to
 the map and unmap overhead. Also, we have to disallow writing to
 certain unsafe registers anyway.

 Marek

 I think Christian is thinking about new hw  cayman where we can skip
 register checking because of vm and hardware register checking (the hw
 CP checks that register in the user IB is not one of the privilege
 register and block write and throw irq if so). On this kind of hw you
 can have cmd stream in bo and don't do the map/unmap.


 Yes indeed, and my plan is to avoid the copying by referencing the state
 directly with indirect buffer commands. That should also make thinks like
 queries and predicated rendering a bit more simpler (think of PM4 subroutine
 calls).

 The problem on SI is that for embedded data and const IBs you need to patch
 up the buffer quite a bit after it is written (at least if I understand them

If you need to patch up the buffer after it is written, i.e. it can't
be immutable, you probably shouldn't use any buffer at all and just
write the commands into the CS directly. How are you gonna update the
buffer if it's busy?

Unless you mean that some state depends on some other state and must
be recomputed, which is another aspect of radeonsi which is pretty
badly designed. Don't recompute states, just create multiple command
buffers covering all possible combinations of states you can ever get
with a single gallium CSO and switch between them at draw time. E.g.
when some external state changes from I to J, switch from
blend_state-variant[I] to blend_state-variant[J] (by just emitting
variant[J]), where variant is an array of immutable command buffers.
Same for pipe_surface (except that pipe_surface should initialize the
variants on demand).

 correctly). But Marek is quite right that this only counts for state objects
 and makes no sense for set_* and draw_* calls (and I'm currently thinking
 how to avoid that and can't come up with a proper solution). Anyway it's
 definitely not an urgent problem for radeonsi.

It will be a problem once we actually start caring about performance
and, most importantly, the CPU overhead of the driver.


 I still think that writing into the command buffers directly (e.g. without
 wrapper functions) is a bad idea, cause that lead to mixing driver logic and

I'm convinced the exact opposite is a bad idea, because it adds
another layer all commands must go through. A layer which brings no
advantage. Think about apps which issue 1k-10k draw calls per frame.
It's obvious that every byte moved around counts and the key to high
framerate is to do (almost) nothing in the driver. It looks like the
idea here is to make the driver as slow as possible.

 packet building in r600g. For example just try to figure out how the
 relocation in NOPs work by reading the source (please keep in mind that one
 of the primary goals why AMD is supporting this driver is 

Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Christian König

Am 26.03.2013 20:28, schrieb Marek Olšák:

On Tue, Mar 26, 2013 at 6:44 PM, Christian König
deathsim...@vodafone.de wrote:

Am 26.03.2013 18:02, schrieb Jerome Glisse:


On Tue, Mar 26, 2013 at 12:40 PM, Marek Olšák mar...@gmail.com wrote:

On Tue, Mar 26, 2013 at 3:59 PM, Christian König
deathsim...@vodafone.de wrote:

Am 26.03.2013 15:34, schrieb Marek Olšák:


Speaking of si_pm4_state, I think it's a horrible mechanism for
anything other than constant state objects (create/bind/delete
functions). For everything else (set/draw functions), you want to emit
directly into the command stream. It's not so different from the bad
state management which r600g used to have (which is now gone). If you
have to call malloc or calloc in a set_* or draw_* function, you're
doing it wrong. Are there plans to change it to something more
efficient (e.g. how r300g and r600g emit non-CSO states right now), or
will it be like this forever?


Actually I hoped that r600g sooner or later moves into the same
direction
some more. The fact that we currently need to malloc every buffer indeed
sucks badly, but that is still better than mixing packet generation with
driver logic.

I don't understand the last sentence. What mixing? The set_* and
draw_* commands are supposed to be executed immediately, therefore
it's reasonable and preferable to write to the CS directly. Having any
intermediate storage for commands is a waste of time and space.

I agree here, i don't think uncached bo for command stream on new hw
would bring huge perf increase, probably will just be noise.


Also I don't think that emitting directly into the command stream is
such a
good idea, we sooner or later want that buffer to be a buffer allocated
in
GART memory. And under this condition it is better to build up the
commands
in a (heavily cached) system memory and then memcpy then to the
destination
buffer.

AFAIK, GART memory is cached on non-AGP systems, but even uncached
access shouldn't be a big issue, because the access pattern is
sequential and write-only. BTW, I have talked about emitting commands
into a buffer object with Dave and he thinks it's a bad idea due to
the map and unmap overhead. Also, we have to disallow writing to
certain unsafe registers anyway.

Marek

I think Christian is thinking about new hw  cayman where we can skip
register checking because of vm and hardware register checking (the hw
CP checks that register in the user IB is not one of the privilege
register and block write and throw irq if so). On this kind of hw you
can have cmd stream in bo and don't do the map/unmap.


Yes indeed, and my plan is to avoid the copying by referencing the state
directly with indirect buffer commands. That should also make thinks like
queries and predicated rendering a bit more simpler (think of PM4 subroutine
calls).

The problem on SI is that for embedded data and const IBs you need to patch
up the buffer quite a bit after it is written (at least if I understand them

If you need to patch up the buffer after it is written, i.e. it can't
be immutable, you probably shouldn't use any buffer at all and just
write the commands into the CS directly. How are you gonna update the
buffer if it's busy?

Unless you mean that some state depends on some other state and must
be recomputed, which is another aspect of radeonsi which is pretty
badly designed. Don't recompute states, just create multiple command
buffers covering all possible combinations of states you can ever get
with a single gallium CSO and switch between them at draw time. E.g.
when some external state changes from I to J, switch from
blend_state-variant[I] to blend_state-variant[J] (by just emitting
variant[J]), where variant is an array of immutable command buffers.
Same for pipe_surface (except that pipe_surface should initialize the
variants on demand).


Yeah, I know. I only have two hands, where should do you suggest I 
should start?



correctly). But Marek is quite right that this only counts for state objects
and makes no sense for set_* and draw_* calls (and I'm currently thinking
how to avoid that and can't come up with a proper solution). Anyway it's
definitely not an urgent problem for radeonsi.

It will be a problem once we actually start caring about performance
and, most importantly, the CPU overhead of the driver.


I still think that writing into the command buffers directly (e.g. without
wrapper functions) is a bad idea, cause that lead to mixing driver logic and

I'm convinced the exact opposite is a bad idea, because it adds
another layer all commands must go through. A layer which brings no
advantage. Think about apps which issue 1k-10k draw calls per frame.
It's obvious that every byte moved around counts and the key to high
framerate is to do (almost) nothing in the driver. It looks like the
idea here is to make the driver as slow as possible.


packet building in r600g. For example just try to figure out how the
relocation in NOPs work by reading the source (please keep 

Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Dave Airlie

 correctly). But Marek is quite right that this only counts for state
 objects
 and makes no sense for set_* and draw_* calls (and I'm currently thinking
 how to avoid that and can't come up with a proper solution). Anyway it's
 definitely not an urgent problem for radeonsi.

 It will be a problem once we actually start caring about performance
 and, most importantly, the CPU overhead of the driver.

 I still think that writing into the command buffers directly (e.g.
 without
 wrapper functions) is a bad idea, cause that lead to mixing driver logic
 and

 I'm convinced the exact opposite is a bad idea, because it adds
 another layer all commands must go through. A layer which brings no
 advantage. Think about apps which issue 1k-10k draw calls per frame.
 It's obvious that every byte moved around counts and the key to high
 framerate is to do (almost) nothing in the driver. It looks like the
 idea here is to make the driver as slow as possible.

 packet building in r600g. For example just try to figure out how the
 relocation in NOPs work by reading the source (please keep in mind that
 one
 of the primary goals why AMD is supporting this driver is to give a good
 example code for customers who want to implement that stuff on their own
 systems).

 I'm shocked. Sacrificing performance in the name of making the code
 nicer for some customers? Seriously? I thought the plan was to make
 the best graphics driver ever.


 Well, maybe I'm repeating myself: Performance is not a priority, it's only
 nice to have!

 Sorry to say so, but if we sacrifice a bit of performance for more code
 readability than that is perfectly ok with me (Don't understand me wrong I
 would really prefer to replace the closed source driver today than tomorrow,
 it's unfortunately just not what I'm paid for).

 On the other hand, we are talking about perfectly optimizeable inline
 functions and/or macros. All I'm saying is that we should structurize the
 code a bit more.

Its okay to take steps in the right direction, but if you start taking
steps that away
from performance in lieu of code readability then please be prepared
to deal with
objections.

The thing is in a lot of cases, code readability is in the eye of the
beholder, I'm sure
Jerome though r600g was perfectly readable when he wrote it, but a lot
of us didn't
and spent a lot of time trying to remove the CPU overheads, not least
the amount of
time Marek spent. The thing is performance is measureable, code
readability isn't.

Dave.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Dave Airlie
On 27 Mar 2013 08:45, Dave Airlie airl...@gmail.com wrote:

 
  correctly). But Marek is quite right that this only counts for state
  objects
  and makes no sense for set_* and draw_* calls (and I'm currently thinking
  how to avoid that and can't come up with a proper solution). Anyway it's
  definitely not an urgent problem for radeonsi.
 
  It will be a problem once we actually start caring about performance
  and, most importantly, the CPU overhead of the driver.
 
  I still think that writing into the command buffers directly (e.g.
  without
  wrapper functions) is a bad idea, cause that lead to mixing driver logic
  and
 
  I'm convinced the exact opposite is a bad idea, because it adds
  another layer all commands must go through. A layer which brings no
  advantage. Think about apps which issue 1k-10k draw calls per frame.
  It's obvious that every byte moved around counts and the key to high
  framerate is to do (almost) nothing in the driver. It looks like the
  idea here is to make the driver as slow as possible.
 
  packet building in r600g. For example just try to figure out how the
  relocation in NOPs work by reading the source (please keep in mind that
  one
  of the primary goals why AMD is supporting this driver is to give a good
  example code for customers who want to implement that stuff on their own
  systems).
 
  I'm shocked. Sacrificing performance in the name of making the code
  nicer for some customers? Seriously? I thought the plan was to make
  the best graphics driver ever.
 
 
  Well, maybe I'm repeating myself: Performance is not a priority, it's only
  nice to have!
 
  Sorry to say so, but if we sacrifice a bit of performance for more code
  readability than that is perfectly ok with me (Don't understand me wrong I
  would really prefer to replace the closed source driver today than tomorrow,
  it's unfortunately just not what I'm paid for).
 
  On the other hand, we are talking about perfectly optimizeable inline
  functions and/or macros. All I'm saying is that we should structurize the
  code a bit more.

 Its okay to take steps in the right direction, but if you start taking
 steps that away
 from performance in lieu of code readability then please be prepared
 to deal with
 objections.

 The thing is in a lot of cases, code readability is in the eye of the
 beholder, I'm sure
 Jerome though r600g was perfectly readable when he wrote it, but a lot
 of us didn't
 and spent a lot of time trying to remove the CPU overheads, not least
 the amount of
 time Marek spent. The thing is performance is measureable, code
 readability isn't.

I also realised there is a bit of irony in using llvm if code
readability is a prime goal :-P, I'm sure .td is readable to some, but
it takes a fair bit of learning.

Dave.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH] radeonsi: add cs tracing v2

2013-03-26 Thread Jerome Glisse
On Tue, Mar 26, 2013 at 6:45 PM, Dave Airlie airl...@gmail.com wrote:

 correctly). But Marek is quite right that this only counts for state
 objects
 and makes no sense for set_* and draw_* calls (and I'm currently thinking
 how to avoid that and can't come up with a proper solution). Anyway it's
 definitely not an urgent problem for radeonsi.

 It will be a problem once we actually start caring about performance
 and, most importantly, the CPU overhead of the driver.

 I still think that writing into the command buffers directly (e.g.
 without
 wrapper functions) is a bad idea, cause that lead to mixing driver logic
 and

 I'm convinced the exact opposite is a bad idea, because it adds
 another layer all commands must go through. A layer which brings no
 advantage. Think about apps which issue 1k-10k draw calls per frame.
 It's obvious that every byte moved around counts and the key to high
 framerate is to do (almost) nothing in the driver. It looks like the
 idea here is to make the driver as slow as possible.

 packet building in r600g. For example just try to figure out how the
 relocation in NOPs work by reading the source (please keep in mind that
 one
 of the primary goals why AMD is supporting this driver is to give a good
 example code for customers who want to implement that stuff on their own
 systems).

 I'm shocked. Sacrificing performance in the name of making the code
 nicer for some customers? Seriously? I thought the plan was to make
 the best graphics driver ever.


 Well, maybe I'm repeating myself: Performance is not a priority, it's only
 nice to have!

 Sorry to say so, but if we sacrifice a bit of performance for more code
 readability than that is perfectly ok with me (Don't understand me wrong I
 would really prefer to replace the closed source driver today than tomorrow,
 it's unfortunately just not what I'm paid for).

 On the other hand, we are talking about perfectly optimizeable inline
 functions and/or macros. All I'm saying is that we should structurize the
 code a bit more.

 Its okay to take steps in the right direction, but if you start taking
 steps that away
 from performance in lieu of code readability then please be prepared
 to deal with
 objections.

 The thing is in a lot of cases, code readability is in the eye of the
 beholder, I'm sure
 Jerome though r600g was perfectly readable when he wrote it, but a lot
 of us didn't
 and spent a lot of time trying to remove the CPU overheads, not least
 the amount of
 time Marek spent. The thing is performance is measureable, code
 readability isn't.

 Dave.

Maybe once again you forgot why i did things the way i did them, i
explained myself to you back then, i designed r600g for a new kernel
api which was violently different from the cs one, my hope was that
the other kernel api would be better, it was not and i never pushed
more on that front. So r600g design was definitely not adapted to the
cs ioctl and not thinked for it. History often explain a lot of things
and people seems to forget about them.

That being said, i too find ironic the code readability argument, if
one understand the cs ioctl then the r600g code as it's nowadays make
sense, but the radeonsi code is closer to what r600g use to be. So
assuming same ioctl i would say that radeonsi should move towards what
r600g is nowadays.

Anyway just wanted to set history straight.

Cheers,
Jerome
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev