Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-08 Thread Chris Wilson
On Tue, Jul 07, 2015 at 10:03:09PM -0700, Kenneth Graunke wrote:
 Hi Chris,
 
 I made a genuine effort to review this patch, hoping to better understand
 the various changes and what you were trying to accomplish.  I spent many
 hours reading and trying to enumerate changes - or potential changes I
 needed to look hard at to convince myself whether they were correct.
 
 I came up with a frighteningly long list of changes:

* Locking completely overhauled.
 
 * Relocation handling changes considerably (the original point of
   Kristian's endeavour which led up to this).

It's been a long time since 2011.
 
 * Fencing, busy tracking, and sync objects are completely reworked.

sync obje
That's a slight overstatement.
 
 * Render-to-texture cache flushing and dirty buffer tracking is
   completely reworked.

And immensely simplified. A hash table? Did you notice that the current
dirty tracking misses where the same buffer is rewritten?
 
 * Gen7 SOL buffer offset resetting now uses MI_LOAD_REGISTER_IMM rather
   than the execbuf2 parameter, requiring the command validator on Haswell.
   This effectively bumps the kernel requirement from v3.6 to v4.2-rc1,
   which will simply not fly with distributions at this time.

My bad, completely missed that there was EXT_transform_feedback that only
did the subset. I wondered why you had code to the pipelined loads/save
but then still flushed the batch every time.
 
 * glBufferSubData() now uses intel_upload_data() rather than allocating
   a temporary BO.  This is the first use of the upload buffer by the
   BLT engine, and could imply that the upload buffer's lifetime now
   extends across batches - longer than before.  Separable change that
   requires separate evaluation and justification.

Yes and no, it is easy to demonstrate that the buffer's lifetime is
longer than a batch current.
 
 * Per buffer cache-coherency checking rather than brw-has_llc?

We've pointed out the bugs in the current usage of brw-has_llc before.
 
 * glBufferSubData()'s prefer_stall_to_blit flag appears to depend on
   per-buffer cache-coherency rather than being set globally.  Could
   impact performance of buffer uploads.
 
 * Potential missing flushes (which can cause hangs or misrendering):
 
   - It looks like calling brw_bo_busy() with BUSY_FLUSH causes a flush
 when necessary.  However, some instances of the old bo_busy,
 bo_references, batch_flush pattern are replaced without that flag.
 One occurrance was in BufferSubData(); I did not spend time to
 check every case.

The change is still accurate.

   - Flushes are often done implicitly by e.g. brw_bo_read calling
 brw_bo_map with the appropriate flags, and many explicit checks
 and flushes are removed.  Not bad, but needs careful review.

   - Gen6+ query object code might have dropped an implicit flush
 guaranteeing that when the GL application requests the result,
 any pending work will be kicked off so they can poll/spin
 repeatedly until the result arrives.

The flush is not missing.
 
   - New code to avoid redundant flushes.
 
 * perf_debug() warnings are removed all over the code for some reason:

The problem is perf_debug() was a layering violation (my goal was to
keep brw_batch a mini-library, either to extract it back to libdrm or
for reuse). I have an idea of using a macro like

#define PERF_DEBUG_LOC(brw, str) ((brw-perf_debug ? (struct loc){__FILE__,
__FUNCTION__, __LINE__, str)} : NULL))

brw_bo_map(..., PERF_DEBUG_LOC(brw));

Then pass back to a brw_context callback if a function stalls.

   - Unsynchronized maps/BufferSubData not working on !LLC platforms?
 If they work now, that's a huge change!  If not, why drop the warning?

They have always worked on !llc. The same caveats with using multiple
aliasing to the same physical address through different modes of write
caching apply to both llc and !llc.
 
   - Warnings about stalls on mapping buffers and miptrees are gone now.
 These have been useful in tracking down performance problems.  They
 might not always be accurate, but surely removing them should be done
 separately with justification?

See issues with calling perf_debug.
 
   - Warnings about stalls on query objects are gone.  I've used these when
 analyzing application performance.  Why?

See issues with calling perf_debug.
 
   - Warnings about implicit flushes are gone.

See issues with calling perf_debug.

It was at the back of my mind that I had to fix perf_debug - if I had
put in the list of issues, I would probably would have remembered to fix
it before posting.
 
 * BO unmap calls appears to be missing in some places.  A few map calls
   have moved around in hard-to-follow ways.  Unclear how lifetimes of
   buffers and lifetimes of maps are affected.

Bo unmap is a figment of your imagination. libdrm never unmaps a bo until
it is closed (simply to avoid the cost of refaulting the objects -
though that does come at some risk of address space 

Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-08 Thread Chris Wilson
On Wed, Jul 08, 2015 at 09:51:07AM +0100, Chris Wilson wrote:
 On Tue, Jul 07, 2015 at 10:03:09PM -0700, Kenneth Graunke wrote:
  * Gen4-5 structure changes.

Did you mean brw_structs.h?

diff --git a/src/mesa/drivers/dri/i965/brw_structs.h 
b/src/mesa/drivers/dri/i965/brw_structs.h
index 55338c0..e167254 100644
--- a/src/mesa/drivers/dri/i965/brw_structs.h
+++ b/src/mesa/drivers/dri/i965/brw_structs.h
@@ -391,13 +391,16 @@ struct brw_sf_unit_state
   unsigned pad3:1;
} thread4;
 
-   struct
+   union
{
+  struct {
 unsigned front_winding:1;
 unsigned viewport_transform:1;
 unsigned pad0:3;
 unsigned sf_viewport_state_offset:27; /* Offset from 
GENERAL_STATE_BASE */
   } sf5;
+  uint32_t dw5;
+   };
 
struct
{
@@ -525,15 +528,17 @@ struct brw_wm_unit_state
struct thread2 thread2;
struct thread3 thread3;
 
+   union {
   struct {
 unsigned stats_enable:1;
 unsigned depth_buffer_clear:1;
 unsigned sampler_count:3;
 unsigned sampler_state_pointer:27;
   } wm4;
+  uint32_t dw4;
+   };
 
-   struct
-   {
+   struct {
   unsigned enable_8_pix:1;
   unsigned enable_16_pix:1;
   unsigned enable_32_pix:1;
diff --git a/src/mesa/drivers/dri/i965/brw_urb.c 
b/src/mesa/drivers/dri/i965/brw_urb.c

Or something else?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-08 Thread Chris Wilson
On Wed, Jul 08, 2015 at 10:49:24AM -0700, Kenneth Graunke wrote:
 On Wednesday, July 08, 2015 03:17:35 PM Chris Wilson wrote:
  On Wed, Jul 08, 2015 at 09:51:07AM +0100, Chris Wilson wrote:
   On Tue, Jul 07, 2015 at 10:03:09PM -0700, Kenneth Graunke wrote:
* Gen4-5 structure changes.
  
  Did you mean brw_structs.h?
  
  diff --git a/src/mesa/drivers/dri/i965/brw_structs.h 
  b/src/mesa/drivers/dri/i965/brw_structs.h
  index 55338c0..e167254 100644
  --- a/src/mesa/drivers/dri/i965/brw_structs.h
  +++ b/src/mesa/drivers/dri/i965/brw_structs.h
  @@ -391,13 +391,16 @@ struct brw_sf_unit_state
 unsigned pad3:1;
  } thread4;
   
  -   struct
  +   union
  {
  +  struct {
   unsigned front_winding:1;
   unsigned viewport_transform:1;
   unsigned pad0:3;
   unsigned sf_viewport_state_offset:27; /* Offset from 
  GENERAL_STATE_BASE */
 } sf5;
  +  uint32_t dw5;
  +   };
   
  struct
  {
  @@ -525,15 +528,17 @@ struct brw_wm_unit_state
  struct thread2 thread2;
  struct thread3 thread3;
   
  +   union {
 struct {
   unsigned stats_enable:1;
   unsigned depth_buffer_clear:1;
   unsigned sampler_count:3;
   unsigned sampler_state_pointer:27;
 } wm4;
  +  uint32_t dw4;
  +   };
   
  -   struct
  -   {
  +   struct {
 unsigned enable_8_pix:1;
 unsigned enable_16_pix:1;
 unsigned enable_32_pix:1;
  diff --git a/src/mesa/drivers/dri/i965/brw_urb.c 
  b/src/mesa/drivers/dri/i965/brw_urb.c
  
  Or something else?
  -Chris
 
 Yes - and the changes are fine.  But...separable.

Would you like a patch to provide dword alternates for all the structs,
or just the ones used for relocation processing?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-08 Thread Kenneth Graunke
On Wednesday, July 08, 2015 03:17:35 PM Chris Wilson wrote:
 On Wed, Jul 08, 2015 at 09:51:07AM +0100, Chris Wilson wrote:
  On Tue, Jul 07, 2015 at 10:03:09PM -0700, Kenneth Graunke wrote:
   * Gen4-5 structure changes.
 
 Did you mean brw_structs.h?
 
 diff --git a/src/mesa/drivers/dri/i965/brw_structs.h 
 b/src/mesa/drivers/dri/i965/brw_structs.h
 index 55338c0..e167254 100644
 --- a/src/mesa/drivers/dri/i965/brw_structs.h
 +++ b/src/mesa/drivers/dri/i965/brw_structs.h
 @@ -391,13 +391,16 @@ struct brw_sf_unit_state
unsigned pad3:1;
 } thread4;
  
 -   struct
 +   union
 {
 +  struct {
  unsigned front_winding:1;
  unsigned viewport_transform:1;
  unsigned pad0:3;
  unsigned sf_viewport_state_offset:27; /* Offset from 
 GENERAL_STATE_BASE */
} sf5;
 +  uint32_t dw5;
 +   };
  
 struct
 {
 @@ -525,15 +528,17 @@ struct brw_wm_unit_state
 struct thread2 thread2;
 struct thread3 thread3;
  
 +   union {
struct {
  unsigned stats_enable:1;
  unsigned depth_buffer_clear:1;
  unsigned sampler_count:3;
  unsigned sampler_state_pointer:27;
} wm4;
 +  uint32_t dw4;
 +   };
  
 -   struct
 -   {
 +   struct {
unsigned enable_8_pix:1;
unsigned enable_16_pix:1;
unsigned enable_32_pix:1;
 diff --git a/src/mesa/drivers/dri/i965/brw_urb.c 
 b/src/mesa/drivers/dri/i965/brw_urb.c
 
 Or something else?
 -Chris

Yes - and the changes are fine.  But...separable.


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-07 Thread Chris Wilson
On Tue, Jul 07, 2015 at 01:14:53PM +0300, Abdiel Janulgue wrote:
 On 07/06/2015 01:33 PM, Chris Wilson wrote:
  @@ -600,7 +593,10 @@ brw_emit_null_surface_state(struct brw_context *brw,
1  BRW_SURFACE_WRITEDISABLE_B_SHIFT |
1  BRW_SURFACE_WRITEDISABLE_A_SHIFT);
  }
  -   surf[1] = bo ? bo-offset64 : 0;
  +   surf[1] = brw_batch_reloc(brw-batch, *out_offset + 4,
  + bo, 0,
  + I915_GEM_DOMAIN_RENDER,
  + I915_GEM_DOMAIN_RENDER);
 
 null check for bo?

I put the NULL check into the inline variant of brw_batch_reloc() for a
bit of syntatic sugar for these cases.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-07 Thread Abdiel Janulgue


On 07/06/2015 01:33 PM, Chris Wilson wrote:

 +/*
 + * Add a relocation entry for the target buffer into the current batch.
 + *
 + * This is the heart of performing fast relocations, both here and in
 + * the corresponding kernel relocation routines.
 + *
 + * - Instead of passing in handles for the kernel convert back into
 + *   the buffer for every relocation, we tell the kernel which
 + *   execobject slot corresponds with the relocation. The kernel is
 + *   able to use a simple LUT constructed as it first looks up each buffer
 + *   for the batch rather than search a small, overfull hashtable. As both
 + *   the number of relocations and buffers in a batch grow, the simple
 + *   LUT is much more efficient (though the LUT itself is less cache
 + *   friendly).
 + *   However, as the batch buffer is by definition the last object in
 + *   the execbuffer array we have to perform a pass to relabel the
 + *   target of all relocations pointing to the batch. (Except when
 + *   the kernel supports batch-first, in which case we can do the relocation
 + *   target processing for the batch inline.)
 + *
 + * - If the kernel has not moved the buffer, it will still be in the same
 + *   location as last time we used it. If we tell the kernel that all the
 + *   relocation entries are the same as the offset for the buffer, then
 + *   the kernel need only check that all the buffers are still in the same
 + *   location and then skip performing relocations entirely. A huge win.
 + *
 + * - As a consequence of telling the kernel to skip processing the 
 relocations,
 + *   we need to tell the kernel about the read/write domains and special 
 needs
 + *   of the buffers.
 + *
 + * - Alternatively, we can request the kernel place the buffer exactly
 + *   where we want it and forgo all relocations to that buffer entirely.
 + *   The buffer is effectively pinned for its lifetime (if the kernel
 + *   does have to move it, for example to swap it out to recover memory,
 + *   the kernel will return it back to our requested location at the start
 + *   of the next batch.) This of course imposes a lot of constraints on where
 + *   we can say the buffers are, they must meet all the alignment constraints
 + *   and not overlap.
 + *
 + * - Essential to all these techniques is that we always use the same
 + *   presumed_offset for the relocations as for submitting the execobject.
 + *   That value must be written into the batch and it must match the value
 + *   we tell the kernel. (This breaks down when using relocation tries shared
 + *   between multiple contexts, hence the need for context-local batch
 + *   management.)
 + *
 + * In contrast to libdrm, we can build the execbuffer array along with
 + * the batch by forgoing the ability to handle general relocation trees.
 + * This avoids having multiple passes to build the execbuffer parameter,
 + * and also gives us a means to cheaply track when a buffer has been
 + * referenced by the batch.
 + */
 +uint64_t __brw_batch_reloc(struct brw_batch *batch,
 +   uint32_t batch_offset,
 +   struct brw_bo *target_bo,
 +   uint64_t target_offset,
 +   unsigned read_domains,
 +   unsigned write_domain)
 +{
 +   assert(target_bo-refcnt);
 +   if (unlikely(target_bo-batch != batch)) {
 +  /* XXX legal sharing between contexts/threads? */
 +  target_bo = brw_bo_import(batch, target_bo-base, true);
 +  if (unlikely(target_bo == NULL))
 + longjmp(batch-jmpbuf, -ENOMEM);
 +  target_bo-refcnt--; /* kept alive by the implicit active reference */
 +   }
 +   assert(target_bo-batch == batch);
 +
 +   if (target_bo-exec == NULL) {
 +  int n;
 +
 +  /* reserve one exec entry for the batch */
 +  if (unlikely(batch-emit.nexec + 1 == batch-exec_size))
 + __brw_batch_grow_exec(batch);
 +
 +  n = batch-emit.nexec++;
 +  target_bo-target_handle = has_lut(batch) ? n : target_bo-handle;
 +  target_bo-exec = memset(batch-exec + n, 0, sizeof(*target_bo-exec));
 +  target_bo-exec-handle = target_bo-handle;
 +  target_bo-exec-alignment = target_bo-alignment;
 +  target_bo-exec-offset = target_bo-offset;
 +  if (target_bo-pinned)
 + target_bo-exec-flags = EXEC_OBJECT_PINNED;
 +
 +  /* Track the total amount of memory in use by all active requests */
 +  if (target_bo-read.rq == NULL) {
 +   batch-rss += target_bo-size;
 +   if (batch-rss  batch-peak_rss)
 +   batch-peak_rss = batch-rss;
 +  }
 +  target_bo-read.rq = batch-next_request;
 +  list_move_tail(target_bo-read.link, batch-next_request-read);
 +
 +  batch-aperture += target_bo-size;
 +   }
 +
 +   if (!target_bo-pinned) {
 +  int n;
 +
 +  if (unlikely(batch-emit.nreloc == batch-reloc_size))
 + __brw_batch_grow_reloc(batch);
 +
 +  n = batch-emit.nreloc++;
 +  

Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-07 Thread Abdiel Janulgue


On 07/07/2015 01:19 PM, Chris Wilson wrote:
 On Tue, Jul 07, 2015 at 01:14:53PM +0300, Abdiel Janulgue wrote:
 On 07/06/2015 01:33 PM, Chris Wilson wrote:
 @@ -600,7 +593,10 @@ brw_emit_null_surface_state(struct brw_context *brw,
   1  BRW_SURFACE_WRITEDISABLE_B_SHIFT |
   1  BRW_SURFACE_WRITEDISABLE_A_SHIFT);
 }
 -   surf[1] = bo ? bo-offset64 : 0;
 +   surf[1] = brw_batch_reloc(brw-batch, *out_offset + 4,
 + bo, 0,
 + I915_GEM_DOMAIN_RENDER,
 + I915_GEM_DOMAIN_RENDER);

 null check for bo?
 
 I put the NULL check into the inline variant of brw_batch_reloc() for a
 bit of syntatic sugar for these cases.
 -Chris
 

You're, right. I failed to notice the in-line variant.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-07 Thread Kenneth Graunke
Hi Chris,

I made a genuine effort to review this patch, hoping to better understand
the various changes and what you were trying to accomplish.  I spent many
hours reading and trying to enumerate changes - or potential changes I
needed to look hard at to convince myself whether they were correct.

I came up with a frighteningly long list of changes:

* Relocation handling changes considerably (the original point of
  Kristian's endeavour which led up to this).

* Fencing, busy tracking, and sync objects are completely reworked.

* Render-to-texture cache flushing and dirty buffer tracking is
  completely reworked.

* Gen7 SOL buffer offset resetting now uses MI_LOAD_REGISTER_IMM rather
  than the execbuf2 parameter, requiring the command validator on Haswell.
  This effectively bumps the kernel requirement from v3.6 to v4.2-rc1,
  which will simply not fly with distributions at this time.

* glBufferSubData() now uses intel_upload_data() rather than allocating
  a temporary BO.  This is the first use of the upload buffer by the
  BLT engine, and could imply that the upload buffer's lifetime now
  extends across batches - longer than before.  Separable change that
  requires separate evaluation and justification.

* Per buffer cache-coherency checking rather than brw-has_llc?

* glBufferSubData()'s prefer_stall_to_blit flag appears to depend on
  per-buffer cache-coherency rather than being set globally.  Could
  impact performance of buffer uploads.

* Potential missing flushes (which can cause hangs or misrendering):

  - It looks like calling brw_bo_busy() with BUSY_FLUSH causes a flush
when necessary.  However, some instances of the old bo_busy,
bo_references, batch_flush pattern are replaced without that flag.
One occurrance was in BufferSubData(); I did not spend time to
check every case.

  - Flushes are often done implicitly by e.g. brw_bo_read calling
brw_bo_map with the appropriate flags, and many explicit checks
and flushes are removed.  Not bad, but needs careful review.

  - Gen6+ query object code might have dropped an implicit flush
guaranteeing that when the GL application requests the result,
any pending work will be kicked off so they can poll/spin
repeatedly until the result arrives.

  - New code to avoid redundant flushes.

* perf_debug() warnings are removed all over the code for some reason:

  - Unsynchronized maps/BufferSubData not working on !LLC platforms?
If they work now, that's a huge change!  If not, why drop the warning?

  - Warnings about stalls on mapping buffers and miptrees are gone now.
These have been useful in tracking down performance problems.  They
might not always be accurate, but surely removing them should be done
separately with justification?

  - Warnings about stalls on query objects are gone.  I've used these when
analyzing application performance.  Why?

  - Warnings about implicit flushes are gone.

* BO unmap calls appears to be missing in some places.  A few map calls
  have moved around in hard-to-follow ways.  Unclear how lifetimes of
  buffers and lifetimes of maps are affected.

* Possible mmap vs. pwrite preference changes?  Hard to follow.

* Texture upload (tiled_memcpy) changes, which is notoriously fragile
  and can lose all of the performance benefit if the compiler isn't able
  to optimize it just right.  Ideally separate.

* Assertions change to GL errors in brw_get_graphics_reset_status().

* Aperture space checking significantly reworked, especially for the BLT
  paths.  Honestly, a lot nicer, but couldn't this be separated?

* The bo_reuse driconf option is removed.

* Gen4-5 structure changes.

* brw_get_timestamp() - removes initialization of result to 0.
  Probably unnecessary and OK to delete; should be separate.

* New helper functions and coding patterns.  Separable.

* Noise (renaming, moving code between files, some other trivial changes
  like removing 'brw' variables and moving code into else blocks).

* ...I probably missed some things.

Based upon this, I cannot in good conscience consider merging this patch.
The potential for breakage is staggering.  As a proof-of-concept, you've
done an excellent job in proving we can do much better, and introduced a
lot of good ideas.  But there's a lot of work left to be done before we
can consider applying it to our production quality driver.

Please advise whether you would like to work towards making a mergeable,
incremental patch series, or if someone else should embark on that
endeavour.

--Ken


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-06 Thread Chris Wilson
Just skipping to interesting comments for the moment.

On Mon, Jul 06, 2015 at 12:34:00PM -0700, Kenneth Graunke wrote:
 While I really like this idea in principle, the current patch is rather
 huge, making it difficult to review; bisectability would also suffer.
 Would it be possible to split it up into several smaller patches?

First time I tried I ended up with a series of patches of equal churn
and I wasn't happy with the results. Honestly, I find mechanical noise
in patches easy to tune out, you quickly recognise the pattern and
deviations should stand out.

Some of the churn I think is important as it highlight the relative
emphasis on different coding patterns. In particular, how the relocation
must always emit the address in the batch, so in quite a few functions
the code had to be rearranged in order to make that happen
 
 One idea I had for how you might accomplish that is to introduce the
 brw_bo struct and related functions, but make it simply contain a
 drm_intel_bo * field, and fall back to the existing libdrm-based code.

Sure, what I had in mind was a changeover to a struct brw_bo that just
wrapped drm_intel_bo for typesafefty (so the compiler was there to catch
mistakes and missing pieces). I felt the patch was still about 500 lines
of mechcanical changes, 1000 lines removing the intel_batchbuffer.c and
2000 lines adding brw_bo (it has grown somewhat with the comments!). The
api to review was in one place, and the bits that required fine changes
(mostly intel_bufferobj.c) were equally obvious.

But what you really need is a slow progression in feature development of
brw_batch. Splitting the mechanical changes off with a touch more churn,
and even converting functions over to new API in stages is relatively
trivial compared to building up the feature set required to do
batchbuffer construction from scratch.

- relocations come first, as request tracking cannot happen with
integration into the relocations.
- execbuf
- request tracking
- mmap functions
- exporting busyness

Honestly though the design has never existed in a piecemeal fashion, it
has always been requests first and that drove the infrastructure
required to enable the request tracking.

Anyway splitting out the mechanical changes is worth it if it gets more
eyes onto brw_batch.c itself and some of its knock-on effects.

 That way, you could put all the mechanical renaming and refactoring
 across the entire code-base in one patch that should have no functional
 change.  You could then replace the contents or brw_bo and those
 functions with your new improved batch manager.
 
 We've talked about moving away from libdrm for a while, so having our
 own functions and structures makes a lot of sense.
 
 I'm also curious about the performance on non-LLC platforms (CHV or
 BYT)?  You've dropped a number of non-LLC paths - which is probably
 fine, honestly...they were always of dubious value - but it'd be nice
 to know the record the effects of this change on non-LLC in the commit
 message.

Ah, I didn't actually remove non-LLC paths, I added equivalent detiled
CPU access for non-LLC (strictly non cache coherent bo, incl scanouts) as
for LLC. So I guess you mean the few places that lost the brw-has_llc
differentiation?
 
  @@ -279,11 +254,6 @@ retry:
  brw-ctx.NewDriverState = ~0ull;
  brw-no_depth_or_stencil = false;
  brw-ib.type = -1;
  -
  -   /* Flush the sampler cache so any texturing from the destination is
  -* coherent.
  -*/
  -   brw_emit_mi_flush(brw);
 
 It looks like you moved this flush earlier, so it's in the section of
 code that retries?  That's probably reasonable.  Seems worth keeping the
 comment, and this could be done as a separate patch...

True. The reason why it has to be earlier is that I am much stricter on
having all BEGIN_BATCH/ADVANCE_BATCH chunks being inside a
brw_batch_begin/brw_batch_end - since brw_emit_mi_flush() itself doesn't
start a new brw_batch_begin section, it had to go in blorp's. The comment
was removed because we now have accurate dirty texture cache tracking and
that itself is not the reason wy flush has to exist. (The flush is for a
change in hiz mode iirc, or some similar change in GPU state that
requires a flush between primitives.)
 
 [snip]
  +static void
  +load_sized_register_mem(struct brw_context *brw,
  +uint32_t reg,
  +struct brw_bo *bo,
  +uint32_t read_domains, uint32_t write_domain,
  +uint32_t offset,
  +int size)
  +{
  +   int i;
  +
  +   /* MI_LOAD_REGISTER_MEM only exists on Gen7+. */
  +   assert(brw-gen = 7);
  +
  +   if (brw-gen = 8) {
  +  BEGIN_BATCH(4 * size);
  +  for (i = 0; i  size; i++) {
  + OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (4 - 2));
  + OUT_BATCH(reg + i * 4);
  + OUT_RELOC64(bo, read_domains, write_domain, offset + i * 4);
  +  }
  +  ADVANCE_BATCH();
  +   } else {
  +  

Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager

2015-07-06 Thread Kenneth Graunke
On Monday, July 06, 2015 11:33:09 AM Chris Wilson wrote:
 When submitting commands to the GPU every cycle of latency counts;
 mutexes, spinlocks, even atomics quickly add to substantial overhead.
 
 This batch manager acts as thread-local shim over the buffer manager
 (drm_intel_bufmgr_gem). As we are only ever used from within a single
 context, we can rely on the upper layers providing thread safety.
 This allows us to import buffers from the shared screen (sharing buffers
 between multiple contexts, threads and users) and wrap that handle in
 our own. Similarly, we want to share the buffer cache between all
 users on the file and so allocate from the global threadsafe buffer
 manager, with a very small and transient local cache of active buffers.
 
 The batch manager provides a cheap way of busyness tracking and very
 efficient batch construction and kernel submission.
 
 The restrictions over and above the generic submission engine in
 intel_bufmgr_gem are:
  - not thread-safe
  - flat relocations, only the batch buffer itself carries
relocations. Relocations relative to auxiliary buffers
must be performed via STATE_BASE
  - direct mapping of the batch for writes, expect reads
from the batch to be slow
  - the batch is a fixed 64k in size
  - access to the batch must be wrapped by brw_batch_begin/_end
  - all relocations must be immediately written into the batch
 
 The importance of the flat relocation tree with local offset handling is
 that it allows us to use the relocation-less execbuffer interfaces,
 dramatically reducing the overhead of batch submission. However, that
 can be relaxed to allow other buffers than the batch buffer to carry
 relocations, if need be.
 
 ivb/bdw OglBatch7 improves by ~20% above and beyond my kernel relocation
 speedups.
 
 ISSUES:
 * shared mipmap trees
   - we instantiate a context local copy on use, but what are the semantics for
 serializing read/writes between them - do we need automagic flushing of
 execution on other contexts and common busyness tracking?
   - we retain references to the bo past the lifetime of its parent
 batchmgr as the mipmap_tree is retained past the lifetime of its
 original context, see glx_arb_create_context/default_major_version
 * OglMultithread is nevertheless unhappy; but that looks like undefined
   behaviour - i.e. a buggy client concurrently executing the same GL
   context in multiple threads, unpatched is equally buggy.
 * Add full-ppgtt softpinning support (no more relocations, at least for
   the first 256TiB), at the moment there is a limited proof-of-principle
   demonstration
 * polish and move to libdrm; though at the cost of sealing the structs?
 
 Signed-off-by: Chris Wilson ch...@chris-wilson.co.uk
 Cc: Daniel Vetter daniel.vet...@ffwll.ch
 Cc: Kristian Høgsberg k...@bitplanet.net
 Cc: Kenneth Graunke kenn...@whitecape.org
 Cc: Jesse Barnes jbar...@virtuousgeek.org
 Cc: Ian Romanick ian.d.roman...@intel.com
 Cc: Abdiel Janulgue abdiel.janul...@linux.intel.com
 Cc: Eero Tamminen eero.t.tammi...@intel.com
 Cc: Martin Peres martin.pe...@linux.intel.com
 ---
  src/mesa/drivers/dri/i965/Makefile.sources |4 +-
  src/mesa/drivers/dri/i965/brw_batch.c  | 1946 
 
  src/mesa/drivers/dri/i965/brw_batch.h  |  377 
  src/mesa/drivers/dri/i965/brw_binding_tables.c |1 -
  src/mesa/drivers/dri/i965/brw_blorp.cpp|   46 +-
  src/mesa/drivers/dri/i965/brw_cc.c |   16 +-
  src/mesa/drivers/dri/i965/brw_clear.c  |1 -
  src/mesa/drivers/dri/i965/brw_clip.c   |2 -
  src/mesa/drivers/dri/i965/brw_clip_line.c  |2 -
  src/mesa/drivers/dri/i965/brw_clip_point.c |2 -
  src/mesa/drivers/dri/i965/brw_clip_state.c |   14 +-
  src/mesa/drivers/dri/i965/brw_clip_tri.c   |2 -
  src/mesa/drivers/dri/i965/brw_clip_unfilled.c  |2 -
  src/mesa/drivers/dri/i965/brw_clip_util.c  |2 -
  src/mesa/drivers/dri/i965/brw_compute.c|   42 +-
  src/mesa/drivers/dri/i965/brw_conditional_render.c |2 +-
  src/mesa/drivers/dri/i965/brw_context.c|  233 ++-
  src/mesa/drivers/dri/i965/brw_context.h|  144 +-
  src/mesa/drivers/dri/i965/brw_cs.cpp   |6 +-
  src/mesa/drivers/dri/i965/brw_curbe.c  |1 -
  src/mesa/drivers/dri/i965/brw_draw.c   |  103 +-
  src/mesa/drivers/dri/i965/brw_draw_upload.c|   23 +-
  src/mesa/drivers/dri/i965/brw_ff_gs.c  |2 -
  src/mesa/drivers/dri/i965/brw_ff_gs_emit.c |1 -
  src/mesa/drivers/dri/i965/brw_fs.cpp   |5 +-
  src/mesa/drivers/dri/i965/brw_meta_fast_clear.c|   11 +-
  src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c  |1 -
  src/mesa/drivers/dri/i965/brw_meta_updownsample.c  |1 -
  src/mesa/drivers/dri/i965/brw_misc_state.c |   10 +-