Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
On Tue, Jul 07, 2015 at 10:03:09PM -0700, Kenneth Graunke wrote: Hi Chris, I made a genuine effort to review this patch, hoping to better understand the various changes and what you were trying to accomplish. I spent many hours reading and trying to enumerate changes - or potential changes I needed to look hard at to convince myself whether they were correct. I came up with a frighteningly long list of changes: * Locking completely overhauled. * Relocation handling changes considerably (the original point of Kristian's endeavour which led up to this). It's been a long time since 2011. * Fencing, busy tracking, and sync objects are completely reworked. sync obje That's a slight overstatement. * Render-to-texture cache flushing and dirty buffer tracking is completely reworked. And immensely simplified. A hash table? Did you notice that the current dirty tracking misses where the same buffer is rewritten? * Gen7 SOL buffer offset resetting now uses MI_LOAD_REGISTER_IMM rather than the execbuf2 parameter, requiring the command validator on Haswell. This effectively bumps the kernel requirement from v3.6 to v4.2-rc1, which will simply not fly with distributions at this time. My bad, completely missed that there was EXT_transform_feedback that only did the subset. I wondered why you had code to the pipelined loads/save but then still flushed the batch every time. * glBufferSubData() now uses intel_upload_data() rather than allocating a temporary BO. This is the first use of the upload buffer by the BLT engine, and could imply that the upload buffer's lifetime now extends across batches - longer than before. Separable change that requires separate evaluation and justification. Yes and no, it is easy to demonstrate that the buffer's lifetime is longer than a batch current. * Per buffer cache-coherency checking rather than brw-has_llc? We've pointed out the bugs in the current usage of brw-has_llc before. * glBufferSubData()'s prefer_stall_to_blit flag appears to depend on per-buffer cache-coherency rather than being set globally. Could impact performance of buffer uploads. * Potential missing flushes (which can cause hangs or misrendering): - It looks like calling brw_bo_busy() with BUSY_FLUSH causes a flush when necessary. However, some instances of the old bo_busy, bo_references, batch_flush pattern are replaced without that flag. One occurrance was in BufferSubData(); I did not spend time to check every case. The change is still accurate. - Flushes are often done implicitly by e.g. brw_bo_read calling brw_bo_map with the appropriate flags, and many explicit checks and flushes are removed. Not bad, but needs careful review. - Gen6+ query object code might have dropped an implicit flush guaranteeing that when the GL application requests the result, any pending work will be kicked off so they can poll/spin repeatedly until the result arrives. The flush is not missing. - New code to avoid redundant flushes. * perf_debug() warnings are removed all over the code for some reason: The problem is perf_debug() was a layering violation (my goal was to keep brw_batch a mini-library, either to extract it back to libdrm or for reuse). I have an idea of using a macro like #define PERF_DEBUG_LOC(brw, str) ((brw-perf_debug ? (struct loc){__FILE__, __FUNCTION__, __LINE__, str)} : NULL)) brw_bo_map(..., PERF_DEBUG_LOC(brw)); Then pass back to a brw_context callback if a function stalls. - Unsynchronized maps/BufferSubData not working on !LLC platforms? If they work now, that's a huge change! If not, why drop the warning? They have always worked on !llc. The same caveats with using multiple aliasing to the same physical address through different modes of write caching apply to both llc and !llc. - Warnings about stalls on mapping buffers and miptrees are gone now. These have been useful in tracking down performance problems. They might not always be accurate, but surely removing them should be done separately with justification? See issues with calling perf_debug. - Warnings about stalls on query objects are gone. I've used these when analyzing application performance. Why? See issues with calling perf_debug. - Warnings about implicit flushes are gone. See issues with calling perf_debug. It was at the back of my mind that I had to fix perf_debug - if I had put in the list of issues, I would probably would have remembered to fix it before posting. * BO unmap calls appears to be missing in some places. A few map calls have moved around in hard-to-follow ways. Unclear how lifetimes of buffers and lifetimes of maps are affected. Bo unmap is a figment of your imagination. libdrm never unmaps a bo until it is closed (simply to avoid the cost of refaulting the objects - though that does come at some risk of address space
Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
On Wed, Jul 08, 2015 at 09:51:07AM +0100, Chris Wilson wrote: On Tue, Jul 07, 2015 at 10:03:09PM -0700, Kenneth Graunke wrote: * Gen4-5 structure changes. Did you mean brw_structs.h? diff --git a/src/mesa/drivers/dri/i965/brw_structs.h b/src/mesa/drivers/dri/i965/brw_structs.h index 55338c0..e167254 100644 --- a/src/mesa/drivers/dri/i965/brw_structs.h +++ b/src/mesa/drivers/dri/i965/brw_structs.h @@ -391,13 +391,16 @@ struct brw_sf_unit_state unsigned pad3:1; } thread4; - struct + union { + struct { unsigned front_winding:1; unsigned viewport_transform:1; unsigned pad0:3; unsigned sf_viewport_state_offset:27; /* Offset from GENERAL_STATE_BASE */ } sf5; + uint32_t dw5; + }; struct { @@ -525,15 +528,17 @@ struct brw_wm_unit_state struct thread2 thread2; struct thread3 thread3; + union { struct { unsigned stats_enable:1; unsigned depth_buffer_clear:1; unsigned sampler_count:3; unsigned sampler_state_pointer:27; } wm4; + uint32_t dw4; + }; - struct - { + struct { unsigned enable_8_pix:1; unsigned enable_16_pix:1; unsigned enable_32_pix:1; diff --git a/src/mesa/drivers/dri/i965/brw_urb.c b/src/mesa/drivers/dri/i965/brw_urb.c Or something else? -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
On Wed, Jul 08, 2015 at 10:49:24AM -0700, Kenneth Graunke wrote: On Wednesday, July 08, 2015 03:17:35 PM Chris Wilson wrote: On Wed, Jul 08, 2015 at 09:51:07AM +0100, Chris Wilson wrote: On Tue, Jul 07, 2015 at 10:03:09PM -0700, Kenneth Graunke wrote: * Gen4-5 structure changes. Did you mean brw_structs.h? diff --git a/src/mesa/drivers/dri/i965/brw_structs.h b/src/mesa/drivers/dri/i965/brw_structs.h index 55338c0..e167254 100644 --- a/src/mesa/drivers/dri/i965/brw_structs.h +++ b/src/mesa/drivers/dri/i965/brw_structs.h @@ -391,13 +391,16 @@ struct brw_sf_unit_state unsigned pad3:1; } thread4; - struct + union { + struct { unsigned front_winding:1; unsigned viewport_transform:1; unsigned pad0:3; unsigned sf_viewport_state_offset:27; /* Offset from GENERAL_STATE_BASE */ } sf5; + uint32_t dw5; + }; struct { @@ -525,15 +528,17 @@ struct brw_wm_unit_state struct thread2 thread2; struct thread3 thread3; + union { struct { unsigned stats_enable:1; unsigned depth_buffer_clear:1; unsigned sampler_count:3; unsigned sampler_state_pointer:27; } wm4; + uint32_t dw4; + }; - struct - { + struct { unsigned enable_8_pix:1; unsigned enable_16_pix:1; unsigned enable_32_pix:1; diff --git a/src/mesa/drivers/dri/i965/brw_urb.c b/src/mesa/drivers/dri/i965/brw_urb.c Or something else? -Chris Yes - and the changes are fine. But...separable. Would you like a patch to provide dword alternates for all the structs, or just the ones used for relocation processing? -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
On Wednesday, July 08, 2015 03:17:35 PM Chris Wilson wrote: On Wed, Jul 08, 2015 at 09:51:07AM +0100, Chris Wilson wrote: On Tue, Jul 07, 2015 at 10:03:09PM -0700, Kenneth Graunke wrote: * Gen4-5 structure changes. Did you mean brw_structs.h? diff --git a/src/mesa/drivers/dri/i965/brw_structs.h b/src/mesa/drivers/dri/i965/brw_structs.h index 55338c0..e167254 100644 --- a/src/mesa/drivers/dri/i965/brw_structs.h +++ b/src/mesa/drivers/dri/i965/brw_structs.h @@ -391,13 +391,16 @@ struct brw_sf_unit_state unsigned pad3:1; } thread4; - struct + union { + struct { unsigned front_winding:1; unsigned viewport_transform:1; unsigned pad0:3; unsigned sf_viewport_state_offset:27; /* Offset from GENERAL_STATE_BASE */ } sf5; + uint32_t dw5; + }; struct { @@ -525,15 +528,17 @@ struct brw_wm_unit_state struct thread2 thread2; struct thread3 thread3; + union { struct { unsigned stats_enable:1; unsigned depth_buffer_clear:1; unsigned sampler_count:3; unsigned sampler_state_pointer:27; } wm4; + uint32_t dw4; + }; - struct - { + struct { unsigned enable_8_pix:1; unsigned enable_16_pix:1; unsigned enable_32_pix:1; diff --git a/src/mesa/drivers/dri/i965/brw_urb.c b/src/mesa/drivers/dri/i965/brw_urb.c Or something else? -Chris Yes - and the changes are fine. But...separable. signature.asc Description: This is a digitally signed message part. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
On Tue, Jul 07, 2015 at 01:14:53PM +0300, Abdiel Janulgue wrote: On 07/06/2015 01:33 PM, Chris Wilson wrote: @@ -600,7 +593,10 @@ brw_emit_null_surface_state(struct brw_context *brw, 1 BRW_SURFACE_WRITEDISABLE_B_SHIFT | 1 BRW_SURFACE_WRITEDISABLE_A_SHIFT); } - surf[1] = bo ? bo-offset64 : 0; + surf[1] = brw_batch_reloc(brw-batch, *out_offset + 4, + bo, 0, + I915_GEM_DOMAIN_RENDER, + I915_GEM_DOMAIN_RENDER); null check for bo? I put the NULL check into the inline variant of brw_batch_reloc() for a bit of syntatic sugar for these cases. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
On 07/06/2015 01:33 PM, Chris Wilson wrote: +/* + * Add a relocation entry for the target buffer into the current batch. + * + * This is the heart of performing fast relocations, both here and in + * the corresponding kernel relocation routines. + * + * - Instead of passing in handles for the kernel convert back into + * the buffer for every relocation, we tell the kernel which + * execobject slot corresponds with the relocation. The kernel is + * able to use a simple LUT constructed as it first looks up each buffer + * for the batch rather than search a small, overfull hashtable. As both + * the number of relocations and buffers in a batch grow, the simple + * LUT is much more efficient (though the LUT itself is less cache + * friendly). + * However, as the batch buffer is by definition the last object in + * the execbuffer array we have to perform a pass to relabel the + * target of all relocations pointing to the batch. (Except when + * the kernel supports batch-first, in which case we can do the relocation + * target processing for the batch inline.) + * + * - If the kernel has not moved the buffer, it will still be in the same + * location as last time we used it. If we tell the kernel that all the + * relocation entries are the same as the offset for the buffer, then + * the kernel need only check that all the buffers are still in the same + * location and then skip performing relocations entirely. A huge win. + * + * - As a consequence of telling the kernel to skip processing the relocations, + * we need to tell the kernel about the read/write domains and special needs + * of the buffers. + * + * - Alternatively, we can request the kernel place the buffer exactly + * where we want it and forgo all relocations to that buffer entirely. + * The buffer is effectively pinned for its lifetime (if the kernel + * does have to move it, for example to swap it out to recover memory, + * the kernel will return it back to our requested location at the start + * of the next batch.) This of course imposes a lot of constraints on where + * we can say the buffers are, they must meet all the alignment constraints + * and not overlap. + * + * - Essential to all these techniques is that we always use the same + * presumed_offset for the relocations as for submitting the execobject. + * That value must be written into the batch and it must match the value + * we tell the kernel. (This breaks down when using relocation tries shared + * between multiple contexts, hence the need for context-local batch + * management.) + * + * In contrast to libdrm, we can build the execbuffer array along with + * the batch by forgoing the ability to handle general relocation trees. + * This avoids having multiple passes to build the execbuffer parameter, + * and also gives us a means to cheaply track when a buffer has been + * referenced by the batch. + */ +uint64_t __brw_batch_reloc(struct brw_batch *batch, + uint32_t batch_offset, + struct brw_bo *target_bo, + uint64_t target_offset, + unsigned read_domains, + unsigned write_domain) +{ + assert(target_bo-refcnt); + if (unlikely(target_bo-batch != batch)) { + /* XXX legal sharing between contexts/threads? */ + target_bo = brw_bo_import(batch, target_bo-base, true); + if (unlikely(target_bo == NULL)) + longjmp(batch-jmpbuf, -ENOMEM); + target_bo-refcnt--; /* kept alive by the implicit active reference */ + } + assert(target_bo-batch == batch); + + if (target_bo-exec == NULL) { + int n; + + /* reserve one exec entry for the batch */ + if (unlikely(batch-emit.nexec + 1 == batch-exec_size)) + __brw_batch_grow_exec(batch); + + n = batch-emit.nexec++; + target_bo-target_handle = has_lut(batch) ? n : target_bo-handle; + target_bo-exec = memset(batch-exec + n, 0, sizeof(*target_bo-exec)); + target_bo-exec-handle = target_bo-handle; + target_bo-exec-alignment = target_bo-alignment; + target_bo-exec-offset = target_bo-offset; + if (target_bo-pinned) + target_bo-exec-flags = EXEC_OBJECT_PINNED; + + /* Track the total amount of memory in use by all active requests */ + if (target_bo-read.rq == NULL) { + batch-rss += target_bo-size; + if (batch-rss batch-peak_rss) + batch-peak_rss = batch-rss; + } + target_bo-read.rq = batch-next_request; + list_move_tail(target_bo-read.link, batch-next_request-read); + + batch-aperture += target_bo-size; + } + + if (!target_bo-pinned) { + int n; + + if (unlikely(batch-emit.nreloc == batch-reloc_size)) + __brw_batch_grow_reloc(batch); + + n = batch-emit.nreloc++; +
Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
On 07/07/2015 01:19 PM, Chris Wilson wrote: On Tue, Jul 07, 2015 at 01:14:53PM +0300, Abdiel Janulgue wrote: On 07/06/2015 01:33 PM, Chris Wilson wrote: @@ -600,7 +593,10 @@ brw_emit_null_surface_state(struct brw_context *brw, 1 BRW_SURFACE_WRITEDISABLE_B_SHIFT | 1 BRW_SURFACE_WRITEDISABLE_A_SHIFT); } - surf[1] = bo ? bo-offset64 : 0; + surf[1] = brw_batch_reloc(brw-batch, *out_offset + 4, + bo, 0, + I915_GEM_DOMAIN_RENDER, + I915_GEM_DOMAIN_RENDER); null check for bo? I put the NULL check into the inline variant of brw_batch_reloc() for a bit of syntatic sugar for these cases. -Chris You're, right. I failed to notice the in-line variant. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
Hi Chris, I made a genuine effort to review this patch, hoping to better understand the various changes and what you were trying to accomplish. I spent many hours reading and trying to enumerate changes - or potential changes I needed to look hard at to convince myself whether they were correct. I came up with a frighteningly long list of changes: * Relocation handling changes considerably (the original point of Kristian's endeavour which led up to this). * Fencing, busy tracking, and sync objects are completely reworked. * Render-to-texture cache flushing and dirty buffer tracking is completely reworked. * Gen7 SOL buffer offset resetting now uses MI_LOAD_REGISTER_IMM rather than the execbuf2 parameter, requiring the command validator on Haswell. This effectively bumps the kernel requirement from v3.6 to v4.2-rc1, which will simply not fly with distributions at this time. * glBufferSubData() now uses intel_upload_data() rather than allocating a temporary BO. This is the first use of the upload buffer by the BLT engine, and could imply that the upload buffer's lifetime now extends across batches - longer than before. Separable change that requires separate evaluation and justification. * Per buffer cache-coherency checking rather than brw-has_llc? * glBufferSubData()'s prefer_stall_to_blit flag appears to depend on per-buffer cache-coherency rather than being set globally. Could impact performance of buffer uploads. * Potential missing flushes (which can cause hangs or misrendering): - It looks like calling brw_bo_busy() with BUSY_FLUSH causes a flush when necessary. However, some instances of the old bo_busy, bo_references, batch_flush pattern are replaced without that flag. One occurrance was in BufferSubData(); I did not spend time to check every case. - Flushes are often done implicitly by e.g. brw_bo_read calling brw_bo_map with the appropriate flags, and many explicit checks and flushes are removed. Not bad, but needs careful review. - Gen6+ query object code might have dropped an implicit flush guaranteeing that when the GL application requests the result, any pending work will be kicked off so they can poll/spin repeatedly until the result arrives. - New code to avoid redundant flushes. * perf_debug() warnings are removed all over the code for some reason: - Unsynchronized maps/BufferSubData not working on !LLC platforms? If they work now, that's a huge change! If not, why drop the warning? - Warnings about stalls on mapping buffers and miptrees are gone now. These have been useful in tracking down performance problems. They might not always be accurate, but surely removing them should be done separately with justification? - Warnings about stalls on query objects are gone. I've used these when analyzing application performance. Why? - Warnings about implicit flushes are gone. * BO unmap calls appears to be missing in some places. A few map calls have moved around in hard-to-follow ways. Unclear how lifetimes of buffers and lifetimes of maps are affected. * Possible mmap vs. pwrite preference changes? Hard to follow. * Texture upload (tiled_memcpy) changes, which is notoriously fragile and can lose all of the performance benefit if the compiler isn't able to optimize it just right. Ideally separate. * Assertions change to GL errors in brw_get_graphics_reset_status(). * Aperture space checking significantly reworked, especially for the BLT paths. Honestly, a lot nicer, but couldn't this be separated? * The bo_reuse driconf option is removed. * Gen4-5 structure changes. * brw_get_timestamp() - removes initialization of result to 0. Probably unnecessary and OK to delete; should be separate. * New helper functions and coding patterns. Separable. * Noise (renaming, moving code between files, some other trivial changes like removing 'brw' variables and moving code into else blocks). * ...I probably missed some things. Based upon this, I cannot in good conscience consider merging this patch. The potential for breakage is staggering. As a proof-of-concept, you've done an excellent job in proving we can do much better, and introduced a lot of good ideas. But there's a lot of work left to be done before we can consider applying it to our production quality driver. Please advise whether you would like to work towards making a mergeable, incremental patch series, or if someone else should embark on that endeavour. --Ken signature.asc Description: This is a digitally signed message part. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
Just skipping to interesting comments for the moment. On Mon, Jul 06, 2015 at 12:34:00PM -0700, Kenneth Graunke wrote: While I really like this idea in principle, the current patch is rather huge, making it difficult to review; bisectability would also suffer. Would it be possible to split it up into several smaller patches? First time I tried I ended up with a series of patches of equal churn and I wasn't happy with the results. Honestly, I find mechanical noise in patches easy to tune out, you quickly recognise the pattern and deviations should stand out. Some of the churn I think is important as it highlight the relative emphasis on different coding patterns. In particular, how the relocation must always emit the address in the batch, so in quite a few functions the code had to be rearranged in order to make that happen One idea I had for how you might accomplish that is to introduce the brw_bo struct and related functions, but make it simply contain a drm_intel_bo * field, and fall back to the existing libdrm-based code. Sure, what I had in mind was a changeover to a struct brw_bo that just wrapped drm_intel_bo for typesafefty (so the compiler was there to catch mistakes and missing pieces). I felt the patch was still about 500 lines of mechcanical changes, 1000 lines removing the intel_batchbuffer.c and 2000 lines adding brw_bo (it has grown somewhat with the comments!). The api to review was in one place, and the bits that required fine changes (mostly intel_bufferobj.c) were equally obvious. But what you really need is a slow progression in feature development of brw_batch. Splitting the mechanical changes off with a touch more churn, and even converting functions over to new API in stages is relatively trivial compared to building up the feature set required to do batchbuffer construction from scratch. - relocations come first, as request tracking cannot happen with integration into the relocations. - execbuf - request tracking - mmap functions - exporting busyness Honestly though the design has never existed in a piecemeal fashion, it has always been requests first and that drove the infrastructure required to enable the request tracking. Anyway splitting out the mechanical changes is worth it if it gets more eyes onto brw_batch.c itself and some of its knock-on effects. That way, you could put all the mechanical renaming and refactoring across the entire code-base in one patch that should have no functional change. You could then replace the contents or brw_bo and those functions with your new improved batch manager. We've talked about moving away from libdrm for a while, so having our own functions and structures makes a lot of sense. I'm also curious about the performance on non-LLC platforms (CHV or BYT)? You've dropped a number of non-LLC paths - which is probably fine, honestly...they were always of dubious value - but it'd be nice to know the record the effects of this change on non-LLC in the commit message. Ah, I didn't actually remove non-LLC paths, I added equivalent detiled CPU access for non-LLC (strictly non cache coherent bo, incl scanouts) as for LLC. So I guess you mean the few places that lost the brw-has_llc differentiation? @@ -279,11 +254,6 @@ retry: brw-ctx.NewDriverState = ~0ull; brw-no_depth_or_stencil = false; brw-ib.type = -1; - - /* Flush the sampler cache so any texturing from the destination is -* coherent. -*/ - brw_emit_mi_flush(brw); It looks like you moved this flush earlier, so it's in the section of code that retries? That's probably reasonable. Seems worth keeping the comment, and this could be done as a separate patch... True. The reason why it has to be earlier is that I am much stricter on having all BEGIN_BATCH/ADVANCE_BATCH chunks being inside a brw_batch_begin/brw_batch_end - since brw_emit_mi_flush() itself doesn't start a new brw_batch_begin section, it had to go in blorp's. The comment was removed because we now have accurate dirty texture cache tracking and that itself is not the reason wy flush has to exist. (The flush is for a change in hiz mode iirc, or some similar change in GPU state that requires a flush between primitives.) [snip] +static void +load_sized_register_mem(struct brw_context *brw, +uint32_t reg, +struct brw_bo *bo, +uint32_t read_domains, uint32_t write_domain, +uint32_t offset, +int size) +{ + int i; + + /* MI_LOAD_REGISTER_MEM only exists on Gen7+. */ + assert(brw-gen = 7); + + if (brw-gen = 8) { + BEGIN_BATCH(4 * size); + for (i = 0; i size; i++) { + OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (4 - 2)); + OUT_BATCH(reg + i * 4); + OUT_RELOC64(bo, read_domains, write_domain, offset + i * 4); + } + ADVANCE_BATCH(); + } else { +
Re: [Mesa-dev] [PATCH 04/18] i965: Introduce a context-local batch manager
On Monday, July 06, 2015 11:33:09 AM Chris Wilson wrote: When submitting commands to the GPU every cycle of latency counts; mutexes, spinlocks, even atomics quickly add to substantial overhead. This batch manager acts as thread-local shim over the buffer manager (drm_intel_bufmgr_gem). As we are only ever used from within a single context, we can rely on the upper layers providing thread safety. This allows us to import buffers from the shared screen (sharing buffers between multiple contexts, threads and users) and wrap that handle in our own. Similarly, we want to share the buffer cache between all users on the file and so allocate from the global threadsafe buffer manager, with a very small and transient local cache of active buffers. The batch manager provides a cheap way of busyness tracking and very efficient batch construction and kernel submission. The restrictions over and above the generic submission engine in intel_bufmgr_gem are: - not thread-safe - flat relocations, only the batch buffer itself carries relocations. Relocations relative to auxiliary buffers must be performed via STATE_BASE - direct mapping of the batch for writes, expect reads from the batch to be slow - the batch is a fixed 64k in size - access to the batch must be wrapped by brw_batch_begin/_end - all relocations must be immediately written into the batch The importance of the flat relocation tree with local offset handling is that it allows us to use the relocation-less execbuffer interfaces, dramatically reducing the overhead of batch submission. However, that can be relaxed to allow other buffers than the batch buffer to carry relocations, if need be. ivb/bdw OglBatch7 improves by ~20% above and beyond my kernel relocation speedups. ISSUES: * shared mipmap trees - we instantiate a context local copy on use, but what are the semantics for serializing read/writes between them - do we need automagic flushing of execution on other contexts and common busyness tracking? - we retain references to the bo past the lifetime of its parent batchmgr as the mipmap_tree is retained past the lifetime of its original context, see glx_arb_create_context/default_major_version * OglMultithread is nevertheless unhappy; but that looks like undefined behaviour - i.e. a buggy client concurrently executing the same GL context in multiple threads, unpatched is equally buggy. * Add full-ppgtt softpinning support (no more relocations, at least for the first 256TiB), at the moment there is a limited proof-of-principle demonstration * polish and move to libdrm; though at the cost of sealing the structs? Signed-off-by: Chris Wilson ch...@chris-wilson.co.uk Cc: Daniel Vetter daniel.vet...@ffwll.ch Cc: Kristian Høgsberg k...@bitplanet.net Cc: Kenneth Graunke kenn...@whitecape.org Cc: Jesse Barnes jbar...@virtuousgeek.org Cc: Ian Romanick ian.d.roman...@intel.com Cc: Abdiel Janulgue abdiel.janul...@linux.intel.com Cc: Eero Tamminen eero.t.tammi...@intel.com Cc: Martin Peres martin.pe...@linux.intel.com --- src/mesa/drivers/dri/i965/Makefile.sources |4 +- src/mesa/drivers/dri/i965/brw_batch.c | 1946 src/mesa/drivers/dri/i965/brw_batch.h | 377 src/mesa/drivers/dri/i965/brw_binding_tables.c |1 - src/mesa/drivers/dri/i965/brw_blorp.cpp| 46 +- src/mesa/drivers/dri/i965/brw_cc.c | 16 +- src/mesa/drivers/dri/i965/brw_clear.c |1 - src/mesa/drivers/dri/i965/brw_clip.c |2 - src/mesa/drivers/dri/i965/brw_clip_line.c |2 - src/mesa/drivers/dri/i965/brw_clip_point.c |2 - src/mesa/drivers/dri/i965/brw_clip_state.c | 14 +- src/mesa/drivers/dri/i965/brw_clip_tri.c |2 - src/mesa/drivers/dri/i965/brw_clip_unfilled.c |2 - src/mesa/drivers/dri/i965/brw_clip_util.c |2 - src/mesa/drivers/dri/i965/brw_compute.c| 42 +- src/mesa/drivers/dri/i965/brw_conditional_render.c |2 +- src/mesa/drivers/dri/i965/brw_context.c| 233 ++- src/mesa/drivers/dri/i965/brw_context.h| 144 +- src/mesa/drivers/dri/i965/brw_cs.cpp |6 +- src/mesa/drivers/dri/i965/brw_curbe.c |1 - src/mesa/drivers/dri/i965/brw_draw.c | 103 +- src/mesa/drivers/dri/i965/brw_draw_upload.c| 23 +- src/mesa/drivers/dri/i965/brw_ff_gs.c |2 - src/mesa/drivers/dri/i965/brw_ff_gs_emit.c |1 - src/mesa/drivers/dri/i965/brw_fs.cpp |5 +- src/mesa/drivers/dri/i965/brw_meta_fast_clear.c| 11 +- src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c |1 - src/mesa/drivers/dri/i965/brw_meta_updownsample.c |1 - src/mesa/drivers/dri/i965/brw_misc_state.c | 10 +-