Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects
On Mon, Oct 17, 2016 at 10:47:09AM +0100, Tvrtko Ursulin wrote: > > On 14/10/2016 15:42, Chris Wilson wrote: > >On Fri, Oct 14, 2016 at 03:35:59PM +0100, Tvrtko Ursulin wrote: > >>On 14/10/2016 14:53, Chris Wilson wrote: > >We do pass NORETRY | NOWARN for the higher order allocations, so it > >shouldn't be as bad it seems? > I don't know for sure without looking into the implementation > details. But I assumed even with NORETRY it does some extra work to > try and free up the space. And if it fails, and we ask for it again, > it is just doing that extra work for nothing. Because within a > single allocation it sounds unlikely that something would change so > dramatically that it would start working. > >>>iirc, NORETRY means abort after failure. In effect, it does > >>>2 attempts from the freelist, a direct reclaim, and may then repeat > >>>if the task's allowed set of nodes were concurrently changed. > >>Do you think it makes sense doing all that after it started failing, > >>within our single get_pages allocation? > >I was thinking about skipping the DIRECT_RECLAIM for high order, but it > >seems like that is beneficial for THP, so I'm presuming it should also > >be for ourselves. Trimming back max_order seems sensible, but I still > >like the idea of taking advantage of contiguous pages where possible > >(primarily these will be used for ringbuffers and shadow batches). > > Can we agree then to try larger order first and fall back gradually? I thought we had already agreed on that! I thought we were looking at what else we might do. :) -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects
On 14/10/2016 15:42, Chris Wilson wrote: On Fri, Oct 14, 2016 at 03:35:59PM +0100, Tvrtko Ursulin wrote: On 14/10/2016 14:53, Chris Wilson wrote: We do pass NORETRY | NOWARN for the higher order allocations, so it shouldn't be as bad it seems? I don't know for sure without looking into the implementation details. But I assumed even with NORETRY it does some extra work to try and free up the space. And if it fails, and we ask for it again, it is just doing that extra work for nothing. Because within a single allocation it sounds unlikely that something would change so dramatically that it would start working. iirc, NORETRY means abort after failure. In effect, it does 2 attempts from the freelist, a direct reclaim, and may then repeat if the task's allowed set of nodes were concurrently changed. Do you think it makes sense doing all that after it started failing, within our single get_pages allocation? I was thinking about skipping the DIRECT_RECLAIM for high order, but it seems like that is beneficial for THP, so I'm presuming it should also be for ourselves. Trimming back max_order seems sensible, but I still like the idea of taking advantage of contiguous pages where possible (primarily these will be used for ringbuffers and shadow batches). Can we agree then to try larger order first and fall back gradually? Regards, Tvrtko ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects
On Fri, Oct 14, 2016 at 03:35:59PM +0100, Tvrtko Ursulin wrote: > > On 14/10/2016 14:53, Chris Wilson wrote: > >>>We do pass NORETRY | NOWARN for the higher order allocations, so it > >>>shouldn't be as bad it seems? > >>I don't know for sure without looking into the implementation > >>details. But I assumed even with NORETRY it does some extra work to > >>try and free up the space. And if it fails, and we ask for it again, > >>it is just doing that extra work for nothing. Because within a > >>single allocation it sounds unlikely that something would change so > >>dramatically that it would start working. > >iirc, NORETRY means abort after failure. In effect, it does > >2 attempts from the freelist, a direct reclaim, and may then repeat > >if the task's allowed set of nodes were concurrently changed. > > Do you think it makes sense doing all that after it started failing, > within our single get_pages allocation? I was thinking about skipping the DIRECT_RECLAIM for high order, but it seems like that is beneficial for THP, so I'm presuming it should also be for ourselves. Trimming back max_order seems sensible, but I still like the idea of taking advantage of contiguous pages where possible (primarily these will be used for ringbuffers and shadow batches). -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects
On 14/10/2016 14:53, Chris Wilson wrote: On Fri, Oct 14, 2016 at 02:44:00PM +0100, Tvrtko Ursulin wrote: On 14/10/2016 13:54, Chris Wilson wrote: On Fri, Oct 14, 2016 at 01:42:35PM +0100, Tvrtko Ursulin wrote: On 14/10/2016 13:18, Chris Wilson wrote: + gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_RECLAIMABLE; + if (IS_CRESTLINE(i915) || IS_BROADWATER(i915)) { + /* 965gm cannot relocate objects above 4GiB. */ + gfp &= ~__GFP_HIGHMEM; + gfp |= __GFP_DMA32; + } + + do { + int order = min(fls(npages) - 1, max_order); I still have reservations on going back to max_order when previous chunks required an order decrease. Size of the object is unbound since it is indirectly controlled by userspace, correct? How about decreasing the max_order on every repeated order decrease, following failed order allocation? + struct page *page; + + do { + page = alloc_pages(gfp | (order ? QUIET : 0), order); + if (page) + break; + if (!order--) + goto err; Like: /* Limit future allocations as well */ max_order = order; + } while (1); We do pass NORETRY | NOWARN for the higher order allocations, so it shouldn't be as bad it seems? I don't know for sure without looking into the implementation details. But I assumed even with NORETRY it does some extra work to try and free up the space. And if it fails, and we ask for it again, it is just doing that extra work for nothing. Because within a single allocation it sounds unlikely that something would change so dramatically that it would start working. iirc, NORETRY means abort after failure. In effect, it does 2 attempts from the freelist, a direct reclaim, and may then repeat if the task's allowed set of nodes were concurrently changed. Do you think it makes sense doing all that after it started failing, within our single get_pages allocation? Regards, Tvrtko ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects
On Fri, Oct 14, 2016 at 02:44:00PM +0100, Tvrtko Ursulin wrote: > > On 14/10/2016 13:54, Chris Wilson wrote: > >On Fri, Oct 14, 2016 at 01:42:35PM +0100, Tvrtko Ursulin wrote: > >>On 14/10/2016 13:18, Chris Wilson wrote: > >>>+ gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_RECLAIMABLE; > >>>+ if (IS_CRESTLINE(i915) || IS_BROADWATER(i915)) { > >>>+ /* 965gm cannot relocate objects above 4GiB. */ > >>>+ gfp &= ~__GFP_HIGHMEM; > >>>+ gfp |= __GFP_DMA32; > >>>+ } > >>>+ > >>>+ do { > >>>+ int order = min(fls(npages) - 1, max_order); > >>I still have reservations on going back to max_order when previous > >>chunks required an order decrease. Size of the object is unbound > >>since it is indirectly controlled by userspace, correct? How about > >>decreasing the max_order on every repeated order decrease, following > >>failed order allocation? > >>>+ struct page *page; > >>>+ > >>>+ do { > >>>+ page = alloc_pages(gfp | (order ? QUIET : 0), order); > >>>+ if (page) > >>>+ break; > >>>+ if (!order--) > >>>+ goto err; > >Like: > > /* Limit future allocations as well */ > > max_order = order; > > > >>>+ } while (1); > >We do pass NORETRY | NOWARN for the higher order allocations, so it > >shouldn't be as bad it seems? > > I don't know for sure without looking into the implementation > details. But I assumed even with NORETRY it does some extra work to > try and free up the space. And if it fails, and we ask for it again, > it is just doing that extra work for nothing. Because within a > single allocation it sounds unlikely that something would change so > dramatically that it would start working. iirc, NORETRY means abort after failure. In effect, it does 2 attempts from the freelist, a direct reclaim, and may then repeat if the task's allowed set of nodes were concurrently changed. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects
On 14/10/2016 13:54, Chris Wilson wrote: On Fri, Oct 14, 2016 at 01:42:35PM +0100, Tvrtko Ursulin wrote: On 14/10/2016 13:18, Chris Wilson wrote: + gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_RECLAIMABLE; + if (IS_CRESTLINE(i915) || IS_BROADWATER(i915)) { + /* 965gm cannot relocate objects above 4GiB. */ + gfp &= ~__GFP_HIGHMEM; + gfp |= __GFP_DMA32; + } + + do { + int order = min(fls(npages) - 1, max_order); I still have reservations on going back to max_order when previous chunks required an order decrease. Size of the object is unbound since it is indirectly controlled by userspace, correct? How about decreasing the max_order on every repeated order decrease, following failed order allocation? + struct page *page; + + do { + page = alloc_pages(gfp | (order ? QUIET : 0), order); + if (page) + break; + if (!order--) + goto err; Like: /* Limit future allocations as well */ max_order = order; + } while (1); We do pass NORETRY | NOWARN for the higher order allocations, so it shouldn't be as bad it seems? I don't know for sure without looking into the implementation details. But I assumed even with NORETRY it does some extra work to try and free up the space. And if it fails, and we ask for it again, it is just doing that extra work for nothing. Because within a single allocation it sounds unlikely that something would change so dramatically that it would start working. So yes, permanently limiting would sound safer to me. Regards, Tvrtko ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects
On Fri, Oct 14, 2016 at 01:42:35PM +0100, Tvrtko Ursulin wrote: > > On 14/10/2016 13:18, Chris Wilson wrote: > >+gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_RECLAIMABLE; > >+if (IS_CRESTLINE(i915) || IS_BROADWATER(i915)) { > >+/* 965gm cannot relocate objects above 4GiB. */ > >+gfp &= ~__GFP_HIGHMEM; > >+gfp |= __GFP_DMA32; > >+} > >+ > >+do { > >+int order = min(fls(npages) - 1, max_order); > > I still have reservations on going back to max_order when previous > chunks required an order decrease. Size of the object is unbound > since it is indirectly controlled by userspace, correct? How about > decreasing the max_order on every repeated order decrease, following > failed order allocation? > >+struct page *page; > >+ > >+do { > >+page = alloc_pages(gfp | (order ? QUIET : 0), order); > >+if (page) > >+break; > >+if (!order--) > >+goto err; Like: /* Limit future allocations as well */ max_order = order; > >+} while (1); We do pass NORETRY | NOWARN for the higher order allocations, so it shouldn't be as bad it seems? -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects
On 14/10/2016 13:18, Chris Wilson wrote: Quite a few of our objects used for internal hardware programming do not benefit from being swappable or from being zero initialised. As such they do not benefit from using a shmemfs backing storage and since they are internal and never directly exposed to the user, we do not need to worry about providing a filp. For these we can use an drm_i915_gem_object wrapper around a sg_table of plain struct page. They are not swap backed and not automatically pinned. If they are reaped by the shrinker, the pages are released and the contents discarded. For the internal use case, this is fine as for example, ringbuffers are pinned from being written by a request to be read by the hardware. Once they are idle, they can be discarded entirely. As such they are a good match for execlist ringbuffers and a small variety of other internal objects. In the first iteration, this is limited to the scratch batch buffers we use (for command parsing and state initialisation). v2: Allocate physically contiguous pages, where possible. Signed-off-by: Chris Wilson Cc: Tvrtko Ursulin --- drivers/gpu/drm/i915/Makefile| 1 + drivers/gpu/drm/i915/i915_drv.h | 5 + drivers/gpu/drm/i915/i915_gem_batch_pool.c | 27 ++--- drivers/gpu/drm/i915/i915_gem_internal.c | 164 +++ drivers/gpu/drm/i915/i915_gem_render_state.c | 2 +- drivers/gpu/drm/i915/intel_engine_cs.c | 2 +- drivers/gpu/drm/i915/intel_ringbuffer.c | 14 ++- 7 files changed, 191 insertions(+), 24 deletions(-) create mode 100644 drivers/gpu/drm/i915/i915_gem_internal.c diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile index 8790ae4fb171..efaf25b984af 100644 --- a/drivers/gpu/drm/i915/Makefile +++ b/drivers/gpu/drm/i915/Makefile @@ -35,6 +35,7 @@ i915-y += i915_cmd_parser.o \ i915_gem_execbuffer.o \ i915_gem_fence.o \ i915_gem_gtt.o \ + i915_gem_internal.o \ i915_gem.o \ i915_gem_render_state.o \ i915_gem_request.o \ diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 4c0c8ef6c084..38467dde1efe 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -3540,6 +3540,11 @@ i915_gem_object_create_stolen_for_preallocated(struct drm_device *dev, u32 gtt_offset, u32 size); +/* i915_gem_internal.c */ +struct drm_i915_gem_object * +i915_gem_object_create_internal(struct drm_i915_private *dev_priv, + unsigned int size); + /* i915_gem_shrinker.c */ unsigned long i915_gem_shrink(struct drm_i915_private *dev_priv, unsigned long target, diff --git a/drivers/gpu/drm/i915/i915_gem_batch_pool.c b/drivers/gpu/drm/i915/i915_gem_batch_pool.c index cb25cad3318c..aa4e1e043b4e 100644 --- a/drivers/gpu/drm/i915/i915_gem_batch_pool.c +++ b/drivers/gpu/drm/i915/i915_gem_batch_pool.c @@ -97,9 +97,9 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool, size_t size) { struct drm_i915_gem_object *obj = NULL; - struct drm_i915_gem_object *tmp, *next; + struct drm_i915_gem_object *tmp; struct list_head *list; - int n; + int n, ret; lockdep_assert_held(&pool->engine->i915->drm.struct_mutex); @@ -112,19 +112,12 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool, n = ARRAY_SIZE(pool->cache_list) - 1; list = &pool->cache_list[n]; - list_for_each_entry_safe(tmp, next, list, batch_pool_link) { + list_for_each_entry(tmp, list, batch_pool_link) { /* The batches are strictly LRU ordered */ if (!i915_gem_active_is_idle(&tmp->last_read[pool->engine->id], &tmp->base.dev->struct_mutex)) break; - /* While we're looping, do some clean up */ - if (tmp->madv == __I915_MADV_PURGED) { - list_del(&tmp->batch_pool_link); - i915_gem_object_put(tmp); - continue; - } - if (tmp->base.size >= size) { obj = tmp; break; @@ -132,19 +125,15 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool, } if (obj == NULL) { - int ret; - - obj = i915_gem_object_create(&pool->engine->i915->drm, size); + obj = i915_gem_object_create_internal(pool->engine->i915, size); if (IS_ERR(obj)) return obj; - - ret = i915_gem_object_get_pages(obj); - if (ret) - return ERR_PTR(ret); - - obj->madv = I915_MADV_DONTNEED; } + ret = i915_gem_object_get_pages(obj);
[Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects
Quite a few of our objects used for internal hardware programming do not benefit from being swappable or from being zero initialised. As such they do not benefit from using a shmemfs backing storage and since they are internal and never directly exposed to the user, we do not need to worry about providing a filp. For these we can use an drm_i915_gem_object wrapper around a sg_table of plain struct page. They are not swap backed and not automatically pinned. If they are reaped by the shrinker, the pages are released and the contents discarded. For the internal use case, this is fine as for example, ringbuffers are pinned from being written by a request to be read by the hardware. Once they are idle, they can be discarded entirely. As such they are a good match for execlist ringbuffers and a small variety of other internal objects. In the first iteration, this is limited to the scratch batch buffers we use (for command parsing and state initialisation). v2: Allocate physically contiguous pages, where possible. Signed-off-by: Chris Wilson Cc: Tvrtko Ursulin --- drivers/gpu/drm/i915/Makefile| 1 + drivers/gpu/drm/i915/i915_drv.h | 5 + drivers/gpu/drm/i915/i915_gem_batch_pool.c | 27 ++--- drivers/gpu/drm/i915/i915_gem_internal.c | 164 +++ drivers/gpu/drm/i915/i915_gem_render_state.c | 2 +- drivers/gpu/drm/i915/intel_engine_cs.c | 2 +- drivers/gpu/drm/i915/intel_ringbuffer.c | 14 ++- 7 files changed, 191 insertions(+), 24 deletions(-) create mode 100644 drivers/gpu/drm/i915/i915_gem_internal.c diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile index 8790ae4fb171..efaf25b984af 100644 --- a/drivers/gpu/drm/i915/Makefile +++ b/drivers/gpu/drm/i915/Makefile @@ -35,6 +35,7 @@ i915-y += i915_cmd_parser.o \ i915_gem_execbuffer.o \ i915_gem_fence.o \ i915_gem_gtt.o \ + i915_gem_internal.o \ i915_gem.o \ i915_gem_render_state.o \ i915_gem_request.o \ diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 4c0c8ef6c084..38467dde1efe 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -3540,6 +3540,11 @@ i915_gem_object_create_stolen_for_preallocated(struct drm_device *dev, u32 gtt_offset, u32 size); +/* i915_gem_internal.c */ +struct drm_i915_gem_object * +i915_gem_object_create_internal(struct drm_i915_private *dev_priv, + unsigned int size); + /* i915_gem_shrinker.c */ unsigned long i915_gem_shrink(struct drm_i915_private *dev_priv, unsigned long target, diff --git a/drivers/gpu/drm/i915/i915_gem_batch_pool.c b/drivers/gpu/drm/i915/i915_gem_batch_pool.c index cb25cad3318c..aa4e1e043b4e 100644 --- a/drivers/gpu/drm/i915/i915_gem_batch_pool.c +++ b/drivers/gpu/drm/i915/i915_gem_batch_pool.c @@ -97,9 +97,9 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool, size_t size) { struct drm_i915_gem_object *obj = NULL; - struct drm_i915_gem_object *tmp, *next; + struct drm_i915_gem_object *tmp; struct list_head *list; - int n; + int n, ret; lockdep_assert_held(&pool->engine->i915->drm.struct_mutex); @@ -112,19 +112,12 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool, n = ARRAY_SIZE(pool->cache_list) - 1; list = &pool->cache_list[n]; - list_for_each_entry_safe(tmp, next, list, batch_pool_link) { + list_for_each_entry(tmp, list, batch_pool_link) { /* The batches are strictly LRU ordered */ if (!i915_gem_active_is_idle(&tmp->last_read[pool->engine->id], &tmp->base.dev->struct_mutex)) break; - /* While we're looping, do some clean up */ - if (tmp->madv == __I915_MADV_PURGED) { - list_del(&tmp->batch_pool_link); - i915_gem_object_put(tmp); - continue; - } - if (tmp->base.size >= size) { obj = tmp; break; @@ -132,19 +125,15 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool, } if (obj == NULL) { - int ret; - - obj = i915_gem_object_create(&pool->engine->i915->drm, size); + obj = i915_gem_object_create_internal(pool->engine->i915, size); if (IS_ERR(obj)) return obj; - - ret = i915_gem_object_get_pages(obj); - if (ret) - return ERR_PTR(ret); - - obj->madv = I915_MADV_DONTNEED; } + ret = i915_gem_object_get_pages(obj); + if (ret) +