Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects

2016-10-17 Thread Chris Wilson
On Mon, Oct 17, 2016 at 10:47:09AM +0100, Tvrtko Ursulin wrote:
> 
> On 14/10/2016 15:42, Chris Wilson wrote:
> >On Fri, Oct 14, 2016 at 03:35:59PM +0100, Tvrtko Ursulin wrote:
> >>On 14/10/2016 14:53, Chris Wilson wrote:
> >We do pass NORETRY | NOWARN for the higher order allocations, so it
> >shouldn't be as bad it seems?
> I don't know for sure without looking into the implementation
> details. But I assumed even with NORETRY it does some extra work to
> try and free up the space. And if it fails, and we ask for it again,
> it is just doing that extra work for nothing. Because within a
> single allocation it sounds unlikely that something would change so
> dramatically that it would start working.
> >>>iirc, NORETRY means abort after failure. In effect, it does
> >>>2 attempts from the freelist, a direct reclaim, and may then repeat
> >>>if the task's allowed set of nodes were concurrently changed.
> >>Do you think it makes sense doing all that after it started failing,
> >>within our single get_pages allocation?
> >I was thinking about skipping the DIRECT_RECLAIM for high order, but it
> >seems like that is beneficial for THP, so I'm presuming it should also
> >be for ourselves. Trimming back max_order seems sensible, but I still
> >like the idea of taking advantage of contiguous pages where possible
> >(primarily these will be used for ringbuffers and shadow batches).
> 
> Can we agree then to try larger order first and fall back gradually?

I thought we had already agreed on that! I thought we were looking at
what else we might do. :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects

2016-10-17 Thread Tvrtko Ursulin


On 14/10/2016 15:42, Chris Wilson wrote:

On Fri, Oct 14, 2016 at 03:35:59PM +0100, Tvrtko Ursulin wrote:

On 14/10/2016 14:53, Chris Wilson wrote:

We do pass NORETRY | NOWARN for the higher order allocations, so it
shouldn't be as bad it seems?

I don't know for sure without looking into the implementation
details. But I assumed even with NORETRY it does some extra work to
try and free up the space. And if it fails, and we ask for it again,
it is just doing that extra work for nothing. Because within a
single allocation it sounds unlikely that something would change so
dramatically that it would start working.

iirc, NORETRY means abort after failure. In effect, it does
2 attempts from the freelist, a direct reclaim, and may then repeat
if the task's allowed set of nodes were concurrently changed.

Do you think it makes sense doing all that after it started failing,
within our single get_pages allocation?

I was thinking about skipping the DIRECT_RECLAIM for high order, but it
seems like that is beneficial for THP, so I'm presuming it should also
be for ourselves. Trimming back max_order seems sensible, but I still
like the idea of taking advantage of contiguous pages where possible
(primarily these will be used for ringbuffers and shadow batches).


Can we agree then to try larger order first and fall back gradually?

Regards,

Tvrtko

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects

2016-10-14 Thread Chris Wilson
On Fri, Oct 14, 2016 at 03:35:59PM +0100, Tvrtko Ursulin wrote:
> 
> On 14/10/2016 14:53, Chris Wilson wrote:
> >>>We do pass NORETRY | NOWARN for the higher order allocations, so it
> >>>shouldn't be as bad it seems?
> >>I don't know for sure without looking into the implementation
> >>details. But I assumed even with NORETRY it does some extra work to
> >>try and free up the space. And if it fails, and we ask for it again,
> >>it is just doing that extra work for nothing. Because within a
> >>single allocation it sounds unlikely that something would change so
> >>dramatically that it would start working.
> >iirc, NORETRY means abort after failure. In effect, it does
> >2 attempts from the freelist, a direct reclaim, and may then repeat
> >if the task's allowed set of nodes were concurrently changed.
> 
> Do you think it makes sense doing all that after it started failing,
> within our single get_pages allocation?

I was thinking about skipping the DIRECT_RECLAIM for high order, but it
seems like that is beneficial for THP, so I'm presuming it should also
be for ourselves. Trimming back max_order seems sensible, but I still
like the idea of taking advantage of contiguous pages where possible
(primarily these will be used for ringbuffers and shadow batches).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects

2016-10-14 Thread Tvrtko Ursulin


On 14/10/2016 14:53, Chris Wilson wrote:

On Fri, Oct 14, 2016 at 02:44:00PM +0100, Tvrtko Ursulin wrote:

On 14/10/2016 13:54, Chris Wilson wrote:

On Fri, Oct 14, 2016 at 01:42:35PM +0100, Tvrtko Ursulin wrote:

On 14/10/2016 13:18, Chris Wilson wrote:

+   gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_RECLAIMABLE;
+   if (IS_CRESTLINE(i915) || IS_BROADWATER(i915)) {
+   /* 965gm cannot relocate objects above 4GiB. */
+   gfp &= ~__GFP_HIGHMEM;
+   gfp |= __GFP_DMA32;
+   }
+
+   do {
+   int order = min(fls(npages) - 1, max_order);

I still have reservations on going back to max_order when previous
chunks required an order decrease. Size of the object is unbound
since it is indirectly controlled by userspace, correct? How about
decreasing the max_order on every repeated order decrease, following
failed order allocation?

+   struct page *page;
+
+   do {
+   page = alloc_pages(gfp | (order ? QUIET : 0), order);
+   if (page)
+   break;
+   if (!order--)
+   goto err;

Like:
/* Limit future allocations as well */
max_order = order;


+   } while (1);

We do pass NORETRY | NOWARN for the higher order allocations, so it
shouldn't be as bad it seems?

I don't know for sure without looking into the implementation
details. But I assumed even with NORETRY it does some extra work to
try and free up the space. And if it fails, and we ask for it again,
it is just doing that extra work for nothing. Because within a
single allocation it sounds unlikely that something would change so
dramatically that it would start working.

iirc, NORETRY means abort after failure. In effect, it does
2 attempts from the freelist, a direct reclaim, and may then repeat
if the task's allowed set of nodes were concurrently changed.


Do you think it makes sense doing all that after it started failing, 
within our single get_pages allocation?


Regards,

Tvrtko

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects

2016-10-14 Thread Chris Wilson
On Fri, Oct 14, 2016 at 02:44:00PM +0100, Tvrtko Ursulin wrote:
> 
> On 14/10/2016 13:54, Chris Wilson wrote:
> >On Fri, Oct 14, 2016 at 01:42:35PM +0100, Tvrtko Ursulin wrote:
> >>On 14/10/2016 13:18, Chris Wilson wrote:
> >>>+  gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_RECLAIMABLE;
> >>>+  if (IS_CRESTLINE(i915) || IS_BROADWATER(i915)) {
> >>>+  /* 965gm cannot relocate objects above 4GiB. */
> >>>+  gfp &= ~__GFP_HIGHMEM;
> >>>+  gfp |= __GFP_DMA32;
> >>>+  }
> >>>+
> >>>+  do {
> >>>+  int order = min(fls(npages) - 1, max_order);
> >>I still have reservations on going back to max_order when previous
> >>chunks required an order decrease. Size of the object is unbound
> >>since it is indirectly controlled by userspace, correct? How about
> >>decreasing the max_order on every repeated order decrease, following
> >>failed order allocation?
> >>>+  struct page *page;
> >>>+
> >>>+  do {
> >>>+  page = alloc_pages(gfp | (order ? QUIET : 0), order);
> >>>+  if (page)
> >>>+  break;
> >>>+  if (!order--)
> >>>+  goto err;
> >Like:
> > /* Limit future allocations as well */
> > max_order = order;
> >
> >>>+  } while (1);
> >We do pass NORETRY | NOWARN for the higher order allocations, so it
> >shouldn't be as bad it seems?
> 
> I don't know for sure without looking into the implementation
> details. But I assumed even with NORETRY it does some extra work to
> try and free up the space. And if it fails, and we ask for it again,
> it is just doing that extra work for nothing. Because within a
> single allocation it sounds unlikely that something would change so
> dramatically that it would start working.

iirc, NORETRY means abort after failure. In effect, it does
2 attempts from the freelist, a direct reclaim, and may then repeat
if the task's allowed set of nodes were concurrently changed.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects

2016-10-14 Thread Tvrtko Ursulin


On 14/10/2016 13:54, Chris Wilson wrote:

On Fri, Oct 14, 2016 at 01:42:35PM +0100, Tvrtko Ursulin wrote:

On 14/10/2016 13:18, Chris Wilson wrote:

+   gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_RECLAIMABLE;
+   if (IS_CRESTLINE(i915) || IS_BROADWATER(i915)) {
+   /* 965gm cannot relocate objects above 4GiB. */
+   gfp &= ~__GFP_HIGHMEM;
+   gfp |= __GFP_DMA32;
+   }
+
+   do {
+   int order = min(fls(npages) - 1, max_order);

I still have reservations on going back to max_order when previous
chunks required an order decrease. Size of the object is unbound
since it is indirectly controlled by userspace, correct? How about
decreasing the max_order on every repeated order decrease, following
failed order allocation?

+   struct page *page;
+
+   do {
+   page = alloc_pages(gfp | (order ? QUIET : 0), order);
+   if (page)
+   break;
+   if (!order--)
+   goto err;

Like:
/* Limit future allocations as well */
max_order = order;


+   } while (1);

We do pass NORETRY | NOWARN for the higher order allocations, so it
shouldn't be as bad it seems?


I don't know for sure without looking into the implementation details. 
But I assumed even with NORETRY it does some extra work to try and free 
up the space. And if it fails, and we ask for it again, it is just doing 
that extra work for nothing. Because within a single allocation it 
sounds unlikely that something would change so dramatically that it 
would start working.


So yes, permanently limiting would sound safer to me.

Regards,

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects

2016-10-14 Thread Chris Wilson
On Fri, Oct 14, 2016 at 01:42:35PM +0100, Tvrtko Ursulin wrote:
> 
> On 14/10/2016 13:18, Chris Wilson wrote:
> >+gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_RECLAIMABLE;
> >+if (IS_CRESTLINE(i915) || IS_BROADWATER(i915)) {
> >+/* 965gm cannot relocate objects above 4GiB. */
> >+gfp &= ~__GFP_HIGHMEM;
> >+gfp |= __GFP_DMA32;
> >+}
> >+
> >+do {
> >+int order = min(fls(npages) - 1, max_order);
> 
> I still have reservations on going back to max_order when previous
> chunks required an order decrease. Size of the object is unbound
> since it is indirectly controlled by userspace, correct? How about
> decreasing the max_order on every repeated order decrease, following
> failed order allocation?

> >+struct page *page;
> >+
> >+do {
> >+page = alloc_pages(gfp | (order ? QUIET : 0), order);
> >+if (page)
> >+break;
> >+if (!order--)
> >+goto err;

Like:
/* Limit future allocations as well */
max_order = order;

> >+} while (1);

We do pass NORETRY | NOWARN for the higher order allocations, so it
shouldn't be as bad it seems?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects

2016-10-14 Thread Tvrtko Ursulin


On 14/10/2016 13:18, Chris Wilson wrote:

Quite a few of our objects used for internal hardware programming do not
benefit from being swappable or from being zero initialised. As such
they do not benefit from using a shmemfs backing storage and since they
are internal and never directly exposed to the user, we do not need to
worry about providing a filp. For these we can use an
drm_i915_gem_object wrapper around a sg_table of plain struct page. They
are not swap backed and not automatically pinned. If they are reaped
by the shrinker, the pages are released and the contents discarded. For
the internal use case, this is fine as for example, ringbuffers are
pinned from being written by a request to be read by the hardware. Once
they are idle, they can be discarded entirely. As such they are a good
match for execlist ringbuffers and a small variety of other internal
objects.

In the first iteration, this is limited to the scratch batch buffers we
use (for command parsing and state initialisation).

v2: Allocate physically contiguous pages, where possible.

Signed-off-by: Chris Wilson 
Cc: Tvrtko Ursulin 
---
  drivers/gpu/drm/i915/Makefile|   1 +
  drivers/gpu/drm/i915/i915_drv.h  |   5 +
  drivers/gpu/drm/i915/i915_gem_batch_pool.c   |  27 ++---
  drivers/gpu/drm/i915/i915_gem_internal.c | 164 +++
  drivers/gpu/drm/i915/i915_gem_render_state.c |   2 +-
  drivers/gpu/drm/i915/intel_engine_cs.c   |   2 +-
  drivers/gpu/drm/i915/intel_ringbuffer.c  |  14 ++-
  7 files changed, 191 insertions(+), 24 deletions(-)
  create mode 100644 drivers/gpu/drm/i915/i915_gem_internal.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 8790ae4fb171..efaf25b984af 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -35,6 +35,7 @@ i915-y += i915_cmd_parser.o \
  i915_gem_execbuffer.o \
  i915_gem_fence.o \
  i915_gem_gtt.o \
+ i915_gem_internal.o \
  i915_gem.o \
  i915_gem_render_state.o \
  i915_gem_request.o \
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 4c0c8ef6c084..38467dde1efe 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3540,6 +3540,11 @@ i915_gem_object_create_stolen_for_preallocated(struct 
drm_device *dev,
   u32 gtt_offset,
   u32 size);
  
+/* i915_gem_internal.c */

+struct drm_i915_gem_object *
+i915_gem_object_create_internal(struct drm_i915_private *dev_priv,
+   unsigned int size);
+
  /* i915_gem_shrinker.c */
  unsigned long i915_gem_shrink(struct drm_i915_private *dev_priv,
  unsigned long target,
diff --git a/drivers/gpu/drm/i915/i915_gem_batch_pool.c 
b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
index cb25cad3318c..aa4e1e043b4e 100644
--- a/drivers/gpu/drm/i915/i915_gem_batch_pool.c
+++ b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
@@ -97,9 +97,9 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool,
size_t size)
  {
struct drm_i915_gem_object *obj = NULL;
-   struct drm_i915_gem_object *tmp, *next;
+   struct drm_i915_gem_object *tmp;
struct list_head *list;
-   int n;
+   int n, ret;
  
  	lockdep_assert_held(&pool->engine->i915->drm.struct_mutex);
  
@@ -112,19 +112,12 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool,

n = ARRAY_SIZE(pool->cache_list) - 1;
list = &pool->cache_list[n];
  
-	list_for_each_entry_safe(tmp, next, list, batch_pool_link) {

+   list_for_each_entry(tmp, list, batch_pool_link) {
/* The batches are strictly LRU ordered */
if (!i915_gem_active_is_idle(&tmp->last_read[pool->engine->id],
 &tmp->base.dev->struct_mutex))
break;
  
-		/* While we're looping, do some clean up */

-   if (tmp->madv == __I915_MADV_PURGED) {
-   list_del(&tmp->batch_pool_link);
-   i915_gem_object_put(tmp);
-   continue;
-   }
-
if (tmp->base.size >= size) {
obj = tmp;
break;
@@ -132,19 +125,15 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool,
}
  
  	if (obj == NULL) {

-   int ret;
-
-   obj = i915_gem_object_create(&pool->engine->i915->drm, size);
+   obj = i915_gem_object_create_internal(pool->engine->i915, size);
if (IS_ERR(obj))
return obj;
-
-   ret = i915_gem_object_get_pages(obj);
-   if (ret)
-   return ERR_PTR(ret);
-
-   obj->madv = I915_MADV_DONTNEED;
}
  
+	ret = i915_gem_object_get_pages(obj);

[Intel-gfx] [PATCH 11/41] drm/i915: Introduce an internal allocator for disposable private objects

2016-10-14 Thread Chris Wilson
Quite a few of our objects used for internal hardware programming do not
benefit from being swappable or from being zero initialised. As such
they do not benefit from using a shmemfs backing storage and since they
are internal and never directly exposed to the user, we do not need to
worry about providing a filp. For these we can use an
drm_i915_gem_object wrapper around a sg_table of plain struct page. They
are not swap backed and not automatically pinned. If they are reaped
by the shrinker, the pages are released and the contents discarded. For
the internal use case, this is fine as for example, ringbuffers are
pinned from being written by a request to be read by the hardware. Once
they are idle, they can be discarded entirely. As such they are a good
match for execlist ringbuffers and a small variety of other internal
objects.

In the first iteration, this is limited to the scratch batch buffers we
use (for command parsing and state initialisation).

v2: Allocate physically contiguous pages, where possible.

Signed-off-by: Chris Wilson 
Cc: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/Makefile|   1 +
 drivers/gpu/drm/i915/i915_drv.h  |   5 +
 drivers/gpu/drm/i915/i915_gem_batch_pool.c   |  27 ++---
 drivers/gpu/drm/i915/i915_gem_internal.c | 164 +++
 drivers/gpu/drm/i915/i915_gem_render_state.c |   2 +-
 drivers/gpu/drm/i915/intel_engine_cs.c   |   2 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c  |  14 ++-
 7 files changed, 191 insertions(+), 24 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_gem_internal.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 8790ae4fb171..efaf25b984af 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -35,6 +35,7 @@ i915-y += i915_cmd_parser.o \
  i915_gem_execbuffer.o \
  i915_gem_fence.o \
  i915_gem_gtt.o \
+ i915_gem_internal.o \
  i915_gem.o \
  i915_gem_render_state.o \
  i915_gem_request.o \
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 4c0c8ef6c084..38467dde1efe 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3540,6 +3540,11 @@ i915_gem_object_create_stolen_for_preallocated(struct 
drm_device *dev,
   u32 gtt_offset,
   u32 size);
 
+/* i915_gem_internal.c */
+struct drm_i915_gem_object *
+i915_gem_object_create_internal(struct drm_i915_private *dev_priv,
+   unsigned int size);
+
 /* i915_gem_shrinker.c */
 unsigned long i915_gem_shrink(struct drm_i915_private *dev_priv,
  unsigned long target,
diff --git a/drivers/gpu/drm/i915/i915_gem_batch_pool.c 
b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
index cb25cad3318c..aa4e1e043b4e 100644
--- a/drivers/gpu/drm/i915/i915_gem_batch_pool.c
+++ b/drivers/gpu/drm/i915/i915_gem_batch_pool.c
@@ -97,9 +97,9 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool,
size_t size)
 {
struct drm_i915_gem_object *obj = NULL;
-   struct drm_i915_gem_object *tmp, *next;
+   struct drm_i915_gem_object *tmp;
struct list_head *list;
-   int n;
+   int n, ret;
 
lockdep_assert_held(&pool->engine->i915->drm.struct_mutex);
 
@@ -112,19 +112,12 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool,
n = ARRAY_SIZE(pool->cache_list) - 1;
list = &pool->cache_list[n];
 
-   list_for_each_entry_safe(tmp, next, list, batch_pool_link) {
+   list_for_each_entry(tmp, list, batch_pool_link) {
/* The batches are strictly LRU ordered */
if (!i915_gem_active_is_idle(&tmp->last_read[pool->engine->id],
 &tmp->base.dev->struct_mutex))
break;
 
-   /* While we're looping, do some clean up */
-   if (tmp->madv == __I915_MADV_PURGED) {
-   list_del(&tmp->batch_pool_link);
-   i915_gem_object_put(tmp);
-   continue;
-   }
-
if (tmp->base.size >= size) {
obj = tmp;
break;
@@ -132,19 +125,15 @@ i915_gem_batch_pool_get(struct i915_gem_batch_pool *pool,
}
 
if (obj == NULL) {
-   int ret;
-
-   obj = i915_gem_object_create(&pool->engine->i915->drm, size);
+   obj = i915_gem_object_create_internal(pool->engine->i915, size);
if (IS_ERR(obj))
return obj;
-
-   ret = i915_gem_object_get_pages(obj);
-   if (ret)
-   return ERR_PTR(ret);
-
-   obj->madv = I915_MADV_DONTNEED;
}
 
+   ret = i915_gem_object_get_pages(obj);
+   if (ret)
+