Re: [Intel-gfx] [RFC 15/44] drm/i915: Added deferred work handler for scheduler

2014-07-25 Thread Daniel Vetter
On Thu, Jul 24, 2014 at 04:42:55PM +0100, John Harrison wrote:
 
 On 23/07/2014 19:50, Daniel Vetter wrote:
 On Wed, Jul 23, 2014 at 5:37 PM, John Harrison
 john.c.harri...@intel.com wrote:
diff --git a/drivers/gpu/drm/i915/i915_drv.h
 b/drivers/gpu/drm/i915/i915_drv.h
 index 0977653..fbafa68 100644
 --- a/drivers/gpu/drm/i915/i915_drv.h
 +++ b/drivers/gpu/drm/i915/i915_drv.h
 @@ -1075,6 +1075,16 @@ struct i915_gem_mm {
  struct delayed_work idle_work;
  /**
 +* New scheme is to get an interrupt after every work packet
 +* in order to allow the low latency scheduling of pending
 +* packets. The idea behind adding new packets to a pending
 +* queue rather than directly into the hardware ring buffer
 +* is to allow high priority packets to over take low priority
 +* ones.
 +*/
 +   struct work_struct scheduler_work;
 Latency for work items isn't too awesome, and e.g. Oscar's execlist code
 latches the next context right away from the irq handler. Why can't we do
 something similar for the scheduler? Fishing the next item out of a
 priority queue shouldn't be expensive ...
 -Daniel
 
 The problem is that taking batch buffers from the scheduler's queue and
 submitting them to the hardware requires lots of processing that is not IRQ
 compatible. It isn't just a simple register write. Half of the code in
 'i915_gem_do_execbuffer()' must be executed. Probably/possibly it could be
 made IRQ friendly but that would place a lot of restrictions on a lot of
 code that currently doesn't expect to be restricted. Instead, the submission
 is done via a work handler that acquires the driver mutex lock.
 
 In order to cover the extra latency, the scheduler operates in a
 multi-buffered mode and aims to keep eight batch buffers in flight at all
 times. That number being obtained empirically by running lots of benchmarks
 on Android with lots of different settings and seeing where the buffer size
 stopped making a difference.
 So I've tried to stitch together that part of the scheduler from the
 patch series. Afaics you do the actual scheduling under the protection
 of irqsave spinlocks (well you also hold the dev-struct_mutex). That
 means you disable local interrupts. Up to the actual submit point I
 spotted two such critcial sections encompassing pretty much all the
 code.
 
 If we'd run the same code from the interrupt handler then only our own
 interrupt handler is blocked, all other interrupt processing can
 continue. So that's actually a lot nicer than what you have. In any
 case you can't do expensive operations under an irqsave spinlock
 anyway.
 
 So either I've missed something big here, or this justification doesn't hold 
 up.
 
 The irqsave spinlock is only held while manipulating the internal scheduler
 data structures. It is released immediately prior to calling
 i915_gem_do_execbuffer_final(). So the actual submission code path is done
 with the driver mutex but no spinlocks. I'm sure I got 'scheduling while
 atomic' bug checks the one time I accidentally left the spinlock held.

Ok, missed something ;-)

btw for big patch series please upload them somewhere on a public git
(github or so). Generally I review patches only by reading them because
trying to apply them usually results in some conflicts. But with big patch
series like this here that doesn't work, especially when everything is
tightly coupled (iirc I had to jump around in about 10 different patches
to figure out what the work handler looks like). Or maybe I didn't
understand the patch split flow and read it backwards ;-)
-Daniel
 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 15/44] drm/i915: Added deferred work handler for scheduler

2014-07-24 Thread John Harrison


On 23/07/2014 19:50, Daniel Vetter wrote:

On Wed, Jul 23, 2014 at 5:37 PM, John Harrison
john.c.harri...@intel.com wrote:

   diff --git a/drivers/gpu/drm/i915/i915_drv.h
b/drivers/gpu/drm/i915/i915_drv.h
index 0977653..fbafa68 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1075,6 +1075,16 @@ struct i915_gem_mm {
 struct delayed_work idle_work;
 /**
+* New scheme is to get an interrupt after every work packet
+* in order to allow the low latency scheduling of pending
+* packets. The idea behind adding new packets to a pending
+* queue rather than directly into the hardware ring buffer
+* is to allow high priority packets to over take low priority
+* ones.
+*/
+   struct work_struct scheduler_work;

Latency for work items isn't too awesome, and e.g. Oscar's execlist code
latches the next context right away from the irq handler. Why can't we do
something similar for the scheduler? Fishing the next item out of a
priority queue shouldn't be expensive ...
-Daniel


The problem is that taking batch buffers from the scheduler's queue and
submitting them to the hardware requires lots of processing that is not IRQ
compatible. It isn't just a simple register write. Half of the code in
'i915_gem_do_execbuffer()' must be executed. Probably/possibly it could be
made IRQ friendly but that would place a lot of restrictions on a lot of
code that currently doesn't expect to be restricted. Instead, the submission
is done via a work handler that acquires the driver mutex lock.

In order to cover the extra latency, the scheduler operates in a
multi-buffered mode and aims to keep eight batch buffers in flight at all
times. That number being obtained empirically by running lots of benchmarks
on Android with lots of different settings and seeing where the buffer size
stopped making a difference.

So I've tried to stitch together that part of the scheduler from the
patch series. Afaics you do the actual scheduling under the protection
of irqsave spinlocks (well you also hold the dev-struct_mutex). That
means you disable local interrupts. Up to the actual submit point I
spotted two such critcial sections encompassing pretty much all the
code.

If we'd run the same code from the interrupt handler then only our own
interrupt handler is blocked, all other interrupt processing can
continue. So that's actually a lot nicer than what you have. In any
case you can't do expensive operations under an irqsave spinlock
anyway.

So either I've missed something big here, or this justification doesn't hold up.
-Daniel


The irqsave spinlock is only held while manipulating the internal 
scheduler data structures. It is released immediately prior to calling 
i915_gem_do_execbuffer_final(). So the actual submission code path is 
done with the driver mutex but no spinlocks. I'm sure I got 'scheduling 
while atomic' bug checks the one time I accidentally left the spinlock held.


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 15/44] drm/i915: Added deferred work handler for scheduler

2014-07-23 Thread John Harrison


On 07/07/2014 20:14, Daniel Vetter wrote:

On Thu, Jun 26, 2014 at 06:24:06PM +0100, john.c.harri...@intel.com wrote:

From: John Harrison john.c.harri...@intel.com

The scheduler needs to do interrupt triggered work that is too complex to do in
the interrupt handler. Thus it requires a deferred work handler to process this
work asynchronously.
---
  drivers/gpu/drm/i915/i915_dma.c   |3 +++
  drivers/gpu/drm/i915/i915_drv.h   |   10 ++
  drivers/gpu/drm/i915/i915_gem.c   |   27 +++
  drivers/gpu/drm/i915/i915_scheduler.c |7 +++
  drivers/gpu/drm/i915/i915_scheduler.h |1 +
  5 files changed, 48 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 1668316..d1356f3 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1813,6 +1813,9 @@ int i915_driver_unload(struct drm_device *dev)
WARN_ON(unregister_oom_notifier(dev_priv-mm.oom_notifier));
unregister_shrinker(dev_priv-mm.shrinker);
  
+	/* Cancel the scheduler work handler, which should be idle now. */

+   cancel_work_sync(dev_priv-mm.scheduler_work);
+
io_mapping_free(dev_priv-gtt.mappable);
arch_phys_wc_del(dev_priv-gtt.mtrr);
  
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h

index 0977653..fbafa68 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1075,6 +1075,16 @@ struct i915_gem_mm {
struct delayed_work idle_work;
  
  	/**

+* New scheme is to get an interrupt after every work packet
+* in order to allow the low latency scheduling of pending
+* packets. The idea behind adding new packets to a pending
+* queue rather than directly into the hardware ring buffer
+* is to allow high priority packets to over take low priority
+* ones.
+*/
+   struct work_struct scheduler_work;

Latency for work items isn't too awesome, and e.g. Oscar's execlist code
latches the next context right away from the irq handler. Why can't we do
something similar for the scheduler? Fishing the next item out of a
priority queue shouldn't be expensive ...
-Daniel


The problem is that taking batch buffers from the scheduler's queue and 
submitting them to the hardware requires lots of processing that is not 
IRQ compatible. It isn't just a simple register write. Half of the code 
in 'i915_gem_do_execbuffer()' must be executed. Probably/possibly it 
could be made IRQ friendly but that would place a lot of restrictions on 
a lot of code that currently doesn't expect to be restricted. Instead, 
the submission is done via a work handler that acquires the driver mutex 
lock.


In order to cover the extra latency, the scheduler operates in a 
multi-buffered mode and aims to keep eight batch buffers in flight at 
all times. That number being obtained empirically by running lots of 
benchmarks on Android with lots of different settings and seeing where 
the buffer size stopped making a difference.


John.





+
+   /**
 * Are we in a non-interruptible section of code like
 * modesetting?
 */
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index fece5e7..57b24f0 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2712,6 +2712,29 @@ i915_gem_idle_work_handler(struct work_struct *work)
intel_mark_idle(dev_priv-dev);
  }
  
+#ifdef CONFIG_DRM_I915_SCHEDULER

+static void
+i915_gem_scheduler_work_handler(struct work_struct *work)
+{
+   struct intel_engine_cs  *ring;
+   struct drm_i915_private *dev_priv;
+   struct drm_device   *dev;
+   int i;
+
+   dev_priv = container_of(work, struct drm_i915_private, 
mm.scheduler_work);
+   dev = dev_priv-dev;
+
+   mutex_lock(dev-struct_mutex);
+
+   /* Do stuff: */
+   for_each_ring(ring, dev_priv, i) {
+   i915_scheduler_remove(ring);
+   }
+
+   mutex_unlock(dev-struct_mutex);
+}
+#endif
+
  /**
   * Ensures that an object will eventually get non-busy by flushing any 
required
   * write domains, emitting any outstanding lazy request and retiring and
@@ -4916,6 +4939,10 @@ i915_gem_load(struct drm_device *dev)
  i915_gem_retire_work_handler);
INIT_DELAYED_WORK(dev_priv-mm.idle_work,
  i915_gem_idle_work_handler);
+#ifdef CONFIG_DRM_I915_SCHEDULER
+   INIT_WORK(dev_priv-mm.scheduler_work,
+   i915_gem_scheduler_work_handler);
+#endif
init_waitqueue_head(dev_priv-gpu_error.reset_queue);
  
  	/* On GEN3 we really need to make sure the ARB C3 LP bit is set */

diff --git a/drivers/gpu/drm/i915/i915_scheduler.c 
b/drivers/gpu/drm/i915/i915_scheduler.c
index 66a6568..37f8a98 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ 

Re: [Intel-gfx] [RFC 15/44] drm/i915: Added deferred work handler for scheduler

2014-07-23 Thread Daniel Vetter
On Wed, Jul 23, 2014 at 5:37 PM, John Harrison
john.c.harri...@intel.com wrote:
   diff --git a/drivers/gpu/drm/i915/i915_drv.h
 b/drivers/gpu/drm/i915/i915_drv.h
 index 0977653..fbafa68 100644
 --- a/drivers/gpu/drm/i915/i915_drv.h
 +++ b/drivers/gpu/drm/i915/i915_drv.h
 @@ -1075,6 +1075,16 @@ struct i915_gem_mm {
 struct delayed_work idle_work;
 /**
 +* New scheme is to get an interrupt after every work packet
 +* in order to allow the low latency scheduling of pending
 +* packets. The idea behind adding new packets to a pending
 +* queue rather than directly into the hardware ring buffer
 +* is to allow high priority packets to over take low priority
 +* ones.
 +*/
 +   struct work_struct scheduler_work;

 Latency for work items isn't too awesome, and e.g. Oscar's execlist code
 latches the next context right away from the irq handler. Why can't we do
 something similar for the scheduler? Fishing the next item out of a
 priority queue shouldn't be expensive ...
 -Daniel


 The problem is that taking batch buffers from the scheduler's queue and
 submitting them to the hardware requires lots of processing that is not IRQ
 compatible. It isn't just a simple register write. Half of the code in
 'i915_gem_do_execbuffer()' must be executed. Probably/possibly it could be
 made IRQ friendly but that would place a lot of restrictions on a lot of
 code that currently doesn't expect to be restricted. Instead, the submission
 is done via a work handler that acquires the driver mutex lock.

 In order to cover the extra latency, the scheduler operates in a
 multi-buffered mode and aims to keep eight batch buffers in flight at all
 times. That number being obtained empirically by running lots of benchmarks
 on Android with lots of different settings and seeing where the buffer size
 stopped making a difference.

So I've tried to stitch together that part of the scheduler from the
patch series. Afaics you do the actual scheduling under the protection
of irqsave spinlocks (well you also hold the dev-struct_mutex). That
means you disable local interrupts. Up to the actual submit point I
spotted two such critcial sections encompassing pretty much all the
code.

If we'd run the same code from the interrupt handler then only our own
interrupt handler is blocked, all other interrupt processing can
continue. So that's actually a lot nicer than what you have. In any
case you can't do expensive operations under an irqsave spinlock
anyway.

So either I've missed something big here, or this justification doesn't hold up.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 15/44] drm/i915: Added deferred work handler for scheduler

2014-07-07 Thread Daniel Vetter
On Thu, Jun 26, 2014 at 06:24:06PM +0100, john.c.harri...@intel.com wrote:
 From: John Harrison john.c.harri...@intel.com
 
 The scheduler needs to do interrupt triggered work that is too complex to do 
 in
 the interrupt handler. Thus it requires a deferred work handler to process 
 this
 work asynchronously.
 ---
  drivers/gpu/drm/i915/i915_dma.c   |3 +++
  drivers/gpu/drm/i915/i915_drv.h   |   10 ++
  drivers/gpu/drm/i915/i915_gem.c   |   27 +++
  drivers/gpu/drm/i915/i915_scheduler.c |7 +++
  drivers/gpu/drm/i915/i915_scheduler.h |1 +
  5 files changed, 48 insertions(+)
 
 diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
 index 1668316..d1356f3 100644
 --- a/drivers/gpu/drm/i915/i915_dma.c
 +++ b/drivers/gpu/drm/i915/i915_dma.c
 @@ -1813,6 +1813,9 @@ int i915_driver_unload(struct drm_device *dev)
   WARN_ON(unregister_oom_notifier(dev_priv-mm.oom_notifier));
   unregister_shrinker(dev_priv-mm.shrinker);
  
 + /* Cancel the scheduler work handler, which should be idle now. */
 + cancel_work_sync(dev_priv-mm.scheduler_work);
 +
   io_mapping_free(dev_priv-gtt.mappable);
   arch_phys_wc_del(dev_priv-gtt.mtrr);
  
 diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
 index 0977653..fbafa68 100644
 --- a/drivers/gpu/drm/i915/i915_drv.h
 +++ b/drivers/gpu/drm/i915/i915_drv.h
 @@ -1075,6 +1075,16 @@ struct i915_gem_mm {
   struct delayed_work idle_work;
  
   /**
 +  * New scheme is to get an interrupt after every work packet
 +  * in order to allow the low latency scheduling of pending
 +  * packets. The idea behind adding new packets to a pending
 +  * queue rather than directly into the hardware ring buffer
 +  * is to allow high priority packets to over take low priority
 +  * ones.
 +  */
 + struct work_struct scheduler_work;

Latency for work items isn't too awesome, and e.g. Oscar's execlist code
latches the next context right away from the irq handler. Why can't we do
something similar for the scheduler? Fishing the next item out of a
priority queue shouldn't be expensive ...
-Daniel

 +
 + /**
* Are we in a non-interruptible section of code like
* modesetting?
*/
 diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
 index fece5e7..57b24f0 100644
 --- a/drivers/gpu/drm/i915/i915_gem.c
 +++ b/drivers/gpu/drm/i915/i915_gem.c
 @@ -2712,6 +2712,29 @@ i915_gem_idle_work_handler(struct work_struct *work)
   intel_mark_idle(dev_priv-dev);
  }
  
 +#ifdef CONFIG_DRM_I915_SCHEDULER
 +static void
 +i915_gem_scheduler_work_handler(struct work_struct *work)
 +{
 + struct intel_engine_cs  *ring;
 + struct drm_i915_private *dev_priv;
 + struct drm_device   *dev;
 + int i;
 +
 + dev_priv = container_of(work, struct drm_i915_private, 
 mm.scheduler_work);
 + dev = dev_priv-dev;
 +
 + mutex_lock(dev-struct_mutex);
 +
 + /* Do stuff: */
 + for_each_ring(ring, dev_priv, i) {
 + i915_scheduler_remove(ring);
 + }
 +
 + mutex_unlock(dev-struct_mutex);
 +}
 +#endif
 +
  /**
   * Ensures that an object will eventually get non-busy by flushing any 
 required
   * write domains, emitting any outstanding lazy request and retiring and
 @@ -4916,6 +4939,10 @@ i915_gem_load(struct drm_device *dev)
 i915_gem_retire_work_handler);
   INIT_DELAYED_WORK(dev_priv-mm.idle_work,
 i915_gem_idle_work_handler);
 +#ifdef CONFIG_DRM_I915_SCHEDULER
 + INIT_WORK(dev_priv-mm.scheduler_work,
 + i915_gem_scheduler_work_handler);
 +#endif
   init_waitqueue_head(dev_priv-gpu_error.reset_queue);
  
   /* On GEN3 we really need to make sure the ARB C3 LP bit is set */
 diff --git a/drivers/gpu/drm/i915/i915_scheduler.c 
 b/drivers/gpu/drm/i915/i915_scheduler.c
 index 66a6568..37f8a98 100644
 --- a/drivers/gpu/drm/i915/i915_scheduler.c
 +++ b/drivers/gpu/drm/i915/i915_scheduler.c
 @@ -58,6 +58,13 @@ int i915_scheduler_init(struct drm_device *dev)
   return 0;
  }
  
 +int i915_scheduler_remove(struct intel_engine_cs *ring)
 +{
 + /* Do stuff... */
 +
 + return 0;
 +}
 +
  bool i915_scheduler_is_seqno_in_flight(struct intel_engine_cs *ring,
  uint32_t seqno, bool *completed)
  {
 diff --git a/drivers/gpu/drm/i915/i915_scheduler.h 
 b/drivers/gpu/drm/i915/i915_scheduler.h
 index 95641f6..6b2cc51 100644
 --- a/drivers/gpu/drm/i915/i915_scheduler.h
 +++ b/drivers/gpu/drm/i915/i915_scheduler.h
 @@ -38,6 +38,7 @@ struct i915_scheduler {
   uint32_tindex;
  };
  
 +int i915_scheduler_remove(struct intel_engine_cs *ring);
  booli915_scheduler_is_seqno_in_flight(struct intel_engine_cs *ring,
 uint32_t seqno, bool *completed);
  
 -- 
 1.7.9.5