[Intel-gfx] [PATCH] drm/i915: remove user GTT mappings early during runtime suspend

2014-05-06 Thread Imre Deak
Currently user space can access GEM buffers mapped to GTT through
existing mappings concurrently while the platform specific suspend
handlers are running.  Since these handlers may change the HW state in a
way that would break such accesses, remove the mappings before calling
the handlers.

Suggested by Ville.

Signed-off-by: Imre Deak imre.d...@intel.com
---
 drivers/gpu/drm/i915/i915_drv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 4024e16..2d4bb48 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1321,6 +1321,7 @@ static int intel_runtime_suspend(struct device *device)
 */
cancel_work_sync(dev_priv-rps.work);
intel_runtime_pm_disable_interrupts(dev);
+   i915_gem_release_all_mmaps(dev_priv);
 
if (IS_GEN6(dev)) {
ret = 0;
@@ -1340,8 +1341,6 @@ static int intel_runtime_suspend(struct device *device)
return ret;
}
 
-   i915_gem_release_all_mmaps(dev_priv);
-
del_timer_sync(dev_priv-gpu_error.hangcheck_timer);
dev_priv-pm.suspended = true;
 
-- 
1.8.4

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: remove user GTT mappings early during runtime suspend

2014-05-06 Thread Chris Wilson
On Tue, May 06, 2014 at 02:28:50PM +0300, Imre Deak wrote:
 Currently user space can access GEM buffers mapped to GTT through
 existing mappings concurrently while the platform specific suspend
 handlers are running.  Since these handlers may change the HW state in a
 way that would break such accesses, remove the mappings before calling
 the handlers.

Hmm, but you never locked the device, so what is preventing those
concurrent accesses from refaulting in GTT entires anyway. Please explain
the context under which the runtime suspend code executes, and leave
that explanation within easy reach of intel_runtime_suspend() -
preferrably with testing of those assumptions.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: remove user GTT mappings early during runtime suspend

2014-05-06 Thread Imre Deak
On Tue, 2014-05-06 at 12:40 +0100, Chris Wilson wrote:
 On Tue, May 06, 2014 at 02:28:50PM +0300, Imre Deak wrote:
  Currently user space can access GEM buffers mapped to GTT through
  existing mappings concurrently while the platform specific suspend
  handlers are running.  Since these handlers may change the HW state in a
  way that would break such accesses, remove the mappings before calling
  the handlers.
 
 Hmm, but you never locked the device, so what is preventing those
 concurrent accesses from refaulting in GTT entires anyway. Please explain
 the context under which the runtime suspend code executes, and leave
 that explanation within easy reach of intel_runtime_suspend() -
 preferrably with testing of those assumptions.

During faulting we take an RPM reference, so that avoids concurrent
re-faults. I could add a comment about this to the code.

--Imre



signature.asc
Description: This is a digitally signed message part
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: remove user GTT mappings early during runtime suspend

2014-05-06 Thread Chris Wilson
On Tue, May 06, 2014 at 02:42:26PM +0300, Imre Deak wrote:
 On Tue, 2014-05-06 at 12:40 +0100, Chris Wilson wrote:
  On Tue, May 06, 2014 at 02:28:50PM +0300, Imre Deak wrote:
   Currently user space can access GEM buffers mapped to GTT through
   existing mappings concurrently while the platform specific suspend
   handlers are running.  Since these handlers may change the HW state in a
   way that would break such accesses, remove the mappings before calling
   the handlers.
  
  Hmm, but you never locked the device, so what is preventing those
  concurrent accesses from refaulting in GTT entires anyway. Please explain
  the context under which the runtime suspend code executes, and leave
  that explanation within easy reach of intel_runtime_suspend() -
  preferrably with testing of those assumptions.
 
 During faulting we take an RPM reference, so that avoids concurrent
 re-faults. I could add a comment about this to the code.

You are still iterating over lists that are not safe, right? Or are
there more magic rpm references that prevent ioctls whilst
intel_runtime_suspend is being called?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: remove user GTT mappings early during runtime suspend

2014-05-06 Thread Imre Deak
On Tue, 2014-05-06 at 12:59 +0100, Chris Wilson wrote:
 On Tue, May 06, 2014 at 02:42:26PM +0300, Imre Deak wrote:
  On Tue, 2014-05-06 at 12:40 +0100, Chris Wilson wrote:
   On Tue, May 06, 2014 at 02:28:50PM +0300, Imre Deak wrote:
Currently user space can access GEM buffers mapped to GTT through
existing mappings concurrently while the platform specific suspend
handlers are running.  Since these handlers may change the HW state in a
way that would break such accesses, remove the mappings before calling
the handlers.
   
   Hmm, but you never locked the device, so what is preventing those
   concurrent accesses from refaulting in GTT entires anyway. Please explain
   the context under which the runtime suspend code executes, and leave
   that explanation within easy reach of intel_runtime_suspend() -
   preferrably with testing of those assumptions.
  
  During faulting we take an RPM reference, so that avoids concurrent
  re-faults. I could add a comment about this to the code.
 
 You are still iterating over lists that are not safe, right? Or are
 there more magic rpm references that prevent ioctls whilst
 intel_runtime_suspend is being called?

Tbh I haven't checked this, since moving the unmapping earlier doesn't
make a difference in this regard.

But it's a good point, I tried to audit now those paths. Currently the
assumption is that we hold an RPM reference everywhere where those lists
are changed. On the exec buffer path this is true, but for example in
i915_drop_caches_set() it's not.

We could fix this by taking struct_mutex around
i915_gem_release_all_mmaps() in intel_runtime_suspend(). Doing that
needs some more auditing to make sure we can't deadlock somehow. At
first glance it seems that the driver always schedules a deferred work
and calls intel_runtime_suspend() from that, so I think it's fine.

I suggest applying this patch regardless since the two issues are
separate.

--Imre


signature.asc
Description: This is a digitally signed message part
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: remove user GTT mappings early during runtime suspend

2014-05-06 Thread Daniel Vetter
On Tue, May 06, 2014 at 05:46:01PM +0300, Imre Deak wrote:
 On Tue, 2014-05-06 at 12:59 +0100, Chris Wilson wrote:
  On Tue, May 06, 2014 at 02:42:26PM +0300, Imre Deak wrote:
   On Tue, 2014-05-06 at 12:40 +0100, Chris Wilson wrote:
On Tue, May 06, 2014 at 02:28:50PM +0300, Imre Deak wrote:
 Currently user space can access GEM buffers mapped to GTT through
 existing mappings concurrently while the platform specific suspend
 handlers are running.  Since these handlers may change the HW state 
 in a
 way that would break such accesses, remove the mappings before calling
 the handlers.

Hmm, but you never locked the device, so what is preventing those
concurrent accesses from refaulting in GTT entires anyway. Please 
explain
the context under which the runtime suspend code executes, and leave
that explanation within easy reach of intel_runtime_suspend() -
preferrably with testing of those assumptions.
   
   During faulting we take an RPM reference, so that avoids concurrent
   re-faults. I could add a comment about this to the code.
  
  You are still iterating over lists that are not safe, right? Or are
  there more magic rpm references that prevent ioctls whilst
  intel_runtime_suspend is being called?
 
 Tbh I haven't checked this, since moving the unmapping earlier doesn't
 make a difference in this regard.
 
 But it's a good point, I tried to audit now those paths. Currently the
 assumption is that we hold an RPM reference everywhere where those lists
 are changed. On the exec buffer path this is true, but for example in
 i915_drop_caches_set() it's not.
 
 We could fix this by taking struct_mutex around
 i915_gem_release_all_mmaps() in intel_runtime_suspend(). Doing that
 needs some more auditing to make sure we can't deadlock somehow. At
 first glance it seems that the driver always schedules a deferred work
 and calls intel_runtime_suspend() from that, so I think it's fine.
 
 I suggest applying this patch regardless since the two issues are
 separate.

If I understand the situation correctly the runtime suspend function only
ever gets called from a worker thread after the hysteris timeout expired.
Which means we should be able to wrap _just_ the gtt pte shotdown with
dev-struct_mutex and nothing else. Which is good since profileration of
dev-struct_mutex is awful.

On the resume side we don't need any locking since the gtt fault handler
will first grab the runtime reference and also dev-struct_mutex.

One issue which is looming is that this might deadlock. We might need a
trylock in the runtime suspend function and abort the runtime suspend if
we can't get the lock. Please test that lockdep catches this before we
commit to a design.

Just a very quick analysis, I didn't check the details so this might be
horribly wrong.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: remove user GTT mappings early during runtime suspend

2014-05-06 Thread Imre Deak
On Tue, 2014-05-06 at 21:27 +0200, Daniel Vetter wrote:
 On Tue, May 06, 2014 at 05:46:01PM +0300, Imre Deak wrote:
  On Tue, 2014-05-06 at 12:59 +0100, Chris Wilson wrote:
   On Tue, May 06, 2014 at 02:42:26PM +0300, Imre Deak wrote:
On Tue, 2014-05-06 at 12:40 +0100, Chris Wilson wrote:
 On Tue, May 06, 2014 at 02:28:50PM +0300, Imre Deak wrote:
  Currently user space can access GEM buffers mapped to GTT through
  existing mappings concurrently while the platform specific suspend
  handlers are running.  Since these handlers may change the HW state 
  in a
  way that would break such accesses, remove the mappings before 
  calling
  the handlers.
 
 Hmm, but you never locked the device, so what is preventing those
 concurrent accesses from refaulting in GTT entires anyway. Please 
 explain
 the context under which the runtime suspend code executes, and leave
 that explanation within easy reach of intel_runtime_suspend() -
 preferrably with testing of those assumptions.

During faulting we take an RPM reference, so that avoids concurrent
re-faults. I could add a comment about this to the code.
   
   You are still iterating over lists that are not safe, right? Or are
   there more magic rpm references that prevent ioctls whilst
   intel_runtime_suspend is being called?
  
  Tbh I haven't checked this, since moving the unmapping earlier doesn't
  make a difference in this regard.
  
  But it's a good point, I tried to audit now those paths. Currently the
  assumption is that we hold an RPM reference everywhere where those lists
  are changed. On the exec buffer path this is true, but for example in
  i915_drop_caches_set() it's not.
  
  We could fix this by taking struct_mutex around
  i915_gem_release_all_mmaps() in intel_runtime_suspend(). Doing that
  needs some more auditing to make sure we can't deadlock somehow. At
  first glance it seems that the driver always schedules a deferred work
  and calls intel_runtime_suspend() from that, so I think it's fine.
  
  I suggest applying this patch regardless since the two issues are
  separate.
 
 If I understand the situation correctly the runtime suspend function only
 ever gets called from a worker thread after the hysteris timeout expired.
 Which means we should be able to wrap _just_ the gtt pte shotdown with
 dev-struct_mutex and nothing else. Which is good since profileration of
 dev-struct_mutex is awful.

 On the resume side we don't need any locking since the gtt fault handler
 will first grab the runtime reference and also dev-struct_mutex.
 
 One issue which is looming is that this might deadlock. We might need a
 trylock in the runtime suspend function and abort the runtime suspend if
 we can't get the lock. Please test that lockdep catches this before we
 commit to a design.

After looking some more at different possible solutions and the RPM core
this looks the easiest way. There could be cleaner ones, for example
rearranging the order everywhere of getting struct_mutex and RPM ref in
the same order, so that we always get the RPM ref outside of
struct_mutex. By this we could just take the mutex during runtime
suspend without the possibility of deadlocking. But this would need a
lot of change due to the RPM get in i915_gem_free_object(). 

It's also clear that we need the trylock, since an RPM get with a struct
mutex already held can can block on a pending RPM put, and so getting
the mutex in the suspend handler could deadlock. In this case we can
fail the suspend with EAGAIN, so it'll get scheduled again. 

--Imre

 Just a very quick analysis, I didn't check the details so this might be
 horribly wrong.

 -Daniel


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx