Re: [Intel-gfx] [PATCH] [RFC] kernel/cpu: Use lockref for online CPU reference counting

2016-02-18 Thread Joonas Lahtinen
On ke, 2016-02-17 at 17:37 +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 05:33:51PM +0100, Daniel Vetter wrote:
> > On Wed, Feb 17, 2016 at 05:14:57PM +0100, Peter Zijlstra wrote:
> > > On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote:
> > > > And for context we're hitting this on CI in a bunch of our machines, 
> > > > which
> > > 
> > > What's CI ?
> > 
> > Continuous integration, aka our own farm of machines dedicated to running
> > i915.ko testcases. Kinda like 0day (it does pre-merge testing on the m-l
> > and also post-merge on our own little integration tree), but for just the
> > graphics team and our needs.
> 
> So what patch triggered this new issue? Did cpufreq change or what?

It appeared right after enabling lockdep debugging on the continuous
integration system. So we do not have a history of it not being there.

Taking an another look at my code, it could indeed end up in double-
wait-looping scenario if suspend and initialization was performed
simultaneously (it had a couple of other bugs too, fixed in v2).
Strange thing is, I think that should have been caught by cpuhp_lock_*
lockdep tracking.

So I'll move the discussion to linux-pm list to change the CPUfreq code. Thanks 
for the comments.

Regards, Joonas
-- 
Joonas Lahtinen
Open Source Technology Center
Intel Corporation



Re: [Intel-gfx] [PATCH] [RFC] kernel/cpu: Use lockref for online CPU reference counting

2016-02-17 Thread Peter Zijlstra
On Wed, Feb 17, 2016 at 05:33:51PM +0100, Daniel Vetter wrote:
> On Wed, Feb 17, 2016 at 05:14:57PM +0100, Peter Zijlstra wrote:
> > On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote:
> > > And for context we're hitting this on CI in a bunch of our machines, which
> > 
> > What's CI ?
> 
> Continuous integration, aka our own farm of machines dedicated to running
> i915.ko testcases. Kinda like 0day (it does pre-merge testing on the m-l
> and also post-merge on our own little integration tree), but for just the
> graphics team and our needs.

So what patch triggered this new issue? Did cpufreq change or what?


Re: [Intel-gfx] [PATCH] [RFC] kernel/cpu: Use lockref for online CPU reference counting

2016-02-17 Thread Daniel Vetter
On Wed, Feb 17, 2016 at 05:14:57PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote:
> > And for context we're hitting this on CI in a bunch of our machines, which
> 
> What's CI ?

Continuous integration, aka our own farm of machines dedicated to running
i915.ko testcases. Kinda like 0day (it does pre-merge testing on the m-l
and also post-merge on our own little integration tree), but for just the
graphics team and our needs.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Intel-gfx] [PATCH] [RFC] kernel/cpu: Use lockref for online CPU reference counting

2016-02-15 Thread Daniel Vetter
On Mon, Feb 15, 2016 at 03:17:55PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 15, 2016 at 02:36:43PM +0200, Joonas Lahtinen wrote:
> > Instead of implementing a custom locked reference counting, use lockref.
> > 
> > Current implementation leads to a deadlock splat on Intel SKL platforms
> > when lockdep debugging is enabled.
> > 
> > This is due to few of CPUfreq drivers (including Intel P-state) having this;
> > policy->rwsem is locked during driver initialization and the functions 
> > called
> > during init that actually apply CPU limits use get_online_cpus (because they
> > have other calling paths too), which will briefly lock cpu_hotplug.lock to
> > increase cpu_hotplug.refcount.
> > 
> > On later calling path, when doing a suspend, when cpu_hotplug_begin() is 
> > called
> > in disable_nonboot_cpus(), callbacks to CPUfreq functions get called after,
> > which will lock policy->rwsem and cpu_hotplug.lock is already held by
> > cpu_hotplug_begin() and we do have a potential deadlock scenario reported by
> > our CI system (though it is a very unlikely one). See the Bugzilla link for 
> > more
> > details.
> 
> I've been meaning to change the thing into a percpu-rwsem, I just
> haven't had time to look into the lockdep splat that generated.

I've thrown Joonas patch into a local topic branch to shut up the noise in
our CI, and it seems to be effective at that (2 runs thus far). I'll drop
this again once we have a proper solution (whatever it'll be) upstream.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch