Re: [Intel-gfx] [PATCH] [RFC] kernel/cpu: Use lockref for online CPU reference counting
On ke, 2016-02-17 at 17:37 +0100, Peter Zijlstra wrote: > On Wed, Feb 17, 2016 at 05:33:51PM +0100, Daniel Vetter wrote: > > On Wed, Feb 17, 2016 at 05:14:57PM +0100, Peter Zijlstra wrote: > > > On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote: > > > > And for context we're hitting this on CI in a bunch of our machines, > > > > which > > > > > > What's CI ? > > > > Continuous integration, aka our own farm of machines dedicated to running > > i915.ko testcases. Kinda like 0day (it does pre-merge testing on the m-l > > and also post-merge on our own little integration tree), but for just the > > graphics team and our needs. > > So what patch triggered this new issue? Did cpufreq change or what? It appeared right after enabling lockdep debugging on the continuous integration system. So we do not have a history of it not being there. Taking an another look at my code, it could indeed end up in double- wait-looping scenario if suspend and initialization was performed simultaneously (it had a couple of other bugs too, fixed in v2). Strange thing is, I think that should have been caught by cpuhp_lock_* lockdep tracking. So I'll move the discussion to linux-pm list to change the CPUfreq code. Thanks for the comments. Regards, Joonas -- Joonas Lahtinen Open Source Technology Center Intel Corporation
Re: [Intel-gfx] [PATCH] [RFC] kernel/cpu: Use lockref for online CPU reference counting
On Wed, Feb 17, 2016 at 05:33:51PM +0100, Daniel Vetter wrote: > On Wed, Feb 17, 2016 at 05:14:57PM +0100, Peter Zijlstra wrote: > > On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote: > > > And for context we're hitting this on CI in a bunch of our machines, which > > > > What's CI ? > > Continuous integration, aka our own farm of machines dedicated to running > i915.ko testcases. Kinda like 0day (it does pre-merge testing on the m-l > and also post-merge on our own little integration tree), but for just the > graphics team and our needs. So what patch triggered this new issue? Did cpufreq change or what?
Re: [Intel-gfx] [PATCH] [RFC] kernel/cpu: Use lockref for online CPU reference counting
On Wed, Feb 17, 2016 at 05:14:57PM +0100, Peter Zijlstra wrote: > On Wed, Feb 17, 2016 at 05:13:21PM +0100, Daniel Vetter wrote: > > And for context we're hitting this on CI in a bunch of our machines, which > > What's CI ? Continuous integration, aka our own farm of machines dedicated to running i915.ko testcases. Kinda like 0day (it does pre-merge testing on the m-l and also post-merge on our own little integration tree), but for just the graphics team and our needs. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: [Intel-gfx] [PATCH] [RFC] kernel/cpu: Use lockref for online CPU reference counting
On Mon, Feb 15, 2016 at 03:17:55PM +0100, Peter Zijlstra wrote: > On Mon, Feb 15, 2016 at 02:36:43PM +0200, Joonas Lahtinen wrote: > > Instead of implementing a custom locked reference counting, use lockref. > > > > Current implementation leads to a deadlock splat on Intel SKL platforms > > when lockdep debugging is enabled. > > > > This is due to few of CPUfreq drivers (including Intel P-state) having this; > > policy->rwsem is locked during driver initialization and the functions > > called > > during init that actually apply CPU limits use get_online_cpus (because they > > have other calling paths too), which will briefly lock cpu_hotplug.lock to > > increase cpu_hotplug.refcount. > > > > On later calling path, when doing a suspend, when cpu_hotplug_begin() is > > called > > in disable_nonboot_cpus(), callbacks to CPUfreq functions get called after, > > which will lock policy->rwsem and cpu_hotplug.lock is already held by > > cpu_hotplug_begin() and we do have a potential deadlock scenario reported by > > our CI system (though it is a very unlikely one). See the Bugzilla link for > > more > > details. > > I've been meaning to change the thing into a percpu-rwsem, I just > haven't had time to look into the lockdep splat that generated. I've thrown Joonas patch into a local topic branch to shut up the noise in our CI, and it seems to be effective at that (2 runs thus far). I'll drop this again once we have a proper solution (whatever it'll be) upstream. Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch