Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM
On 2019/1/17 下午11:17, Dave Hansen wrote: On 1/17/19 12:19 AM, Yanmin Zhang wrote: I didn't try pmem and I am wondering it's slower than DRAM. Should a flag, such like _GFP_PMEM, be added to distinguish it from DRAM? Absolutely not. :) Agree. We already have performance-differentiated memory, and lots of ways to enumerate and select it in the kernel (all of our NUMA infrastructure). Kernel does manage memory like what you say. My question is: with your patch, PMEM becomes normal RAM, then there is a chance for kernel to allocate PMEM as DMA buffer. Some super speed devices like 10Giga NIC, USB (SSIC connecting modem), might not work well if DMA buffer is in PMEM as it's slower than DRAM. Should your patchset consider it? PMEM is also just the first of many "kinds" of memory that folks want to build in systems and use a "RAM". We literally don't have space to put a flag in for each type.
Re: [PATCH 4/4] dax: "Hotplug" persistent memory for use like normal RAM
On 2019/1/17 上午2:19, Dave Hansen wrote: From: Dave Hansen Currently, a persistent memory region is "owned" by a device driver, either the "Direct DAX" or "Filesystem DAX" drivers. These drivers allow applications to explicitly use persistent memory, generally by being modified to use special, new libraries. However, this limits persistent memory use to applications which *have* been modified. To make it more broadly usable, this driver "hotplugs" memory into the kernel, to be managed ad used just like normal RAM would be. To make this work, management software must remove the device from being controlled by the "Device DAX" infrastructure: echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind and then bind it to this new driver: echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind After this, there will be a number of new memory sections visible in sysfs that can be onlined, or that may get onlined by existing udev-initiated memory hotplug rules. Note: this inherits any existing NUMA information for the newly- added memory from the persistent memory device that came from the firmware. On Intel platforms, the firmware has guarantees that require each socket's persistent memory to be in a separate memory-only NUMA node. That means that this patch is not expected to create NUMA nodes, but will simply hotplug memory into existing nodes. There is currently some metadata at the beginning of pmem regions. The section-size memory hotplug restrictions, plus this small reserved area can cause the "loss" of a section or two of capacity. This should be fixable in follow-on patches. But, as a first step, losing 256MB of memory (worst case) out of hundreds of gigabytes is a good tradeoff vs. the required code to fix this up precisely. Signed-off-by: Dave Hansen Cc: Dan Williams Cc: Dave Jiang Cc: Ross Zwisler Cc: Vishal Verma Cc: Tom Lendacky Cc: Andrew Morton Cc: Michal Hocko Cc: linux-nvd...@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux...@kvack.org Cc: Huang Ying Cc: Fengguang Wu Cc: Borislav Petkov Cc: Bjorn Helgaas Cc: Yaowei Bai Cc: Takashi Iwai --- b/drivers/dax/Kconfig |5 ++ b/drivers/dax/Makefile |1 b/drivers/dax/kmem.c | 93 + 3 files changed, 99 insertions(+) diff -puN drivers/dax/Kconfig~dax-kmem-try-4 drivers/dax/Kconfig --- a/drivers/dax/Kconfig~dax-kmem-try-42019-01-08 09:54:44.051694874 -0800 +++ b/drivers/dax/Kconfig 2019-01-08 09:54:44.056694874 -0800 @@ -32,6 +32,11 @@ config DEV_DAX_PMEM Say M if unsure +config DEV_DAX_KMEM + def_bool y + depends on DEV_DAX_PMEM # Needs DEV_DAX_PMEM infrastructure + depends on MEMORY_HOTPLUG # for add_memory() and friends + config DEV_DAX_PMEM_COMPAT tristate "PMEM DAX: support the deprecated /sys/class/dax interface" depends on DEV_DAX_PMEM diff -puN /dev/null drivers/dax/kmem.c --- /dev/null 2018-12-03 08:41:47.355756491 -0800 +++ b/drivers/dax/kmem.c2019-01-08 09:54:44.056694874 -0800 @@ -0,0 +1,93 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2016-2018 Intel Corporation. All rights reserved. */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "dax-private.h" +#include "bus.h" + +int dev_dax_kmem_probe(struct device *dev) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + struct resource *res = _dax->region->res; + resource_size_t kmem_start; + resource_size_t kmem_size; + struct resource *new_res; + int numa_node; + int rc; + + /* Hotplug starting at the beginning of the next block: */ + kmem_start = ALIGN(res->start, memory_block_size_bytes()); + + kmem_size = resource_size(res); + /* Adjust the size down to compensate for moving up kmem_start: */ +kmem_size -= kmem_start - res->start; + /* Align the size down to cover only complete blocks: */ + kmem_size &= ~(memory_block_size_bytes() - 1); + + new_res = devm_request_mem_region(dev, kmem_start, kmem_size, + dev_name(dev)); + + if (!new_res) { + printk("could not reserve region %016llx -> %016llx\n", + kmem_start, kmem_start+kmem_size); + return -EBUSY; + } + + /* +* Set flags appropriate for System RAM. Leave ..._BUSY clear +* so that add_memory() can add a child resource. +*/ + new_res->flags = IORESOURCE_SYSTEM_RAM; + new_res->name = dev_name(dev); + + numa_node = dev_dax->target_node; + if (numa_node < 0) { + pr_warn_once("bad numa_node: %d, forcing to 0\n", numa_node); + numa_node = 0; + } + + rc =
Re: [PATCH] pnp: Bypass the calling to pnp_stop_dev at suspend when there is a protocol suspend
On 二, 2013-12-24 at 09:35 +0800, shuox@intel.com wrote: > From: Zhang Yanmin > > pnp pnp_bus_suspend/_resume have an issue. > pnp_bus_suspend calls pnp_stop_dev to disable the device. With ACPI, > pnp_stop_dev turns off the dev usually. Then, > pnp_bus_suspend=>pnp_dev->protocol->suspend accesses the device and > suspend it again. > > pnp_bus_resume has the similar issue. > > Another issue is firmware might just provide _DIS, but no_STS method. > > The patch fixes it by adding a checking. If there is > pnp_dev->protocol->suspend, pnp_bus_suspend doesn't call pnp_stop_dev. > Do the similar thing for _resume. Rafael, What's your idea about this patch? We hit the issue when enabling Android on a latest tablet. After suspend-to-ram wakeup, serial console doesn't work. This serial port is bound by pnpcore driver. At suspending, static int __pnp_bus_suspend(struct device *dev, pm_message_t state) { ... if (pnp_can_disable(pnp_dev)) { error = pnp_stop_dev(pnp_dev); if (error) return error; } if (pnp_dev->protocol->suspend) pnp_dev->protocol->suspend(pnp_dev, state); return 0; } pnp_stop_dev calls dev->protocol->disable. As for ACPI device, that disable callback calls _DIS. Based on ACPI spec, driver need turn off the device before disabling it by _DIS. That means, after pnp_stop_dev returns, the device is at OFF state. Then, __pnp_bus_suspend calls pnp_dev->protocol->suspend, which continues to access the device while the device is at OFF. Our firmware just provides _DIS for the device. There is no _STS method. Then, after wakeup, the device doesn't work. But just like what the patch points out, pnp_dev->protocol->suspend continues to access the device while the device is at OFF. It's not safe. The patch looks like a workaround. Another possible fix is to just delete the calling of pnp_stop_dev in function __pnp_bus_suspend, as suspend is not equal to _disable_. The deletion might be a little intrusive. That's why we sent a workaround patch to LKML. Which one is better? Thanks, Yanmin > > Signed-off-by: Zhang Yanmin > Signed-off-by: Liu ShuoX > --- > drivers/pnp/driver.c | 6 ++ > 1 file changed, 2 insertions(+), 4 deletions(-) > > diff --git a/drivers/pnp/driver.c b/drivers/pnp/driver.c > index f748cc8..2512e47 100644 > --- a/drivers/pnp/driver.c > +++ b/drivers/pnp/driver.c > @@ -176,7 +176,7 @@ static int __pnp_bus_suspend(struct device *dev, > pm_message_t state) > return error; > } > > - if (pnp_can_disable(pnp_dev)) { > + if (pnp_can_disable(pnp_dev) && !pnp_dev->protocol->suspend) { > error = pnp_stop_dev(pnp_dev); > if (error) > return error; > @@ -215,9 +215,7 @@ static int pnp_bus_resume(struct device *dev) > error = pnp_dev->protocol->resume(pnp_dev); > if (error) > return error; > - } > - > - if (pnp_can_write(pnp_dev)) { > + } else if (pnp_can_write(pnp_dev)) { > error = pnp_start_dev(pnp_dev); > if (error) > return error; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] pnp: Bypass the calling to pnp_stop_dev at suspend when there is a protocol suspend
On 二, 2013-12-24 at 09:35 +0800, shuox@intel.com wrote: From: Zhang Yanmin yanmin.zh...@intel.com pnp pnp_bus_suspend/_resume have an issue. pnp_bus_suspend calls pnp_stop_dev to disable the device. With ACPI, pnp_stop_dev turns off the dev usually. Then, pnp_bus_suspend=pnp_dev-protocol-suspend accesses the device and suspend it again. pnp_bus_resume has the similar issue. Another issue is firmware might just provide _DIS, but no_STS method. The patch fixes it by adding a checking. If there is pnp_dev-protocol-suspend, pnp_bus_suspend doesn't call pnp_stop_dev. Do the similar thing for _resume. Rafael, What's your idea about this patch? We hit the issue when enabling Android on a latest tablet. After suspend-to-ram wakeup, serial console doesn't work. This serial port is bound by pnpcore driver. At suspending, static int __pnp_bus_suspend(struct device *dev, pm_message_t state) { ... if (pnp_can_disable(pnp_dev)) { error = pnp_stop_dev(pnp_dev); if (error) return error; } if (pnp_dev-protocol-suspend) pnp_dev-protocol-suspend(pnp_dev, state); return 0; } pnp_stop_dev calls dev-protocol-disable. As for ACPI device, that disable callback calls _DIS. Based on ACPI spec, driver need turn off the device before disabling it by _DIS. That means, after pnp_stop_dev returns, the device is at OFF state. Then, __pnp_bus_suspend calls pnp_dev-protocol-suspend, which continues to access the device while the device is at OFF. Our firmware just provides _DIS for the device. There is no _STS method. Then, after wakeup, the device doesn't work. But just like what the patch points out, pnp_dev-protocol-suspend continues to access the device while the device is at OFF. It's not safe. The patch looks like a workaround. Another possible fix is to just delete the calling of pnp_stop_dev in function __pnp_bus_suspend, as suspend is not equal to _disable_. The deletion might be a little intrusive. That's why we sent a workaround patch to LKML. Which one is better? Thanks, Yanmin Signed-off-by: Zhang Yanmin yanmin.zh...@intel.com Signed-off-by: Liu ShuoX shuox@intel.com --- drivers/pnp/driver.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/pnp/driver.c b/drivers/pnp/driver.c index f748cc8..2512e47 100644 --- a/drivers/pnp/driver.c +++ b/drivers/pnp/driver.c @@ -176,7 +176,7 @@ static int __pnp_bus_suspend(struct device *dev, pm_message_t state) return error; } - if (pnp_can_disable(pnp_dev)) { + if (pnp_can_disable(pnp_dev) !pnp_dev-protocol-suspend) { error = pnp_stop_dev(pnp_dev); if (error) return error; @@ -215,9 +215,7 @@ static int pnp_bus_resume(struct device *dev) error = pnp_dev-protocol-resume(pnp_dev); if (error) return error; - } - - if (pnp_can_write(pnp_dev)) { + } else if (pnp_can_write(pnp_dev)) { error = pnp_start_dev(pnp_dev); if (error) return error; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PM: avoid 'autosleep' in shutdown progress
On Sat, 2013-07-13 at 13:56 +0200, Pavel Machek wrote: > On Fri 2013-07-12 10:37:33, Alan Stern wrote: > > On Fri, 12 Jul 2013, Yanmin Zhang wrote: > > > > > On Thu, 2013-07-11 at 16:03 +0800, shuox@intel.com wrote: > > > > From: Liu ShuoX > > > > > > > > In shutdown progress, system is possible to do power transition > > > > (such as suspend-to-ram) in parallel. It is unreasonable. So, > > > > fixes it by adding a system_state checking and queue try_to_suspend > > > > again when system status is not running. > > > > > > > > Signed-off-by: Liu ShuoX > > > > --- > > > > kernel/power/autosleep.c |3 ++- > > > > 1 files changed, 2 insertions(+), 1 deletions(-) > > > > > > > Without this patch, we hit an hang issue on Android. > > > > > > Scenario: > > > Kernel starts shutdown and calls all device driver's shutdown callback. > > > When a driver's shutdown is called, the last wakelock is released and > > > suspend-to-ram starts. However, as some driver's shut down callbacks > > > already shut down devices and disable runtime pm, the suspend-to-ram > > > calls driver's suspend callback without noticing that device is already > > > off and causes crash. > > > We know the drivers should be fixed, but can we also change generic > > > codes a little to make it stronger? > > > > Indeed, this does seem like the sort of thing we want to have. > > Allowing an "automatic" or "opportunistic" sleep in the middle of a > > shutdown makes no sense at all. In fact, it might be a good idea to > > disable system sleep completely at this time -- not even a write to > > /sys/power/state should be allowed to interrupt a shutdown. > > I'm not completely sure, but as long as userland is running, we should > have system_state == RUNNING, no? No. The reboot/shutdown is started by application processes, then kernel changes system_state quickly. See kernel_restart_prepare, kernel_shutdown_prepare. But indeed, it's a good question. We hit many other issues around rebooting. When system is rebooting, user space processes are still running. There is no freeze step. Some processes might jump in to access devices. To avoid it, we have to add some tricks in some device drivers. It's a little crazy. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PM: avoid 'autosleep' in shutdown progress
On Sat, 2013-07-13 at 13:56 +0200, Pavel Machek wrote: On Fri 2013-07-12 10:37:33, Alan Stern wrote: On Fri, 12 Jul 2013, Yanmin Zhang wrote: On Thu, 2013-07-11 at 16:03 +0800, shuox@intel.com wrote: From: Liu ShuoX shuox@intel.com In shutdown progress, system is possible to do power transition (such as suspend-to-ram) in parallel. It is unreasonable. So, fixes it by adding a system_state checking and queue try_to_suspend again when system status is not running. Signed-off-by: Liu ShuoX shuox@intel.com --- kernel/power/autosleep.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) Without this patch, we hit an hang issue on Android. Scenario: Kernel starts shutdown and calls all device driver's shutdown callback. When a driver's shutdown is called, the last wakelock is released and suspend-to-ram starts. However, as some driver's shut down callbacks already shut down devices and disable runtime pm, the suspend-to-ram calls driver's suspend callback without noticing that device is already off and causes crash. We know the drivers should be fixed, but can we also change generic codes a little to make it stronger? Indeed, this does seem like the sort of thing we want to have. Allowing an automatic or opportunistic sleep in the middle of a shutdown makes no sense at all. In fact, it might be a good idea to disable system sleep completely at this time -- not even a write to /sys/power/state should be allowed to interrupt a shutdown. I'm not completely sure, but as long as userland is running, we should have system_state == RUNNING, no? No. The reboot/shutdown is started by application processes, then kernel changes system_state quickly. See kernel_restart_prepare, kernel_shutdown_prepare. But indeed, it's a good question. We hit many other issues around rebooting. When system is rebooting, user space processes are still running. There is no freeze step. Some processes might jump in to access devices. To avoid it, we have to add some tricks in some device drivers. It's a little crazy. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PM: avoid 'autosleep' in shutdown progress
On Thu, 2013-07-11 at 16:03 +0800, shuox@intel.com wrote: > From: Liu ShuoX > > In shutdown progress, system is possible to do power transition > (such as suspend-to-ram) in parallel. It is unreasonable. So, > fixes it by adding a system_state checking and queue try_to_suspend > again when system status is not running. > > Signed-off-by: Liu ShuoX > --- > kernel/power/autosleep.c |3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > Without this patch, we hit an hang issue on Android. Scenario: Kernel starts shutdown and calls all device driver's shutdown callback. When a driver's shutdown is called, the last wakelock is released and suspend-to-ram starts. However, as some driver's shut down callbacks already shut down devices and disable runtime pm, the suspend-to-ram calls driver's suspend callback without noticing that device is already off and causes crash. We know the drivers should be fixed, but can we also change generic codes a little to make it stronger? > diff --git a/kernel/power/autosleep.c b/kernel/power/autosleep.c > index c6422ff..9012ecf 100644 > --- a/kernel/power/autosleep.c > +++ b/kernel/power/autosleep.c > @@ -32,7 +32,8 @@ static void try_to_suspend(struct work_struct *work) > > mutex_lock(_lock); > > - if (!pm_save_wakeup_count(initial_count)) { > + if (!pm_save_wakeup_count(initial_count) || > + system_state != SYSTEM_RUNNING) { > mutex_unlock(_lock); > goto out; > } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] PM: avoid 'autosleep' in shutdown progress
On Thu, 2013-07-11 at 16:03 +0800, shuox@intel.com wrote: From: Liu ShuoX shuox@intel.com In shutdown progress, system is possible to do power transition (such as suspend-to-ram) in parallel. It is unreasonable. So, fixes it by adding a system_state checking and queue try_to_suspend again when system status is not running. Signed-off-by: Liu ShuoX shuox@intel.com --- kernel/power/autosleep.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) Without this patch, we hit an hang issue on Android. Scenario: Kernel starts shutdown and calls all device driver's shutdown callback. When a driver's shutdown is called, the last wakelock is released and suspend-to-ram starts. However, as some driver's shut down callbacks already shut down devices and disable runtime pm, the suspend-to-ram calls driver's suspend callback without noticing that device is already off and causes crash. We know the drivers should be fixed, but can we also change generic codes a little to make it stronger? diff --git a/kernel/power/autosleep.c b/kernel/power/autosleep.c index c6422ff..9012ecf 100644 --- a/kernel/power/autosleep.c +++ b/kernel/power/autosleep.c @@ -32,7 +32,8 @@ static void try_to_suspend(struct work_struct *work) mutex_lock(autosleep_lock); - if (!pm_save_wakeup_count(initial_count)) { + if (!pm_save_wakeup_count(initial_count) || + system_state != SYSTEM_RUNNING) { mutex_unlock(autosleep_lock); goto out; } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Sat, 2013-06-08 at 03:52 +0200, Rafael J. Wysocki wrote: > On Saturday, June 08, 2013 09:36:03 AM Yanmin Zhang wrote: > > On Sat, 2013-06-08 at 03:30 +0200, Rafael J. Wysocki wrote: > > > On Friday, June 07, 2013 06:16:25 PM Greg KH wrote: > > > > On Sat, Jun 08, 2013 at 08:42:12AM +0800, Yanmin Zhang wrote: > > > > > On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: > > > > > > On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: > > > > > > > dpm_run_callback is used in other stages of power states changing. > > > > > > > It provides debug info message and time measurement when call > > > > > > > these > > > > > > > callback. We also want to benefit ->prepare and ->complete. > > > > > > > > > > > > > > [PATCH 1/2] PM: use dpm_run_callback in device_prepare > > > > > > > [PATCH 2/2] PM: add dpm_run_callback_void and use it in > > > > > > > device_complete > > > > > > > > > > > > Is this an "Oh, why don't we do that?" series, or is it useful for > > > > > > anything > > > > > > in practice? I'm asking, because we haven't added that stuff to > > > > > > start with > > > > > > since we didn't see why it would be useful to anyone. > > > > > > > > > > > > And while patch [1/2] reduces the code size (by 1 line), so I can > > > > > > see some > > > > > > (tiny) benefit from applying it, patch [2/2] adds more code and is > > > > > > there any > > > > > > paractical reason? > > > > > Sometimes, suspend-to-ram path spends too much time (either suspend > > > > > slowly > > > > > or wakeup slowly) and we need optimize it. > > > > > With the 2 patches, we could collect initcall_debug printk info and > > > > > manually > > > > > check what prepare/complete callbacks consume too much time. > > > > > > > > But initcall information is for initialization stuff, not suspend/resume > > > > things, right? Doesn't the existing tools for parsing this choke if it > > > > sees the information at suspend/resume time? > > > > > > We've been using that for suspend/resume for quite some time too, but not > > > for the prepare/complete phases (because we still believe that's not > > > really > > > useful for them). > > > > > > Well, I'll be handling patches changing code under drivers/base/power, > > > I promise. :-) > > > > > > I've been doing that for quite a few years now ... > > Yes, indeed. Power is one of the most important features on embedded > > devices. > > Lots of smart phones don't really go through the full cycles of > > suspend-to-ram. > > We are following the full steps of the suspend. > > But if you go through the code, you'll see that alomost no drivers actually > implemet .prepare() and .complete(). Some subsystems do, but they really > don't > take too much time to execute. Which means that your patches with > initcall_debug will add quite a pile of useless garbage to the kernel log Does that mean we need add more log levels around such info instead of just having or not having? Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Fri, 2013-06-07 at 08:08 -0700, Greg KH wrote: > On Fri, Jun 07, 2013 at 12:54:55PM +0800, Yanmin Zhang wrote: > > On Thu, 2013-06-06 at 21:19 -0700, Greg KH wrote: > > > On Fri, Jun 07, 2013 at 11:18:06AM +0800, Yanmin Zhang wrote: > > > > On Thu, 2013-06-06 at 20:08 -0700, Greg KH wrote: > > > > > On Fri, Jun 07, 2013 at 10:37:52AM +0800, Yanmin Zhang wrote: > > > > > > On Thu, 2013-06-06 at 18:02 -0700, Greg KH wrote: > > > > > > > On Fri, Jun 07, 2013 at 08:53:29AM +0800, Yanmin Zhang wrote: > > > > > > > > On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: > > > > > > > > > On Thu, 6 Jun 2013, shuox@intel.com wrote: > > > > > > > > > > From: Zhang Yanmin > > > > > > > > > > > > > > > > > > > > synchronize_irq waits pending IRQ handlers to be finished. > > > > > > > > > > If using this > > > > > > > > > > function while holding a resource, the IRQ handler may > > > > > > > > > > cause deadlock. > > > > > > > > > > > > > > > > > > > > Here we add a new function irq_in_progress which doesn't > > > > > > > > > > wait for the handlers > > > > > > > > > > to be finished. > > > > > > > > > > > > > > > > > > > > A typical use case at suspend-to-ram: > > > > > > > > > > > > > > > > > > > > device driver's irq handler is complicated and might hold a > > > > > > > > > > mutex at rare cases. > > > > > > > > > > Its suspend function is called and a suspended flag is set. > > > > > > > > > > In case its IRQ handler is running, suspend function calls > > > > > > > > > > irq_in_progress. if > > > > > > > > > > handler is running, abort suspend. > > > > > > > > > > The irq handler checks the suspended flag. If the device is > > > > > > > > > > suspended, irq handler > > > > > > > > > > either ignores the interrupt, or wakes up the whole system, > > > > > > > > > > and the driver's > > > > > > > > > > resume function could deal with the delayed interrupt > > > > > > > > > > handling. > > > > > > > > > > > > > > > > > > This is as wrong as it can be. Fix the driver instead of > > > > > > > > > hacking racy > > > > > > > > > functions into the core code. > > > > > > > > > > > > > > > > > > So your problem looks like this: > > > > > > > > > > > > > > > > > > CPU 0 CPU 1 > > > > > > > > > irq_handler_thread() suspend() > > > > > > > > >. mutex_lock(); > > > > > > > > >mutex_lock();synchronize_irq(); > > > > > > > > > > > > > > > > > > So why needs the mutex to be taken before synchronize_irq()? > > > > > > > > > Why not > > > > > > > > > doing the obvious? > > > > > > > > > > > > > > > > > > suspend() > > > > > > > > > disable_irq(); (Implies synchronize_irq) > > > > > > > > > mutex_lock(); > > > > > > > > > > > > > > > > > > mutex_unlock(); > > > > > > > > > enable_irq(); > > > > > > > > Thanks for the kind comment. > > > > > > > > > > > > > > > > We do consider your solution before and it works well indeed > > > > > > > > with some specific > > > > > > > > simple drivers. For example, some drives use GPIO pin as > > > > > > > > interrupt source. > > > > > > > > > > > > > > > > On one specific platform, disable_irq would really disable the > > > > > > > > irq at RTE entry, > > > > > > > >
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Sat, 2013-06-08 at 03:30 +0200, Rafael J. Wysocki wrote: > On Friday, June 07, 2013 06:16:25 PM Greg KH wrote: > > On Sat, Jun 08, 2013 at 08:42:12AM +0800, Yanmin Zhang wrote: > > > On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: > > > > On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: > > > > > dpm_run_callback is used in other stages of power states changing. > > > > > It provides debug info message and time measurement when call these > > > > > callback. We also want to benefit ->prepare and ->complete. > > > > > > > > > > [PATCH 1/2] PM: use dpm_run_callback in device_prepare > > > > > [PATCH 2/2] PM: add dpm_run_callback_void and use it in > > > > > device_complete > > > > > > > > Is this an "Oh, why don't we do that?" series, or is it useful for > > > > anything > > > > in practice? I'm asking, because we haven't added that stuff to start > > > > with > > > > since we didn't see why it would be useful to anyone. > > > > > > > > And while patch [1/2] reduces the code size (by 1 line), so I can see > > > > some > > > > (tiny) benefit from applying it, patch [2/2] adds more code and is > > > > there any > > > > paractical reason? > > > Sometimes, suspend-to-ram path spends too much time (either suspend slowly > > > or wakeup slowly) and we need optimize it. > > > With the 2 patches, we could collect initcall_debug printk info and > > > manually > > > check what prepare/complete callbacks consume too much time. > > > > But initcall information is for initialization stuff, not suspend/resume > > things, right? Doesn't the existing tools for parsing this choke if it > > sees the information at suspend/resume time? > > We've been using that for suspend/resume for quite some time too, but not > for the prepare/complete phases (because we still believe that's not really > useful for them). > > Well, I'll be handling patches changing code under drivers/base/power, > I promise. :-) > > I've been doing that for quite a few years now ... Yes, indeed. Power is one of the most important features on embedded devices. Lots of smart phones don't really go through the full cycles of suspend-to-ram. We are following the full steps of the suspend. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Fri, 2013-06-07 at 18:16 -0700, Greg KH wrote: > On Sat, Jun 08, 2013 at 08:42:12AM +0800, Yanmin Zhang wrote: > > On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: > > > On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: > > > > dpm_run_callback is used in other stages of power states changing. > > > > It provides debug info message and time measurement when call these > > > > callback. We also want to benefit ->prepare and ->complete. > > > > > > > > [PATCH 1/2] PM: use dpm_run_callback in device_prepare > > > > [PATCH 2/2] PM: add dpm_run_callback_void and use it in device_complete > > > > > > Is this an "Oh, why don't we do that?" series, or is it useful for > > > anything > > > in practice? I'm asking, because we haven't added that stuff to start > > > with > > > since we didn't see why it would be useful to anyone. > > > > > > And while patch [1/2] reduces the code size (by 1 line), so I can see some > > > (tiny) benefit from applying it, patch [2/2] adds more code and is there > > > any > > > paractical reason? > > Sometimes, suspend-to-ram path spends too much time (either suspend slowly > > or wakeup slowly) and we need optimize it. > > With the 2 patches, we could collect initcall_debug printk info and manually > > check what prepare/complete callbacks consume too much time. > > But initcall information is for initialization stuff, not suspend/resume > things, right? initcall_debug also controls shutdown callbacks, and suspend/resume callbacks. See __device_suspend=>dpm_run_callback. It's very useful when we want to optimize suspend-to-ram, or debug some hard issues which happen when async suspend is enabled. > Doesn't the existing tools for parsing this choke if it > sees the information at suspend/resume time? Current kernel doesn't print out info around prepare/complete. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] PM: use dpm_run_callback in device_prepare
On Fri, 2013-06-07 at 18:15 -0700, Greg KH wrote: > On Sat, Jun 08, 2013 at 08:43:55AM +0800, Yanmin Zhang wrote: > > On Fri, 2013-06-07 at 10:37 -0700, Greg KH wrote: > > > On Fri, Jun 07, 2013 at 04:20:31PM +0800, shuox@intel.com wrote: > > > > From: ShuoX Liu > > > > > > > > dpm_run_callback could show more debug info around prepare stage. > > > > > > Why? Who needs this? What problem does it solve? > > > > > > Without answers to this, why would you expect us to accept such a > > > change? > > It's to provide more debug info and developers can quickly find what > > prepare/complete callback consumes too much time. > > What exact debug info is this providing, and why wouldn't you say all of > this in the original changelog information? Thanks for the reminder. Below are examples, after applying the patches. We can see clearly how much time every device prepare callback consumes. [ 600.384779] sr 1:0:0:0: preparing bus suspend [ 600.384780] call 1:0:0:0+ returned 0 after 2 usecs [ 600.384785] calling 2-1+ @ 2556, parent: usb2 [ 600.384788] usb 2-1: preparing type suspend [ 600.384789] call 2-1+ returned 0 after 2 usecs [ 600.384794] calling 1-1.1+ @ 2556, parent: 1-1 [ 600.384797] usb 1-1.1: preparing type suspend [ 600.384799] call 1-1.1+ returned 0 after 2 usecs [ 600.384802] calling 1-1.3+ @ 2556, parent: 1-1 [ 600.384804] usb 1-1.3: preparing type suspend [ 600.384805] call 1-1.3+ returned 0 after 2 usecs [ 600.384809] calling 1-1.4+ @ 2556, parent: 1-1 [ 600.384811] usb 1-1.4: preparing type suspend [ 600.384813] call 1-1.4+ returned 0 after 2 usecs [ 600.384816] calling 1-1.5+ @ 2556, parent: 1-1 [ 600.384819] usb 1-1.5: preparing type suspend, may wakeup [ 600.384820] call 1-1.5+ returned 0 after 2 usecs [ 600.384823] calling 1-1.4.1+ @ 2556, parent: 1-1.4 [ 600.384825] usb 1-1.4.1: preparing type suspend [ 600.384827] call 1-1.4.1+ returned 0 after 2 usecs [ 600.384829] calling 1-1.4.2+ @ 2556, parent: 1-1.4 [ 600.384831] usb 1-1.4.2: preparing type suspend [ 600.384833] call 1-1.4.2+ returned 0 after 2 usecs [ 600.384865] PM: prepare suspend of devices complete after 0.471 msecs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Sat, 2013-06-08 at 02:54 +0200, Rafael J. Wysocki wrote: > On Saturday, June 08, 2013 08:42:12 AM Yanmin Zhang wrote: > > On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: > > > On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: > > > > dpm_run_callback is used in other stages of power states changing. > > > > It provides debug info message and time measurement when call these > > > > callback. We also want to benefit ->prepare and ->complete. > > > > > > > > [PATCH 1/2] PM: use dpm_run_callback in device_prepare > > > > [PATCH 2/2] PM: add dpm_run_callback_void and use it in device_complete > > > > > > Is this an "Oh, why don't we do that?" series, or is it useful for > > > anything > > > in practice? I'm asking, because we haven't added that stuff to start > > > with > > > since we didn't see why it would be useful to anyone. > > > > > > And while patch [1/2] reduces the code size (by 1 line), so I can see some > > > (tiny) benefit from applying it, patch [2/2] adds more code and is there > > > any > > > paractical reason? > > Sometimes, suspend-to-ram path spends too much time (either suspend slowly > > or wakeup slowly) and we need optimize it. > > With the 2 patches, we could collect initcall_debug printk info and manually > > check what prepare/complete callbacks consume too much time. > > Well, can you point me to a single driver where prepare/complete causes this > type of problems to happen? That's a good question. We are enabling Android mobiles and need optimize suspend-to-ram. In the current upstream, we don't find a driver's prepare/complete callback spends too much time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] PM: use dpm_run_callback in device_prepare
On Fri, 2013-06-07 at 10:37 -0700, Greg KH wrote: > On Fri, Jun 07, 2013 at 04:20:31PM +0800, shuox@intel.com wrote: > > From: ShuoX Liu > > > > dpm_run_callback could show more debug info around prepare stage. > > Why? Who needs this? What problem does it solve? > > Without answers to this, why would you expect us to accept such a > change? It's to provide more debug info and developers can quickly find what prepare/complete callback consumes too much time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: > On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: > > dpm_run_callback is used in other stages of power states changing. > > It provides debug info message and time measurement when call these > > callback. We also want to benefit ->prepare and ->complete. > > > > [PATCH 1/2] PM: use dpm_run_callback in device_prepare > > [PATCH 2/2] PM: add dpm_run_callback_void and use it in device_complete > > Is this an "Oh, why don't we do that?" series, or is it useful for anything > in practice? I'm asking, because we haven't added that stuff to start with > since we didn't see why it would be useful to anyone. > > And while patch [1/2] reduces the code size (by 1 line), so I can see some > (tiny) benefit from applying it, patch [2/2] adds more code and is there any > paractical reason? Sometimes, suspend-to-ram path spends too much time (either suspend slowly or wakeup slowly) and we need optimize it. With the 2 patches, we could collect initcall_debug printk info and manually check what prepare/complete callbacks consume too much time. Thanks, Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: dpm_run_callback is used in other stages of power states changing. It provides debug info message and time measurement when call these callback. We also want to benefit -prepare and -complete. [PATCH 1/2] PM: use dpm_run_callback in device_prepare [PATCH 2/2] PM: add dpm_run_callback_void and use it in device_complete Is this an Oh, why don't we do that? series, or is it useful for anything in practice? I'm asking, because we haven't added that stuff to start with since we didn't see why it would be useful to anyone. And while patch [1/2] reduces the code size (by 1 line), so I can see some (tiny) benefit from applying it, patch [2/2] adds more code and is there any paractical reason? Sometimes, suspend-to-ram path spends too much time (either suspend slowly or wakeup slowly) and we need optimize it. With the 2 patches, we could collect initcall_debug printk info and manually check what prepare/complete callbacks consume too much time. Thanks, Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] PM: use dpm_run_callback in device_prepare
On Fri, 2013-06-07 at 10:37 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 04:20:31PM +0800, shuox@intel.com wrote: From: ShuoX Liu shuox@intel.com dpm_run_callback could show more debug info around prepare stage. Why? Who needs this? What problem does it solve? Without answers to this, why would you expect us to accept such a change? It's to provide more debug info and developers can quickly find what prepare/complete callback consumes too much time. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Sat, 2013-06-08 at 02:54 +0200, Rafael J. Wysocki wrote: On Saturday, June 08, 2013 08:42:12 AM Yanmin Zhang wrote: On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: dpm_run_callback is used in other stages of power states changing. It provides debug info message and time measurement when call these callback. We also want to benefit -prepare and -complete. [PATCH 1/2] PM: use dpm_run_callback in device_prepare [PATCH 2/2] PM: add dpm_run_callback_void and use it in device_complete Is this an Oh, why don't we do that? series, or is it useful for anything in practice? I'm asking, because we haven't added that stuff to start with since we didn't see why it would be useful to anyone. And while patch [1/2] reduces the code size (by 1 line), so I can see some (tiny) benefit from applying it, patch [2/2] adds more code and is there any paractical reason? Sometimes, suspend-to-ram path spends too much time (either suspend slowly or wakeup slowly) and we need optimize it. With the 2 patches, we could collect initcall_debug printk info and manually check what prepare/complete callbacks consume too much time. Well, can you point me to a single driver where prepare/complete causes this type of problems to happen? That's a good question. We are enabling Android mobiles and need optimize suspend-to-ram. In the current upstream, we don't find a driver's prepare/complete callback spends too much time. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] PM: use dpm_run_callback in device_prepare
On Fri, 2013-06-07 at 18:15 -0700, Greg KH wrote: On Sat, Jun 08, 2013 at 08:43:55AM +0800, Yanmin Zhang wrote: On Fri, 2013-06-07 at 10:37 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 04:20:31PM +0800, shuox@intel.com wrote: From: ShuoX Liu shuox@intel.com dpm_run_callback could show more debug info around prepare stage. Why? Who needs this? What problem does it solve? Without answers to this, why would you expect us to accept such a change? It's to provide more debug info and developers can quickly find what prepare/complete callback consumes too much time. What exact debug info is this providing, and why wouldn't you say all of this in the original changelog information? Thanks for the reminder. Below are examples, after applying the patches. We can see clearly how much time every device prepare callback consumes. [ 600.384779] sr 1:0:0:0: preparing bus suspend [ 600.384780] call 1:0:0:0+ returned 0 after 2 usecs [ 600.384785] calling 2-1+ @ 2556, parent: usb2 [ 600.384788] usb 2-1: preparing type suspend [ 600.384789] call 2-1+ returned 0 after 2 usecs [ 600.384794] calling 1-1.1+ @ 2556, parent: 1-1 [ 600.384797] usb 1-1.1: preparing type suspend [ 600.384799] call 1-1.1+ returned 0 after 2 usecs [ 600.384802] calling 1-1.3+ @ 2556, parent: 1-1 [ 600.384804] usb 1-1.3: preparing type suspend [ 600.384805] call 1-1.3+ returned 0 after 2 usecs [ 600.384809] calling 1-1.4+ @ 2556, parent: 1-1 [ 600.384811] usb 1-1.4: preparing type suspend [ 600.384813] call 1-1.4+ returned 0 after 2 usecs [ 600.384816] calling 1-1.5+ @ 2556, parent: 1-1 [ 600.384819] usb 1-1.5: preparing type suspend, may wakeup [ 600.384820] call 1-1.5+ returned 0 after 2 usecs [ 600.384823] calling 1-1.4.1+ @ 2556, parent: 1-1.4 [ 600.384825] usb 1-1.4.1: preparing type suspend [ 600.384827] call 1-1.4.1+ returned 0 after 2 usecs [ 600.384829] calling 1-1.4.2+ @ 2556, parent: 1-1.4 [ 600.384831] usb 1-1.4.2: preparing type suspend [ 600.384833] call 1-1.4.2+ returned 0 after 2 usecs [ 600.384865] PM: prepare suspend of devices complete after 0.471 msecs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Fri, 2013-06-07 at 18:16 -0700, Greg KH wrote: On Sat, Jun 08, 2013 at 08:42:12AM +0800, Yanmin Zhang wrote: On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: dpm_run_callback is used in other stages of power states changing. It provides debug info message and time measurement when call these callback. We also want to benefit -prepare and -complete. [PATCH 1/2] PM: use dpm_run_callback in device_prepare [PATCH 2/2] PM: add dpm_run_callback_void and use it in device_complete Is this an Oh, why don't we do that? series, or is it useful for anything in practice? I'm asking, because we haven't added that stuff to start with since we didn't see why it would be useful to anyone. And while patch [1/2] reduces the code size (by 1 line), so I can see some (tiny) benefit from applying it, patch [2/2] adds more code and is there any paractical reason? Sometimes, suspend-to-ram path spends too much time (either suspend slowly or wakeup slowly) and we need optimize it. With the 2 patches, we could collect initcall_debug printk info and manually check what prepare/complete callbacks consume too much time. But initcall information is for initialization stuff, not suspend/resume things, right? initcall_debug also controls shutdown callbacks, and suspend/resume callbacks. See __device_suspend=dpm_run_callback. It's very useful when we want to optimize suspend-to-ram, or debug some hard issues which happen when async suspend is enabled. Doesn't the existing tools for parsing this choke if it sees the information at suspend/resume time? Current kernel doesn't print out info around prepare/complete. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Sat, 2013-06-08 at 03:30 +0200, Rafael J. Wysocki wrote: On Friday, June 07, 2013 06:16:25 PM Greg KH wrote: On Sat, Jun 08, 2013 at 08:42:12AM +0800, Yanmin Zhang wrote: On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: dpm_run_callback is used in other stages of power states changing. It provides debug info message and time measurement when call these callback. We also want to benefit -prepare and -complete. [PATCH 1/2] PM: use dpm_run_callback in device_prepare [PATCH 2/2] PM: add dpm_run_callback_void and use it in device_complete Is this an Oh, why don't we do that? series, or is it useful for anything in practice? I'm asking, because we haven't added that stuff to start with since we didn't see why it would be useful to anyone. And while patch [1/2] reduces the code size (by 1 line), so I can see some (tiny) benefit from applying it, patch [2/2] adds more code and is there any paractical reason? Sometimes, suspend-to-ram path spends too much time (either suspend slowly or wakeup slowly) and we need optimize it. With the 2 patches, we could collect initcall_debug printk info and manually check what prepare/complete callbacks consume too much time. But initcall information is for initialization stuff, not suspend/resume things, right? Doesn't the existing tools for parsing this choke if it sees the information at suspend/resume time? We've been using that for suspend/resume for quite some time too, but not for the prepare/complete phases (because we still believe that's not really useful for them). Well, I'll be handling patches changing code under drivers/base/power, I promise. :-) I've been doing that for quite a few years now ... Yes, indeed. Power is one of the most important features on embedded devices. Lots of smart phones don't really go through the full cycles of suspend-to-ram. We are following the full steps of the suspend. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Fri, 2013-06-07 at 08:08 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 12:54:55PM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 21:19 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 11:18:06AM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 20:08 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 10:37:52AM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 18:02 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 08:53:29AM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: On Thu, 6 Jun 2013, shuox@intel.com wrote: From: Zhang Yanmin yanmin.zh...@intel.com synchronize_irq waits pending IRQ handlers to be finished. If using this function while holding a resource, the IRQ handler may cause deadlock. Here we add a new function irq_in_progress which doesn't wait for the handlers to be finished. A typical use case at suspend-to-ram: device driver's irq handler is complicated and might hold a mutex at rare cases. Its suspend function is called and a suspended flag is set. In case its IRQ handler is running, suspend function calls irq_in_progress. if handler is running, abort suspend. The irq handler checks the suspended flag. If the device is suspended, irq handler either ignores the interrupt, or wakes up the whole system, and the driver's resume function could deal with the delayed interrupt handling. This is as wrong as it can be. Fix the driver instead of hacking racy functions into the core code. So your problem looks like this: CPU 0 CPU 1 irq_handler_thread() suspend() . mutex_lock(m); mutex_lock(m);synchronize_irq(); So why needs the mutex to be taken before synchronize_irq()? Why not doing the obvious? suspend() disable_irq(); (Implies synchronize_irq) mutex_lock(m); mutex_unlock(m); enable_irq(); Thanks for the kind comment. We do consider your solution before and it works well indeed with some specific simple drivers. For example, some drives use GPIO pin as interrupt source. On one specific platform, disable_irq would really disable the irq at RTE entry, which means we lose the wakeup capability of this device. synchronize_irq can be another solution. But we did hit 'DPM device timeout' issue reported by dpm_wd_handler at suspend-to-ram. The driver is complicated. We couldn't change too many functions to optimize it. In addition, we have to use the driver instead of throwing it away. What is preventing you from rewriting it to work properly? It's complicated. That sounds like your issue, not ours, so please don't push the problem onto someone else. Take ownership of the driver, fix it up, and all will be good. With irq_in_progress, we can resolve this issue and it does work, although it looks like ugly. Don't paper over driver bugs in the core kernel, fix the driver. It's hard to say it's a driver bug. Could generic kernel provide some flexibility for such complicated drivers? Please post the code for the driver, and then we will be glad to continue the dicussion. Greg, Thanks for your enthusiasm. It's a private driver for products. What do you mean private driver? All drivers need to be merged into the mainline kernel, it saves you time and money, and we will fix your bugs for you. You know that, your bosses know that, your management knows that, so why are you trying to hide things? totally confused, They are embedded device drivers, highly depending on specific devices which runs its own firmware in devices. Here the kernel drivers run at AP side. That's no different from loads of drivers that we have in the kernel today, no need to keep them from being merged, please submit them. It need managers' approval and is beyond my capability. One example is Graphics driver, which is very big and coding is not friendly. Kernel experts can raise tons of questions against the driver, but we have to make the driver work well on real products. So you are saying that kernel experts don't ask questions that actually make drivers work well on real products? Obviously, I didn't say that. Pls. also remove the , as I really think I'm
Re: [PATCH 0/2] Run callback of device_prepare/complete consistently
On Sat, 2013-06-08 at 03:52 +0200, Rafael J. Wysocki wrote: On Saturday, June 08, 2013 09:36:03 AM Yanmin Zhang wrote: On Sat, 2013-06-08 at 03:30 +0200, Rafael J. Wysocki wrote: On Friday, June 07, 2013 06:16:25 PM Greg KH wrote: On Sat, Jun 08, 2013 at 08:42:12AM +0800, Yanmin Zhang wrote: On Fri, 2013-06-07 at 12:36 +0200, Rafael J. Wysocki wrote: On Friday, June 07, 2013 04:20:30 PM shuox@intel.com wrote: dpm_run_callback is used in other stages of power states changing. It provides debug info message and time measurement when call these callback. We also want to benefit -prepare and -complete. [PATCH 1/2] PM: use dpm_run_callback in device_prepare [PATCH 2/2] PM: add dpm_run_callback_void and use it in device_complete Is this an Oh, why don't we do that? series, or is it useful for anything in practice? I'm asking, because we haven't added that stuff to start with since we didn't see why it would be useful to anyone. And while patch [1/2] reduces the code size (by 1 line), so I can see some (tiny) benefit from applying it, patch [2/2] adds more code and is there any paractical reason? Sometimes, suspend-to-ram path spends too much time (either suspend slowly or wakeup slowly) and we need optimize it. With the 2 patches, we could collect initcall_debug printk info and manually check what prepare/complete callbacks consume too much time. But initcall information is for initialization stuff, not suspend/resume things, right? Doesn't the existing tools for parsing this choke if it sees the information at suspend/resume time? We've been using that for suspend/resume for quite some time too, but not for the prepare/complete phases (because we still believe that's not really useful for them). Well, I'll be handling patches changing code under drivers/base/power, I promise. :-) I've been doing that for quite a few years now ... Yes, indeed. Power is one of the most important features on embedded devices. Lots of smart phones don't really go through the full cycles of suspend-to-ram. We are following the full steps of the suspend. But if you go through the code, you'll see that alomost no drivers actually implemet .prepare() and .complete(). Some subsystems do, but they really don't take too much time to execute. Which means that your patches with initcall_debug will add quite a pile of useless garbage to the kernel log Does that mean we need add more log levels around such info instead of just having or not having? Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Thu, 2013-06-06 at 21:19 -0700, Greg KH wrote: > On Fri, Jun 07, 2013 at 11:18:06AM +0800, Yanmin Zhang wrote: > > On Thu, 2013-06-06 at 20:08 -0700, Greg KH wrote: > > > On Fri, Jun 07, 2013 at 10:37:52AM +0800, Yanmin Zhang wrote: > > > > On Thu, 2013-06-06 at 18:02 -0700, Greg KH wrote: > > > > > On Fri, Jun 07, 2013 at 08:53:29AM +0800, Yanmin Zhang wrote: > > > > > > On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: > > > > > > > On Thu, 6 Jun 2013, shuox@intel.com wrote: > > > > > > > > From: Zhang Yanmin > > > > > > > > > > > > > > > > synchronize_irq waits pending IRQ handlers to be finished. If > > > > > > > > using this > > > > > > > > function while holding a resource, the IRQ handler may cause > > > > > > > > deadlock. > > > > > > > > > > > > > > > > Here we add a new function irq_in_progress which doesn't wait > > > > > > > > for the handlers > > > > > > > > to be finished. > > > > > > > > > > > > > > > > A typical use case at suspend-to-ram: > > > > > > > > > > > > > > > > device driver's irq handler is complicated and might hold a > > > > > > > > mutex at rare cases. > > > > > > > > Its suspend function is called and a suspended flag is set. > > > > > > > > In case its IRQ handler is running, suspend function calls > > > > > > > > irq_in_progress. if > > > > > > > > handler is running, abort suspend. > > > > > > > > The irq handler checks the suspended flag. If the device is > > > > > > > > suspended, irq handler > > > > > > > > either ignores the interrupt, or wakes up the whole system, and > > > > > > > > the driver's > > > > > > > > resume function could deal with the delayed interrupt handling. > > > > > > > > > > > > > > This is as wrong as it can be. Fix the driver instead of hacking > > > > > > > racy > > > > > > > functions into the core code. > > > > > > > > > > > > > > So your problem looks like this: > > > > > > > > > > > > > > CPU 0 CPU 1 > > > > > > > irq_handler_thread() suspend() > > > > > > >. mutex_lock(); > > > > > > >mutex_lock();synchronize_irq(); > > > > > > > > > > > > > > So why needs the mutex to be taken before synchronize_irq()? Why > > > > > > > not > > > > > > > doing the obvious? > > > > > > > > > > > > > > suspend() > > > > > > > disable_irq(); (Implies synchronize_irq) > > > > > > > mutex_lock(); > > > > > > > > > > > > > > mutex_unlock(); > > > > > > > enable_irq(); > > > > > > Thanks for the kind comment. > > > > > > > > > > > > We do consider your solution before and it works well indeed with > > > > > > some specific > > > > > > simple drivers. For example, some drives use GPIO pin as interrupt > > > > > > source. > > > > > > > > > > > > On one specific platform, disable_irq would really disable the irq > > > > > > at RTE entry, > > > > > > which means we lose the wakeup capability of this device. > > > > > > synchronize_irq can be another solution. But we did hit 'DPM device > > > > > > timeout' issue > > > > > > reported by dpm_wd_handler at suspend-to-ram. > > > > > > > > > > > > The driver is complicated. We couldn't change too many functions to > > > > > > optimize it. > > > > > > In addition, we have to use the driver instead of throwing it away. > > > > > > > > > > What is preventing you from rewriting it to work properly? > > > > It's complicated. > > > > > > That sounds like your issue, not ours, so please don't push the problem > &
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Thu, 2013-06-06 at 20:08 -0700, Greg KH wrote: > On Fri, Jun 07, 2013 at 10:37:52AM +0800, Yanmin Zhang wrote: > > On Thu, 2013-06-06 at 18:02 -0700, Greg KH wrote: > > > On Fri, Jun 07, 2013 at 08:53:29AM +0800, Yanmin Zhang wrote: > > > > On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: > > > > > On Thu, 6 Jun 2013, shuox@intel.com wrote: > > > > > > From: Zhang Yanmin > > > > > > > > > > > > synchronize_irq waits pending IRQ handlers to be finished. If using > > > > > > this > > > > > > function while holding a resource, the IRQ handler may cause > > > > > > deadlock. > > > > > > > > > > > > Here we add a new function irq_in_progress which doesn't wait for > > > > > > the handlers > > > > > > to be finished. > > > > > > > > > > > > A typical use case at suspend-to-ram: > > > > > > > > > > > > device driver's irq handler is complicated and might hold a mutex > > > > > > at rare cases. > > > > > > Its suspend function is called and a suspended flag is set. > > > > > > In case its IRQ handler is running, suspend function calls > > > > > > irq_in_progress. if > > > > > > handler is running, abort suspend. > > > > > > The irq handler checks the suspended flag. If the device is > > > > > > suspended, irq handler > > > > > > either ignores the interrupt, or wakes up the whole system, and the > > > > > > driver's > > > > > > resume function could deal with the delayed interrupt handling. > > > > > > > > > > This is as wrong as it can be. Fix the driver instead of hacking racy > > > > > functions into the core code. > > > > > > > > > > So your problem looks like this: > > > > > > > > > > CPU 0 CPU 1 > > > > > irq_handler_thread() suspend() > > > > >. mutex_lock(); > > > > >mutex_lock();synchronize_irq(); > > > > > > > > > > So why needs the mutex to be taken before synchronize_irq()? Why not > > > > > doing the obvious? > > > > > > > > > > suspend() > > > > > disable_irq(); (Implies synchronize_irq) > > > > > mutex_lock(); > > > > > > > > > > mutex_unlock(); > > > > > enable_irq(); > > > > Thanks for the kind comment. > > > > > > > > We do consider your solution before and it works well indeed with some > > > > specific > > > > simple drivers. For example, some drives use GPIO pin as interrupt > > > > source. > > > > > > > > On one specific platform, disable_irq would really disable the irq at > > > > RTE entry, > > > > which means we lose the wakeup capability of this device. > > > > synchronize_irq can be another solution. But we did hit 'DPM device > > > > timeout' issue > > > > reported by dpm_wd_handler at suspend-to-ram. > > > > > > > > The driver is complicated. We couldn't change too many functions to > > > > optimize it. > > > > In addition, we have to use the driver instead of throwing it away. > > > > > > What is preventing you from rewriting it to work properly? > > It's complicated. > > That sounds like your issue, not ours, so please don't push the problem > onto someone else. Take ownership of the driver, fix it up, and all > will be good. > > > > > > With irq_in_progress, we can resolve this issue and it does work, > > > > although it > > > > looks like ugly. > > > > > > Don't paper over driver bugs in the core kernel, fix the driver. > > It's hard to say it's a driver bug. Could generic kernel provide some > > flexibility for such complicated drivers? > > Please post the code for the driver, and then we will be glad to > continue the dicussion. Greg, Thanks for your enthusiasm. It's a private driver for products. Let's stop here and I wouldn't push the patch to upstream. Thanks all. Yanmin > > As for "complicated", just make it "uncomplicated", it's just code :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Thu, 2013-06-06 at 18:02 -0700, Greg KH wrote: > On Fri, Jun 07, 2013 at 08:53:29AM +0800, Yanmin Zhang wrote: > > On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: > > > On Thu, 6 Jun 2013, shuox@intel.com wrote: > > > > From: Zhang Yanmin > > > > > > > > synchronize_irq waits pending IRQ handlers to be finished. If using this > > > > function while holding a resource, the IRQ handler may cause deadlock. > > > > > > > > Here we add a new function irq_in_progress which doesn't wait for the > > > > handlers > > > > to be finished. > > > > > > > > A typical use case at suspend-to-ram: > > > > > > > > device driver's irq handler is complicated and might hold a mutex at > > > > rare cases. > > > > Its suspend function is called and a suspended flag is set. > > > > In case its IRQ handler is running, suspend function calls > > > > irq_in_progress. if > > > > handler is running, abort suspend. > > > > The irq handler checks the suspended flag. If the device is suspended, > > > > irq handler > > > > either ignores the interrupt, or wakes up the whole system, and the > > > > driver's > > > > resume function could deal with the delayed interrupt handling. > > > > > > This is as wrong as it can be. Fix the driver instead of hacking racy > > > functions into the core code. > > > > > > So your problem looks like this: > > > > > > CPU 0 CPU 1 > > > irq_handler_thread() suspend() > > >. mutex_lock(); > > >mutex_lock();synchronize_irq(); > > > > > > So why needs the mutex to be taken before synchronize_irq()? Why not > > > doing the obvious? > > > > > > suspend() > > > disable_irq(); (Implies synchronize_irq) > > > mutex_lock(); > > > > > > mutex_unlock(); > > > enable_irq(); > > Thanks for the kind comment. > > > > We do consider your solution before and it works well indeed with some > > specific > > simple drivers. For example, some drives use GPIO pin as interrupt source. > > > > On one specific platform, disable_irq would really disable the irq at RTE > > entry, > > which means we lose the wakeup capability of this device. > > synchronize_irq can be another solution. But we did hit 'DPM device > > timeout' issue > > reported by dpm_wd_handler at suspend-to-ram. > > > > The driver is complicated. We couldn't change too many functions to > > optimize it. > > In addition, we have to use the driver instead of throwing it away. > > What is preventing you from rewriting it to work properly? It's complicated. > > > With irq_in_progress, we can resolve this issue and it does work, although > > it > > looks like ugly. > > Don't paper over driver bugs in the core kernel, fix the driver. It's hard to say it's a driver bug. Could generic kernel provide some flexibility for such complicated drivers? For example, any driver's suspend can return error and the whole suspend-to-ram aborts. Can we say the driver has a bug? If so, why not to change all suspend/resume callbacks to return VOID? Current kernel already allows drivers to abort suspend-to-ram at some rare corner cases. Of course, if the abort happens frequently, it's a bug and we need fix it in driver. If it happens rarely, can we provide some flexibility for the driver? Thanks, Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: > On Thu, 6 Jun 2013, shuox@intel.com wrote: > > From: Zhang Yanmin > > > > synchronize_irq waits pending IRQ handlers to be finished. If using this > > function while holding a resource, the IRQ handler may cause deadlock. > > > > Here we add a new function irq_in_progress which doesn't wait for the > > handlers > > to be finished. > > > > A typical use case at suspend-to-ram: > > > > device driver's irq handler is complicated and might hold a mutex at rare > > cases. > > Its suspend function is called and a suspended flag is set. > > In case its IRQ handler is running, suspend function calls irq_in_progress. > > if > > handler is running, abort suspend. > > The irq handler checks the suspended flag. If the device is suspended, irq > > handler > > either ignores the interrupt, or wakes up the whole system, and the driver's > > resume function could deal with the delayed interrupt handling. > > This is as wrong as it can be. Fix the driver instead of hacking racy > functions into the core code. > > So your problem looks like this: > > CPU 0 CPU 1 > irq_handler_thread() suspend() >. mutex_lock(); >mutex_lock();synchronize_irq(); > > So why needs the mutex to be taken before synchronize_irq()? Why not > doing the obvious? > > suspend() > disable_irq(); (Implies synchronize_irq) > mutex_lock(); > > mutex_unlock(); > enable_irq(); Thanks for the kind comment. We do consider your solution before and it works well indeed with some specific simple drivers. For example, some drives use GPIO pin as interrupt source. On one specific platform, disable_irq would really disable the irq at RTE entry, which means we lose the wakeup capability of this device. synchronize_irq can be another solution. But we did hit 'DPM device timeout' issue reported by dpm_wd_handler at suspend-to-ram. The driver is complicated. We couldn't change too many functions to optimize it. In addition, we have to use the driver instead of throwing it away. With irq_in_progress, we can resolve this issue and it does work, although it looks like ugly. Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Wed, 2013-06-05 at 07:30 -0600, Bjorn Helgaas wrote: > On Tue, Jun 4, 2013 at 6:38 PM, Yanmin Zhang > wrote: > > On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote: > >> I'm not sure where we are with this patch. I think Joseph initially > >> reported a problem (though I haven't actually seen that), and this > >> patch fixed it, so it seems like there's something we want to do here. > > Yes, indeed. We checked Powerpc error handling and plan to use the > > similar method to deal with the issue. > > > > Sorry for replying very late. I move to Android smartphone area and am > > very busy on many top issues. > > OK. Can you point us at a bugzilla or email archive of Joseph's > original problem report, or post it to this thread? Then maybe > somebody else can pick it up and make progress on this. Joseph sent email to my outlook emailbox. I changed it a little to delete company sensitive product name. ===Joseph's original email to me Hi Tom and Yanmin, Hope this email can reach you well. I am working on a driver's PCI error recovery with AER. I have a question with one of my observations from platform AER driver's behavior. As your names and emails are listed to the PCIe AER driver coming with the kernel, I send this question to you: During injecting AER Non-Correctable/Fatal error and PCIe error recovery process, I observed the following from our Low Level Driver (LLD): 1. The LLD's error_detected() callback got called and the PCI channel state passed in as "pci_channel_io_frozen", as expected; 2. The LLD's error_detected() callback function returned with PCI_ERS_RESULT_NEED_RESET, requesting a PCI slot reset; 3. The LLD's slot_reset() callback got called and it attempted to re-initialize the device hardware for the recovery. However, the PCI slot state was still in "pci_channel_io_frozen" and the pci_channel_offline() helper routine returned 1. This is not expected, and it actually prevented LLD in performing needed access to memory mapped I/O space for preparing the device for recovery; 4. Later, the LLD's resume() callback got called and the PCI slot state was set to "pci_channel_io_normal"; In the upstream Linux kernel 3.7.0's pci-error-recovery.txt, "STEP 4 Slot Reset", it stated after platform has performed PCI slot reset and then calls LLD's slot_reset() callback: "This call gives drivers the chance to re-initialize the hardware (re-download firmware, etc.). At this point, the driver may assume that the card is in a fresh state and is fully functional. The slot is unfrozen and the driver has full access to PCI config space, memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) will also be available." I looked into the kernel 3.7.0's AER driver's aerdrv_core.c and see that the PCI slot state was set to "pci_channel_io_frozen" on AER_FATAL serverity, and only set back to "pci_channel_io_normal" in report_resume() function. The PCI slot state was not set to "pci_channel_io_normal" when invoking report_slot_reset(). As a comparison, I also perform the EEH error recovery with the LLD driver on a PPC platform, which indeed set the PCI slot state to "pci_channel_io_normal" when calling LLD's slot_reset() callback function. Can you please verify this platform AER driver's behavior is intended and will stay with the AER platform support, or it should be changed to be consistent with the PCI error recovery procedure described in the pci-error-recovery.txt documentation? I also noticed that before invoking the broadcast_error_message() function with "slot_reset", there is a comment in the 3.7.0 kernel source: /* * TODO: Should call platform-specific * functions to reset slot before calling * drivers' slot_reset callbacks? */ And I do see that only the PCIe Link_Reset was performed, no PCI fundamental reset was performed even the LLD set the PCIe device's "pdev->needs_freset = 1", is this the area that later will be added for performing needed PCI hot reset or fundamental reset at this stage? Your timely response is very appreciated. Thanks in advance and please let me know if you think I should redirect the question to someone else. Best Regards, Joseph Liu, Ph.D. Senior Principal Engineer Emulex Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Wed, 2013-06-05 at 07:30 -0600, Bjorn Helgaas wrote: On Tue, Jun 4, 2013 at 6:38 PM, Yanmin Zhang yanmin_zh...@linux.intel.com wrote: On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote: I'm not sure where we are with this patch. I think Joseph initially reported a problem (though I haven't actually seen that), and this patch fixed it, so it seems like there's something we want to do here. Yes, indeed. We checked Powerpc error handling and plan to use the similar method to deal with the issue. Sorry for replying very late. I move to Android smartphone area and am very busy on many top issues. OK. Can you point us at a bugzilla or email archive of Joseph's original problem report, or post it to this thread? Then maybe somebody else can pick it up and make progress on this. Joseph sent email to my outlook emailbox. I changed it a little to delete company sensitive product name. ===Joseph's original email to me Hi Tom and Yanmin, Hope this email can reach you well. I am working on a driver's PCI error recovery with AER. I have a question with one of my observations from platform AER driver's behavior. As your names and emails are listed to the PCIe AER driver coming with the kernel, I send this question to you: During injecting AER Non-Correctable/Fatal error and PCIe error recovery process, I observed the following from our Low Level Driver (LLD): 1. The LLD's error_detected() callback got called and the PCI channel state passed in as pci_channel_io_frozen, as expected; 2. The LLD's error_detected() callback function returned with PCI_ERS_RESULT_NEED_RESET, requesting a PCI slot reset; 3. The LLD's slot_reset() callback got called and it attempted to re-initialize the device hardware for the recovery. However, the PCI slot state was still in pci_channel_io_frozen and the pci_channel_offline() helper routine returned 1. This is not expected, and it actually prevented LLD in performing needed access to memory mapped I/O space for preparing the device for recovery; 4. Later, the LLD's resume() callback got called and the PCI slot state was set to pci_channel_io_normal; In the upstream Linux kernel 3.7.0's pci-error-recovery.txt, STEP 4 Slot Reset, it stated after platform has performed PCI slot reset and then calls LLD's slot_reset() callback: This call gives drivers the chance to re-initialize the hardware (re-download firmware, etc.). At this point, the driver may assume that the card is in a fresh state and is fully functional. The slot is unfrozen and the driver has full access to PCI config space, memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X) will also be available. I looked into the kernel 3.7.0's AER driver's aerdrv_core.c and see that the PCI slot state was set to pci_channel_io_frozen on AER_FATAL serverity, and only set back to pci_channel_io_normal in report_resume() function. The PCI slot state was not set to pci_channel_io_normal when invoking report_slot_reset(). As a comparison, I also perform the EEH error recovery with the LLD driver on a PPC platform, which indeed set the PCI slot state to pci_channel_io_normal when calling LLD's slot_reset() callback function. Can you please verify this platform AER driver's behavior is intended and will stay with the AER platform support, or it should be changed to be consistent with the PCI error recovery procedure described in the pci-error-recovery.txt documentation? I also noticed that before invoking the broadcast_error_message() function with slot_reset, there is a comment in the 3.7.0 kernel source: /* * TODO: Should call platform-specific * functions to reset slot before calling * drivers' slot_reset callbacks? */ And I do see that only the PCIe Link_Reset was performed, no PCI fundamental reset was performed even the LLD set the PCIe device's pdev-needs_freset = 1, is this the area that later will be added for performing needed PCI hot reset or fundamental reset at this stage? Your timely response is very appreciated. Thanks in advance and please let me know if you think I should redirect the question to someone else. Best Regards, Joseph Liu, Ph.D. Senior Principal Engineer Emulex Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: On Thu, 6 Jun 2013, shuox@intel.com wrote: From: Zhang Yanmin yanmin.zh...@intel.com synchronize_irq waits pending IRQ handlers to be finished. If using this function while holding a resource, the IRQ handler may cause deadlock. Here we add a new function irq_in_progress which doesn't wait for the handlers to be finished. A typical use case at suspend-to-ram: device driver's irq handler is complicated and might hold a mutex at rare cases. Its suspend function is called and a suspended flag is set. In case its IRQ handler is running, suspend function calls irq_in_progress. if handler is running, abort suspend. The irq handler checks the suspended flag. If the device is suspended, irq handler either ignores the interrupt, or wakes up the whole system, and the driver's resume function could deal with the delayed interrupt handling. This is as wrong as it can be. Fix the driver instead of hacking racy functions into the core code. So your problem looks like this: CPU 0 CPU 1 irq_handler_thread() suspend() . mutex_lock(m); mutex_lock(m);synchronize_irq(); So why needs the mutex to be taken before synchronize_irq()? Why not doing the obvious? suspend() disable_irq(); (Implies synchronize_irq) mutex_lock(m); mutex_unlock(m); enable_irq(); Thanks for the kind comment. We do consider your solution before and it works well indeed with some specific simple drivers. For example, some drives use GPIO pin as interrupt source. On one specific platform, disable_irq would really disable the irq at RTE entry, which means we lose the wakeup capability of this device. synchronize_irq can be another solution. But we did hit 'DPM device timeout' issue reported by dpm_wd_handler at suspend-to-ram. The driver is complicated. We couldn't change too many functions to optimize it. In addition, we have to use the driver instead of throwing it away. With irq_in_progress, we can resolve this issue and it does work, although it looks like ugly. Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Thu, 2013-06-06 at 18:02 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 08:53:29AM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: On Thu, 6 Jun 2013, shuox@intel.com wrote: From: Zhang Yanmin yanmin.zh...@intel.com synchronize_irq waits pending IRQ handlers to be finished. If using this function while holding a resource, the IRQ handler may cause deadlock. Here we add a new function irq_in_progress which doesn't wait for the handlers to be finished. A typical use case at suspend-to-ram: device driver's irq handler is complicated and might hold a mutex at rare cases. Its suspend function is called and a suspended flag is set. In case its IRQ handler is running, suspend function calls irq_in_progress. if handler is running, abort suspend. The irq handler checks the suspended flag. If the device is suspended, irq handler either ignores the interrupt, or wakes up the whole system, and the driver's resume function could deal with the delayed interrupt handling. This is as wrong as it can be. Fix the driver instead of hacking racy functions into the core code. So your problem looks like this: CPU 0 CPU 1 irq_handler_thread() suspend() . mutex_lock(m); mutex_lock(m);synchronize_irq(); So why needs the mutex to be taken before synchronize_irq()? Why not doing the obvious? suspend() disable_irq(); (Implies synchronize_irq) mutex_lock(m); mutex_unlock(m); enable_irq(); Thanks for the kind comment. We do consider your solution before and it works well indeed with some specific simple drivers. For example, some drives use GPIO pin as interrupt source. On one specific platform, disable_irq would really disable the irq at RTE entry, which means we lose the wakeup capability of this device. synchronize_irq can be another solution. But we did hit 'DPM device timeout' issue reported by dpm_wd_handler at suspend-to-ram. The driver is complicated. We couldn't change too many functions to optimize it. In addition, we have to use the driver instead of throwing it away. What is preventing you from rewriting it to work properly? It's complicated. With irq_in_progress, we can resolve this issue and it does work, although it looks like ugly. Don't paper over driver bugs in the core kernel, fix the driver. It's hard to say it's a driver bug. Could generic kernel provide some flexibility for such complicated drivers? For example, any driver's suspend can return error and the whole suspend-to-ram aborts. Can we say the driver has a bug? If so, why not to change all suspend/resume callbacks to return VOID? Current kernel already allows drivers to abort suspend-to-ram at some rare corner cases. Of course, if the abort happens frequently, it's a bug and we need fix it in driver. If it happens rarely, can we provide some flexibility for the driver? Thanks, Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Thu, 2013-06-06 at 20:08 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 10:37:52AM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 18:02 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 08:53:29AM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: On Thu, 6 Jun 2013, shuox@intel.com wrote: From: Zhang Yanmin yanmin.zh...@intel.com synchronize_irq waits pending IRQ handlers to be finished. If using this function while holding a resource, the IRQ handler may cause deadlock. Here we add a new function irq_in_progress which doesn't wait for the handlers to be finished. A typical use case at suspend-to-ram: device driver's irq handler is complicated and might hold a mutex at rare cases. Its suspend function is called and a suspended flag is set. In case its IRQ handler is running, suspend function calls irq_in_progress. if handler is running, abort suspend. The irq handler checks the suspended flag. If the device is suspended, irq handler either ignores the interrupt, or wakes up the whole system, and the driver's resume function could deal with the delayed interrupt handling. This is as wrong as it can be. Fix the driver instead of hacking racy functions into the core code. So your problem looks like this: CPU 0 CPU 1 irq_handler_thread() suspend() . mutex_lock(m); mutex_lock(m);synchronize_irq(); So why needs the mutex to be taken before synchronize_irq()? Why not doing the obvious? suspend() disable_irq(); (Implies synchronize_irq) mutex_lock(m); mutex_unlock(m); enable_irq(); Thanks for the kind comment. We do consider your solution before and it works well indeed with some specific simple drivers. For example, some drives use GPIO pin as interrupt source. On one specific platform, disable_irq would really disable the irq at RTE entry, which means we lose the wakeup capability of this device. synchronize_irq can be another solution. But we did hit 'DPM device timeout' issue reported by dpm_wd_handler at suspend-to-ram. The driver is complicated. We couldn't change too many functions to optimize it. In addition, we have to use the driver instead of throwing it away. What is preventing you from rewriting it to work properly? It's complicated. That sounds like your issue, not ours, so please don't push the problem onto someone else. Take ownership of the driver, fix it up, and all will be good. With irq_in_progress, we can resolve this issue and it does work, although it looks like ugly. Don't paper over driver bugs in the core kernel, fix the driver. It's hard to say it's a driver bug. Could generic kernel provide some flexibility for such complicated drivers? Please post the code for the driver, and then we will be glad to continue the dicussion. Greg, Thanks for your enthusiasm. It's a private driver for products. Let's stop here and I wouldn't push the patch to upstream. Thanks all. Yanmin As for complicated, just make it uncomplicated, it's just code :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] irq: add a new function irq_in_progress to check pending IRQ handlers
On Thu, 2013-06-06 at 21:19 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 11:18:06AM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 20:08 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 10:37:52AM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 18:02 -0700, Greg KH wrote: On Fri, Jun 07, 2013 at 08:53:29AM +0800, Yanmin Zhang wrote: On Thu, 2013-06-06 at 15:18 +0200, Thomas Gleixner wrote: On Thu, 6 Jun 2013, shuox@intel.com wrote: From: Zhang Yanmin yanmin.zh...@intel.com synchronize_irq waits pending IRQ handlers to be finished. If using this function while holding a resource, the IRQ handler may cause deadlock. Here we add a new function irq_in_progress which doesn't wait for the handlers to be finished. A typical use case at suspend-to-ram: device driver's irq handler is complicated and might hold a mutex at rare cases. Its suspend function is called and a suspended flag is set. In case its IRQ handler is running, suspend function calls irq_in_progress. if handler is running, abort suspend. The irq handler checks the suspended flag. If the device is suspended, irq handler either ignores the interrupt, or wakes up the whole system, and the driver's resume function could deal with the delayed interrupt handling. This is as wrong as it can be. Fix the driver instead of hacking racy functions into the core code. So your problem looks like this: CPU 0 CPU 1 irq_handler_thread() suspend() . mutex_lock(m); mutex_lock(m);synchronize_irq(); So why needs the mutex to be taken before synchronize_irq()? Why not doing the obvious? suspend() disable_irq(); (Implies synchronize_irq) mutex_lock(m); mutex_unlock(m); enable_irq(); Thanks for the kind comment. We do consider your solution before and it works well indeed with some specific simple drivers. For example, some drives use GPIO pin as interrupt source. On one specific platform, disable_irq would really disable the irq at RTE entry, which means we lose the wakeup capability of this device. synchronize_irq can be another solution. But we did hit 'DPM device timeout' issue reported by dpm_wd_handler at suspend-to-ram. The driver is complicated. We couldn't change too many functions to optimize it. In addition, we have to use the driver instead of throwing it away. What is preventing you from rewriting it to work properly? It's complicated. That sounds like your issue, not ours, so please don't push the problem onto someone else. Take ownership of the driver, fix it up, and all will be good. With irq_in_progress, we can resolve this issue and it does work, although it looks like ugly. Don't paper over driver bugs in the core kernel, fix the driver. It's hard to say it's a driver bug. Could generic kernel provide some flexibility for such complicated drivers? Please post the code for the driver, and then we will be glad to continue the dicussion. Greg, Thanks for your enthusiasm. It's a private driver for products. What do you mean private driver? All drivers need to be merged into the mainline kernel, it saves you time and money, and we will fix your bugs for you. You know that, your bosses know that, your management knows that, so why are you trying to hide things? totally confused, They are embedded device drivers, highly depending on specific devices which runs its own firmware in devices. Here the kernel drivers run at AP side. One example is Graphics driver, which is very big and coding is not friendly. Kernel experts can raise tons of questions against the driver, but we have to make the driver work well on real products. Another reason is drivers need workaround many hardware issues. That makes it hard to implement kernel drivers in good shape sometimes. We need support all cases. We fixed lots of hard issues on embedded products and think if kernel could be more flexible to support complicated cases. Anyway, thanks. Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote: > I'm not sure where we are with this patch. I think Joseph initially > reported a problem (though I haven't actually seen that), and this > patch fixed it, so it seems like there's something we want to do here. Yes, indeed. We checked Powerpc error handling and plan to use the similar method to deal with the issue. Sorry for replying very late. I move to Android smartphone area and am very busy on many top issues. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote: I'm not sure where we are with this patch. I think Joseph initially reported a problem (though I haven't actually seen that), and this patch fixed it, so it seems like there's something we want to do here. Yes, indeed. We checked Powerpc error handling and plan to use the similar method to deal with the issue. Sorry for replying very late. I move to Android smartphone area and am very busy on many top issues. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Mon, 2013-05-20 at 10:37 -0500, Linas Vepstas wrote: > I think Joe Liu has it right. I'm going to top-post because things > are a bit tangled below. I urge a review of > /Documentation/PCI/pci-error-recovery.txt, as that gives the details. > > The intended sequence is that, after an error, the device driver gets > a shot at running some diagnostics & dumps, and then the pci > bridges/controllers/ports/links are reset (by platform code, viz. aer > in this case) to a state resembling a fresh power-on. Then the > .slot_reset() callback is called on the device driver, to tell the > driver "hey everything upstream is now working, go set yourself up for > normal ops." Thus, in particular, one should have pdev->error_state = > pci_channel_io_normal; before .slot_reset() is called, and the PCI > config space contents should resemble a fresh power-on state (and > **NOT** some other saved state!) > > If the device driver wanted to leave the card in a dead state, it had > several opportunities to say so, earlier in the callback sequence. If > the driver wanted to fiddle with the card with the old PCI config > space, it already had a chance to do that too, before the > bridges/controllers/ports/links were fully reset by the platform. By > the time that .slot_reset() is being called, both the platform and the > device driver are expecting smooth sailing ahead. Yes. It's flexible for drivers to do that in many callbacks. AER framework provides such flexibility. > > So, looking at the original patch, it seems reasonable. I agree. > My impression > is that maybe the AER driver had been doing not quite the right thing > for a long time. Pls. provide evidence/facts. The new patch is to facilitate device driver implementation. It doesn't mean current AER driver is incorrect. We need a tradeoff. Just like what Bjoin says, we shouldn't change error_state to pci_channel_io_normal before we really recover the hardware. The patch changes it just because drivers might call some functions to recover the devices, while such functions need (error_state==pci_channel_io_normal). > > -- Linas Vepstas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Mon, 2013-05-20 at 16:48 -0600, Bjorn Helgaas wrote: > On Fri, Apr 26, 2013 at 06:28:59AM +, Zhang, LongX wrote: > > From: Zhang Long > > > > Specific pci device drivers might have many functions to call > > pci_channel_offline to check device states. When slot_reset happens, > > drivers' slot_reset callback might call such functions and eventually > > abort the reset. > > > > The patch resets pdev->error_state to pci_channel_io_normal at > > the begining of report_slot_reset. > > > > Thank Liu Joseph for pointing it out. > > > > Signed-off-by: Zhang Yanmin > > Signed-off-by: Zhang Long > > --- > > drivers/pci/pcie/aer/aerdrv_core.c |1 + > > drivers/pci/pcie/portdrv_pci.c | 12 +--- > > 2 files changed, 6 insertions(+), 7 deletions(-) Thank all for the kind comments. > > > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c > > b/drivers/pci/pcie/aer/aerdrv_core.c > > index 564d97f..c61fd44 100644 > > --- a/drivers/pci/pcie/aer/aerdrv_core.c > > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > > @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void > > *data) > > result_data = (struct aer_broadcast_data *) data; > > > > device_lock(>dev); > > + dev->error_state = pci_channel_io_normal; > > if (!dev->driver || > > !dev->driver->err_handler || > > !dev->driver->err_handler->slot_reset) > > diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c > > index ed4d094..7abefd9 100644 > > --- a/drivers/pci/pcie/portdrv_pci.c > > +++ b/drivers/pci/pcie/portdrv_pci.c > > @@ -332,13 +332,11 @@ static pci_ers_result_t > > pcie_portdrv_slot_reset(struct pci_dev *dev) > > pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED; > > int retval; > > > > - /* If fatal, restore cfg space for possible link reset at upstream */ > > - if (dev->error_state == pci_channel_io_frozen) { > > - dev->state_saved = true; > > - pci_restore_state(dev); > > - pcie_portdrv_restore_config(dev); > > - pci_enable_pcie_error_reporting(dev); > > - } > > + /* restore cfg space for possible link reset at upstream */ > > + dev->state_saved = true; It's indeed a dirty trick to change it to true. The reason is suspend. The port would suspend/resume at suspend-to-ram. When the port is suspended, PCI power framework calls pci_save_state. When the port is resumed, PCI framework calls pci_restore_state and dev->state_saved is set to false. A better solution is to add double space in pci_dev->saved_config_space/saved_cap_space, which seems consume unnecessary resource. > > + pci_restore_state(dev); > > + pcie_portdrv_restore_config(dev); > > + pci_enable_pcie_error_reporting(dev); > > > > /* get true return value from */ > > retval = device_for_each_child(>dev, , slot_reset_iter); > I think this patch changes the behavior in the case of a non-fatal error > where one of the .error_detected() methods returned > PCI_ERS_RESULT_NEED_RESET. In that case, pcie_portdrv_slot_reset() > previously did not restore config space, but after your patch, it *will* > restore it. We need an explanation of why this is safe. AER standard doesn't define such behavior like if we need reset a slot. When we implemented the feature in kernel, we assumed at non-fatal error, config space is still good. With the new patch, we change the behavior a little. > > I think you should split this into two patches: the first would remove the > "if (dev->error_state == pci_channel_io_frozen)" test from portdrv_pci.c The first patch would depend on the 2nd patch as 2nd patch resets error_state to pci_channel_io_normal. > and explain the reason, and the second would make the aerdrv_core.c change. > > I'm also concerned that in that same case (a non-fatal error where one of > the .error_detected() methods returned PCI_ERS_RESULT_NEED_RESET), I don't > think we actually *do* any kind of device reset. This isn't related to > your patch, of course, so if you resolve the config space restore question, > we can deal with the reset question later. It's a good question. At the beginning when we enabled AER feature in kernel, we didn't really implement the real device reset. It's in the TODO list. Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Mon, 2013-05-20 at 16:48 -0600, Bjorn Helgaas wrote: On Fri, Apr 26, 2013 at 06:28:59AM +, Zhang, LongX wrote: From: Zhang Long longx.zh...@intel.com Specific pci device drivers might have many functions to call pci_channel_offline to check device states. When slot_reset happens, drivers' slot_reset callback might call such functions and eventually abort the reset. The patch resets pdev-error_state to pci_channel_io_normal at the begining of report_slot_reset. Thank Liu Joseph for pointing it out. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com Signed-off-by: Zhang Long longx.zh...@intel.com --- drivers/pci/pcie/aer/aerdrv_core.c |1 + drivers/pci/pcie/portdrv_pci.c | 12 +--- 2 files changed, 6 insertions(+), 7 deletions(-) Thank all for the kind comments. diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index 564d97f..c61fd44 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data) result_data = (struct aer_broadcast_data *) data; device_lock(dev-dev); + dev-error_state = pci_channel_io_normal; if (!dev-driver || !dev-driver-err_handler || !dev-driver-err_handler-slot_reset) diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index ed4d094..7abefd9 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev) pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED; int retval; - /* If fatal, restore cfg space for possible link reset at upstream */ - if (dev-error_state == pci_channel_io_frozen) { - dev-state_saved = true; - pci_restore_state(dev); - pcie_portdrv_restore_config(dev); - pci_enable_pcie_error_reporting(dev); - } + /* restore cfg space for possible link reset at upstream */ + dev-state_saved = true; It's indeed a dirty trick to change it to true. The reason is suspend. The port would suspend/resume at suspend-to-ram. When the port is suspended, PCI power framework calls pci_save_state. When the port is resumed, PCI framework calls pci_restore_state and dev-state_saved is set to false. A better solution is to add double space in pci_dev-saved_config_space/saved_cap_space, which seems consume unnecessary resource. + pci_restore_state(dev); + pcie_portdrv_restore_config(dev); + pci_enable_pcie_error_reporting(dev); /* get true return value from status */ retval = device_for_each_child(dev-dev, status, slot_reset_iter); I think this patch changes the behavior in the case of a non-fatal error where one of the .error_detected() methods returned PCI_ERS_RESULT_NEED_RESET. In that case, pcie_portdrv_slot_reset() previously did not restore config space, but after your patch, it *will* restore it. We need an explanation of why this is safe. AER standard doesn't define such behavior like if we need reset a slot. When we implemented the feature in kernel, we assumed at non-fatal error, config space is still good. With the new patch, we change the behavior a little. I think you should split this into two patches: the first would remove the if (dev-error_state == pci_channel_io_frozen) test from portdrv_pci.c The first patch would depend on the 2nd patch as 2nd patch resets error_state to pci_channel_io_normal. and explain the reason, and the second would make the aerdrv_core.c change. I'm also concerned that in that same case (a non-fatal error where one of the .error_detected() methods returned PCI_ERS_RESULT_NEED_RESET), I don't think we actually *do* any kind of device reset. This isn't related to your patch, of course, so if you resolve the config space restore question, we can deal with the reset question later. It's a good question. At the beginning when we enabled AER feature in kernel, we didn't really implement the real device reset. It's in the TODO list. Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Mon, 2013-05-20 at 10:37 -0500, Linas Vepstas wrote: I think Joe Liu has it right. I'm going to top-post because things are a bit tangled below. I urge a review of /Documentation/PCI/pci-error-recovery.txt, as that gives the details. The intended sequence is that, after an error, the device driver gets a shot at running some diagnostics dumps, and then the pci bridges/controllers/ports/links are reset (by platform code, viz. aer in this case) to a state resembling a fresh power-on. Then the .slot_reset() callback is called on the device driver, to tell the driver hey everything upstream is now working, go set yourself up for normal ops. Thus, in particular, one should have pdev-error_state = pci_channel_io_normal; before .slot_reset() is called, and the PCI config space contents should resemble a fresh power-on state (and **NOT** some other saved state!) If the device driver wanted to leave the card in a dead state, it had several opportunities to say so, earlier in the callback sequence. If the driver wanted to fiddle with the card with the old PCI config space, it already had a chance to do that too, before the bridges/controllers/ports/links were fully reset by the platform. By the time that .slot_reset() is being called, both the platform and the device driver are expecting smooth sailing ahead. Yes. It's flexible for drivers to do that in many callbacks. AER framework provides such flexibility. So, looking at the original patch, it seems reasonable. I agree. My impression is that maybe the AER driver had been doing not quite the right thing for a long time. Pls. provide evidence/facts. The new patch is to facilitate device driver implementation. It doesn't mean current AER driver is incorrect. We need a tradeoff. Just like what Bjoin says, we shouldn't change error_state to pci_channel_io_normal before we really recover the hardware. The patch changes it just because drivers might call some functions to recover the devices, while such functions need (error_state==pci_channel_io_normal). -- Linas Vepstas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Fri, 2013-05-03 at 13:00 -0500, Linas Vepstas wrote: > Greg, > > I've been receiving (and reading!) these messages; I replied that I am > not maintaining a tree, and have no way of testing these patches (no > access to hardware.) I believe it unlikely that my situation will > change, and so I should probably be removed from the maintainers file. > > No acked-by or anything; the patches are not "obviously correct" by > visual inspection; they may be right, but would require deeper > thinking about what is actually happening than I'm capable of at this > time; I'm a bit rusty with this code base, and might miss something > important powerpc uses the similar method. See function eeh_report_reset. We worked out the patch after checking powerpc codes. Joseph Liu helped test the patch and the patch does work well. > . > > -- Linas > > p.s. you didn't see my earlier reply because I forgot to hit 'plain text > reply' > > > On 2 May 2013 22:13, Yanmin Zhang wrote: > > > > On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote: > > > On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote: > > > > On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote: > > > > > Hi, > > > > > > > > > > On 1 May 2013 19:30, Yanmin Zhang > > > > > wrote: > > > > > On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote: > > > > > > From: Zhang Long > > > > > > > > > > > > Specific pci device drivers might have many functions to > > > > > call > > > > > > pci_channel_offline to check device states. When slot_reset > > > > > happens, > > > > > > drivers' slot_reset callback might call such functions and > > > > > eventually > > > > > > abort the reset. > > > > > > > > > > > > The patch resets pdev->error_state to pci_channel_io_normal > > > > > at > > > > > > the begining of report_slot_reset. > > > > > > > > > > > > Thank Liu Joseph for pointing it out. > > > > > > > > > > Linas, Bjorn, > > > > > > > > > > Would you like to merge the patch to your testing tree? > > > > > The patch resolves one issue pointed out by Joseph. > > > > > > > > > > > > > > > I'm not maintaining a tree at this time, and am not able to test. > > > > > Sorry. > > > > Thanks Linas. > > > > > > > > Greg, > > > > > > > > Would you like to merge it into your testing tree? Joseph Liu tested > > > > the patch and it does work. > > > > > > You do know about the scripts/get_maintainer.pl script, right? > > > > > > Hint, try it out :) > > Greg, > > > > Thanks. We did send to the right people, Linas and Bjorn. > > > > scripts/get_maintainer.pl > > 0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch > > Bjorn Helgaas (supporter:PCI > > SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%) > > Linas Vepstas (commit_signer:2/8=25%) > > Yijing Wang > > (commit_signer:2/8=25%,commit_signer:2/11=18%) > > Jiang Liu > > (commit_signer:2/8=25%,commit_signer:2/11=18%) > > Stephen Hemminger (commit_signer:1/8=12%) > > "Rafael J. Wysocki" (commit_signer:6/11=55%) > > Huang Ying (commit_signer:5/11=45%) > > linux-...@vger.kernel.org (open list:PCI SUBSYSTEM) > > linux-kernel@vger.kernel.org (open list) > > > > > > I remember Jesse was maintaining PCI subsystem and he responded quickly. > > > > Yanmin > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Fri, 2013-05-03 at 13:00 -0500, Linas Vepstas wrote: Greg, I've been receiving (and reading!) these messages; I replied that I am not maintaining a tree, and have no way of testing these patches (no access to hardware.) I believe it unlikely that my situation will change, and so I should probably be removed from the maintainers file. No acked-by or anything; the patches are not obviously correct by visual inspection; they may be right, but would require deeper thinking about what is actually happening than I'm capable of at this time; I'm a bit rusty with this code base, and might miss something important powerpc uses the similar method. See function eeh_report_reset. We worked out the patch after checking powerpc codes. Joseph Liu helped test the patch and the patch does work well. . -- Linas p.s. you didn't see my earlier reply because I forgot to hit 'plain text reply' On 2 May 2013 22:13, Yanmin Zhang yanmin_zh...@linux.intel.com wrote: On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote: On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote: On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote: Hi, On 1 May 2013 19:30, Yanmin Zhang yanmin_zh...@linux.intel.com wrote: On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote: From: Zhang Long longx.zh...@intel.com Specific pci device drivers might have many functions to call pci_channel_offline to check device states. When slot_reset happens, drivers' slot_reset callback might call such functions and eventually abort the reset. The patch resets pdev-error_state to pci_channel_io_normal at the begining of report_slot_reset. Thank Liu Joseph for pointing it out. Linas, Bjorn, Would you like to merge the patch to your testing tree? The patch resolves one issue pointed out by Joseph. I'm not maintaining a tree at this time, and am not able to test. Sorry. Thanks Linas. Greg, Would you like to merge it into your testing tree? Joseph Liu tested the patch and it does work. You do know about the scripts/get_maintainer.pl script, right? Hint, try it out :) Greg, Thanks. We did send to the right people, Linas and Bjorn. scripts/get_maintainer.pl 0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch Bjorn Helgaas bhelg...@google.com (supporter:PCI SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%) Linas Vepstas linasveps...@gmail.com (commit_signer:2/8=25%) Yijing Wang wangyij...@huawei.com (commit_signer:2/8=25%,commit_signer:2/11=18%) Jiang Liu jiang@huawei.com (commit_signer:2/8=25%,commit_signer:2/11=18%) Stephen Hemminger shemmin...@vyatta.com (commit_signer:1/8=12%) Rafael J. Wysocki rafael.j.wyso...@intel.com (commit_signer:6/11=55%) Huang Ying ying.hu...@intel.com (commit_signer:5/11=45%) linux-...@vger.kernel.org (open list:PCI SUBSYSTEM) linux-kernel@vger.kernel.org (open list) I remember Jesse was maintaining PCI subsystem and he responded quickly. Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote: > On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote: > > On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote: > > > Hi, > > > > > > On 1 May 2013 19:30, Yanmin Zhang > > > wrote: > > > On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote: > > > > From: Zhang Long > > > > > > > > Specific pci device drivers might have many functions to > > > call > > > > pci_channel_offline to check device states. When slot_reset > > > happens, > > > > drivers' slot_reset callback might call such functions and > > > eventually > > > > abort the reset. > > > > > > > > The patch resets pdev->error_state to pci_channel_io_normal > > > at > > > > the begining of report_slot_reset. > > > > > > > > Thank Liu Joseph for pointing it out. > > > > > > Linas, Bjorn, > > > > > > Would you like to merge the patch to your testing tree? > > > The patch resolves one issue pointed out by Joseph. > > > > > > > > > I'm not maintaining a tree at this time, and am not able to test. > > > Sorry. > > Thanks Linas. > > > > Greg, > > > > Would you like to merge it into your testing tree? Joseph Liu tested > > the patch and it does work. > > You do know about the scripts/get_maintainer.pl script, right? > > Hint, try it out :) Greg, Thanks. We did send to the right people, Linas and Bjorn. scripts/get_maintainer.pl 0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch Bjorn Helgaas (supporter:PCI SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%) Linas Vepstas (commit_signer:2/8=25%) Yijing Wang (commit_signer:2/8=25%,commit_signer:2/11=18%) Jiang Liu (commit_signer:2/8=25%,commit_signer:2/11=18%) Stephen Hemminger (commit_signer:1/8=12%) "Rafael J. Wysocki" (commit_signer:6/11=55%) Huang Ying (commit_signer:5/11=45%) linux-...@vger.kernel.org (open list:PCI SUBSYSTEM) linux-kernel@vger.kernel.org (open list) I remember Jesse was maintaining PCI subsystem and he responded quickly. Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote: > Hi, > > On 1 May 2013 19:30, Yanmin Zhang > wrote: > On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote: > > From: Zhang Long > > > > Specific pci device drivers might have many functions to > call > > pci_channel_offline to check device states. When slot_reset > happens, > > drivers' slot_reset callback might call such functions and > eventually > > abort the reset. > > > > The patch resets pdev->error_state to pci_channel_io_normal > at > > the begining of report_slot_reset. > > > > Thank Liu Joseph for pointing it out. > > Linas, Bjorn, > > Would you like to merge the patch to your testing tree? > The patch resolves one issue pointed out by Joseph. > > > I'm not maintaining a tree at this time, and am not able to test. > Sorry. Thanks Linas. Greg, Would you like to merge it into your testing tree? Joseph Liu tested the patch and it does work. Thanks, Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote: Hi, On 1 May 2013 19:30, Yanmin Zhang yanmin_zh...@linux.intel.com wrote: On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote: From: Zhang Long longx.zh...@intel.com Specific pci device drivers might have many functions to call pci_channel_offline to check device states. When slot_reset happens, drivers' slot_reset callback might call such functions and eventually abort the reset. The patch resets pdev-error_state to pci_channel_io_normal at the begining of report_slot_reset. Thank Liu Joseph for pointing it out. Linas, Bjorn, Would you like to merge the patch to your testing tree? The patch resolves one issue pointed out by Joseph. I'm not maintaining a tree at this time, and am not able to test. Sorry. Thanks Linas. Greg, Would you like to merge it into your testing tree? Joseph Liu tested the patch and it does work. Thanks, Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote: On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote: On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote: Hi, On 1 May 2013 19:30, Yanmin Zhang yanmin_zh...@linux.intel.com wrote: On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote: From: Zhang Long longx.zh...@intel.com Specific pci device drivers might have many functions to call pci_channel_offline to check device states. When slot_reset happens, drivers' slot_reset callback might call such functions and eventually abort the reset. The patch resets pdev-error_state to pci_channel_io_normal at the begining of report_slot_reset. Thank Liu Joseph for pointing it out. Linas, Bjorn, Would you like to merge the patch to your testing tree? The patch resolves one issue pointed out by Joseph. I'm not maintaining a tree at this time, and am not able to test. Sorry. Thanks Linas. Greg, Would you like to merge it into your testing tree? Joseph Liu tested the patch and it does work. You do know about the scripts/get_maintainer.pl script, right? Hint, try it out :) Greg, Thanks. We did send to the right people, Linas and Bjorn. scripts/get_maintainer.pl 0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch Bjorn Helgaas bhelg...@google.com (supporter:PCI SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%) Linas Vepstas linasveps...@gmail.com (commit_signer:2/8=25%) Yijing Wang wangyij...@huawei.com (commit_signer:2/8=25%,commit_signer:2/11=18%) Jiang Liu jiang@huawei.com (commit_signer:2/8=25%,commit_signer:2/11=18%) Stephen Hemminger shemmin...@vyatta.com (commit_signer:1/8=12%) Rafael J. Wysocki rafael.j.wyso...@intel.com (commit_signer:6/11=55%) Huang Ying ying.hu...@intel.com (commit_signer:5/11=45%) linux-...@vger.kernel.org (open list:PCI SUBSYSTEM) linux-kernel@vger.kernel.org (open list) I remember Jesse was maintaining PCI subsystem and he responded quickly. Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote: > From: Zhang Long > > Specific pci device drivers might have many functions to call > pci_channel_offline to check device states. When slot_reset happens, > drivers' slot_reset callback might call such functions and eventually > abort the reset. > > The patch resets pdev->error_state to pci_channel_io_normal at > the begining of report_slot_reset. > > Thank Liu Joseph for pointing it out. Linas, Bjorn, Would you like to merge the patch to your testing tree? The patch resolves one issue pointed out by Joseph. Thanks, Yanmin > > Signed-off-by: Zhang Yanmin > Signed-off-by: Zhang Long > --- > drivers/pci/pcie/aer/aerdrv_core.c |1 + > drivers/pci/pcie/portdrv_pci.c | 12 +--- > 2 files changed, 6 insertions(+), 7 deletions(-) > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c > b/drivers/pci/pcie/aer/aerdrv_core.c > index 564d97f..c61fd44 100644 > --- a/drivers/pci/pcie/aer/aerdrv_core.c > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void > *data) > result_data = (struct aer_broadcast_data *) data; > > device_lock(>dev); > + dev->error_state = pci_channel_io_normal; > if (!dev->driver || > !dev->driver->err_handler || > !dev->driver->err_handler->slot_reset) > diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c > index ed4d094..7abefd9 100644 > --- a/drivers/pci/pcie/portdrv_pci.c > +++ b/drivers/pci/pcie/portdrv_pci.c > @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct > pci_dev *dev) > pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED; > int retval; > > - /* If fatal, restore cfg space for possible link reset at upstream */ > - if (dev->error_state == pci_channel_io_frozen) { > - dev->state_saved = true; > - pci_restore_state(dev); > - pcie_portdrv_restore_config(dev); > - pci_enable_pcie_error_reporting(dev); > - } > + /* restore cfg space for possible link reset at upstream */ > + dev->state_saved = true; > + pci_restore_state(dev); > + pcie_portdrv_restore_config(dev); > + pci_enable_pcie_error_reporting(dev); > > /* get true return value from */ > retval = device_for_each_child(>dev, , slot_reset_iter); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote: From: Zhang Long longx.zh...@intel.com Specific pci device drivers might have many functions to call pci_channel_offline to check device states. When slot_reset happens, drivers' slot_reset callback might call such functions and eventually abort the reset. The patch resets pdev-error_state to pci_channel_io_normal at the begining of report_slot_reset. Thank Liu Joseph for pointing it out. Linas, Bjorn, Would you like to merge the patch to your testing tree? The patch resolves one issue pointed out by Joseph. Thanks, Yanmin Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com Signed-off-by: Zhang Long longx.zh...@intel.com --- drivers/pci/pcie/aer/aerdrv_core.c |1 + drivers/pci/pcie/portdrv_pci.c | 12 +--- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index 564d97f..c61fd44 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data) result_data = (struct aer_broadcast_data *) data; device_lock(dev-dev); + dev-error_state = pci_channel_io_normal; if (!dev-driver || !dev-driver-err_handler || !dev-driver-err_handler-slot_reset) diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index ed4d094..7abefd9 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev) pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED; int retval; - /* If fatal, restore cfg space for possible link reset at upstream */ - if (dev-error_state == pci_channel_io_frozen) { - dev-state_saved = true; - pci_restore_state(dev); - pcie_portdrv_restore_config(dev); - pci_enable_pcie_error_reporting(dev); - } + /* restore cfg space for possible link reset at upstream */ + dev-state_saved = true; + pci_restore_state(dev); + pcie_portdrv_restore_config(dev); + pci_enable_pcie_error_reporting(dev); /* get true return value from status */ retval = device_for_each_child(dev-dev, status, slot_reset_iter); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] output the cpu number when printking.
On Mon, 2012-12-24 at 09:55 -0800, Greg KH wrote: > On Mon, Dec 24, 2012 at 01:01:55PM +0800, he, bo wrote: > > From: "he, bo" > > > > We often hit kernel panic issues on SMP machines because processes race > > on multiple cpu. By adding a new parameter printk.cpu, kernel prints > > cpu number at printk information line. It’s useful to debug what cpus > > are racing. > > How useful is this really for normal developers? It's very useful to debug race conditions under SMP environment. We applied the patch to our Android build image on our smartphones. > We complained when > this option was proposed by the ARM developers who were, for the first > time, handling more than one processor and the issues involved with > that. You are enabling this as a default option, for all developers, > and almost no one will ever need it. At least, I am a developer to use it every day. :). I need debug various weird issues and tried many debug tools. Anyway, we would change the default to _disable_. > > So, if you really want this, don't enable this by default. Also, go > back and read the old thread about this option and why it was rejeted > previously. Thanks for your kind reminder. I googled it. Is the link like https://lkml.org/lkml/2012/8/3/121? Or http://lkml.indiana.edu/hypermail/linux/kernel/1208.0/02097.html? Above links raise a good question that we might be able to use ftrace (trace_printk) instead of to extend printk. 1) There are indeed lots of debug methods/tools in kernel because of many excellent developers' contributions. However, the most frequently-used tool is still printk. 2) ftrace (I use it often) is good when system mostly doesn't crash. If system crashes, we couldn't get ftrace info sometimes. When we debug some hard issues on a multi-core device, the system often crashes when Graphics driver accesses registers incorrectly. There is no chance for kernel to dump ftrace or panic info. With printk, we find that's because 2 cpu are racing. One cpu turns off Graphics while another cpu is accessing it. Of course, we also applied the Android persistent ram patch, so after rebooting, we still could see the printk buffer of previous booting. 3) Current drivers usually have many dev_dbg statements. We could enable dynamic debug control to get such info quickly. With the printk cpu number patch, we quickly enhance them to dump info about potential races among cpu. With ftrace, we need change all these dev_dbg to trace_printk and that's not the original objective of driver developers. Mostly, the patch is to reuse all current printk statements in kernel with a little overhead. Thanks for your kind comments. Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] output the cpu number when printking.
On Mon, 2012-12-24 at 09:55 -0800, Greg KH wrote: On Mon, Dec 24, 2012 at 01:01:55PM +0800, he, bo wrote: From: he, bo bo...@intel.com We often hit kernel panic issues on SMP machines because processes race on multiple cpu. By adding a new parameter printk.cpu, kernel prints cpu number at printk information line. It’s useful to debug what cpus are racing. How useful is this really for normal developers? It's very useful to debug race conditions under SMP environment. We applied the patch to our Android build image on our smartphones. We complained when this option was proposed by the ARM developers who were, for the first time, handling more than one processor and the issues involved with that. You are enabling this as a default option, for all developers, and almost no one will ever need it. At least, I am a developer to use it every day. :). I need debug various weird issues and tried many debug tools. Anyway, we would change the default to _disable_. So, if you really want this, don't enable this by default. Also, go back and read the old thread about this option and why it was rejeted previously. Thanks for your kind reminder. I googled it. Is the link like https://lkml.org/lkml/2012/8/3/121? Or http://lkml.indiana.edu/hypermail/linux/kernel/1208.0/02097.html? Above links raise a good question that we might be able to use ftrace (trace_printk) instead of to extend printk. 1) There are indeed lots of debug methods/tools in kernel because of many excellent developers' contributions. However, the most frequently-used tool is still printk. 2) ftrace (I use it often) is good when system mostly doesn't crash. If system crashes, we couldn't get ftrace info sometimes. When we debug some hard issues on a multi-core device, the system often crashes when Graphics driver accesses registers incorrectly. There is no chance for kernel to dump ftrace or panic info. With printk, we find that's because 2 cpu are racing. One cpu turns off Graphics while another cpu is accessing it. Of course, we also applied the Android persistent ram patch, so after rebooting, we still could see the printk buffer of previous booting. 3) Current drivers usually have many dev_dbg statements. We could enable dynamic debug control to get such info quickly. With the printk cpu number patch, we quickly enhance them to dump info about potential races among cpu. With ftrace, we need change all these dev_dbg to trace_printk and that's not the original objective of driver developers. Mostly, the patch is to reuse all current printk statements in kernel with a little overhead. Thanks for your kind comments. Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] hrtimer: Printing timer info when hitting BUG_ON()
On Tue, 2012-10-30 at 18:27 +0800, Chuansheng Liu wrote: > We encounted one BUG_ON() issue at function __run_hrtimer(), > but the panic info is not enough to find out which hrtimer > users use the hrtimer wrongly. > (in this BUG_ON case, it is callback running at the same time > hrtimer_start() is executed on different context.) > > We can print basic timer info before BUG(), it is easy to find > out which hrtimer user has fault, the output info is like below: > > <3>[ 41.118552] timer info: state/3, func/test_timer_cb > <3>[ 41.118618] timer stats: pid/786(sh), site/hrtimer_start > <0>[ 41.118670] [ cut here ] > <2>[ 41.118691] kernel BUG at > /root/r4_code/hardware/intel/linux-2.6/kernel/hrtimer.c:1260! > <0>[ 41.118715] invalid opcode: [#1] PREEMPT SMP > > The infos of callback/state are very helpful for debugging this BUG_ON(). It's a useful patch. Thomas, Could you merge it into your testing tree, so other developers could also benefit from it? Thanks all. Yanmin > > Signed-off-by: Li Fei > Signed-off-by: liu chuansheng > --- > kernel/hrtimer.c | 21 - > 1 files changed, 20 insertions(+), 1 deletions(-) > > diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c > index 6db7a5e..0618eaf 100644 > --- a/kernel/hrtimer.c > +++ b/kernel/hrtimer.c > @@ -1204,6 +1204,22 @@ int hrtimer_get_res(const clockid_t which_clock, > struct timespec *tp) > } > EXPORT_SYMBOL_GPL(hrtimer_get_res); > > +/* > + * dump_hrtimer_callinfo - print hrtimer information including: > + * state, callback function, pid and start_site. > +*/ > +static void dump_hrtimer_callinfo(struct hrtimer *timer) > +{ > + > + pr_err("timer info: state/%lx, func/%pf\n", > + timer->state, timer->function); > + > +#ifdef CONFIG_TIMER_STATS > + pr_err("timer stats: pid/%d(%s), site/%pf\n", > + timer->start_pid, timer->start_comm, timer->start_site); > +#endif > +} > + > static void __run_hrtimer(struct hrtimer *timer, ktime_t *now) > { > struct hrtimer_clock_base *base = timer->base; > @@ -1235,7 +1251,10 @@ static void __run_hrtimer(struct hrtimer *timer, > ktime_t *now) >* hrtimer_start_range_ns() or in hrtimer_interrupt() >*/ > if (restart != HRTIMER_NORESTART) { > - BUG_ON(timer->state != HRTIMER_STATE_CALLBACK); > + if (timer->state != HRTIMER_STATE_CALLBACK) { > + dump_hrtimer_callinfo(timer); > + BUG(); > + } > enqueue_hrtimer(timer, base); > } > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2] hrtimer: Printing timer info when hitting BUG_ON()
On Tue, 2012-10-30 at 18:27 +0800, Chuansheng Liu wrote: We encounted one BUG_ON() issue at function __run_hrtimer(), but the panic info is not enough to find out which hrtimer users use the hrtimer wrongly. (in this BUG_ON case, it is callback running at the same time hrtimer_start() is executed on different context.) We can print basic timer info before BUG(), it is easy to find out which hrtimer user has fault, the output info is like below: 3[ 41.118552] timer info: state/3, func/test_timer_cb 3[ 41.118618] timer stats: pid/786(sh), site/hrtimer_start 0[ 41.118670] [ cut here ] 2[ 41.118691] kernel BUG at /root/r4_code/hardware/intel/linux-2.6/kernel/hrtimer.c:1260! 0[ 41.118715] invalid opcode: [#1] PREEMPT SMP The infos of callback/state are very helpful for debugging this BUG_ON(). It's a useful patch. Thomas, Could you merge it into your testing tree, so other developers could also benefit from it? Thanks all. Yanmin Signed-off-by: Li Fei fei...@intel.com Signed-off-by: liu chuansheng chuansheng@intel.com --- kernel/hrtimer.c | 21 - 1 files changed, 20 insertions(+), 1 deletions(-) diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 6db7a5e..0618eaf 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1204,6 +1204,22 @@ int hrtimer_get_res(const clockid_t which_clock, struct timespec *tp) } EXPORT_SYMBOL_GPL(hrtimer_get_res); +/* + * dump_hrtimer_callinfo - print hrtimer information including: + * state, callback function, pid and start_site. +*/ +static void dump_hrtimer_callinfo(struct hrtimer *timer) +{ + + pr_err(timer info: state/%lx, func/%pf\n, + timer-state, timer-function); + +#ifdef CONFIG_TIMER_STATS + pr_err(timer stats: pid/%d(%s), site/%pf\n, + timer-start_pid, timer-start_comm, timer-start_site); +#endif +} + static void __run_hrtimer(struct hrtimer *timer, ktime_t *now) { struct hrtimer_clock_base *base = timer-base; @@ -1235,7 +1251,10 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now) * hrtimer_start_range_ns() or in hrtimer_interrupt() */ if (restart != HRTIMER_NORESTART) { - BUG_ON(timer-state != HRTIMER_STATE_CALLBACK); + if (timer-state != HRTIMER_STATE_CALLBACK) { + dump_hrtimer_callinfo(timer); + BUG(); + } enqueue_hrtimer(timer, base); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] hrtimer:__run_hrtimer races with enqueue_hrtimer
On Fri, 2012-10-26 at 14:09 +0200, Thomas Gleixner wrote: > On Fri, 26 Oct 2012, Zhang, Yanmin wrote: > > >From: Thomas Gleixner [mailto:t...@linutronix.de] > > >Your code is returning HRTIMER_RESTART from the timer callback and at > > >the same time it starts the timer from some other context. That's what > > >needs to be fixed. > > > > The timer user should fix it. But could we also change hrtimer to > > make it more stable? At least, instead of panic, could we print > > some information and go ahead to let kernel continue? > > That's unfortunately not possible. At this point the timer might be > already corrupted. > > CPU0 CPU 1 > > timer expires > callback runs > hrtimer_start() > expiry value is set > hrtimer_enqueue() > >hrtimer_forward() > expiry value is set > > return HRTIMER_RESTART > > So while we can prevent the double enqueue, we have no way to deal > with the corrupted expiry value and the inconsistent RB tree. We can > give better debugging information, but we can't pretend that > everything is nice and cool. > > If we really want to do something about it which keeps the machine > alive, then we need to > >1) dequeue the timer >2) run a consistency check over the rbtree >3) enqueue the timer > > Not sure if that's worth the trouble. I strongly agree with you. Such checking might cause more trouble than what we could benefit from it. Could we add more dumping, especially about the timer address and callback function? Next time, developers could find the bad timer quickly. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] hrtimer:__run_hrtimer races with enqueue_hrtimer
On Fri, 2012-10-26 at 14:09 +0200, Thomas Gleixner wrote: On Fri, 26 Oct 2012, Zhang, Yanmin wrote: From: Thomas Gleixner [mailto:t...@linutronix.de] Your code is returning HRTIMER_RESTART from the timer callback and at the same time it starts the timer from some other context. That's what needs to be fixed. The timer user should fix it. But could we also change hrtimer to make it more stable? At least, instead of panic, could we print some information and go ahead to let kernel continue? That's unfortunately not possible. At this point the timer might be already corrupted. CPU0 CPU 1 timer expires callback runs hrtimer_start() expiry value is set hrtimer_enqueue() hrtimer_forward() expiry value is set return HRTIMER_RESTART So while we can prevent the double enqueue, we have no way to deal with the corrupted expiry value and the inconsistent RB tree. We can give better debugging information, but we can't pretend that everything is nice and cool. If we really want to do something about it which keeps the machine alive, then we need to 1) dequeue the timer 2) run a consistency check over the rbtree 3) enqueue the timer Not sure if that's worth the trouble. I strongly agree with you. Such checking might cause more trouble than what we could benefit from it. Could we add more dumping, especially about the timer address and callback function? Next time, developers could find the bad timer quickly. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] hrtimer:__run_hrtimer races with enqueue_hrtimer
V2 adds a checking HRTIMER_STATE_ENQUEUED. If HRTIMER_STATE_ENQUEUED is set, We don't call enqueue_hrtimer. It should be rare that HRTIMER_STATE_ENQUEUED is set. --- We hit a kernel panic at __run_hrtimer=>BUG_ON(timer->state != HRTIMER_STATE_CALLBACK). <2>[ 10.226053, 3] kernel BUG at /home/android/xiaobing/ymz/r4/hardware/intel/linux-2.6/kernel/hrtimer.c:1228! <0>[ 10.235682, 3] invalid opcode: [#1] PREEMPT SMP <4>[ 10.240716, 3] Modules linked in: wl12xx_sdio wl12xx mac80211 cfg80211 compat btwilink rmi4(C) fmdrv_chr st_drv matrix(C) <4>[ 10.251651, 3] <4>[ 10.253391, 3] Pid: 68, comm: kworker/3:4 Tainted: GWC 3.0.34-140430-g2af538d #45 Intel Corporation CloverTrail/FFRD <4>[ 10.264674, 3] EIP: 0060:[] EFLAGS: 00010002 CPU: 3 <4>[ 10.270411, 3] EIP is at __run_hrtimer+0xbd/0x240 <4>[ 10.275091, 3] EAX: 0001 EBX: f67fb6b8 ECX: f57b4000 EDX: 7301 <4>[ 10.281602, 3] ESI: c1d614c0 EDI: f67fb680 EBP: f57b5dd8 ESP: f57b5da8 <4>[ 10.288113, 3] DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 <0>[ 10.293754, 3] Process kworker/3:4 (pid: 68, ti=f57b4000 task=f57aa730 task.ti=f57b4000) <0>[ 10.301827, 3] Stack: <4>[ 10.304083, 3] c1afef40 f57b5dd8 c167a6e0 f67fb680 20b366e3 f67fb6b8 f57b5e14 <4>[ 10.312069, 3] 0001 f67fb6b8 0001 f67fb680 f57b5e28 c126d1e5 f57b5e08 c126f325 <4>[ 10.320055, 3] 86b9868d 0001 86b9868d 0001 0003 7fff <0>[ 10.328041, 3] Call Trace: <4>[ 10.330742, 3] [] ? gburst_thread_stop.isra.25+0x40/0x40 <4>[ 10.336988, 3] [] hrtimer_interrupt+0xd5/0x250 <4>[ 10.342368, 3] [] ? sched_clock_cpu+0xe5/0x150 <4>[ 10.347753, 3] [] smp_apic_timer_interrupt+0x54/0x88 <4>[ 10.353654, 3] [] ? trace_hardirqs_off_thunk+0xc/0x14 <4>[ 10.359643, 3] [] apic_timer_interrupt+0x2f/0x34 <4>[ 10.365199, 3] [] ? sub_preempt_count+0x1f/0x50 <4>[ 10.370669, 3] [] delay_tsc+0x3a/0xc0 <6>[ 10.371589, 0] android_work: did not send uevent (0 0 (null)) <4>[ 10.381171, 3] [] __const_udelay+0x23/0x30 <4>[ 10.386207, 3] [] mdfld_dsi_send_dcs+0x12a/0x5d0 <4>[ 10.391760, 3] [] ? _raw_spin_unlock_irqrestore+0x26/0x50 <4>[ 10.398101, 3] [] ? ospm_power_using_hw_begin+0xa1/0x350 <4>[ 10.399053, 3] [] ? __mutex_lock_slowpath+0x1ff/0x2f0 <4>[ 10.399069, 3] [] mdfld_dbi_update_panel+0x21e/0x2d0 <4>[ 10.399085, 3] [] mdfld_te_handler_work+0x71/0x80 <4>[ 10.399099, 3] [] process_one_work+0xfe/0x3f0 <4>[ 10.399114, 3] [] ? mdfld_async_flip_te_handler+0xf0/0xf0 Basically, __run_hrtimer has a race with enqueue_hrtimer. When __run_hrtimer calls the timer callback fn, another thread might call enqueue_hrtimer or hrtimer_start to requeue it, and the timer->state is equal to HRTIMER_STATE_CALLBACK|HRTIMER_STATE_ENQUEUED, which causes the BUG_ON(timer->state != HRTIMER_STATE_CALLBACK) checking fails. The patch fixes it by checking only bit HRTIMER_STATE_CALLBACK and also bypass enqueue_hrtimer when the timer is queued. Signed-off-by: Yanmin Zhang Signed-off-by: He Bo --- kernel/hrtimer.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 6db7a5e..535cdbb 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1235,8 +1235,9 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now) * hrtimer_start_range_ns() or in hrtimer_interrupt() */ if (restart != HRTIMER_NORESTART) { - BUG_ON(timer->state != HRTIMER_STATE_CALLBACK); - enqueue_hrtimer(timer, base); + BUG_ON(!(timer->state & HRTIMER_STATE_CALLBACK)); + if (likely(!(timer->state & HRTIMER_STATE_ENQUEUED))) + enqueue_hrtimer(timer, base); } WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK)); -- 1.7.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hrtimer:__run_hrtimer races with enqueue_hrtimer
On Fri, 2012-10-26 at 10:51 +0800, he, bo wrote: > From: Yanmin Zhang > > We hit a kernel panic at __run_hrtimer=>BUG_ON(timer->state != > HRTIMER_STATE_CALLBACK). > <2>[ 10.226053, 3] kernel BUG at > /home/android/xiaobing/ymz/r4/hardware/intel/linux-2.6/kernel/hrtimer.c:1228! > <0>[ 10.235682, 3] invalid opcode: [#1] PREEMPT SMP > <4>[ 10.240716, 3] Modules linked in: wl12xx_sdio wl12xx mac80211 cfg80211 > compat btwilink rmi4(C) fmdrv_chr st_drv matrix(C) > <4>[ 10.251651, 3] > <4>[ 10.253391, 3] Pid: 68, comm: kworker/3:4 Tainted: GWC > 3.0.34-140430-g2af538d #45 Intel Corporation CloverTrail/FFRD > <4>[ 10.264674, 3] EIP: 0060:[] EFLAGS: 00010002 CPU: 3 > <4>[ 10.270411, 3] EIP is at __run_hrtimer+0xbd/0x240 > <4>[ 10.275091, 3] EAX: 0001 EBX: f67fb6b8 ECX: f57b4000 EDX: 7301 > <4>[ 10.281602, 3] ESI: c1d614c0 EDI: f67fb680 EBP: f57b5dd8 ESP: f57b5da8 > <4>[ 10.288113, 3] DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 > <0>[ 10.293754, 3] Process kworker/3:4 (pid: 68, ti=f57b4000 task=f57aa730 > task.ti=f57b4000) > <0>[ 10.301827, 3] Stack: > <4>[ 10.304083, 3] c1afef40 f57b5dd8 c167a6e0 f67fb680 20b366e3 > f67fb6b8 f57b5e14 > <4>[ 10.312069, 3] 0001 f67fb6b8 0001 f67fb680 f57b5e28 c126d1e5 > f57b5e08 c126f325 > <4>[ 10.320055, 3] 86b9868d 0001 86b9868d 0001 0003 > 7fff > <0>[ 10.328041, 3] Call Trace: > <4>[ 10.330742, 3] [] ? gburst_thread_stop.isra.25+0x40/0x40 > <4>[ 10.336988, 3] [] hrtimer_interrupt+0xd5/0x250 > <4>[ 10.342368, 3] [] ? sched_clock_cpu+0xe5/0x150 > <4>[ 10.347753, 3] [] smp_apic_timer_interrupt+0x54/0x88 > <4>[ 10.353654, 3] [] ? trace_hardirqs_off_thunk+0xc/0x14 > <4>[ 10.359643, 3] [] apic_timer_interrupt+0x2f/0x34 > <4>[ 10.365199, 3] [] ? sub_preempt_count+0x1f/0x50 > <4>[ 10.370669, 3] [] delay_tsc+0x3a/0xc0 > <6>[ 10.371589, 0] android_work: did not send uevent (0 0 (null)) > <4>[ 10.381171, 3] [] __const_udelay+0x23/0x30 > <4>[ 10.386207, 3] [] mdfld_dsi_send_dcs+0x12a/0x5d0 > <4>[ 10.391760, 3] [] ? _raw_spin_unlock_irqrestore+0x26/0x50 > <4>[ 10.398101, 3] [] ? ospm_power_using_hw_begin+0xa1/0x350 > <4>[ 10.399053, 3] [] ? __mutex_lock_slowpath+0x1ff/0x2f0 > <4>[ 10.399069, 3] [] mdfld_dbi_update_panel+0x21e/0x2d0 > <4>[ 10.399085, 3] [] mdfld_te_handler_work+0x71/0x80 > <4>[ 10.399099, 3] [] process_one_work+0xfe/0x3f0 > <4>[ 10.399114, 3] [] ? mdfld_async_flip_te_handler+0xf0/0xf0 > > Basically, __run_hrtimer has a race with enqueue_hrtimer. When __run_hrtimer > calls > the timer callback fn, another thread might call enqueue_hrtimer or > hrtimer_start > to requeue it, and the timer->state is equal to > HRTIMER_STATE_CALLBACK|HRTIMER_STATE_ENQUEUED, > which causes the BUG_ON(timer->state != HRTIMER_STATE_CALLBACK) checking > fails. > > The patch fixes it by checking only bit HRTIMER_STATE_CALLBACK. The patch has an issue that enqueue_hrtimer doesn't check if the timer is queued. I will send a new patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hrtimer:__run_hrtimer races with enqueue_hrtimer
On Fri, 2012-10-26 at 10:51 +0800, he, bo wrote: From: Yanmin Zhang yanmin.zh...@intel.com We hit a kernel panic at __run_hrtimer=BUG_ON(timer-state != HRTIMER_STATE_CALLBACK). 2[ 10.226053, 3] kernel BUG at /home/android/xiaobing/ymz/r4/hardware/intel/linux-2.6/kernel/hrtimer.c:1228! 0[ 10.235682, 3] invalid opcode: [#1] PREEMPT SMP 4[ 10.240716, 3] Modules linked in: wl12xx_sdio wl12xx mac80211 cfg80211 compat btwilink rmi4(C) fmdrv_chr st_drv matrix(C) 4[ 10.251651, 3] 4[ 10.253391, 3] Pid: 68, comm: kworker/3:4 Tainted: GWC 3.0.34-140430-g2af538d #45 Intel Corporation CloverTrail/FFRD 4[ 10.264674, 3] EIP: 0060:[c126c7ed] EFLAGS: 00010002 CPU: 3 4[ 10.270411, 3] EIP is at __run_hrtimer+0xbd/0x240 4[ 10.275091, 3] EAX: 0001 EBX: f67fb6b8 ECX: f57b4000 EDX: 7301 4[ 10.281602, 3] ESI: c1d614c0 EDI: f67fb680 EBP: f57b5dd8 ESP: f57b5da8 4[ 10.288113, 3] DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 0[ 10.293754, 3] Process kworker/3:4 (pid: 68, ti=f57b4000 task=f57aa730 task.ti=f57b4000) 0[ 10.301827, 3] Stack: 4[ 10.304083, 3] c1afef40 f57b5dd8 c167a6e0 f67fb680 20b366e3 f67fb6b8 f57b5e14 4[ 10.312069, 3] 0001 f67fb6b8 0001 f67fb680 f57b5e28 c126d1e5 f57b5e08 c126f325 4[ 10.320055, 3] 86b9868d 0001 86b9868d 0001 0003 7fff 0[ 10.328041, 3] Call Trace: 4[ 10.330742, 3] [c167a6e0] ? gburst_thread_stop.isra.25+0x40/0x40 4[ 10.336988, 3] [c126d1e5] hrtimer_interrupt+0xd5/0x250 4[ 10.342368, 3] [c126f325] ? sched_clock_cpu+0xe5/0x150 4[ 10.347753, 3] [c1871d44] smp_apic_timer_interrupt+0x54/0x88 4[ 10.353654, 3] [c1496558] ? trace_hardirqs_off_thunk+0xc/0x14 4[ 10.359643, 3] [c186be9f] apic_timer_interrupt+0x2f/0x34 4[ 10.365199, 3] [c186e60f] ? sub_preempt_count+0x1f/0x50 4[ 10.370669, 3] [c149558a] delay_tsc+0x3a/0xc0 6[ 10.371589, 0] android_work: did not send uevent (0 0 (null)) 4[ 10.381171, 3] [c14954e3] __const_udelay+0x23/0x30 4[ 10.386207, 3] [c16d043a] mdfld_dsi_send_dcs+0x12a/0x5d0 4[ 10.391760, 3] [c186b6c6] ? _raw_spin_unlock_irqrestore+0x26/0x50 4[ 10.398101, 3] [c16af431] ? ospm_power_using_hw_begin+0xa1/0x350 4[ 10.399053, 3] [c186a49f] ? __mutex_lock_slowpath+0x1ff/0x2f0 4[ 10.399069, 3] [c16bd59e] mdfld_dbi_update_panel+0x21e/0x2d0 4[ 10.399085, 3] [c16b1ae1] mdfld_te_handler_work+0x71/0x80 4[ 10.399099, 3] [c12642be] process_one_work+0xfe/0x3f0 4[ 10.399114, 3] [c16b1a70] ? mdfld_async_flip_te_handler+0xf0/0xf0 Basically, __run_hrtimer has a race with enqueue_hrtimer. When __run_hrtimer calls the timer callback fn, another thread might call enqueue_hrtimer or hrtimer_start to requeue it, and the timer-state is equal to HRTIMER_STATE_CALLBACK|HRTIMER_STATE_ENQUEUED, which causes the BUG_ON(timer-state != HRTIMER_STATE_CALLBACK) checking fails. The patch fixes it by checking only bit HRTIMER_STATE_CALLBACK. The patch has an issue that enqueue_hrtimer doesn't check if the timer is queued. I will send a new patch. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] hrtimer:__run_hrtimer races with enqueue_hrtimer
V2 adds a checking HRTIMER_STATE_ENQUEUED. If HRTIMER_STATE_ENQUEUED is set, We don't call enqueue_hrtimer. It should be rare that HRTIMER_STATE_ENQUEUED is set. --- We hit a kernel panic at __run_hrtimer=BUG_ON(timer-state != HRTIMER_STATE_CALLBACK). 2[ 10.226053, 3] kernel BUG at /home/android/xiaobing/ymz/r4/hardware/intel/linux-2.6/kernel/hrtimer.c:1228! 0[ 10.235682, 3] invalid opcode: [#1] PREEMPT SMP 4[ 10.240716, 3] Modules linked in: wl12xx_sdio wl12xx mac80211 cfg80211 compat btwilink rmi4(C) fmdrv_chr st_drv matrix(C) 4[ 10.251651, 3] 4[ 10.253391, 3] Pid: 68, comm: kworker/3:4 Tainted: GWC 3.0.34-140430-g2af538d #45 Intel Corporation CloverTrail/FFRD 4[ 10.264674, 3] EIP: 0060:[c126c7ed] EFLAGS: 00010002 CPU: 3 4[ 10.270411, 3] EIP is at __run_hrtimer+0xbd/0x240 4[ 10.275091, 3] EAX: 0001 EBX: f67fb6b8 ECX: f57b4000 EDX: 7301 4[ 10.281602, 3] ESI: c1d614c0 EDI: f67fb680 EBP: f57b5dd8 ESP: f57b5da8 4[ 10.288113, 3] DS: 007b ES: 007b FS: 00d8 GS: SS: 0068 0[ 10.293754, 3] Process kworker/3:4 (pid: 68, ti=f57b4000 task=f57aa730 task.ti=f57b4000) 0[ 10.301827, 3] Stack: 4[ 10.304083, 3] c1afef40 f57b5dd8 c167a6e0 f67fb680 20b366e3 f67fb6b8 f57b5e14 4[ 10.312069, 3] 0001 f67fb6b8 0001 f67fb680 f57b5e28 c126d1e5 f57b5e08 c126f325 4[ 10.320055, 3] 86b9868d 0001 86b9868d 0001 0003 7fff 0[ 10.328041, 3] Call Trace: 4[ 10.330742, 3] [c167a6e0] ? gburst_thread_stop.isra.25+0x40/0x40 4[ 10.336988, 3] [c126d1e5] hrtimer_interrupt+0xd5/0x250 4[ 10.342368, 3] [c126f325] ? sched_clock_cpu+0xe5/0x150 4[ 10.347753, 3] [c1871d44] smp_apic_timer_interrupt+0x54/0x88 4[ 10.353654, 3] [c1496558] ? trace_hardirqs_off_thunk+0xc/0x14 4[ 10.359643, 3] [c186be9f] apic_timer_interrupt+0x2f/0x34 4[ 10.365199, 3] [c186e60f] ? sub_preempt_count+0x1f/0x50 4[ 10.370669, 3] [c149558a] delay_tsc+0x3a/0xc0 6[ 10.371589, 0] android_work: did not send uevent (0 0 (null)) 4[ 10.381171, 3] [c14954e3] __const_udelay+0x23/0x30 4[ 10.386207, 3] [c16d043a] mdfld_dsi_send_dcs+0x12a/0x5d0 4[ 10.391760, 3] [c186b6c6] ? _raw_spin_unlock_irqrestore+0x26/0x50 4[ 10.398101, 3] [c16af431] ? ospm_power_using_hw_begin+0xa1/0x350 4[ 10.399053, 3] [c186a49f] ? __mutex_lock_slowpath+0x1ff/0x2f0 4[ 10.399069, 3] [c16bd59e] mdfld_dbi_update_panel+0x21e/0x2d0 4[ 10.399085, 3] [c16b1ae1] mdfld_te_handler_work+0x71/0x80 4[ 10.399099, 3] [c12642be] process_one_work+0xfe/0x3f0 4[ 10.399114, 3] [c16b1a70] ? mdfld_async_flip_te_handler+0xf0/0xf0 Basically, __run_hrtimer has a race with enqueue_hrtimer. When __run_hrtimer calls the timer callback fn, another thread might call enqueue_hrtimer or hrtimer_start to requeue it, and the timer-state is equal to HRTIMER_STATE_CALLBACK|HRTIMER_STATE_ENQUEUED, which causes the BUG_ON(timer-state != HRTIMER_STATE_CALLBACK) checking fails. The patch fixes it by checking only bit HRTIMER_STATE_CALLBACK and also bypass enqueue_hrtimer when the timer is queued. Signed-off-by: Yanmin Zhang yanmin_zh...@linux.intel.com Signed-off-by: He Bo bo...@intel.com --- kernel/hrtimer.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 6db7a5e..535cdbb 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -1235,8 +1235,9 @@ static void __run_hrtimer(struct hrtimer *timer, ktime_t *now) * hrtimer_start_range_ns() or in hrtimer_interrupt() */ if (restart != HRTIMER_NORESTART) { - BUG_ON(timer-state != HRTIMER_STATE_CALLBACK); - enqueue_hrtimer(timer, base); + BUG_ON(!(timer-state HRTIMER_STATE_CALLBACK)); + if (likely(!(timer-state HRTIMER_STATE_ENQUEUED))) + enqueue_hrtimer(timer, base); } WARN_ON_ONCE(!(timer-state HRTIMER_STATE_CALLBACK)); -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] drivers-core: move the calling to device_pm_remove behind the calling to bus_remove_device
On Thu, 2012-10-25 at 00:33 +0200, Rafael J. Wysocki wrote: > On Wednesday 24 of October 2012 15:13:27 Greg Kroah-Hartman wrote: > > On Wed, Oct 24, 2012 at 09:46:19PM +0200, Rafael J. Wysocki wrote: > > > On Tuesday 16 of October 2012 23:28:08 zhanglong wrote: > > > > We hit an hang issue when removing a mmc device on Medfield Android > > > > phone by sysfs interface. > > > > > > > > device_pm_remove will call pm_runtime_remove which would disable > > > > runtime PM of the device. After that pm_runtime_get* or > > > > pm_runtime_put* will be ignored. So if we disable the runtime PM > > > > before device really be removed, drivers' _remove callback may > > > > access HW even pm_runtime_get* fails. That is bad. > > > > > > > > Consider below call sequence when removing a device: > > > > device_del => device_pm_remove > > > > => class_intf->remove_dev(dev, class_intf) => > > > > pm_runtime_get_sync/put_sync > > > > => bus_remove_device => device_release_driver => > > > > pm_runtime_get_sync/put_sync > > > > > > > > remove_dev might call pm_runtime_get_sync/put_sync. > > > > Then, generic device_release_driver also calls > > > > pm_runtime_get_sync/put_sync. > > > > Since device_del => device_pm_remove firstly, later _get_sync wouldn't > > > > really wake up the device. > > > > > > > > I git log -p to find the patch which moves the calling to > > > > device_pm_remove ahead. > > > > It's below patch: > > > > > > > > commit 775b64d2b6ca37697de925f70799c710aab5849a > > > > Author: Rafael J. Wysocki > > > > Date: Sat Jan 12 20:40:46 2008 +0100 > > > > > > > > PM: Acquire device locks on suspend > > > > > > > > This patch reorganizes the way suspend and resume notifications are > > > > sent to drivers. The major changes are that now the PM core > > > > acquires > > > > every device semaphore before calling the methods, and calls to > > > > device_add() during suspends will fail, while calls to device_del() > > > > during suspends will block. > > > > > > > > It also provides a way to safely remove a suspended device with the > > > > help of the PM core, by using the device_pm_schedule_removal() > > > > callback > > > > introduced specifically for this purpose, and updates two drivers > > > > (msr > > > > and cpuid) that need to use it. > > > > > > > > > > > > As device_pm_schedule_removal is deleted by another patch, we need also > > > > revert other parts of the patch, > > > > i.e. move the calling of device_pm_remove after the calling to > > > > bus_remove_device. > > > > > > > > Signed-off-by: LongX Zhang > > > > > > Greg, do you see any potential problems with this patch? > > > > No, no objection from me: > > > > Acked-by: Greg Kroah-Hartman > > OK, thanks! > > Applied to linux-pm.git/linux-next as v3.8 material. Rafael, As usual, I really appreciate it! Yanmin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] drivers-core: move the calling to device_pm_remove behind the calling to bus_remove_device
On Thu, 2012-10-25 at 00:33 +0200, Rafael J. Wysocki wrote: On Wednesday 24 of October 2012 15:13:27 Greg Kroah-Hartman wrote: On Wed, Oct 24, 2012 at 09:46:19PM +0200, Rafael J. Wysocki wrote: On Tuesday 16 of October 2012 23:28:08 zhanglong wrote: We hit an hang issue when removing a mmc device on Medfield Android phone by sysfs interface. device_pm_remove will call pm_runtime_remove which would disable runtime PM of the device. After that pm_runtime_get* or pm_runtime_put* will be ignored. So if we disable the runtime PM before device really be removed, drivers' _remove callback may access HW even pm_runtime_get* fails. That is bad. Consider below call sequence when removing a device: device_del = device_pm_remove = class_intf-remove_dev(dev, class_intf) = pm_runtime_get_sync/put_sync = bus_remove_device = device_release_driver = pm_runtime_get_sync/put_sync remove_dev might call pm_runtime_get_sync/put_sync. Then, generic device_release_driver also calls pm_runtime_get_sync/put_sync. Since device_del = device_pm_remove firstly, later _get_sync wouldn't really wake up the device. I git log -p to find the patch which moves the calling to device_pm_remove ahead. It's below patch: commit 775b64d2b6ca37697de925f70799c710aab5849a Author: Rafael J. Wysocki r...@sisk.pl Date: Sat Jan 12 20:40:46 2008 +0100 PM: Acquire device locks on suspend This patch reorganizes the way suspend and resume notifications are sent to drivers. The major changes are that now the PM core acquires every device semaphore before calling the methods, and calls to device_add() during suspends will fail, while calls to device_del() during suspends will block. It also provides a way to safely remove a suspended device with the help of the PM core, by using the device_pm_schedule_removal() callback introduced specifically for this purpose, and updates two drivers (msr and cpuid) that need to use it. As device_pm_schedule_removal is deleted by another patch, we need also revert other parts of the patch, i.e. move the calling of device_pm_remove after the calling to bus_remove_device. Signed-off-by: LongX Zhang longx.zh...@intel.com Greg, do you see any potential problems with this patch? No, no objection from me: Acked-by: Greg Kroah-Hartman gre...@linuxfoundation.org OK, thanks! Applied to linux-pm.git/linux-next as v3.8 material. Rafael, As usual, I really appreciate it! Yanmin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] drivers-core: move the calling to device_pm_remove behind the calling to bus_remove_device
On Tue, 2012-10-16 at 23:28 +0800, zhanglong wrote: > We hit an hang issue when removing a mmc device on Medfield Android phone by > sysfs interface. > > device_pm_remove will call pm_runtime_remove which would disable > runtime PM of the device. After that pm_runtime_get* or > pm_runtime_put* will be ignored. So if we disable the runtime PM > before device really be removed, drivers' _remove callback may > access HW even pm_runtime_get* fails. That is bad. > > Consider below call sequence when removing a device: > device_del => device_pm_remove > => class_intf->remove_dev(dev, class_intf) => > pm_runtime_get_sync/put_sync > => bus_remove_device => device_release_driver => > pm_runtime_get_sync/put_sync > > remove_dev might call pm_runtime_get_sync/put_sync. > Then, generic device_release_driver also calls pm_runtime_get_sync/put_sync. > Since device_del => device_pm_remove firstly, later _get_sync wouldn't really > wake up the device. > > I git log -p to find the patch which moves the calling to device_pm_remove > ahead. > It's below patch: > > commit 775b64d2b6ca37697de925f70799c710aab5849a > Author: Rafael J. Wysocki > Date: Sat Jan 12 20:40:46 2008 +0100 > > PM: Acquire device locks on suspend > > This patch reorganizes the way suspend and resume notifications are > sent to drivers. The major changes are that now the PM core acquires > every device semaphore before calling the methods, and calls to > device_add() during suspends will fail, while calls to device_del() > during suspends will block. > > It also provides a way to safely remove a suspended device with the > help of the PM core, by using the device_pm_schedule_removal() callback > introduced specifically for this purpose, and updates two drivers (msr > and cpuid) that need to use it. > > > As device_pm_schedule_removal is deleted by another patch, we need also > revert other parts of the patch, > i.e. move the calling of device_pm_remove after the calling to > bus_remove_device. > > Signed-off-by: LongX Zhang Rafael, With the patch V2, is there anything else we could improve before it can be merged into your testing tree? Thanks, Yanmin > --- > drivers/base/core.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/base/core.c b/drivers/base/core.c > index abea76c..150a415 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -1180,7 +1180,6 @@ void device_del(struct device *dev) > if (dev->bus) > blocking_notifier_call_chain(>bus->p->bus_notifier, > BUS_NOTIFY_DEL_DEVICE, dev); > - device_pm_remove(dev); > dpm_sysfs_remove(dev); > if (parent) > klist_del(>p->knode_parent); > @@ -1205,6 +1204,7 @@ void device_del(struct device *dev) > device_remove_file(dev, _attr); > device_remove_attrs(dev); > bus_remove_device(dev); > + device_pm_remove(dev); > driver_deferred_probe_del(dev); > > /* Notify the platform of the removal, in case they -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] drivers-core: move the calling to device_pm_remove behind the calling to bus_remove_device
On Tue, 2012-10-16 at 23:28 +0800, zhanglong wrote: We hit an hang issue when removing a mmc device on Medfield Android phone by sysfs interface. device_pm_remove will call pm_runtime_remove which would disable runtime PM of the device. After that pm_runtime_get* or pm_runtime_put* will be ignored. So if we disable the runtime PM before device really be removed, drivers' _remove callback may access HW even pm_runtime_get* fails. That is bad. Consider below call sequence when removing a device: device_del = device_pm_remove = class_intf-remove_dev(dev, class_intf) = pm_runtime_get_sync/put_sync = bus_remove_device = device_release_driver = pm_runtime_get_sync/put_sync remove_dev might call pm_runtime_get_sync/put_sync. Then, generic device_release_driver also calls pm_runtime_get_sync/put_sync. Since device_del = device_pm_remove firstly, later _get_sync wouldn't really wake up the device. I git log -p to find the patch which moves the calling to device_pm_remove ahead. It's below patch: commit 775b64d2b6ca37697de925f70799c710aab5849a Author: Rafael J. Wysocki r...@sisk.pl Date: Sat Jan 12 20:40:46 2008 +0100 PM: Acquire device locks on suspend This patch reorganizes the way suspend and resume notifications are sent to drivers. The major changes are that now the PM core acquires every device semaphore before calling the methods, and calls to device_add() during suspends will fail, while calls to device_del() during suspends will block. It also provides a way to safely remove a suspended device with the help of the PM core, by using the device_pm_schedule_removal() callback introduced specifically for this purpose, and updates two drivers (msr and cpuid) that need to use it. As device_pm_schedule_removal is deleted by another patch, we need also revert other parts of the patch, i.e. move the calling of device_pm_remove after the calling to bus_remove_device. Signed-off-by: LongX Zhang longx.zh...@intel.com Rafael, With the patch V2, is there anything else we could improve before it can be merged into your testing tree? Thanks, Yanmin --- drivers/base/core.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index abea76c..150a415 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1180,7 +1180,6 @@ void device_del(struct device *dev) if (dev-bus) blocking_notifier_call_chain(dev-bus-p-bus_notifier, BUS_NOTIFY_DEL_DEVICE, dev); - device_pm_remove(dev); dpm_sysfs_remove(dev); if (parent) klist_del(dev-p-knode_parent); @@ -1205,6 +1204,7 @@ void device_del(struct device *dev) device_remove_file(dev, uevent_attr); device_remove_attrs(dev); bus_remove_device(dev); + device_pm_remove(dev); driver_deferred_probe_del(dev); /* Notify the platform of the removal, in case they -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject: [PATCH] drivers-core: move device_pm_remove behind bus_remove_device
On Mon, 2012-10-15 at 22:59 +0200, Rafael J. Wysocki wrote: > On Monday 15 of October 2012 15:39:49 Yanmin Zhang wrote: > > On Fri, 2012-09-21 at 01:58 +, Zhang, LongX wrote: > > > From: LongX Zhang > > > > > > device_pm_remove will call pm_runtime_remove which would disable > > > runtime PM of the device. After that pm_runtime_get* or > > > pm_runtime_put* will be ingored. So if we disable the runtime PM > > > before device really be removed, drivers' _remove callback may > > > access HW even pm_runtime_get* fails. That is bad. > > The background about the patch: We hit an hang issue when removing a mmc > > device on Medfield Android phone by sysfs interface. > > > > Consider below call sequence when removing a device: > > device_del => device_pm_remove > >=> class_intf->remove_dev(dev, class_intf) => > > pm_runtime_get_sync/put_sync > >=> bus_remove_device => device_release_driver => > > pm_runtime_get_sync/put_sync > > > > remove_dev might call pm_runtime_get_sync/put_sync. > > Then, generic device_release_driver also calls pm_runtime_get_sync/put_sync. > > Since device_del => device_pm_remove firstly, later _get_sync wouldn't > > really > > wake up the device. > > > > I git log -p to find the patch which moves the calling to device_pm_remove > > ahead. > > It's below patch: > > > > commit 775b64d2b6ca37697de925f70799c710aab5849a > > Author: Rafael J. Wysocki > > Date: Sat Jan 12 20:40:46 2008 +0100 > > > > PM: Acquire device locks on suspend > > > > This patch reorganizes the way suspend and resume notifications are > > sent to drivers. The major changes are that now the PM core acquires > > every device semaphore before calling the methods, and calls to > > device_add() during suspends will fail, while calls to device_del() > > during suspends will block. > > > > It also provides a way to safely remove a suspended device with the > > help of the PM core, by using the device_pm_schedule_removal() callback > > introduced specifically for this purpose, and updates two drivers (msr > > and cpuid) that need to use it. > > > > > > As device_pm_schedule_removal is deleted by another patch, we need also > > revert > > other parts of the patch, i.e. move the calling of device_pm_remove after > > the calling to bus_remove_device. > > > > > > > > Signed-off-by: LongX Zhang > > Can you please resend the patch? I lost the track of it unfortunately. Long sent V2 out. See https://lkml.org/lkml/2012/10/16/594. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject: [PATCH] drivers-core: move device_pm_remove behind bus_remove_device
On Mon, 2012-10-15 at 22:59 +0200, Rafael J. Wysocki wrote: On Monday 15 of October 2012 15:39:49 Yanmin Zhang wrote: On Fri, 2012-09-21 at 01:58 +, Zhang, LongX wrote: From: LongX Zhang longx.zh...@intel.com device_pm_remove will call pm_runtime_remove which would disable runtime PM of the device. After that pm_runtime_get* or pm_runtime_put* will be ingored. So if we disable the runtime PM before device really be removed, drivers' _remove callback may access HW even pm_runtime_get* fails. That is bad. The background about the patch: We hit an hang issue when removing a mmc device on Medfield Android phone by sysfs interface. Consider below call sequence when removing a device: device_del = device_pm_remove = class_intf-remove_dev(dev, class_intf) = pm_runtime_get_sync/put_sync = bus_remove_device = device_release_driver = pm_runtime_get_sync/put_sync remove_dev might call pm_runtime_get_sync/put_sync. Then, generic device_release_driver also calls pm_runtime_get_sync/put_sync. Since device_del = device_pm_remove firstly, later _get_sync wouldn't really wake up the device. I git log -p to find the patch which moves the calling to device_pm_remove ahead. It's below patch: commit 775b64d2b6ca37697de925f70799c710aab5849a Author: Rafael J. Wysocki r...@sisk.pl Date: Sat Jan 12 20:40:46 2008 +0100 PM: Acquire device locks on suspend This patch reorganizes the way suspend and resume notifications are sent to drivers. The major changes are that now the PM core acquires every device semaphore before calling the methods, and calls to device_add() during suspends will fail, while calls to device_del() during suspends will block. It also provides a way to safely remove a suspended device with the help of the PM core, by using the device_pm_schedule_removal() callback introduced specifically for this purpose, and updates two drivers (msr and cpuid) that need to use it. As device_pm_schedule_removal is deleted by another patch, we need also revert other parts of the patch, i.e. move the calling of device_pm_remove after the calling to bus_remove_device. Signed-off-by: LongX Zhang longx.zh...@intel.com Can you please resend the patch? I lost the track of it unfortunately. Long sent V2 out. See https://lkml.org/lkml/2012/10/16/594. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject: [PATCH] drivers-core: move device_pm_remove behind bus_remove_device
On Fri, 2012-09-21 at 01:58 +, Zhang, LongX wrote: > From: LongX Zhang > > device_pm_remove will call pm_runtime_remove which would disable > runtime PM of the device. After that pm_runtime_get* or > pm_runtime_put* will be ingored. So if we disable the runtime PM > before device really be removed, drivers' _remove callback may > access HW even pm_runtime_get* fails. That is bad. The background about the patch: We hit an hang issue when removing a mmc device on Medfield Android phone by sysfs interface. Consider below call sequence when removing a device: device_del => device_pm_remove => class_intf->remove_dev(dev, class_intf) => pm_runtime_get_sync/put_sync => bus_remove_device => device_release_driver => pm_runtime_get_sync/put_sync remove_dev might call pm_runtime_get_sync/put_sync. Then, generic device_release_driver also calls pm_runtime_get_sync/put_sync. Since device_del => device_pm_remove firstly, later _get_sync wouldn't really wake up the device. I git log -p to find the patch which moves the calling to device_pm_remove ahead. It's below patch: commit 775b64d2b6ca37697de925f70799c710aab5849a Author: Rafael J. Wysocki Date: Sat Jan 12 20:40:46 2008 +0100 PM: Acquire device locks on suspend This patch reorganizes the way suspend and resume notifications are sent to drivers. The major changes are that now the PM core acquires every device semaphore before calling the methods, and calls to device_add() during suspends will fail, while calls to device_del() during suspends will block. It also provides a way to safely remove a suspended device with the help of the PM core, by using the device_pm_schedule_removal() callback introduced specifically for this purpose, and updates two drivers (msr and cpuid) that need to use it. As device_pm_schedule_removal is deleted by another patch, we need also revert other parts of the patch, i.e. move the calling of device_pm_remove after the calling to bus_remove_device. > > Signed-off-by: LongX Zhang > --- > drivers/base/core.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/base/core.c b/drivers/base/core.c > index 5e6e00b..81ea7f2 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -1169,7 +1169,6 @@ void device_del(struct device *dev) > if (dev->bus) > blocking_notifier_call_chain(>bus->p->bus_notifier, >BUS_NOTIFY_DEL_DEVICE, dev); > - device_pm_remove(dev); > dpm_sysfs_remove(dev); > if (parent) > klist_del(>p->knode_parent); > @@ -1194,6 +1193,7 @@ void device_del(struct device *dev) > device_remove_file(dev, _attr); > device_remove_attrs(dev); > bus_remove_device(dev); > + device_pm_remove(dev); > driver_deferred_probe_del(dev); > > /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Subject: [PATCH] drivers-core: move device_pm_remove behind bus_remove_device
On Fri, 2012-09-21 at 01:58 +, Zhang, LongX wrote: From: LongX Zhang longx.zh...@intel.com device_pm_remove will call pm_runtime_remove which would disable runtime PM of the device. After that pm_runtime_get* or pm_runtime_put* will be ingored. So if we disable the runtime PM before device really be removed, drivers' _remove callback may access HW even pm_runtime_get* fails. That is bad. The background about the patch: We hit an hang issue when removing a mmc device on Medfield Android phone by sysfs interface. Consider below call sequence when removing a device: device_del = device_pm_remove = class_intf-remove_dev(dev, class_intf) = pm_runtime_get_sync/put_sync = bus_remove_device = device_release_driver = pm_runtime_get_sync/put_sync remove_dev might call pm_runtime_get_sync/put_sync. Then, generic device_release_driver also calls pm_runtime_get_sync/put_sync. Since device_del = device_pm_remove firstly, later _get_sync wouldn't really wake up the device. I git log -p to find the patch which moves the calling to device_pm_remove ahead. It's below patch: commit 775b64d2b6ca37697de925f70799c710aab5849a Author: Rafael J. Wysocki r...@sisk.pl Date: Sat Jan 12 20:40:46 2008 +0100 PM: Acquire device locks on suspend This patch reorganizes the way suspend and resume notifications are sent to drivers. The major changes are that now the PM core acquires every device semaphore before calling the methods, and calls to device_add() during suspends will fail, while calls to device_del() during suspends will block. It also provides a way to safely remove a suspended device with the help of the PM core, by using the device_pm_schedule_removal() callback introduced specifically for this purpose, and updates two drivers (msr and cpuid) that need to use it. As device_pm_schedule_removal is deleted by another patch, we need also revert other parts of the patch, i.e. move the calling of device_pm_remove after the calling to bus_remove_device. Signed-off-by: LongX Zhang longx.zh...@intel.com --- drivers/base/core.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 5e6e00b..81ea7f2 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1169,7 +1169,6 @@ void device_del(struct device *dev) if (dev-bus) blocking_notifier_call_chain(dev-bus-p-bus_notifier, BUS_NOTIFY_DEL_DEVICE, dev); - device_pm_remove(dev); dpm_sysfs_remove(dev); if (parent) klist_del(dev-p-knode_parent); @@ -1194,6 +1193,7 @@ void device_del(struct device *dev) device_remove_file(dev, uevent_attr); device_remove_attrs(dev); bus_remove_device(dev); + device_pm_remove(dev); driver_deferred_probe_del(dev); /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/fixup_irq: using the cpu_online_mask instead of cpu_all_mask
On Tue, 2012-08-14 at 06:55 +, Liu, Chuansheng wrote: > From: liu chuansheng > Subject: [PATCH] x86/fixup_irq: using the cpu_online_mask instead of > cpu_all_mask > > When one CPU is going down, and this CPU is the last one in irq affinity, > current code is setting cpu_all_mask as the new affinity for that irq. > > But for some system the firmware maybe send the interrupt to each CPU > in irq affinity averagely, and cpu_all_mask include all CPUs. > > Here replacing cpu_all_mask with cpu_online_mask, it is more reasonable > and fittable. It's a good finding. The issue exists on Medfield Android mobile. > > Signed-off-by: liu chuansheng > --- > arch/x86/kernel/irq.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c > index 7ad683d..d44f782 100644 > --- a/arch/x86/kernel/irq.c > +++ b/arch/x86/kernel/irq.c > @@ -270,7 +270,7 @@ void fixup_irqs(void) > > if (cpumask_any_and(affinity, cpu_online_mask) >= nr_cpu_ids) > { > break_affinity = 1; > - affinity = cpu_all_mask; > + affinity = cpu_online_mask; > } > > chip = irq_data_get_irq_chip(data); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/fixup_irq: using the cpu_online_mask instead of cpu_all_mask
On Tue, 2012-08-14 at 06:55 +, Liu, Chuansheng wrote: From: liu chuansheng chuansheng@intel.com Subject: [PATCH] x86/fixup_irq: using the cpu_online_mask instead of cpu_all_mask When one CPU is going down, and this CPU is the last one in irq affinity, current code is setting cpu_all_mask as the new affinity for that irq. But for some system the firmware maybe send the interrupt to each CPU in irq affinity averagely, and cpu_all_mask include all CPUs. Here replacing cpu_all_mask with cpu_online_mask, it is more reasonable and fittable. It's a good finding. The issue exists on Medfield Android mobile. Signed-off-by: liu chuansheng chuansheng@intel.com --- arch/x86/kernel/irq.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c index 7ad683d..d44f782 100644 --- a/arch/x86/kernel/irq.c +++ b/arch/x86/kernel/irq.c @@ -270,7 +270,7 @@ void fixup_irqs(void) if (cpumask_any_and(affinity, cpu_online_mask) = nr_cpu_ids) { break_affinity = 1; - affinity = cpu_all_mask; + affinity = cpu_online_mask; } chip = irq_data_get_irq_chip(data); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Fwd: [PATCH] x86/smp: Fix cpuN startup panic]
Peter, What's your opinion about the patch? We hit it when enabling Medfield Android mobile. This patch would put AP to a loop. Another method to fix it is just to enlarge the wait time, for example, from 2HZ to 10HZ. Yanmin Forwarded Message > From: Chen, LinX Z > To: linux-kernel@vger.kernel.org > Cc: mi...@redhat.com, t...@linutronix.de, h...@zytor.com, > yanmin_zh...@linux.intel.com > Subject: [PATCH] x86/smp: Fix cpuN startup panic > Date: Tue, 07 Aug 2012 18:50:40 +0900 > > From: Lin Chen > > We hit a panic while doing cpu hotplug test. > <0>[ 627.982857] Kernel panic - not syncing: smp_callin: CPU1 started up but > did not get a callout! > <0>[ 627.982864] > <4>[ 627.982876] Pid: 0, comm: kworker/0:1 Tainted: G ... > <4>[ 627.982883] Call Trace: > <4>[ 627.982903] [] panic+0x66/0x16c > <4>[ 627.982918] [] ? default_get_apic_id+0x1c/0x40 > <4>[ 627.982931] [] start_secondary+0xda/0x252 > > During BSP bootup AP, it is possible that BSP be preempted before > finishing STARTUP sequence of AP(set cpu_callout_mask) which maybe cause > AP busy wait for it. At present, AP will wait for 2 seconds then panic. > > This patch let AP waits until BSP finish the startup sequence and gives > WARNING when BSP is preempted more than 2 seconds. > > Signed-off-by: Yanmin Zhang > Signed-off-by: Lin Chen > --- > arch/x86/kernel/smpboot.c | 11 ++- > 1 files changed, 6 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 7c5a8c3..a9e3379 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -165,19 +165,20 @@ static void __cpuinit smp_callin(void) >* Waiting 2s total for startup (udelay is not yet working) >*/ > timeout = jiffies + 2*HZ; > - while (time_before(jiffies, timeout)) { > + while (1) { > /* >* Has the boot CPU finished it's STARTUP sequence? >*/ > if (cpumask_test_cpu(cpuid, cpu_callout_mask)) > break; > cpu_relax(); > + if (!time_before(jiffies, timeout)) { > + WARN(1, "%s: CPU%d started up but did not get a > callout!\n", > + __func__, cpuid); > + timeout = jiffies + 2*HZ; > + } > } > > - if (!time_before(jiffies, timeout)) { > - panic("%s: CPU%d started up but did not get a callout!\n", > - __func__, cpuid); > - } > > /* >* the boot CPU has finished the init stage and is spinning > -- > 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Fwd: [PATCH] x86/smp: Fix cpuN startup panic]
Peter, What's your opinion about the patch? We hit it when enabling Medfield Android mobile. This patch would put AP to a loop. Another method to fix it is just to enlarge the wait time, for example, from 2HZ to 10HZ. Yanmin Forwarded Message From: Chen, LinX Z linx.z.c...@intel.com To: linux-kernel@vger.kernel.org Cc: mi...@redhat.com, t...@linutronix.de, h...@zytor.com, yanmin_zh...@linux.intel.com Subject: [PATCH] x86/smp: Fix cpuN startup panic Date: Tue, 07 Aug 2012 18:50:40 +0900 From: Lin Chen linx.z.c...@intel.com We hit a panic while doing cpu hotplug test. 0[ 627.982857] Kernel panic - not syncing: smp_callin: CPU1 started up but did not get a callout! 0[ 627.982864] 4[ 627.982876] Pid: 0, comm: kworker/0:1 Tainted: G ... 4[ 627.982883] Call Trace: 4[ 627.982903] [c18f2977] panic+0x66/0x16c 4[ 627.982918] [c12234cc] ? default_get_apic_id+0x1c/0x40 4[ 627.982931] [c18ef96d] start_secondary+0xda/0x252 During BSP bootup AP, it is possible that BSP be preempted before finishing STARTUP sequence of AP(set cpu_callout_mask) which maybe cause AP busy wait for it. At present, AP will wait for 2 seconds then panic. This patch let AP waits until BSP finish the startup sequence and gives WARNING when BSP is preempted more than 2 seconds. Signed-off-by: Yanmin Zhang yanmin_zh...@linux.intel.com Signed-off-by: Lin Chen linx.z.c...@intel.com --- arch/x86/kernel/smpboot.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 7c5a8c3..a9e3379 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -165,19 +165,20 @@ static void __cpuinit smp_callin(void) * Waiting 2s total for startup (udelay is not yet working) */ timeout = jiffies + 2*HZ; - while (time_before(jiffies, timeout)) { + while (1) { /* * Has the boot CPU finished it's STARTUP sequence? */ if (cpumask_test_cpu(cpuid, cpu_callout_mask)) break; cpu_relax(); + if (!time_before(jiffies, timeout)) { + WARN(1, %s: CPU%d started up but did not get a callout!\n, + __func__, cpuid); + timeout = jiffies + 2*HZ; + } } - if (!time_before(jiffies, timeout)) { - panic(%s: CPU%d started up but did not get a callout!\n, - __func__, cpuid); - } /* * the boot CPU has finished the init stage and is spinning -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/smp: Fix cpuN startup panic
On Wed, 2012-08-08 at 00:33 +0800, Jiang Liu wrote: > On 08/07/2012 05:50 PM, Chen, LinX Z wrote: > > From: Lin Chen > > > > We hit a panic while doing cpu hotplug test. > > <0>[ 627.982857] Kernel panic - not syncing: smp_callin: CPU1 started up > > but did not get a callout! > > <0>[ 627.982864] > > <4>[ 627.982876] Pid: 0, comm: kworker/0:1 Tainted: G ... > > <4>[ 627.982883] Call Trace: > > <4>[ 627.982903] [] panic+0x66/0x16c > > <4>[ 627.982918] [] ? default_get_apic_id+0x1c/0x40 > > <4>[ 627.982931] [] start_secondary+0xda/0x252 > > > > During BSP bootup AP, it is possible that BSP be preempted before > > finishing STARTUP sequence of AP(set cpu_callout_mask) which maybe cause > > AP busy wait for it. At present, AP will wait for 2 seconds then panic. > > > > This patch let AP waits until BSP finish the startup sequence and gives > > WARNING when BSP is preempted more than 2 seconds. > > > > Signed-off-by: Yanmin Zhang > > Signed-off-by: Lin Chen > > --- > > arch/x86/kernel/smpboot.c | 11 ++- > > 1 files changed, 6 insertions(+), 5 deletions(-) > > > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > index 7c5a8c3..a9e3379 100644 > > --- a/arch/x86/kernel/smpboot.c > > +++ b/arch/x86/kernel/smpboot.c > > @@ -165,19 +165,20 @@ static void __cpuinit smp_callin(void) > > * Waiting 2s total for startup (udelay is not yet working) > > */ > > timeout = jiffies + 2*HZ; > > -while (time_before(jiffies, timeout)) { > > +while (1) { > Hi Yanmin, > > Seems a little risky, what if a slave CPU can't be booted due to > hardware errors? Slave CPU runs the loop. Basically, there is a handshake between BSP and AP. The patch doesn't change BSP codes. So when slave CPU fails, BSP still goes ahead and kernel still works. > Regards! > Gerry > > > /* > > * Has the boot CPU finished it's STARTUP sequence? > > */ > > if (cpumask_test_cpu(cpuid, cpu_callout_mask)) > > break; > > cpu_relax(); > > +if (!time_before(jiffies, timeout)) { > > +WARN(1, "%s: CPU%d started up but did not get a callout!\n", > > +__func__, cpuid); > > +timeout = jiffies + 2*HZ; > > +} > > } > > > > -if (!time_before(jiffies, timeout)) { > > -panic("%s: CPU%d started up but did not get a callout!\n", > > - __func__, cpuid); > > -} > > > > /* > > * the boot CPU has finished the init stage and is spinning > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/smp: Fix cpuN startup panic
On Wed, 2012-08-08 at 00:33 +0800, Jiang Liu wrote: On 08/07/2012 05:50 PM, Chen, LinX Z wrote: From: Lin Chen linx.z.c...@intel.com We hit a panic while doing cpu hotplug test. 0[ 627.982857] Kernel panic - not syncing: smp_callin: CPU1 started up but did not get a callout! 0[ 627.982864] 4[ 627.982876] Pid: 0, comm: kworker/0:1 Tainted: G ... 4[ 627.982883] Call Trace: 4[ 627.982903] [c18f2977] panic+0x66/0x16c 4[ 627.982918] [c12234cc] ? default_get_apic_id+0x1c/0x40 4[ 627.982931] [c18ef96d] start_secondary+0xda/0x252 During BSP bootup AP, it is possible that BSP be preempted before finishing STARTUP sequence of AP(set cpu_callout_mask) which maybe cause AP busy wait for it. At present, AP will wait for 2 seconds then panic. This patch let AP waits until BSP finish the startup sequence and gives WARNING when BSP is preempted more than 2 seconds. Signed-off-by: Yanmin Zhang yanmin_zh...@linux.intel.com Signed-off-by: Lin Chen linx.z.c...@intel.com --- arch/x86/kernel/smpboot.c | 11 ++- 1 files changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 7c5a8c3..a9e3379 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -165,19 +165,20 @@ static void __cpuinit smp_callin(void) * Waiting 2s total for startup (udelay is not yet working) */ timeout = jiffies + 2*HZ; -while (time_before(jiffies, timeout)) { +while (1) { Hi Yanmin, Seems a little risky, what if a slave CPU can't be booted due to hardware errors? Slave CPU runs the loop. Basically, there is a handshake between BSP and AP. The patch doesn't change BSP codes. So when slave CPU fails, BSP still goes ahead and kernel still works. Regards! Gerry /* * Has the boot CPU finished it's STARTUP sequence? */ if (cpumask_test_cpu(cpuid, cpu_callout_mask)) break; cpu_relax(); +if (!time_before(jiffies, timeout)) { +WARN(1, %s: CPU%d started up but did not get a callout!\n, +__func__, cpuid); +timeout = jiffies + 2*HZ; +} } -if (!time_before(jiffies, timeout)) { -panic(%s: CPU%d started up but did not get a callout!\n, - __func__, cpuid); -} /* * the boot CPU has finished the init stage and is spinning -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/