[PATCH] ARM: OMAP4+: wakeupgen: fix memory corruption
Fix a memory corruption bug caused by commit 247c445c0fbd52c77e497ff5bfcf0dceb8afea8d (ARM: OMAP5: Add the WakeupGen IP updates) and commit ec2c0825ca3183a646a24717966cc7752e8b0393 (ARM: OMAP2+: Remove hardcoded IRQs and enable SPARSE_IRQ). The first commit, in the OMAP4+ wakeupgen code, has an implicit dependency on !SPARSE_IRQ. It allocates a static array with NR_IRQS elements, then proceeds to iterate over 128 or 160 elements of that array, clearing them to zero. The second commit switched OMAP2+ to use sparse IRQs, but missed the NR_IRQS reference in the wakeupgen code. Before the second commit, NR_IRQS was 474 on OMAP4430; but afterwards, it became 16. This resulted in the wakeupgen code allocating a 16 element array, and then attempting to write to 128 or 160 of those elements, depending on the type of SoC. This trashed a chunk of whatever was allocated after the array. The immediate manifestation was a set of boot warnings similar to the following: WARNING: at arch/arm/mach-omap2/omap_hwmod.c:1941 _enable+0x1bc/0x204() omap_hwmod: mpu: could not enable clockdomain mpuss_clkdm: -22 ... since it blew away arch_clkdm. Ultimately the kernel crashed during boot. Fix the problem in the OMAP4+ wakeupgen code by removing the reference to NR_IRQS, allocating a larger array, and warning if the iteration is larger than the array. Signed-off-by: Paul Walmsley p...@pwsan.com Cc: Tony Lindgren t...@atomide.com Cc: Santosh Shilimkar santosh.shilim...@ti.com --- Applies on arm-soc omap/cleanup-sparseirq and should ideally be merged there before the 3.7 merge window. Test logs are here: http://www.pwsan.com/omap/testlogs/broken_sparseirq_fix_3.7/20120922012656/ arch/arm/mach-omap2/omap-wakeupgen.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/arm/mach-omap2/omap-wakeupgen.c b/arch/arm/mach-omap2/omap-wakeupgen.c index b54427d..869f16c 100644 --- a/arch/arm/mach-omap2/omap-wakeupgen.c +++ b/arch/arm/mach-omap2/omap-wakeupgen.c @@ -47,7 +47,7 @@ static void __iomem *wakeupgen_base; static void __iomem *sar_base; static DEFINE_SPINLOCK(wakeupgen_lock); -static unsigned int irq_target_cpu[NR_IRQS]; +static unsigned int irq_target_cpu[MAX_IRQS]; static unsigned int irq_banks = MAX_NR_REG_BANKS; static unsigned int max_irqs = MAX_IRQS; static unsigned int omap_secure_apis; @@ -446,6 +446,12 @@ int __init omap_wakeupgen_init(void) * GIC code has necessary hooks in place. */ + /* +* If you see this warning, then the subsequent loop just +* corrupted some memory +*/ + WARN_ON(max_irqs ARRAY_SIZE(irq_target_cpu)); + /* Associate all the IRQs to boot CPU like GIC init does. */ for (i = 0; i max_irqs; i++) irq_target_cpu[i] = boot_cpu; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] PM / Runtime: let rpm_resume() succeed if RPM_ACTIVE, even when disabled
On Saturday, September 22, 2012, Kevin Hilman wrote: From: Kevin Hilman khil...@ti.com There are several drivers where the return value of pm_runtime_get_sync() is used to decide whether or not it is safe to access hardware and that don't provide .suspend() callbacks for system suspend (but may use late/noirq callbacks.) If such a driver happens to call pm_runtime_get_sync() during system suspend, after the core has disabled runtime PM, it will get the error code and will decide that the hardware should not be accessed, although this may be a wrong conclusion, depending on the state of the device when runtime PM was disabled. Drivers might work around this problem by using a test like: ret = pm_runtime_get_sync(dev); if (!ret || (ret == -EACCES driver_private_data(dev)-suspended)) { /* access hardware */ } where driver_private_data(dev)-suspended is a flag set by the driver's .suspend() method (that would have to be added for this purpose). However, that potentially would need to be done by multiple drivers which means quite a lot of duplicated code and bloat. To avoid that we can use the observation that the core sets dev-power.is_suspended before disabling runtime PM and use that instead of the driver's private flag. Still, potentially many drivers would need to repeat that same check in quite a few places, so it's better to let the core do it. Then we can be a bit smarter and check whether or not runtime PM was disabled by the core only (disable_depth == 1) or by someone else in addition to the core (disable_depth 1). In the former case rpm_resume() can return 1 if the runtime PM status is RPM_ACTIVE, because it means the device was active when the core disabled runtime PM. In the latter case it should still return -EACCES, because it isn't clear why runtime PM has been disabled. Tested on AM3730/Beagle-xM where a wakeup IRQ firing during the late suspend phase triggers runtime PM activity in the I2C driver since the wakeup IRQ is on an I2C-connected PMIC. Cc: Rafael J. Wysocki r...@sisk.pl Cc: Alan Stern st...@rowland.harvard.edu Signed-off-by: Kevin Hilman khil...@ti.com --- v2: - major changelog rewrite, based largely on input from Rafael - add check for disable_depth == 1 and move to separate if statement, both suggested by Alan Stern OK, this looks good to me, thanks! Alan, what do you think? Rafael drivers/base/power/runtime.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c index 7d9c1cb..d43856b 100644 --- a/drivers/base/power/runtime.c +++ b/drivers/base/power/runtime.c @@ -509,6 +509,9 @@ static int rpm_resume(struct device *dev, int rpmflags) repeat: if (dev-power.runtime_error) retval = -EINVAL; + else if (dev-power.disable_depth == 1 dev-power.is_suspended + dev-power.runtime_status == RPM_ACTIVE) + retval = 1; else if (dev-power.disable_depth 0) retval = -EACCES; if (retval) -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ARM: OMAP4+: wakeupgen: fix memory corruption
Paul, On Sat, Sep 22, 2012 at 1:41 PM, Paul Walmsley p...@pwsan.com wrote: Fix a memory corruption bug caused by commit 247c445c0fbd52c77e497ff5bfcf0dceb8afea8d (ARM: OMAP5: Add the WakeupGen IP updates) and commit ec2c0825ca3183a646a24717966cc7752e8b0393 (ARM: OMAP2+: Remove hardcoded IRQs and enable SPARSE_IRQ). The first commit, in the OMAP4+ wakeupgen code, has an implicit dependency on !SPARSE_IRQ. It allocates a static array with NR_IRQS elements, then proceeds to iterate over 128 or 160 elements of that array, clearing them to zero. The second commit switched OMAP2+ to use sparse IRQs, but missed the NR_IRQS reference in the wakeupgen code. Before the second commit, NR_IRQS was 474 on OMAP4430; but afterwards, it became 16. This resulted in the wakeupgen code allocating a 16 element array, and then attempting to write to 128 or 160 of those elements, depending on the type of SoC. This trashed a chunk of whatever was allocated after the array. The immediate manifestation was a set of boot warnings similar to the following: WARNING: at arch/arm/mach-omap2/omap_hwmod.c:1941 _enable+0x1bc/0x204() omap_hwmod: mpu: could not enable clockdomain mpuss_clkdm: -22 ... since it blew away arch_clkdm. Ultimately the kernel crashed during boot. Fix the problem in the OMAP4+ wakeupgen code by removing the reference to NR_IRQS, allocating a larger array, and warning if the iteration is larger than the array. Signed-off-by: Paul Walmsley p...@pwsan.com Cc: Tony Lindgren t...@atomide.com Cc: Santosh Shilimkar santosh.shilim...@ti.com --- Applies on arm-soc omap/cleanup-sparseirq and should ideally be merged there before the 3.7 merge window. The issue is already fixed by commit e534e87 {ARM: OMAP4: Fix array size for irq_target_cpu} in mainline. The fix got merged after 3.6-rc5 tag and hence not appearing in the 'omap/cleanup-sparseirq' branch which seems to be based of 3.6-rc5. If you merge 3.6-rc6 tag or the latest mainline with omap/cleanup-sparseirq, the issue should go away. So from 3.7 merge window point of view, the fix is already in place. Regards Santosh -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 03/15] dmaengine: Add flags parameter to dmaengine_prep_dma_cyclic()
On Fri, Sep 14, 2012 at 03:05:46PM +0300, Peter Ujfalusi wrote: With this parameter added to dmaengine_prep_dma_cyclic() the API will be in sync with other dmaengine_prep_*() functions. The dmaengine_prep_dma_cyclic() function primarily used by audio for cyclic transfer required by ALSA, we use the from audio to ask dma drivers to suppress interrupts (if DMA_PREP_INTERRUPT is cleared) when it is supported on the platform. Are you sure this was generated against for-3.7? There's fuzz against dmaengine.h and git can't find the blobs to do resolution. Anyway, I applied this and the rest of the series. -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/14] MFD/ASoC/Input: twl4030-audio submodule DT support
On Mon, Sep 10, 2012 at 01:46:18PM +0300, Peter Ujfalusi wrote: Hello, Applied all, thanks. -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] ARM: OMAP: hwmod/PMU/PRCM patches for 3.7
* Paul Walmsley p...@pwsan.com [120921 22:41]: On Fri, 21 Sep 2012, Tony Lindgren wrote: * Tony Lindgren t...@atomide.com [120921 13:55]: Care to base this on something more mergeable? Maybe a merge of cleanup-fixes-for-v3.7 + omap-devel-am33xx-for-v3.7? While working on this, noticed that the 4430ES2 Panda test boot failed on the merge base of cleanup-fixes-for-v3.7 and omap-devel-am33xx-for-v3.7. Enabling DEBUG_LL and adding some debug revealed that the static variable 'arch_clkdm' in mach-omap2/clockdomain.c was getting overwritten between omap44xx_clockdomains_init() and the end of IRQ setup. This was bisected down to this commit: commit ec2c0825ca3183a646a24717966cc7752e8b0393 Author: Tony Lindgren t...@atomide.com Date: Mon Aug 27 17:43:01 2012 -0700 ARM: OMAP2+: Remove hardcoded IRQs and enable SPARSE_IRQ Remove hardcoded IRQs in irqs.h and related files as these are no longer needed. ... Looks to me like something is wrong with the IRQ allocation and it's corrupting memory. Yeah I bet that's e534e871 (ARM: OMAP4: Fix array size for irq_target_cpu) already in mainline since -rc6. Regards, Tony -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
2012/9/22 Paul E. McKenney paul...@linux.vnet.ibm.com: On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote: * Paul E. McKenney paul...@linux.vnet.ibm.com [120921 12:58]: Just to make sure I understand the combinations: o All stalls have happened when running a minimal userspace. o CONFIG_NO_HZ=n suppresses the stalls. o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has no observable effect on the stalls. The reason why you may need minimal userspace is to cut down the number of timers waking up the system with NO_HZ. Booting with init=/bin/sh might also do the trick for that. Good point! This does make for a very quiet system, but does not reproduce the problem under kvm, even after waiting for four minutes. I will leave it for more time, but it looks like I really might need to ask Linaro for remote access to a Panda. I have one. I'm currently installing Ubuntu on it and I'll try to manage to build a kernel and reproduce the issue. I'll give more news soon. -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Sat, Sep 22, 2012 at 05:45:12PM +0200, Frederic Weisbecker wrote: 2012/9/22 Paul E. McKenney paul...@linux.vnet.ibm.com: On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote: * Paul E. McKenney paul...@linux.vnet.ibm.com [120921 12:58]: Just to make sure I understand the combinations: o All stalls have happened when running a minimal userspace. o CONFIG_NO_HZ=n suppresses the stalls. o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has no observable effect on the stalls. The reason why you may need minimal userspace is to cut down the number of timers waking up the system with NO_HZ. Booting with init=/bin/sh might also do the trick for that. Good point! This does make for a very quiet system, but does not reproduce the problem under kvm, even after waiting for four minutes. I will leave it for more time, but it looks like I really might need to ask Linaro for remote access to a Panda. I have one. I'm currently installing Ubuntu on it and I'll try to manage to build a kernel and reproduce the issue. I'll give more news soon. Thank you! My bet is that you have to have a userspace that is so small that it registers only a few (but at least one!) RCU callback at boot time, then never registers any callbacks ever again. I have coded up a crude test case, using Tony Lindgren's suggestion of init=/bin/sh, but I appear to have inadvertently fixed this bug in current -rcu (git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git, branch rcu/next). But I have been wrong a few times already on this particular bug... Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST PATCH v2 1/2] spi: omap2-mcspi: add pinctrl support
On Tue, Sep 18, 2012 at 08:01:25AM -0400, Matt Porter wrote: Adds pinctrl support to support OMAP platforms that boot from DT and rely on pinctrl support to set pinmuxes. Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CPU_IDLE causes random reboots on custom 4430
On 09/22/2012 07:45 AM, Shilimkar, Santosh wrote: On Sat, Sep 22, 2012 at 4:19 AM, Chris Hoffmann chrmhoffm...@gmail.com wrote: Hi, We're trying to get a custom 4430 board (aka. nook tablet with OMAP4430 ES2.3 HS TWL6030 ES2.1) working with p-android-omap-3.0 on android jelly bean. The board works quite well, but we experience random hangs and the watchdog kicks the board to reboot. On the same kernel, you should have support for the persistent log. You might want to check the output. That should give you pointers on what CPU was doing before the freeze which resulted in reboot. Hi, I have some problems to provide logs. If I add -DDEBUG to cpuidle44xx.o the problem doesn't seem to occur. It could be that printk-ing alleviates the issue. Also the watchdog seems to shutdown the device rather than rebooting it (or it hangs?) and then I can't provide /proc/last_kmsg. How could I provide more info? Rgds, Chris -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] ARM: OMAP: hwmod/PMU/PRCM patches for 3.7
On Fri, 21 Sep 2012, Tony Lindgren wrote: Hmm I wonder what's causing it then? There must be something else in tmp-merge at commit abfee61f that causes the problems. Maybe try to merge with that commit and see what you get? Probably the merge with the clock patches was causing trouble. That commit can't be used as a base though as that's temporary most likely.. But we can create a base to use out of the branches once we know them, you can do it yourself too. Your tmp-merge contains branch/tag merges that haven't yet gone upstream to arm-soc. I don't know which of those merges you consider stable (aside from the upstream ones, obviously). For this one it looks like the clock patches were the ones causing the merge trouble. So since that series also came from me and is unmerged, will just merge the clock and hwmod patches into a new pull request on v3.6-rc6 + cleanup-fixes-for-v3.7 + omap-devel-am33xx-for-v3.7. Hopefully that will work for you... - Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
Hi Paul On Fri, 21 Sep 2012, Paul E. McKenney wrote: I am wondering if your system somehow figured out how to start a grace period that had no RCU callbacks waiting for it. If that happened, then a CONFIG_NO_HZ=y system could in theory get into a state where all CPUs are in dyntick-idle mode, so that none of them is doing anything to force the grace period to complete. That should be easy to diagnose, anyway. Please see below, which includes the earlier diagnostic patch. Here you go. - Paul [ 248.902618] INFO: rcu_sched self-detected stall on CPU [ 248.905456] 0: (1 ticks this GP) idle=933/1/0 [ 248.907897] (t=26570 jiffies g=11 c=10 q=0) [ 248.910339] [c001bc90] (unwind_backtrace+0x0/0xf0) from [c00ad800] (rcu_check_callbacks+0x220/0x714) [ 248.915527] [c00ad800] (rcu_check_callbacks+0x220/0x714) from [c00532a0] (update_process_times+0x38/0x68) [ 248.920928] [c00532a0] (update_process_times+0x38/0x68) from [c008c9e8] (tick_sched_timer+0x80/0xec) [ 248.926116] [c008c9e8] (tick_sched_timer+0x80/0xec) from [c0068ed4] (__run_hrtimer+0x7c/0x1e0) [ 248.930999] [c0068ed4] (__run_hrtimer+0x7c/0x1e0) from [c0069cb8] (hrtimer_interrupt+0x11c/0x2d0) [ 248.936035] [c0069cb8] (hrtimer_interrupt+0x11c/0x2d0) from [c001a3cc] (twd_handler+0x30/0x44) [ 248.940948] [c001a3cc] (twd_handler+0x30/0x44) from [c00a7bd0] (handle_percpu_devid_irq+0x90/0x13c) [ 248.946075] [c00a7bd0] (handle_percpu_devid_irq+0x90/0x13c) from [c00a4344] (generic_handle_irq+0x30/0x48) [ 248.951538] [c00a4344] (generic_handle_irq+0x30/0x48) from [c0014e38] (handle_IRQ+0x4c/0xac) [ 248.956329] [c0014e38] (handle_IRQ+0x4c/0xac) from [c00084cc] (gic_handle_irq+0x28/0x5c) [ 248.960937] [c00084cc] (gic_handle_irq+0x28/0x5c) from [c04fb1a4] (__irq_svc+0x44/0x5c) [ 248.965484] Exception stack(0xc0729f58 to 0xc0729fa0) [ 248.968231] 9f40: 0003b832 0001 [ 248.972686] 9f60: c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 411fc092 [ 248.977142] 9f80: c074bfe8 0001 c0729fa0 0003b833 c0015130 2113 [ 248.981597] [c04fb1a4] (__irq_svc+0x44/0x5c) from [c0015130] (default_idle+0x20/0x44) [ 248.986083] [c0015130] (default_idle+0x20/0x44) from [c001535c] (cpu_idle+0x9c/0x114) [ 248.990539] [c001535c] (cpu_idle+0x9c/0x114) from [c06d77b0] (start_kernel+0x2b4/0x304) -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OMAP baseline test results for v3.6-rc4
cc Santosh Hi Igor, I regret the delay in responding, On Fri, 7 Sep 2012, Igor Grinberg wrote: On 09/05/12 18:44, Paul Walmsley wrote: * CM-T3517: L3 in-band error with USB OTG during boot - Cause unknown; longstanding issue; does not occur on the 3517EVM We see this problem on several cm-t3517, but not all of them. There's probably a dependency on the bootloader or X-Loader. It looks something like: cut NET: Registered protocol family 16 GPMC revision 5.0 gpmc: irq-20 could not claim: err -22 OMAP GPIO hardware version 2.5 In-band Error seen by USB_OTG at address 0 That tag USB_OTG above isn't 100% accurate for AM3517/3505, by the way. omap_l3_smx.c doesn't have a correct initiator map for those chips. The offender could be USBOTG, but it could also be any other initiator in the IP subsystem, such as Camera/VPFE or EMAC. Table 5-18 InitiatorID Definition in the AM35x TRM vB (SPRUGR0B) lists these. As far as I know, the message means that some module in the IPSS tried to initiate an L3 interconnect transaction, but that it failed. Probably the IPSS isn't clocked. [ cut here ] WARNING: at /home/lifshitz/workroot/git-repo/linux-cm-t3x/arch/arm/mach-omap2/omap_l3_smx.c:162 omap3_l3_app_irq+0xdc/0x120() Modules linked in: [c001ad08] (unwind_backtrace+0x0/0xf4) from [c003f670] (warn_slowpath_common+0x4c/0x64) [c003f670] (warn_slowpath_common+0x4c/0x64) from [c003f6a4] (warn_slowpath_null+0x1c/0x24) [c003f6a4] (warn_slowpath_null+0x1c/0x24) from [c0033af0] (omap3_l3_app_irq+0xdc/0x120) [c0033af0] (omap3_l3_app_irq+0xdc/0x120) from [c008b8bc] (handle_irq_event_percpu+0xac/0x298) [c008b8bc] (handle_irq_event_percpu+0xac/0x298) from [c008bafc] (handle_irq_event+0x54/0x74) [c008bafc] (handle_irq_event+0x54/0x74) from [c008e290] (handle_level_irq+0xc4/0x118) [c008e290] (handle_level_irq+0xc4/0x118) from [c008b3ac] (generic_handle_irq+0x2c/0x44) [c008b3ac] (generic_handle_irq+0x2c/0x44) from [c001500c] (handle_IRQ+0x60/0x80) [c001500c] (handle_IRQ+0x60/0x80) from [c00085ec] (omap3_intc_handle_irq+0x60/0x74) [c00085ec] (omap3_intc_handle_irq+0x60/0x74) from [c04e3100] (__irq_svc+0x40/0x74) Exception stack(0xcf02de00 to 0xcf02de48) de00: 000a 0021 c074bcac cf046280 000a 6013 de20: c074bcdc c070020c 0001 cf02de48 c008c988 de40: 4013 [c04e3100] (__irq_svc+0x40/0x74) from [c008c988] (__setup_irq+0x2a8/0x404) [c008c988] (__setup_irq+0x2a8/0x404) from [c008cd18] (request_threaded_irq+0xe8/0x13c) [c008cd18] (request_threaded_irq+0xe8/0x13c) from [c06c3d24] (omap3_l3_probe+0x10c/0x16c) [c06c3d24] (omap3_l3_probe+0x10c/0x16c) from [c033586c] (platform_drv_probe+0x18/0x1c) [c033586c] (platform_drv_probe+0x18/0x1c) from [c0334414] (really_probe+0xac/0x1c8) [c0334414] (really_probe+0xac/0x1c8) from [c0334578] (driver_probe_device+0x48/0x60) [c0334578] (driver_probe_device+0x48/0x60) from [c03345f0] (__driver_attach+0x60/0x84) [c03345f0] (__driver_attach+0x60/0x84) from [c0332ce0] (bus_for_each_dev+0x4c/0x80) [c0332ce0] (bus_for_each_dev+0x4c/0x80) from [c0333414] (bus_add_driver+0xa4/0x294) [c0333414] (bus_add_driver+0xa4/0x294) from [c0334bdc] (driver_register+0xa4/0x188) [c0334bdc] (driver_register+0xa4/0x188) from [c0335c5c] (platform_driver_probe+0x18/0x98) [c0335c5c] (platform_driver_probe+0x18/0x98) from [c0008798] (do_one_initcall+0xac/0x16c) [c0008798] (do_one_initcall+0xac/0x16c) from [c06b52ac] (do_basic_setup+0x88/0xc0) [c06b52ac] (do_basic_setup+0x88/0xc0) from [c06b53c4] (kernel_init+0x60/0xfc) [c06b53c4] (kernel_init+0x60/0xfc) from [c00150a4] (kernel_thread_exit+0x0/0x8) ---[ end trace 1b75b31a2719ed1c ]--- -cut--- After that, the board continues to function properly. Any hints how to debug this? Probably the core problem is that we don't yet have the IPSS correctly supported in the AM35xx hwmod data. This is partially due to the fact that we're missing hierarchical enables/disables in that code, a longstanding omission. My guess is that if you hacked in some code to enable the IPSS early in boot (see the CONTROL_IPSS_CLK_CTRL register), the problem would probably go away. - Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Fri, 21 Sep 2012, Paul E. McKenney wrote: Could you please point me to a recipe for creating a minimal userspace? Just in case it is the userspac erather than the architecture/hardware that makes the difference. Tony's suggestion is pretty good. Note that there may also be differences in kernel timers -- either between x86 and ARM architectures, or loaded device drivers -- that may confound the problem. Just to make sure I understand the combinations: o All stalls have happened when running a minimal userspace. o CONFIG_NO_HZ=n suppresses the stalls. o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has no observable effect on the stalls. Did I get that right, or am I missing a combination? That's correct. Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is cancelled upon exit from idle, and therefore should (almost) never actually execute. Its sole purpose is to wake up the CPU. ;-) Right. Just curious, what would wake up the kernel from idle to handle a grace period expiration when CONFIG_RCU_FAST_NO_HZ=n? On a very idle system, the time between timer ticks could potentially be several tens of seconds. - Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/3] ARM: omap: hwmod: get rid of all omap_clk_get_by_name usage
On Wed, 29 Aug 2012, Rajendra Nayak wrote: Moving to Common clk framework for OMAP would mean we no longer use internal lookup mechanism like omap_clk_get_by_name(). get rid of all its usage mostly from hwmod and omap_device code. Also use IS_ERR_OR_NULL() for error checking. Moving to clk_get() also means the respective platforms need the clkdev tables updated with an entry for all clocks used by hwmod to have clock name same as the alias. Based on original changes from Mike Turquette. Signed-off-by: Rajendra Nayak rna...@ti.com Updated this one to include several missing clock aliases and to fix a bug it introduced with omap_96m_alwon_fck_3630; modified patch below. - Paul From: Rajendra Nayak rna...@ti.com Date: Sat, 22 Sep 2012 02:24:16 -0600 Subject: [PATCH] ARM: OMAP2+: hwmod: get rid of all omap_clk_get_by_name usage Moving to Common clk framework for OMAP would mean we no longer use internal lookup mechanism like omap_clk_get_by_name(). get rid of all its usage mostly from hwmod and omap_device code. Moving to clk_get() also means the respective platforms need the clkdev tables updated with an entry for all clocks used by hwmod to have clock name same as the alias. Based on original changes from Mike Turquette. Signed-off-by: Rajendra Nayak rna...@ti.com Cc: Russell King - ARM Linux li...@arm.linux.org.uk [p...@pwsan.com: removed IS_ERR_OR_NULL() conversion (rmk comment); restricted omap_96m_alwon_fck_3630 to OMAP36xx; added missing AM35xx clock aliases for emac_fck, emac_ick, vpfe_ick, vpfe_fck; added aliases rng_ick and several emulation clocks] Signed-off-by: Paul Walmsley p...@pwsan.com --- arch/arm/mach-omap2/clock2420_data.c | 17 + arch/arm/mach-omap2/clock2430_data.c | 22 ++ arch/arm/mach-omap2/clock3xxx_data.c | 31 +++ arch/arm/mach-omap2/clock44xx_data.c |6 ++ arch/arm/mach-omap2/omap_hwmod.c | 12 ++-- arch/arm/plat-omap/clock.c | 27 --- arch/arm/plat-omap/omap_device.c |4 ++-- 7 files changed, 84 insertions(+), 35 deletions(-) diff --git a/arch/arm/mach-omap2/clock2420_data.c b/arch/arm/mach-omap2/clock2420_data.c index 8c3bd2a..c3cde1a 100644 --- a/arch/arm/mach-omap2/clock2420_data.c +++ b/arch/arm/mach-omap2/clock2420_data.c @@ -1804,6 +1804,7 @@ static struct omap_clk omap2420_clks[] = { CLK(NULL, gfx_ick, gfx_ick, CK_242X), /* DSS domain clocks */ CLK(omapdss_dss, ick, dss_ick, CK_242X), + CLK(NULL, dss_ick, dss_ick, CK_242X), CLK(NULL, dss1_fck, dss1_fck, CK_242X), CLK(NULL, dss2_fck, dss2_fck, CK_242X), CLK(NULL, dss_54m_fck, dss_54m_fck, CK_242X), @@ -1843,12 +1844,16 @@ static struct omap_clk omap2420_clks[] = { CLK(NULL, gpt12_ick,gpt12_ick, CK_242X), CLK(NULL, gpt12_fck,gpt12_fck, CK_242X), CLK(omap-mcbsp.1, ick, mcbsp1_ick,CK_242X), + CLK(NULL, mcbsp1_ick, mcbsp1_ick,CK_242X), CLK(NULL, mcbsp1_fck, mcbsp1_fck,CK_242X), CLK(omap-mcbsp.2, ick, mcbsp2_ick,CK_242X), + CLK(NULL, mcbsp2_ick, mcbsp2_ick,CK_242X), CLK(NULL, mcbsp2_fck, mcbsp2_fck,CK_242X), CLK(omap2_mcspi.1, ick, mcspi1_ick,CK_242X), + CLK(NULL, mcspi1_ick, mcspi1_ick,CK_242X), CLK(NULL, mcspi1_fck, mcspi1_fck,CK_242X), CLK(omap2_mcspi.2, ick, mcspi2_ick,CK_242X), + CLK(NULL, mcspi2_ick, mcspi2_ick,CK_242X), CLK(NULL, mcspi2_fck, mcspi2_fck,CK_242X), CLK(NULL, uart1_ick,uart1_ick, CK_242X), CLK(NULL, uart1_fck,uart1_fck, CK_242X), @@ -1859,12 +1864,15 @@ static struct omap_clk omap2420_clks[] = { CLK(NULL, gpios_ick,gpios_ick, CK_242X), CLK(NULL, gpios_fck,gpios_fck, CK_242X), CLK(omap_wdt, ick, mpu_wdt_ick, CK_242X), + CLK(NULL, mpu_wdt_ick, mpu_wdt_ick, CK_242X), CLK(NULL, mpu_wdt_fck, mpu_wdt_fck, CK_242X), CLK(NULL, sync_32k_ick, sync_32k_ick, CK_242X), CLK(NULL, wdt1_ick, wdt1_ick, CK_242X), CLK(NULL, omapctrl_ick, omapctrl_ick, CK_242X), CLK(omap24xxcam, fck, cam_fck, CK_242X), + CLK(NULL, cam_fck, cam_fck, CK_242X), CLK(omap24xxcam, ick, cam_ick, CK_242X), + CLK(NULL, cam_ick, cam_ick, CK_242X), CLK(NULL, mailboxes_ick, mailboxes_ick,CK_242X), CLK(NULL, wdt4_ick, wdt4_ick, CK_242X), CLK(NULL, wdt4_fck, wdt4_fck, CK_242X), @@ -1873,16 +1881,22 @@ static struct omap_clk
Re: [PATCH v2] PM / Runtime: let rpm_resume() succeed if RPM_ACTIVE, even when disabled
On Saturday, September 22, 2012, Alan Stern wrote: On Sat, 22 Sep 2012, Rafael J. Wysocki wrote: On Saturday, September 22, 2012, Kevin Hilman wrote: OK, this looks good to me, thanks! Alan, what do you think? Rafael --- a/drivers/base/power/runtime.c +++ b/drivers/base/power/runtime.c @@ -509,6 +509,9 @@ static int rpm_resume(struct device *dev, int rpmflags) repeat: if (dev-power.runtime_error) retval = -EINVAL; + else if (dev-power.disable_depth == 1 dev-power.is_suspended + dev-power.runtime_status == RPM_ACTIVE) + retval = 1; else if (dev-power.disable_depth 0) retval = -EACCES; if (retval) Well, I'd prefer the indentation on the continuation line to be different from the indentation of the following line, and I'd prefer to have a comment explaining the reason for the exception. But these are only matters of taste; the implementation itself looks good. Thanks! I've applied the patch as v3.7 material (and fixed up the white space). Rafael -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote: On Fri, 21 Sep 2012, Paul E. McKenney wrote: Could you please point me to a recipe for creating a minimal userspace? Just in case it is the userspac erather than the architecture/hardware that makes the difference. Tony's suggestion is pretty good. Note that there may also be differences in kernel timers -- either between x86 and ARM architectures, or loaded device drivers -- that may confound the problem. For example, there must be at least one RCU callback outstanding after the boot sequence quiets down. Of course, the last time I tried Tony's approach, I was doing it on top of my -rcu stack, so am retrying on v3.6-rc6. Just to make sure I understand the combinations: o All stalls have happened when running a minimal userspace. o CONFIG_NO_HZ=n suppresses the stalls. o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has no observable effect on the stalls. Did I get that right, or am I missing a combination? That's correct. Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is cancelled upon exit from idle, and therefore should (almost) never actually execute. Its sole purpose is to wake up the CPU. ;-) Right. Just curious, what would wake up the kernel from idle to handle a grace period expiration when CONFIG_RCU_FAST_NO_HZ=n? On a very idle system, the time between timer ticks could potentially be several tens of seconds. If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted to shut off the scheduling-clock tick, so any CPU with RCU callbacks will be awakened every jiffy. The problem is that there appears to be a way to get an RCU grace period started without any CPU having any callbacks, which, as you surmise, would result in all the CPUs going to sleep and the grace period never ending. So if a CPU is awakened for any reason after this everlasting grace period has extended for more than a minute, the first thing that CPU will do is print an RCU CPU stall warning. I believe that I see how to prevent callback-free grace periods from ever starting. (Famous last words...) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Initialising omapfb on AM3517 issues
Hi Tomi, Thank you for those hints... I now have the LCD displaying the bootup logo, so it’s a good start. How do I push the patch for the new panel ? /* HannStar HSD043i9W1*/ { { .x_res = 480, .y_res = 272, .pixel_clock= 9000, .hsw= 41, .hfp= 2, .hbp= 2, .vsw= 10, .vfp= 2, .vbp= 2, .vsync_level= OMAPDSS_SIG_ACTIVE_HIGH, .hsync_level= OMAPDSS_SIG_ACTIVE_HIGH, .data_pclk_edge = OMAPDSS_DRIVE_SIG_RISING_EDGE, .de_level = OMAPDSS_SIG_ACTIVE_HIGH, .sync_pclk_edge = OMAPDSS_DRIVE_SIG_OPPOSITE_EDGES, }, .name = hannstar_hsd043i9w1, }, Regards Marc -Original Message- From: Tomi Valkeinen [mailto:tomi.valkei...@ti.com] Sent: 19 September 2012 08:24 To: Marc Murphy Cc: 'linux-omap@vger.kernel.org' Subject: Re: Initialising omapfb on AM3517 issues Hi, On Tue, 2012-09-18 at 16:41 +, Marc Murphy wrote: Hello all, I have been moving from the ti 2.6.37 BSP to the 3.x kernel with quite a bit of success, the main issue I have at the moment is trying to get the frame buffer and any displays I have initialised. [2.805358] omapfb omapfb: no driver for display: lcd [2.810729] omapfb omapfb: no displays [2.814666] omapfb omapfb: failed to setup omapfb I have tried a few versions of release and none of them will initialise; Currently on [0.00] Linux version 3.6.0-rc3 I have started with board-am3517evm display config and even that doesn't initialise. Is there something I am missing with the configs or is there a patch required to get the feature to work. My current config options use; # # Graphics support # CONFIG_DRM=y omapdrm and omapfb cannot be used at the same time. That said, you don't seem to enable omapdrm, only the core drm support, so it shouldn't matter. But you don't need CONFIG_DRM if you use omapfb. CONFIG_FB=y CONFIG_FB_CFB_FILLRECT=y CONFIG_FB_CFB_COPYAREA=y CONFIG_FB_CFB_IMAGEBLIT=y # # Frame buffer hardware drivers # CONFIG_OMAP2_VRAM=y CONFIG_OMAP2_VRFB=y CONFIG_OMAP2_DSS=y CONFIG_OMAP2_VRAM_SIZE=12 CONFIG_OMAP2_DSS_DPI=y CONFIG_OMAP2_DSS_VENC=y CONFIG_OMAP2_DSS_DSI=y CONFIG_OMAP2_DSS_MIN_FCK_PER_PCK=1 CONFIG_OMAP2_DSS_SLEEP_AFTER_VENC_RESET=y CONFIG_FB_OMAP2=y CONFIG_FB_OMAP2_NUM_FBS=3 # # OMAP2/3 Display Device Drivers # CONFIG_PANEL_GENERIC_DPI=y CONFIG_PANEL_SHARP_LS037V7DW01=y CONFIG_BACKLIGHT_LCD_SUPPORT=y CONFIG_LCD_CLASS_DEVICE=y CONFIG_BACKLIGHT_CLASS_DEVICE=y CONFIG_BACKLIGHT_GENERIC=y And the init structs are static int am3517_evm_panel_enable_lcd(struct omap_dss_device *dssdev) { gpio_set_value(TAM3517_DVI_PON_GPIO, 0); gpio_set_value(TAM3517_LCD_ENVDD_GPIO, 0); gpio_set_value(TAM3517_LCD_PON_GPIO, 1); printk(LCD voltage on\n); return 0; } static void am3517_evm_panel_disable_lcd(struct omap_dss_device *dssdev) { gpio_set_value(TAM3517_LCD_ENVDD_GPIO, 1); gpio_set_value(TAM3517_LCD_PON_GPIO, 0); } static struct panel_generic_dpi_data lcd_panel = { //.name = generic_dpi_panel, You need to define name for the panel you have. You can see the list of supported panels in drivers/video/omap2/displays/panel-generic-dpi.c. If you don't give a name, the panel driver doesn't start. There's also a problem with the vdds_dsi regulator. Search the list for [PATCH] OMAPDSS: Do not require a VDDS_DSI regulator on am35xx. The patch to fix it hasn't been merged yet. Tomi N�r��yb�X��ǧv�^�){.n�+{��f��{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Sat, Sep 22, 2012 at 06:16:15PM +, Paul Walmsley wrote: Hi Paul On Fri, 21 Sep 2012, Paul E. McKenney wrote: I am wondering if your system somehow figured out how to start a grace period that had no RCU callbacks waiting for it. If that happened, then a CONFIG_NO_HZ=y system could in theory get into a state where all CPUs are in dyntick-idle mode, so that none of them is doing anything to force the grace period to complete. That should be easy to diagnose, anyway. Please see below, which includes the earlier diagnostic patch. Here you go. - Paul [ 248.902618] INFO: rcu_sched self-detected stall on CPU [ 248.905456] 0: (1 ticks this GP) idle=933/1/0 [ 248.907897] (t=26570 jiffies g=11 c=10 q=0) Bingo!!! (q=0, in case you were wondering. And thank you for testing this!) Strangely enough, I believe that I have inadvertently fixed this in my -rcu tree: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next Nevertheless, if you get a chance to try it, I would be interested to hear if my guess is correct. The trick is that a kthread drives the grace period in -rcu, regardless of whether or not there are callbacks. However, the backport would not be something that -stable would be happy with, so I will be putting together a fix for mainline. This thing has been in the kernel since about 2004, not sure why you didn't hit it earlier. Thanx, Paul [ 248.910339] [c001bc90] (unwind_backtrace+0x0/0xf0) from [c00ad800] (rcu_check_callbacks+0x220/0x714) [ 248.915527] [c00ad800] (rcu_check_callbacks+0x220/0x714) from [c00532a0] (update_process_times+0x38/0x68) [ 248.920928] [c00532a0] (update_process_times+0x38/0x68) from [c008c9e8] (tick_sched_timer+0x80/0xec) [ 248.926116] [c008c9e8] (tick_sched_timer+0x80/0xec) from [c0068ed4] (__run_hrtimer+0x7c/0x1e0) [ 248.930999] [c0068ed4] (__run_hrtimer+0x7c/0x1e0) from [c0069cb8] (hrtimer_interrupt+0x11c/0x2d0) [ 248.936035] [c0069cb8] (hrtimer_interrupt+0x11c/0x2d0) from [c001a3cc] (twd_handler+0x30/0x44) [ 248.940948] [c001a3cc] (twd_handler+0x30/0x44) from [c00a7bd0] (handle_percpu_devid_irq+0x90/0x13c) [ 248.946075] [c00a7bd0] (handle_percpu_devid_irq+0x90/0x13c) from [c00a4344] (generic_handle_irq+0x30/0x48) [ 248.951538] [c00a4344] (generic_handle_irq+0x30/0x48) from [c0014e38] (handle_IRQ+0x4c/0xac) [ 248.956329] [c0014e38] (handle_IRQ+0x4c/0xac) from [c00084cc] (gic_handle_irq+0x28/0x5c) [ 248.960937] [c00084cc] (gic_handle_irq+0x28/0x5c) from [c04fb1a4] (__irq_svc+0x44/0x5c) [ 248.965484] Exception stack(0xc0729f58 to 0xc0729fa0) [ 248.968231] 9f40: 0003b832 0001 [ 248.972686] 9f60: c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 411fc092 [ 248.977142] 9f80: c074bfe8 0001 c0729fa0 0003b833 c0015130 2113 [ 248.981597] [c04fb1a4] (__irq_svc+0x44/0x5c) from [c0015130] (default_idle+0x20/0x44) [ 248.986083] [c0015130] (default_idle+0x20/0x44) from [c001535c] (cpu_idle+0x9c/0x114) [ 248.990539] [c001535c] (cpu_idle+0x9c/0x114) from [c06d77b0] (start_kernel+0x2b4/0x304) -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote: On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote: On Fri, 21 Sep 2012, Paul E. McKenney wrote: Could you please point me to a recipe for creating a minimal userspace? Just in case it is the userspac erather than the architecture/hardware that makes the difference. Tony's suggestion is pretty good. Note that there may also be differences in kernel timers -- either between x86 and ARM architectures, or loaded device drivers -- that may confound the problem. For example, there must be at least one RCU callback outstanding after the boot sequence quiets down. Of course, the last time I tried Tony's approach, I was doing it on top of my -rcu stack, so am retrying on v3.6-rc6. Just to make sure I understand the combinations: o All stalls have happened when running a minimal userspace. o CONFIG_NO_HZ=n suppresses the stalls. o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has no observable effect on the stalls. Did I get that right, or am I missing a combination? That's correct. Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is cancelled upon exit from idle, and therefore should (almost) never actually execute. Its sole purpose is to wake up the CPU. ;-) Right. Just curious, what would wake up the kernel from idle to handle a grace period expiration when CONFIG_RCU_FAST_NO_HZ=n? On a very idle system, the time between timer ticks could potentially be several tens of seconds. If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted to shut off the scheduling-clock tick, so any CPU with RCU callbacks will be awakened every jiffy. The problem is that there appears to be a way to get an RCU grace period started without any CPU having any callbacks, which, as you surmise, would result in all the CPUs going to sleep and the grace period never ending. So if a CPU is awakened for any reason after this everlasting grace period has extended for more than a minute, the first thing that CPU will do is print an RCU CPU stall warning. I believe that I see how to prevent callback-free grace periods from ever starting. (Famous last words...) And here is a patch. I am still having trouble reproducing the problem, but figured that I should avoid serializing things. Thanx, Paul b/kernel/rcutree.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) rcu: Fix day-one dyntick-idle stall-warning bug Each grace period is supposed to have at least one callback waiting for that grace period to complete. However, if CONFIG_NO_HZ=n, an extra callback-free grace period is no big problem -- it will chew up a tiny bit of CPU time, but it will complete normally. In contrast, CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to sleep indefinitely, in turn indefinitely delaying completion of the callback-free grace period. Given that nothing is waiting on this grace period, this is also not a problem. Unless RCU CPU stall warnings are also enabled, as they are in recent kernels. In this case, if a CPU wakes up after at least one minute of inactivity, an RCU CPU stall warning will result. The reason that no one noticed until quite recently is that most systems have enough OS noise that they will never remain absolutely idle for a full minute. But there are some embedded systems with cut-down userspace configurations that get into this mode quite easily. All this begs the question of exactly how a callback-free grace period gets started in the first place. This can happen due to the fact that CPUs do not necessarily agree on which grace period is in progress. If a CPU still believes that the grace period that just completed is still ongoing, it will believe that it has callbacks that need to wait for another grace period, never mind the fact that the grace period that they were waiting for just completed. This CPU can therefore erroneously decide to start a new grace period. Once this CPU notices that the earlier grace period completed, it will invoke its callbacks. It then won't have any callbacks left. If no other CPU has any callbacks, we now have a callback-free grace period. This commit therefore makes CPUs check more carefully before starting a new grace period. This new check relies on an array of tail pointers into each CPU's list of callbacks. If the CPU is up to date on which grace periods have completed, it checks to see if any callbacks follow the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks follow the RCU_WAIT_TAIL segment. The reason that this works is that the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment as soon as the CPU figures out that the old grace period has ended. This change
Re: rcu self-detected stall messages on OMAP3, 4 boards
Hi Paul On Sat, 22 Sep 2012, Paul E. McKenney wrote: Strangely enough, I believe that I have inadvertently fixed this in my -rcu tree: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next Nevertheless, if you get a chance to try it, I would be interested to hear if my guess is correct. Yes, good news: the stall warnings go away with that branch. The trick is that a kthread drives the grace period in -rcu, regardless of whether or not there are callbacks. This is rcu: Move quiescent-state forcing into kthread ? Added some debugging into rcu_gp_kthread() after that commit and can confirm that the quiescent-state forcing loop does start a few times when there are zero callbacks pending (modulo any races in my measurement code). However, the backport would not be something that -stable would be happy with, so I will be putting together a fix for mainline. This thing has been in the kernel since about 2004, not sure why you didn't hit it earlier. One other data point in that regard - noticed the warnings don't appear when the board is booted with: commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367 Author: Paul E. McKenney paul.mcken...@linaro.org Date: Tue Jun 5 15:53:53 2012 -0700 rcu: Fix qlen_lazy breakage ... - Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Sat, 22 Sep 2012, Paul E. McKenney wrote: And here is a patch. I am still having trouble reproducing the problem, but figured that I should avoid serializing things. Thanks, testing this now on v3.6-rc6. One question though about the patch description: All this begs the question of exactly how a callback-free grace period gets started in the first place. This can happen due to the fact that CPUs do not necessarily agree on which grace period is in progress. If a CPU still believes that the grace period that just completed is still ongoing, it will believe that it has callbacks that need to wait for another grace period, never mind the fact that the grace period that they were waiting for just completed. This CPU can therefore erroneously decide to start a new grace period. Doesn't this imply that this bug would only affect multi-CPU systems? The recent tests here have been on Pandaboard, which is dual-CPU, but my recollection is that I also observed the warnings on a single-core Beagleboard. Will re-test. - Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Sat, Sep 22, 2012 at 10:25:59PM +, Paul Walmsley wrote: On Sat, 22 Sep 2012, Paul E. McKenney wrote: And here is a patch. I am still having trouble reproducing the problem, but figured that I should avoid serializing things. Thanks, testing this now on v3.6-rc6. Very cool, thank you! One question though about the patch description: All this begs the question of exactly how a callback-free grace period gets started in the first place. This can happen due to the fact that CPUs do not necessarily agree on which grace period is in progress. If a CPU still believes that the grace period that just completed is still ongoing, it will believe that it has callbacks that need to wait for another grace period, never mind the fact that the grace period that they were waiting for just completed. This CPU can therefore erroneously decide to start a new grace period. Doesn't this imply that this bug would only affect multi-CPU systems? Surprisingly not, at least when running TREE_RCU or TREE_PREEMPT_RCU. In order to keep lock contention down to a dull roar on larger systems, TREE_RCU keeps three sets of books: (1) the global state in the rcu_state structure, (2) the combining-tree per-node state in the rcu_node structure, and the per-CPU state in the rcu_data structure. A CPU is not officially aware of the end of a grace period until it is reflected in its rcu_data structure. This has the perhaps-surprising consequence that the CPU that detected the end of the old grace period might start a new one before becoming officially aware that the old one ended. Why not have the CPU inform itself immediately upon noticing that the old grace period ended? Deadlock. The rcu_node locks must be acquired from leaf towards root, and the CPU is holding the root rcu_node lock when it notices that the grace period has ended. I have made this a bit less problematic in the bigrt branch, working towards a goal of getting RCU into a state where automatic formal validation might one day be possible. And yes, I am starting to get some formal-validation people interested in this lofty goal, see for example: http://sites.google.com/site/popl13grace/paper.pdf. The recent tests here have been on Pandaboard, which is dual-CPU, but my recollection is that I also observed the warnings on a single-core Beagleboard. Will re-test. Anxiously awaiting the results. This has been a strange one, even by RCU's standards. Plus I need to add a few Reported-by lines. Next version... Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Sat, Sep 22, 2012 at 10:20:19PM +, Paul Walmsley wrote: Hi Paul On Sat, 22 Sep 2012, Paul E. McKenney wrote: Strangely enough, I believe that I have inadvertently fixed this in my -rcu tree: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next Nevertheless, if you get a chance to try it, I would be interested to hear if my guess is correct. Yes, good news: the stall warnings go away with that branch. Very good! The trick is that a kthread drives the grace period in -rcu, regardless of whether or not there are callbacks. This is rcu: Move quiescent-state forcing into kthread ? Yep, plus the preceding commits moving grace-period initialization and cleanup into that same kthread. This was motivated by a bug report last February complaining about 200-microsecond latency spikes from RCU grace-period initialization. On systems with 4096 CPUs. Real-time response. It is far bigger than I thought. ;-) Added some debugging into rcu_gp_kthread() after that commit and can confirm that the quiescent-state forcing loop does start a few times when there are zero callbacks pending (modulo any races in my measurement code). Cool, thank you! Assuming it works, that indicates that there is long-term value to the fix for this problem. On larger systems, extra grace periods are not what you want, as their expense increases with the number of CPUs. However, the backport would not be something that -stable would be happy with, so I will be putting together a fix for mainline. This thing has been in the kernel since about 2004, not sure why you didn't hit it earlier. One other data point in that regard - noticed the warnings don't appear when the board is booted with: commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367 Author: Paul E. McKenney paul.mcken...@linaro.org Date: Tue Jun 5 15:53:53 2012 -0700 rcu: Fix qlen_lazy breakage You lost me on this one. This is already in mainline, so if you were using (say) 3.6-rc6, you would already have this commit applied. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
Hi Paul On Sat, 22 Sep 2012, Paul Walmsley wrote: On Sat, 22 Sep 2012, Paul E. McKenney wrote: And here is a patch. I am still having trouble reproducing the problem, but figured that I should avoid serializing things. Thanks, testing this now on v3.6-rc6. Looks like you solved it! Tested v3.6-rc6 + your stall diagnostic patch: http://marc.info/?l=linux-arm-kernelm=134827237215882w=2 on OMAP4430ES2 Pandaboard using omap2plus_defconfig and CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings. Then added rcu: Fix day-one dyntick-idle stall-warning bug from: http://marc.info/?l=linux-arm-kernelm=134835120600590w=2 Booted that, and the stall warnings did not appear within 30 minutes. To confirm that the problem being solved matched your hypothesis, the debugging patch below[1] was added to the RCU idle entry/exit code. Without the bugfix patch, a boot log transcript was obtained indicating that the idle loop was entered with tick_nohz_enabled=1 during a grace period with no callbacks present: http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt The debugging events started to appear at 1.867370 seconds into the boot. ENTER was pressed about 464 seconds in; this triggered the rcu_sched stall traceback. With the bugfix patch, a boot log transcript was obtained that indicated that the condition under test never occurred after waiting about 20 minutes: http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt Thanks for being so willing to root-cause the issue, Paul; it's appreciated, and it's been quite instructive as well. Will address some remaining loose ends in follow-up E-mails. - Paul [1] Debugging patch to printk() if the previous idle loop entry occurred with tick_nohz_enabled=1 during a grace period with no RCU callbacks present: --- kernel/rcutree.c | 17 + 1 file changed, 17 insertions(+) diff --git a/kernel/rcutree.c b/kernel/rcutree.c index f1eb7ad..f42941b 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -60,6 +60,9 @@ /* Data structures. */ +extern int tick_nohz_enabled; +static int no_cbs_idle_entry_count; + static struct lock_class_key rcu_node_class[RCU_NUM_LVLS]; #define RCU_STATE_INITIALIZER(sname, cr) { \ @@ -400,8 +403,12 @@ void rcu_idle_enter(void) unsigned long flags; long long oldval; struct rcu_dynticks *rdtp; + int cpu; + long totqlen = 0; + struct rcu_data *rdp; local_irq_save(flags); + rdp = __get_cpu_var(rcu_sched_data); rdtp = __get_cpu_var(rcu_dynticks); oldval = rdtp-dynticks_nesting; WARN_ON_ONCE((oldval DYNTICK_TASK_NEST_MASK) == 0); @@ -410,6 +417,12 @@ void rcu_idle_enter(void) else rdtp-dynticks_nesting -= DYNTICK_TASK_NEST_VALUE; rcu_idle_enter_common(rdtp, oldval); + if (tick_nohz_enabled rcu_gp_in_progress(rdp-rsp)) { + for_each_possible_cpu(cpu) + totqlen += per_cpu_ptr(rdp-rsp-rda, cpu)-qlen; + if (totqlen == 0) + no_cbs_idle_entry_count = 1; + } local_irq_restore(flags); } EXPORT_SYMBOL_GPL(rcu_idle_enter); @@ -503,6 +516,10 @@ void rcu_idle_exit(void) rdtp-dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; rcu_idle_exit_common(rdtp, oldval); local_irq_restore(flags); + if (no_cbs_idle_entry_count) { + no_cbs_idle_entry_count = 0; + pr_err(* Tickless idle was entered with zero RCU callbacks\n); + } } EXPORT_SYMBOL_GPL(rcu_idle_exit); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Sun, Sep 23, 2012 at 01:42:10AM +, Paul Walmsley wrote: Hi Paul On Sat, 22 Sep 2012, Paul Walmsley wrote: On Sat, 22 Sep 2012, Paul E. McKenney wrote: And here is a patch. I am still having trouble reproducing the problem, but figured that I should avoid serializing things. Thanks, testing this now on v3.6-rc6. Looks like you solved it! Tested v3.6-rc6 + your stall diagnostic patch: http://marc.info/?l=linux-arm-kernelm=134827237215882w=2 on OMAP4430ES2 Pandaboard using omap2plus_defconfig and CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings. Then added rcu: Fix day-one dyntick-idle stall-warning bug from: http://marc.info/?l=linux-arm-kernelm=134835120600590w=2 Booted that, and the stall warnings did not appear within 30 minutes. Very cool, thank you for your testing efforts!!! May I apply your Tested-by to this patch? And good show on the debugging patch -- it is quite good to have such solid evidence that the bug that the fix was intended for was actually occurring. Thanx, Paul To confirm that the problem being solved matched your hypothesis, the debugging patch below[1] was added to the RCU idle entry/exit code. Without the bugfix patch, a boot log transcript was obtained indicating that the idle loop was entered with tick_nohz_enabled=1 during a grace period with no callbacks present: http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt The debugging events started to appear at 1.867370 seconds into the boot. ENTER was pressed about 464 seconds in; this triggered the rcu_sched stall traceback. With the bugfix patch, a boot log transcript was obtained that indicated that the condition under test never occurred after waiting about 20 minutes: http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt Thanks for being so willing to root-cause the issue, Paul; it's appreciated, and it's been quite instructive as well. Will address some remaining loose ends in follow-up E-mails. - Paul [1] Debugging patch to printk() if the previous idle loop entry occurred with tick_nohz_enabled=1 during a grace period with no RCU callbacks present: --- kernel/rcutree.c | 17 + 1 file changed, 17 insertions(+) diff --git a/kernel/rcutree.c b/kernel/rcutree.c index f1eb7ad..f42941b 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -60,6 +60,9 @@ /* Data structures. */ +extern int tick_nohz_enabled; +static int no_cbs_idle_entry_count; + static struct lock_class_key rcu_node_class[RCU_NUM_LVLS]; #define RCU_STATE_INITIALIZER(sname, cr) { \ @@ -400,8 +403,12 @@ void rcu_idle_enter(void) unsigned long flags; long long oldval; struct rcu_dynticks *rdtp; + int cpu; + long totqlen = 0; + struct rcu_data *rdp; local_irq_save(flags); + rdp = __get_cpu_var(rcu_sched_data); rdtp = __get_cpu_var(rcu_dynticks); oldval = rdtp-dynticks_nesting; WARN_ON_ONCE((oldval DYNTICK_TASK_NEST_MASK) == 0); @@ -410,6 +417,12 @@ void rcu_idle_enter(void) else rdtp-dynticks_nesting -= DYNTICK_TASK_NEST_VALUE; rcu_idle_enter_common(rdtp, oldval); + if (tick_nohz_enabled rcu_gp_in_progress(rdp-rsp)) { + for_each_possible_cpu(cpu) + totqlen += per_cpu_ptr(rdp-rsp-rda, cpu)-qlen; + if (totqlen == 0) + no_cbs_idle_entry_count = 1; + } local_irq_restore(flags); } EXPORT_SYMBOL_GPL(rcu_idle_enter); @@ -503,6 +516,10 @@ void rcu_idle_exit(void) rdtp-dynticks_nesting = DYNTICK_TASK_EXIT_IDLE; rcu_idle_exit_common(rdtp, oldval); local_irq_restore(flags); + if (no_cbs_idle_entry_count) { + no_cbs_idle_entry_count = 0; + pr_err(* Tickless idle was entered with zero RCU callbacks\n); + } } EXPORT_SYMBOL_GPL(rcu_idle_exit); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rcu self-detected stall messages on OMAP3, 4 boards
On Sat, 22 Sep 2012, Paul E. McKenney wrote: Very cool, thank you for your testing efforts!!! You're welcome. May I apply your Tested-by to this patch? Please do: Tested-by: Paul Walmsley p...@pwsan.com # OMAP4430 Am testing on OMAP3730 (single-core) now. - Paul -- To unsubscribe from this list: send the line unsubscribe linux-omap in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html