[PATCH] ARM: OMAP4+: wakeupgen: fix memory corruption

2012-09-22 Thread Paul Walmsley

Fix a memory corruption bug caused by commit
247c445c0fbd52c77e497ff5bfcf0dceb8afea8d (ARM: OMAP5: Add the
WakeupGen IP updates) and commit
ec2c0825ca3183a646a24717966cc7752e8b0393 (ARM: OMAP2+: Remove
hardcoded IRQs and enable SPARSE_IRQ).

The first commit, in the OMAP4+ wakeupgen code, has an implicit
dependency on !SPARSE_IRQ.  It allocates a static array with NR_IRQS
elements, then proceeds to iterate over 128 or 160 elements of
that array, clearing them to zero.

The second commit switched OMAP2+ to use sparse IRQs, but missed the
NR_IRQS reference in the wakeupgen code.  Before the second commit,
NR_IRQS was 474 on OMAP4430; but afterwards, it became 16.

This resulted in the wakeupgen code allocating a 16 element array, and
then attempting to write to 128 or 160 of those elements, depending on the
type of SoC.  This trashed a chunk of whatever was allocated after the
array.

The immediate manifestation was a set of boot warnings similar to the
following:

   WARNING: at arch/arm/mach-omap2/omap_hwmod.c:1941 _enable+0x1bc/0x204()
   omap_hwmod: mpu: could not enable clockdomain mpuss_clkdm: -22
   ...

since it blew away arch_clkdm.  Ultimately the kernel crashed during boot.

Fix the problem in the OMAP4+ wakeupgen code by removing the reference to
NR_IRQS, allocating a larger array, and warning if the iteration is larger
than the array.

Signed-off-by: Paul Walmsley p...@pwsan.com
Cc: Tony Lindgren t...@atomide.com
Cc: Santosh Shilimkar santosh.shilim...@ti.com
---
Applies on arm-soc omap/cleanup-sparseirq and should ideally be merged 
there before the 3.7 merge window.

Test logs are here:

   http://www.pwsan.com/omap/testlogs/broken_sparseirq_fix_3.7/20120922012656/

 arch/arm/mach-omap2/omap-wakeupgen.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm/mach-omap2/omap-wakeupgen.c 
b/arch/arm/mach-omap2/omap-wakeupgen.c
index b54427d..869f16c 100644
--- a/arch/arm/mach-omap2/omap-wakeupgen.c
+++ b/arch/arm/mach-omap2/omap-wakeupgen.c
@@ -47,7 +47,7 @@
 static void __iomem *wakeupgen_base;
 static void __iomem *sar_base;
 static DEFINE_SPINLOCK(wakeupgen_lock);
-static unsigned int irq_target_cpu[NR_IRQS];
+static unsigned int irq_target_cpu[MAX_IRQS];
 static unsigned int irq_banks = MAX_NR_REG_BANKS;
 static unsigned int max_irqs = MAX_IRQS;
 static unsigned int omap_secure_apis;
@@ -446,6 +446,12 @@ int __init omap_wakeupgen_init(void)
 * GIC code has necessary hooks in place.
 */
 
+   /*
+* If you see this warning, then the subsequent loop just
+* corrupted some memory
+*/
+   WARN_ON(max_irqs  ARRAY_SIZE(irq_target_cpu));
+
/* Associate all the IRQs to boot CPU like GIC init does. */
for (i = 0; i  max_irqs; i++)
irq_target_cpu[i] = boot_cpu;
-- 
1.7.10.4
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] PM / Runtime: let rpm_resume() succeed if RPM_ACTIVE, even when disabled

2012-09-22 Thread Rafael J. Wysocki
On Saturday, September 22, 2012, Kevin Hilman wrote:
 From: Kevin Hilman khil...@ti.com
 
 There are several drivers where the return value of
 pm_runtime_get_sync() is used to decide whether or not it is safe to
 access hardware and that don't provide .suspend() callbacks for system
 suspend (but may use late/noirq callbacks.)  If such a driver happens
 to call pm_runtime_get_sync() during system suspend, after the core
 has disabled runtime PM, it will get the error code and will decide
 that the hardware should not be accessed, although this may be a wrong
 conclusion, depending on the state of the device when runtime PM was
 disabled.
 
 Drivers might work around this problem by using a test like:
 
ret = pm_runtime_get_sync(dev);
if (!ret || (ret == -EACCES  driver_private_data(dev)-suspended)) {
   /* access hardware */
}
 
 where driver_private_data(dev)-suspended is a flag set by the
 driver's .suspend() method (that would have to be added for this
 purpose).  However, that potentially would need to be done by multiple
 drivers which means quite a lot of duplicated code and bloat.
 
 To avoid that we can use the observation that the core sets
 dev-power.is_suspended before disabling runtime PM and use that
 instead of the driver's private flag.  Still, potentially many drivers
 would need to repeat that same check in quite a few places, so it's
 better to let the core do it.
 
 Then we can be a bit smarter and check whether or not runtime PM was
 disabled by the core only (disable_depth == 1) or by someone else in
 addition to the core (disable_depth  1).  In the former case
 rpm_resume() can return 1 if the runtime PM status is RPM_ACTIVE,
 because it means the device was active when the core disabled runtime
 PM.  In the latter case it should still return -EACCES, because it
 isn't clear why runtime PM has been disabled.
 
 Tested on AM3730/Beagle-xM where a wakeup IRQ firing during the late
 suspend phase triggers runtime PM activity in the I2C driver since the
 wakeup IRQ is on an I2C-connected PMIC.
 
 Cc: Rafael J. Wysocki r...@sisk.pl
 Cc: Alan Stern st...@rowland.harvard.edu
 Signed-off-by: Kevin Hilman khil...@ti.com
 ---
 v2: 
 - major changelog rewrite, based largely on input from Rafael 
 - add check for disable_depth == 1 and move to separate if statement,
   both suggested by Alan Stern

OK, this looks good to me, thanks!

Alan, what do you think?

Rafael


  drivers/base/power/runtime.c |3 +++
  1 file changed, 3 insertions(+)
 
 diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
 index 7d9c1cb..d43856b 100644
 --- a/drivers/base/power/runtime.c
 +++ b/drivers/base/power/runtime.c
 @@ -509,6 +509,9 @@ static int rpm_resume(struct device *dev, int rpmflags)
   repeat:
   if (dev-power.runtime_error)
   retval = -EINVAL;
 + else if (dev-power.disable_depth == 1  dev-power.is_suspended
 +   dev-power.runtime_status == RPM_ACTIVE)
 + retval = 1;
   else if (dev-power.disable_depth  0)
   retval = -EACCES;
   if (retval)
 

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ARM: OMAP4+: wakeupgen: fix memory corruption

2012-09-22 Thread Shilimkar, Santosh
Paul,

On Sat, Sep 22, 2012 at 1:41 PM, Paul Walmsley p...@pwsan.com wrote:


 Fix a memory corruption bug caused by commit
 247c445c0fbd52c77e497ff5bfcf0dceb8afea8d (ARM: OMAP5: Add the
 WakeupGen IP updates) and commit
 ec2c0825ca3183a646a24717966cc7752e8b0393 (ARM: OMAP2+: Remove
 hardcoded IRQs and enable SPARSE_IRQ).

 The first commit, in the OMAP4+ wakeupgen code, has an implicit
 dependency on !SPARSE_IRQ.  It allocates a static array with NR_IRQS
 elements, then proceeds to iterate over 128 or 160 elements of
 that array, clearing them to zero.

 The second commit switched OMAP2+ to use sparse IRQs, but missed the
 NR_IRQS reference in the wakeupgen code.  Before the second commit,
 NR_IRQS was 474 on OMAP4430; but afterwards, it became 16.

 This resulted in the wakeupgen code allocating a 16 element array, and
 then attempting to write to 128 or 160 of those elements, depending on the
 type of SoC.  This trashed a chunk of whatever was allocated after the
 array.

 The immediate manifestation was a set of boot warnings similar to the
 following:

WARNING: at arch/arm/mach-omap2/omap_hwmod.c:1941 _enable+0x1bc/0x204()
omap_hwmod: mpu: could not enable clockdomain mpuss_clkdm: -22
...

 since it blew away arch_clkdm.  Ultimately the kernel crashed during boot.

 Fix the problem in the OMAP4+ wakeupgen code by removing the reference to
 NR_IRQS, allocating a larger array, and warning if the iteration is larger
 than the array.

 Signed-off-by: Paul Walmsley p...@pwsan.com
 Cc: Tony Lindgren t...@atomide.com
 Cc: Santosh Shilimkar santosh.shilim...@ti.com
 ---
 Applies on arm-soc omap/cleanup-sparseirq and should ideally be merged
 there before the 3.7 merge window.

The issue is already fixed by commit e534e87 {ARM: OMAP4: Fix array size for
irq_target_cpu} in mainline. The fix got merged after 3.6-rc5 tag and hence
not appearing in the 'omap/cleanup-sparseirq' branch which seems to be based
of 3.6-rc5.

If you merge 3.6-rc6 tag or the latest mainline with omap/cleanup-sparseirq, the
issue should go away. So from 3.7 merge window point of view, the fix is already
in place.

Regards
Santosh
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 03/15] dmaengine: Add flags parameter to dmaengine_prep_dma_cyclic()

2012-09-22 Thread Mark Brown
On Fri, Sep 14, 2012 at 03:05:46PM +0300, Peter Ujfalusi wrote:
 With this parameter added to dmaengine_prep_dma_cyclic() the API will be in
 sync with other dmaengine_prep_*() functions.
 The dmaengine_prep_dma_cyclic() function primarily used by audio for cyclic
 transfer required by ALSA, we use the from audio to ask dma drivers to
 suppress interrupts (if DMA_PREP_INTERRUPT is cleared) when it is supported
 on the platform.

Are you sure this was generated against for-3.7?  There's fuzz against
dmaengine.h and git can't find the blobs to do resolution.  Anyway, I
applied this and the rest of the series.
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/14] MFD/ASoC/Input: twl4030-audio submodule DT support

2012-09-22 Thread Mark Brown
On Mon, Sep 10, 2012 at 01:46:18PM +0300, Peter Ujfalusi wrote:

 Hello,

Applied all, thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] ARM: OMAP: hwmod/PMU/PRCM patches for 3.7

2012-09-22 Thread Tony Lindgren
* Paul Walmsley p...@pwsan.com [120921 22:41]:
 
 On Fri, 21 Sep 2012, Tony Lindgren wrote:
 
  * Tony Lindgren t...@atomide.com [120921 13:55]:
  
  Care to base this on something more mergeable? Maybe a merge
  of cleanup-fixes-for-v3.7 + omap-devel-am33xx-for-v3.7?
 
 While working on this, noticed that the 4430ES2 Panda test boot failed on 
 the merge base of cleanup-fixes-for-v3.7 and omap-devel-am33xx-for-v3.7.  
 Enabling DEBUG_LL and adding some debug revealed that the static variable 
 'arch_clkdm' in mach-omap2/clockdomain.c was getting overwritten between 
 omap44xx_clockdomains_init() and the end of IRQ setup.  This was bisected 
 down to this commit:
 
 commit ec2c0825ca3183a646a24717966cc7752e8b0393
 Author: Tony Lindgren t...@atomide.com
 Date:   Mon Aug 27 17:43:01 2012 -0700
 
 ARM: OMAP2+: Remove hardcoded IRQs and enable SPARSE_IRQ
 
 Remove hardcoded IRQs in irqs.h and related files as these
 are no longer needed.
 
 ...
 
 Looks to me like something is wrong with the IRQ allocation and it's 
 corrupting memory.

Yeah I bet that's e534e871 (ARM: OMAP4: Fix array size for irq_target_cpu)
already in mainline since -rc6.

Regards,

Tony
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Frederic Weisbecker
2012/9/22 Paul E. McKenney paul...@linux.vnet.ibm.com:
 On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
 * Paul E. McKenney paul...@linux.vnet.ibm.com [120921 12:58]:
 
  Just to make sure I understand the combinations:
 
  o   All stalls have happened when running a minimal userspace.
  o   CONFIG_NO_HZ=n suppresses the stalls.
  o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
  no observable effect on the stalls.

 The reason why you may need minimal userspace is to cut down
 the number of timers waking up the system with NO_HZ.
 Booting with init=/bin/sh might also do the trick for that.

 Good point!  This does make for a very quiet system, but does not
 reproduce the problem under kvm, even after waiting for four minutes.
 I will leave it for more time, but it looks like I really might need to
 ask Linaro for remote access to a Panda.

I have one. I'm currently installing Ubuntu on it and I'll try to
manage to build
a kernel and reproduce the issue.

I'll give more news soon.
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney
On Sat, Sep 22, 2012 at 05:45:12PM +0200, Frederic Weisbecker wrote:
 2012/9/22 Paul E. McKenney paul...@linux.vnet.ibm.com:
  On Fri, Sep 21, 2012 at 01:31:49PM -0700, Tony Lindgren wrote:
  * Paul E. McKenney paul...@linux.vnet.ibm.com [120921 12:58]:
  
   Just to make sure I understand the combinations:
  
   o   All stalls have happened when running a minimal userspace.
   o   CONFIG_NO_HZ=n suppresses the stalls.
   o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
   no observable effect on the stalls.
 
  The reason why you may need minimal userspace is to cut down
  the number of timers waking up the system with NO_HZ.
  Booting with init=/bin/sh might also do the trick for that.
 
  Good point!  This does make for a very quiet system, but does not
  reproduce the problem under kvm, even after waiting for four minutes.
  I will leave it for more time, but it looks like I really might need to
  ask Linaro for remote access to a Panda.
 
 I have one. I'm currently installing Ubuntu on it and I'll try to
 manage to build
 a kernel and reproduce the issue.
 
 I'll give more news soon.

Thank you!

My bet is that you have to have a userspace that is so small that it
registers only a few (but at least one!) RCU callback at boot time,
then never registers any callbacks ever again.  I have coded up a
crude test case, using Tony Lindgren's suggestion of init=/bin/sh,
but I appear to have inadvertently fixed this bug in current -rcu
(git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git,
branch rcu/next).

But I have been wrong a few times already on this particular bug...

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST PATCH v2 1/2] spi: omap2-mcspi: add pinctrl support

2012-09-22 Thread Mark Brown
On Tue, Sep 18, 2012 at 08:01:25AM -0400, Matt Porter wrote:
 Adds pinctrl support to support OMAP platforms that boot from DT
 and rely on pinctrl support to set pinmuxes.

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CPU_IDLE causes random reboots on custom 4430

2012-09-22 Thread Chris Hoffmann

On 09/22/2012 07:45 AM, Shilimkar, Santosh wrote:

On Sat, Sep 22, 2012 at 4:19 AM, Chris Hoffmann chrmhoffm...@gmail.com wrote:

Hi,

We're trying to get a custom 4430 board (aka. nook tablet with OMAP4430
ES2.3 HS TWL6030 ES2.1) working with p-android-omap-3.0 on android jelly
bean. The board works quite well, but we experience random hangs and the
watchdog kicks the board to reboot.


On the same kernel, you should have support for the persistent log. You might
want to check the output. That should give you pointers on what CPU was
doing before the freeze which resulted in reboot.


Hi,

I have some problems to provide logs. If I add -DDEBUG to cpuidle44xx.o 
the problem doesn't seem to occur. It could be that printk-ing 
alleviates the issue.


Also the watchdog seems to shutdown the device rather than rebooting it 
(or it hangs?) and then I can't provide /proc/last_kmsg.


How could I provide more info?

Rgds,
Chris



--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] ARM: OMAP: hwmod/PMU/PRCM patches for 3.7

2012-09-22 Thread Paul Walmsley
On Fri, 21 Sep 2012, Tony Lindgren wrote:

 Hmm I wonder what's causing it then? There must be something
 else in tmp-merge at commit abfee61f that causes the problems.
 Maybe try to merge with that commit and see what you get?

Probably the merge with the clock patches was causing trouble.

 That commit can't be used as a base though as that's temporary
 most likely.. But we can create a base to use out of the
 branches once we know them, you can do it yourself too.

Your tmp-merge contains branch/tag merges that haven't yet gone upstream 
to arm-soc.  I don't know which of those merges you consider stable (aside 
from the upstream ones, obviously).  For this one it looks like the clock 
patches were the ones causing the merge trouble.  So since that series 
also came from me and is unmerged, will just merge the clock and hwmod 
patches into a new pull request on v3.6-rc6 + cleanup-fixes-for-v3.7 + 
omap-devel-am33xx-for-v3.7.  Hopefully that will work for you...


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley
Hi Paul

On Fri, 21 Sep 2012, Paul E. McKenney wrote:

 I am wondering if your system somehow figured out how to start a grace
 period that had no RCU callbacks waiting for it.  If that happened,
 then a CONFIG_NO_HZ=y system could in theory get into a state where all
 CPUs are in dyntick-idle mode, so that none of them is doing anything
 to force the grace period to complete.

 That should be easy to diagnose, anyway.  Please see below, which
 includes the earlier diagnostic patch.

Here you go.
 
- Paul

[  248.902618] INFO: rcu_sched self-detected stall on CPU
[  248.905456]  0: (1 ticks this GP) idle=933/1/0 
[  248.907897]   (t=26570 jiffies g=11 c=10 q=0)
[  248.910339] [c001bc90] (unwind_backtrace+0x0/0xf0) from [c00ad800] 
(rcu_check_callbacks+0x220/0x714)
[  248.915527] [c00ad800] (rcu_check_callbacks+0x220/0x714) from [c00532a0] 
(update_process_times+0x38/0x68)
[  248.920928] [c00532a0] (update_process_times+0x38/0x68) from [c008c9e8] 
(tick_sched_timer+0x80/0xec)
[  248.926116] [c008c9e8] (tick_sched_timer+0x80/0xec) from [c0068ed4] 
(__run_hrtimer+0x7c/0x1e0)
[  248.930999] [c0068ed4] (__run_hrtimer+0x7c/0x1e0) from [c0069cb8] 
(hrtimer_interrupt+0x11c/0x2d0)
[  248.936035] [c0069cb8] (hrtimer_interrupt+0x11c/0x2d0) from [c001a3cc] 
(twd_handler+0x30/0x44)
[  248.940948] [c001a3cc] (twd_handler+0x30/0x44) from [c00a7bd0] 
(handle_percpu_devid_irq+0x90/0x13c)
[  248.946075] [c00a7bd0] (handle_percpu_devid_irq+0x90/0x13c) from 
[c00a4344] (generic_handle_irq+0x30/0x48)
[  248.951538] [c00a4344] (generic_handle_irq+0x30/0x48) from [c0014e38] 
(handle_IRQ+0x4c/0xac)
[  248.956329] [c0014e38] (handle_IRQ+0x4c/0xac) from [c00084cc] 
(gic_handle_irq+0x28/0x5c)
[  248.960937] [c00084cc] (gic_handle_irq+0x28/0x5c) from [c04fb1a4] 
(__irq_svc+0x44/0x5c)
[  248.965484] Exception stack(0xc0729f58 to 0xc0729fa0)
[  248.968231] 9f40:   
0003b832 0001
[  248.972686] 9f60:  c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 
 411fc092
[  248.977142] 9f80: c074bfe8  0001 c0729fa0 0003b833 c0015130 
2113 
[  248.981597] [c04fb1a4] (__irq_svc+0x44/0x5c) from [c0015130] 
(default_idle+0x20/0x44)
[  248.986083] [c0015130] (default_idle+0x20/0x44) from [c001535c] 
(cpu_idle+0x9c/0x114)
[  248.990539] [c001535c] (cpu_idle+0x9c/0x114) from [c06d77b0] 
(start_kernel+0x2b4/0x304)
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OMAP baseline test results for v3.6-rc4

2012-09-22 Thread Paul Walmsley
cc Santosh

Hi Igor,

I regret the delay in responding,

On Fri, 7 Sep 2012, Igor Grinberg wrote:

 On 09/05/12 18:44, Paul Walmsley wrote:
 
  * CM-T3517: L3 in-band error with USB OTG during boot
- Cause unknown; longstanding issue; does not occur on the 3517EVM
 
 We see this problem on several cm-t3517, but not all of them.

There's probably a dependency on the bootloader or X-Loader.

 It looks something like:
 cut
 NET: Registered protocol family 16
 GPMC revision 5.0
 gpmc: irq-20 could not claim: err -22
 OMAP GPIO hardware version 2.5
 In-band Error seen by USB_OTG  at address 0

That tag USB_OTG above isn't 100% accurate for AM3517/3505, by the way.  
omap_l3_smx.c doesn't have a correct initiator map for those chips.  The 
offender could be USBOTG, but it could also be any other initiator in the 
IP subsystem, such as Camera/VPFE or EMAC.  Table 5-18 InitiatorID 
Definition in the AM35x TRM vB (SPRUGR0B) lists these.

As far as I know, the message means that some module in the IPSS tried to 
initiate an L3 interconnect transaction, but that it failed.  Probably the 
IPSS isn't clocked.

 [ cut here ]
 WARNING: at 
 /home/lifshitz/workroot/git-repo/linux-cm-t3x/arch/arm/mach-omap2/omap_l3_smx.c:162
  omap3_l3_app_irq+0xdc/0x120()
 Modules linked in:
 [c001ad08] (unwind_backtrace+0x0/0xf4) from [c003f670] 
 (warn_slowpath_common+0x4c/0x64)
 [c003f670] (warn_slowpath_common+0x4c/0x64) from [c003f6a4] 
 (warn_slowpath_null+0x1c/0x24)
 [c003f6a4] (warn_slowpath_null+0x1c/0x24) from [c0033af0] 
 (omap3_l3_app_irq+0xdc/0x120)
 [c0033af0] (omap3_l3_app_irq+0xdc/0x120) from [c008b8bc] 
 (handle_irq_event_percpu+0xac/0x298)
 [c008b8bc] (handle_irq_event_percpu+0xac/0x298) from [c008bafc] 
 (handle_irq_event+0x54/0x74)
 [c008bafc] (handle_irq_event+0x54/0x74) from [c008e290] 
 (handle_level_irq+0xc4/0x118)
 [c008e290] (handle_level_irq+0xc4/0x118) from [c008b3ac] 
 (generic_handle_irq+0x2c/0x44)
 [c008b3ac] (generic_handle_irq+0x2c/0x44) from [c001500c] 
 (handle_IRQ+0x60/0x80)
 [c001500c] (handle_IRQ+0x60/0x80) from [c00085ec] 
 (omap3_intc_handle_irq+0x60/0x74)
 [c00085ec] (omap3_intc_handle_irq+0x60/0x74) from [c04e3100] 
 (__irq_svc+0x40/0x74)
 Exception stack(0xcf02de00 to 0xcf02de48)
 de00:  000a  0021 c074bcac cf046280 000a 6013
 de20: c074bcdc c070020c 0001   cf02de48  c008c988
 de40: 4013 
 [c04e3100] (__irq_svc+0x40/0x74) from [c008c988] (__setup_irq+0x2a8/0x404)
 [c008c988] (__setup_irq+0x2a8/0x404) from [c008cd18] 
 (request_threaded_irq+0xe8/0x13c)
 [c008cd18] (request_threaded_irq+0xe8/0x13c) from [c06c3d24] 
 (omap3_l3_probe+0x10c/0x16c)
 [c06c3d24] (omap3_l3_probe+0x10c/0x16c) from [c033586c] 
 (platform_drv_probe+0x18/0x1c)
 [c033586c] (platform_drv_probe+0x18/0x1c) from [c0334414] 
 (really_probe+0xac/0x1c8)
 [c0334414] (really_probe+0xac/0x1c8) from [c0334578] 
 (driver_probe_device+0x48/0x60)
 [c0334578] (driver_probe_device+0x48/0x60) from [c03345f0] 
 (__driver_attach+0x60/0x84)
 [c03345f0] (__driver_attach+0x60/0x84) from [c0332ce0] 
 (bus_for_each_dev+0x4c/0x80)
 [c0332ce0] (bus_for_each_dev+0x4c/0x80) from [c0333414] 
 (bus_add_driver+0xa4/0x294)
 [c0333414] (bus_add_driver+0xa4/0x294) from [c0334bdc] 
 (driver_register+0xa4/0x188)
 [c0334bdc] (driver_register+0xa4/0x188) from [c0335c5c] 
 (platform_driver_probe+0x18/0x98)
 [c0335c5c] (platform_driver_probe+0x18/0x98) from [c0008798] 
 (do_one_initcall+0xac/0x16c)
 [c0008798] (do_one_initcall+0xac/0x16c) from [c06b52ac] 
 (do_basic_setup+0x88/0xc0)
 [c06b52ac] (do_basic_setup+0x88/0xc0) from [c06b53c4] 
 (kernel_init+0x60/0xfc)
 [c06b53c4] (kernel_init+0x60/0xfc) from [c00150a4] 
 (kernel_thread_exit+0x0/0x8)
 ---[ end trace 1b75b31a2719ed1c ]---
 -cut---
 
 After that, the board continues to function properly.
 Any hints how to debug this?

Probably the core problem is that we don't yet have the IPSS correctly 
supported in the AM35xx hwmod data.  This is partially due to the fact 
that we're missing hierarchical enables/disables in that code, a 
longstanding omission.  My guess is that if you hacked in some code to 
enable the IPSS early in boot (see the CONTROL_IPSS_CLK_CTRL register), 
the problem would probably go away.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley
On Fri, 21 Sep 2012, Paul E. McKenney wrote:

 Could you please point me to a recipe for creating a minimal userspace?
 Just in case it is the userspac erather than the architecture/hardware
 that makes the difference.

Tony's suggestion is pretty good.  Note that there may also be differences 
in kernel timers -- either between x86 and ARM architectures, or loaded 
device drivers -- that may confound the problem.

 Just to make sure I understand the combinations:
 
 o All stalls have happened when running a minimal userspace.
 o CONFIG_NO_HZ=n suppresses the stalls.
 o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
   no observable effect on the stalls.
 
 Did I get that right, or am I missing a combination?

That's correct.

 Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
 cancelled upon exit from idle, and therefore should (almost) never 
 actually execute. Its sole purpose is to wake up the CPU.  ;-)

Right.  Just curious, what would wake up the kernel from idle to handle a 
grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
system, the time between timer ticks could potentially be several tens of 
seconds.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/3] ARM: omap: hwmod: get rid of all omap_clk_get_by_name usage

2012-09-22 Thread Paul Walmsley
On Wed, 29 Aug 2012, Rajendra Nayak wrote:

 Moving to Common clk framework for OMAP would mean we no longer use
 internal lookup mechanism like omap_clk_get_by_name().
 get rid of all its usage mostly from hwmod and omap_device
 code.
 
 Also use IS_ERR_OR_NULL() for error checking.
 
 Moving to clk_get() also means the respective platforms
 need the clkdev tables updated with an entry for all clocks
 used by hwmod to have clock name same as the alias.
 
 Based on original changes from Mike Turquette.
 
 Signed-off-by: Rajendra Nayak rna...@ti.com

Updated this one to include several missing clock aliases and to fix a bug 
it introduced with omap_96m_alwon_fck_3630; modified patch below.


- Paul


From: Rajendra Nayak rna...@ti.com
Date: Sat, 22 Sep 2012 02:24:16 -0600
Subject: [PATCH] ARM: OMAP2+: hwmod: get rid of all omap_clk_get_by_name
 usage

Moving to Common clk framework for OMAP would mean we no longer use
internal lookup mechanism like omap_clk_get_by_name().
get rid of all its usage mostly from hwmod and omap_device
code.

Moving to clk_get() also means the respective platforms
need the clkdev tables updated with an entry for all clocks
used by hwmod to have clock name same as the alias.

Based on original changes from Mike Turquette.

Signed-off-by: Rajendra Nayak rna...@ti.com
Cc: Russell King - ARM Linux li...@arm.linux.org.uk
[p...@pwsan.com: removed IS_ERR_OR_NULL() conversion (rmk comment);
 restricted omap_96m_alwon_fck_3630 to OMAP36xx; added missing AM35xx
 clock aliases for emac_fck, emac_ick, vpfe_ick, vpfe_fck; added
 aliases rng_ick and several emulation clocks]
Signed-off-by: Paul Walmsley p...@pwsan.com
---
 arch/arm/mach-omap2/clock2420_data.c |   17 +
 arch/arm/mach-omap2/clock2430_data.c |   22 ++
 arch/arm/mach-omap2/clock3xxx_data.c |   31 +++
 arch/arm/mach-omap2/clock44xx_data.c |6 ++
 arch/arm/mach-omap2/omap_hwmod.c |   12 ++--
 arch/arm/plat-omap/clock.c   |   27 ---
 arch/arm/plat-omap/omap_device.c |4 ++--
 7 files changed, 84 insertions(+), 35 deletions(-)

diff --git a/arch/arm/mach-omap2/clock2420_data.c 
b/arch/arm/mach-omap2/clock2420_data.c
index 8c3bd2a..c3cde1a 100644
--- a/arch/arm/mach-omap2/clock2420_data.c
+++ b/arch/arm/mach-omap2/clock2420_data.c
@@ -1804,6 +1804,7 @@ static struct omap_clk omap2420_clks[] = {
CLK(NULL,   gfx_ick,  gfx_ick,   CK_242X),
/* DSS domain clocks */
CLK(omapdss_dss,  ick,  dss_ick,   CK_242X),
+   CLK(NULL,   dss_ick,  dss_ick,   CK_242X),
CLK(NULL,   dss1_fck, dss1_fck,  CK_242X),
CLK(NULL,   dss2_fck, dss2_fck,  CK_242X),
CLK(NULL,   dss_54m_fck,  dss_54m_fck,   CK_242X),
@@ -1843,12 +1844,16 @@ static struct omap_clk omap2420_clks[] = {
CLK(NULL,   gpt12_ick,gpt12_ick, CK_242X),
CLK(NULL,   gpt12_fck,gpt12_fck, CK_242X),
CLK(omap-mcbsp.1, ick,  mcbsp1_ick,CK_242X),
+   CLK(NULL,   mcbsp1_ick,   mcbsp1_ick,CK_242X),
CLK(NULL,   mcbsp1_fck,   mcbsp1_fck,CK_242X),
CLK(omap-mcbsp.2, ick,  mcbsp2_ick,CK_242X),
+   CLK(NULL,   mcbsp2_ick,   mcbsp2_ick,CK_242X),
CLK(NULL,   mcbsp2_fck,   mcbsp2_fck,CK_242X),
CLK(omap2_mcspi.1, ick, mcspi1_ick,CK_242X),
+   CLK(NULL,   mcspi1_ick,   mcspi1_ick,CK_242X),
CLK(NULL,   mcspi1_fck,   mcspi1_fck,CK_242X),
CLK(omap2_mcspi.2, ick, mcspi2_ick,CK_242X),
+   CLK(NULL,   mcspi2_ick,   mcspi2_ick,CK_242X),
CLK(NULL,   mcspi2_fck,   mcspi2_fck,CK_242X),
CLK(NULL,   uart1_ick,uart1_ick, CK_242X),
CLK(NULL,   uart1_fck,uart1_fck, CK_242X),
@@ -1859,12 +1864,15 @@ static struct omap_clk omap2420_clks[] = {
CLK(NULL,   gpios_ick,gpios_ick, CK_242X),
CLK(NULL,   gpios_fck,gpios_fck, CK_242X),
CLK(omap_wdt, ick,  mpu_wdt_ick,   CK_242X),
+   CLK(NULL,   mpu_wdt_ick,  mpu_wdt_ick,   CK_242X),
CLK(NULL,   mpu_wdt_fck,  mpu_wdt_fck,   CK_242X),
CLK(NULL,   sync_32k_ick, sync_32k_ick,  CK_242X),
CLK(NULL,   wdt1_ick, wdt1_ick,  CK_242X),
CLK(NULL,   omapctrl_ick, omapctrl_ick,  CK_242X),
CLK(omap24xxcam, fck,   cam_fck,   CK_242X),
+   CLK(NULL,   cam_fck,  cam_fck,   CK_242X),
CLK(omap24xxcam, ick,   cam_ick,   CK_242X),
+   CLK(NULL,   cam_ick,  cam_ick,   CK_242X),
CLK(NULL,   mailboxes_ick, mailboxes_ick,CK_242X),
CLK(NULL,   wdt4_ick, wdt4_ick,  CK_242X),
CLK(NULL,   wdt4_fck, wdt4_fck,  CK_242X),
@@ -1873,16 +1881,22 @@ static struct omap_clk 

Re: [PATCH v2] PM / Runtime: let rpm_resume() succeed if RPM_ACTIVE, even when disabled

2012-09-22 Thread Rafael J. Wysocki
On Saturday, September 22, 2012, Alan Stern wrote:
 On Sat, 22 Sep 2012, Rafael J. Wysocki wrote:
 
  On Saturday, September 22, 2012, Kevin Hilman wrote:
 
  OK, this looks good to me, thanks!
  
  Alan, what do you think?
  
  Rafael
 
   --- a/drivers/base/power/runtime.c
   +++ b/drivers/base/power/runtime.c
   @@ -509,6 +509,9 @@ static int rpm_resume(struct device *dev, int 
   rpmflags)
 repeat:
 if (dev-power.runtime_error)
 retval = -EINVAL;
   + else if (dev-power.disable_depth == 1  dev-power.is_suspended
   +   dev-power.runtime_status == RPM_ACTIVE)
   + retval = 1;
 else if (dev-power.disable_depth  0)
 retval = -EACCES;
 if (retval)
 
 Well, I'd prefer the indentation on the continuation line to be 
 different from the indentation of the following line, and I'd prefer 
 to have a comment explaining the reason for the exception.
 
 But these are only matters of taste; the implementation itself looks 
 good.

Thanks!

I've applied the patch as v3.7 material (and fixed up the white space).

Rafael
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney
On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
 On Fri, 21 Sep 2012, Paul E. McKenney wrote:
 
  Could you please point me to a recipe for creating a minimal userspace?
  Just in case it is the userspac erather than the architecture/hardware
  that makes the difference.
 
 Tony's suggestion is pretty good.  Note that there may also be differences 
 in kernel timers -- either between x86 and ARM architectures, or loaded 
 device drivers -- that may confound the problem.

For example, there must be at least one RCU callback outstanding after
the boot sequence quiets down.  Of course, the last time I tried Tony's
approach, I was doing it on top of my -rcu stack, so am retrying on
v3.6-rc6.

  Just to make sure I understand the combinations:
  
  o   All stalls have happened when running a minimal userspace.
  o   CONFIG_NO_HZ=n suppresses the stalls.
  o   CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
  no observable effect on the stalls.
  
  Did I get that right, or am I missing a combination?
 
 That's correct.
 
  Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
  cancelled upon exit from idle, and therefore should (almost) never 
  actually execute. Its sole purpose is to wake up the CPU.  ;-)
 
 Right.  Just curious, what would wake up the kernel from idle to handle a 
 grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
 system, the time between timer ticks could potentially be several tens of 
 seconds.

If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted
to shut off the scheduling-clock tick, so any CPU with RCU callbacks will
be awakened every jiffy.  The problem is that there appears to be a way
to get an RCU grace period started without any CPU having any callbacks,
which, as you surmise, would result in all the CPUs going to sleep and
the grace period never ending.  So if a CPU is awakened for any reason
after this everlasting grace period has extended for more than a minute,
the first thing that CPU will do is print an RCU CPU stall warning.

I believe that I see how to prevent callback-free grace periods from
ever starting.  (Famous last words...)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Initialising omapfb on AM3517 issues

2012-09-22 Thread Marc Murphy
Hi Tomi,
Thank you for those hints... I now have the LCD displaying the bootup logo, so 
it’s a good start.

How do I push the patch for the new panel ?

/* HannStar HSD043i9W1*/
{
{
.x_res  = 480,
.y_res  = 272,

.pixel_clock= 9000,

.hsw= 41,
.hfp= 2,
.hbp= 2,

.vsw= 10,
.vfp= 2,
.vbp= 2,

.vsync_level= OMAPDSS_SIG_ACTIVE_HIGH,
.hsync_level= OMAPDSS_SIG_ACTIVE_HIGH,
.data_pclk_edge = OMAPDSS_DRIVE_SIG_RISING_EDGE,
.de_level   = OMAPDSS_SIG_ACTIVE_HIGH,
.sync_pclk_edge = OMAPDSS_DRIVE_SIG_OPPOSITE_EDGES,
},
.name   = hannstar_hsd043i9w1,
},

Regards
Marc


-Original Message-
From: Tomi Valkeinen [mailto:tomi.valkei...@ti.com] 
Sent: 19 September 2012 08:24
To: Marc Murphy
Cc: 'linux-omap@vger.kernel.org'
Subject: Re: Initialising omapfb on AM3517 issues

Hi,

On Tue, 2012-09-18 at 16:41 +, Marc Murphy wrote:
 Hello all,
 I have been moving from the ti 2.6.37 BSP to the 3.x kernel with quite a bit 
 of success, the main issue I have at the moment is trying to get the frame 
 buffer and any displays I have initialised.
 
 [2.805358] omapfb omapfb: no driver for display: lcd
 [2.810729] omapfb omapfb: no displays
 [2.814666] omapfb omapfb: failed to setup omapfb
 
 I have tried a few versions of release and none of them will 
 initialise;
 
 Currently on
 [0.00] Linux version 3.6.0-rc3
 
 I have started with board-am3517evm display config and even that doesn't 
 initialise.  Is there something I am missing with the configs or is there a 
 patch required to get the feature to work.
 
 My current config options use;
 #
 # Graphics support
 #
 CONFIG_DRM=y

omapdrm and omapfb cannot be used at the same time. That said, you don't seem 
to enable omapdrm, only the core drm support, so it shouldn't matter. But you 
don't need CONFIG_DRM if you use omapfb.

 CONFIG_FB=y
 CONFIG_FB_CFB_FILLRECT=y
 CONFIG_FB_CFB_COPYAREA=y
 CONFIG_FB_CFB_IMAGEBLIT=y
 
 #
 # Frame buffer hardware drivers
 #
 CONFIG_OMAP2_VRAM=y
 CONFIG_OMAP2_VRFB=y
 CONFIG_OMAP2_DSS=y
 CONFIG_OMAP2_VRAM_SIZE=12
 CONFIG_OMAP2_DSS_DPI=y
 CONFIG_OMAP2_DSS_VENC=y
 CONFIG_OMAP2_DSS_DSI=y
 CONFIG_OMAP2_DSS_MIN_FCK_PER_PCK=1
 CONFIG_OMAP2_DSS_SLEEP_AFTER_VENC_RESET=y
 CONFIG_FB_OMAP2=y
 CONFIG_FB_OMAP2_NUM_FBS=3
 
 #
 # OMAP2/3 Display Device Drivers
 #
 CONFIG_PANEL_GENERIC_DPI=y
 CONFIG_PANEL_SHARP_LS037V7DW01=y
 CONFIG_BACKLIGHT_LCD_SUPPORT=y
 CONFIG_LCD_CLASS_DEVICE=y
 CONFIG_BACKLIGHT_CLASS_DEVICE=y
 CONFIG_BACKLIGHT_GENERIC=y
 
 And the init structs are
 static int am3517_evm_panel_enable_lcd(struct omap_dss_device *dssdev) 
 {
 gpio_set_value(TAM3517_DVI_PON_GPIO, 0);
 gpio_set_value(TAM3517_LCD_ENVDD_GPIO, 0);
 gpio_set_value(TAM3517_LCD_PON_GPIO, 1);
 printk(LCD voltage on\n);
   return 0;
 }
 
 static void am3517_evm_panel_disable_lcd(struct omap_dss_device 
 *dssdev) {
 gpio_set_value(TAM3517_LCD_ENVDD_GPIO, 1);
 gpio_set_value(TAM3517_LCD_PON_GPIO, 0); }
 
 static struct panel_generic_dpi_data lcd_panel = {
 //.name   = generic_dpi_panel,

You need to define name for the panel you have. You can see the list of 
supported panels in drivers/video/omap2/displays/panel-generic-dpi.c. If you 
don't give a name, the panel driver doesn't start.

There's also a problem with the vdds_dsi regulator. Search the list for 
[PATCH] OMAPDSS: Do not require a VDDS_DSI regulator on am35xx. The patch to 
fix it hasn't been merged yet.

 Tomi

N�r��yb�X��ǧv�^�)޺{.n�+{��f��{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney
On Sat, Sep 22, 2012 at 06:16:15PM +, Paul Walmsley wrote:
 Hi Paul
 
 On Fri, 21 Sep 2012, Paul E. McKenney wrote:
 
  I am wondering if your system somehow figured out how to start a grace
  period that had no RCU callbacks waiting for it.  If that happened,
  then a CONFIG_NO_HZ=y system could in theory get into a state where all
  CPUs are in dyntick-idle mode, so that none of them is doing anything
  to force the grace period to complete.
 
  That should be easy to diagnose, anyway.  Please see below, which
  includes the earlier diagnostic patch.
 
 Here you go.
 
 - Paul
 
 [  248.902618] INFO: rcu_sched self-detected stall on CPU
 [  248.905456]  0: (1 ticks this GP) idle=933/1/0 
 [  248.907897]   (t=26570 jiffies g=11 c=10 q=0)

Bingo!!!  (q=0, in case you were wondering.  And thank you for testing this!)

Strangely enough, I believe that I have inadvertently fixed this in
my -rcu tree:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next

Nevertheless, if you get a chance to try it, I would be interested to
hear if my guess is correct.  The trick is that a kthread drives the
grace period in -rcu, regardless of whether or not there are callbacks.

However, the backport would not be something that -stable would be happy
with, so I will be putting together a fix for mainline.  This thing
has been in the kernel since about 2004, not sure why you didn't hit
it earlier.

Thanx, Paul

 [  248.910339] [c001bc90] (unwind_backtrace+0x0/0xf0) from [c00ad800] 
 (rcu_check_callbacks+0x220/0x714)
 [  248.915527] [c00ad800] (rcu_check_callbacks+0x220/0x714) from 
 [c00532a0] (update_process_times+0x38/0x68)
 [  248.920928] [c00532a0] (update_process_times+0x38/0x68) from 
 [c008c9e8] (tick_sched_timer+0x80/0xec)
 [  248.926116] [c008c9e8] (tick_sched_timer+0x80/0xec) from [c0068ed4] 
 (__run_hrtimer+0x7c/0x1e0)
 [  248.930999] [c0068ed4] (__run_hrtimer+0x7c/0x1e0) from [c0069cb8] 
 (hrtimer_interrupt+0x11c/0x2d0)
 [  248.936035] [c0069cb8] (hrtimer_interrupt+0x11c/0x2d0) from [c001a3cc] 
 (twd_handler+0x30/0x44)
 [  248.940948] [c001a3cc] (twd_handler+0x30/0x44) from [c00a7bd0] 
 (handle_percpu_devid_irq+0x90/0x13c)
 [  248.946075] [c00a7bd0] (handle_percpu_devid_irq+0x90/0x13c) from 
 [c00a4344] (generic_handle_irq+0x30/0x48)
 [  248.951538] [c00a4344] (generic_handle_irq+0x30/0x48) from [c0014e38] 
 (handle_IRQ+0x4c/0xac)
 [  248.956329] [c0014e38] (handle_IRQ+0x4c/0xac) from [c00084cc] 
 (gic_handle_irq+0x28/0x5c)
 [  248.960937] [c00084cc] (gic_handle_irq+0x28/0x5c) from [c04fb1a4] 
 (__irq_svc+0x44/0x5c)
 [  248.965484] Exception stack(0xc0729f58 to 0xc0729fa0)
 [  248.968231] 9f40:   
 0003b832 0001
 [  248.972686] 9f60:  c074a8e8 c0728000 c07c42c8 c05065a0 c074bdc8 
  411fc092
 [  248.977142] 9f80: c074bfe8  0001 c0729fa0 0003b833 c0015130 
 2113 
 [  248.981597] [c04fb1a4] (__irq_svc+0x44/0x5c) from [c0015130] 
 (default_idle+0x20/0x44)
 [  248.986083] [c0015130] (default_idle+0x20/0x44) from [c001535c] 
 (cpu_idle+0x9c/0x114)
 [  248.990539] [c001535c] (cpu_idle+0x9c/0x114) from [c06d77b0] 
 (start_kernel+0x2b4/0x304)
 

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney
On Sat, Sep 22, 2012 at 01:10:43PM -0700, Paul E. McKenney wrote:
 On Sat, Sep 22, 2012 at 06:42:08PM +, Paul Walmsley wrote:
  On Fri, 21 Sep 2012, Paul E. McKenney wrote:
  
   Could you please point me to a recipe for creating a minimal userspace?
   Just in case it is the userspac erather than the architecture/hardware
   that makes the difference.
  
  Tony's suggestion is pretty good.  Note that there may also be differences 
  in kernel timers -- either between x86 and ARM architectures, or loaded 
  device drivers -- that may confound the problem.
 
 For example, there must be at least one RCU callback outstanding after
 the boot sequence quiets down.  Of course, the last time I tried Tony's
 approach, I was doing it on top of my -rcu stack, so am retrying on
 v3.6-rc6.
 
   Just to make sure I understand the combinations:
   
   o All stalls have happened when running a minimal userspace.
   o CONFIG_NO_HZ=n suppresses the stalls.
   o CONFIG_RCU_FAST_NO_HZ (which depends on CONFIG_NO_HZ=y) has
 no observable effect on the stalls.
   
   Did I get that right, or am I missing a combination?
  
  That's correct.
  
   Indeed, rcu_idle_gp_timer_func() is a bit strange in that it is 
   cancelled upon exit from idle, and therefore should (almost) never 
   actually execute. Its sole purpose is to wake up the CPU.  ;-)
  
  Right.  Just curious, what would wake up the kernel from idle to handle a 
  grace period expiration when CONFIG_RCU_FAST_NO_HZ=n?  On a very idle 
  system, the time between timer ticks could potentially be several tens of 
  seconds.
 
 If CONFIG_RCU_FAST_NO_HZ=n, then CPUs with RCU callbacks are not permitted
 to shut off the scheduling-clock tick, so any CPU with RCU callbacks will
 be awakened every jiffy.  The problem is that there appears to be a way
 to get an RCU grace period started without any CPU having any callbacks,
 which, as you surmise, would result in all the CPUs going to sleep and
 the grace period never ending.  So if a CPU is awakened for any reason
 after this everlasting grace period has extended for more than a minute,
 the first thing that CPU will do is print an RCU CPU stall warning.
 
 I believe that I see how to prevent callback-free grace periods from
 ever starting.  (Famous last words...)

And here is a patch.  I am still having trouble reproducing the problem,
but figured that I should avoid serializing things.

Thanx, Paul



 b/kernel/rcutree.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

rcu: Fix day-one dyntick-idle stall-warning bug

Each grace period is supposed to have at least one callback waiting
for that grace period to complete.  However, if CONFIG_NO_HZ=n, an
extra callback-free grace period is no big problem -- it will chew up
a tiny bit of CPU time, but it will complete normally.  In contrast,
CONFIG_NO_HZ=y kernels have the potential for all the CPUs to go to
sleep indefinitely, in turn indefinitely delaying completion of the
callback-free grace period.  Given that nothing is waiting on this grace
period, this is also not a problem.

Unless RCU CPU stall warnings are also enabled, as they are in recent
kernels.  In this case, if a CPU wakes up after at least one minute
of inactivity, an RCU CPU stall warning will result.  The reason that
no one noticed until quite recently is that most systems have enough
OS noise that they will never remain absolutely idle for a full minute.
But there are some embedded systems with cut-down userspace configurations
that get into this mode quite easily.

All this begs the question of exactly how a callback-free grace period
gets started in the first place.  This can happen due to the fact that
CPUs do not necessarily agree on which grace period is in progress.
If a CPU still believes that the grace period that just completed is
still ongoing, it will believe that it has callbacks that need to wait
for another grace period, never mind the fact that the grace period
that they were waiting for just completed.  This CPU can therefore
erroneously decide to start a new grace period.

Once this CPU notices that the earlier grace period completed, it will
invoke its callbacks.  It then won't have any callbacks left.  If no
other CPU has any callbacks, we now have a callback-free grace period.

This commit therefore makes CPUs check more carefully before starting a
new grace period.  This new check relies on an array of tail pointers
into each CPU's list of callbacks.  If the CPU is up to date on which
grace periods have completed, it checks to see if any callbacks follow
the RCU_DONE_TAIL segment, otherwise it checks to see if any callbacks
follow the RCU_WAIT_TAIL segment.  The reason that this works is that
the RCU_WAIT_TAIL segment will be promoted to the RCU_DONE_TAIL segment
as soon as the CPU figures out that the old grace period has ended.

This change 

Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley
Hi Paul

On Sat, 22 Sep 2012, Paul E. McKenney wrote:

 Strangely enough, I believe that I have inadvertently fixed this in
 my -rcu tree:
 
 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next
 
 Nevertheless, if you get a chance to try it, I would be interested to
 hear if my guess is correct.

Yes, good news: the stall warnings go away with that branch.

 The trick is that a kthread drives the grace period in -rcu, regardless 
 of whether or not there are callbacks.

This is rcu: Move quiescent-state forcing into kthread ?

Added some debugging into rcu_gp_kthread() after that commit and can 
confirm that the quiescent-state forcing loop does start a few times when 
there are zero callbacks pending (modulo any races in my measurement 
code).

 However, the backport would not be something that -stable would be happy
 with, so I will be putting together a fix for mainline.  This thing
 has been in the kernel since about 2004, not sure why you didn't hit
 it earlier.

One other data point in that regard - noticed the warnings don't appear 
when the board is booted with:

commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
Author: Paul E. McKenney paul.mcken...@linaro.org
Date:   Tue Jun 5 15:53:53 2012 -0700

rcu: Fix qlen_lazy breakage

...


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley
On Sat, 22 Sep 2012, Paul E. McKenney wrote:

 And here is a patch.  I am still having trouble reproducing the problem,
 but figured that I should avoid serializing things.

Thanks, testing this now on v3.6-rc6.  One question though about the patch 
description:

 All this begs the question of exactly how a callback-free grace period
 gets started in the first place.  This can happen due to the fact that
 CPUs do not necessarily agree on which grace period is in progress.
 If a CPU still believes that the grace period that just completed is
 still ongoing, it will believe that it has callbacks that need to wait
 for another grace period, never mind the fact that the grace period
 that they were waiting for just completed.  This CPU can therefore
 erroneously decide to start a new grace period.

Doesn't this imply that this bug would only affect multi-CPU systems?  

The recent tests here have been on Pandaboard, which is dual-CPU, but my 
recollection is that I also observed the warnings on a single-core 
Beagleboard.  Will re-test.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney
On Sat, Sep 22, 2012 at 10:25:59PM +, Paul Walmsley wrote:
 On Sat, 22 Sep 2012, Paul E. McKenney wrote:
 
  And here is a patch.  I am still having trouble reproducing the problem,
  but figured that I should avoid serializing things.
 
 Thanks, testing this now on v3.6-rc6.

Very cool, thank you!

One question though about the patch 
 description:
 
  All this begs the question of exactly how a callback-free grace period
  gets started in the first place.  This can happen due to the fact that
  CPUs do not necessarily agree on which grace period is in progress.
  If a CPU still believes that the grace period that just completed is
  still ongoing, it will believe that it has callbacks that need to wait
  for another grace period, never mind the fact that the grace period
  that they were waiting for just completed.  This CPU can therefore
  erroneously decide to start a new grace period.
 
 Doesn't this imply that this bug would only affect multi-CPU systems?  

Surprisingly not, at least when running TREE_RCU or TREE_PREEMPT_RCU.
In order to keep lock contention down to a dull roar on larger systems,
TREE_RCU keeps three sets of books: (1) the global state in the rcu_state
structure, (2) the combining-tree per-node state in the rcu_node
structure, and the per-CPU state in the rcu_data structure.  A CPU is
not officially aware of the end of a grace period until it is reflected
in its rcu_data structure.  This has the perhaps-surprising consequence
that the CPU that detected the end of the old grace period might start
a new one before becoming officially aware that the old one ended.

Why not have the CPU inform itself immediately upon noticing that the
old grace period ended?  Deadlock.  The rcu_node locks must be acquired
from leaf towards root, and the CPU is holding the root rcu_node lock
when it notices that the grace period has ended.

I have made this a bit less problematic in the bigrt branch, working
towards a goal of getting RCU into a state where automatic formal
validation might one day be possible.  And yes, I am starting to get some
formal-validation people interested in this lofty goal, see for example:
http://sites.google.com/site/popl13grace/paper.pdf.

 The recent tests here have been on Pandaboard, which is dual-CPU, but my 
 recollection is that I also observed the warnings on a single-core 
 Beagleboard.  Will re-test.

Anxiously awaiting the results.  This has been a strange one, even by
RCU's standards.

Plus I need to add a few Reported-by lines.  Next version...

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney
On Sat, Sep 22, 2012 at 10:20:19PM +, Paul Walmsley wrote:
 Hi Paul
 
 On Sat, 22 Sep 2012, Paul E. McKenney wrote:
 
  Strangely enough, I believe that I have inadvertently fixed this in
  my -rcu tree:
  
  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next
  
  Nevertheless, if you get a chance to try it, I would be interested to
  hear if my guess is correct.
 
 Yes, good news: the stall warnings go away with that branch.

Very good!

  The trick is that a kthread drives the grace period in -rcu, regardless 
  of whether or not there are callbacks.
 
 This is rcu: Move quiescent-state forcing into kthread ?

Yep, plus the preceding commits moving grace-period initialization and
cleanup into that same kthread.  This was motivated by a bug report
last February complaining about 200-microsecond latency spikes from
RCU grace-period initialization.  On systems with 4096 CPUs.

Real-time response.  It is far bigger than I thought.  ;-)

 Added some debugging into rcu_gp_kthread() after that commit and can 
 confirm that the quiescent-state forcing loop does start a few times when 
 there are zero callbacks pending (modulo any races in my measurement 
 code).

Cool, thank you!  Assuming it works, that indicates that there is long-term
value to the fix for this problem.  On larger systems, extra grace periods
are not what you want, as their expense increases with the number of CPUs.

  However, the backport would not be something that -stable would be happy
  with, so I will be putting together a fix for mainline.  This thing
  has been in the kernel since about 2004, not sure why you didn't hit
  it earlier.
 
 One other data point in that regard - noticed the warnings don't appear 
 when the board is booted with:
 
 commit 4fa3b6cb1bc8c14b81b4c8ffdfd3f2500a7e9367
 Author: Paul E. McKenney paul.mcken...@linaro.org
 Date:   Tue Jun 5 15:53:53 2012 -0700
 
 rcu: Fix qlen_lazy breakage

You lost me on this one.  This is already in mainline, so if you were
using (say) 3.6-rc6, you would already have this commit applied.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley
Hi Paul

On Sat, 22 Sep 2012, Paul Walmsley wrote:

 On Sat, 22 Sep 2012, Paul E. McKenney wrote:
 
  And here is a patch.  I am still having trouble reproducing the problem,
  but figured that I should avoid serializing things.
 
 Thanks, testing this now on v3.6-rc6.

Looks like you solved it!

Tested v3.6-rc6 + your stall diagnostic patch:

http://marc.info/?l=linux-arm-kernelm=134827237215882w=2

on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.

Then added rcu: Fix day-one dyntick-idle stall-warning bug from:

http://marc.info/?l=linux-arm-kernelm=134835120600590w=2

Booted that, and the stall warnings did not appear within 30 minutes.

To confirm that the problem being solved matched your hypothesis, the
debugging patch below[1] was added to the RCU idle entry/exit code.

Without the bugfix patch, a boot log transcript was obtained
indicating that the idle loop was entered with tick_nohz_enabled=1
during a grace period with no callbacks present:

http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt

The debugging events started to appear at 1.867370 seconds into the
boot.  ENTER was pressed about 464 seconds in; this triggered the
rcu_sched stall traceback.

With the bugfix patch, a boot log transcript was obtained that
indicated that the condition under test never occurred after waiting
about 20 minutes:

http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt

Thanks for being so willing to root-cause the issue, Paul; it's 
appreciated, and it's been quite instructive as well.  Will address some 
remaining loose ends in follow-up E-mails.


- Paul


[1] Debugging patch to printk() if the previous idle loop entry occurred 
with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
present:


---
 kernel/rcutree.c |   17 +
 1 file changed, 17 insertions(+)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f1eb7ad..f42941b 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -60,6 +60,9 @@
 
 /* Data structures. */
 
+extern int tick_nohz_enabled;
+static int no_cbs_idle_entry_count;
+
 static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 
 #define RCU_STATE_INITIALIZER(sname, cr) { \
@@ -400,8 +403,12 @@ void rcu_idle_enter(void)
unsigned long flags;
long long oldval;
struct rcu_dynticks *rdtp;
+   int cpu;
+   long totqlen = 0;
+   struct rcu_data *rdp;
 
local_irq_save(flags);
+   rdp = __get_cpu_var(rcu_sched_data);
rdtp = __get_cpu_var(rcu_dynticks);
oldval = rdtp-dynticks_nesting;
WARN_ON_ONCE((oldval  DYNTICK_TASK_NEST_MASK) == 0);
@@ -410,6 +417,12 @@ void rcu_idle_enter(void)
else
rdtp-dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
rcu_idle_enter_common(rdtp, oldval);
+   if (tick_nohz_enabled  rcu_gp_in_progress(rdp-rsp)) {
+   for_each_possible_cpu(cpu)
+   totqlen += per_cpu_ptr(rdp-rsp-rda, cpu)-qlen;
+   if (totqlen == 0)
+   no_cbs_idle_entry_count = 1;
+   }
local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(rcu_idle_enter);
@@ -503,6 +516,10 @@ void rcu_idle_exit(void)
rdtp-dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
rcu_idle_exit_common(rdtp, oldval);
local_irq_restore(flags);
+   if (no_cbs_idle_entry_count) {
+   no_cbs_idle_entry_count = 0;
+   pr_err(* Tickless idle was entered with zero RCU callbacks\n);
+   }
 }
 EXPORT_SYMBOL_GPL(rcu_idle_exit);
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul E. McKenney
On Sun, Sep 23, 2012 at 01:42:10AM +, Paul Walmsley wrote:
 Hi Paul
 
 On Sat, 22 Sep 2012, Paul Walmsley wrote:
 
  On Sat, 22 Sep 2012, Paul E. McKenney wrote:
  
   And here is a patch.  I am still having trouble reproducing the problem,
   but figured that I should avoid serializing things.
  
  Thanks, testing this now on v3.6-rc6.
 
 Looks like you solved it!
 
 Tested v3.6-rc6 + your stall diagnostic patch:
 
 http://marc.info/?l=linux-arm-kernelm=134827237215882w=2
 
 on OMAP4430ES2 Pandaboard using omap2plus_defconfig and 
 CONFIG_RCU_CPU_STALL_INFO=y; got the stall warnings.
 
 Then added rcu: Fix day-one dyntick-idle stall-warning bug from:
 
 http://marc.info/?l=linux-arm-kernelm=134835120600590w=2
 
 Booted that, and the stall warnings did not appear within 30 minutes.

Very cool, thank you for your testing efforts!!!

May I apply your Tested-by to this patch?

And good show on the debugging patch -- it is quite good to have such
solid evidence that the bug that the fix was intended for was actually
occurring.

Thanx, Paul

 To confirm that the problem being solved matched your hypothesis, the
 debugging patch below[1] was added to the RCU idle entry/exit code.
 
 Without the bugfix patch, a boot log transcript was obtained
 indicating that the idle loop was entered with tick_nohz_enabled=1
 during a grace period with no callbacks present:
 
 http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-pre-fix.txt
 
 The debugging events started to appear at 1.867370 seconds into the
 boot.  ENTER was pressed about 464 seconds in; this triggered the
 rcu_sched stall traceback.
 
 With the bugfix patch, a boot log transcript was obtained that
 indicated that the condition under test never occurred after waiting
 about 20 minutes:
 
 
 http://www.pwsan.com/omap/transcripts/20120922-rcu-stall-debug-post-fix.txt
 
 Thanks for being so willing to root-cause the issue, Paul; it's 
 appreciated, and it's been quite instructive as well.  Will address some 
 remaining loose ends in follow-up E-mails.
 
 
 - Paul
 
 
 [1] Debugging patch to printk() if the previous idle loop entry occurred 
 with tick_nohz_enabled=1 during a grace period with no RCU callbacks 
 present:
 
 
 ---
  kernel/rcutree.c |   17 +
  1 file changed, 17 insertions(+)
 
 diff --git a/kernel/rcutree.c b/kernel/rcutree.c
 index f1eb7ad..f42941b 100644
 --- a/kernel/rcutree.c
 +++ b/kernel/rcutree.c
 @@ -60,6 +60,9 @@
 
  /* Data structures. */
 
 +extern int tick_nohz_enabled;
 +static int no_cbs_idle_entry_count;
 +
  static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 
  #define RCU_STATE_INITIALIZER(sname, cr) { \
 @@ -400,8 +403,12 @@ void rcu_idle_enter(void)
   unsigned long flags;
   long long oldval;
   struct rcu_dynticks *rdtp;
 + int cpu;
 + long totqlen = 0;
 + struct rcu_data *rdp;
 
   local_irq_save(flags);
 + rdp = __get_cpu_var(rcu_sched_data);
   rdtp = __get_cpu_var(rcu_dynticks);
   oldval = rdtp-dynticks_nesting;
   WARN_ON_ONCE((oldval  DYNTICK_TASK_NEST_MASK) == 0);
 @@ -410,6 +417,12 @@ void rcu_idle_enter(void)
   else
   rdtp-dynticks_nesting -= DYNTICK_TASK_NEST_VALUE;
   rcu_idle_enter_common(rdtp, oldval);
 + if (tick_nohz_enabled  rcu_gp_in_progress(rdp-rsp)) {
 + for_each_possible_cpu(cpu)
 + totqlen += per_cpu_ptr(rdp-rsp-rda, cpu)-qlen;
 + if (totqlen == 0)
 + no_cbs_idle_entry_count = 1;
 + }
   local_irq_restore(flags);
  }
  EXPORT_SYMBOL_GPL(rcu_idle_enter);
 @@ -503,6 +516,10 @@ void rcu_idle_exit(void)
   rdtp-dynticks_nesting = DYNTICK_TASK_EXIT_IDLE;
   rcu_idle_exit_common(rdtp, oldval);
   local_irq_restore(flags);
 + if (no_cbs_idle_entry_count) {
 + no_cbs_idle_entry_count = 0;
 + pr_err(* Tickless idle was entered with zero RCU callbacks\n);
 + }
  }
  EXPORT_SYMBOL_GPL(rcu_idle_exit);
 
 -- 
 1.7.10.4
 

--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rcu self-detected stall messages on OMAP3, 4 boards

2012-09-22 Thread Paul Walmsley
On Sat, 22 Sep 2012, Paul E. McKenney wrote:

 Very cool, thank you for your testing efforts!!!

You're welcome.

 May I apply your Tested-by to this patch?

Please do:

Tested-by: Paul Walmsley p...@pwsan.com  # OMAP4430

Am testing on OMAP3730 (single-core) now.


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-omap in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html