Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-05-01 Thread Johannes Weiner
On Wed, May 01, 2013 at 08:28:30AM -0700, Hugh Dickins wrote:
> On Tue, 30 Apr 2013, Johannes Weiner wrote:
> > On Wed, Apr 24, 2013 at 08:50:01PM -0700, Hugh Dickins wrote:
> > > On Wed, 24 Apr 2013, Johannes Weiner wrote:
> > > > On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
> > > > > On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
> > > > > > On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
> > > > > > 
> > > > > > [   48.347963] [ cut here ]
> > > > > > [   48.347972] kernel BUG at mm/memcontrol.c:3994!
> > > > > > __mem_cgroup_uncharge_common() triggers:
> > > > > > 
> > > > > > [...]
> > > > > > if (mem_cgroup_disabled())
> > > > > > return NULL;
> > > > > > 
> > > > > > VM_BUG_ON(PageSwapCache(page));
> > > > > > [...]
> > > 
> > > I agree that the actual memcg uncharging should be okay, but the memsw
> > > swap stats will go wrong (doesn't matter t much), and mem_cgroup_put
> > > get missed (leaking a struct mem_cgroup).
> > 
> > Ok, so I just went over this again.  For the swapout path the memsw
> > uncharge is deferred, but if we "steal" this uncharge from the swap
> > code, we actually do uncharge memsw in mem_cgroup_do_uncharge(), so we
> > may prematurely unaccount the swap page, but we never leak a charge.
> > Good.
> > 
> > Because of this stealing, we also don't do the following:
> > 
> > if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> > mem_cgroup_swap_statistics(memcg, true);
> > mem_cgroup_get(memcg);
> > }
> > 
> > I.e. it does not matter that mem_cgroup_uncharge_swap() doesn't do the
> > put, we are also not doing the get.  We should not leak references.
> > 
> > So the only thing that I can see go wrong is that we may have a
> > swapped out page that is not charged to memsw and not accounted as
> > MEM_CGROUP_STAT_SWAP.  But I don't know how likely that is, because we
> > check for PG_swapcache in this uncharge path after the last pte is
> > torn down, so even though the page is put on swap cache, it probably
> > won't be swapped.  It would require that the PG_swapcache setting
> > would become visible only after the page has been added to the swap
> > cache AND rmap has established at least one swap pte for us to
> > uncharge a page that actually continues to be used.  And that's a bit
> > of a stretch, I think.
> 
> Sorry, our minds seem to work in different ways,
> I understood very little of what you wrote above :-(
> 
> But once I try to disprove you with a counter-example, I seem to
> arrive at the same conclusion as you have (well, I haven't quite
> arrived there yet, but cannot give it any more time).

I might be losing my mind.  But since you are reaching the same
conclusion, and I see the same mental milestones in your thought
process described below, it's more likely that I suck at describing my
train of thought coherently.  Or the third possibility: we're both
losing it!

> Looking at it from my point of view, I concentrate on the racy
>   if (PageSwapCache(page))
>   return;
>   __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
> in mem_cgroup_uncharge_page().
> 
> Now, that may or may not catch the case where last reference to page
> is unmapped at the same time as the page is added to swap: but being
> a MEM_CGROUP_CHARGE_TYPE_ANON call, it does not interfere with the
> memsw stats and get/put at all, those remain in balance.

Yes, exactly.

> And mem_cgroup_uncharge_swap() has all along been prepared to get
> a zero id from swap_cgroup_record(), if a SwapCache page should be
> uncharged when it was never quite charged as such.
> 
> Yes, we may occasionally fail to charge a SwapCache page as such
> if its final unmap from userspace races with its being added to swap;
> but it's heading towards swap_writepage()'s try_to_free_swap() anyway,
> so I don't think that's anything to worry about.

Agreed as well.  If there are no pte references to the swap slot, it
will be freed either way.  I didn't even think of the
try_to_free_swap() in the writeout call, but was looking at the
__remove_mapping later on in reclaim that will do a swapcache_free().

The only case I was worried about is the following:

#0  #1
page_remove_rmap()  shrink_page_list()
  if --page->mapcount == 0:   add_to_swap()
mem_cgroup_uncharge_page()  __add_to_swap_cache()
  if PageSwapCache:   SetPageSwapCache()
returntry_to_unmap()
  __mem_cgroup_uncharge_common()for each pte:
  install swp_entry_t
  page->mapcount--

Looking at #1, I don't see anything that would force concurrent
threads to observe SetSwapCache ordered against the page->mapcount--.
My concern was that if those get 

Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-05-01 Thread Hugh Dickins
On Tue, 30 Apr 2013, Johannes Weiner wrote:
> On Wed, Apr 24, 2013 at 08:50:01PM -0700, Hugh Dickins wrote:
> > On Wed, 24 Apr 2013, Johannes Weiner wrote:
> > > On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
> > > > On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
> > > > > On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
> > > > > 
> > > > > [   48.347963] [ cut here ]
> > > > > [   48.347972] kernel BUG at mm/memcontrol.c:3994!
> > > > > __mem_cgroup_uncharge_common() triggers:
> > > > > 
> > > > > [...]
> > > > > if (mem_cgroup_disabled())
> > > > > return NULL;
> > > > > 
> > > > > VM_BUG_ON(PageSwapCache(page));
> > > > > [...]
> > 
> > I agree that the actual memcg uncharging should be okay, but the memsw
> > swap stats will go wrong (doesn't matter t much), and mem_cgroup_put
> > get missed (leaking a struct mem_cgroup).
> 
> Ok, so I just went over this again.  For the swapout path the memsw
> uncharge is deferred, but if we "steal" this uncharge from the swap
> code, we actually do uncharge memsw in mem_cgroup_do_uncharge(), so we
> may prematurely unaccount the swap page, but we never leak a charge.
> Good.
> 
> Because of this stealing, we also don't do the following:
> 
>   if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
>   mem_cgroup_swap_statistics(memcg, true);
>   mem_cgroup_get(memcg);
>   }
> 
> I.e. it does not matter that mem_cgroup_uncharge_swap() doesn't do the
> put, we are also not doing the get.  We should not leak references.
> 
> So the only thing that I can see go wrong is that we may have a
> swapped out page that is not charged to memsw and not accounted as
> MEM_CGROUP_STAT_SWAP.  But I don't know how likely that is, because we
> check for PG_swapcache in this uncharge path after the last pte is
> torn down, so even though the page is put on swap cache, it probably
> won't be swapped.  It would require that the PG_swapcache setting
> would become visible only after the page has been added to the swap
> cache AND rmap has established at least one swap pte for us to
> uncharge a page that actually continues to be used.  And that's a bit
> of a stretch, I think.

Sorry, our minds seem to work in different ways,
I understood very little of what you wrote above :-(

But once I try to disprove you with a counter-example, I seem to
arrive at the same conclusion as you have (well, I haven't quite
arrived there yet, but cannot give it any more time).

Looking at it from my point of view, I concentrate on the racy
if (PageSwapCache(page))
return;
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
in mem_cgroup_uncharge_page().

Now, that may or may not catch the case where last reference to page
is unmapped at the same time as the page is added to swap: but being
a MEM_CGROUP_CHARGE_TYPE_ANON call, it does not interfere with the
memsw stats and get/put at all, those remain in balance.

And mem_cgroup_uncharge_swap() has all along been prepared to get
a zero id from swap_cgroup_record(), if a SwapCache page should be
uncharged when it was never quite charged as such.

Yes, we may occasionally fail to charge a SwapCache page as such
if its final unmap from userspace races with its being added to swap;
but it's heading towards swap_writepage()'s try_to_free_swap() anyway,
so I don't think that's anything to worry about.

(If I had time to stop and read through that, I'd probably find it
just as hard to understand as what you wrote!)

> 
> Did I miss something?  If not, I'll just send a patch that removes the
> VM_BUG_ON() and adds a comment describing the scenarios and a note
> that we may want to fix this in the future.

I don't think you missed something.  Yes, please just send Linus and
Andrew a patch to remove the VM_BUG_ON() (with Cc stable tag), I now
agree that's all that's really needed - thanks.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-05-01 Thread Hugh Dickins
On Tue, 30 Apr 2013, Johannes Weiner wrote:
 On Wed, Apr 24, 2013 at 08:50:01PM -0700, Hugh Dickins wrote:
  On Wed, 24 Apr 2013, Johannes Weiner wrote:
   On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
 On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
 
 [   48.347963] [ cut here ]
 [   48.347972] kernel BUG at mm/memcontrol.c:3994!
 __mem_cgroup_uncharge_common() triggers:
 
 [...]
 if (mem_cgroup_disabled())
 return NULL;
 
 VM_BUG_ON(PageSwapCache(page));
 [...]
  
  I agree that the actual memcg uncharging should be okay, but the memsw
  swap stats will go wrong (doesn't matter t much), and mem_cgroup_put
  get missed (leaking a struct mem_cgroup).
 
 Ok, so I just went over this again.  For the swapout path the memsw
 uncharge is deferred, but if we steal this uncharge from the swap
 code, we actually do uncharge memsw in mem_cgroup_do_uncharge(), so we
 may prematurely unaccount the swap page, but we never leak a charge.
 Good.
 
 Because of this stealing, we also don't do the following:
 
   if (do_swap_account  ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
   mem_cgroup_swap_statistics(memcg, true);
   mem_cgroup_get(memcg);
   }
 
 I.e. it does not matter that mem_cgroup_uncharge_swap() doesn't do the
 put, we are also not doing the get.  We should not leak references.
 
 So the only thing that I can see go wrong is that we may have a
 swapped out page that is not charged to memsw and not accounted as
 MEM_CGROUP_STAT_SWAP.  But I don't know how likely that is, because we
 check for PG_swapcache in this uncharge path after the last pte is
 torn down, so even though the page is put on swap cache, it probably
 won't be swapped.  It would require that the PG_swapcache setting
 would become visible only after the page has been added to the swap
 cache AND rmap has established at least one swap pte for us to
 uncharge a page that actually continues to be used.  And that's a bit
 of a stretch, I think.

Sorry, our minds seem to work in different ways,
I understood very little of what you wrote above :-(

But once I try to disprove you with a counter-example, I seem to
arrive at the same conclusion as you have (well, I haven't quite
arrived there yet, but cannot give it any more time).

Looking at it from my point of view, I concentrate on the racy
if (PageSwapCache(page))
return;
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
in mem_cgroup_uncharge_page().

Now, that may or may not catch the case where last reference to page
is unmapped at the same time as the page is added to swap: but being
a MEM_CGROUP_CHARGE_TYPE_ANON call, it does not interfere with the
memsw stats and get/put at all, those remain in balance.

And mem_cgroup_uncharge_swap() has all along been prepared to get
a zero id from swap_cgroup_record(), if a SwapCache page should be
uncharged when it was never quite charged as such.

Yes, we may occasionally fail to charge a SwapCache page as such
if its final unmap from userspace races with its being added to swap;
but it's heading towards swap_writepage()'s try_to_free_swap() anyway,
so I don't think that's anything to worry about.

(If I had time to stop and read through that, I'd probably find it
just as hard to understand as what you wrote!)

 
 Did I miss something?  If not, I'll just send a patch that removes the
 VM_BUG_ON() and adds a comment describing the scenarios and a note
 that we may want to fix this in the future.

I don't think you missed something.  Yes, please just send Linus and
Andrew a patch to remove the VM_BUG_ON() (with Cc stable tag), I now
agree that's all that's really needed - thanks.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-05-01 Thread Johannes Weiner
On Wed, May 01, 2013 at 08:28:30AM -0700, Hugh Dickins wrote:
 On Tue, 30 Apr 2013, Johannes Weiner wrote:
  On Wed, Apr 24, 2013 at 08:50:01PM -0700, Hugh Dickins wrote:
   On Wed, 24 Apr 2013, Johannes Weiner wrote:
On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
 On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
  On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
  
  [   48.347963] [ cut here ]
  [   48.347972] kernel BUG at mm/memcontrol.c:3994!
  __mem_cgroup_uncharge_common() triggers:
  
  [...]
  if (mem_cgroup_disabled())
  return NULL;
  
  VM_BUG_ON(PageSwapCache(page));
  [...]
   
   I agree that the actual memcg uncharging should be okay, but the memsw
   swap stats will go wrong (doesn't matter t much), and mem_cgroup_put
   get missed (leaking a struct mem_cgroup).
  
  Ok, so I just went over this again.  For the swapout path the memsw
  uncharge is deferred, but if we steal this uncharge from the swap
  code, we actually do uncharge memsw in mem_cgroup_do_uncharge(), so we
  may prematurely unaccount the swap page, but we never leak a charge.
  Good.
  
  Because of this stealing, we also don't do the following:
  
  if (do_swap_account  ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
  mem_cgroup_swap_statistics(memcg, true);
  mem_cgroup_get(memcg);
  }
  
  I.e. it does not matter that mem_cgroup_uncharge_swap() doesn't do the
  put, we are also not doing the get.  We should not leak references.
  
  So the only thing that I can see go wrong is that we may have a
  swapped out page that is not charged to memsw and not accounted as
  MEM_CGROUP_STAT_SWAP.  But I don't know how likely that is, because we
  check for PG_swapcache in this uncharge path after the last pte is
  torn down, so even though the page is put on swap cache, it probably
  won't be swapped.  It would require that the PG_swapcache setting
  would become visible only after the page has been added to the swap
  cache AND rmap has established at least one swap pte for us to
  uncharge a page that actually continues to be used.  And that's a bit
  of a stretch, I think.
 
 Sorry, our minds seem to work in different ways,
 I understood very little of what you wrote above :-(
 
 But once I try to disprove you with a counter-example, I seem to
 arrive at the same conclusion as you have (well, I haven't quite
 arrived there yet, but cannot give it any more time).

I might be losing my mind.  But since you are reaching the same
conclusion, and I see the same mental milestones in your thought
process described below, it's more likely that I suck at describing my
train of thought coherently.  Or the third possibility: we're both
losing it!

 Looking at it from my point of view, I concentrate on the racy
   if (PageSwapCache(page))
   return;
   __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
 in mem_cgroup_uncharge_page().
 
 Now, that may or may not catch the case where last reference to page
 is unmapped at the same time as the page is added to swap: but being
 a MEM_CGROUP_CHARGE_TYPE_ANON call, it does not interfere with the
 memsw stats and get/put at all, those remain in balance.

Yes, exactly.

 And mem_cgroup_uncharge_swap() has all along been prepared to get
 a zero id from swap_cgroup_record(), if a SwapCache page should be
 uncharged when it was never quite charged as such.
 
 Yes, we may occasionally fail to charge a SwapCache page as such
 if its final unmap from userspace races with its being added to swap;
 but it's heading towards swap_writepage()'s try_to_free_swap() anyway,
 so I don't think that's anything to worry about.

Agreed as well.  If there are no pte references to the swap slot, it
will be freed either way.  I didn't even think of the
try_to_free_swap() in the writeout call, but was looking at the
__remove_mapping later on in reclaim that will do a swapcache_free().

The only case I was worried about is the following:

#0  #1
page_remove_rmap()  shrink_page_list()
  if --page-mapcount == 0:   add_to_swap()
mem_cgroup_uncharge_page()  __add_to_swap_cache()
  if PageSwapCache:   SetPageSwapCache()
returntry_to_unmap()
  __mem_cgroup_uncharge_common()for each pte:
  install swp_entry_t
  page-mapcount--

Looking at #1, I don't see anything that would force concurrent
threads to observe SetSwapCache ordered against the page-mapcount--.
My concern was that if those get reordered, #0 may see page-mapcount
== 1 AND !PageSwapcache, and then go ahead and uncharge the page while
there is actually a swp_entry_t pointing to it.  The page will be a
proper 

Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-30 Thread Johannes Weiner
On Wed, Apr 24, 2013 at 08:50:01PM -0700, Hugh Dickins wrote:
> On Wed, 24 Apr 2013, Johannes Weiner wrote:
> > On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
> > > On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
> > > > On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
> > > > > Ok, thanks for verifying! I'll look into it; hopefully I can 
> > > > > reproduce it
> > > > > here as well.
> > > > 
> > > > That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
> > > > below (when I force the system to swap):
> > > > 
> > > > [   48.347963] [ cut here ]
> > > > [   48.347972] kernel BUG at mm/memcontrol.c:3994!
> > > > [   48.348012] illegal operation: 0001 [#1] SMP 
> > > > [   48.348015] Modules linked in:
> > > > [   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
> > > > [   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
> > > > 2e91b938)
> > > > [   48.348022] Krnl PSW : 0704f0018000 0026552c 
> > > > (__mem_cgroup_uncharge_common+0x2c4/0x33c)
> > > > [   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 
> > > > PM:0 EA:3
> > > >Krnl GPRS: 0008 0009 
> > > > 03d1002a9200 
> > > > [   48.348039] 006812d8 
> > > > 03ffdf339000 321a6f98
> > > > [   48.348043]03fffce11000  
> > > > 0001 03d1002a9200
> > > > [   48.348046]0001 00681b88 
> > > > 2e91bc18 2e91bbd0
> > > > [   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
> > > > %r14,1b9a60
> > > >   00265524: a7f4ff7dbrc 
> > > > 15,26541e
> > > >  #00265528: a7f40001brc 
> > > > 15,26552a
> > > >  >0026552c: e3c0b8200124stg 
> > > > %r12,6176(%r11)
> > > >   00265532: a7f4ff57brc 
> > > > 15,2653e0
> > > >   00265536: e310b8280104lg  
> > > > %r1,6184(%r11)
> > > >   0026553c: a71b0001aghi
> > > > %r1,1
> > > >   00265540: e310b8280124stg 
> > > > %r1,6184(%r11)
> > > > [   48.348099] Call Trace:
> > > > [   48.348100] ([<03d1002a91c0>] 0x3d1002a91c0)
> > > > [   48.348102]  [<002404aa>] page_remove_rmap+0xf2/0x16c
> > > > [   48.348106]  [<00232dc8>] unmap_single_vma+0x494/0x7d8
> > > > [   48.348107]  [<00233ac0>] unmap_vmas+0x50/0x74
> > > > [   48.348109]  [<002396ec>] unmap_region+0x9c/0x110
> > > > [   48.348110]  [<0023bd18>] do_munmap+0x284/0x470
> > > > [   48.348111]  [<0023bf56>] vm_munmap+0x52/0x70
> > > > [   48.348113]  [<0023cf32>] SyS_munmap+0x3a/0x4c
> > > > [   48.348114]  [<00665e14>] sysc_noemu+0x22/0x28
> > > > [   48.348118]  [<03fffcf187b2>] 0x3fffcf187b2
> > > > [   48.348119] Last Breaking-Event-Address:
> > > > [   48.348120]  [<00265528>] 
> > > > __mem_cgroup_uncharge_common+0x2c0/0x33c
> > > > 
> > > > Looking at the code, the code flow is:
> > > > 
> > > > page_remove_rmap() -> mem_cgroup_uncharge_page() -> 
> > > > __mem_cgroup_uncharge_common()
> > > > 
> > > > Note that in mem_cgroup_uncharge_page() the page in question passed the 
> > > > check:
> > > > 
> > > > [...]
> > > > if (PageSwapCache(page))
> > > > return;
> > > > [...]
> > > > 
> > > > and just a couple of instructions later the VM_BUG_ON() within
> > > > __mem_cgroup_uncharge_common() triggers:
> > > > 
> > > > [...]
> > > > if (mem_cgroup_disabled())
> > > > return NULL;
> > > > 
> > > > VM_BUG_ON(PageSwapCache(page));
> > > > [...]
> > > > 
> > > > Which means that another cpu changed the pageflags concurrently. In 
> > > > fact,
> > > > looking at the dump a different cpu is indeed busy with running kswapd.
> > > 
> > > Hmm, maybe I am missing something but it really looks like we can race
> > > here. Reclaim path takes the page lock while zap_pte takes page table
> > > lock so nothing prevents them from racing here:
> > > shrink_page_list  zap_pte_range
> > >   trylock_page  pte_offset_map_lock
> > >   add_to_swap page_remove_rmap
> > > /* Page can be still mapped */
> > > add_to_swap_cache   atomic_add_negative(_mapcount)
> > >   __add_to_swap_cache mem_cgroup_uncharge_page
> > > (PageSwapCache(page)) && 
> > > return
> > > SetPageSwapCache
> > > __mem_cgroup_uncharge_common
> > >   VM_BUG_ON(PageSwapCache(page))
> > > 
> > > Maybe not many people run 

Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-30 Thread Johannes Weiner
On Wed, Apr 24, 2013 at 08:50:01PM -0700, Hugh Dickins wrote:
 On Wed, 24 Apr 2013, Johannes Weiner wrote:
  On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
   On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
 Ok, thanks for verifying! I'll look into it; hopefully I can 
 reproduce it
 here as well.

That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
below (when I force the system to swap):

[   48.347963] [ cut here ]
[   48.347972] kernel BUG at mm/memcontrol.c:3994!
[   48.348012] illegal operation: 0001 [#1] SMP 
[   48.348015] Modules linked in:
[   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
[   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
2e91b938)
[   48.348022] Krnl PSW : 0704f0018000 0026552c 
(__mem_cgroup_uncharge_common+0x2c4/0x33c)
[   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 
PM:0 EA:3
   Krnl GPRS: 0008 0009 
03d1002a9200 
[   48.348039] 006812d8 
03ffdf339000 321a6f98
[   48.348043]03fffce11000  
0001 03d1002a9200
[   48.348046]0001 00681b88 
2e91bc18 2e91bbd0
[   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
%r14,1b9a60
  00265524: a7f4ff7dbrc 
15,26541e
 #00265528: a7f40001brc 
15,26552a
 0026552c: e3c0b8200124stg 
%r12,6176(%r11)
  00265532: a7f4ff57brc 
15,2653e0
  00265536: e310b8280104lg  
%r1,6184(%r11)
  0026553c: a71b0001aghi
%r1,1
  00265540: e310b8280124stg 
%r1,6184(%r11)
[   48.348099] Call Trace:
[   48.348100] ([03d1002a91c0] 0x3d1002a91c0)
[   48.348102]  [002404aa] page_remove_rmap+0xf2/0x16c
[   48.348106]  [00232dc8] unmap_single_vma+0x494/0x7d8
[   48.348107]  [00233ac0] unmap_vmas+0x50/0x74
[   48.348109]  [002396ec] unmap_region+0x9c/0x110
[   48.348110]  [0023bd18] do_munmap+0x284/0x470
[   48.348111]  [0023bf56] vm_munmap+0x52/0x70
[   48.348113]  [0023cf32] SyS_munmap+0x3a/0x4c
[   48.348114]  [00665e14] sysc_noemu+0x22/0x28
[   48.348118]  [03fffcf187b2] 0x3fffcf187b2
[   48.348119] Last Breaking-Event-Address:
[   48.348120]  [00265528] 
__mem_cgroup_uncharge_common+0x2c0/0x33c

Looking at the code, the code flow is:

page_remove_rmap() - mem_cgroup_uncharge_page() - 
__mem_cgroup_uncharge_common()

Note that in mem_cgroup_uncharge_page() the page in question passed the 
check:

[...]
if (PageSwapCache(page))
return;
[...]

and just a couple of instructions later the VM_BUG_ON() within
__mem_cgroup_uncharge_common() triggers:

[...]
if (mem_cgroup_disabled())
return NULL;

VM_BUG_ON(PageSwapCache(page));
[...]

Which means that another cpu changed the pageflags concurrently. In 
fact,
looking at the dump a different cpu is indeed busy with running kswapd.
   
   Hmm, maybe I am missing something but it really looks like we can race
   here. Reclaim path takes the page lock while zap_pte takes page table
   lock so nothing prevents them from racing here:
   shrink_page_list  zap_pte_range
 trylock_page  pte_offset_map_lock
 add_to_swap page_remove_rmap
   /* Page can be still mapped */
   add_to_swap_cache   atomic_add_negative(_mapcount)
 __add_to_swap_cache mem_cgroup_uncharge_page
   (PageSwapCache(page))  
   return
   SetPageSwapCache
   __mem_cgroup_uncharge_common
 VM_BUG_ON(PageSwapCache(page))
   
   Maybe not many people run with CONFIG_DEBUG_VM enabled these days so we
   do not this more often (even me testing configs are not consistent in
   that regards and only few have it on). The only thing that changed in
   this area recently is 0c59b89c which made the test VM_BUG_ON rather then
   simple return in 3.6
   And maybe the BUG_ON is too harsh as CgroupUsed should guarantee that
   the uncharge will eventually go away. What do you think 

Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-24 Thread Hugh Dickins
On Wed, 24 Apr 2013, Johannes Weiner wrote:
> On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
> > On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
> > > On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
> > > > Ok, thanks for verifying! I'll look into it; hopefully I can reproduce 
> > > > it
> > > > here as well.
> > > 
> > > That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
> > > below (when I force the system to swap):
> > > 
> > > [   48.347963] [ cut here ]
> > > [   48.347972] kernel BUG at mm/memcontrol.c:3994!
> > > [   48.348012] illegal operation: 0001 [#1] SMP 
> > > [   48.348015] Modules linked in:
> > > [   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
> > > [   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
> > > 2e91b938)
> > > [   48.348022] Krnl PSW : 0704f0018000 0026552c 
> > > (__mem_cgroup_uncharge_common+0x2c4/0x33c)
> > > [   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 
> > > PM:0 EA:3
> > >Krnl GPRS: 0008 0009 
> > > 03d1002a9200 
> > > [   48.348039] 006812d8 
> > > 03ffdf339000 321a6f98
> > > [   48.348043]03fffce11000  
> > > 0001 03d1002a9200
> > > [   48.348046]0001 00681b88 
> > > 2e91bc18 2e91bbd0
> > > [   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
> > > %r14,1b9a60
> > >   00265524: a7f4ff7dbrc 
> > > 15,26541e
> > >  #00265528: a7f40001brc 
> > > 15,26552a
> > >  >0026552c: e3c0b8200124stg 
> > > %r12,6176(%r11)
> > >   00265532: a7f4ff57brc 
> > > 15,2653e0
> > >   00265536: e310b8280104lg  
> > > %r1,6184(%r11)
> > >   0026553c: a71b0001aghi
> > > %r1,1
> > >   00265540: e310b8280124stg 
> > > %r1,6184(%r11)
> > > [   48.348099] Call Trace:
> > > [   48.348100] ([<03d1002a91c0>] 0x3d1002a91c0)
> > > [   48.348102]  [<002404aa>] page_remove_rmap+0xf2/0x16c
> > > [   48.348106]  [<00232dc8>] unmap_single_vma+0x494/0x7d8
> > > [   48.348107]  [<00233ac0>] unmap_vmas+0x50/0x74
> > > [   48.348109]  [<002396ec>] unmap_region+0x9c/0x110
> > > [   48.348110]  [<0023bd18>] do_munmap+0x284/0x470
> > > [   48.348111]  [<0023bf56>] vm_munmap+0x52/0x70
> > > [   48.348113]  [<0023cf32>] SyS_munmap+0x3a/0x4c
> > > [   48.348114]  [<00665e14>] sysc_noemu+0x22/0x28
> > > [   48.348118]  [<03fffcf187b2>] 0x3fffcf187b2
> > > [   48.348119] Last Breaking-Event-Address:
> > > [   48.348120]  [<00265528>] 
> > > __mem_cgroup_uncharge_common+0x2c0/0x33c
> > > 
> > > Looking at the code, the code flow is:
> > > 
> > > page_remove_rmap() -> mem_cgroup_uncharge_page() -> 
> > > __mem_cgroup_uncharge_common()
> > > 
> > > Note that in mem_cgroup_uncharge_page() the page in question passed the 
> > > check:
> > > 
> > > [...]
> > > if (PageSwapCache(page))
> > > return;
> > > [...]
> > > 
> > > and just a couple of instructions later the VM_BUG_ON() within
> > > __mem_cgroup_uncharge_common() triggers:
> > > 
> > > [...]
> > > if (mem_cgroup_disabled())
> > > return NULL;
> > > 
> > > VM_BUG_ON(PageSwapCache(page));
> > > [...]
> > > 
> > > Which means that another cpu changed the pageflags concurrently. In fact,
> > > looking at the dump a different cpu is indeed busy with running kswapd.
> > 
> > Hmm, maybe I am missing something but it really looks like we can race
> > here. Reclaim path takes the page lock while zap_pte takes page table
> > lock so nothing prevents them from racing here:
> > shrink_page_listzap_pte_range
> >   trylock_pagepte_offset_map_lock
> >   add_to_swap   page_remove_rmap
> > /* Page can be still mapped */
> > add_to_swap_cache atomic_add_negative(_mapcount)
> >   __add_to_swap_cache   mem_cgroup_uncharge_page
> >   (PageSwapCache(page)) && 
> > return
> > SetPageSwapCache
> >   __mem_cgroup_uncharge_common
> > VM_BUG_ON(PageSwapCache(page))
> > 
> > Maybe not many people run with CONFIG_DEBUG_VM enabled these days so we
> > do not this more often (even me testing configs are not consistent in
> > that regards and only few have it on). The only thing that changed in
> > this area recently is 0c59b89c which made the test VM_BUG_ON 

Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-24 Thread Johannes Weiner
On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
> On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
> > On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
> > > Ok, thanks for verifying! I'll look into it; hopefully I can reproduce it
> > > here as well.
> > 
> > That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
> > below (when I force the system to swap):
> > 
> > [   48.347963] [ cut here ]
> > [   48.347972] kernel BUG at mm/memcontrol.c:3994!
> > [   48.348012] illegal operation: 0001 [#1] SMP 
> > [   48.348015] Modules linked in:
> > [   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
> > [   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
> > 2e91b938)
> > [   48.348022] Krnl PSW : 0704f0018000 0026552c 
> > (__mem_cgroup_uncharge_common+0x2c4/0x33c)
> > [   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 
> > PM:0 EA:3
> >Krnl GPRS: 0008 0009 
> > 03d1002a9200 
> > [   48.348039] 006812d8 
> > 03ffdf339000 321a6f98
> > [   48.348043]03fffce11000  
> > 0001 03d1002a9200
> > [   48.348046]0001 00681b88 
> > 2e91bc18 2e91bbd0
> > [   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
> > %r14,1b9a60
> >   00265524: a7f4ff7dbrc 
> > 15,26541e
> >  #00265528: a7f40001brc 
> > 15,26552a
> >  >0026552c: e3c0b8200124stg 
> > %r12,6176(%r11)
> >   00265532: a7f4ff57brc 
> > 15,2653e0
> >   00265536: e310b8280104lg  
> > %r1,6184(%r11)
> >   0026553c: a71b0001aghi
> > %r1,1
> >   00265540: e310b8280124stg 
> > %r1,6184(%r11)
> > [   48.348099] Call Trace:
> > [   48.348100] ([<03d1002a91c0>] 0x3d1002a91c0)
> > [   48.348102]  [<002404aa>] page_remove_rmap+0xf2/0x16c
> > [   48.348106]  [<00232dc8>] unmap_single_vma+0x494/0x7d8
> > [   48.348107]  [<00233ac0>] unmap_vmas+0x50/0x74
> > [   48.348109]  [<002396ec>] unmap_region+0x9c/0x110
> > [   48.348110]  [<0023bd18>] do_munmap+0x284/0x470
> > [   48.348111]  [<0023bf56>] vm_munmap+0x52/0x70
> > [   48.348113]  [<0023cf32>] SyS_munmap+0x3a/0x4c
> > [   48.348114]  [<00665e14>] sysc_noemu+0x22/0x28
> > [   48.348118]  [<03fffcf187b2>] 0x3fffcf187b2
> > [   48.348119] Last Breaking-Event-Address:
> > [   48.348120]  [<00265528>] 
> > __mem_cgroup_uncharge_common+0x2c0/0x33c
> > 
> > Looking at the code, the code flow is:
> > 
> > page_remove_rmap() -> mem_cgroup_uncharge_page() -> 
> > __mem_cgroup_uncharge_common()
> > 
> > Note that in mem_cgroup_uncharge_page() the page in question passed the 
> > check:
> > 
> > [...]
> > if (PageSwapCache(page))
> > return;
> > [...]
> > 
> > and just a couple of instructions later the VM_BUG_ON() within
> > __mem_cgroup_uncharge_common() triggers:
> > 
> > [...]
> > if (mem_cgroup_disabled())
> > return NULL;
> > 
> > VM_BUG_ON(PageSwapCache(page));
> > [...]
> > 
> > Which means that another cpu changed the pageflags concurrently. In fact,
> > looking at the dump a different cpu is indeed busy with running kswapd.
> 
> Hmm, maybe I am missing something but it really looks like we can race
> here. Reclaim path takes the page lock while zap_pte takes page table
> lock so nothing prevents them from racing here:
> shrink_page_list  zap_pte_range
>   trylock_page  pte_offset_map_lock
>   add_to_swap page_remove_rmap
> /* Page can be still mapped */
> add_to_swap_cache   atomic_add_negative(_mapcount)
>   __add_to_swap_cache mem_cgroup_uncharge_page
> (PageSwapCache(page)) && 
> return
> SetPageSwapCache
> __mem_cgroup_uncharge_common
>   VM_BUG_ON(PageSwapCache(page))
> 
> Maybe not many people run with CONFIG_DEBUG_VM enabled these days so we
> do not this more often (even me testing configs are not consistent in
> that regards and only few have it on). The only thing that changed in
> this area recently is 0c59b89c which made the test VM_BUG_ON rather then
> simple return in 3.6
> And maybe the BUG_ON is too harsh as CgroupUsed should guarantee that
> the uncharge will eventually go away. What do you think Johannes?

Interesting.  We need to ensure there is ordering between setting
PG_swapcache and 

Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-24 Thread Michal Hocko
On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
> On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
> > Ok, thanks for verifying! I'll look into it; hopefully I can reproduce it
> > here as well.
> 
> That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
> below (when I force the system to swap):
> 
> [   48.347963] [ cut here ]
> [   48.347972] kernel BUG at mm/memcontrol.c:3994!
> [   48.348012] illegal operation: 0001 [#1] SMP 
> [   48.348015] Modules linked in:
> [   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
> [   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
> 2e91b938)
> [   48.348022] Krnl PSW : 0704f0018000 0026552c 
> (__mem_cgroup_uncharge_common+0x2c4/0x33c)
> [   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 
> EA:3
>Krnl GPRS: 0008 0009 03d1002a9200 
> 
> [   48.348039] 006812d8 03ffdf339000 
> 321a6f98
> [   48.348043]03fffce11000  0001 
> 03d1002a9200
> [   48.348046]0001 00681b88 2e91bc18 
> 2e91bbd0
> [   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
> %r14,1b9a60
>   00265524: a7f4ff7dbrc 
> 15,26541e
>  #00265528: a7f40001brc 
> 15,26552a
>  >0026552c: e3c0b8200124stg 
> %r12,6176(%r11)
>   00265532: a7f4ff57brc 
> 15,2653e0
>   00265536: e310b8280104lg  
> %r1,6184(%r11)
>   0026553c: a71b0001aghi%r1,1
>   00265540: e310b8280124stg 
> %r1,6184(%r11)
> [   48.348099] Call Trace:
> [   48.348100] ([<03d1002a91c0>] 0x3d1002a91c0)
> [   48.348102]  [<002404aa>] page_remove_rmap+0xf2/0x16c
> [   48.348106]  [<00232dc8>] unmap_single_vma+0x494/0x7d8
> [   48.348107]  [<00233ac0>] unmap_vmas+0x50/0x74
> [   48.348109]  [<002396ec>] unmap_region+0x9c/0x110
> [   48.348110]  [<0023bd18>] do_munmap+0x284/0x470
> [   48.348111]  [<0023bf56>] vm_munmap+0x52/0x70
> [   48.348113]  [<0023cf32>] SyS_munmap+0x3a/0x4c
> [   48.348114]  [<00665e14>] sysc_noemu+0x22/0x28
> [   48.348118]  [<03fffcf187b2>] 0x3fffcf187b2
> [   48.348119] Last Breaking-Event-Address:
> [   48.348120]  [<00265528>] __mem_cgroup_uncharge_common+0x2c0/0x33c
> 
> Looking at the code, the code flow is:
> 
> page_remove_rmap() -> mem_cgroup_uncharge_page() -> 
> __mem_cgroup_uncharge_common()
> 
> Note that in mem_cgroup_uncharge_page() the page in question passed the check:
> 
> [...]
> if (PageSwapCache(page))
> return;
> [...]
> 
> and just a couple of instructions later the VM_BUG_ON() within
> __mem_cgroup_uncharge_common() triggers:
> 
> [...]
> if (mem_cgroup_disabled())
> return NULL;
> 
> VM_BUG_ON(PageSwapCache(page));
> [...]
> 
> Which means that another cpu changed the pageflags concurrently. In fact,
> looking at the dump a different cpu is indeed busy with running kswapd.

Hmm, maybe I am missing something but it really looks like we can race
here. Reclaim path takes the page lock while zap_pte takes page table
lock so nothing prevents them from racing here:
shrink_page_listzap_pte_range
  trylock_pagepte_offset_map_lock
  add_to_swap   page_remove_rmap
/* Page can be still mapped */
add_to_swap_cache atomic_add_negative(_mapcount)
  __add_to_swap_cache   mem_cgroup_uncharge_page
  (PageSwapCache(page)) && return
SetPageSwapCache
  __mem_cgroup_uncharge_common
VM_BUG_ON(PageSwapCache(page))

Maybe not many people run with CONFIG_DEBUG_VM enabled these days so we
do not this more often (even me testing configs are not consistent in
that regards and only few have it on). The only thing that changed in
this area recently is 0c59b89c which made the test VM_BUG_ON rather then
simple return in 3.6
And maybe the BUG_ON is too harsh as CgroupUsed should guarantee that
the uncharge will eventually go away. What do you think Johannes?
 
> So.. this seems to be somewhat broken. Anyone familiar with memcontrol?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-24 Thread Heiko Carstens
On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
> Ok, thanks for verifying! I'll look into it; hopefully I can reproduce it
> here as well.

That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
below (when I force the system to swap):

[   48.347963] [ cut here ]
[   48.347972] kernel BUG at mm/memcontrol.c:3994!
[   48.348012] illegal operation: 0001 [#1] SMP 
[   48.348015] Modules linked in:
[   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
[   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
2e91b938)
[   48.348022] Krnl PSW : 0704f0018000 0026552c 
(__mem_cgroup_uncharge_common+0x2c4/0x33c)
[   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 
EA:3
   Krnl GPRS: 0008 0009 03d1002a9200 

[   48.348039] 006812d8 03ffdf339000 
321a6f98
[   48.348043]03fffce11000  0001 
03d1002a9200
[   48.348046]0001 00681b88 2e91bc18 
2e91bbd0
[   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
%r14,1b9a60
  00265524: a7f4ff7dbrc 
15,26541e
 #00265528: a7f40001brc 
15,26552a
 >0026552c: e3c0b8200124stg 
%r12,6176(%r11)
  00265532: a7f4ff57brc 
15,2653e0
  00265536: e310b8280104lg  
%r1,6184(%r11)
  0026553c: a71b0001aghi%r1,1
  00265540: e310b8280124stg 
%r1,6184(%r11)
[   48.348099] Call Trace:
[   48.348100] ([<03d1002a91c0>] 0x3d1002a91c0)
[   48.348102]  [<002404aa>] page_remove_rmap+0xf2/0x16c
[   48.348106]  [<00232dc8>] unmap_single_vma+0x494/0x7d8
[   48.348107]  [<00233ac0>] unmap_vmas+0x50/0x74
[   48.348109]  [<002396ec>] unmap_region+0x9c/0x110
[   48.348110]  [<0023bd18>] do_munmap+0x284/0x470
[   48.348111]  [<0023bf56>] vm_munmap+0x52/0x70
[   48.348113]  [<0023cf32>] SyS_munmap+0x3a/0x4c
[   48.348114]  [<00665e14>] sysc_noemu+0x22/0x28
[   48.348118]  [<03fffcf187b2>] 0x3fffcf187b2
[   48.348119] Last Breaking-Event-Address:
[   48.348120]  [<00265528>] __mem_cgroup_uncharge_common+0x2c0/0x33c

Looking at the code, the code flow is:

page_remove_rmap() -> mem_cgroup_uncharge_page() -> 
__mem_cgroup_uncharge_common()

Note that in mem_cgroup_uncharge_page() the page in question passed the check:

[...]
if (PageSwapCache(page))
return;
[...]

and just a couple of instructions later the VM_BUG_ON() within
__mem_cgroup_uncharge_common() triggers:

[...]
if (mem_cgroup_disabled())
return NULL;

VM_BUG_ON(PageSwapCache(page));
[...]

Which means that another cpu changed the pageflags concurrently. In fact,
looking at the dump a different cpu is indeed busy with running kswapd.

So.. this seems to be somewhat broken. Anyone familiar with memcontrol?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-24 Thread Heiko Carstens
On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
 Ok, thanks for verifying! I'll look into it; hopefully I can reproduce it
 here as well.

That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
below (when I force the system to swap):

[   48.347963] [ cut here ]
[   48.347972] kernel BUG at mm/memcontrol.c:3994!
[   48.348012] illegal operation: 0001 [#1] SMP 
[   48.348015] Modules linked in:
[   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
[   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
2e91b938)
[   48.348022] Krnl PSW : 0704f0018000 0026552c 
(__mem_cgroup_uncharge_common+0x2c4/0x33c)
[   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 
EA:3
   Krnl GPRS: 0008 0009 03d1002a9200 

[   48.348039] 006812d8 03ffdf339000 
321a6f98
[   48.348043]03fffce11000  0001 
03d1002a9200
[   48.348046]0001 00681b88 2e91bc18 
2e91bbd0
[   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
%r14,1b9a60
  00265524: a7f4ff7dbrc 
15,26541e
 #00265528: a7f40001brc 
15,26552a
 0026552c: e3c0b8200124stg 
%r12,6176(%r11)
  00265532: a7f4ff57brc 
15,2653e0
  00265536: e310b8280104lg  
%r1,6184(%r11)
  0026553c: a71b0001aghi%r1,1
  00265540: e310b8280124stg 
%r1,6184(%r11)
[   48.348099] Call Trace:
[   48.348100] ([03d1002a91c0] 0x3d1002a91c0)
[   48.348102]  [002404aa] page_remove_rmap+0xf2/0x16c
[   48.348106]  [00232dc8] unmap_single_vma+0x494/0x7d8
[   48.348107]  [00233ac0] unmap_vmas+0x50/0x74
[   48.348109]  [002396ec] unmap_region+0x9c/0x110
[   48.348110]  [0023bd18] do_munmap+0x284/0x470
[   48.348111]  [0023bf56] vm_munmap+0x52/0x70
[   48.348113]  [0023cf32] SyS_munmap+0x3a/0x4c
[   48.348114]  [00665e14] sysc_noemu+0x22/0x28
[   48.348118]  [03fffcf187b2] 0x3fffcf187b2
[   48.348119] Last Breaking-Event-Address:
[   48.348120]  [00265528] __mem_cgroup_uncharge_common+0x2c0/0x33c

Looking at the code, the code flow is:

page_remove_rmap() - mem_cgroup_uncharge_page() - 
__mem_cgroup_uncharge_common()

Note that in mem_cgroup_uncharge_page() the page in question passed the check:

[...]
if (PageSwapCache(page))
return;
[...]

and just a couple of instructions later the VM_BUG_ON() within
__mem_cgroup_uncharge_common() triggers:

[...]
if (mem_cgroup_disabled())
return NULL;

VM_BUG_ON(PageSwapCache(page));
[...]

Which means that another cpu changed the pageflags concurrently. In fact,
looking at the dump a different cpu is indeed busy with running kswapd.

So.. this seems to be somewhat broken. Anyone familiar with memcontrol?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-24 Thread Michal Hocko
On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
 On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
  Ok, thanks for verifying! I'll look into it; hopefully I can reproduce it
  here as well.
 
 That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
 below (when I force the system to swap):
 
 [   48.347963] [ cut here ]
 [   48.347972] kernel BUG at mm/memcontrol.c:3994!
 [   48.348012] illegal operation: 0001 [#1] SMP 
 [   48.348015] Modules linked in:
 [   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
 [   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
 2e91b938)
 [   48.348022] Krnl PSW : 0704f0018000 0026552c 
 (__mem_cgroup_uncharge_common+0x2c4/0x33c)
 [   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 
 EA:3
Krnl GPRS: 0008 0009 03d1002a9200 
 
 [   48.348039] 006812d8 03ffdf339000 
 321a6f98
 [   48.348043]03fffce11000  0001 
 03d1002a9200
 [   48.348046]0001 00681b88 2e91bc18 
 2e91bbd0
 [   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
 %r14,1b9a60
   00265524: a7f4ff7dbrc 
 15,26541e
  #00265528: a7f40001brc 
 15,26552a
  0026552c: e3c0b8200124stg 
 %r12,6176(%r11)
   00265532: a7f4ff57brc 
 15,2653e0
   00265536: e310b8280104lg  
 %r1,6184(%r11)
   0026553c: a71b0001aghi%r1,1
   00265540: e310b8280124stg 
 %r1,6184(%r11)
 [   48.348099] Call Trace:
 [   48.348100] ([03d1002a91c0] 0x3d1002a91c0)
 [   48.348102]  [002404aa] page_remove_rmap+0xf2/0x16c
 [   48.348106]  [00232dc8] unmap_single_vma+0x494/0x7d8
 [   48.348107]  [00233ac0] unmap_vmas+0x50/0x74
 [   48.348109]  [002396ec] unmap_region+0x9c/0x110
 [   48.348110]  [0023bd18] do_munmap+0x284/0x470
 [   48.348111]  [0023bf56] vm_munmap+0x52/0x70
 [   48.348113]  [0023cf32] SyS_munmap+0x3a/0x4c
 [   48.348114]  [00665e14] sysc_noemu+0x22/0x28
 [   48.348118]  [03fffcf187b2] 0x3fffcf187b2
 [   48.348119] Last Breaking-Event-Address:
 [   48.348120]  [00265528] __mem_cgroup_uncharge_common+0x2c0/0x33c
 
 Looking at the code, the code flow is:
 
 page_remove_rmap() - mem_cgroup_uncharge_page() - 
 __mem_cgroup_uncharge_common()
 
 Note that in mem_cgroup_uncharge_page() the page in question passed the check:
 
 [...]
 if (PageSwapCache(page))
 return;
 [...]
 
 and just a couple of instructions later the VM_BUG_ON() within
 __mem_cgroup_uncharge_common() triggers:
 
 [...]
 if (mem_cgroup_disabled())
 return NULL;
 
 VM_BUG_ON(PageSwapCache(page));
 [...]
 
 Which means that another cpu changed the pageflags concurrently. In fact,
 looking at the dump a different cpu is indeed busy with running kswapd.

Hmm, maybe I am missing something but it really looks like we can race
here. Reclaim path takes the page lock while zap_pte takes page table
lock so nothing prevents them from racing here:
shrink_page_listzap_pte_range
  trylock_pagepte_offset_map_lock
  add_to_swap   page_remove_rmap
/* Page can be still mapped */
add_to_swap_cache atomic_add_negative(_mapcount)
  __add_to_swap_cache   mem_cgroup_uncharge_page
  (PageSwapCache(page))  return
SetPageSwapCache
  __mem_cgroup_uncharge_common
VM_BUG_ON(PageSwapCache(page))

Maybe not many people run with CONFIG_DEBUG_VM enabled these days so we
do not this more often (even me testing configs are not consistent in
that regards and only few have it on). The only thing that changed in
this area recently is 0c59b89c which made the test VM_BUG_ON rather then
simple return in 3.6
And maybe the BUG_ON is too harsh as CgroupUsed should guarantee that
the uncharge will eventually go away. What do you think Johannes?
 
 So.. this seems to be somewhat broken. Anyone familiar with memcontrol?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-24 Thread Johannes Weiner
On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
 On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
  On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
   Ok, thanks for verifying! I'll look into it; hopefully I can reproduce it
   here as well.
  
  That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
  below (when I force the system to swap):
  
  [   48.347963] [ cut here ]
  [   48.347972] kernel BUG at mm/memcontrol.c:3994!
  [   48.348012] illegal operation: 0001 [#1] SMP 
  [   48.348015] Modules linked in:
  [   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
  [   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
  2e91b938)
  [   48.348022] Krnl PSW : 0704f0018000 0026552c 
  (__mem_cgroup_uncharge_common+0x2c4/0x33c)
  [   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 
  PM:0 EA:3
 Krnl GPRS: 0008 0009 
  03d1002a9200 
  [   48.348039] 006812d8 
  03ffdf339000 321a6f98
  [   48.348043]03fffce11000  
  0001 03d1002a9200
  [   48.348046]0001 00681b88 
  2e91bc18 2e91bbd0
  [   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
  %r14,1b9a60
00265524: a7f4ff7dbrc 
  15,26541e
   #00265528: a7f40001brc 
  15,26552a
   0026552c: e3c0b8200124stg 
  %r12,6176(%r11)
00265532: a7f4ff57brc 
  15,2653e0
00265536: e310b8280104lg  
  %r1,6184(%r11)
0026553c: a71b0001aghi
  %r1,1
00265540: e310b8280124stg 
  %r1,6184(%r11)
  [   48.348099] Call Trace:
  [   48.348100] ([03d1002a91c0] 0x3d1002a91c0)
  [   48.348102]  [002404aa] page_remove_rmap+0xf2/0x16c
  [   48.348106]  [00232dc8] unmap_single_vma+0x494/0x7d8
  [   48.348107]  [00233ac0] unmap_vmas+0x50/0x74
  [   48.348109]  [002396ec] unmap_region+0x9c/0x110
  [   48.348110]  [0023bd18] do_munmap+0x284/0x470
  [   48.348111]  [0023bf56] vm_munmap+0x52/0x70
  [   48.348113]  [0023cf32] SyS_munmap+0x3a/0x4c
  [   48.348114]  [00665e14] sysc_noemu+0x22/0x28
  [   48.348118]  [03fffcf187b2] 0x3fffcf187b2
  [   48.348119] Last Breaking-Event-Address:
  [   48.348120]  [00265528] 
  __mem_cgroup_uncharge_common+0x2c0/0x33c
  
  Looking at the code, the code flow is:
  
  page_remove_rmap() - mem_cgroup_uncharge_page() - 
  __mem_cgroup_uncharge_common()
  
  Note that in mem_cgroup_uncharge_page() the page in question passed the 
  check:
  
  [...]
  if (PageSwapCache(page))
  return;
  [...]
  
  and just a couple of instructions later the VM_BUG_ON() within
  __mem_cgroup_uncharge_common() triggers:
  
  [...]
  if (mem_cgroup_disabled())
  return NULL;
  
  VM_BUG_ON(PageSwapCache(page));
  [...]
  
  Which means that another cpu changed the pageflags concurrently. In fact,
  looking at the dump a different cpu is indeed busy with running kswapd.
 
 Hmm, maybe I am missing something but it really looks like we can race
 here. Reclaim path takes the page lock while zap_pte takes page table
 lock so nothing prevents them from racing here:
 shrink_page_list  zap_pte_range
   trylock_page  pte_offset_map_lock
   add_to_swap page_remove_rmap
 /* Page can be still mapped */
 add_to_swap_cache   atomic_add_negative(_mapcount)
   __add_to_swap_cache mem_cgroup_uncharge_page
 (PageSwapCache(page))  
 return
 SetPageSwapCache
 __mem_cgroup_uncharge_common
   VM_BUG_ON(PageSwapCache(page))
 
 Maybe not many people run with CONFIG_DEBUG_VM enabled these days so we
 do not this more often (even me testing configs are not consistent in
 that regards and only few have it on). The only thing that changed in
 this area recently is 0c59b89c which made the test VM_BUG_ON rather then
 simple return in 3.6
 And maybe the BUG_ON is too harsh as CgroupUsed should guarantee that
 the uncharge will eventually go away. What do you think Johannes?

Interesting.  We need to ensure there is ordering between setting
PG_swapcache and installing swap entries because I think we are the
only ones looking at PG_swapcache without the page lock held.  So we
don't have a safe way to check for PG_swapcache but if we get it
wrong, we may steal an uncharge that 

Re: [v3.9-rc8]: kernel BUG at mm/memcontrol.c:3994! (was: Re: [BUG][s390x] mm: system crashed)

2013-04-24 Thread Hugh Dickins
On Wed, 24 Apr 2013, Johannes Weiner wrote:
 On Wed, Apr 24, 2013 at 03:18:51PM +0200, Michal Hocko wrote:
  On Wed 24-04-13 12:42:55, Heiko Carstens wrote:
   On Thu, Apr 18, 2013 at 09:13:03AM +0200, Heiko Carstens wrote:
Ok, thanks for verifying! I'll look into it; hopefully I can reproduce 
it
here as well.
   
   That seems to be a common code bug. I can easily trigger the VM_BUG_ON()
   below (when I force the system to swap):
   
   [   48.347963] [ cut here ]
   [   48.347972] kernel BUG at mm/memcontrol.c:3994!
   [   48.348012] illegal operation: 0001 [#1] SMP 
   [   48.348015] Modules linked in:
   [   48.348017] CPU: 1 Not tainted 3.9.0-rc8+ #38
   [   48.348020] Process mmap2 (pid: 635, task: 29476100, ksp: 
   2e91b938)
   [   48.348022] Krnl PSW : 0704f0018000 0026552c 
   (__mem_cgroup_uncharge_common+0x2c4/0x33c)
   [   48.348032]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 
   PM:0 EA:3
  Krnl GPRS: 0008 0009 
   03d1002a9200 
   [   48.348039] 006812d8 
   03ffdf339000 321a6f98
   [   48.348043]03fffce11000  
   0001 03d1002a9200
   [   48.348046]0001 00681b88 
   2e91bc18 2e91bbd0
   [   48.348057] Krnl Code: 0026551e: c0e5fffaa2a1brasl   
   %r14,1b9a60
 00265524: a7f4ff7dbrc 
   15,26541e
#00265528: a7f40001brc 
   15,26552a
0026552c: e3c0b8200124stg 
   %r12,6176(%r11)
 00265532: a7f4ff57brc 
   15,2653e0
 00265536: e310b8280104lg  
   %r1,6184(%r11)
 0026553c: a71b0001aghi
   %r1,1
 00265540: e310b8280124stg 
   %r1,6184(%r11)
   [   48.348099] Call Trace:
   [   48.348100] ([03d1002a91c0] 0x3d1002a91c0)
   [   48.348102]  [002404aa] page_remove_rmap+0xf2/0x16c
   [   48.348106]  [00232dc8] unmap_single_vma+0x494/0x7d8
   [   48.348107]  [00233ac0] unmap_vmas+0x50/0x74
   [   48.348109]  [002396ec] unmap_region+0x9c/0x110
   [   48.348110]  [0023bd18] do_munmap+0x284/0x470
   [   48.348111]  [0023bf56] vm_munmap+0x52/0x70
   [   48.348113]  [0023cf32] SyS_munmap+0x3a/0x4c
   [   48.348114]  [00665e14] sysc_noemu+0x22/0x28
   [   48.348118]  [03fffcf187b2] 0x3fffcf187b2
   [   48.348119] Last Breaking-Event-Address:
   [   48.348120]  [00265528] 
   __mem_cgroup_uncharge_common+0x2c0/0x33c
   
   Looking at the code, the code flow is:
   
   page_remove_rmap() - mem_cgroup_uncharge_page() - 
   __mem_cgroup_uncharge_common()
   
   Note that in mem_cgroup_uncharge_page() the page in question passed the 
   check:
   
   [...]
   if (PageSwapCache(page))
   return;
   [...]
   
   and just a couple of instructions later the VM_BUG_ON() within
   __mem_cgroup_uncharge_common() triggers:
   
   [...]
   if (mem_cgroup_disabled())
   return NULL;
   
   VM_BUG_ON(PageSwapCache(page));
   [...]
   
   Which means that another cpu changed the pageflags concurrently. In fact,
   looking at the dump a different cpu is indeed busy with running kswapd.
  
  Hmm, maybe I am missing something but it really looks like we can race
  here. Reclaim path takes the page lock while zap_pte takes page table
  lock so nothing prevents them from racing here:
  shrink_page_listzap_pte_range
trylock_pagepte_offset_map_lock
add_to_swap   page_remove_rmap
  /* Page can be still mapped */
  add_to_swap_cache atomic_add_negative(_mapcount)
__add_to_swap_cache   mem_cgroup_uncharge_page
(PageSwapCache(page))  
  return
  SetPageSwapCache
__mem_cgroup_uncharge_common
  VM_BUG_ON(PageSwapCache(page))
  
  Maybe not many people run with CONFIG_DEBUG_VM enabled these days so we
  do not this more often (even me testing configs are not consistent in
  that regards and only few have it on). The only thing that changed in
  this area recently is 0c59b89c which made the test VM_BUG_ON rather then
  simple return in 3.6
  And maybe the BUG_ON is too harsh as CgroupUsed should guarantee that
  the uncharge will eventually go away. What do you think Johannes?
 
 Interesting.  We need to ensure there is ordering between setting
 PG_swapcache and installing swap entries because I think we are the
 only ones 

Re: [BUG][s390x] mm: system crashed

2013-04-18 Thread Heiko Carstens
On Thu, Apr 18, 2013 at 02:27:45AM -0400, Zhouping Liu wrote:
> Hello Heiko,
> > If you have some time, could you please repeat your test with the kernel
> > command line option " user_mode=home "?
> 
> I tested the system with the kernel parameter, but the issue still appeared,
> I just to say it takes longer time to reproduce the issue than the before.
> 
> > 
> > As far as I can tell there was only one s390 patch merged that was
> > mmap related: 486c0a0bc80d370471b21662bf03f04fbb37cdc6 "s390/mm: Fix crst
> > upgrade of mmap with MAP_FIXED".
> 
> also I tested the revert commit, unluckily, the same issue as the before.

Ok, thanks for verifying! I'll look into it; hopefully I can reproduce it
here as well.

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-18 Thread Zhouping Liu
Hello Heiko,

- Original Message -
> From: "Heiko Carstens" 
> To: "Zhouping Liu" 
> Cc: linux...@kvack.org, "LKML" , "caiqian" 
> , "Caspar Zhang"
> , "Martin Schwidefsky" 
> Sent: Tuesday, April 16, 2013 3:50:47 PM
> Subject: Re: [BUG][s390x] mm: system crashed
> 
> On Mon, Apr 15, 2013 at 02:16:55PM +0800, Zhouping Liu wrote:
> > On 04/15/2013 01:56 PM, Heiko Carstens wrote:
> > >On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:
> > >>� 16109.346170¨ Call Trace:
> > >>� 16109.346179¨ (� <00100920>¨ show_trace+0x128/0x12c)
> > >>� 16109.346195¨  � <001cd320>¨ rcu_check_callbacks+0x458/0xccc
> > >>� 16109.346209¨  � <00140f2e>¨ update_process_times+0x4a/0x74
> > >>� 16109.346222¨  � <00199452>¨
> > >>tick_sched_handle.isra.12+0x5e/0x70
> > >>� 16109.346235¨  � <001995aa>¨ tick_sched_timer+0x6a/0x98
> > >>� 16109.346247¨  � <0015c1ea>¨ __run_hrtimer+0x8e/0x200
> > >>� 16109.346381¨  � <0015d1b2>¨ hrtimer_interrupt+0x212/0x2b0
> > >>� 16109.346385¨  � <001040f6>¨ clock_comparator_work+0x4a/0x54
> > >>� 16109.346390¨  � <0010d658>¨ do_extint+0x158/0x15c
> > >>� 16109.346396¨  � <0062aa24>¨ ext_skip+0x38/0x3c
> > >>� 16109.346404¨  � <001153c8>¨ smp_yield_cpu+0x44/0x48
> > >>� 16109.346412¨ (� <03d10051aec0>¨ 0x3d10051aec0)
> > >>� 16109.346457¨  � <0024206a>¨ __page_check_address+0x16a/0x170
> > >>� 16109.346466¨  � <002423a2>¨ page_referenced_one+0x3e/0xa0
> > >>� 16109.346501¨  � <0024427c>¨ page_referenced+0x32c/0x41c
> > >>� 16109.346510¨  � <0021b1dc>¨ shrink_page_list+0x380/0xb9c
> > >>� 16109.346521¨  � <0021c0a6>¨ shrink_inactive_list+0x1c6/0x56c
> > >>� 16109.346532¨  � <0021c69e>¨ shrink_lruvec+0x252/0x56c
> > >>� 16109.346542¨  � <0021ca44>¨ shrink_zone+0x8c/0x1bc
> > >>� 16109.346553¨  � <0021d080>¨ balance_pgdat+0x50c/0x658
> > >>� 16109.346564¨  � <0021d318>¨ kswapd+0x14c/0x470
> > >>� 16109.346576¨  � <00158292>¨ kthread+0xda/0xe4
> > >>� 16109.346656¨  � <0062a5de>¨ kernel_thread_starter+0x6/0xc
> > >>� 16109.346682¨  � <0062a5d8>¨ kernel_thread_starter+0x0/0xc
> > >>[-- MARK -- Fri Apr 12 06:15:00 2013]
> > >>� 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010
> > >>jiffies
> > >>  g=89766 c=89765 q=10627)
> > >Did the system really crash or did you just see the rcu related
> > >warning(s)?
> > 
> > I just check it again, actually at first the system didn't really
> > crash, but the system is very slow in response.
> > and the reproducer process can't be killed, after I did some common
> > actions such as 'ls' 'vim' etc, the system
> > seemed to be really crashed, no any response.
> > 
> > also in the previous testing, I can remember that the system would
> > be no any response for a long time, just only
> > repeatedly print out the such above 'Call Trace' into console.
> 
> Ok, thanks.
> Just a couple of more questions: did you see this also on other archs, or
> just
> s390 (if you tried other platforms at all).
> 
> If you have some time, could you please repeat your test with the kernel
> command line option " user_mode=home "?

I tested the system with the kernel parameter, but the issue still appeared,
I just to say it takes longer time to reproduce the issue than the before.

> 
> As far as I can tell there was only one s390 patch merged that was
> mmap related: 486c0a0bc80d370471b21662bf03f04fbb37cdc6 "s390/mm: Fix crst
> upgrade of mmap with MAP_FIXED".

also I tested the revert commit, unluckily, the same issue as the before.


-- 
Thanks,
Zhouping
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-18 Thread Zhouping Liu
Hello Heiko,

- Original Message -
 From: Heiko Carstens heiko.carst...@de.ibm.com
 To: Zhouping Liu z...@redhat.com
 Cc: linux...@kvack.org, LKML linux-kernel@vger.kernel.org, caiqian 
 caiq...@redhat.com, Caspar Zhang
 czh...@redhat.com, Martin Schwidefsky schwidef...@de.ibm.com
 Sent: Tuesday, April 16, 2013 3:50:47 PM
 Subject: Re: [BUG][s390x] mm: system crashed
 
 On Mon, Apr 15, 2013 at 02:16:55PM +0800, Zhouping Liu wrote:
  On 04/15/2013 01:56 PM, Heiko Carstens wrote:
  On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:
  � 16109.346170¨ Call Trace:
  � 16109.346179¨ (� 00100920¨ show_trace+0x128/0x12c)
  � 16109.346195¨  � 001cd320¨ rcu_check_callbacks+0x458/0xccc
  � 16109.346209¨  � 00140f2e¨ update_process_times+0x4a/0x74
  � 16109.346222¨  � 00199452¨
  tick_sched_handle.isra.12+0x5e/0x70
  � 16109.346235¨  � 001995aa¨ tick_sched_timer+0x6a/0x98
  � 16109.346247¨  � 0015c1ea¨ __run_hrtimer+0x8e/0x200
  � 16109.346381¨  � 0015d1b2¨ hrtimer_interrupt+0x212/0x2b0
  � 16109.346385¨  � 001040f6¨ clock_comparator_work+0x4a/0x54
  � 16109.346390¨  � 0010d658¨ do_extint+0x158/0x15c
  � 16109.346396¨  � 0062aa24¨ ext_skip+0x38/0x3c
  � 16109.346404¨  � 001153c8¨ smp_yield_cpu+0x44/0x48
  � 16109.346412¨ (� 03d10051aec0¨ 0x3d10051aec0)
  � 16109.346457¨  � 0024206a¨ __page_check_address+0x16a/0x170
  � 16109.346466¨  � 002423a2¨ page_referenced_one+0x3e/0xa0
  � 16109.346501¨  � 0024427c¨ page_referenced+0x32c/0x41c
  � 16109.346510¨  � 0021b1dc¨ shrink_page_list+0x380/0xb9c
  � 16109.346521¨  � 0021c0a6¨ shrink_inactive_list+0x1c6/0x56c
  � 16109.346532¨  � 0021c69e¨ shrink_lruvec+0x252/0x56c
  � 16109.346542¨  � 0021ca44¨ shrink_zone+0x8c/0x1bc
  � 16109.346553¨  � 0021d080¨ balance_pgdat+0x50c/0x658
  � 16109.346564¨  � 0021d318¨ kswapd+0x14c/0x470
  � 16109.346576¨  � 00158292¨ kthread+0xda/0xe4
  � 16109.346656¨  � 0062a5de¨ kernel_thread_starter+0x6/0xc
  � 16109.346682¨  � 0062a5d8¨ kernel_thread_starter+0x0/0xc
  [-- MARK -- Fri Apr 12 06:15:00 2013]
  � 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010
  jiffies
g=89766 c=89765 q=10627)
  Did the system really crash or did you just see the rcu related
  warning(s)?
  
  I just check it again, actually at first the system didn't really
  crash, but the system is very slow in response.
  and the reproducer process can't be killed, after I did some common
  actions such as 'ls' 'vim' etc, the system
  seemed to be really crashed, no any response.
  
  also in the previous testing, I can remember that the system would
  be no any response for a long time, just only
  repeatedly print out the such above 'Call Trace' into console.
 
 Ok, thanks.
 Just a couple of more questions: did you see this also on other archs, or
 just
 s390 (if you tried other platforms at all).
 
 If you have some time, could you please repeat your test with the kernel
 command line option  user_mode=home ?

I tested the system with the kernel parameter, but the issue still appeared,
I just to say it takes longer time to reproduce the issue than the before.

 
 As far as I can tell there was only one s390 patch merged that was
 mmap related: 486c0a0bc80d370471b21662bf03f04fbb37cdc6 s390/mm: Fix crst
 upgrade of mmap with MAP_FIXED.

also I tested the revert commit, unluckily, the same issue as the before.


-- 
Thanks,
Zhouping
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-18 Thread Heiko Carstens
On Thu, Apr 18, 2013 at 02:27:45AM -0400, Zhouping Liu wrote:
 Hello Heiko,
  If you have some time, could you please repeat your test with the kernel
  command line option  user_mode=home ?
 
 I tested the system with the kernel parameter, but the issue still appeared,
 I just to say it takes longer time to reproduce the issue than the before.
 
  
  As far as I can tell there was only one s390 patch merged that was
  mmap related: 486c0a0bc80d370471b21662bf03f04fbb37cdc6 s390/mm: Fix crst
  upgrade of mmap with MAP_FIXED.
 
 also I tested the revert commit, unluckily, the same issue as the before.

Ok, thanks for verifying! I'll look into it; hopefully I can reproduce it
here as well.

Thanks!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-16 Thread Heiko Carstens
On Tue, Apr 16, 2013 at 03:56:59PM +0800, Simon Jeons wrote:
> Hi Heiko,
> >If you have some time, could you please repeat your test with the kernel
> >command line option " user_mode=home "?
> 
> What's the meaning of this command line? I can't find it in
> Documentation/kernel-parameters.txt/

It switches the architectural address space where kernel and user space
reside in.
We only recently switched the default address space for user space from
home space to primary space, since that's needed for kvm.
The user space runs in home space mode will be removed in the future; we
keep it currently as fallback, just in case something breaks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-16 Thread Simon Jeons

Hi Heiko,
On 04/16/2013 03:50 PM, Heiko Carstens wrote:

On Mon, Apr 15, 2013 at 02:16:55PM +0800, Zhouping Liu wrote:

On 04/15/2013 01:56 PM, Heiko Carstens wrote:

On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:

� 16109.346170¨ Call Trace:
� 16109.346179¨ (� <00100920>¨ show_trace+0x128/0x12c)
� 16109.346195¨  � <001cd320>¨ rcu_check_callbacks+0x458/0xccc
� 16109.346209¨  � <00140f2e>¨ update_process_times+0x4a/0x74
� 16109.346222¨  � <00199452>¨ tick_sched_handle.isra.12+0x5e/0x70
� 16109.346235¨  � <001995aa>¨ tick_sched_timer+0x6a/0x98
� 16109.346247¨  � <0015c1ea>¨ __run_hrtimer+0x8e/0x200
� 16109.346381¨  � <0015d1b2>¨ hrtimer_interrupt+0x212/0x2b0
� 16109.346385¨  � <001040f6>¨ clock_comparator_work+0x4a/0x54
� 16109.346390¨  � <0010d658>¨ do_extint+0x158/0x15c
� 16109.346396¨  � <0062aa24>¨ ext_skip+0x38/0x3c
� 16109.346404¨  � <001153c8>¨ smp_yield_cpu+0x44/0x48
� 16109.346412¨ (� <03d10051aec0>¨ 0x3d10051aec0)
� 16109.346457¨  � <0024206a>¨ __page_check_address+0x16a/0x170
� 16109.346466¨  � <002423a2>¨ page_referenced_one+0x3e/0xa0
� 16109.346501¨  � <0024427c>¨ page_referenced+0x32c/0x41c
� 16109.346510¨  � <0021b1dc>¨ shrink_page_list+0x380/0xb9c
� 16109.346521¨  � <0021c0a6>¨ shrink_inactive_list+0x1c6/0x56c
� 16109.346532¨  � <0021c69e>¨ shrink_lruvec+0x252/0x56c
� 16109.346542¨  � <0021ca44>¨ shrink_zone+0x8c/0x1bc
� 16109.346553¨  � <0021d080>¨ balance_pgdat+0x50c/0x658
� 16109.346564¨  � <0021d318>¨ kswapd+0x14c/0x470
� 16109.346576¨  � <00158292>¨ kthread+0xda/0xe4
� 16109.346656¨  � <0062a5de>¨ kernel_thread_starter+0x6/0xc
� 16109.346682¨  � <0062a5d8>¨ kernel_thread_starter+0x0/0xc
[-- MARK -- Fri Apr 12 06:15:00 2013]
� 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
jiffies
  g=89766 c=89765 q=10627)

Did the system really crash or did you just see the rcu related warning(s)?

I just check it again, actually at first the system didn't really
crash, but the system is very slow in response.
and the reproducer process can't be killed, after I did some common
actions such as 'ls' 'vim' etc, the system
seemed to be really crashed, no any response.

also in the previous testing, I can remember that the system would
be no any response for a long time, just only
repeatedly print out the such above 'Call Trace' into console.

Ok, thanks.
Just a couple of more questions: did you see this also on other archs, or just
s390 (if you tried other platforms at all).

If you have some time, could you please repeat your test with the kernel
command line option " user_mode=home "?


What's the meaning of this command line? I can't find it in 
Documentation/kernel-parameters.txt/




As far as I can tell there was only one s390 patch merged that was
mmap related: 486c0a0bc80d370471b21662bf03f04fbb37cdc6 "s390/mm: Fix crst
upgrade of mmap with MAP_FIXED".
Even though I don't think it explains the bug you've seen it might be worth
to try to revert it.

And at last, can you share your kernel config?

Thanks,
Heiko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-16 Thread Heiko Carstens
On Mon, Apr 15, 2013 at 02:16:55PM +0800, Zhouping Liu wrote:
> On 04/15/2013 01:56 PM, Heiko Carstens wrote:
> >On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:
> >>� 16109.346170¨ Call Trace:
> >>� 16109.346179¨ (� <00100920>¨ show_trace+0x128/0x12c)
> >>� 16109.346195¨  � <001cd320>¨ rcu_check_callbacks+0x458/0xccc
> >>� 16109.346209¨  � <00140f2e>¨ update_process_times+0x4a/0x74
> >>� 16109.346222¨  � <00199452>¨ tick_sched_handle.isra.12+0x5e/0x70
> >>� 16109.346235¨  � <001995aa>¨ tick_sched_timer+0x6a/0x98
> >>� 16109.346247¨  � <0015c1ea>¨ __run_hrtimer+0x8e/0x200
> >>� 16109.346381¨  � <0015d1b2>¨ hrtimer_interrupt+0x212/0x2b0
> >>� 16109.346385¨  � <001040f6>¨ clock_comparator_work+0x4a/0x54
> >>� 16109.346390¨  � <0010d658>¨ do_extint+0x158/0x15c
> >>� 16109.346396¨  � <0062aa24>¨ ext_skip+0x38/0x3c
> >>� 16109.346404¨  � <001153c8>¨ smp_yield_cpu+0x44/0x48
> >>� 16109.346412¨ (� <03d10051aec0>¨ 0x3d10051aec0)
> >>� 16109.346457¨  � <0024206a>¨ __page_check_address+0x16a/0x170
> >>� 16109.346466¨  � <002423a2>¨ page_referenced_one+0x3e/0xa0
> >>� 16109.346501¨  � <0024427c>¨ page_referenced+0x32c/0x41c
> >>� 16109.346510¨  � <0021b1dc>¨ shrink_page_list+0x380/0xb9c
> >>� 16109.346521¨  � <0021c0a6>¨ shrink_inactive_list+0x1c6/0x56c
> >>� 16109.346532¨  � <0021c69e>¨ shrink_lruvec+0x252/0x56c
> >>� 16109.346542¨  � <0021ca44>¨ shrink_zone+0x8c/0x1bc
> >>� 16109.346553¨  � <0021d080>¨ balance_pgdat+0x50c/0x658
> >>� 16109.346564¨  � <0021d318>¨ kswapd+0x14c/0x470
> >>� 16109.346576¨  � <00158292>¨ kthread+0xda/0xe4
> >>� 16109.346656¨  � <0062a5de>¨ kernel_thread_starter+0x6/0xc
> >>� 16109.346682¨  � <0062a5d8>¨ kernel_thread_starter+0x0/0xc
> >>[-- MARK -- Fri Apr 12 06:15:00 2013]
> >>� 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
> >>jiffies
> >>  g=89766 c=89765 q=10627)
> >Did the system really crash or did you just see the rcu related warning(s)?
> 
> I just check it again, actually at first the system didn't really
> crash, but the system is very slow in response.
> and the reproducer process can't be killed, after I did some common
> actions such as 'ls' 'vim' etc, the system
> seemed to be really crashed, no any response.
> 
> also in the previous testing, I can remember that the system would
> be no any response for a long time, just only
> repeatedly print out the such above 'Call Trace' into console.

Ok, thanks.
Just a couple of more questions: did you see this also on other archs, or just
s390 (if you tried other platforms at all).

If you have some time, could you please repeat your test with the kernel
command line option " user_mode=home "?

As far as I can tell there was only one s390 patch merged that was
mmap related: 486c0a0bc80d370471b21662bf03f04fbb37cdc6 "s390/mm: Fix crst
upgrade of mmap with MAP_FIXED".
Even though I don't think it explains the bug you've seen it might be worth
to try to revert it.

And at last, can you share your kernel config?

Thanks,
Heiko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-16 Thread Heiko Carstens
On Mon, Apr 15, 2013 at 02:16:55PM +0800, Zhouping Liu wrote:
 On 04/15/2013 01:56 PM, Heiko Carstens wrote:
 On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:
 � 16109.346170¨ Call Trace:
 � 16109.346179¨ (� 00100920¨ show_trace+0x128/0x12c)
 � 16109.346195¨  � 001cd320¨ rcu_check_callbacks+0x458/0xccc
 � 16109.346209¨  � 00140f2e¨ update_process_times+0x4a/0x74
 � 16109.346222¨  � 00199452¨ tick_sched_handle.isra.12+0x5e/0x70
 � 16109.346235¨  � 001995aa¨ tick_sched_timer+0x6a/0x98
 � 16109.346247¨  � 0015c1ea¨ __run_hrtimer+0x8e/0x200
 � 16109.346381¨  � 0015d1b2¨ hrtimer_interrupt+0x212/0x2b0
 � 16109.346385¨  � 001040f6¨ clock_comparator_work+0x4a/0x54
 � 16109.346390¨  � 0010d658¨ do_extint+0x158/0x15c
 � 16109.346396¨  � 0062aa24¨ ext_skip+0x38/0x3c
 � 16109.346404¨  � 001153c8¨ smp_yield_cpu+0x44/0x48
 � 16109.346412¨ (� 03d10051aec0¨ 0x3d10051aec0)
 � 16109.346457¨  � 0024206a¨ __page_check_address+0x16a/0x170
 � 16109.346466¨  � 002423a2¨ page_referenced_one+0x3e/0xa0
 � 16109.346501¨  � 0024427c¨ page_referenced+0x32c/0x41c
 � 16109.346510¨  � 0021b1dc¨ shrink_page_list+0x380/0xb9c
 � 16109.346521¨  � 0021c0a6¨ shrink_inactive_list+0x1c6/0x56c
 � 16109.346532¨  � 0021c69e¨ shrink_lruvec+0x252/0x56c
 � 16109.346542¨  � 0021ca44¨ shrink_zone+0x8c/0x1bc
 � 16109.346553¨  � 0021d080¨ balance_pgdat+0x50c/0x658
 � 16109.346564¨  � 0021d318¨ kswapd+0x14c/0x470
 � 16109.346576¨  � 00158292¨ kthread+0xda/0xe4
 � 16109.346656¨  � 0062a5de¨ kernel_thread_starter+0x6/0xc
 � 16109.346682¨  � 0062a5d8¨ kernel_thread_starter+0x0/0xc
 [-- MARK -- Fri Apr 12 06:15:00 2013]
 � 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
 jiffies
   g=89766 c=89765 q=10627)
 Did the system really crash or did you just see the rcu related warning(s)?
 
 I just check it again, actually at first the system didn't really
 crash, but the system is very slow in response.
 and the reproducer process can't be killed, after I did some common
 actions such as 'ls' 'vim' etc, the system
 seemed to be really crashed, no any response.
 
 also in the previous testing, I can remember that the system would
 be no any response for a long time, just only
 repeatedly print out the such above 'Call Trace' into console.

Ok, thanks.
Just a couple of more questions: did you see this also on other archs, or just
s390 (if you tried other platforms at all).

If you have some time, could you please repeat your test with the kernel
command line option  user_mode=home ?

As far as I can tell there was only one s390 patch merged that was
mmap related: 486c0a0bc80d370471b21662bf03f04fbb37cdc6 s390/mm: Fix crst
upgrade of mmap with MAP_FIXED.
Even though I don't think it explains the bug you've seen it might be worth
to try to revert it.

And at last, can you share your kernel config?

Thanks,
Heiko

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-16 Thread Simon Jeons

Hi Heiko,
On 04/16/2013 03:50 PM, Heiko Carstens wrote:

On Mon, Apr 15, 2013 at 02:16:55PM +0800, Zhouping Liu wrote:

On 04/15/2013 01:56 PM, Heiko Carstens wrote:

On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:

� 16109.346170¨ Call Trace:
� 16109.346179¨ (� 00100920¨ show_trace+0x128/0x12c)
� 16109.346195¨  � 001cd320¨ rcu_check_callbacks+0x458/0xccc
� 16109.346209¨  � 00140f2e¨ update_process_times+0x4a/0x74
� 16109.346222¨  � 00199452¨ tick_sched_handle.isra.12+0x5e/0x70
� 16109.346235¨  � 001995aa¨ tick_sched_timer+0x6a/0x98
� 16109.346247¨  � 0015c1ea¨ __run_hrtimer+0x8e/0x200
� 16109.346381¨  � 0015d1b2¨ hrtimer_interrupt+0x212/0x2b0
� 16109.346385¨  � 001040f6¨ clock_comparator_work+0x4a/0x54
� 16109.346390¨  � 0010d658¨ do_extint+0x158/0x15c
� 16109.346396¨  � 0062aa24¨ ext_skip+0x38/0x3c
� 16109.346404¨  � 001153c8¨ smp_yield_cpu+0x44/0x48
� 16109.346412¨ (� 03d10051aec0¨ 0x3d10051aec0)
� 16109.346457¨  � 0024206a¨ __page_check_address+0x16a/0x170
� 16109.346466¨  � 002423a2¨ page_referenced_one+0x3e/0xa0
� 16109.346501¨  � 0024427c¨ page_referenced+0x32c/0x41c
� 16109.346510¨  � 0021b1dc¨ shrink_page_list+0x380/0xb9c
� 16109.346521¨  � 0021c0a6¨ shrink_inactive_list+0x1c6/0x56c
� 16109.346532¨  � 0021c69e¨ shrink_lruvec+0x252/0x56c
� 16109.346542¨  � 0021ca44¨ shrink_zone+0x8c/0x1bc
� 16109.346553¨  � 0021d080¨ balance_pgdat+0x50c/0x658
� 16109.346564¨  � 0021d318¨ kswapd+0x14c/0x470
� 16109.346576¨  � 00158292¨ kthread+0xda/0xe4
� 16109.346656¨  � 0062a5de¨ kernel_thread_starter+0x6/0xc
� 16109.346682¨  � 0062a5d8¨ kernel_thread_starter+0x0/0xc
[-- MARK -- Fri Apr 12 06:15:00 2013]
� 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
jiffies
  g=89766 c=89765 q=10627)

Did the system really crash or did you just see the rcu related warning(s)?

I just check it again, actually at first the system didn't really
crash, but the system is very slow in response.
and the reproducer process can't be killed, after I did some common
actions such as 'ls' 'vim' etc, the system
seemed to be really crashed, no any response.

also in the previous testing, I can remember that the system would
be no any response for a long time, just only
repeatedly print out the such above 'Call Trace' into console.

Ok, thanks.
Just a couple of more questions: did you see this also on other archs, or just
s390 (if you tried other platforms at all).

If you have some time, could you please repeat your test with the kernel
command line option  user_mode=home ?


What's the meaning of this command line? I can't find it in 
Documentation/kernel-parameters.txt/




As far as I can tell there was only one s390 patch merged that was
mmap related: 486c0a0bc80d370471b21662bf03f04fbb37cdc6 s390/mm: Fix crst
upgrade of mmap with MAP_FIXED.
Even though I don't think it explains the bug you've seen it might be worth
to try to revert it.

And at last, can you share your kernel config?

Thanks,
Heiko

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-16 Thread Heiko Carstens
On Tue, Apr 16, 2013 at 03:56:59PM +0800, Simon Jeons wrote:
 Hi Heiko,
 If you have some time, could you please repeat your test with the kernel
 command line option  user_mode=home ?
 
 What's the meaning of this command line? I can't find it in
 Documentation/kernel-parameters.txt/

It switches the architectural address space where kernel and user space
reside in.
We only recently switched the default address space for user space from
home space to primary space, since that's needed for kvm.
The user space runs in home space mode will be removed in the future; we
keep it currently as fallback, just in case something breaks.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-15 Thread Zhouping Liu

On 04/15/2013 01:56 PM, Heiko Carstens wrote:

On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:

Hi All,

I hit the below crashed when doing memory related tests[1] on s390x:

--- snip -
� 15929.351639¨  � <0021c0a6>¨ shrink_inactive_list+0x1c6/0x56c
� 15929.351647¨  � <0021c69e>¨ shrink_lruvec+0x252/0x56c
� 15929.351654¨  � <0021ca44>¨ shrink_zone+0x8c/0x1bc
� 15929.351662¨  � <0021d080>¨ balance_pgdat+0x50c/0x658
� 15929.351671¨  � <0021d318>¨ kswapd+0x14c/0x470
� 15929.351680¨  � <00158292>¨ kthread+0xda/0xe4
� 15929.351690¨  � <0062a5de>¨ kernel_thread_starter+0x6/0xc
� 15929.351700¨  � <0062a5d8>¨ kernel_thread_starter+0x0/0xc
� 16109.346061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=24006 
jiffies
  g=89766 c=89765 q=10544)
� 16109.346101¨ CPU: 0 Tainted: G  D  3.9.0-rc6+ #1
� 16109.346106¨ Process kswapd0 (pid: 28, task: 3b2a, ksp: 
3b
2ab8c0)
� 16109.346110¨0001bb60 0001bb70 0002 
000
0
0001bc00 0001bb78 0001bb78 001009ca
 2930 000a 000a
0001bbc0 0001bb60  
0063bb18 001009ca 0001bb60 0001bbb0
� 16109.346170¨ Call Trace:
� 16109.346179¨ (� <00100920>¨ show_trace+0x128/0x12c)
� 16109.346195¨  � <001cd320>¨ rcu_check_callbacks+0x458/0xccc
� 16109.346209¨  � <00140f2e>¨ update_process_times+0x4a/0x74
� 16109.346222¨  � <00199452>¨ tick_sched_handle.isra.12+0x5e/0x70
� 16109.346235¨  � <001995aa>¨ tick_sched_timer+0x6a/0x98
� 16109.346247¨  � <0015c1ea>¨ __run_hrtimer+0x8e/0x200
� 16109.346381¨  � <0015d1b2>¨ hrtimer_interrupt+0x212/0x2b0
� 16109.346385¨  � <001040f6>¨ clock_comparator_work+0x4a/0x54
� 16109.346390¨  � <0010d658>¨ do_extint+0x158/0x15c
� 16109.346396¨  � <0062aa24>¨ ext_skip+0x38/0x3c
� 16109.346404¨  � <001153c8>¨ smp_yield_cpu+0x44/0x48
� 16109.346412¨ (� <03d10051aec0>¨ 0x3d10051aec0)
� 16109.346457¨  � <0024206a>¨ __page_check_address+0x16a/0x170
� 16109.346466¨  � <002423a2>¨ page_referenced_one+0x3e/0xa0
� 16109.346501¨  � <0024427c>¨ page_referenced+0x32c/0x41c
� 16109.346510¨  � <0021b1dc>¨ shrink_page_list+0x380/0xb9c
� 16109.346521¨  � <0021c0a6>¨ shrink_inactive_list+0x1c6/0x56c
� 16109.346532¨  � <0021c69e>¨ shrink_lruvec+0x252/0x56c
� 16109.346542¨  � <0021ca44>¨ shrink_zone+0x8c/0x1bc
� 16109.346553¨  � <0021d080>¨ balance_pgdat+0x50c/0x658
� 16109.346564¨  � <0021d318>¨ kswapd+0x14c/0x470
� 16109.346576¨  � <00158292>¨ kthread+0xda/0xe4
� 16109.346656¨  � <0062a5de>¨ kernel_thread_starter+0x6/0xc
� 16109.346682¨  � <0062a5d8>¨ kernel_thread_starter+0x0/0xc
[-- MARK -- Fri Apr 12 06:15:00 2013]
� 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
jiffies
  g=89766 c=89765 q=10627)

Did the system really crash or did you just see the rcu related warning(s)?


I just check it again, actually at first the system didn't really crash, 
but the system is very slow in response.
and the reproducer process can't be killed, after I did some common 
actions such as 'ls' 'vim' etc, the system

seemed to be really crashed, no any response.

also in the previous testing, I can remember that the system would be no 
any response for a long time, just only

repeatedly print out the such above 'Call Trace' into console.

Thanks,
Zhouping
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-15 Thread Zhouping Liu

On 04/15/2013 01:56 PM, Heiko Carstens wrote:

On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:

Hi All,

I hit the below crashed when doing memory related tests[1] on s390x:

--- snip -
� 15929.351639¨  � 0021c0a6¨ shrink_inactive_list+0x1c6/0x56c
� 15929.351647¨  � 0021c69e¨ shrink_lruvec+0x252/0x56c
� 15929.351654¨  � 0021ca44¨ shrink_zone+0x8c/0x1bc
� 15929.351662¨  � 0021d080¨ balance_pgdat+0x50c/0x658
� 15929.351671¨  � 0021d318¨ kswapd+0x14c/0x470
� 15929.351680¨  � 00158292¨ kthread+0xda/0xe4
� 15929.351690¨  � 0062a5de¨ kernel_thread_starter+0x6/0xc
� 15929.351700¨  � 0062a5d8¨ kernel_thread_starter+0x0/0xc
� 16109.346061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=24006 
jiffies
  g=89766 c=89765 q=10544)
� 16109.346101¨ CPU: 0 Tainted: G  D  3.9.0-rc6+ #1
� 16109.346106¨ Process kswapd0 (pid: 28, task: 3b2a, ksp: 
3b
2ab8c0)
� 16109.346110¨0001bb60 0001bb70 0002 
000
0
0001bc00 0001bb78 0001bb78 001009ca
 2930 000a 000a
0001bbc0 0001bb60  
0063bb18 001009ca 0001bb60 0001bbb0
� 16109.346170¨ Call Trace:
� 16109.346179¨ (� 00100920¨ show_trace+0x128/0x12c)
� 16109.346195¨  � 001cd320¨ rcu_check_callbacks+0x458/0xccc
� 16109.346209¨  � 00140f2e¨ update_process_times+0x4a/0x74
� 16109.346222¨  � 00199452¨ tick_sched_handle.isra.12+0x5e/0x70
� 16109.346235¨  � 001995aa¨ tick_sched_timer+0x6a/0x98
� 16109.346247¨  � 0015c1ea¨ __run_hrtimer+0x8e/0x200
� 16109.346381¨  � 0015d1b2¨ hrtimer_interrupt+0x212/0x2b0
� 16109.346385¨  � 001040f6¨ clock_comparator_work+0x4a/0x54
� 16109.346390¨  � 0010d658¨ do_extint+0x158/0x15c
� 16109.346396¨  � 0062aa24¨ ext_skip+0x38/0x3c
� 16109.346404¨  � 001153c8¨ smp_yield_cpu+0x44/0x48
� 16109.346412¨ (� 03d10051aec0¨ 0x3d10051aec0)
� 16109.346457¨  � 0024206a¨ __page_check_address+0x16a/0x170
� 16109.346466¨  � 002423a2¨ page_referenced_one+0x3e/0xa0
� 16109.346501¨  � 0024427c¨ page_referenced+0x32c/0x41c
� 16109.346510¨  � 0021b1dc¨ shrink_page_list+0x380/0xb9c
� 16109.346521¨  � 0021c0a6¨ shrink_inactive_list+0x1c6/0x56c
� 16109.346532¨  � 0021c69e¨ shrink_lruvec+0x252/0x56c
� 16109.346542¨  � 0021ca44¨ shrink_zone+0x8c/0x1bc
� 16109.346553¨  � 0021d080¨ balance_pgdat+0x50c/0x658
� 16109.346564¨  � 0021d318¨ kswapd+0x14c/0x470
� 16109.346576¨  � 00158292¨ kthread+0xda/0xe4
� 16109.346656¨  � 0062a5de¨ kernel_thread_starter+0x6/0xc
� 16109.346682¨  � 0062a5d8¨ kernel_thread_starter+0x0/0xc
[-- MARK -- Fri Apr 12 06:15:00 2013]
� 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
jiffies
  g=89766 c=89765 q=10627)

Did the system really crash or did you just see the rcu related warning(s)?


I just check it again, actually at first the system didn't really crash, 
but the system is very slow in response.
and the reproducer process can't be killed, after I did some common 
actions such as 'ls' 'vim' etc, the system

seemed to be really crashed, no any response.

also in the previous testing, I can remember that the system would be no 
any response for a long time, just only

repeatedly print out the such above 'Call Trace' into console.

Thanks,
Zhouping
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-14 Thread Heiko Carstens
On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:
> Hi All,
> 
> I hit the below crashed when doing memory related tests[1] on s390x:
> 
> --- snip -
> � 15929.351639¨  � <0021c0a6>¨ shrink_inactive_list+0x1c6/0x56c 
> � 15929.351647¨  � <0021c69e>¨ shrink_lruvec+0x252/0x56c 
> � 15929.351654¨  � <0021ca44>¨ shrink_zone+0x8c/0x1bc 
> � 15929.351662¨  � <0021d080>¨ balance_pgdat+0x50c/0x658 
> � 15929.351671¨  � <0021d318>¨ kswapd+0x14c/0x470 
> � 15929.351680¨  � <00158292>¨ kthread+0xda/0xe4 
> � 15929.351690¨  � <0062a5de>¨ kernel_thread_starter+0x6/0xc 
> � 15929.351700¨  � <0062a5d8>¨ kernel_thread_starter+0x0/0xc 
> � 16109.346061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=24006 
> jiffies 
>  g=89766 c=89765 q=10544) 
> � 16109.346101¨ CPU: 0 Tainted: G  D  3.9.0-rc6+ #1 
> � 16109.346106¨ Process kswapd0 (pid: 28, task: 3b2a, ksp: 
> 3b 
> 2ab8c0) 
> � 16109.346110¨0001bb60 0001bb70 0002 
> 000 
> 0 
>0001bc00 0001bb78 0001bb78 001009ca 
> 2930 000a 000a 
>0001bbc0 0001bb60   
>0063bb18 001009ca 0001bb60 0001bbb0 
> � 16109.346170¨ Call Trace: 
> � 16109.346179¨ (� <00100920>¨ show_trace+0x128/0x12c) 
> � 16109.346195¨  � <001cd320>¨ rcu_check_callbacks+0x458/0xccc 
> � 16109.346209¨  � <00140f2e>¨ update_process_times+0x4a/0x74 
> � 16109.346222¨  � <00199452>¨ tick_sched_handle.isra.12+0x5e/0x70 
> � 16109.346235¨  � <001995aa>¨ tick_sched_timer+0x6a/0x98 
> � 16109.346247¨  � <0015c1ea>¨ __run_hrtimer+0x8e/0x200 
> � 16109.346381¨  � <0015d1b2>¨ hrtimer_interrupt+0x212/0x2b0 
> � 16109.346385¨  � <001040f6>¨ clock_comparator_work+0x4a/0x54 
> � 16109.346390¨  � <0010d658>¨ do_extint+0x158/0x15c 
> � 16109.346396¨  � <0062aa24>¨ ext_skip+0x38/0x3c 
> � 16109.346404¨  � <001153c8>¨ smp_yield_cpu+0x44/0x48 
> � 16109.346412¨ (� <03d10051aec0>¨ 0x3d10051aec0) 
> � 16109.346457¨  � <0024206a>¨ __page_check_address+0x16a/0x170 
> � 16109.346466¨  � <002423a2>¨ page_referenced_one+0x3e/0xa0 
> � 16109.346501¨  � <0024427c>¨ page_referenced+0x32c/0x41c 
> � 16109.346510¨  � <0021b1dc>¨ shrink_page_list+0x380/0xb9c 
> � 16109.346521¨  � <0021c0a6>¨ shrink_inactive_list+0x1c6/0x56c 
> � 16109.346532¨  � <0021c69e>¨ shrink_lruvec+0x252/0x56c 
> � 16109.346542¨  � <0021ca44>¨ shrink_zone+0x8c/0x1bc 
> � 16109.346553¨  � <0021d080>¨ balance_pgdat+0x50c/0x658 
> � 16109.346564¨  � <0021d318>¨ kswapd+0x14c/0x470 
> � 16109.346576¨  � <00158292>¨ kthread+0xda/0xe4 
> � 16109.346656¨  � <0062a5de>¨ kernel_thread_starter+0x6/0xc 
> � 16109.346682¨  � <0062a5d8>¨ kernel_thread_starter+0x0/0xc 
> [-- MARK -- Fri Apr 12 06:15:00 2013] 
> � 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
> jiffies 
>  g=89766 c=89765 q=10627) 

Did the system really crash or did you just see the rcu related warning(s)?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG][s390x] mm: system crashed

2013-04-14 Thread Zhouping Liu
Hi All,

I hit the below crashed when doing memory related tests[1] on s390x:

--- snip -
� 15929.351639¨  � <0021c0a6>¨ shrink_inactive_list+0x1c6/0x56c 
� 15929.351647¨  � <0021c69e>¨ shrink_lruvec+0x252/0x56c 
� 15929.351654¨  � <0021ca44>¨ shrink_zone+0x8c/0x1bc 
� 15929.351662¨  � <0021d080>¨ balance_pgdat+0x50c/0x658 
� 15929.351671¨  � <0021d318>¨ kswapd+0x14c/0x470 
� 15929.351680¨  � <00158292>¨ kthread+0xda/0xe4 
� 15929.351690¨  � <0062a5de>¨ kernel_thread_starter+0x6/0xc 
� 15929.351700¨  � <0062a5d8>¨ kernel_thread_starter+0x0/0xc 
� 16109.346061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=24006 
jiffies 
 g=89766 c=89765 q=10544) 
� 16109.346101¨ CPU: 0 Tainted: G  D  3.9.0-rc6+ #1 
� 16109.346106¨ Process kswapd0 (pid: 28, task: 3b2a, ksp: 
3b 
2ab8c0) 
� 16109.346110¨0001bb60 0001bb70 0002 
000 
0 
   0001bc00 0001bb78 0001bb78 001009ca 
    2930 000a 000a 
   0001bbc0 0001bb60   
   0063bb18 001009ca 0001bb60 0001bbb0 
� 16109.346170¨ Call Trace: 
� 16109.346179¨ (� <00100920>¨ show_trace+0x128/0x12c) 
� 16109.346195¨  � <001cd320>¨ rcu_check_callbacks+0x458/0xccc 
� 16109.346209¨  � <00140f2e>¨ update_process_times+0x4a/0x74 
� 16109.346222¨  � <00199452>¨ tick_sched_handle.isra.12+0x5e/0x70 
� 16109.346235¨  � <001995aa>¨ tick_sched_timer+0x6a/0x98 
� 16109.346247¨  � <0015c1ea>¨ __run_hrtimer+0x8e/0x200 
� 16109.346381¨  � <0015d1b2>¨ hrtimer_interrupt+0x212/0x2b0 
� 16109.346385¨  � <001040f6>¨ clock_comparator_work+0x4a/0x54 
� 16109.346390¨  � <0010d658>¨ do_extint+0x158/0x15c 
� 16109.346396¨  � <0062aa24>¨ ext_skip+0x38/0x3c 
� 16109.346404¨  � <001153c8>¨ smp_yield_cpu+0x44/0x48 
� 16109.346412¨ (� <03d10051aec0>¨ 0x3d10051aec0) 
� 16109.346457¨  � <0024206a>¨ __page_check_address+0x16a/0x170 
� 16109.346466¨  � <002423a2>¨ page_referenced_one+0x3e/0xa0 
� 16109.346501¨  � <0024427c>¨ page_referenced+0x32c/0x41c 
� 16109.346510¨  � <0021b1dc>¨ shrink_page_list+0x380/0xb9c 
� 16109.346521¨  � <0021c0a6>¨ shrink_inactive_list+0x1c6/0x56c 
� 16109.346532¨  � <0021c69e>¨ shrink_lruvec+0x252/0x56c 
� 16109.346542¨  � <0021ca44>¨ shrink_zone+0x8c/0x1bc 
� 16109.346553¨  � <0021d080>¨ balance_pgdat+0x50c/0x658 
� 16109.346564¨  � <0021d318>¨ kswapd+0x14c/0x470 
� 16109.346576¨  � <00158292>¨ kthread+0xda/0xe4 
� 16109.346656¨  � <0062a5de>¨ kernel_thread_starter+0x6/0xc 
� 16109.346682¨  � <0062a5d8>¨ kernel_thread_starter+0x0/0xc 
[-- MARK -- Fri Apr 12 06:15:00 2013] 
� 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
jiffies 
 g=89766 c=89765 q=10627) 
-- snip --

The testing system has 1Gb RAM, kernel is new latest mainline.
please let me know if you need any more info.

[1] reproducer is come from LTP: 
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap2.c
and execute it using this command: `./mmap2 -x 0.002 -a -p`

-- 
Thanks,
Zhouping
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG][s390x] mm: system crashed

2013-04-14 Thread Zhouping Liu
Hi All,

I hit the below crashed when doing memory related tests[1] on s390x:

--- snip -
� 15929.351639¨  � 0021c0a6¨ shrink_inactive_list+0x1c6/0x56c 
� 15929.351647¨  � 0021c69e¨ shrink_lruvec+0x252/0x56c 
� 15929.351654¨  � 0021ca44¨ shrink_zone+0x8c/0x1bc 
� 15929.351662¨  � 0021d080¨ balance_pgdat+0x50c/0x658 
� 15929.351671¨  � 0021d318¨ kswapd+0x14c/0x470 
� 15929.351680¨  � 00158292¨ kthread+0xda/0xe4 
� 15929.351690¨  � 0062a5de¨ kernel_thread_starter+0x6/0xc 
� 15929.351700¨  � 0062a5d8¨ kernel_thread_starter+0x0/0xc 
� 16109.346061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=24006 
jiffies 
 g=89766 c=89765 q=10544) 
� 16109.346101¨ CPU: 0 Tainted: G  D  3.9.0-rc6+ #1 
� 16109.346106¨ Process kswapd0 (pid: 28, task: 3b2a, ksp: 
3b 
2ab8c0) 
� 16109.346110¨0001bb60 0001bb70 0002 
000 
0 
   0001bc00 0001bb78 0001bb78 001009ca 
    2930 000a 000a 
   0001bbc0 0001bb60   
   0063bb18 001009ca 0001bb60 0001bbb0 
� 16109.346170¨ Call Trace: 
� 16109.346179¨ (� 00100920¨ show_trace+0x128/0x12c) 
� 16109.346195¨  � 001cd320¨ rcu_check_callbacks+0x458/0xccc 
� 16109.346209¨  � 00140f2e¨ update_process_times+0x4a/0x74 
� 16109.346222¨  � 00199452¨ tick_sched_handle.isra.12+0x5e/0x70 
� 16109.346235¨  � 001995aa¨ tick_sched_timer+0x6a/0x98 
� 16109.346247¨  � 0015c1ea¨ __run_hrtimer+0x8e/0x200 
� 16109.346381¨  � 0015d1b2¨ hrtimer_interrupt+0x212/0x2b0 
� 16109.346385¨  � 001040f6¨ clock_comparator_work+0x4a/0x54 
� 16109.346390¨  � 0010d658¨ do_extint+0x158/0x15c 
� 16109.346396¨  � 0062aa24¨ ext_skip+0x38/0x3c 
� 16109.346404¨  � 001153c8¨ smp_yield_cpu+0x44/0x48 
� 16109.346412¨ (� 03d10051aec0¨ 0x3d10051aec0) 
� 16109.346457¨  � 0024206a¨ __page_check_address+0x16a/0x170 
� 16109.346466¨  � 002423a2¨ page_referenced_one+0x3e/0xa0 
� 16109.346501¨  � 0024427c¨ page_referenced+0x32c/0x41c 
� 16109.346510¨  � 0021b1dc¨ shrink_page_list+0x380/0xb9c 
� 16109.346521¨  � 0021c0a6¨ shrink_inactive_list+0x1c6/0x56c 
� 16109.346532¨  � 0021c69e¨ shrink_lruvec+0x252/0x56c 
� 16109.346542¨  � 0021ca44¨ shrink_zone+0x8c/0x1bc 
� 16109.346553¨  � 0021d080¨ balance_pgdat+0x50c/0x658 
� 16109.346564¨  � 0021d318¨ kswapd+0x14c/0x470 
� 16109.346576¨  � 00158292¨ kthread+0xda/0xe4 
� 16109.346656¨  � 0062a5de¨ kernel_thread_starter+0x6/0xc 
� 16109.346682¨  � 0062a5d8¨ kernel_thread_starter+0x0/0xc 
[-- MARK -- Fri Apr 12 06:15:00 2013] 
� 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
jiffies 
 g=89766 c=89765 q=10627) 
-- snip --

The testing system has 1Gb RAM, kernel is new latest mainline.
please let me know if you need any more info.

[1] reproducer is come from LTP: 
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap2.c
and execute it using this command: `./mmap2 -x 0.002 -a -p`

-- 
Thanks,
Zhouping
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG][s390x] mm: system crashed

2013-04-14 Thread Heiko Carstens
On Sun, Apr 14, 2013 at 11:28:40PM -0400, Zhouping Liu wrote:
 Hi All,
 
 I hit the below crashed when doing memory related tests[1] on s390x:
 
 --- snip -
 � 15929.351639¨  � 0021c0a6¨ shrink_inactive_list+0x1c6/0x56c 
 � 15929.351647¨  � 0021c69e¨ shrink_lruvec+0x252/0x56c 
 � 15929.351654¨  � 0021ca44¨ shrink_zone+0x8c/0x1bc 
 � 15929.351662¨  � 0021d080¨ balance_pgdat+0x50c/0x658 
 � 15929.351671¨  � 0021d318¨ kswapd+0x14c/0x470 
 � 15929.351680¨  � 00158292¨ kthread+0xda/0xe4 
 � 15929.351690¨  � 0062a5de¨ kernel_thread_starter+0x6/0xc 
 � 15929.351700¨  � 0062a5d8¨ kernel_thread_starter+0x0/0xc 
 � 16109.346061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=24006 
 jiffies 
  g=89766 c=89765 q=10544) 
 � 16109.346101¨ CPU: 0 Tainted: G  D  3.9.0-rc6+ #1 
 � 16109.346106¨ Process kswapd0 (pid: 28, task: 3b2a, ksp: 
 3b 
 2ab8c0) 
 � 16109.346110¨0001bb60 0001bb70 0002 
 000 
 0 
0001bc00 0001bb78 0001bb78 001009ca 
 2930 000a 000a 
0001bbc0 0001bb60   
0063bb18 001009ca 0001bb60 0001bbb0 
 � 16109.346170¨ Call Trace: 
 � 16109.346179¨ (� 00100920¨ show_trace+0x128/0x12c) 
 � 16109.346195¨  � 001cd320¨ rcu_check_callbacks+0x458/0xccc 
 � 16109.346209¨  � 00140f2e¨ update_process_times+0x4a/0x74 
 � 16109.346222¨  � 00199452¨ tick_sched_handle.isra.12+0x5e/0x70 
 � 16109.346235¨  � 001995aa¨ tick_sched_timer+0x6a/0x98 
 � 16109.346247¨  � 0015c1ea¨ __run_hrtimer+0x8e/0x200 
 � 16109.346381¨  � 0015d1b2¨ hrtimer_interrupt+0x212/0x2b0 
 � 16109.346385¨  � 001040f6¨ clock_comparator_work+0x4a/0x54 
 � 16109.346390¨  � 0010d658¨ do_extint+0x158/0x15c 
 � 16109.346396¨  � 0062aa24¨ ext_skip+0x38/0x3c 
 � 16109.346404¨  � 001153c8¨ smp_yield_cpu+0x44/0x48 
 � 16109.346412¨ (� 03d10051aec0¨ 0x3d10051aec0) 
 � 16109.346457¨  � 0024206a¨ __page_check_address+0x16a/0x170 
 � 16109.346466¨  � 002423a2¨ page_referenced_one+0x3e/0xa0 
 � 16109.346501¨  � 0024427c¨ page_referenced+0x32c/0x41c 
 � 16109.346510¨  � 0021b1dc¨ shrink_page_list+0x380/0xb9c 
 � 16109.346521¨  � 0021c0a6¨ shrink_inactive_list+0x1c6/0x56c 
 � 16109.346532¨  � 0021c69e¨ shrink_lruvec+0x252/0x56c 
 � 16109.346542¨  � 0021ca44¨ shrink_zone+0x8c/0x1bc 
 � 16109.346553¨  � 0021d080¨ balance_pgdat+0x50c/0x658 
 � 16109.346564¨  � 0021d318¨ kswapd+0x14c/0x470 
 � 16109.346576¨  � 00158292¨ kthread+0xda/0xe4 
 � 16109.346656¨  � 0062a5de¨ kernel_thread_starter+0x6/0xc 
 � 16109.346682¨  � 0062a5d8¨ kernel_thread_starter+0x0/0xc 
 [-- MARK -- Fri Apr 12 06:15:00 2013] 
 � 16289.386061¨ INFO: rcu_sched self-detected stall on CPU { 0}  (t=42010 
 jiffies 
  g=89766 c=89765 q=10627) 

Did the system really crash or did you just see the rcu related warning(s)?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/