Re: x86-64 bad pmds in 2.6.11.6

2005-08-08 Thread Andy Davidson

On Wed, 6 Apr, 2005 22:49:03 -0400, Dave Jones wrote:

On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
 > On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
 > >  I arrived at the office today to find my workstation had this spew
 > >  in its dmesg buffer..
 > Looks like random memory corruption to me.
 > Can you enable slab debugging etc.?
 > >  mm/memory.c:97: bad pmd 81004b017438(0038a5500a88).
 > >  mm/memory.c:97: bad pmd 81004b017440(0003).
 > >  mm/memory.c:97: bad pmd 81004b017448(773b).
 > >  mm/memory.c:97: bad pmd 81004b017450(773c).
I realised today that this happens every time X starts up for
the first time.   I did some experiments, and found that with 2.6.12rc1
it's gone. Either it got fixed accidentally, or its hidden now
by one of the many changes in 4-level patches.
I'll try and narrow this down a little more tomorrow, to see if I
can pinpoint the exact -bk snapshot (may be tricky given they were
broken for a while), as it'd be good to get this fixed in 2.6.11.x
if .12 isn't going to show up any time soon.


Hi, Dave, all --

Does anyone remember if they saw any system instability at the time of 
these messages ?


I'm running 2.6.11 on an SMP Opteron box, which is exhibiting these 
notices.  The box occasionally then behaves like it would during a 
serious memory leak - the load average shoots up, the box becomes 
unresponsive, stops accepting network connections, (but memory resources 
are not entirely starved, and nor does the kernel kill any processes off.)


Then - a few minutes later, the computer returns to normal.  This seems 
to happen maybe twice a week.  Thankfully, it's not ruined my weekend 
with a phone call from support yet, but it might. ;-)


If you do remember instability at this time, which was cured with an 
upgrade, then I will schedule some down time to try this out.



--

Regards, Andy Davidson[EMAIL PROTECTED]
Systems Administrator,Ebuyer (UK) Ltd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-08-08 Thread Andy Davidson

On Wed, 6 Apr, 2005 22:49:03 -0400, Dave Jones wrote:

On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
  On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
I arrived at the office today to find my workstation had this spew
in its dmesg buffer..
  Looks like random memory corruption to me.
  Can you enable slab debugging etc.?
mm/memory.c:97: bad pmd 81004b017438(0038a5500a88).
mm/memory.c:97: bad pmd 81004b017440(0003).
mm/memory.c:97: bad pmd 81004b017448(773b).
mm/memory.c:97: bad pmd 81004b017450(773c).
I realised today that this happens every time X starts up for
the first time.   I did some experiments, and found that with 2.6.12rc1
it's gone. Either it got fixed accidentally, or its hidden now
by one of the many changes in 4-level patches.
I'll try and narrow this down a little more tomorrow, to see if I
can pinpoint the exact -bk snapshot (may be tricky given they were
broken for a while), as it'd be good to get this fixed in 2.6.11.x
if .12 isn't going to show up any time soon.


Hi, Dave, all --

Does anyone remember if they saw any system instability at the time of 
these messages ?


I'm running 2.6.11 on an SMP Opteron box, which is exhibiting these 
notices.  The box occasionally then behaves like it would during a 
serious memory leak - the load average shoots up, the box becomes 
unresponsive, stops accepting network connections, (but memory resources 
are not entirely starved, and nor does the kernel kill any processes off.)


Then - a few minutes later, the computer returns to normal.  This seems 
to happen maybe twice a week.  Thankfully, it's not ruined my weekend 
with a phone call from support yet, but it might. ;-)


If you do remember instability at this time, which was cured with an 
upgrade, then I will schedule some down time to try this out.



--

Regards, Andy Davidson[EMAIL PROTECTED]
Systems Administrator,Ebuyer (UK) Ltd
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Debugging patch was Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-23 Thread Andi Kleen

Can people who can reproduce the x86-64 2.6.11 pmd bad  problem please apply
the following patch and see (a) if it can be still reprocuded with it 
and send the output generated. Also a strace of the program that showed
it (pid and name of it should be dumped) would be useful if not too big.

After staring some time at the code I cant find the problem, but 
I somehow suspect it has to do with early page table frees. That is
why they were disabled. This should not cause any memory leaks,
the page tables will be always freed at process exit, so it is
safe to apply even for production machines.

Thanks,

-Andi


diff -u linux-2.6.11/mm/memory.c-o linux-2.6.11/mm/memory.c
--- linux-2.6.11/mm/memory.c-o  2005-03-02 08:38:08.0 +0100
+++ linux-2.6.11/mm/memory.c2005-04-22 19:32:30.305402456 +0200
@@ -94,6 +94,7 @@
if (pmd_none(*pmd))
return;
if (unlikely(pmd_bad(*pmd))) {
+   printk("%s:%d: ", current->comm, current->pid);
pmd_ERROR(*pmd);
pmd_clear(pmd);
return;
diff -u linux-2.6.11/mm/mmap.c-o linux-2.6.11/mm/mmap.c
--- linux-2.6.11/mm/mmap.c-o2005-03-02 08:38:12.0 +0100
+++ linux-2.6.11/mm/mmap.c  2005-04-22 19:33:10.354580428 +0200
@@ -1645,11 +1645,13 @@
return;
if (first < FIRST_USER_PGD_NR * PGDIR_SIZE)
first = FIRST_USER_PGD_NR * PGDIR_SIZE;
+#if 0
/* No point trying to free anything if we're in the same pte page */
if ((first & PMD_MASK) < (last & PMD_MASK)) {
clear_page_range(tlb, first, last);
flush_tlb_pgtables(mm, first, last);
}
+#endif
 }
 
 /* Normal function to fix up a mapping

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Debugging patch was Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-23 Thread Andi Kleen

Can people who can reproduce the x86-64 2.6.11 pmd bad  problem please apply
the following patch and see (a) if it can be still reprocuded with it 
and send the output generated. Also a strace of the program that showed
it (pid and name of it should be dumped) would be useful if not too big.

After staring some time at the code I cant find the problem, but 
I somehow suspect it has to do with early page table frees. That is
why they were disabled. This should not cause any memory leaks,
the page tables will be always freed at process exit, so it is
safe to apply even for production machines.

Thanks,

-Andi


diff -u linux-2.6.11/mm/memory.c-o linux-2.6.11/mm/memory.c
--- linux-2.6.11/mm/memory.c-o  2005-03-02 08:38:08.0 +0100
+++ linux-2.6.11/mm/memory.c2005-04-22 19:32:30.305402456 +0200
@@ -94,6 +94,7 @@
if (pmd_none(*pmd))
return;
if (unlikely(pmd_bad(*pmd))) {
+   printk(%s:%d: , current-comm, current-pid);
pmd_ERROR(*pmd);
pmd_clear(pmd);
return;
diff -u linux-2.6.11/mm/mmap.c-o linux-2.6.11/mm/mmap.c
--- linux-2.6.11/mm/mmap.c-o2005-03-02 08:38:12.0 +0100
+++ linux-2.6.11/mm/mmap.c  2005-04-22 19:33:10.354580428 +0200
@@ -1645,11 +1645,13 @@
return;
if (first  FIRST_USER_PGD_NR * PGDIR_SIZE)
first = FIRST_USER_PGD_NR * PGDIR_SIZE;
+#if 0
/* No point trying to free anything if we're in the same pte page */
if ((first  PMD_MASK)  (last  PMD_MASK)) {
clear_page_range(tlb, first, last);
flush_tlb_pgtables(mm, first, last);
}
+#endif
 }
 
 /* Normal function to fix up a mapping

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-19 Thread Hugh Dickins
On Tue, 19 Apr 2005, Andi Kleen wrote:
> On Fri, Apr 15, 2005 at 06:58:20PM +0100, Hugh Dickins wrote:
> > 
> > I must confess, with all due respect to Andi, that I don't understand his
> > dismissal of the possibility that load_cr3 in leave_mm might be the fix
> > (to create_elf_tables writing user stack data into the pmd).
> 
> Sorry for the late answer.

Not at all.  I didn't expect you to persist in trying to persuade me,
thank you for doing so, and I apologize for taking your time on this.

> Ok, lets try again. The hole fixed by this patch only covers
> the case of an kernel thread with lazy mm doing some memory access
> (or more likely the CPU doing a prefetch there). But ELF loading
> never happens in lazy mm kernel threads.AFAIK in a "real" process
> the TLB is always fully consistent.
> 
> Does that explanation satisfy you? 

It does.  Well, I needed to restudy exec_mmap and switch_mm in detail,
and having done so, I agree that the only way you can get through
exec_mmap's activate_mm without fully flushing the cpu's TLB, is if
the active_mm matches the newly allocated mm (itself impossible since
there's a reference on the active_mm), and the cpu bit is still set
in cpu_vm_mask - precisely not the case if we went through leave_mm.
Yet I was claiming your leave_mm fix could flush TLB for exec_mmap
where it wasn't already done.

Sorry for letting the neatness of my pmd/stack story blind me
to its impossibility, and for wasting your time.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-19 Thread Andi Kleen
On Fri, Apr 15, 2005 at 06:58:20PM +0100, Hugh Dickins wrote:
> On Fri, 15 Apr 2005, Chris Wright wrote:
> > * Andi Kleen ([EMAIL PROTECTED]) wrote:
> > > On Thu, Apr 14, 2005 at 11:27:12AM -0700, Chris Wright wrote:
> > > > Yes, I've seen it in .11 and earlier kernels.  Happen to have same
> > > > "x86_64" string on my bad pmd dumps, but can't reproduce it at all.
> > > > So, for now, I can hold off on adding the reload cr3 patch to -stable
> > > > unless you think it should be there anyway.
> > > 
> > > It is a bug fix (actually there is another related patch that fixes
> > > a similar bug), but we lived with the problems for years so I guess
> > > they can wait for .12. 
> > 
> > Sounds good.
> 
> I must confess, with all due respect to Andi, that I don't understand his
> dismissal of the possibility that load_cr3 in leave_mm might be the fix
> (to create_elf_tables writing user stack data into the pmd).

Sorry for the late answer.

Ok, lets try again. The hole fixed by this patch only covers
the case of an kernel thread with lazy mm doing some memory access
(or more likely the CPU doing a prefetch there). But ELF loading
never happens in lazy mm kernel threads.AFAIK in a "real" process
the TLB is always fully consistent.

Does that explanation satisfy you? 

I agree that my earlier one was a bit dubious because I argued about
the direct mapping, but the argv setup actually uses user addresses.
But I still think it must be something else.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-19 Thread Andi Kleen
On Fri, Apr 15, 2005 at 06:58:20PM +0100, Hugh Dickins wrote:
 On Fri, 15 Apr 2005, Chris Wright wrote:
  * Andi Kleen ([EMAIL PROTECTED]) wrote:
   On Thu, Apr 14, 2005 at 11:27:12AM -0700, Chris Wright wrote:
Yes, I've seen it in .11 and earlier kernels.  Happen to have same
x86_64 string on my bad pmd dumps, but can't reproduce it at all.
So, for now, I can hold off on adding the reload cr3 patch to -stable
unless you think it should be there anyway.
   
   It is a bug fix (actually there is another related patch that fixes
   a similar bug), but we lived with the problems for years so I guess
   they can wait for .12. 
  
  Sounds good.
 
 I must confess, with all due respect to Andi, that I don't understand his
 dismissal of the possibility that load_cr3 in leave_mm might be the fix
 (to create_elf_tables writing user stack data into the pmd).

Sorry for the late answer.

Ok, lets try again. The hole fixed by this patch only covers
the case of an kernel thread with lazy mm doing some memory access
(or more likely the CPU doing a prefetch there). But ELF loading
never happens in lazy mm kernel threads.AFAIK in a real process
the TLB is always fully consistent.

Does that explanation satisfy you? 

I agree that my earlier one was a bit dubious because I argued about
the direct mapping, but the argv setup actually uses user addresses.
But I still think it must be something else.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-19 Thread Hugh Dickins
On Tue, 19 Apr 2005, Andi Kleen wrote:
 On Fri, Apr 15, 2005 at 06:58:20PM +0100, Hugh Dickins wrote:
  
  I must confess, with all due respect to Andi, that I don't understand his
  dismissal of the possibility that load_cr3 in leave_mm might be the fix
  (to create_elf_tables writing user stack data into the pmd).
 
 Sorry for the late answer.

Not at all.  I didn't expect you to persist in trying to persuade me,
thank you for doing so, and I apologize for taking your time on this.

 Ok, lets try again. The hole fixed by this patch only covers
 the case of an kernel thread with lazy mm doing some memory access
 (or more likely the CPU doing a prefetch there). But ELF loading
 never happens in lazy mm kernel threads.AFAIK in a real process
 the TLB is always fully consistent.
 
 Does that explanation satisfy you? 

It does.  Well, I needed to restudy exec_mmap and switch_mm in detail,
and having done so, I agree that the only way you can get through
exec_mmap's activate_mm without fully flushing the cpu's TLB, is if
the active_mm matches the newly allocated mm (itself impossible since
there's a reference on the active_mm), and the cpu bit is still set
in cpu_vm_mask - precisely not the case if we went through leave_mm.
Yet I was claiming your leave_mm fix could flush TLB for exec_mmap
where it wasn't already done.

Sorry for letting the neatness of my pmd/stack story blind me
to its impossibility, and for wasting your time.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-15 Thread Dave Jones
On Fri, Apr 15, 2005 at 06:58:20PM +0100, Hugh Dickins wrote:

 > > > If there was a fix for the bad pmd problem it might be a candidate
 > > > for stable, but so far we dont know what causes it yet.
 > > If I figure a way to trigger here, I'll report back.
 > 
 > Dave, earlier on you were quite able to reproduce the problem on 2.6.11,
 > finding it happened the first time you ran X.  Do you have any time to
 > reverify that, then try to reproduce with the load_cr3 in leave_mm patch?
 > 
 > But please don't waste your time on this unless you think it's plausible.

I used to be able to reproduce it 100% by doing this on an vanilla
upstream kernel. Then it changed behaviour so I only saw it happening
on the Fedora kernel.  For the latest Fedora update kernel I backported
this change..
- x86_64: Only free PMDs and PUDs after other CPUs have been flushed
as a 'try it and see'.  At first I thought it killed the bug, but
a day or so later, it started doing it again.

In the Fedora kernel we have a patch which restricts /dev/mem reading,
so I got suspicious about this interacting with any of the changes
that had happened to drivers/char/mem.c
Out of curiousity, I backported the 3-4 patches from .12rc to
the Fedora .11 kernel, and haven't seen the problem since.

The bizarre thing is I can't explain why any of those patches would
make such a difference.  Given the bug seems to be coming and going
for me, its possible its just masked the problem.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-15 Thread Hugh Dickins
On Fri, 15 Apr 2005, Chris Wright wrote:
> * Andi Kleen ([EMAIL PROTECTED]) wrote:
> > On Thu, Apr 14, 2005 at 11:27:12AM -0700, Chris Wright wrote:
> > > Yes, I've seen it in .11 and earlier kernels.  Happen to have same
> > > "x86_64" string on my bad pmd dumps, but can't reproduce it at all.
> > > So, for now, I can hold off on adding the reload cr3 patch to -stable
> > > unless you think it should be there anyway.
> > 
> > It is a bug fix (actually there is another related patch that fixes
> > a similar bug), but we lived with the problems for years so I guess
> > they can wait for .12. 
> 
> Sounds good.

I must confess, with all due respect to Andi, that I don't understand his
dismissal of the possibility that load_cr3 in leave_mm might be the fix
(to create_elf_tables writing user stack data into the pmd).

My belief is that leaving any opening for unpredictable speculations to
pull stale translations into the TLB, is a recipe for strange trouble
down the line when those translations may get used in actuality.

I'd been hoping Andi would come to see it my way overnight!
since I'm clearly not up to arguing the case persuasively.

But I certainly don't expect Chris to add an unjustified patch to -stable.

> > If there was a fix for the bad pmd problem it might be a candidate
> > for stable, but so far we dont know what causes it yet.
> 
> If I figure a way to trigger here, I'll report back.

Dave, earlier on you were quite able to reproduce the problem on 2.6.11,
finding it happened the first time you ran X.  Do you have any time to
reverify that, then try to reproduce with the load_cr3 in leave_mm patch?

But please don't waste your time on this unless you think it's plausible.

Thanks,
Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-15 Thread Chris Wright
* Andi Kleen ([EMAIL PROTECTED]) wrote:
> On Thu, Apr 14, 2005 at 11:27:12AM -0700, Chris Wright wrote:
> > Yes, I've seen it in .11 and earlier kernels.  Happen to have same
> > "x86_64" string on my bad pmd dumps, but can't reproduce it at all.
> > So, for now, I can hold off on adding the reload cr3 patch to -stable
> > unless you think it should be there anyway.
> 
> It is a bug fix (actually there is another related patch that fixes
> a similar bug), but we lived with the problems for years so I guess
> they can wait for .12. 

Sounds good.

> If there was a fix for the bad pmd problem it might be a candidate
> for stable, but so far we dont know what causes it yet.

If I figure a way to trigger here, I'll report back.

thanks,
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-15 Thread Andi Kleen
On Thu, Apr 14, 2005 at 11:27:12AM -0700, Chris Wright wrote:
> * Andi Kleen ([EMAIL PROTECTED]) wrote:
> > > I will take a closer look at the rc1/rc2 patches later this evening
> > > and see if I can spot something. Can only report back tomorrow though.
> > 
> > Actually itt started in .11 already - sigh - on rereading the thread.
> > That will make the code audit harder :/
> 
> Yes, I've seen it in .11 and earlier kernels.  Happen to have same
> "x86_64" string on my bad pmd dumps, but can't reproduce it at all.
> So, for now, I can hold off on adding the reload cr3 patch to -stable
> unless you think it should be there anyway.

It is a bug fix (actually there is another related patch that fixes
a similar bug), but we lived with the problems for years so I guess
they can wait for .12. 

If there was a fix for the bad pmd problem it might be a candidate
for stable, but so far we dont know what causes it yet.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-15 Thread Andi Kleen
On Thu, Apr 14, 2005 at 11:27:12AM -0700, Chris Wright wrote:
 * Andi Kleen ([EMAIL PROTECTED]) wrote:
   I will take a closer look at the rc1/rc2 patches later this evening
   and see if I can spot something. Can only report back tomorrow though.
  
  Actually itt started in .11 already - sigh - on rereading the thread.
  That will make the code audit harder :/
 
 Yes, I've seen it in .11 and earlier kernels.  Happen to have same
 x86_64 string on my bad pmd dumps, but can't reproduce it at all.
 So, for now, I can hold off on adding the reload cr3 patch to -stable
 unless you think it should be there anyway.

It is a bug fix (actually there is another related patch that fixes
a similar bug), but we lived with the problems for years so I guess
they can wait for .12. 

If there was a fix for the bad pmd problem it might be a candidate
for stable, but so far we dont know what causes it yet.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-15 Thread Chris Wright
* Andi Kleen ([EMAIL PROTECTED]) wrote:
 On Thu, Apr 14, 2005 at 11:27:12AM -0700, Chris Wright wrote:
  Yes, I've seen it in .11 and earlier kernels.  Happen to have same
  x86_64 string on my bad pmd dumps, but can't reproduce it at all.
  So, for now, I can hold off on adding the reload cr3 patch to -stable
  unless you think it should be there anyway.
 
 It is a bug fix (actually there is another related patch that fixes
 a similar bug), but we lived with the problems for years so I guess
 they can wait for .12. 

Sounds good.

 If there was a fix for the bad pmd problem it might be a candidate
 for stable, but so far we dont know what causes it yet.

If I figure a way to trigger here, I'll report back.

thanks,
-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-15 Thread Hugh Dickins
On Fri, 15 Apr 2005, Chris Wright wrote:
 * Andi Kleen ([EMAIL PROTECTED]) wrote:
  On Thu, Apr 14, 2005 at 11:27:12AM -0700, Chris Wright wrote:
   Yes, I've seen it in .11 and earlier kernels.  Happen to have same
   x86_64 string on my bad pmd dumps, but can't reproduce it at all.
   So, for now, I can hold off on adding the reload cr3 patch to -stable
   unless you think it should be there anyway.
  
  It is a bug fix (actually there is another related patch that fixes
  a similar bug), but we lived with the problems for years so I guess
  they can wait for .12. 
 
 Sounds good.

I must confess, with all due respect to Andi, that I don't understand his
dismissal of the possibility that load_cr3 in leave_mm might be the fix
(to create_elf_tables writing user stack data into the pmd).

My belief is that leaving any opening for unpredictable speculations to
pull stale translations into the TLB, is a recipe for strange trouble
down the line when those translations may get used in actuality.

I'd been hoping Andi would come to see it my way overnight!
since I'm clearly not up to arguing the case persuasively.

But I certainly don't expect Chris to add an unjustified patch to -stable.

  If there was a fix for the bad pmd problem it might be a candidate
  for stable, but so far we dont know what causes it yet.
 
 If I figure a way to trigger here, I'll report back.

Dave, earlier on you were quite able to reproduce the problem on 2.6.11,
finding it happened the first time you ran X.  Do you have any time to
reverify that, then try to reproduce with the load_cr3 in leave_mm patch?

But please don't waste your time on this unless you think it's plausible.

Thanks,
Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-15 Thread Dave Jones
On Fri, Apr 15, 2005 at 06:58:20PM +0100, Hugh Dickins wrote:

If there was a fix for the bad pmd problem it might be a candidate
for stable, but so far we dont know what causes it yet.
   If I figure a way to trigger here, I'll report back.
  
  Dave, earlier on you were quite able to reproduce the problem on 2.6.11,
  finding it happened the first time you ran X.  Do you have any time to
  reverify that, then try to reproduce with the load_cr3 in leave_mm patch?
  
  But please don't waste your time on this unless you think it's plausible.

I used to be able to reproduce it 100% by doing this on an vanilla
upstream kernel. Then it changed behaviour so I only saw it happening
on the Fedora kernel.  For the latest Fedora update kernel I backported
this change..
- x86_64: Only free PMDs and PUDs after other CPUs have been flushed
as a 'try it and see'.  At first I thought it killed the bug, but
a day or so later, it started doing it again.

In the Fedora kernel we have a patch which restricts /dev/mem reading,
so I got suspicious about this interacting with any of the changes
that had happened to drivers/char/mem.c
Out of curiousity, I backported the 3-4 patches from .12rc to
the Fedora .11 kernel, and haven't seen the problem since.

The bizarre thing is I can't explain why any of those patches would
make such a difference.  Given the bug seems to be coming and going
for me, its possible its just masked the problem.

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-14 Thread Chris Wright
* Andi Kleen ([EMAIL PROTECTED]) wrote:
> > I will take a closer look at the rc1/rc2 patches later this evening
> > and see if I can spot something. Can only report back tomorrow though.
> 
> Actually itt started in .11 already - sigh - on rereading the thread.
> That will make the code audit harder :/

Yes, I've seen it in .11 and earlier kernels.  Happen to have same
"x86_64" string on my bad pmd dumps, but can't reproduce it at all.
So, for now, I can hold off on adding the reload cr3 patch to -stable
unless you think it should be there anyway.

thanks,
-chris
-- 
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-14 Thread Andi Kleen
> I will take a closer look at the rc1/rc2 patches later this evening
> and see if I can spot something. Can only report back tomorrow though.

Actually itt started in .11 already - sigh - on rereading the thread.
That will make the code audit harder :/

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-14 Thread Andi Kleen
On Thu, Apr 14, 2005 at 06:34:58PM +0100, Hugh Dickins wrote:
> On Thu, 14 Apr 2005, Andi Kleen wrote:
> > 
> > Thanks for the analysis. However I doubt the load_cr3 patch can fix
> > it. All it does is to stop the CPU from prefetching mappings (which
> > can cause different problem).
> 
> I thought that the leave_mm code (before your patch) flushes the TLB, but
> restores cr3 to the mm, while removing that cpu from the mm's cpu_vm_mask.
> 
> So any speculation, not just prefetching, on that cpu is in danger of
> bringing address translations according to that mm back into the TLB.
> 
> But when the mm is torn down in exit_mmap, there's no longer any record
> that the TLB on that cpu needs flushing, so stale translations remain.
> 
> As a rule, we always flush TLB _after_ invalidating, not just before,
> for this kind of reason.

Yes this is all true. In fact I have several bug fixes for problems
in this area.

But this all cannot explain corruptions comming from the kernel, 
you tend to only see problems with the CPU prefetching something.

Note that with the cr3 reload you end up with init_mm, which
is not any useful mm. So even if there was a store from the kernel
into a stale mapping it would cause -EFAULT now.  But that is
not happening.

> 
> My paranoia of speculation may be excessive: I _think_ what I outline
> above is a real possibility on Intel, but you and others know AMD much
> better than I (and the reports I've seen are on AMD64, not EM64T).

It is not both on Intel and AMD :) These CPUs do a lot of prefetching
behind your back, any stale mappings at any time in the TLB eventually
cause problems. But other ones than this.


> Sure, the "mm/memory.c:97: bad pmd" messages are coming from
> clear_pmd_range, when the corrupted task exits later (but probably
> not much later, since its user stack is oddly distributed across
> two different pages: some mentioned SIGSEGVs I think).
> 
> The pmd really is bad, but it got to be bad because it had stack data
> written into it by create_elf_tables, when the TLB mistakenly thought
> it already knew what physical page 0x7000 was mapped to
> (prior kernel accesses to that user stack are not by user address).

What I meant is that the overwriting must be from Linux code
acting in the direct mapping, not due stale TLBs for addresses < __PAGE_OFFSET.

I will take a closer look at the rc1/rc2 patches later this evening
and see if I can spot something. Can only report back tomorrow though.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-14 Thread Hugh Dickins
On Thu, 14 Apr 2005, Andi Kleen wrote:
> 
> Thanks for the analysis. However I doubt the load_cr3 patch can fix
> it. All it does is to stop the CPU from prefetching mappings (which
> can cause different problem).

I thought that the leave_mm code (before your patch) flushes the TLB, but
restores cr3 to the mm, while removing that cpu from the mm's cpu_vm_mask.

So any speculation, not just prefetching, on that cpu is in danger of
bringing address translations according to that mm back into the TLB.

But when the mm is torn down in exit_mmap, there's no longer any record
that the TLB on that cpu needs flushing, so stale translations remain.

As a rule, we always flush TLB _after_ invalidating, not just before,
for this kind of reason.

My paranoia of speculation may be excessive: I _think_ what I outline
above is a real possibility on Intel, but you and others know AMD much
better than I (and the reports I've seen are on AMD64, not EM64T).

> But the Linux code who does bad pmd checks
> never looks at CR3 anyways, it always uses the current->mm. If
> bad pmd sees a bad page it must be still in the page tables of the MM,
> not a stable TLB entry.

Sure, the "mm/memory.c:97: bad pmd" messages are coming from
clear_pmd_range, when the corrupted task exits later (but probably
not much later, since its user stack is oddly distributed across
two different pages: some mentioned SIGSEGVs I think).

The pmd really is bad, but it got to be bad because it had stack data
written into it by create_elf_tables, when the TLB mistakenly thought
it already knew what physical page 0x7000 was mapped to
(prior kernel accesses to that user stack are not by user address).

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-14 Thread Andi Kleen
> It looks very much as if the mm being created has for pmd a page
> which was used for user stack in the outgoing mm; but somehow exec's
> exit_mmap TLB flushing hasn't taken effect.  I only now noticed this
> patch where you fix just such an issue.

Thanks for the analysis. However I doubt the load_cr3 patch can fix
it. All it does is to stop the CPU from prefetching mappings (which
can cause different problem). But the Linux code who does bad pmd checks
never looks at CR3 anyways, it always uses the current->mm. If
bad pmd sees a bad page it must be still in the page tables of the MM,
not a stable TLB entry.

It must be something else. Somehow we get a freed page into
the page table hierarchy. After the initial 4level implementation
I did not do many changes there, my suspection would be rather
on the recent memory.c changes.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-14 Thread Hugh Dickins
On Thu, 7 Apr 2005, Andi Kleen wrote:
> Dave Jones wrote:
> > I realised today that this happens every time X starts up for
> > the first time.   I did some experiments, and found that with 2.6.12rc1
> > it's gone. Either it got fixed accidentally, or its hidden now
> > by one of the many changes in 4-level patches.
> > 
> > I'll try and narrow this down a little more tomorrow, to see if I
> > can pinpoint the exact -bk snapshot (may be tricky given they were
> > broken for a while), as it'd be good to get this fixed in 2.6.11.x
> > if .12 isn't going to show up any time soon.
> 
> Can you supply a strace of the /dev/mem, /dev/kmem accesses of 
> your X server? (including the mmaps or read/writes if available)
> 
> My X server doesn't seem to cause that.

I can't explain why it should appear fixed in 2.6.12-rc1 (probably
other complicating factors at work), but I do believe you've fixed
this in 2.6.12-rc2, and the patch which should go into -stable is
your load_cr3 patch below, which Linus took from Andrew on 28 March.

I say this because I was intrigued by the resemblance between Sergey's
and Dave's corruptions, and spent a while trying to work out where they
come from.  The giveaway is the little ASCII string they share at the
end (seen also in Clem's extract)

 mm/memory.c:97: bad pmd 81004b017730(5f363878).
 mm/memory.c:97: bad pmd 81004b017738(3436).

That says "x86_64", and a grep for that as a string shows ELF_PLATFORM,
and a grep for that shows create_elf_tables in fs/binfmt_elf.c.  _All_
this pmd corruption (except for the first line, presumably pushing a
user address on stack) originates from create_elf_tables (the neatly
ascending stack addresses being the argv and envp pointers, incrementing
by 1 because only a NUL-string is found for each, the real strings being
off elsewhere in the intended new stack page, not in this pmd page).

It looks very much as if the mm being created has for pmd a page
which was used for user stack in the outgoing mm; but somehow exec's
exit_mmap TLB flushing hasn't taken effect.  I only now noticed this
patch where you fix just such an issue.

Hugh

From: "Andi Kleen" <[EMAIL PROTECTED]>

Always reload CR3 completely when a lazy MM thread drops a MM.  This avoids
keeping stale mappings around in the TLB that could be run into by the CPU by
itself (e.g.  during prefetches).

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 25-akpm/arch/x86_64/kernel/smp.c |3 ++-
 25-akpm/include/asm-x86_64/mmu_context.h |   10 --
 2 files changed, 10 insertions(+), 3 deletions(-)

diff -puN 
arch/x86_64/kernel/smp.c~x86_64-always-reload-cr3-completely-when-a-lazy-mm 
arch/x86_64/kernel/smp.c
--- 
25/arch/x86_64/kernel/smp.c~x86_64-always-reload-cr3-completely-when-a-lazy-mm  
Wed Mar 23 15:38:58 2005
+++ 25-akpm/arch/x86_64/kernel/smp.cWed Mar 23 15:38:58 2005
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -52,7 +53,7 @@ static inline void leave_mm (unsigned lo
if (read_pda(mmu_state) == TLBSTATE_OK)
BUG();
clear_bit(cpu, _pda(active_mm)->cpu_vm_mask);
-   __flush_tlb();
+   load_cr3(swapper_pg_dir);
 }
 
 /*
diff -puN 
include/asm-x86_64/mmu_context.h~x86_64-always-reload-cr3-completely-when-a-lazy-mm
 include/asm-x86_64/mmu_context.h
--- 
25/include/asm-x86_64/mmu_context.h~x86_64-always-reload-cr3-completely-when-a-lazy-mm
  Wed Mar 23 15:38:58 2005
+++ 25-akpm/include/asm-x86_64/mmu_context.hWed Mar 23 15:38:58 2005
@@ -28,6 +28,11 @@ static inline void enter_lazy_tlb(struct
 }
 #endif
 
+static inline void load_cr3(pgd_t *pgd)
+{
+   asm volatile("movq %0,%%cr3" :: "r" (__pa(pgd)) : "memory");
+}
+
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, 
 struct task_struct *tsk)
 {
@@ -40,7 +45,8 @@ static inline void switch_mm(struct mm_s
write_pda(active_mm, next);
 #endif
set_bit(cpu, >cpu_vm_mask);
-   asm volatile("movq %0,%%cr3" :: "r" (__pa(next->pgd)) : 
"memory");
+   load_cr3(next->pgd);
+
if (unlikely(next->context.ldt != prev->context.ldt)) 
load_LDT_nolock(>context, cpu);
}
@@ -54,7 +60,7 @@ static inline void switch_mm(struct mm_s
 * tlb flush IPI delivery. We must reload CR3
 * to make sure to use no freed page tables.
 */
-   asm volatile("movq %0,%%cr3" :: "r" (__pa(next->pgd)) : 
"memory");
+   load_cr3(next->pgd);
load_LDT_nolock(>context, cpu);
}
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: x86-64 bad pmds in 2.6.11.6

2005-04-14 Thread Hugh Dickins
On Thu, 7 Apr 2005, Andi Kleen wrote:
 Dave Jones wrote:
  I realised today that this happens every time X starts up for
  the first time.   I did some experiments, and found that with 2.6.12rc1
  it's gone. Either it got fixed accidentally, or its hidden now
  by one of the many changes in 4-level patches.
  
  I'll try and narrow this down a little more tomorrow, to see if I
  can pinpoint the exact -bk snapshot (may be tricky given they were
  broken for a while), as it'd be good to get this fixed in 2.6.11.x
  if .12 isn't going to show up any time soon.
 
 Can you supply a strace of the /dev/mem, /dev/kmem accesses of 
 your X server? (including the mmaps or read/writes if available)
 
 My X server doesn't seem to cause that.

I can't explain why it should appear fixed in 2.6.12-rc1 (probably
other complicating factors at work), but I do believe you've fixed
this in 2.6.12-rc2, and the patch which should go into -stable is
your load_cr3 patch below, which Linus took from Andrew on 28 March.

I say this because I was intrigued by the resemblance between Sergey's
and Dave's corruptions, and spent a while trying to work out where they
come from.  The giveaway is the little ASCII string they share at the
end (seen also in Clem's extract)

 mm/memory.c:97: bad pmd 81004b017730(5f363878).
 mm/memory.c:97: bad pmd 81004b017738(3436).

That says x86_64, and a grep for that as a string shows ELF_PLATFORM,
and a grep for that shows create_elf_tables in fs/binfmt_elf.c.  _All_
this pmd corruption (except for the first line, presumably pushing a
user address on stack) originates from create_elf_tables (the neatly
ascending stack addresses being the argv and envp pointers, incrementing
by 1 because only a NUL-string is found for each, the real strings being
off elsewhere in the intended new stack page, not in this pmd page).

It looks very much as if the mm being created has for pmd a page
which was used for user stack in the outgoing mm; but somehow exec's
exit_mmap TLB flushing hasn't taken effect.  I only now noticed this
patch where you fix just such an issue.

Hugh

From: Andi Kleen [EMAIL PROTECTED]

Always reload CR3 completely when a lazy MM thread drops a MM.  This avoids
keeping stale mappings around in the TLB that could be run into by the CPU by
itself (e.g.  during prefetches).

Signed-off-by: Andi Kleen [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
---

 25-akpm/arch/x86_64/kernel/smp.c |3 ++-
 25-akpm/include/asm-x86_64/mmu_context.h |   10 --
 2 files changed, 10 insertions(+), 3 deletions(-)

diff -puN 
arch/x86_64/kernel/smp.c~x86_64-always-reload-cr3-completely-when-a-lazy-mm 
arch/x86_64/kernel/smp.c
--- 
25/arch/x86_64/kernel/smp.c~x86_64-always-reload-cr3-completely-when-a-lazy-mm  
Wed Mar 23 15:38:58 2005
+++ 25-akpm/arch/x86_64/kernel/smp.cWed Mar 23 15:38:58 2005
@@ -25,6 +25,7 @@
 #include asm/pgalloc.h
 #include asm/tlbflush.h
 #include asm/mach_apic.h
+#include asm/mmu_context.h
 #include asm/proto.h
 
 /*
@@ -52,7 +53,7 @@ static inline void leave_mm (unsigned lo
if (read_pda(mmu_state) == TLBSTATE_OK)
BUG();
clear_bit(cpu, read_pda(active_mm)-cpu_vm_mask);
-   __flush_tlb();
+   load_cr3(swapper_pg_dir);
 }
 
 /*
diff -puN 
include/asm-x86_64/mmu_context.h~x86_64-always-reload-cr3-completely-when-a-lazy-mm
 include/asm-x86_64/mmu_context.h
--- 
25/include/asm-x86_64/mmu_context.h~x86_64-always-reload-cr3-completely-when-a-lazy-mm
  Wed Mar 23 15:38:58 2005
+++ 25-akpm/include/asm-x86_64/mmu_context.hWed Mar 23 15:38:58 2005
@@ -28,6 +28,11 @@ static inline void enter_lazy_tlb(struct
 }
 #endif
 
+static inline void load_cr3(pgd_t *pgd)
+{
+   asm volatile(movq %0,%%cr3 :: r (__pa(pgd)) : memory);
+}
+
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, 
 struct task_struct *tsk)
 {
@@ -40,7 +45,8 @@ static inline void switch_mm(struct mm_s
write_pda(active_mm, next);
 #endif
set_bit(cpu, next-cpu_vm_mask);
-   asm volatile(movq %0,%%cr3 :: r (__pa(next-pgd)) : 
memory);
+   load_cr3(next-pgd);
+
if (unlikely(next-context.ldt != prev-context.ldt)) 
load_LDT_nolock(next-context, cpu);
}
@@ -54,7 +60,7 @@ static inline void switch_mm(struct mm_s
 * tlb flush IPI delivery. We must reload CR3
 * to make sure to use no freed page tables.
 */
-   asm volatile(movq %0,%%cr3 :: r (__pa(next-pgd)) : 
memory);
+   load_cr3(next-pgd);
load_LDT_nolock(next-context, cpu);
}
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86-64 bad pmds in 2.6.11.6

2005-04-14 Thread Andi Kleen
 It looks very much as if the mm being created has for pmd a page
 which was used for user stack in the outgoing mm; but somehow exec's
 exit_mmap TLB flushing hasn't taken effect.  I only now noticed this
 patch where you fix just such an issue.

Thanks for the analysis. However I doubt the load_cr3 patch can fix
it. All it does is to stop the CPU from prefetching mappings (which
can cause different problem). But the Linux code who does bad pmd checks
never looks at CR3 anyways, it always uses the current-mm. If
bad pmd sees a bad page it must be still in the page tables of the MM,
not a stable TLB entry.

It must be something else. Somehow we get a freed page into
the page table hierarchy. After the initial 4level implementation
I did not do many changes there, my suspection would be rather
on the recent memory.c changes.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-14 Thread Hugh Dickins
On Thu, 14 Apr 2005, Andi Kleen wrote:
 
 Thanks for the analysis. However I doubt the load_cr3 patch can fix
 it. All it does is to stop the CPU from prefetching mappings (which
 can cause different problem).

I thought that the leave_mm code (before your patch) flushes the TLB, but
restores cr3 to the mm, while removing that cpu from the mm's cpu_vm_mask.

So any speculation, not just prefetching, on that cpu is in danger of
bringing address translations according to that mm back into the TLB.

But when the mm is torn down in exit_mmap, there's no longer any record
that the TLB on that cpu needs flushing, so stale translations remain.

As a rule, we always flush TLB _after_ invalidating, not just before,
for this kind of reason.

My paranoia of speculation may be excessive: I _think_ what I outline
above is a real possibility on Intel, but you and others know AMD much
better than I (and the reports I've seen are on AMD64, not EM64T).

 But the Linux code who does bad pmd checks
 never looks at CR3 anyways, it always uses the current-mm. If
 bad pmd sees a bad page it must be still in the page tables of the MM,
 not a stable TLB entry.

Sure, the mm/memory.c:97: bad pmd messages are coming from
clear_pmd_range, when the corrupted task exits later (but probably
not much later, since its user stack is oddly distributed across
two different pages: some mentioned SIGSEGVs I think).

The pmd really is bad, but it got to be bad because it had stack data
written into it by create_elf_tables, when the TLB mistakenly thought
it already knew what physical page 0x7000 was mapped to
(prior kernel accesses to that user stack are not by user address).

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-14 Thread Andi Kleen
On Thu, Apr 14, 2005 at 06:34:58PM +0100, Hugh Dickins wrote:
 On Thu, 14 Apr 2005, Andi Kleen wrote:
  
  Thanks for the analysis. However I doubt the load_cr3 patch can fix
  it. All it does is to stop the CPU from prefetching mappings (which
  can cause different problem).
 
 I thought that the leave_mm code (before your patch) flushes the TLB, but
 restores cr3 to the mm, while removing that cpu from the mm's cpu_vm_mask.
 
 So any speculation, not just prefetching, on that cpu is in danger of
 bringing address translations according to that mm back into the TLB.
 
 But when the mm is torn down in exit_mmap, there's no longer any record
 that the TLB on that cpu needs flushing, so stale translations remain.
 
 As a rule, we always flush TLB _after_ invalidating, not just before,
 for this kind of reason.

Yes this is all true. In fact I have several bug fixes for problems
in this area.

But this all cannot explain corruptions comming from the kernel, 
you tend to only see problems with the CPU prefetching something.

Note that with the cr3 reload you end up with init_mm, which
is not any useful mm. So even if there was a store from the kernel
into a stale mapping it would cause -EFAULT now.  But that is
not happening.

 
 My paranoia of speculation may be excessive: I _think_ what I outline
 above is a real possibility on Intel, but you and others know AMD much
 better than I (and the reports I've seen are on AMD64, not EM64T).

It is not both on Intel and AMD :) These CPUs do a lot of prefetching
behind your back, any stale mappings at any time in the TLB eventually
cause problems. But other ones than this.


 Sure, the mm/memory.c:97: bad pmd messages are coming from
 clear_pmd_range, when the corrupted task exits later (but probably
 not much later, since its user stack is oddly distributed across
 two different pages: some mentioned SIGSEGVs I think).
 
 The pmd really is bad, but it got to be bad because it had stack data
 written into it by create_elf_tables, when the TLB mistakenly thought
 it already knew what physical page 0x7000 was mapped to
 (prior kernel accesses to that user stack are not by user address).

What I meant is that the overwriting must be from Linux code
acting in the direct mapping, not due stale TLBs for addresses  __PAGE_OFFSET.

I will take a closer look at the rc1/rc2 patches later this evening
and see if I can spot something. Can only report back tomorrow though.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-14 Thread Andi Kleen
 I will take a closer look at the rc1/rc2 patches later this evening
 and see if I can spot something. Can only report back tomorrow though.

Actually itt started in .11 already - sigh - on rereading the thread.
That will make the code audit harder :/

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6 II

2005-04-14 Thread Chris Wright
* Andi Kleen ([EMAIL PROTECTED]) wrote:
  I will take a closer look at the rc1/rc2 patches later this evening
  and see if I can spot something. Can only report back tomorrow though.
 
 Actually itt started in .11 already - sigh - on rereading the thread.
 That will make the code audit harder :/

Yes, I've seen it in .11 and earlier kernels.  Happen to have same
x86_64 string on my bad pmd dumps, but can't reproduce it at all.
So, for now, I can hold off on adding the reload cr3 patch to -stable
unless you think it should be there anyway.

thanks,
-chris
-- 
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


re: x86-64 bad pmds in 2.6.11.6

2005-04-08 Thread Clem Taylor
Dave Jones reported seeing bad pmd messages in 2.6.11.6. I've been
seeing them with 2.6.11 and today with 2.6.11.6. When I first saw the
problem I ran memtest86 and it didn't catch anything after ~3hours.
However, I don't see them when X starts. They tend to happen after a
program segfaults:

2.6.11:
Apr  3 23:23:33 klaatu kernel: sh[16361]: segfault at 
rip  rsp 7020 error 14
Apr  3 23:23:33 klaatu kernel: mm/memory.c:97: bad pmd
810027171010(006b68b9).
.. many more ...

2.6.11.6:
Apr  8 12:03:17 klaatu kernel: grep[20971]: segfault at
 rip  rsp 7090 error 14
Apr  8 12:03:17 klaatu kernel: mm/memory.c:97: bad pmd
810095929010(0015).
 many more ...
Apr  8 12:03:18 klaatu kernel: mm/memory.c:97: bad pmd
8100959299d0(34365f363878).
Apr  8 12:03:18 klaatu kernel: grep[21116]: segfault at
 rip  rsp 70a0 error 14
Apr  8 12:03:18 klaatu kernel: mm/memory.c:97: bad pmd
810095f5b000(000f).
...

At the time I was doing a
find ... -exec grep -H ...
over a linux kernel tree.

I repeated the find and I didn't see segfaults the second run.

--Clem
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


re: x86-64 bad pmds in 2.6.11.6

2005-04-08 Thread Clem Taylor
Dave Jones reported seeing bad pmd messages in 2.6.11.6. I've been
seeing them with 2.6.11 and today with 2.6.11.6. When I first saw the
problem I ran memtest86 and it didn't catch anything after ~3hours.
However, I don't see them when X starts. They tend to happen after a
program segfaults:

2.6.11:
Apr  3 23:23:33 klaatu kernel: sh[16361]: segfault at 
rip  rsp 7020 error 14
Apr  3 23:23:33 klaatu kernel: mm/memory.c:97: bad pmd
810027171010(006b68b9).
.. many more ...

2.6.11.6:
Apr  8 12:03:17 klaatu kernel: grep[20971]: segfault at
 rip  rsp 7090 error 14
Apr  8 12:03:17 klaatu kernel: mm/memory.c:97: bad pmd
810095929010(0015).
 many more ...
Apr  8 12:03:18 klaatu kernel: mm/memory.c:97: bad pmd
8100959299d0(34365f363878).
Apr  8 12:03:18 klaatu kernel: grep[21116]: segfault at
 rip  rsp 70a0 error 14
Apr  8 12:03:18 klaatu kernel: mm/memory.c:97: bad pmd
810095f5b000(000f).
...

At the time I was doing a
find ... -exec grep -H ...
over a linux kernel tree.

I repeated the find and I didn't see segfaults the second run.

--Clem
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-07 Thread Andi Kleen
> I realised today that this happens every time X starts up for
> the first time.   I did some experiments, and found that with 2.6.12rc1
> it's gone. Either it got fixed accidentally, or its hidden now
> by one of the many changes in 4-level patches.
> 
> I'll try and narrow this down a little more tomorrow, to see if I
> can pinpoint the exact -bk snapshot (may be tricky given they were
> broken for a while), as it'd be good to get this fixed in 2.6.11.x
> if .12 isn't going to show up any time soon.

Can you supply a strace of the /dev/mem, /dev/kmem accesses of 
your X server? (including the mmaps or read/writes if available)

My X server doesn't seem to cause that.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-07 Thread Andi Kleen
 I realised today that this happens every time X starts up for
 the first time.   I did some experiments, and found that with 2.6.12rc1
 it's gone. Either it got fixed accidentally, or its hidden now
 by one of the many changes in 4-level patches.
 
 I'll try and narrow this down a little more tomorrow, to see if I
 can pinpoint the exact -bk snapshot (may be tricky given they were
 broken for a while), as it'd be good to get this fixed in 2.6.11.x
 if .12 isn't going to show up any time soon.

Can you supply a strace of the /dev/mem, /dev/kmem accesses of 
your X server? (including the mmaps or read/writes if available)

My X server doesn't seem to cause that.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-06 Thread Dave Jones
On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
 > On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
 > > [apologies to Andi for getting this twice, I goofed the l-k address
 > >  the first time]
 > > 
 > >  
 > >  I arrived at the office today to find my workstation had this spew
 > >  in its dmesg buffer..
 > 
 > Looks like random memory corruption to me.
 > 
 > Can you enable slab debugging etc.?
 > 
 > >  mm/memory.c:97: bad pmd 81004b017438(0038a5500a88).
 > >  mm/memory.c:97: bad pmd 81004b017440(0003).
 > >  mm/memory.c:97: bad pmd 81004b017448(773b).
 > >  mm/memory.c:97: bad pmd 81004b017450(773c).
 > > etc..

I realised today that this happens every time X starts up for
the first time.   I did some experiments, and found that with 2.6.12rc1
it's gone. Either it got fixed accidentally, or its hidden now
by one of the many changes in 4-level patches.

I'll try and narrow this down a little more tomorrow, to see if I
can pinpoint the exact -bk snapshot (may be tricky given they were
broken for a while), as it'd be good to get this fixed in 2.6.11.x
if .12 isn't going to show up any time soon.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-06 Thread Dave Jones
On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
  On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
   [apologies to Andi for getting this twice, I goofed the l-k address
the first time]
   

I arrived at the office today to find my workstation had this spew
in its dmesg buffer..
  
  Looks like random memory corruption to me.
  
  Can you enable slab debugging etc.?
  
mm/memory.c:97: bad pmd 81004b017438(0038a5500a88).
mm/memory.c:97: bad pmd 81004b017440(0003).
mm/memory.c:97: bad pmd 81004b017448(773b).
mm/memory.c:97: bad pmd 81004b017450(773c).
   etc..

I realised today that this happens every time X starts up for
the first time.   I did some experiments, and found that with 2.6.12rc1
it's gone. Either it got fixed accidentally, or its hidden now
by one of the many changes in 4-level patches.

I'll try and narrow this down a little more tomorrow, to see if I
can pinpoint the exact -bk snapshot (may be tricky given they were
broken for a while), as it'd be good to get this fixed in 2.6.11.x
if .12 isn't going to show up any time soon.

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-04-01 Thread Sergey S. Kostyliov
On Friday 01 April 2005 01:52, Dave Jones wrote:
> On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
>  > On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
>  > > [apologies to Andi for getting this twice, I goofed the l-k address
>  > >  the first time]
>  > > 
>  > >  
>  > >  I arrived at the office today to find my workstation had this spew
>  > >  in its dmesg buffer..
>  > 
>  > Looks like random memory corruption to me.
>  > 
>  > Can you enable slab debugging etc.?
> 
> SLAB_DEBUG=y.  Nothing in the logs.
> 
>  > Yes I saw them, but I supposed it is some driver going bad.
>  > If you want you can collect hardware data and see if there is
>  > a common driver.
> 
> There's quite a bit in this box 
> 
> 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 
> 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 
> 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 
> 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
> 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
> 00:07.5 Multimedia audio controller: Advanced Micro Devices [AMD] AMD-8111 
> AC97 Audio (rev 03)
> 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
> 12)
> 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
> 12)
> 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
> 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> HyperTransport Technology Configuration
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Address Map
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
> Controller
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Miscellaneous Control
> 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> HyperTransport Technology Configuration
> 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Address Map
> 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
> Controller
> 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Miscellaneous Control
> 02:07.0 USB Controller: NEC Corporation USB (rev 41)
> 02:07.1 USB Controller: NEC Corporation USB (rev 41)
> 02:07.2 USB Controller: NEC Corporation USB 2.0 (rev 02)
> 02:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  
> Ultra3 SCSI Adapter (rev 01)
> 02:08.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  
> Ultra3 SCSI Adapter (rev 01)
> 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit 
> Ethernet (rev 02)
> 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
> 03:0a.0 Unknown mass storage controller: Triones Technologies, Inc. 
> HPT366/368/370/370A/372 (rev 03)
> 03:0b.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD 
> Technology Inc) SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
> 03:0c.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
> Controller (PHY/Link)
> 04:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-8151 System Controller 
> (rev 13)
> 04:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8151 AGP Bridge (rev 13)
> 05:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G550 AGP (rev 01)
> 
> The SATA & SCSI controllers have no disks attached.  Firewire can be ignored 
> (theres
> no actual connector even for it on the board). The various USB controllers
> are mostly unused. Only one of them is USB2.0, so that sees occasional
> usb-storage use. Not noticed anything going bad there though.
> 
>   Dave

And here is my box (looks like there is no many hardware drivers
in common).

[EMAIL PROTECTED] rathamahata $ /sbin/lspci
:00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
:00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
:00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
:00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
:00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge 
(rev 12)
:00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
:00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge 
(rev 12)
:00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Address Map
:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
DRAM Controller
:00:18.3 Host bridge: Advanced Micro 

Re: x86-64 bad pmds in 2.6.11.6

2005-04-01 Thread Sergey S. Kostyliov
On Friday 01 April 2005 01:52, Dave Jones wrote:
 On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
   On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
[apologies to Andi for getting this twice, I goofed the l-k address
 the first time]

 
 I arrived at the office today to find my workstation had this spew
 in its dmesg buffer..
   
   Looks like random memory corruption to me.
   
   Can you enable slab debugging etc.?
 
 SLAB_DEBUG=y.  Nothing in the logs.
 
   Yes I saw them, but I supposed it is some driver going bad.
   If you want you can collect hardware data and see if there is
   a common driver.
 
 There's quite a bit in this box 
 
 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 
 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 
 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 
 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
 00:07.5 Multimedia audio controller: Advanced Micro Devices [AMD] AMD-8111 
 AC97 Audio (rev 03)
 00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
 12)
 00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
 00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
 12)
 00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 HyperTransport Technology Configuration
 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 Address Map
 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
 Controller
 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 Miscellaneous Control
 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 HyperTransport Technology Configuration
 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 Address Map
 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
 Controller
 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
 Miscellaneous Control
 02:07.0 USB Controller: NEC Corporation USB (rev 41)
 02:07.1 USB Controller: NEC Corporation USB (rev 41)
 02:07.2 USB Controller: NEC Corporation USB 2.0 (rev 02)
 02:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  
 Ultra3 SCSI Adapter (rev 01)
 02:08.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  
 Ultra3 SCSI Adapter (rev 01)
 02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit 
 Ethernet (rev 02)
 03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
 03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
 03:0a.0 Unknown mass storage controller: Triones Technologies, Inc. 
 HPT366/368/370/370A/372 (rev 03)
 03:0b.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD 
 Technology Inc) SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
 03:0c.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
 Controller (PHY/Link)
 04:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-8151 System Controller 
 (rev 13)
 04:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8151 AGP Bridge (rev 13)
 05:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G550 AGP (rev 01)
 
 The SATA  SCSI controllers have no disks attached.  Firewire can be ignored 
 (theres
 no actual connector even for it on the board). The various USB controllers
 are mostly unused. Only one of them is USB2.0, so that sees occasional
 usb-storage use. Not noticed anything going bad there though.
 
   Dave

And here is my box (looks like there is no many hardware drivers
in common).

[EMAIL PROTECTED] rathamahata $ /sbin/lspci
:00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
:00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
:00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
:00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
:00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge 
(rev 12)
:00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
:00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge 
(rev 12)
:00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Address Map
:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
DRAM Controller
:00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
:00:19.0 Host bridge: Advanced Micro 

Re: x86-64 bad pmds in 2.6.11.6

2005-03-31 Thread Dave Jones
On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
 > On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
 > > [apologies to Andi for getting this twice, I goofed the l-k address
 > >  the first time]
 > > 
 > >  
 > >  I arrived at the office today to find my workstation had this spew
 > >  in its dmesg buffer..
 > 
 > Looks like random memory corruption to me.
 > 
 > Can you enable slab debugging etc.?

SLAB_DEBUG=y.  Nothing in the logs.

 > Yes I saw them, but I supposed it is some driver going bad.
 > If you want you can collect hardware data and see if there is
 > a common driver.

There's quite a bit in this box 

00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 00:07.0 
ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 00:07.1 IDE 
interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 00:07.2 SMBus: 
Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
00:07.5 Multimedia audio controller: Advanced Micro Devices [AMD] AMD-8111 AC97 
Audio (rev 03)
00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
02:07.0 USB Controller: NEC Corporation USB (rev 41)
02:07.1 USB Controller: NEC Corporation USB (rev 41)
02:07.2 USB Controller: NEC Corporation USB 2.0 (rev 02)
02:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  
Ultra3 SCSI Adapter (rev 01)
02:08.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  
Ultra3 SCSI Adapter (rev 01)
02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit 
Ethernet (rev 02)
03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:0a.0 Unknown mass storage controller: Triones Technologies, Inc. 
HPT366/368/370/370A/372 (rev 03)
03:0b.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD 
Technology Inc) SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
03:0c.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
04:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-8151 System Controller 
(rev 13)
04:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8151 AGP Bridge (rev 13)
05:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G550 AGP (rev 01)

The SATA & SCSI controllers have no disks attached.  Firewire can be ignored 
(theres
no actual connector even for it on the board). The various USB controllers
are mostly unused. Only one of them is USB2.0, so that sees occasional
usb-storage use. Not noticed anything going bad there though.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86-64 bad pmds in 2.6.11.6

2005-03-31 Thread Andi Kleen
On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
> [apologies to Andi for getting this twice, I goofed the l-k address
>  the first time]
> 
>  
>  I arrived at the office today to find my workstation had this spew
>  in its dmesg buffer..

Looks like random memory corruption to me.

Can you enable slab debugging etc.?

>  mm/memory.c:97: bad pmd 81004b017438(0038a5500a88).
>  mm/memory.c:97: bad pmd 81004b017440(0003).
>  mm/memory.c:97: bad pmd 81004b017448(773b).
>  mm/memory.c:97: bad pmd 81004b017450(773c).
>  mm/memory.c:97: bad pmd 81004b017458(773d).
>  mm/memory.c:97: bad pmd 81004b017468(773e).
>  mm/memory.c:97: bad pmd 81004b017470(773f).
>  mm/memory.c:97: bad pmd 81004b017478(7740).
>  mm/memory.c:97: bad pmd 81004b017480(7741).
>  mm/memory.c:97: bad pmd 81004b017488(7742).
>  mm/memory.c:97: bad pmd 81004b017490(7743).
>  mm/memory.c:97: bad pmd 81004b017498(7744).
>  mm/memory.c:97: bad pmd 81004b0174a0(7745).
>  mm/memory.c:97: bad pmd 81004b0174a8(7746).
>  mm/memory.c:97: bad pmd 81004b0174b0(7747).
>  mm/memory.c:97: bad pmd 81004b0174b8(7748).
>  mm/memory.c:97: bad pmd 81004b0174c0(7749).
>  mm/memory.c:97: bad pmd 81004b0174c8(774a).
>  mm/memory.c:97: bad pmd 81004b0174d0(774b).
>  mm/memory.c:97: bad pmd 81004b0174d8(774c).
>  mm/memory.c:97: bad pmd 81004b0174e0(774d).
>  mm/memory.c:97: bad pmd 81004b0174e8(774e).
>  mm/memory.c:97: bad pmd 81004b0174f0(774f).
>  mm/memory.c:97: bad pmd 81004b0174f8(7750).
>  mm/memory.c:97: bad pmd 81004b017500(7751).
>  mm/memory.c:97: bad pmd 81004b017508(7752).
>  mm/memory.c:97: bad pmd 81004b017510(7753).
>  mm/memory.c:97: bad pmd 81004b017518(7754).
>  mm/memory.c:97: bad pmd 81004b017520(7755).
>  mm/memory.c:97: bad pmd 81004b017528(7756).
>  mm/memory.c:97: bad pmd 81004b017530(7757).
>  mm/memory.c:97: bad pmd 81004b017538(7758).
>  mm/memory.c:97: bad pmd 81004b017540(7759).
>  mm/memory.c:97: bad pmd 81004b017548(775a).
>  mm/memory.c:97: bad pmd 81004b017550(775b).
>  mm/memory.c:97: bad pmd 81004b017558(775c).
>  mm/memory.c:97: bad pmd 81004b017560(775d).
>  mm/memory.c:97: bad pmd 81004b017568(775e).
>  mm/memory.c:97: bad pmd 81004b017570(775f).
>  mm/memory.c:97: bad pmd 81004b017578(7760).
>  mm/memory.c:97: bad pmd 81004b017580(7761).
>  mm/memory.c:97: bad pmd 81004b017588(7762).
>  mm/memory.c:97: bad pmd 81004b017590(7763).
>  mm/memory.c:97: bad pmd 81004b017598(7764).
>  mm/memory.c:97: bad pmd 81004b0175a0(7765).
>  mm/memory.c:97: bad pmd 81004b0175a8(7766).
>  mm/memory.c:97: bad pmd 81004b0175b0(7767).
>  mm/memory.c:97: bad pmd 81004b0175b8(7768).
>  mm/memory.c:97: bad pmd 81004b0175c0(7769).
>  mm/memory.c:97: bad pmd 81004b0175c8(776a).
>  mm/memory.c:97: bad pmd 81004b0175d0(776b).
>  mm/memory.c:97: bad pmd 81004b0175d8(776c).
>  mm/memory.c:97: bad pmd 81004b0175e0(776d).
>  mm/memory.c:97: bad pmd 81004b0175e8(776e).
>  mm/memory.c:97: bad pmd 81004b0175f0(776f).
>  mm/memory.c:97: bad pmd 81004b0175f8(7770).
>  mm/memory.c:97: bad pmd 81004b017600(7771).
>  mm/memory.c:97: bad pmd 81004b017608(7772).
>  mm/memory.c:97: bad pmd 81004b017610(7773).
>  mm/memory.c:97: bad pmd 81004b017618(7774).
>  mm/memory.c:97: bad pmd 81004b017628(0010).
>  mm/memory.c:97: bad pmd 81004b017630(078bfbff).
>  mm/memory.c:97: bad pmd 81004b017638(0006).
>  mm/memory.c:97: bad pmd 81004b017640(1000).
>  mm/memory.c:97: bad pmd 81004b017648(0011).
>  mm/memory.c:97: bad pmd 81004b017650(0064).
>  mm/memory.c:97: bad pmd 81004b017658(0003).
>  mm/memory.c:97: bad pmd 81004b017660(00400040).
>  mm/memory.c:97: bad pmd 81004b017668(0004).
>  mm/memory.c:97: bad pmd 81004b017670(0038).
>  mm/memory.c:97: bad pmd 81004b017678(0005).
>  mm/memory.c:97: bad pmd 81004b017680(0008).
>  mm/memory.c:97: bad pmd 81004b017688(0007).
>  mm/memory.c:97: bad pmd 81004b017698(0008).
>  

Re: x86-64 bad pmds in 2.6.11.6

2005-03-31 Thread Andi Kleen
On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
 [apologies to Andi for getting this twice, I goofed the l-k address
  the first time]
 
  
  I arrived at the office today to find my workstation had this spew
  in its dmesg buffer..

Looks like random memory corruption to me.

Can you enable slab debugging etc.?

  mm/memory.c:97: bad pmd 81004b017438(0038a5500a88).
  mm/memory.c:97: bad pmd 81004b017440(0003).
  mm/memory.c:97: bad pmd 81004b017448(773b).
  mm/memory.c:97: bad pmd 81004b017450(773c).
  mm/memory.c:97: bad pmd 81004b017458(773d).
  mm/memory.c:97: bad pmd 81004b017468(773e).
  mm/memory.c:97: bad pmd 81004b017470(773f).
  mm/memory.c:97: bad pmd 81004b017478(7740).
  mm/memory.c:97: bad pmd 81004b017480(7741).
  mm/memory.c:97: bad pmd 81004b017488(7742).
  mm/memory.c:97: bad pmd 81004b017490(7743).
  mm/memory.c:97: bad pmd 81004b017498(7744).
  mm/memory.c:97: bad pmd 81004b0174a0(7745).
  mm/memory.c:97: bad pmd 81004b0174a8(7746).
  mm/memory.c:97: bad pmd 81004b0174b0(7747).
  mm/memory.c:97: bad pmd 81004b0174b8(7748).
  mm/memory.c:97: bad pmd 81004b0174c0(7749).
  mm/memory.c:97: bad pmd 81004b0174c8(774a).
  mm/memory.c:97: bad pmd 81004b0174d0(774b).
  mm/memory.c:97: bad pmd 81004b0174d8(774c).
  mm/memory.c:97: bad pmd 81004b0174e0(774d).
  mm/memory.c:97: bad pmd 81004b0174e8(774e).
  mm/memory.c:97: bad pmd 81004b0174f0(774f).
  mm/memory.c:97: bad pmd 81004b0174f8(7750).
  mm/memory.c:97: bad pmd 81004b017500(7751).
  mm/memory.c:97: bad pmd 81004b017508(7752).
  mm/memory.c:97: bad pmd 81004b017510(7753).
  mm/memory.c:97: bad pmd 81004b017518(7754).
  mm/memory.c:97: bad pmd 81004b017520(7755).
  mm/memory.c:97: bad pmd 81004b017528(7756).
  mm/memory.c:97: bad pmd 81004b017530(7757).
  mm/memory.c:97: bad pmd 81004b017538(7758).
  mm/memory.c:97: bad pmd 81004b017540(7759).
  mm/memory.c:97: bad pmd 81004b017548(775a).
  mm/memory.c:97: bad pmd 81004b017550(775b).
  mm/memory.c:97: bad pmd 81004b017558(775c).
  mm/memory.c:97: bad pmd 81004b017560(775d).
  mm/memory.c:97: bad pmd 81004b017568(775e).
  mm/memory.c:97: bad pmd 81004b017570(775f).
  mm/memory.c:97: bad pmd 81004b017578(7760).
  mm/memory.c:97: bad pmd 81004b017580(7761).
  mm/memory.c:97: bad pmd 81004b017588(7762).
  mm/memory.c:97: bad pmd 81004b017590(7763).
  mm/memory.c:97: bad pmd 81004b017598(7764).
  mm/memory.c:97: bad pmd 81004b0175a0(7765).
  mm/memory.c:97: bad pmd 81004b0175a8(7766).
  mm/memory.c:97: bad pmd 81004b0175b0(7767).
  mm/memory.c:97: bad pmd 81004b0175b8(7768).
  mm/memory.c:97: bad pmd 81004b0175c0(7769).
  mm/memory.c:97: bad pmd 81004b0175c8(776a).
  mm/memory.c:97: bad pmd 81004b0175d0(776b).
  mm/memory.c:97: bad pmd 81004b0175d8(776c).
  mm/memory.c:97: bad pmd 81004b0175e0(776d).
  mm/memory.c:97: bad pmd 81004b0175e8(776e).
  mm/memory.c:97: bad pmd 81004b0175f0(776f).
  mm/memory.c:97: bad pmd 81004b0175f8(7770).
  mm/memory.c:97: bad pmd 81004b017600(7771).
  mm/memory.c:97: bad pmd 81004b017608(7772).
  mm/memory.c:97: bad pmd 81004b017610(7773).
  mm/memory.c:97: bad pmd 81004b017618(7774).
  mm/memory.c:97: bad pmd 81004b017628(0010).
  mm/memory.c:97: bad pmd 81004b017630(078bfbff).
  mm/memory.c:97: bad pmd 81004b017638(0006).
  mm/memory.c:97: bad pmd 81004b017640(1000).
  mm/memory.c:97: bad pmd 81004b017648(0011).
  mm/memory.c:97: bad pmd 81004b017650(0064).
  mm/memory.c:97: bad pmd 81004b017658(0003).
  mm/memory.c:97: bad pmd 81004b017660(00400040).
  mm/memory.c:97: bad pmd 81004b017668(0004).
  mm/memory.c:97: bad pmd 81004b017670(0038).
  mm/memory.c:97: bad pmd 81004b017678(0005).
  mm/memory.c:97: bad pmd 81004b017680(0008).
  mm/memory.c:97: bad pmd 81004b017688(0007).
  mm/memory.c:97: bad pmd 81004b017698(0008).
  mm/memory.c:97: bad pmd 81004b0176a8(0009).
  mm/memory.c:97: bad 

Re: x86-64 bad pmds in 2.6.11.6

2005-03-31 Thread Dave Jones
On Thu, Mar 31, 2005 at 12:41:17PM +0200, Andi Kleen wrote:
  On Wed, Mar 30, 2005 at 04:44:55PM -0500, Dave Jones wrote:
   [apologies to Andi for getting this twice, I goofed the l-k address
the first time]
   

I arrived at the office today to find my workstation had this spew
in its dmesg buffer..
  
  Looks like random memory corruption to me.
  
  Can you enable slab debugging etc.?

SLAB_DEBUG=y.  Nothing in the logs.

  Yes I saw them, but I supposed it is some driver going bad.
  If you want you can collect hardware data and see if there is
  a common driver.

There's quite a bit in this box 

00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 00:07.0 
ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 00:07.1 IDE 
interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 00:07.2 SMBus: 
Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
00:07.5 Multimedia audio controller: Advanced Micro Devices [AMD] AMD-8111 AC97 
Audio (rev 03)
00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
02:07.0 USB Controller: NEC Corporation USB (rev 41)
02:07.1 USB Controller: NEC Corporation USB (rev 41)
02:07.2 USB Controller: NEC Corporation USB 2.0 (rev 02)
02:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  
Ultra3 SCSI Adapter (rev 01)
02:08.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz  
Ultra3 SCSI Adapter (rev 01)
02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5703X Gigabit 
Ethernet (rev 02)
03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
03:0a.0 Unknown mass storage controller: Triones Technologies, Inc. 
HPT366/368/370/370A/372 (rev 03)
03:0b.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD 
Technology Inc) SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
03:0c.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)
04:00.0 Host bridge: Advanced Micro Devices [AMD] AMD-8151 System Controller 
(rev 13)
04:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8151 AGP Bridge (rev 13)
05:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G550 AGP (rev 01)

The SATA  SCSI controllers have no disks attached.  Firewire can be ignored 
(theres
no actual connector even for it on the board). The various USB controllers
are mostly unused. Only one of them is USB2.0, so that sees occasional
usb-storage use. Not noticed anything going bad there though.

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


x86-64 bad pmds in 2.6.11.6

2005-03-30 Thread Dave Jones
[apologies to Andi for getting this twice, I goofed the l-k address
 the first time]

 
 I arrived at the office today to find my workstation had this spew
 in its dmesg buffer..
 
 mm/memory.c:97: bad pmd 81004b017438(0038a5500a88).
 mm/memory.c:97: bad pmd 81004b017440(0003).
 mm/memory.c:97: bad pmd 81004b017448(773b).
 mm/memory.c:97: bad pmd 81004b017450(773c).
 mm/memory.c:97: bad pmd 81004b017458(773d).
 mm/memory.c:97: bad pmd 81004b017468(773e).
 mm/memory.c:97: bad pmd 81004b017470(773f).
 mm/memory.c:97: bad pmd 81004b017478(7740).
 mm/memory.c:97: bad pmd 81004b017480(7741).
 mm/memory.c:97: bad pmd 81004b017488(7742).
 mm/memory.c:97: bad pmd 81004b017490(7743).
 mm/memory.c:97: bad pmd 81004b017498(7744).
 mm/memory.c:97: bad pmd 81004b0174a0(7745).
 mm/memory.c:97: bad pmd 81004b0174a8(7746).
 mm/memory.c:97: bad pmd 81004b0174b0(7747).
 mm/memory.c:97: bad pmd 81004b0174b8(7748).
 mm/memory.c:97: bad pmd 81004b0174c0(7749).
 mm/memory.c:97: bad pmd 81004b0174c8(774a).
 mm/memory.c:97: bad pmd 81004b0174d0(774b).
 mm/memory.c:97: bad pmd 81004b0174d8(774c).
 mm/memory.c:97: bad pmd 81004b0174e0(774d).
 mm/memory.c:97: bad pmd 81004b0174e8(774e).
 mm/memory.c:97: bad pmd 81004b0174f0(774f).
 mm/memory.c:97: bad pmd 81004b0174f8(7750).
 mm/memory.c:97: bad pmd 81004b017500(7751).
 mm/memory.c:97: bad pmd 81004b017508(7752).
 mm/memory.c:97: bad pmd 81004b017510(7753).
 mm/memory.c:97: bad pmd 81004b017518(7754).
 mm/memory.c:97: bad pmd 81004b017520(7755).
 mm/memory.c:97: bad pmd 81004b017528(7756).
 mm/memory.c:97: bad pmd 81004b017530(7757).
 mm/memory.c:97: bad pmd 81004b017538(7758).
 mm/memory.c:97: bad pmd 81004b017540(7759).
 mm/memory.c:97: bad pmd 81004b017548(775a).
 mm/memory.c:97: bad pmd 81004b017550(775b).
 mm/memory.c:97: bad pmd 81004b017558(775c).
 mm/memory.c:97: bad pmd 81004b017560(775d).
 mm/memory.c:97: bad pmd 81004b017568(775e).
 mm/memory.c:97: bad pmd 81004b017570(775f).
 mm/memory.c:97: bad pmd 81004b017578(7760).
 mm/memory.c:97: bad pmd 81004b017580(7761).
 mm/memory.c:97: bad pmd 81004b017588(7762).
 mm/memory.c:97: bad pmd 81004b017590(7763).
 mm/memory.c:97: bad pmd 81004b017598(7764).
 mm/memory.c:97: bad pmd 81004b0175a0(7765).
 mm/memory.c:97: bad pmd 81004b0175a8(7766).
 mm/memory.c:97: bad pmd 81004b0175b0(7767).
 mm/memory.c:97: bad pmd 81004b0175b8(7768).
 mm/memory.c:97: bad pmd 81004b0175c0(7769).
 mm/memory.c:97: bad pmd 81004b0175c8(776a).
 mm/memory.c:97: bad pmd 81004b0175d0(776b).
 mm/memory.c:97: bad pmd 81004b0175d8(776c).
 mm/memory.c:97: bad pmd 81004b0175e0(776d).
 mm/memory.c:97: bad pmd 81004b0175e8(776e).
 mm/memory.c:97: bad pmd 81004b0175f0(776f).
 mm/memory.c:97: bad pmd 81004b0175f8(7770).
 mm/memory.c:97: bad pmd 81004b017600(7771).
 mm/memory.c:97: bad pmd 81004b017608(7772).
 mm/memory.c:97: bad pmd 81004b017610(7773).
 mm/memory.c:97: bad pmd 81004b017618(7774).
 mm/memory.c:97: bad pmd 81004b017628(0010).
 mm/memory.c:97: bad pmd 81004b017630(078bfbff).
 mm/memory.c:97: bad pmd 81004b017638(0006).
 mm/memory.c:97: bad pmd 81004b017640(1000).
 mm/memory.c:97: bad pmd 81004b017648(0011).
 mm/memory.c:97: bad pmd 81004b017650(0064).
 mm/memory.c:97: bad pmd 81004b017658(0003).
 mm/memory.c:97: bad pmd 81004b017660(00400040).
 mm/memory.c:97: bad pmd 81004b017668(0004).
 mm/memory.c:97: bad pmd 81004b017670(0038).
 mm/memory.c:97: bad pmd 81004b017678(0005).
 mm/memory.c:97: bad pmd 81004b017680(0008).
 mm/memory.c:97: bad pmd 81004b017688(0007).
 mm/memory.c:97: bad pmd 81004b017698(0008).
 mm/memory.c:97: bad pmd 81004b0176a8(0009).
 mm/memory.c:97: bad pmd 81004b0176b0(00403840).
 mm/memory.c:97: bad pmd 81004b0176b8(000b).
 mm/memory.c:97: bad pmd 81004b0176c0(01f4).
 mm/memory.c:97: bad pmd 

x86-64 bad pmds in 2.6.11.6

2005-03-30 Thread Dave Jones
[apologies to Andi for getting this twice, I goofed the l-k address
 the first time]

 
 I arrived at the office today to find my workstation had this spew
 in its dmesg buffer..
 
 mm/memory.c:97: bad pmd 81004b017438(0038a5500a88).
 mm/memory.c:97: bad pmd 81004b017440(0003).
 mm/memory.c:97: bad pmd 81004b017448(773b).
 mm/memory.c:97: bad pmd 81004b017450(773c).
 mm/memory.c:97: bad pmd 81004b017458(773d).
 mm/memory.c:97: bad pmd 81004b017468(773e).
 mm/memory.c:97: bad pmd 81004b017470(773f).
 mm/memory.c:97: bad pmd 81004b017478(7740).
 mm/memory.c:97: bad pmd 81004b017480(7741).
 mm/memory.c:97: bad pmd 81004b017488(7742).
 mm/memory.c:97: bad pmd 81004b017490(7743).
 mm/memory.c:97: bad pmd 81004b017498(7744).
 mm/memory.c:97: bad pmd 81004b0174a0(7745).
 mm/memory.c:97: bad pmd 81004b0174a8(7746).
 mm/memory.c:97: bad pmd 81004b0174b0(7747).
 mm/memory.c:97: bad pmd 81004b0174b8(7748).
 mm/memory.c:97: bad pmd 81004b0174c0(7749).
 mm/memory.c:97: bad pmd 81004b0174c8(774a).
 mm/memory.c:97: bad pmd 81004b0174d0(774b).
 mm/memory.c:97: bad pmd 81004b0174d8(774c).
 mm/memory.c:97: bad pmd 81004b0174e0(774d).
 mm/memory.c:97: bad pmd 81004b0174e8(774e).
 mm/memory.c:97: bad pmd 81004b0174f0(774f).
 mm/memory.c:97: bad pmd 81004b0174f8(7750).
 mm/memory.c:97: bad pmd 81004b017500(7751).
 mm/memory.c:97: bad pmd 81004b017508(7752).
 mm/memory.c:97: bad pmd 81004b017510(7753).
 mm/memory.c:97: bad pmd 81004b017518(7754).
 mm/memory.c:97: bad pmd 81004b017520(7755).
 mm/memory.c:97: bad pmd 81004b017528(7756).
 mm/memory.c:97: bad pmd 81004b017530(7757).
 mm/memory.c:97: bad pmd 81004b017538(7758).
 mm/memory.c:97: bad pmd 81004b017540(7759).
 mm/memory.c:97: bad pmd 81004b017548(775a).
 mm/memory.c:97: bad pmd 81004b017550(775b).
 mm/memory.c:97: bad pmd 81004b017558(775c).
 mm/memory.c:97: bad pmd 81004b017560(775d).
 mm/memory.c:97: bad pmd 81004b017568(775e).
 mm/memory.c:97: bad pmd 81004b017570(775f).
 mm/memory.c:97: bad pmd 81004b017578(7760).
 mm/memory.c:97: bad pmd 81004b017580(7761).
 mm/memory.c:97: bad pmd 81004b017588(7762).
 mm/memory.c:97: bad pmd 81004b017590(7763).
 mm/memory.c:97: bad pmd 81004b017598(7764).
 mm/memory.c:97: bad pmd 81004b0175a0(7765).
 mm/memory.c:97: bad pmd 81004b0175a8(7766).
 mm/memory.c:97: bad pmd 81004b0175b0(7767).
 mm/memory.c:97: bad pmd 81004b0175b8(7768).
 mm/memory.c:97: bad pmd 81004b0175c0(7769).
 mm/memory.c:97: bad pmd 81004b0175c8(776a).
 mm/memory.c:97: bad pmd 81004b0175d0(776b).
 mm/memory.c:97: bad pmd 81004b0175d8(776c).
 mm/memory.c:97: bad pmd 81004b0175e0(776d).
 mm/memory.c:97: bad pmd 81004b0175e8(776e).
 mm/memory.c:97: bad pmd 81004b0175f0(776f).
 mm/memory.c:97: bad pmd 81004b0175f8(7770).
 mm/memory.c:97: bad pmd 81004b017600(7771).
 mm/memory.c:97: bad pmd 81004b017608(7772).
 mm/memory.c:97: bad pmd 81004b017610(7773).
 mm/memory.c:97: bad pmd 81004b017618(7774).
 mm/memory.c:97: bad pmd 81004b017628(0010).
 mm/memory.c:97: bad pmd 81004b017630(078bfbff).
 mm/memory.c:97: bad pmd 81004b017638(0006).
 mm/memory.c:97: bad pmd 81004b017640(1000).
 mm/memory.c:97: bad pmd 81004b017648(0011).
 mm/memory.c:97: bad pmd 81004b017650(0064).
 mm/memory.c:97: bad pmd 81004b017658(0003).
 mm/memory.c:97: bad pmd 81004b017660(00400040).
 mm/memory.c:97: bad pmd 81004b017668(0004).
 mm/memory.c:97: bad pmd 81004b017670(0038).
 mm/memory.c:97: bad pmd 81004b017678(0005).
 mm/memory.c:97: bad pmd 81004b017680(0008).
 mm/memory.c:97: bad pmd 81004b017688(0007).
 mm/memory.c:97: bad pmd 81004b017698(0008).
 mm/memory.c:97: bad pmd 81004b0176a8(0009).
 mm/memory.c:97: bad pmd 81004b0176b0(00403840).
 mm/memory.c:97: bad pmd 81004b0176b8(000b).
 mm/memory.c:97: bad pmd 81004b0176c0(01f4).
 mm/memory.c:97: bad pmd