Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-08-01 Thread Phillip Susi

Hendrik . wrote:

So I think there is a problem with this specific CK804
ATA controller causing the MCE... Any clues?


Yes, the SATA chip is broken.  Probably time to check the known errata 
on the chip, and if it isn't known, bring nvidia in to debug their 
silicon.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-08-01 Thread Robert Hancock

Hendrik . wrote:

Ok, I did actually not copy the coreret code in the
mcelog, leaving me some errors about the Northbridge.
If I do it again it gives me something else. I made 2
digital photo's of 2 lockups when it happened and this
is the result of the tool, the TSC is different in
both errors, the rest is the same:


CPU 0 4 northbridge TSC b7d4a144d0 
  Northbridge Watchdog error

   bit57 = processor context corrupt
   bit61 = error uncorrected
  bus error 'generic participation, request timed out
  generic error mem transaction
  generic access, level generic'
STATUS b2070f0f MCGSTATUS 4
This is not a software problem!


Presumably some access that the CPU is doing to the controller has timed 
out and caused the MCE. It might be useful if we could get a stack trace 
from where the MCE was triggered - does anyone know if it's possible to 
do this?


It's possible that only NVidia really could tell why this error would 
result from a disk problem, though.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-08-01 Thread Robert Hancock

Hendrik . wrote:

Ok, I did actually not copy the coreret code in the
mcelog, leaving me some errors about the Northbridge.
If I do it again it gives me something else. I made 2
digital photo's of 2 lockups when it happened and this
is the result of the tool, the TSC is different in
both errors, the rest is the same:


CPU 0 4 northbridge TSC b7d4a144d0 
  Northbridge Watchdog error

   bit57 = processor context corrupt
   bit61 = error uncorrected
  bus error 'generic participation, request timed out
  generic error mem transaction
  generic access, level generic'
STATUS b2070f0f MCGSTATUS 4
This is not a software problem!


Presumably some access that the CPU is doing to the controller has timed 
out and caused the MCE. It might be useful if we could get a stack trace 
from where the MCE was triggered - does anyone know if it's possible to 
do this?


It's possible that only NVidia really could tell why this error would 
result from a disk problem, though.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-08-01 Thread Phillip Susi

Hendrik . wrote:

So I think there is a problem with this specific CK804
ATA controller causing the MCE... Any clues?


Yes, the SATA chip is broken.  Probably time to check the known errata 
on the chip, and if it isn't known, bring nvidia in to debug their 
silicon.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-30 Thread Hendrik .
After even more tests I found out the following:

- Running 'dd_rescue /dev/sda1 /dev/zero' on the
on-board Silicon Image Inc. SiI 3114 controller
handles the bad sector just fine and does not give a
MCE. This is on the same motherboard that does give
the MCE error on the Nvidia port.

The following SATA controllers are in that machine:
* IDE interface: nVidia Corporation CK804 Serial ATA
Controller (rev f3)
* RAID bus controller: Silicon Image, Inc. SiI 3114
[SATALink/SATARaid] Serial ATA Controller (rev 02)

- Running the dd_rescue command om another PC with a
different type of motherboard (M2NPV-VM) also with a
Nvidia Nforce 4 (altough different) chipset work fine
and reports the bad sector like the SiL 3114
controller on the other PC.

This PC has the following lspci listing for SATA
controller:
* IDE interface: nVidia Corporation MCP51 Serial ATA
Controller (rev a1)

So I think there is a problem with this specific CK804
ATA controller causing the MCE... Any clues?



   

Be a better Heartthrob. Get better relationship answers from someone who knows. 
Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=list=396545433
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-30 Thread Hendrik .
After even more tests I found out the following:

- Running 'dd_rescue /dev/sda1 /dev/zero' on the
on-board Silicon Image Inc. SiI 3114 controller
handles the bad sector just fine and does not give a
MCE. This is on the same motherboard that does give
the MCE error on the Nvidia port.

The following SATA controllers are in that machine:
* IDE interface: nVidia Corporation CK804 Serial ATA
Controller (rev f3)
* RAID bus controller: Silicon Image, Inc. SiI 3114
[SATALink/SATARaid] Serial ATA Controller (rev 02)

- Running the dd_rescue command om another PC with a
different type of motherboard (M2NPV-VM) also with a
Nvidia Nforce 4 (altough different) chipset work fine
and reports the bad sector like the SiL 3114
controller on the other PC.

This PC has the following lspci listing for SATA
controller:
* IDE interface: nVidia Corporation MCP51 Serial ATA
Controller (rev a1)

So I think there is a problem with this specific CK804
ATA controller causing the MCE... Any clues?



   

Be a better Heartthrob. Get better relationship answers from someone who knows. 
Yahoo! Answers - Check it out. 
http://answers.yahoo.com/dir/?link=listsid=396545433
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
Ok, I did actually not copy the coreret code in the
mcelog, leaving me some errors about the Northbridge.
If I do it again it gives me something else. I made 2
digital photo's of 2 lockups when it happened and this
is the result of the tool, the TSC is different in
both errors, the rest is the same:


CPU 0 4 northbridge TSC b7d4a144d0 
  Northbridge Watchdog error
   bit57 = processor context corrupt
   bit61 = error uncorrected
  bus error 'generic participation, request timed out
  generic error mem transaction
  generic access, level generic'
STATUS b2070f0f MCGSTATUS 4
This is not a software problem!

CPU 0 4 northbridge TSC c4dd3a549f 
  Northbridge Watchdog error
   bit57 = processor context corrupt
   bit61 = error uncorrected
  bus error 'generic participation, request timed out
  generic error mem transaction
  generic access, level generic'
STATUS b2070f0f MCGSTATUS 4
This is not a software problem!


It's a bit strange but if I copy the results from my
first post I get the Northbridge error, perhaps
because there is an 'enter' between the first line
with the 'bank 4' and the 'b2070f0f' line. The
mcelog tool handles this different from the error in 1
line. 

Regards,
Hendrik


   

Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail=summer+activities+for+kids=bz
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
>> hangs. If I try it after a reboot with 'mcelog --k8
>> --ascii' or whatever parameter, there is no output
at
> You could type error back in from the email ?

Ok I copied it into the tool, it gives me:

CPU 0 4 northbridge TSC b7d4a144d0 
  Northbridge ECC error
  ECC syndrome = 0
STATUS 0 MCGSTATUS 4

This is a bit strange because I repeatedly tested the
RAM yesterday and it gives no problems. And even more
interesting: the error occurs at a reproducible
moment: when reading the bad sector from the Seagate
harddisk. And with an older kernel I was able to just
copy all stuff from the drive using dd_rescue... I do
not have ECC RAM in my PC by the way.

> > Isn't it strange to say that the controller does
> > something bad if there is just a bad sector on the
> > drive that is reported and handled correctly in an
> > older kernel 
> Not really. Its very strange it gives an MCE at all
> but this is a known
> failure path (and should be a fixed known failure
> path) for the Nvidia SATA.

So how to proceed in tackling this problem now? Is
there anything I can do to (help you guys ;)) fix it?
At this moment it unfortunately does not look to me as
a fixed failure path...


   

Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, 
photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Alan Cox
> How can I do this? I have installed mcelog but I
> cannot run it after the MCE error because the whole PC
> hangs. If I try it after a reboot with 'mcelog --k8
> --ascii' or whatever parameter, there is no output at

You could type error back in from the email ?

> Isn't it strange to say that the controller does
> something bad if there is just a bad sector on the
> drive that is reported and handled correctly in an
> older kernel 

Not really. Its very strange it gives an MCE at all but this is a known
failure path (and should be a fixed known failure path) for the Nvidia
SATA.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
Probably a similar problem is described in the
linux-ide mailing list a while ago:

http://www.opensubscriber.com/message/[EMAIL PROTECTED]/6490911.html

>> Argh. I'm seeing a show stopper bug on sata_nv
here. 
>> ata_exec_internal  
>> is MCE-ing on the READ_NATIVE_MAX_EXT command on 
>> both i386 and amd64, with  
>> top of Linus' tree + this patch. :(  
>
>Oddly, the command at least executes and doesn't MCE
>(but it's not at all  
>happy either) if I use ATA_PROT_PIO. I wonder if
>ATA_PROT_NODATA is buggered on this sata_nv chip
>(Asus A8N-E).  

At least it is a similar motherboard that is used
(however I have explicitly have the A8N-E Deluxe
edition).

I try not to repair my SATA disk for now with the
Seatools, so if there is some testing to be done, I
can run it with the bad disk.

Regards,
Hendrik


   

Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, 
photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
> > HARDWARE ERROR
> > CPU 0: Machine Check Exception: 4  Bank 4:
> > b2070f0f
> > TSC b7d4a144d0
> > This is not a software problem!
> > Run through mcelog --ascii to decode and contact
> your
> > hardware vendor
> > Kernel panic - not syncing: Machine check
> 
> You should run this through mcelog as it suggests
> and see what it shows. 
>   The kernel should be handling this properly,
> unless the drive problem 
> is causing the controller to do something bad. Note
> that kernels 2.6.20 
> and later use ADMA mode on the nForce4 SATA
> controller whereas previous 
> versions used it essentially like a PATA controller,
> so it is not 
> surprising that the behavior is different.

How can I do this? I have installed mcelog but I
cannot run it after the MCE error because the whole PC
hangs. If I try it after a reboot with 'mcelog --k8
--ascii' or whatever parameter, there is no output at
all. If I try to redirect the output to the syslog,
nothing is in there because the computer stopped
working and did not save the log anymore.

Isn't it strange to say that the controller does
something bad if there is just a bad sector on the
drive that is reported and handled correctly in an
older kernel (I have confirmed a bad sector on the
drive using the Seatools software from Seagate)? In my
opinion a kernel should not stop responding at all
with a bad sector on the disk. I cannot change the
controller's behavior and did all the updates there
are to make in function, but the problem is introduced
using the newer kernel series.

Perhaps nobody has tried accessing a bad SATA drive
before, to simulate such an error? If it helps I could
try a different type of motherboard to see what
happens there? (Asus M2NPV-VM)

Regards,
Hendrik


   

Moody friends. Drama queens. Your life? Nope! - their life, your story. Play 
Sims Stories at Yahoo! Games.
http://sims.yahoo.com/  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Robert Hancock

Hendrik . wrote:

Last night I discovered a problem in my RAID5 array
and finally after a lot of tests I narrowed it down to
a bad sector on one of the hard disks and some goofy
kernels.

I just yesterday build a new PC using an existing
array of 5 disks in RAID 5. I did build the array with
only 4 out of 5 disks in the system but the rebuild
processes stopped over and over again apparently at
the same position. At last I found out that the
harddisk at the first SATA port had developed some bad
sectors which made the kernel stop completely when it
tried to read that sector with the following error on
the screen:

HARDWARE ERROR
CPU 0: Machine Check Exception: 4  Bank 4:
b2070f0f
TSC b7d4a144d0
This is not a software problem!
Run through mcelog --ascii to decode and contact your
hardware vendor
Kernel panic - not syncing: Machine check


You should run this through mcelog as it suggests and see what it shows. 
 The kernel should be handling this properly, unless the drive problem 
is causing the controller to do something bad. Note that kernels 2.6.20 
and later use ADMA mode on the nForce4 SATA controller whereas previous 
versions used it essentially like a PATA controller, so it is not 
surprising that the behavior is different.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
Last night I discovered a problem in my RAID5 array
and finally after a lot of tests I narrowed it down to
a bad sector on one of the hard disks and some goofy
kernels.

I just yesterday build a new PC using an existing
array of 5 disks in RAID 5. I did build the array with
only 4 out of 5 disks in the system but the rebuild
processes stopped over and over again apparently at
the same position. At last I found out that the
harddisk at the first SATA port had developed some bad
sectors which made the kernel stop completely when it
tried to read that sector with the following error on
the screen:

HARDWARE ERROR
CPU 0: Machine Check Exception: 4  Bank 4:
b2070f0f
TSC b7d4a144d0
This is not a software problem!
Run through mcelog --ascii to decode and contact your
hardware vendor
Kernel panic - not syncing: Machine check

Googling around made me check memory, upgrade the BIOS
and things like that but now i DO think that this IS a
software problem, which is in the linux kernel.

I was running the standard 2.6.20-16 kernel series
from Ubuntu Feisty Fawn (using the generic and server
built) and I built my own 2.6.22.1 but the problem
still persisted. When copying manually with dd_rescue
I was not able to copy past the bad sector or the MCE
error reappeared. Only when using the standard Ubuntu
Edgy Eft kernel (2.6.17-12-server) the problem went
away completely and the syslog was filled with normal
lines like: 

Jul 28 22:58:26 mediaserver kernel: [ 6562.446868]
ata2: error=0x40 { UncorrectableError }
Jul 28 22:58:26 mediaserver kernel: [ 6562.446875] sd
1:0:0:0: SCSI error: return code = 0x802
Jul 28 22:58:26 mediaserver kernel: [ 6562.446880]
Additional sense: Unrecovered read error - auto
reallocate failed
Jul 28 22:58:26 mediaserver kernel: [ 6562.446887]
end_request: I/O error, dev sda, sector 205534870

So in the end I was able to copy my stuff off the bad
harddisk to a new disk (losing some bytes because of
my already dirty RAID5 array) but I do think this is a
kernel bug or at least strange behavior as an old
kernel is willing to continue operation on something
'minor' as a bad sector. In the end when I will start
scrubbing the drive array overnight a simple bad
sector on the array will take down the complete system
instead of just continuing with 1 faulty drive in the
array!

Some information about the hardware:
AMD Athlon 64 3000+
Asus A8N-E Deluxe motherboard
1 GB RAM
4 Seagate 7200.9 drives on the NVIDIA SATA controller
(sda ... sdd)
2 WD drives on the IDE controller (hda, hdc)
Running Feisty Fawn 64 bit Server edition

Faulty drive is /dev/sda and on thus on the first SATA
port. Changing this to a different port on the
motherboard gives the same lockup. There is also a SIL
3114 controller on the motherboard but I have not
tried to dd_rescue with the faulty drive on that
controller to see if it locks up the kernel.

Regards,
Hendrik van den Boogaard




  

Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail=graduation+gifts=bz
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
Last night I discovered a problem in my RAID5 array
and finally after a lot of tests I narrowed it down to
a bad sector on one of the hard disks and some goofy
kernels.

I just yesterday build a new PC using an existing
array of 5 disks in RAID 5. I did build the array with
only 4 out of 5 disks in the system but the rebuild
processes stopped over and over again apparently at
the same position. At last I found out that the
harddisk at the first SATA port had developed some bad
sectors which made the kernel stop completely when it
tried to read that sector with the following error on
the screen:

HARDWARE ERROR
CPU 0: Machine Check Exception: 4  Bank 4:
b2070f0f
TSC b7d4a144d0
This is not a software problem!
Run through mcelog --ascii to decode and contact your
hardware vendor
Kernel panic - not syncing: Machine check

Googling around made me check memory, upgrade the BIOS
and things like that but now i DO think that this IS a
software problem, which is in the linux kernel.

I was running the standard 2.6.20-16 kernel series
from Ubuntu Feisty Fawn (using the generic and server
built) and I built my own 2.6.22.1 but the problem
still persisted. When copying manually with dd_rescue
I was not able to copy past the bad sector or the MCE
error reappeared. Only when using the standard Ubuntu
Edgy Eft kernel (2.6.17-12-server) the problem went
away completely and the syslog was filled with normal
lines like: 

Jul 28 22:58:26 mediaserver kernel: [ 6562.446868]
ata2: error=0x40 { UncorrectableError }
Jul 28 22:58:26 mediaserver kernel: [ 6562.446875] sd
1:0:0:0: SCSI error: return code = 0x802
Jul 28 22:58:26 mediaserver kernel: [ 6562.446880]
Additional sense: Unrecovered read error - auto
reallocate failed
Jul 28 22:58:26 mediaserver kernel: [ 6562.446887]
end_request: I/O error, dev sda, sector 205534870

So in the end I was able to copy my stuff off the bad
harddisk to a new disk (losing some bytes because of
my already dirty RAID5 array) but I do think this is a
kernel bug or at least strange behavior as an old
kernel is willing to continue operation on something
'minor' as a bad sector. In the end when I will start
scrubbing the drive array overnight a simple bad
sector on the array will take down the complete system
instead of just continuing with 1 faulty drive in the
array!

Some information about the hardware:
AMD Athlon 64 3000+
Asus A8N-E Deluxe motherboard
1 GB RAM
4 Seagate 7200.9 drives on the NVIDIA SATA controller
(sda ... sdd)
2 WD drives on the IDE controller (hda, hdc)
Running Feisty Fawn 64 bit Server edition

Faulty drive is /dev/sda and on thus on the first SATA
port. Changing this to a different port on the
motherboard gives the same lockup. There is also a SIL
3114 controller on the motherboard but I have not
tried to dd_rescue with the faulty drive on that
controller to see if it locks up the kernel.

Regards,
Hendrik van den Boogaard




  

Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mailp=graduation+giftscs=bz
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Robert Hancock

Hendrik . wrote:

Last night I discovered a problem in my RAID5 array
and finally after a lot of tests I narrowed it down to
a bad sector on one of the hard disks and some goofy
kernels.

I just yesterday build a new PC using an existing
array of 5 disks in RAID 5. I did build the array with
only 4 out of 5 disks in the system but the rebuild
processes stopped over and over again apparently at
the same position. At last I found out that the
harddisk at the first SATA port had developed some bad
sectors which made the kernel stop completely when it
tried to read that sector with the following error on
the screen:

HARDWARE ERROR
CPU 0: Machine Check Exception: 4  Bank 4:
b2070f0f
TSC b7d4a144d0
This is not a software problem!
Run through mcelog --ascii to decode and contact your
hardware vendor
Kernel panic - not syncing: Machine check


You should run this through mcelog as it suggests and see what it shows. 
 The kernel should be handling this properly, unless the drive problem 
is causing the controller to do something bad. Note that kernels 2.6.20 
and later use ADMA mode on the nForce4 SATA controller whereas previous 
versions used it essentially like a PATA controller, so it is not 
surprising that the behavior is different.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
  HARDWARE ERROR
  CPU 0: Machine Check Exception: 4  Bank 4:
  b2070f0f
  TSC b7d4a144d0
  This is not a software problem!
  Run through mcelog --ascii to decode and contact
 your
  hardware vendor
  Kernel panic - not syncing: Machine check
 
 You should run this through mcelog as it suggests
 and see what it shows. 
   The kernel should be handling this properly,
 unless the drive problem 
 is causing the controller to do something bad. Note
 that kernels 2.6.20 
 and later use ADMA mode on the nForce4 SATA
 controller whereas previous 
 versions used it essentially like a PATA controller,
 so it is not 
 surprising that the behavior is different.

How can I do this? I have installed mcelog but I
cannot run it after the MCE error because the whole PC
hangs. If I try it after a reboot with 'mcelog --k8
--ascii' or whatever parameter, there is no output at
all. If I try to redirect the output to the syslog,
nothing is in there because the computer stopped
working and did not save the log anymore.

Isn't it strange to say that the controller does
something bad if there is just a bad sector on the
drive that is reported and handled correctly in an
older kernel (I have confirmed a bad sector on the
drive using the Seatools software from Seagate)? In my
opinion a kernel should not stop responding at all
with a bad sector on the disk. I cannot change the
controller's behavior and did all the updates there
are to make in function, but the problem is introduced
using the newer kernel series.

Perhaps nobody has tried accessing a bad SATA drive
before, to simulate such an error? If it helps I could
try a different type of motherboard to see what
happens there? (Asus M2NPV-VM)

Regards,
Hendrik


   

Moody friends. Drama queens. Your life? Nope! - their life, your story. Play 
Sims Stories at Yahoo! Games.
http://sims.yahoo.com/  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
Probably a similar problem is described in the
linux-ide mailing list a while ago:

http://www.opensubscriber.com/message/[EMAIL PROTECTED]/6490911.html

 Argh. I'm seeing a show stopper bug on sata_nv
here. 
 ata_exec_internal  
 is MCE-ing on the READ_NATIVE_MAX_EXT command on 
 both i386 and amd64, with  
 top of Linus' tree + this patch. :(  

Oddly, the command at least executes and doesn't MCE
(but it's not at all  
happy either) if I use ATA_PROT_PIO. I wonder if
ATA_PROT_NODATA is buggered on this sata_nv chip
(Asus A8N-E).  

At least it is a similar motherboard that is used
(however I have explicitly have the A8N-E Deluxe
edition).

I try not to repair my SATA disk for now with the
Seatools, so if there is some testing to be done, I
can run it with the bad disk.

Regards,
Hendrik


   

Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, 
photos  more. 
http://mobile.yahoo.com/go?refer=1GNXIC
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Alan Cox
 How can I do this? I have installed mcelog but I
 cannot run it after the MCE error because the whole PC
 hangs. If I try it after a reboot with 'mcelog --k8
 --ascii' or whatever parameter, there is no output at

You could type error back in from the email ?

 Isn't it strange to say that the controller does
 something bad if there is just a bad sector on the
 drive that is reported and handled correctly in an
 older kernel 

Not really. Its very strange it gives an MCE at all but this is a known
failure path (and should be a fixed known failure path) for the Nvidia
SATA.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
 hangs. If I try it after a reboot with 'mcelog --k8
 --ascii' or whatever parameter, there is no output
at
 You could type error back in from the email ?

Ok I copied it into the tool, it gives me:

CPU 0 4 northbridge TSC b7d4a144d0 
  Northbridge ECC error
  ECC syndrome = 0
STATUS 0 MCGSTATUS 4

This is a bit strange because I repeatedly tested the
RAM yesterday and it gives no problems. And even more
interesting: the error occurs at a reproducible
moment: when reading the bad sector from the Seagate
harddisk. And with an older kernel I was able to just
copy all stuff from the drive using dd_rescue... I do
not have ECC RAM in my PC by the way.

  Isn't it strange to say that the controller does
  something bad if there is just a bad sector on the
  drive that is reported and handled correctly in an
  older kernel 
 Not really. Its very strange it gives an MCE at all
 but this is a known
 failure path (and should be a fixed known failure
 path) for the Nvidia SATA.

So how to proceed in tackling this problem now? Is
there anything I can do to (help you guys ;)) fix it?
At this moment it unfortunately does not look to me as
a fixed failure path...


   

Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, 
photos  more. 
http://mobile.yahoo.com/go?refer=1GNXIC
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reading a bad sector does not report failure as 'read error' but hangs PC with 'Machine Check Exception'

2007-07-29 Thread Hendrik .
Ok, I did actually not copy the coreret code in the
mcelog, leaving me some errors about the Northbridge.
If I do it again it gives me something else. I made 2
digital photo's of 2 lockups when it happened and this
is the result of the tool, the TSC is different in
both errors, the rest is the same:


CPU 0 4 northbridge TSC b7d4a144d0 
  Northbridge Watchdog error
   bit57 = processor context corrupt
   bit61 = error uncorrected
  bus error 'generic participation, request timed out
  generic error mem transaction
  generic access, level generic'
STATUS b2070f0f MCGSTATUS 4
This is not a software problem!

CPU 0 4 northbridge TSC c4dd3a549f 
  Northbridge Watchdog error
   bit57 = processor context corrupt
   bit61 = error uncorrected
  bus error 'generic participation, request timed out
  generic error mem transaction
  generic access, level generic'
STATUS b2070f0f MCGSTATUS 4
This is not a software problem!


It's a bit strange but if I copy the results from my
first post I get the Northbridge error, perhaps
because there is an 'enter' between the first line
with the 'bank 4' and the 'b2070f0f' line. The
mcelog tool handles this different from the error in 1
line. 

Regards,
Hendrik


   

Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mailp=summer+activities+for+kidscs=bz
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/