Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels

2010-02-06 Thread Cesare Leonardi

M. Dietrich wrote:

Does this mean that with normal 686 or 486 kernel the corruption
doesn't happen?


yes.


So could be a kernel bug. Or the bigmem kernel trigger the problem early 
or frequently.
Have you already searched through internet if someone had hit your 
problem? Because i suspect it's not a kernel problem (see later)...



there are no special settings installed using hdparm:

/dev/sda:
 multcount =  0 (off)
 IO_support=  1 (32-bit)
 readonly  =  0 (off)
 readahead = 256 (on)
 geometry  = 30401/255/63, sectors = 488397168, start = 0


This is the output of the command, but it doesn't tell all the things 
you could have changed from the default. Have you customized 
/etc/hdparm.conf?

For example i've set apm=254 but the above output doesn't report it.

My suggestion is: try to comment out everything you have customized 
about the disk.



it's already installed, this is the output:

ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   085   069   034Pre-fail  Always   
-   98867399
  3 Spin_Up_Time0x0003   100   100   000Pre-fail  Always   
-   0
  4 Start_Stop_Count0x0032   001   001   020Old_age   Always   
FAILING_NOW 248712
  5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   075   060   030Pre-fail  Always   
-   40211526
  9 Power_On_Hours  0x0032   095   095   000Old_age   Always   
-   269350284038985
 10 Spin_Retry_Count0x0013   100   100   034Pre-fail  Always   
-   0
 12 Power_Cycle_Count   0x0032   100   100   020Old_age   Always   
-   448
184 End-to-End_Error0x0032   100   253   000Old_age   Always   
-   0
187 Reported_Uncorrect  0x003a   100   100   000Old_age   Always   
-   0
189 High_Fly_Writes 0x0022   100   100   045Old_age   Always   
-   0
190 Airflow_Temperature_Cel 0x0032   071   052   000Old_age   Always   
-   29 (Lifetime Min/Max 10/48)
191 G-Sense_Error_Rate  0x0032   100   100   000Old_age   Always   
-   19
192 Power-Off_Retract_Count 0x0022   062   062   000Old_age   Always   
-   77434
193 Load_Cycle_Count0x001a   001   001   000Old_age   Always   
-   320283
194 Temperature_Celsius 0x0012   029   048   000Old_age   Always   
-   29 (0 10 0 0)
195 Hardware_ECC_Recovered  0x0010   070   061   000Old_age   Offline  
-   98881899
196 Reallocated_Event_Count 0x003e   096   096   000Old_age   Always   
-   3645 (28548, 0)
197 Current_Pending_Sector  0x   100   100   000Old_age   Offline  
-   0
198 Offline_Uncorrectable   0x0032   100   100   000Old_age   Always   
-   0
199 UDMA_CRC_Error_Count0x   200   200   000Old_age   Offline  
-   0
200 Multi_Zone_Error_Rate   0x   100   253   000Old_age   Offline  
-   0
202 Data_Address_Mark_Errs  0x   100   253   000Old_age   Offline  
-   0

i wonder how to interpret that. Start_Stop_Count has FAILING_NOW, maybe because
hdaps is stopping the device often? why is that bad? hm.


Good question. I suggest to download a diagnostic tool from your disk's 
vendor site and see if it report it as failing.


The problem (one of) with smart is that the semantic of the table above 
in not consistent between manufacturers. So i suggest you to look at the 
wikipedia SMART page, in particular the Known ATA S.M.A.R.T. 
attributes table, but take it with a grain of salt:

http://en.wikipedia.org/wiki/S.M.A.R.T.

That said, from your smart table i'd do some search regarding this 
attributes and you disk manufacturer:

* Raw_Read_Error_Rate
* Start_Stop_Count
* Seek_Error_Rate
* Power-Off_Retract_Count
* Load_Cycle_Count
* Hardware_ECC_Recovered
* Reallocated_Event_Count

I'd look if the raw values of Raw_Read_Error_Rate and Seek_Error_Rate as 
used by your manufactured are worrying or not.
Same thing for Hardware_ECC_Recovered. At work we have at least 4 Maxtor 
that show high and always increasing raw values but they work without 
problem since years.


Also the Reallocated_Event_Count should require some investigation: why 
is so high but Reallocated_Sector_Ct and Current_Pending_Sector are zero?


Last, looking from your smart table seems that your drive turn often in 
standby/sleep mode. This can be seen by the high values of 
Start_Stop_Count, Load_Cycle_Count and Power-Off_Retract_Count. An in 
your initial report you said that you used suspend/resume.
I think that you should reduce these value because they are very high 
and all this start/stop cycle will (or already have) reduce the life of 
your disk.
Maybe on your system there is something that force too aggressive power 
saving on your disk. Laptop-mode-tools 

Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels

2010-02-04 Thread M. Dietrich
On Wed, Feb 03, 2010 at 11:22:06AM +0100, Cesare Leonardi wrote:
 M. Dietrich wrote:
  my system had serious filesystem corruption with several -bigmem
  kernel in the past (from 2.6.28 to 2.6.32).
 
 Does this mean that with normal 686 or 486 kernel the corruption
 doesn't happen?

yes.
 
 However many years ago i've experienced frequent filesystem
 corruption but i couldn't figure out why. Eventually i discovered
 was some hdparm settings...
 Was a lot hard to find, so i hope this could help you.  ;-)

there are no special settings installed using hdparm:

/dev/sda:
 multcount =  0 (off)
 IO_support=  1 (32-bit)
 readonly  =  0 (off)
 readahead = 256 (on)
 geometry  = 30401/255/63, sectors = 488397168, start = 0

  for sure i can't guarantee that this isn't related to some hardware
  fault like broken ram or the like but i checked ram with memtest86+.
 
 If i were you, i would also install smartmontools and try something
 like: smartctl -a /dev/yourdisk I'd put particular attention in the
 Vendor Specific SMART Attributes with Thresholds table to find
 something strange.

it's already installed, this is the output:

ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   085   069   034Pre-fail  Always   
-   98867399
  3 Spin_Up_Time0x0003   100   100   000Pre-fail  Always   
-   0
  4 Start_Stop_Count0x0032   001   001   020Old_age   Always   
FAILING_NOW 248712
  5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   075   060   030Pre-fail  Always   
-   40211526
  9 Power_On_Hours  0x0032   095   095   000Old_age   Always   
-   269350284038985
 10 Spin_Retry_Count0x0013   100   100   034Pre-fail  Always   
-   0
 12 Power_Cycle_Count   0x0032   100   100   020Old_age   Always   
-   448
184 End-to-End_Error0x0032   100   253   000Old_age   Always   
-   0
187 Reported_Uncorrect  0x003a   100   100   000Old_age   Always   
-   0
189 High_Fly_Writes 0x0022   100   100   045Old_age   Always   
-   0
190 Airflow_Temperature_Cel 0x0032   071   052   000Old_age   Always   
-   29 (Lifetime Min/Max 10/48)
191 G-Sense_Error_Rate  0x0032   100   100   000Old_age   Always   
-   19
192 Power-Off_Retract_Count 0x0022   062   062   000Old_age   Always   
-   77434
193 Load_Cycle_Count0x001a   001   001   000Old_age   Always   
-   320283
194 Temperature_Celsius 0x0012   029   048   000Old_age   Always   
-   29 (0 10 0 0)
195 Hardware_ECC_Recovered  0x0010   070   061   000Old_age   Offline  
-   98881899
196 Reallocated_Event_Count 0x003e   096   096   000Old_age   Always   
-   3645 (28548, 0)
197 Current_Pending_Sector  0x   100   100   000Old_age   Offline  
-   0
198 Offline_Uncorrectable   0x0032   100   100   000Old_age   Always   
-   0
199 UDMA_CRC_Error_Count0x   200   200   000Old_age   Offline  
-   0
200 Multi_Zone_Error_Rate   0x   100   253   000Old_age   Offline  
-   0
202 Data_Address_Mark_Errs  0x   100   253   000Old_age   Offline  
-   0

i wonder how to interpret that. Start_Stop_Count has FAILING_NOW, maybe because
hdaps is stopping the device often? why is that bad? hm.

but everything else looks fine, right?

 And try to hear if the disk make suspicious noise.

it doesnt - silent as a sleeping baby.
 
 If you have a minimum suspect for the ram, try to temporarly remove
 some bank, if you have more than one, or replace completely if you
 can. In the past i've seen at least two cases where memtest run ok
 for about a day but the system had sporadic system freeze and BSOD
 (Windows PCs). When i've replaced the ram the problems disapperead.
 
removing would reduce mem size and the need for bigmem kernel obsolete.
replacing isn't possible right now. point is: i never had strange behaviour
related to mem like kernel-freezes or program core dumps and i use the system
quite alot with big (cross-)compiles and everything that uses mem alot...

thing is that i discovered fs corruption by accident - git complained
about a defect repo. then i forced a fsck run at boot and that failed.
maybe all bigmem users should force a fsck and see if they already
suffer from a similar corruption. if not this bug should be closed
because it seems to be hw related. but i don't know how  where to
search, especially because this computer is a tool to do my work on.

best regards,
michael



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels

2010-02-03 Thread Cesare Leonardi

M. Dietrich wrote:

my system had serious filesystem corruption with several -bigmem
kernel in the past (from 2.6.28 to 2.6.32).


Does this mean that with normal 686 or 486 kernel the corruption doesn't 
happen?


However many years ago i've experienced frequent filesystem corruption 
but i couldn't figure out why. Eventually i discovered was some hdparm 
settings...

Was a lot hard to find, so i hope this could help you.  ;-)


for sure i can't guarantee that this isn't related to some hardware
fault like broken ram or the like but i checked ram with memtest86+.


If i were you, i would also install smartmontools and try something like:
smartctl -a /dev/yourdisk
I'd put particular attention in the Vendor Specific SMART Attributes 
with Thresholds table to find something strange.

And try to hear if the disk make suspicious noise.

If you have a minimum suspect for the ram, try to temporarly remove some 
bank, if you have more than one, or replace completely if you can. In 
the past i've seen at least two cases where memtest run ok for about a 
day but the system had sporadic system freeze and BSOD (Windows PCs). 
When i've replaced the ram the problems disapperead.


Bye.

Cesare.



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels

2010-01-27 Thread M. Dietrich
Package: linux-image-686-bigmem
Severity: grave
Justification: renders package unusable

my system had serious filesystem corruption with several -bigmem
kernel in the past (from 2.6.28 to 2.6.32). 

git first discovered the problem because it complained about corruption
in the repo. a filesystem check found lots of double allocated blocks
and other problems.

i use a ext3 filesystem. i first thought its related to suspend/resume
so i didn't suspend anymore - file corruption still occured. so i
switched from a -bigmem kernel to a normal kernel with same version
and even after heavy usage (cross compile openwrt) no corruption was
found.

for sure i can't guarantee that this isn't related to some hardware
fault like broken ram or the like but i checked ram with memtest86+.

please adjust Severity and/or Justification if i choosed that wrong.

-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)

Kernel: Linux 2.6.32-trunk-686 (SMP w/2 CPU cores)
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels

2010-01-27 Thread Bastian Blank
reassign 567204 linux-2.6 2.6.32-1
severity 567204 important
thanks

On Wed, Jan 27, 2010 at 11:25:57PM +0100, M. Dietrich wrote:
 my system had serious filesystem corruption with several -bigmem
 kernel in the past (from 2.6.28 to 2.6.32). 

Well, none of mines have.

 for sure i can't guarantee that this isn't related to some hardware
 fault like broken ram or the like but i checked ram with memtest86+.

A usual test is to run md5sum several times over the same data and see
if it changes.

Also you may check that you use an uptodate BIOS.

Bastian

-- 
Dammit Jim, I'm an actor, not a doctor.



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels

2010-01-27 Thread Ben Hutchings
On Wed, 2010-01-27 at 23:25 +0100, M. Dietrich wrote:
 Package: linux-image-686-bigmem

This isn't a real package name, so your report is lacking the system
information that should be gathered automatically.

Please follow-up (run 'reportbug -N 567204') to add that system
information.

 Severity: grave
 Justification: renders package unusable
[...]
 please adjust Severity and/or Justification if i choosed that wrong.

This justification only applies when the package is unusable on all or
most systems.  However, data loss is grave, so I think the severity is
correct.

Ben.

-- 
Ben Hutchings
The obvious mathematical breakthrough [to break modern encryption] would be
development of an easy way to factor large prime numbers. - Bill Gates


signature.asc
Description: This is a digitally signed message part