Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
M. Dietrich wrote: Does this mean that with normal 686 or 486 kernel the corruption doesn't happen? yes. So could be a kernel bug. Or the bigmem kernel trigger the problem early or frequently. Have you already searched through internet if someone had hit your problem? Because i suspect it's not a kernel problem (see later)... there are no special settings installed using hdparm: /dev/sda: multcount = 0 (off) IO_support= 1 (32-bit) readonly = 0 (off) readahead = 256 (on) geometry = 30401/255/63, sectors = 488397168, start = 0 This is the output of the command, but it doesn't tell all the things you could have changed from the default. Have you customized /etc/hdparm.conf? For example i've set apm=254 but the above output doesn't report it. My suggestion is: try to comment out everything you have customized about the disk. it's already installed, this is the output: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 085 069 034Pre-fail Always - 98867399 3 Spin_Up_Time0x0003 100 100 000Pre-fail Always - 0 4 Start_Stop_Count0x0032 001 001 020Old_age Always FAILING_NOW 248712 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 075 060 030Pre-fail Always - 40211526 9 Power_On_Hours 0x0032 095 095 000Old_age Always - 269350284038985 10 Spin_Retry_Count0x0013 100 100 034Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020Old_age Always - 448 184 End-to-End_Error0x0032 100 253 000Old_age Always - 0 187 Reported_Uncorrect 0x003a 100 100 000Old_age Always - 0 189 High_Fly_Writes 0x0022 100 100 045Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 071 052 000Old_age Always - 29 (Lifetime Min/Max 10/48) 191 G-Sense_Error_Rate 0x0032 100 100 000Old_age Always - 19 192 Power-Off_Retract_Count 0x0022 062 062 000Old_age Always - 77434 193 Load_Cycle_Count0x001a 001 001 000Old_age Always - 320283 194 Temperature_Celsius 0x0012 029 048 000Old_age Always - 29 (0 10 0 0) 195 Hardware_ECC_Recovered 0x0010 070 061 000Old_age Offline - 98881899 196 Reallocated_Event_Count 0x003e 096 096 000Old_age Always - 3645 (28548, 0) 197 Current_Pending_Sector 0x 100 100 000Old_age Offline - 0 198 Offline_Uncorrectable 0x0032 100 100 000Old_age Always - 0 199 UDMA_CRC_Error_Count0x 200 200 000Old_age Offline - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 Data_Address_Mark_Errs 0x 100 253 000Old_age Offline - 0 i wonder how to interpret that. Start_Stop_Count has FAILING_NOW, maybe because hdaps is stopping the device often? why is that bad? hm. Good question. I suggest to download a diagnostic tool from your disk's vendor site and see if it report it as failing. The problem (one of) with smart is that the semantic of the table above in not consistent between manufacturers. So i suggest you to look at the wikipedia SMART page, in particular the Known ATA S.M.A.R.T. attributes table, but take it with a grain of salt: http://en.wikipedia.org/wiki/S.M.A.R.T. That said, from your smart table i'd do some search regarding this attributes and you disk manufacturer: * Raw_Read_Error_Rate * Start_Stop_Count * Seek_Error_Rate * Power-Off_Retract_Count * Load_Cycle_Count * Hardware_ECC_Recovered * Reallocated_Event_Count I'd look if the raw values of Raw_Read_Error_Rate and Seek_Error_Rate as used by your manufactured are worrying or not. Same thing for Hardware_ECC_Recovered. At work we have at least 4 Maxtor that show high and always increasing raw values but they work without problem since years. Also the Reallocated_Event_Count should require some investigation: why is so high but Reallocated_Sector_Ct and Current_Pending_Sector are zero? Last, looking from your smart table seems that your drive turn often in standby/sleep mode. This can be seen by the high values of Start_Stop_Count, Load_Cycle_Count and Power-Off_Retract_Count. An in your initial report you said that you used suspend/resume. I think that you should reduce these value because they are very high and all this start/stop cycle will (or already have) reduce the life of your disk. Maybe on your system there is something that force too aggressive power saving on your disk. Laptop-mode-tools
Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
On Wed, Feb 03, 2010 at 11:22:06AM +0100, Cesare Leonardi wrote: M. Dietrich wrote: my system had serious filesystem corruption with several -bigmem kernel in the past (from 2.6.28 to 2.6.32). Does this mean that with normal 686 or 486 kernel the corruption doesn't happen? yes. However many years ago i've experienced frequent filesystem corruption but i couldn't figure out why. Eventually i discovered was some hdparm settings... Was a lot hard to find, so i hope this could help you. ;-) there are no special settings installed using hdparm: /dev/sda: multcount = 0 (off) IO_support= 1 (32-bit) readonly = 0 (off) readahead = 256 (on) geometry = 30401/255/63, sectors = 488397168, start = 0 for sure i can't guarantee that this isn't related to some hardware fault like broken ram or the like but i checked ram with memtest86+. If i were you, i would also install smartmontools and try something like: smartctl -a /dev/yourdisk I'd put particular attention in the Vendor Specific SMART Attributes with Thresholds table to find something strange. it's already installed, this is the output: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 085 069 034Pre-fail Always - 98867399 3 Spin_Up_Time0x0003 100 100 000Pre-fail Always - 0 4 Start_Stop_Count0x0032 001 001 020Old_age Always FAILING_NOW 248712 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 075 060 030Pre-fail Always - 40211526 9 Power_On_Hours 0x0032 095 095 000Old_age Always - 269350284038985 10 Spin_Retry_Count0x0013 100 100 034Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020Old_age Always - 448 184 End-to-End_Error0x0032 100 253 000Old_age Always - 0 187 Reported_Uncorrect 0x003a 100 100 000Old_age Always - 0 189 High_Fly_Writes 0x0022 100 100 045Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 071 052 000Old_age Always - 29 (Lifetime Min/Max 10/48) 191 G-Sense_Error_Rate 0x0032 100 100 000Old_age Always - 19 192 Power-Off_Retract_Count 0x0022 062 062 000Old_age Always - 77434 193 Load_Cycle_Count0x001a 001 001 000Old_age Always - 320283 194 Temperature_Celsius 0x0012 029 048 000Old_age Always - 29 (0 10 0 0) 195 Hardware_ECC_Recovered 0x0010 070 061 000Old_age Offline - 98881899 196 Reallocated_Event_Count 0x003e 096 096 000Old_age Always - 3645 (28548, 0) 197 Current_Pending_Sector 0x 100 100 000Old_age Offline - 0 198 Offline_Uncorrectable 0x0032 100 100 000Old_age Always - 0 199 UDMA_CRC_Error_Count0x 200 200 000Old_age Offline - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 Data_Address_Mark_Errs 0x 100 253 000Old_age Offline - 0 i wonder how to interpret that. Start_Stop_Count has FAILING_NOW, maybe because hdaps is stopping the device often? why is that bad? hm. but everything else looks fine, right? And try to hear if the disk make suspicious noise. it doesnt - silent as a sleeping baby. If you have a minimum suspect for the ram, try to temporarly remove some bank, if you have more than one, or replace completely if you can. In the past i've seen at least two cases where memtest run ok for about a day but the system had sporadic system freeze and BSOD (Windows PCs). When i've replaced the ram the problems disapperead. removing would reduce mem size and the need for bigmem kernel obsolete. replacing isn't possible right now. point is: i never had strange behaviour related to mem like kernel-freezes or program core dumps and i use the system quite alot with big (cross-)compiles and everything that uses mem alot... thing is that i discovered fs corruption by accident - git complained about a defect repo. then i forced a fsck run at boot and that failed. maybe all bigmem users should force a fsck and see if they already suffer from a similar corruption. if not this bug should be closed because it seems to be hw related. but i don't know how where to search, especially because this computer is a tool to do my work on. best regards, michael -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
M. Dietrich wrote: my system had serious filesystem corruption with several -bigmem kernel in the past (from 2.6.28 to 2.6.32). Does this mean that with normal 686 or 486 kernel the corruption doesn't happen? However many years ago i've experienced frequent filesystem corruption but i couldn't figure out why. Eventually i discovered was some hdparm settings... Was a lot hard to find, so i hope this could help you. ;-) for sure i can't guarantee that this isn't related to some hardware fault like broken ram or the like but i checked ram with memtest86+. If i were you, i would also install smartmontools and try something like: smartctl -a /dev/yourdisk I'd put particular attention in the Vendor Specific SMART Attributes with Thresholds table to find something strange. And try to hear if the disk make suspicious noise. If you have a minimum suspect for the ram, try to temporarly remove some bank, if you have more than one, or replace completely if you can. In the past i've seen at least two cases where memtest run ok for about a day but the system had sporadic system freeze and BSOD (Windows PCs). When i've replaced the ram the problems disapperead. Bye. Cesare. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
Package: linux-image-686-bigmem Severity: grave Justification: renders package unusable my system had serious filesystem corruption with several -bigmem kernel in the past (from 2.6.28 to 2.6.32). git first discovered the problem because it complained about corruption in the repo. a filesystem check found lots of double allocated blocks and other problems. i use a ext3 filesystem. i first thought its related to suspend/resume so i didn't suspend anymore - file corruption still occured. so i switched from a -bigmem kernel to a normal kernel with same version and even after heavy usage (cross compile openwrt) no corruption was found. for sure i can't guarantee that this isn't related to some hardware fault like broken ram or the like but i checked ram with memtest86+. please adjust Severity and/or Justification if i choosed that wrong. -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (500, 'unstable') Architecture: i386 (i686) Kernel: Linux 2.6.32-trunk-686 (SMP w/2 CPU cores) Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
reassign 567204 linux-2.6 2.6.32-1 severity 567204 important thanks On Wed, Jan 27, 2010 at 11:25:57PM +0100, M. Dietrich wrote: my system had serious filesystem corruption with several -bigmem kernel in the past (from 2.6.28 to 2.6.32). Well, none of mines have. for sure i can't guarantee that this isn't related to some hardware fault like broken ram or the like but i checked ram with memtest86+. A usual test is to run md5sum several times over the same data and see if it changes. Also you may check that you use an uptodate BIOS. Bastian -- Dammit Jim, I'm an actor, not a doctor. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#567204: linux-image-686-bigmem: serious filesystem corruption with bigmem kernels
On Wed, 2010-01-27 at 23:25 +0100, M. Dietrich wrote: Package: linux-image-686-bigmem This isn't a real package name, so your report is lacking the system information that should be gathered automatically. Please follow-up (run 'reportbug -N 567204') to add that system information. Severity: grave Justification: renders package unusable [...] please adjust Severity and/or Justification if i choosed that wrong. This justification only applies when the package is unusable on all or most systems. However, data loss is grave, so I think the severity is correct. Ben. -- Ben Hutchings The obvious mathematical breakthrough [to break modern encryption] would be development of an easy way to factor large prime numbers. - Bill Gates signature.asc Description: This is a digitally signed message part