Re: btrfs csum failed on git .pack file

2009-09-17 Thread Markus Trippelsdorf
On Thu, Sep 17, 2009 at 08:44:56AM +0200, Jens Axboe wrote:
 On Thu, Sep 17 2009, Markus Trippelsdorf wrote:
  On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
   On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
Just got this error today in my dmesg:
btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 
43905798

linux % find . -inum 1483065
./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack

It's the main pack file from my git linux kernel tree:

   
   Hmm, I ran into something very similar. Care to check what the corrupted
   block of data looks like (and how big it is)?
  
  I've hit the same problem again today:
  
  btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 
  1660028275
  
  The file in question is:
  ./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack
  
  I can't read the file directly, because of the csum mismatch:
 
 Chris, is there a way to force reading the file? Seems like that would
 be a very handy feature.
 
 Markus, not sure if that works, but you could always try and remount
 with data checksumming disabled.
 
 mount /dev/fooX -o remount,rw,nodatasum
 
 should do the trick.

That doesn't work unfortunately, btrfs still calculates and compares the
checksums (it won't write new ones I guess).

-- 
Markus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-17 Thread Jens Axboe
On Thu, Sep 17 2009, Markus Trippelsdorf wrote:
 On Thu, Sep 17, 2009 at 08:44:56AM +0200, Jens Axboe wrote:
  On Thu, Sep 17 2009, Markus Trippelsdorf wrote:
   On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
 Just got this error today in my dmesg:
 btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 
 43905798
 
 linux % find . -inum 1483065
 ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
 
 It's the main pack file from my git linux kernel tree:
 

Hmm, I ran into something very similar. Care to check what the corrupted
block of data looks like (and how big it is)?
   
   I've hit the same problem again today:
   
   btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 
   1660028275
   
   The file in question is:
   ./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack
   
   I can't read the file directly, because of the csum mismatch:
  
  Chris, is there a way to force reading the file? Seems like that would
  be a very handy feature.
  
  Markus, not sure if that works, but you could always try and remount
  with data checksumming disabled.
  
  mount /dev/fooX -o remount,rw,nodatasum
  
  should do the trick.
 
 That doesn't work unfortunately, btrfs still calculates and compares the
 checksums (it won't write new ones I guess).

Ah ok, as mentioned I wasn't sure whether that would work or not. I'll
defer to Chris :-)

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-17 Thread Markus Trippelsdorf
On Thu, Sep 17, 2009 at 11:05:49AM +0200, Jens Axboe wrote:
 On Thu, Sep 17 2009, Markus Trippelsdorf wrote:
  On Thu, Sep 17, 2009 at 08:44:56AM +0200, Jens Axboe wrote:
   On Thu, Sep 17 2009, Markus Trippelsdorf wrote:
On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
 On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
  Just got this error today in my dmesg:
  btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 
  43905798
  
  linux % find . -inum 1483065
  ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
  
  It's the main pack file from my git linux kernel tree:
  
 
 Hmm, I ran into something very similar. Care to check what the 
 corrupted
 block of data looks like (and how big it is)?

I've hit the same problem again today:

btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 
1660028275

The file in question is:
./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack

I can't read the file directly, because of the csum mismatch:
   
   Chris, is there a way to force reading the file? Seems like that would
   be a very handy feature.
   
   Markus, not sure if that works, but you could always try and remount
   with data checksumming disabled.
   
   mount /dev/fooX -o remount,rw,nodatasum
   
   should do the trick.
  
  That doesn't work unfortunately, btrfs still calculates and compares the
  checksums (it won't write new ones I guess).
 
 Ah ok, as mentioned I wasn't sure whether that would work or not. I'll
 defer to Chris :-)

Understood.

I did some further investigations and was able to reconstruct exactly
the same pack file in question by starting from an older backup copy of
my git repro and then running the same git commands as previous. 
Then I did a binary comparison between this reconstructed file and a
corrupted backup copy from the time before the csum errors occured (I
automatically backup every 4h).

This is the result (first line good pack file, second line corrupted
file):

vbindiff 
debug/.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack 
debug2/.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack

0130 9FA0: E2 3B 43 AA 63 BF 28 B3  87 B7 FD AB DA 74 2D 1C
0130 9FA0: E2 3B 43 AA 63 BF 28 B3  87 33 FD AB DA 74 2D 1C

06CD DF90: B0 22 6B 46 9F ED 6E 47  73 5E 7E EB DA 5F D6 11
06CD DF90: B0 22 6B 46 9F ED 6E 47  73 1E 7E EB DA 5F D6 11

06CD DFC0: 0D 86 2B B2 57 A4 5A CD  78 4B 08 94 C0 65 17 3A
06CD DFC0: 0D 86 2B B2 57 A4 5A CD  78 0B 08 94 C0 65 17 3A

0802 C3C0: 5C A5 E1 4A 1C BC 14 04  16 4A 29 D3 CC EF A6 80
0802 C3C0: 5C 25 E1 4A 1C BC 14 04  16 48 29 D3 CC EF A6 80

081A B3C0: 7D 7A 2C CD 20 89 E5 F2  A8 D3 32 38 04 BA 8A B5
081A B3C0: 7D 3A 2C CD 20 89 E5 F2  A8 D3 32 38 04 BA 8A B5

098E C430: FE 24 4A 19 09 F4 D5 1F  22 E8 36 FA F8 55 B2 6E
098E C430: FE 24 4A 19 09 F4 D5 1F  22 E0 36 FA F8 55 B2 6E

098E C440: 1B 3F C1 B4 BB 80 F8 5A  FB EE 0D A3 3F C5 A4 DB
098E C440: 1B 3D C1 B4 BB 80 F8 5A  FB EE 0D A3 3F C5 A4 DB

098E C4D0: F8 6C E2 65 18 7A 5D 33  2E 35 77 64 B2 81 BE DF
098E C4D0: F8 6C E2 65 18 7A 5D 33  2E 25 77 64 B2 81 BE DF

098E C4E0: 05 18 DE E3 00 78 D2 2C  4F 91 8F AF 0B F6 0C 31
098E C4E0: 05 1C DE E3 00 78 D2 2C  4F 91 8F AF 0B F6 0C 31

098E C500: 0A 12 D3 E7 FA B8 40 DE  0D 71 94 88 5D 4C 97 21
098E C500: 0A 12 D3 E7 FA B8 40 DE  0D 51 94 88 5D 4C 97 21

098E C540: 93 F2 58 C7 49 9A AA EB  30 3D 28 AA E3 09 4B 7B
098E C540: 93 F2 58 C7 49 9A AA EB  30 3C 28 AA E3 09 4B 7B

0FDE C420: F3 6A C2 38 76 43 9E 86  0D 9C 89 86 F1 E6 B0 F2
0FDE C420: F3 6A C2 38 76 43 9E 86  0D DC 89 86 F1 E6 B0 F2

0FDE C430: 38 E4 69 2E 22 1D E4 FF  90 A7 C6 E8 9F 08 4C 98
0FDE C430: 38 E4 69 2E 22 1D E4 FF  90 A5 C6 E8 9F 08 4C 98

1214 A4C0: 24 D6 56 AC 8B D8 D0 9B  D2 62 7B 83 C7 0B 3D BE
1214 A4C0: 24 D4 56 AC 8B D8 D0 9B  D2 62 7B 83 C7 0B 3D BE

1214 A500: EC 51 D3 FF C5 7D 30 DD  6D 45 50 FE E9 64 A4 FC
1214 A500: EC 11 D3 FF C5 7D 30 DD  6D 45 50 FE E9 64 A4 FC

1214 A520: D9 4D 63 EB 77 4D F0 BE  5E B3 6B DE E6 D2 28 67
1214 A520: D9 4D 63 EB 77 4D F0 BE  5E 33 6B DE E6 D2 28 67

-- 
Markus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-17 Thread Markus Trippelsdorf
On Thu, Sep 17, 2009 at 02:15:01PM +0200, Markus Trippelsdorf wrote:
 On Thu, Sep 17, 2009 at 11:05:49AM +0200, Jens Axboe wrote:
  On Thu, Sep 17 2009, Markus Trippelsdorf wrote:
   On Thu, Sep 17, 2009 at 08:44:56AM +0200, Jens Axboe wrote:
On Thu, Sep 17 2009, Markus Trippelsdorf wrote:
 On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
  On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
   Just got this error today in my dmesg:
   btrfs csum failed ino 1483065 off 158482432 csum 4283543305 
   private 43905798
   
   linux % find . -inum 1483065
   ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
   
   It's the main pack file from my git linux kernel tree:
   
  
  Hmm, I ran into something very similar. Care to check what the 
  corrupted
  block of data looks like (and how big it is)?
 
 I've hit the same problem again today:
 
 btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 
 1660028275
 
 The file in question is:
 ./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack
 
 I can't read the file directly, because of the csum mismatch:

Chris, is there a way to force reading the file? Seems like that would
be a very handy feature.

Markus, not sure if that works, but you could always try and remount
with data checksumming disabled.

mount /dev/fooX -o remount,rw,nodatasum

should do the trick.
   
   That doesn't work unfortunately, btrfs still calculates and compares the
   checksums (it won't write new ones I guess).
  
  Ah ok, as mentioned I wasn't sure whether that would work or not. I'll
  defer to Chris :-)
 
 Understood.
 
 I did some further investigations and was able to reconstruct exactly
 the same pack file in question by starting from an older backup copy of
 my git repro and then running the same git commands as previous. 
 Then I did a binary comparison between this reconstructed file and a
 corrupted backup copy from the time before the csum errors occured (I
 automatically backup every 4h).
 
Thanks to Chris' patch (from IRC) I was able to compare the file with
the csum error to the reconstructed one. You'll find the reults as
attachments.

-- 
Markus
08F403A0   5D 8E B3 32  7D 8F 5D E7  54 B6 9D 1E  E6 0C 9B 0D  BE 1D 9D 0C  34 
BA 7F FE  7F D4 E5 1A  0A 16 29 96
105AC3A0   76 80 1E 0A  3F 8A 7E FC  B3 2E 2B 9E  9E 53 82 10  C3 F6 4B C1  C0 
12 FC 61  A5 0E 63 70  B0 A4 7B 27
105AC3C0   DC AE 26 CE  48 5D CA 07  B7 26 B6 3C  BC 91 AD 00  55 97 BF E4  8C 
D7 EF AA  28 B7 54 65  30 DB 78 A6
105AC3E0   26 90 18 88  8F F4 25 91  48 5F 9C F6  4F 0D 46 72  A2 04 77 1A  AF 
FB 88 23  93 AF FB AA  B9 82 BC CC
08F403A0   5D 8E B3 32  7D 8F 5D E7  54 B4 9D 1E  E6 0C 9B 0D  BE 1D 9D 0C  34 
BA 7F FE  7F D4 E5 1A  0A 16 29 96
105AC3A0   76 80 1E 0A  3F 8A 7E FC  B3 2E 2B 9E  9E 53 82 10  C3 F7 4B C1  C0 
12 FC 61  A5 0E 63 70  B0 A4 7B 27
105AC3C0   DC AE 26 CE  48 5D CA 07  B7 77 B6 3C  BC 91 AD 00  55 97 BF E4  8C 
D7 EF AA  28 A7 54 65  30 DB 78 A6
105AC3E0   26 90 18 88  8F F4 25 91  48 5F 9C F6  4F 0D 46 72  A2 04 77 1A  AF 
FB 88 23  93 AF FB AA  B9 82 BC CC


Re: btrfs csum failed on git .pack file

2009-09-17 Thread Zach Brown

 0130 9FA0: E2 3B 43 AA 63 BF 28 B3  87 B7 FD AB DA 74 2D 1C
 0130 9FA0: E2 3B 43 AA 63 BF 28 B3  87 33 FD AB DA 74 2D 1C

B7 = 10110111
33 = 00110011

 06CD DF90: B0 22 6B 46 9F ED 6E 47  73 5E 7E EB DA 5F D6 11
 06CD DF90: B0 22 6B 46 9F ED 6E 47  73 1E 7E EB DA 5F D6 11

5E = 0100
1E = 0000

 06CD DFC0: 0D 86 2B B2 57 A4 5A CD  78 4B 08 94 C0 65 17 3A
 06CD DFC0: 0D 86 2B B2 57 A4 5A CD  78 0B 08 94 C0 65 17 3A

4B = 01001011
0B = 1011

And so on.

It looks like a few bits are getting flipped at the same byte offset.
One can imagine software bugs that would do this, certainly, but upset
hardware seems awfully likely too.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-17 Thread Markus Trippelsdorf
On Thu, Sep 17, 2009 at 10:00:28AM -0700, Zach Brown wrote:
 
  0130 9FA0: E2 3B 43 AA 63 BF 28 B3  87 B7 FD AB DA 74 2D 1C
  0130 9FA0: E2 3B 43 AA 63 BF 28 B3  87 33 FD AB DA 74 2D 1C
 
 B7 = 10110111
 33 = 00110011
 
  06CD DF90: B0 22 6B 46 9F ED 6E 47  73 5E 7E EB DA 5F D6 11
  06CD DF90: B0 22 6B 46 9F ED 6E 47  73 1E 7E EB DA 5F D6 11
 
 5E = 0100
 1E = 0000
 
  06CD DFC0: 0D 86 2B B2 57 A4 5A CD  78 4B 08 94 C0 65 17 3A
  06CD DFC0: 0D 86 2B B2 57 A4 5A CD  78 0B 08 94 C0 65 17 3A
 
 4B = 01001011
 0B = 1011
 
 And so on.
 
 It looks like a few bits are getting flipped at the same byte offset.
 One can imagine software bugs that would do this, certainly, but upset
 hardware seems awfully likely too.

I'm afraid you're right. I did some further tests and now I'm pretty
sure that a bad RAM module was the root cause of it all...
Oh well.

-- 
Markus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-17 Thread Tomasz Torcz
On Thu, Sep 17, 2009 at 07:10:06PM +0200, Markus Trippelsdorf wrote:
   06CD DFC0: 0D 86 2B B2 57 A4 5A CD  78 4B 08 94 C0 65 17 3A
   06CD DFC0: 0D 86 2B B2 57 A4 5A CD  78 0B 08 94 C0 65 17 3A
  
  4B = 01001011
  0B = 1011
  
  And so on.
  
  It looks like a few bits are getting flipped at the same byte offset.
  One can imagine software bugs that would do this, certainly, but upset
  hardware seems awfully likely too.
 
 I'm afraid you're right. I did some further tests and now I'm pretty
 sure that a bad RAM module was the root cause of it all...
 Oh well.

  On the other hand, that what's so great in checksumming filesystems.
You found bad module thanks to btrfs, otherwise you wouldn't suspect
anything wrong. If you have had raid-1 for data, this corruption would
have been fixed by btrfs.

-- 
Tomasz Torcz   72-|   80-|
xmpp: zdzich...@chrome.pl  72-|   80-|

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-10 Thread Bryan Østergaard
On Wed, Sep 9, 2009 at 11:01 PM, Oliver Mattos
oliver.matto...@imperial.ac.uk wrote:

 What a strange coincidence that it affected git pack files in both cases.
 It's almost too improbable...

I had similar problems with a broken git repository about two weeks
ago. This was on a regular laptop harddrive that's never reported any
errors.

Unfortunately I rm'ed the repository and cloned it again so I can't
check exactly what caused the corruption. Interestingly I've just
discovered a broken tar.bz2 file that shows similar symptoms as what's
been described here earlier.

The first (and by far largest) chunk of the file consists entirely of
0x01 bytes followed by a smaller chunk that appears to be a PNG file
and then arch/sparc/include/asm/fhc.h from the linux kernel. After
this I have a small chunk of 0x00 bytes followed by
arch/sparc/include/asm/floppy.h.

This pattern is repeated several times with different include files
from the kernel sources and the file ends with a small chunk of 0x01
bytes again.

The harddisk in question is:
=== START OF INFORMATION SECTION ===
Model Family: Fujitsu MHV series
Device Model: FUJITSU MHV2080BH
Serial Number:NW05T6425FRY
Firmware Version: 00840028
User Capacity:80,025,280,000 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:Thu Sep 10 12:40:10 2009 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

As already mentioned it's never reported any errors and I also haven't
seen any problems like this before when using ext3 or ext4. The broken
file is available at http://omploader.org/vMmJtbg if that's any help.

Regards,
Bryan Østergaard
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-09 Thread Markus Trippelsdorf
On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote:
 On Tue, Sep 08 2009, Markus Trippelsdorf wrote:
  On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
   On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
Just got this error today in my dmesg:
btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 
43905798

linux % find . -inum 1483065
./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack

It's the main pack file from my git linux kernel tree:

   
   Hmm, I ran into something very similar. Care to check what the corrupted
   block of data looks like (and how big it is)?
  
  I've already deleted the file in question unfortunately.
  On IRC Chris decided that either bad RAM or a harddrive error was the
  most likely reason for this chechsum mismatch.
 
 Darn, that's too bad. The corruption issue I had was also in a git pack
 file. It was fine one day, bad the next. Turned out to be 16kb of 0xff
 in the file, and I blamed it on the (cheap) SSD drive that hosted the
 local git repo. It's still the most likely explanation given the nature
 of the problem, however it would have been really interesting to see
 what corruption you had.

If by cheap SSD drive you mean an Indilinx Barefoot based one, we might
be using the same hardware (30GB Vertex in my case). 
What a strange coincidence that it affected git pack files in both cases.
It's almost too improbable...

-- 
Markus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-09 Thread Markus Trippelsdorf
On Wed, Sep 09, 2009 at 09:01:41AM +0200, Jens Axboe wrote:
 On Wed, Sep 09 2009, Markus Trippelsdorf wrote:
  On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote:
   On Tue, Sep 08 2009, Markus Trippelsdorf wrote:
On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
 On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
  Just got this error today in my dmesg:
  btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 
  43905798
  
  linux % find . -inum 1483065
  ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
  
  It's the main pack file from my git linux kernel tree:
  
 
 Hmm, I ran into something very similar. Care to check what the 
 corrupted
 block of data looks like (and how big it is)?

I've already deleted the file in question unfortunately.
On IRC Chris decided that either bad RAM or a harddrive error was the
most likely reason for this chechsum mismatch.
   
   Darn, that's too bad. The corruption issue I had was also in a git pack
   file. It was fine one day, bad the next. Turned out to be 16kb of 0xff
   in the file, and I blamed it on the (cheap) SSD drive that hosted the
   local git repo. It's still the most likely explanation given the nature
   of the problem, however it would have been really interesting to see
   what corruption you had.
  
  If by cheap SSD drive you mean an Indilinx Barefoot based one, we might
  be using the same hardware (30GB Vertex in my case). 
 
 Spooky, yes indeed that's the very same drive I'm using. Also see my
 postings on this very issue here, top two entries:
 
 http://axboe.livejournal.com/
 
 So that pretty much looks like it reaffirms some of my suspicions. Is
 the drive in a laptop that you suspend and resume?

No. I use it in my workstation, that I never switch off normally.

  What a strange coincidence that it affected git pack files in both cases.
  It's almost too improbable...
 
 Probably more than a coincidence I think, the question is what though...

If it really was an SSD error, then it should happen randomly, messing up
random files. But (contrary to your experience) I never had any issues with 
this SSD until this single failed checksum.

-- 
Markus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-09 Thread Jens Axboe
On Wed, Sep 09 2009, Markus Trippelsdorf wrote:
 On Wed, Sep 09, 2009 at 09:01:41AM +0200, Jens Axboe wrote:
  On Wed, Sep 09 2009, Markus Trippelsdorf wrote:
   On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote:
On Tue, Sep 08 2009, Markus Trippelsdorf wrote:
 On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
  On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
   Just got this error today in my dmesg:
   btrfs csum failed ino 1483065 off 158482432 csum 4283543305 
   private 43905798
   
   linux % find . -inum 1483065
   ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
   
   It's the main pack file from my git linux kernel tree:
   
  
  Hmm, I ran into something very similar. Care to check what the 
  corrupted
  block of data looks like (and how big it is)?
 
 I've already deleted the file in question unfortunately.
 On IRC Chris decided that either bad RAM or a harddrive error was the
 most likely reason for this chechsum mismatch.

Darn, that's too bad. The corruption issue I had was also in a git pack
file. It was fine one day, bad the next. Turned out to be 16kb of 0xff
in the file, and I blamed it on the (cheap) SSD drive that hosted the
local git repo. It's still the most likely explanation given the nature
of the problem, however it would have been really interesting to see
what corruption you had.
   
   If by cheap SSD drive you mean an Indilinx Barefoot based one, we might
   be using the same hardware (30GB Vertex in my case). 
  
  Spooky, yes indeed that's the very same drive I'm using. Also see my
  postings on this very issue here, top two entries:
  
  http://axboe.livejournal.com/
  
  So that pretty much looks like it reaffirms some of my suspicions. Is
  the drive in a laptop that you suspend and resume?
 
 No. I use it in my workstation, that I never switch off normally.

OK, so we can rule out any interactions between suspending and resuming
the drive. That's at least something.

   What a strange coincidence that it affected git pack files in both cases.
   It's almost too improbable...
  
  Probably more than a coincidence I think, the question is what though...
 
 If it really was an SSD error, then it should happen randomly, messing up
 random files. But (contrary to your experience) I never had any issues with 
 this SSD until this single failed checksum.

Not necessarily, they may be some pattern to how the pack files are
accessed (that propagates through to the drive). The fact is, 0xff is an
extremely weird piece of corruption that just reeks of bad flash blocks.
It's almost impossible that it is a software error. If it was all
zeroes, or a bit flip, the likely causes would be very different.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-09 Thread Daniel J Blueman
On Wed, Sep 9, 2009 at 8:01 AM, Jens Axboejens.ax...@oracle.com wrote:
 On Wed, Sep 09 2009, Markus Trippelsdorf wrote:
 On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote:
  On Tue, Sep 08 2009, Markus Trippelsdorf wrote:
   On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
 Just got this error today in my dmesg:
 btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 
 43905798

 linux % find . -inum 1483065
 ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack

 It's the main pack file from my git linux kernel tree:

   
Hmm, I ran into something very similar. Care to check what the 
corrupted
block of data looks like (and how big it is)?
  
   I've already deleted the file in question unfortunately.
   On IRC Chris decided that either bad RAM or a harddrive error was the
   most likely reason for this chechsum mismatch.
 
  Darn, that's too bad. The corruption issue I had was also in a git pack
  file. It was fine one day, bad the next. Turned out to be 16kb of 0xff
  in the file, and I blamed it on the (cheap) SSD drive that hosted the
  local git repo. It's still the most likely explanation given the nature
  of the problem, however it would have been really interesting to see
  what corruption you had.

 If by cheap SSD drive you mean an Indilinx Barefoot based one, we might
 be using the same hardware (30GB Vertex in my case).

 Spooky, yes indeed that's the very same drive I'm using. Also see my
 postings on this very issue here, top two entries:

 http://axboe.livejournal.com/

 So that pretty much looks like it reaffirms some of my suspicions. Is
 the drive in a laptop that you suspend and resume?

If you're on firmware  1.30, the changlog includes some fixes which
may be relevant, eg if block 0 is relative, or you're
suspending/resuming:

- Race condition occurred during soft reset handler
- If read fail occurs during reading stamp information, firmware
corrupted block 0.
- Power off recovery had bug in certain circumstances

http://www.ocztechnologyforum.com/forum/showthread.php?t=57516
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-09 Thread Jens Axboe
On Wed, Sep 09 2009, Daniel J Blueman wrote:
 On Wed, Sep 9, 2009 at 8:01 AM, Jens Axboejens.ax...@oracle.com wrote:
  On Wed, Sep 09 2009, Markus Trippelsdorf wrote:
  On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote:
   On Tue, Sep 08 2009, Markus Trippelsdorf wrote:
On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
 On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
  Just got this error today in my dmesg:
  btrfs csum failed ino 1483065 off 158482432 csum 4283543305 
  private 43905798
 
  linux % find . -inum 1483065
  ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
 
  It's the main pack file from my git linux kernel tree:
 

 Hmm, I ran into something very similar. Care to check what the 
 corrupted
 block of data looks like (and how big it is)?
   
I've already deleted the file in question unfortunately.
On IRC Chris decided that either bad RAM or a harddrive error was the
most likely reason for this chechsum mismatch.
  
   Darn, that's too bad. The corruption issue I had was also in a git pack
   file. It was fine one day, bad the next. Turned out to be 16kb of 0xff
   in the file, and I blamed it on the (cheap) SSD drive that hosted the
   local git repo. It's still the most likely explanation given the nature
   of the problem, however it would have been really interesting to see
   what corruption you had.
 
  If by cheap SSD drive you mean an Indilinx Barefoot based one, we might
  be using the same hardware (30GB Vertex in my case).
 
  Spooky, yes indeed that's the very same drive I'm using. Also see my
  postings on this very issue here, top two entries:
 
  http://axboe.livejournal.com/
 
  So that pretty much looks like it reaffirms some of my suspicions. Is
  the drive in a laptop that you suspend and resume?
 
 If you're on firmware  1.30, the changlog includes some fixes which
 may be relevant, eg if block 0 is relative, or you're
 suspending/resuming:
 
 - Race condition occurred during soft reset handler
 - If read fail occurs during reading stamp information, firmware
 corrupted block 0.
 - Power off recovery had bug in certain circumstances
 
 http://www.ocztechnologyforum.com/forum/showthread.php?t=57516

The issue is pretty much moot at this point, since OCZ support were not
really interested in providing any sort of real technical support to
find out what really caused this issue. My main worry was reliability of
these cheaper SSD drives, and that worry is still not resolved. If you
read the blog entries, I do comment on the apparently scary basic bugs
taht are still being fixed on the Indilinx controllers. I do expect some
basic level of data integrity from a consumer product and at least some
interest in resolving weird corruption issues if things go wrong. Since
OCZ cannot provide anything like that, I have a hard time recommending
these drives for anything but very casual use. Fast, cheap, reliable.
Pick any two.

My drive was running 1.10 at the time of the problem.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-09 Thread Daniel J Blueman
On Wed, Sep 9, 2009 at 9:26 AM, Jens Axboejens.ax...@oracle.com wrote:
 On Wed, Sep 09 2009, Daniel J Blueman wrote:
 On Wed, Sep 9, 2009 at 8:01 AM, Jens Axboejens.ax...@oracle.com wrote:
  On Wed, Sep 09 2009, Markus Trippelsdorf wrote:
  On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote:
   On Tue, Sep 08 2009, Markus Trippelsdorf wrote:
On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
 On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
  Just got this error today in my dmesg:
  btrfs csum failed ino 1483065 off 158482432 csum 4283543305 
  private 43905798
 
  linux % find . -inum 1483065
  ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
 
  It's the main pack file from my git linux kernel tree:
 

 Hmm, I ran into something very similar. Care to check what the 
 corrupted
 block of data looks like (and how big it is)?
   
I've already deleted the file in question unfortunately.
On IRC Chris decided that either bad RAM or a harddrive error was the
most likely reason for this chechsum mismatch.
  
   Darn, that's too bad. The corruption issue I had was also in a git pack
   file. It was fine one day, bad the next. Turned out to be 16kb of 0xff
   in the file, and I blamed it on the (cheap) SSD drive that hosted the
   local git repo. It's still the most likely explanation given the nature
   of the problem, however it would have been really interesting to see
   what corruption you had.
 
  If by cheap SSD drive you mean an Indilinx Barefoot based one, we might
  be using the same hardware (30GB Vertex in my case).
 
  Spooky, yes indeed that's the very same drive I'm using. Also see my
  postings on this very issue here, top two entries:
 
  http://axboe.livejournal.com/
 
  So that pretty much looks like it reaffirms some of my suspicions. Is
  the drive in a laptop that you suspend and resume?

 If you're on firmware  1.30, the changlog includes some fixes which
 may be relevant, eg if block 0 is relative, or you're
 suspending/resuming:

 - Race condition occurred during soft reset handler
 - If read fail occurs during reading stamp information, firmware
 corrupted block 0.
 - Power off recovery had bug in certain circumstances

 http://www.ocztechnologyforum.com/forum/showthread.php?t=57516

 The issue is pretty much moot at this point, since OCZ support were not
 really interested in providing any sort of real technical support to
 find out what really caused this issue. My main worry was reliability of
 these cheaper SSD drives, and that worry is still not resolved. If you
 read the blog entries, I do comment on the apparently scary basic bugs
 taht are still being fixed on the Indilinx controllers. I do expect some
 basic level of data integrity from a consumer product and at least some
 interest in resolving weird corruption issues if things go wrong. Since
 OCZ cannot provide anything like that, I have a hard time recommending
 these drives for anything but very casual use. Fast, cheap, reliable.
 Pick any two.

 My drive was running 1.10 at the time of the problem.

It looks like we need a small tool which performs patterned block I/O
to the device, updating a checksum as it goes, and performing
integrity sweeps at intervals, lower level than fsx. It must be
trusted or not.

I had a problem like this with nVidia CK804/MCP55 chipsets corrupting
data under a triple-edge case workload.
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-09 Thread Chris Mason
On Wed, Sep 09, 2009 at 09:37:42AM +0100, Daniel J Blueman wrote:
 
  http://www.ocztechnologyforum.com/forum/showthread.php?t=57516
 
  The issue is pretty much moot at this point, since OCZ support were not
  really interested in providing any sort of real technical support to
  find out what really caused this issue. My main worry was reliability of
  these cheaper SSD drives, and that worry is still not resolved. If you
  read the blog entries, I do comment on the apparently scary basic bugs
  taht are still being fixed on the Indilinx controllers. I do expect some
  basic level of data integrity from a consumer product and at least some
  interest in resolving weird corruption issues if things go wrong. Since
  OCZ cannot provide anything like that, I have a hard time recommending
  these drives for anything but very casual use. Fast, cheap, reliable.
  Pick any two.
 
  My drive was running 1.10 at the time of the problem.
 
 It looks like we need a small tool which performs patterned block I/O
 to the device, updating a checksum as it goes, and performing
 integrity sweeps at intervals, lower level than fsx. It must be
 trusted or not.
 
 I had a problem like this with nVidia CK804/MCP55 chipsets corrupting
 data under a triple-edge case workload.

Well, just use git ;)  Apply a bunch of patches (say the mm tree) with
guilt and repack in a loop.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-09 Thread Oliver Mattos



What a strange coincidence that it affected git pack files in both cases.
It's almost too improbable...



Probably more than a coincidence I think, the question is what though...


Some SSD drives (or rather the cheap wear levelling controllers in things
like USB sticks) have firmware which tries to recognise certain data
structures of common filesystems (like FAT and NTFS), and uses information
in those data structures to optimise the allocation and erasure of blocks
(for example the free space linked list in FAT).  If the data you were
saving to the disk was similar to one of those data structures, you might've
triggered one of those algorithms, which would cause data corruption.  This
is common in high performance usb sticks because they want to pre-erase
blocks on file deletion for operating systems not supporting SCSI TRIM - I
imagine the same technology might carry across to cheap SSD's.

Not much BTRFS can do about it though.  If the piece of data that triggers
the bug could be identified, workarounds could possibly be introduced for
the particular buggy controllers.

Oliver Mattos

(resent as I emailled wrong recipients before) 


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-08 Thread Jens Axboe
On Tue, Sep 08 2009, Markus Trippelsdorf wrote:
 On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
  On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
   Just got this error today in my dmesg:
   btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 
   43905798
   
   linux % find . -inum 1483065
   ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
   
   It's the main pack file from my git linux kernel tree:
   
   linux % ls -l ./.git/objects/pack/
   total 562848
   -rw-r--r-- 1 markus markus   1891324 2008-11-29 19:49 
   pack-011b43fa6956667db5e67fba859e40cb4b154226.idx
   -rw-r--r-- 1 markus markus  44002938 2008-11-29 19:54 
   pack-011b43fa6956667db5e67fba859e40cb4b154226.pack.temp
   -rw-r--r-- 1 markus markus730332 2008-11-29 19:49 
   pack-67be92b3fab3dab175683582dab0b719517e55a5.idx
   -r--r--r-- 1 markus markus  36061684 2009-09-06 21:48 
   pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.idx
   -r--r--r-- 1 markus markus 335202742 2009-09-06 21:48 
   pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
   -rw--- 1 markus markus 158457856 2009-09-07 22:15 tmp_pack_OUdxER
   
   I'm running the latest git kernel and I've been using btrfs as my root
   fs for the last few weeks without problems so far.
  
  Hmm, I ran into something very similar. Care to check what the corrupted
  block of data looks like (and how big it is)?
 
 I've already deleted the file in question unfortunately.
 On IRC Chris decided that either bad RAM or a harddrive error was the
 most likely reason for this chechsum mismatch.

Darn, that's too bad. The corruption issue I had was also in a git pack
file. It was fine one day, bad the next. Turned out to be 16kb of 0xff
in the file, and I blamed it on the (cheap) SSD drive that hosted the
local git repo. It's still the most likely explanation given the nature
of the problem, however it would have been really interesting to see
what corruption you had.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-08 Thread Jens Axboe
On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
 Just got this error today in my dmesg:
 btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798
 
 linux % find . -inum 1483065
 ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
 
 It's the main pack file from my git linux kernel tree:
 
 linux % ls -l ./.git/objects/pack/
 total 562848
 -rw-r--r-- 1 markus markus   1891324 2008-11-29 19:49 
 pack-011b43fa6956667db5e67fba859e40cb4b154226.idx
 -rw-r--r-- 1 markus markus  44002938 2008-11-29 19:54 
 pack-011b43fa6956667db5e67fba859e40cb4b154226.pack.temp
 -rw-r--r-- 1 markus markus730332 2008-11-29 19:49 
 pack-67be92b3fab3dab175683582dab0b719517e55a5.idx
 -r--r--r-- 1 markus markus  36061684 2009-09-06 21:48 
 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.idx
 -r--r--r-- 1 markus markus 335202742 2009-09-06 21:48 
 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
 -rw--- 1 markus markus 158457856 2009-09-07 22:15 tmp_pack_OUdxER
 
 I'm running the latest git kernel and I've been using btrfs as my root
 fs for the last few weeks without problems so far.

Hmm, I ran into something very similar. Care to check what the corrupted
block of data looks like (and how big it is)?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs csum failed on git .pack file

2009-09-08 Thread Markus Trippelsdorf
On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
 On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
  Just got this error today in my dmesg:
  btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798
  
  linux % find . -inum 1483065
  ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
  
  It's the main pack file from my git linux kernel tree:
  
  linux % ls -l ./.git/objects/pack/
  total 562848
  -rw-r--r-- 1 markus markus   1891324 2008-11-29 19:49 
  pack-011b43fa6956667db5e67fba859e40cb4b154226.idx
  -rw-r--r-- 1 markus markus  44002938 2008-11-29 19:54 
  pack-011b43fa6956667db5e67fba859e40cb4b154226.pack.temp
  -rw-r--r-- 1 markus markus730332 2008-11-29 19:49 
  pack-67be92b3fab3dab175683582dab0b719517e55a5.idx
  -r--r--r-- 1 markus markus  36061684 2009-09-06 21:48 
  pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.idx
  -r--r--r-- 1 markus markus 335202742 2009-09-06 21:48 
  pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
  -rw--- 1 markus markus 158457856 2009-09-07 22:15 tmp_pack_OUdxER
  
  I'm running the latest git kernel and I've been using btrfs as my root
  fs for the last few weeks without problems so far.
 
 Hmm, I ran into something very similar. Care to check what the corrupted
 block of data looks like (and how big it is)?

I've already deleted the file in question unfortunately.
On IRC Chris decided that either bad RAM or a harddrive error was the
most likely reason for this chechsum mismatch.

-- 
Markus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html