Re: nonzero mismatch_cnt with no earlier error

2007-03-04 Thread Tejun Heo
Eyal Lebedinsky wrote: > I CC'ed linux-ide to see if they think the reported error was really innocent: > > Question: does this error report suggest that a disk could be corrupted? > > This SATA disk is part of an md raid and no error was reported by md. > > [937567.332751] ata3.00: exception Em

Re: nonzero mismatch_cnt with no earlier error

2007-02-26 Thread Eyal Lebedinsky
I CC'ed linux-ide to see if they think the reported error was really innocent: Question: does this error report suggest that a disk could be corrupted? This SATA disk is part of an md raid and no error was reported by md. [937567.332751] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4190002 acti

Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Jeff Breidenbach
Ok, so hearing all the excitement I ran a check on a multi-disk RAID-1. One of the RAID-1 disks failed out, maybe by coincidence but presumably due to the check. (I also have another disk in the array deliberately removed as a backup mechanism.) And of course there is a big mismatch count. Questi

Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Neil Brown
On Saturday February 24, [EMAIL PROTECTED] wrote: > But is this not a good opportunity to repair the bad stripe for a very > low cost (no complete resync required)? In this case, 'md' knew nothing about an error. The SCSI layer detected something and thought it had fixed it itself. Nothing for m

Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Bill Davidsen
Justin Piszcz wrote: On Sat, 24 Feb 2007, Michael Tokarev wrote: Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at

Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Justin Piszcz
On Sun, 25 Feb 2007, Christian Pernegger wrote: Sorry to hijack the thread a little but I just noticed that the mismatch_cnt for my mirror is at 256. I'd always thought the monthly check done by the mdadm Debian package does repair as well - apparently it doesn't. So I guess I should run rep

Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Christian Pernegger
Sorry to hijack the thread a little but I just noticed that the mismatch_cnt for my mirror is at 256. I'd always thought the monthly check done by the mdadm Debian package does repair as well - apparently it doesn't. So I guess I should run repair but I'm wondering ... - is it safe / bugfree con

Re: nonzero mismatch_cnt with no earlier error

2007-02-25 Thread Frank van Maarseveen
On Sat, Feb 24, 2007 at 11:23:55AM +1100, Eyal Lebedinsky wrote: [...] > > fsck (ext3 with logging) found no errors but I may have bad data > somewhere. I've written a program for fast MD5/SHA256 summing which may be useful for tracking these kind of silent corruptions. See http://www.fr

Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz
On Sat, 24 Feb 2007, Michael Tokarev wrote: Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of As poin

Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Michael Tokarev
Jason Rainforest wrote: > I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, > multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 > +). > > I then ordered a resync. The mismatch_cnt returned to 0 at the start of As pointed out later it was repair, not resync.

Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz
Ahh, perhaps Neil can fix that? ;) Cat /sys/block/md0/md/sync_action will tell you what it is really doing. On Sat, 24 Feb 2007, Jason Rainforest wrote: Yes, I meant repair, sorry. I checked my bash history and I did indeed order a repair (echo repair >/sys/block/md0/md/sync_action). I think

Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Jason Rainforest
Yes, I meant repair, sorry. I checked my bash history and I did indeed order a repair (echo repair >/sys/block/md0/md/sync_action). I think I called it a resync because that's what /proc/mdstat told me it was doing. On Sat, 2007-02-24 at 04:50 -0500, Justin Piszcz wrote: > A resync? You're suppos

Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz
A resync? You're supposed to run a 'repair' are you not? Justin. On Sat, 24 Feb 2007, Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_c

Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Jason Rainforest
I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of the resync, but around the same time that it went up to 8 with the check, it went u

Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz
Of course you could just run repair but then you would never know that mismatch_cnt was > 0. Justin. On Sat, 24 Feb 2007, Justin Piszcz wrote: Perhaps, The way it works (I believe is as follows) 1. echo check > sync_action 2. If mismatch_cnt > 0 then run: 3. echo repair > sync_action 4. Re-

Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Justin Piszcz
Perhaps, The way it works (I believe is as follows) 1. echo check > sync_action 2. If mismatch_cnt > 0 then run: 3. echo repair > sync_action 4. Re-run #1 5. Check to make sure it is back to 0. Justin. On Sat, 24 Feb 2007, Eyal Lebedinsky wrote: I did a resync since, which ended up with the

Re: nonzero mismatch_cnt with no earlier error

2007-02-23 Thread Eyal Lebedinsky
I did a resync since, which ended up with the same mismatch_cnt of 184. I noticed that the count *was* reset to zero when the resync started, but ended up with 184 (same as after the check). I thought that the resync just calculates fresh parity and does not bother checking if it is different. So

Re: nonzero mismatch_cnt with no earlier error

2007-02-23 Thread Eyal Lebedinsky
But is this not a good opportunity to repair the bad stripe for a very low cost (no complete resync required)? At time of error we actually know which disk failed and can re-write it, something we do not know at resync time, so I assume we always write to the parity disk. Justin Piszcz wrote: > S

Re: nonzero mismatch_cnt with no earlier error

2007-02-23 Thread Justin Piszcz
Should the raid have noticed the error, checked the offending stripe and taken appropriate action? The messages from that error are below. I don't think so, that is why we need to run check every once and a while and check the mismatch_cnt file for each md raid device. Run repair then re-run c

nonzero mismatch_cnt with no earlier error

2007-02-23 Thread Eyal Lebedinsky
I run a 'check' weekly, and yesterday it came up with a non-zero mismatch count (184). There were no earlier RAID errors logged and the count was zero after the run a week ago. Now, the interesting part is that there was one i/o error logged during the check *last week*, however the raid did not s