Spontaneous rebuild

2007-12-01 Thread Oliver Martin
[Please CC me on replies as I'm not subscribed]

Hello!

I've been experimenting with software RAID a bit lately, using two
external 500GB drives. One is connected via USB, one via Firewire. It is
set up as a RAID5 with LVM on top so that I can easily add more drives
when I run out of space.
About a day after the initial setup, things went belly up. First, EXT3
reported strange errors:
EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system
zone - blocks from 106561536, length 1
EXT3-fs error (device dm-0): ext3_new_block: Allocating block in system
zone - blocks from 106561537, length 1
...

There were literally hundreds of these, and they came back immediately
when I reformatted the array. So I tried ReiserFS, which worked fine for
about a day. Then I got errors like these:
ReiserFS: warning: is_tree_node: node level 0 does not match to the
expected one 2
ReiserFS: dm-0: warning: vs-5150: search_by_key: invalid format found in
block 69839092. Fsck?
ReiserFS: dm-0: warning: vs-13070: reiserfs_read_locked_inode: i/o
failure occurred trying to find stat data of [6 10 0x0 SD]

Again, hundreds. So I ran badblocks on the LVM volume, and it reported
some bad blocks near the end. Running badblocks on the md array worked,
so I recreated the LVM stuff and attributed the failures to undervolting
experiments I had been doing (this is my old laptop running as a server).

Anyway, the problems are back: To test my theory that everything is
alright with the CPU running within its specs, I removed one of the
drives while copying some large files yesterday. Initially, everything
seemed to work out nicely, and by the morning, the rebuild had finished.
Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It
ran without gripes for some hours, but just now I saw md had started to
rebuild the array again out of the blue:

Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4
Dec  2 01:06:02 quassel kernel: md: data-check of RAID array md0
Dec  2 01:06:02 quassel kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Dec  2 01:06:02 quassel kernel: md: using maximum available idle IO
bandwidth (but not more than 20 KB/sec) for data-check.
Dec  2 01:06:02 quassel kernel: md: using 128k window, over a total of
488383936 blocks.
Dec  2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device
using ehci_hcd and address 4

I'm not sure the USB resets are related to the problem - device 4-5.2 is
part of the array, but I get these sometimes at random intervals and
they don't seem to hurt normally. Besides, the first one was long before
the rebuild started, and the second one long afterwards.

Any ideas why md is rebuilding the array? And could this be related to
the bad blocks problem I had first? badblocks is still running, I'll
post an update when it is finished.
In the meantime, mdadm --detail /dev/md0 and mdadm --examine
/dev/sd[bc]1 don't give me any clues as to what went wrong, both disks
are marked as "active sync", and the whole array is "active, recovering".

Before I forget, I'm running 2.6.23.1 with this config:
http://stud4.tuwien.ac.at/~e0626486/config-2.6.23.1-hrt3-fw

Thanks,
Oliver
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Robert Hancock

Justin Piszcz wrote:
I am putting a new machine together and I have dual raptor raid 1 for 
the root, which works just fine under all stress tests.


Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on 
sale now adays):


I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)

And sometime along the way(?) (i had gone to sleep and let it run), this 
occurred:


[42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 
action 0x2 frozen

[42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
[42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 
cdb 0x0 data 512 in
[42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 
0x10 (ATA bus error)

[42881.841899] ata3: soft resetting port
[42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42915.919042] ata3.00: qc timeout (cmd 0xec)
[42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[42915.919149] ata3.00: revalidation failed (errno=-5)
[42915.919206] ata3: failed to recover some devices, retrying in 5 secs
[42920.912458] ata3: hard resetting port
[42926.411363] ata3: port is slow to respond, please be patient (Status 
0x80)

[42930.943080] ata3: COMRESET failed (errno=-16)
[42930.943130] ata3: hard resetting port
[42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42931.413523] ata3.00: configured for UDMA/133
[42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
[42931.413655] ata3: EH complete
[42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors 
(750156 MB)

[42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
[42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA


Usually when I see this sort of thing with another box I have full of 
raptors, it was due to a bad raptor and I never saw it again after I 
replaced the disk that it happened on, but that was using the Intel P965 
chipset.


For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of 
the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).


I am going to do some further testing but does this indicate a bad 
drive? Bad cable?  Bad connector?


Could be any of the above.



As you can see above, /dev/sdc stopped responding for a little bit and 
then the kernel reset the port.


It looks like the first thing that happened is that the controller 
reported it lost the SATA link, and then the drive didn't respond until 
it was bashed with a few hard resets..




Why is this though?  What is the likely root cause?  Should I replace 
the drive?  Obviously this is not normal and cannot be good at all, the 
idea is to put these drives in a RAID5 and if one is going to timeout 
that is going to cause the array to go degraded and thus be worthless in 
a raid5 configuration.


Can anyone offer any insight here?

Thank you,

Justin.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Bill Davidsen

Jan Engelhardt wrote:

On Dec 1 2007 06:26, Justin Piszcz wrote:

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)


Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.

(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)

Do you not test your drive for minimum functionality before using them? 
Also, if you have the tools to check for relocated sectors before and 
after doing this, that's a good idea as well. S.M.A.R.T is your friend. 
And when writing /dev/zero to a drive, if it craps out you have less 
emotional attachment to the data.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Justin Piszcz



On Sat, 1 Dec 2007, Janek Kozicki wrote:


Justin Piszcz said: (by the date of Sat, 1 Dec 2007 07:23:41 -0500 (EST))


dd if=/dev/zero of=/dev/sdc


The purpose is with any new disk its good to write to all the blocks and
let the drive to all of the re-mapping before you put 'real' data on it.
Let it crap out or fail before I put my data on it.


better use badblocks. It writes data, then reads it afterwards:
In this example the data is semi random (quicker than /dev/urandom ;)

badblocks -c 10240 -s -w -t random -v /dev/sdc

--
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Will give this a shot and see if I can reproduce the error, thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Janek Kozicki
Justin Piszcz said: (by the date of Sat, 1 Dec 2007 07:23:41 -0500 (EST))

> >> dd if=/dev/zero of=/dev/sdc
>
> The purpose is with any new disk its good to write to all the blocks and 
> let the drive to all of the re-mapping before you put 'real' data on it. 
> Let it crap out or fail before I put my data on it.

better use badblocks. It writes data, then reads it afterwards:
In this example the data is semi random (quicker than /dev/urandom ;)

badblocks -c 10240 -s -w -t random -v /dev/sdc

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 reshape/resync

2007-12-01 Thread Nagilum

- Message from [EMAIL PROTECTED] -
Date: Thu, 29 Nov 2007 16:48:47 +1100
From: Neil Brown <[EMAIL PROTECTED]>
Reply-To: Neil Brown <[EMAIL PROTECTED]>
 Subject: Re: raid5 reshape/resync
  To: Nagilum <[EMAIL PROTECTED]>
  Cc: linux-raid@vger.kernel.org


> Hi,
> I'm running 2.6.23.8 x86_64 using mdadm v2.6.4.
> I was adding a disk (/dev/sdf) to an existing raid5 (/dev/sd[a-e] -> md0)
> During that reshape (at around 4%) /dev/sdd reported read errors and
> went offline.


Sad.


> I replaced /dev/sdd with a new drive and tried to reassemble the array
> (/dev/sdd was shown as removed and now as spare).


There must be a step missing here.
Just because one drive goes offline, that  doesn't mean that you need
to reassemble the array.  It should just continue with the reshape
until that is finished.  Did you shut the machine down or did it crash
or what

> Assembly worked but it would not run unless I use --force.


That suggests an unclean shutdown.  Maybe it did crash?


I started the reshape and went out. When I came back the controller  
was beeping (indicating the erraneous disk). I tried to log on but I  
could not get in. The machine was responding to pings but that was  
about it (no ssh or xdm login worked). So I hard rebooted.
I booted into a rescue root, the /etc/mdadm/mdadm.conf didn't yet  
include the new disk so the raid was missing one disk and not started.
Since I didn't know what exactly what was going on I --re-added sdf  
(the new disk) and tried to resume reshaping. A second into that the  
read failure on /dev/sdd was reported. So I stopped md0 and shut down  
to verify the read error with another controller.
After I had verified that I replaced /dev/sdd with a new drive and put  
in the broken drive as /dev/sdg, just in case.



> Since I'm always reluctant to use force I put the bad disk back in,
> this time as /dev/sdg . I re-added the drive and could run the array.
> The array started to resync (since the disk can be read until 4%) and
> then I marked the disk as failed. Now the array is "active, degraded,
> recovering":


It should have restarted the reshape from whereever it was up to, so
it should have hit the read error almost immediately.  Do you remember
where it started the reshape from?  If it restarted from the beginning
that would be bad.


It must have continued where it left off since the reshape position in  
all superblocks was at about 4%.



Did you just "--assemble" all the drives or did you do something else?


Sorry for being a bit unexact here, I didn't actually have to use  
--assemble, when booting into the rescue root the raid came up with  
/dev/sdd and /dev/sdf removed. I just had to --re-add /dev/sdf



> unusually low which seems to indicate a lot of seeking as if two
> operations are happening at the same time.


Well reshape is always slow as it has to read from one part of the
drive and write to another part of the drive.


Actually it was resyncing with the minimum speed, I managed to crank  
up the speed to >20MB/s by adjusting /sys/block/md0/md/sync_speed_min



> Can someone relief my doubts as to whether md does the right thing here?
> Thanks,


I believe it is do "the right thing".


>
- End message from [EMAIL PROTECTED] -

Ok, so the reshape tried to continue without the failed drive and
after that resynced to the new spare.


As I would expect.


Unfortunately the result is a mess. On top of the Raid5 I have


Hmm.  This I would not expect.


dm-crypt and LVM.
Although dmcrypt and LVM dont appear to have a problem the filesystems
on top are a mess now.


Can you be more specific about what sort of "mess" they are in?


Sure.
So here is the vg-layout:
nas:~# lvdisplay vg01
  --- Logical volume ---
  LV Name/dev/vg01/lv1
  VG Namevg01
  LV UUID4HmzU2-VQpO-vy5R-Wdys-PmwH-AuUg-W02CKS
  LV Write Accessread/write
  LV Status  available
  # open 0
  LV Size512.00 MB
  Current LE 128
  Segments   1
  Allocation inherit
  Read ahead sectors 0
  Block device   253:1

  --- Logical volume ---
  LV Name/dev/vg01/lv2
  VG Namevg01
  LV UUID4e2ZB9-29Rb-dy4M-EzEY-cEIG-Nm1I-CPI0kk
  LV Write Accessread/write
  LV Status  available
  # open 0
  LV Size7.81 GB
  Current LE 2000
  Segments   1
  Allocation inherit
  Read ahead sectors 0
  Block device   253:2

  --- Logical volume ---
  LV Name/dev/vg01/lv3
  VG Namevg01
  LV UUIDYQRd0X-5hF8-2dd3-GG4v-wQLH-WGH0-ntGgug
  LV Write Accessread/write
  LV Status  available
  # open 0
  LV Size1.81 TB
  Current LE 474735
  Segments   1
  Allocation inherit
  Rea

Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Justin Piszcz



On Sat, 1 Dec 2007, Jan Engelhardt wrote:



On Dec 1 2007 06:26, Justin Piszcz wrote:

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)


Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.

(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)



The purpose is with any new disk its good to write to all the blocks and 
let the drive to all of the re-mapping before you put 'real' data on it. 
Let it crap out or fail before I put my data on it.


Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Justin Piszcz



On Sat, 1 Dec 2007, Jan Engelhardt wrote:



On Dec 1 2007 07:12, Justin Piszcz wrote:

On Sat, 1 Dec 2007, Jan Engelhardt wrote:

On Dec 1 2007 06:19, Justin Piszcz wrote:


RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
you use 1.x superblocks with LILO you can't boot)


Says who? (Don't use LILO ;-)


I like LILO :)


LILO cares much less about disk layout / filesystems than GRUB does,
so I would have expected LILO to cope with all sorts of superblocks.
OTOH I would suspect GRUB to only handle 0.90 and 1.0, where the MDSB
is at the end of the disk <=> the filesystem SB is at the very beginning.


So two questions:

1) If it rebuilt by itself, how come it only rebuilt /dev/md0?


So md1/md2 was NOT rebuilt?


Correct.


Well it should, after they are readded using -a.
If they still don't, then perhaps another resync is in progress.



There was nothing in progress, md0 was synced up and md1,md2 = degraded.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Jan Engelhardt

On Dec 1 2007 07:12, Justin Piszcz wrote:
> On Sat, 1 Dec 2007, Jan Engelhardt wrote:
>> On Dec 1 2007 06:19, Justin Piszcz wrote:
>>
>> > RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
>> > you use 1.x superblocks with LILO you can't boot)
>>
>> Says who? (Don't use LILO ;-)
>
> I like LILO :)

LILO cares much less about disk layout / filesystems than GRUB does,
so I would have expected LILO to cope with all sorts of superblocks.
OTOH I would suspect GRUB to only handle 0.90 and 1.0, where the MDSB
is at the end of the disk <=> the filesystem SB is at the very beginning.

>> > So two questions:
>> >
>> > 1) If it rebuilt by itself, how come it only rebuilt /dev/md0?
>>
>> So md1/md2 was NOT rebuilt?
>
> Correct.

Well it should, after they are readded using -a.
If they still don't, then perhaps another resync is in progress.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Jan Engelhardt

On Dec 1 2007 06:26, Justin Piszcz wrote:
> I ran the following:
>
> dd if=/dev/zero of=/dev/sdc
> dd if=/dev/zero of=/dev/sdd
> dd if=/dev/zero of=/dev/sde
>
> (as it is always a very good idea to do this with any new disk)

Why would you care about what's on the disk? fdisk, mkfs and
the day-to-day operation will overwrite it _anyway_.

(If you think the disk is not empty, you should look at it
and copy off all usable warez beforehand :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Justin Piszcz



On Sat, 1 Dec 2007, Jan Engelhardt wrote:



On Dec 1 2007 06:19, Justin Piszcz wrote:


RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
you use 1.x superblocks with LILO you can't boot)


Says who? (Don't use LILO ;-)

I like LILO :)




, and then:

/dev/sda1+sdb1 <-> /dev/md0 <-> swap
/dev/sda2+sdb2 <-> /dev/md1 <-> /boot (ext3)
/dev/sda3+sdb3 <-> /dev/md2 <-> / (xfs)

All works fine, no issues...

Quick question though, I turned off the machine, disconnected /dev/sda
from the machine, boot from /dev/sdb, no problems, shows as degraded
RAID1.  Turn the machine off.  Re-attach the first drive.  When I boot
my first partition either re-synced by itself or it was not degraded,
was is this?


If md0 was not touched (written to) after you disconnected sda, it also
should not be in a degraded state.


So two questions:

1) If it rebuilt by itself, how come it only rebuilt /dev/md0?


So md1/md2 was NOT rebuilt?

Correct.




2) If it did not rebuild, is it because the kernel knows it does not
   need to re-calculate parity etc for swap?


Kernel does not know what's inside an md usually. And it should not
try to be smart.

Ok.




I had to:

mdadm /dev/md1 -a /dev/sda2
and
mdadm /dev/md2 -a /dev/sda3

To rebuild the /boot and /, which worked fine, I am just curious
though why it works like this, I figured it would be all or nothing.


Devices are not automatically readded. Who knows, maybe you inserted a
different disk into sda which you don't want to be overwritten.

Makes sense, I just wanted to confirm that it was normal..




More info:

Not using ANY initramfs/initrd images, everything is compiled into 1
kernel image (makes things MUCH simpler and the expected device layout
etc is always the same, unlike initrd/etc).


My expected device layout is also always the same, _with_ initrd. Why?
Simply because mdadm.conf is copied to the initrd, and mdadm will
use your defined order.


That is another way as well, people seem to be divided.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Jan Engelhardt

On Dec 1 2007 06:19, Justin Piszcz wrote:

> RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
> you use 1.x superblocks with LILO you can't boot)

Says who? (Don't use LILO ;-)

>, and then:
>
> /dev/sda1+sdb1 <-> /dev/md0 <-> swap
> /dev/sda2+sdb2 <-> /dev/md1 <-> /boot (ext3)
> /dev/sda3+sdb3 <-> /dev/md2 <-> / (xfs)
>
> All works fine, no issues...
>
> Quick question though, I turned off the machine, disconnected /dev/sda 
> from the machine, boot from /dev/sdb, no problems, shows as degraded 
> RAID1.  Turn the machine off.  Re-attach the first drive.  When I boot 
> my first partition either re-synced by itself or it was not degraded, 
> was is this?

If md0 was not touched (written to) after you disconnected sda, it also 
should not be in a degraded state.

> So two questions:
>
> 1) If it rebuilt by itself, how come it only rebuilt /dev/md0?

So md1/md2 was NOT rebuilt?

> 2) If it did not rebuild, is it because the kernel knows it does not 
>need to re-calculate parity etc for swap?

Kernel does not know what's inside an md usually. And it should not 
try to be smart.

> I had to:
>
> mdadm /dev/md1 -a /dev/sda2
> and
> mdadm /dev/md2 -a /dev/sda3
>
> To rebuild the /boot and /, which worked fine, I am just curious 
> though why it works like this, I figured it would be all or nothing.

Devices are not automatically readded. Who knows, maybe you inserted a 
different disk into sda which you don't want to be overwritten.

> More info:
>
> Not using ANY initramfs/initrd images, everything is compiled into 1 
> kernel image (makes things MUCH simpler and the expected device layout 
> etc is always the same, unlike initrd/etc).
>
My expected device layout is also always the same, _with_ initrd. Why? 
Simply because mdadm.conf is copied to the initrd, and mdadm will 
use your defined order.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-01 Thread Justin Piszcz
I am putting a new machine together and I have dual raptor raid 1 for the 
root, which works just fine under all stress tests.


Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on 
sale now adays):


I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)

And sometime along the way(?) (i had gone to sleep and let it run), this 
occurred:


[42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 
action 0x2 frozen

[42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
[42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 
0x0 data 512 in
[42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 
(ATA bus error)

[42881.841899] ata3: soft resetting port
[42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42915.919042] ata3.00: qc timeout (cmd 0xec)
[42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[42915.919149] ata3.00: revalidation failed (errno=-5)
[42915.919206] ata3: failed to recover some devices, retrying in 5 secs
[42920.912458] ata3: hard resetting port
[42926.411363] ata3: port is slow to respond, please be patient (Status 
0x80)

[42930.943080] ata3: COMRESET failed (errno=-16)
[42930.943130] ata3: hard resetting port
[42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42931.413523] ata3.00: configured for UDMA/133
[42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
[42931.413655] ata3: EH complete
[42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors 
(750156 MB)

[42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
[42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA


Usually when I see this sort of thing with another box I have full of 
raptors, it was due to a bad raptor and I never saw it again after I 
replaced the disk that it happened on, but that was using the Intel P965 
chipset.


For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of 
the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).


I am going to do some further testing but does this indicate a bad drive? 
Bad cable?  Bad connector?


As you can see above, /dev/sdc stopped responding for a little bit and 
then the kernel reset the port.


Why is this though?  What is the likely root cause?  Should I replace the 
drive?  Obviously this is not normal and cannot be good at all, the idea 
is to put these drives in a RAID5 and if one is going to timeout that is 
going to cause the array to go degraded and thus be worthless in a raid5 
configuration.


Can anyone offer any insight here?

Thank you,

Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-01 Thread Justin Piszcz

Quick question,

Setup a new machine last night with two raptor 150 disks.  Setup RAID1 as 
I do everywhere else, 0.90.03 superblocks (in order to be compatible with 
LILO, if you use 1.x superblocks with LILO you can't boot), and then:


/dev/sda1+sdb1 <-> /dev/md0 <-> swap
/dev/sda2+sdb2 <-> /dev/md1 <-> /boot (ext3)
/dev/sda3+sdb3 <-> /dev/md2 <-> / (xfs)

All works fine, no issues...

Quick question though, I turned off the machine, disconnected /dev/sda 
from the machine, boot from /dev/sdb, no problems, shows as degraded 
RAID1.  Turn the machine off.  Re-attach the first drive.  When I boot my 
first partition either re-synced by itself or it was not degraded, was is 
this?


So two questions:

1) If it rebuilt by itself, how come it only rebuilt /dev/md0?
2) If it did not rebuild, is it because the kernel knows it does not need 
to re-calculate parity etc for swap?


I had to:

mdadm /dev/md1 -a /dev/sda2
and
mdadm /dev/md2 -a /dev/sda3

To rebuild the /boot and /, which worked fine, I am just curious though 
why it works like this, I figured it would be all or nothing.


More info:

Not using ANY initramfs/initrd images, everything is compiled into 1 
kernel image (makes things MUCH simpler and the expected device layout etc 
is always the same, unlike initrd/etc).


Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html