Re[2]: mdadm 2.6.4 : How i can check out current status of reshaping ?

2008-02-05 Thread Andreas-Sokov
Hello, Neil.

YOU WROTE : 5 февраля 2008 г., 01:48:33:
 On Monday February 4, [EMAIL PROTECTED] wrote:
 
 [EMAIL PROTECTED]:/# cat /proc/mdstat
 Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
 [multipath] [faulty]
 md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1]
   1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] 
 [_]
 
 unused devices: none
 
 ##
 But how i can see the status of reshaping ?
 Is it reshaped realy ? or may be just hang up ? or may be mdadm nothing do 
 not give in
 general ?
 How long wait when reshaping will finish ?
 ##
 

 The reshape hasn't restarted.

 Did you do that mdadm -w /dev/md1 like I suggested?  If so, what
 happened?

 Possibly you tried mounting the filesystem before trying the mdadm
 -w.  There seems to be a bug such that doing this would cause the
 reshape not to restart, and mdadm -w would not help any more.

 I suggest you:

   echo 0  /sys/module/md_mod/parameters/start_ro

 stop the array 
   mdadm -S /dev/md1
 (after unmounting if necessary).

 Then assemble the array again.
 Then
   mdadm -w /dev/md1

 just to be sure.

 If this doesn't work, please report exactly what you did, exactly what
 message you got and exactly where message appeared in the kernel log.

 NeilBrown
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

I read again your latter.
at first time i did not do

echo 0  /sys/module/md_mod/parameters/start_ro

now i have done this, then
mdadm -S /dev/md1
mdadm /dev/md1 -A /dev/sd[bcdef]
mdadm -w /dev/md1

and i have : after 2 minutes kernel show something
but reshaping during in process still

[EMAIL PROTECTED]:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
[multipath] [faulty]
md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1]
  1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] [_]
  [==..]  reshape = 10.1% (49591552/488386496) 
finish=12127.2min speed=602K/sec

unused devices: none
[EMAIL PROTECTED]:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
[multipath] [faulty]
md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1]
  1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] [_]
  [==..]  reshape = 10.1% (49591552/488386496) 
finish=12259.0min speed=596K/sec

unused devices: none
[EMAIL PROTECTED]:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
[multipath] [faulty]
md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1]
  1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] [_]
  [==..]  reshape = 10.1% (49591552/488386496) 
finish=12311.7min speed=593K/sec

unused devices: none
[EMAIL PROTECTED]:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
[multipath] [faulty]
md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1]
  1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] [_]
  [==..]  reshape = 10.1% (49591552/488386496) 
finish=12338.1min speed=592K/sec

unused devices: none




Feb  5 11:54:21 raid01 kernel: raid5: reshape will continue
Feb  5 11:54:21 raid01 kernel: raid5: device sdc operational as raid disk 0
Feb  5 11:54:21 raid01 kernel: raid5: device sdf operational as raid disk 3
Feb  5 11:54:21 raid01 kernel: raid5: device sde operational as raid disk 2
Feb  5 11:54:21 raid01 kernel: raid5: device sdd operational as raid disk 1
Feb  5 11:54:21 raid01 kernel: raid5: allocated 5245kB for md1
Feb  5 11:54:21 raid01 kernel: raid5: raid level 5 set md1 active with 4 out of 
5 devices, algorithm 2
Feb  5 11:54:21 raid01 kernel: RAID5 conf printout:
Feb  5 11:54:21 raid01 kernel:  --- rd:5 wd:4
Feb  5 11:54:21 raid01 kernel:  disk 0, o:1, dev:sdc
Feb  5 11:54:21 raid01 kernel:  disk 1, o:1, dev:sdd
Feb  5 11:54:21 raid01 kernel:  disk 2, o:1, dev:sde
Feb  5 11:54:21 raid01 kernel:  disk 3, o:1, dev:sdf
Feb  5 11:54:21 raid01 kernel: ...ok start reshape thread
Feb  5 11:54:21 raid01 mdadm: RebuildStarted event detected on md device 
/dev/md1
Feb  5 11:54:21 raid01 kernel: md: reshape of RAID array md1
Feb  5 11:54:21 raid01 kernel: md: minimum _guaranteed_  speed: 1000 
KB/sec/disk.
Feb  5 11:54:21 raid01 kernel: md: using maximum available idle IO bandwidth 
(but not more than 20 KB/sec) for reshape.
Feb  5 11:54:21 raid01 kernel: md: using 128k window, over a total of 488386496 
blocks.
Feb  5 11:56:12 raid01 kernel: BUG: unable to handle kernel paging request at 
virtual address 001cd901
Feb  5 11:56:12 raid01 kernel:  printing eip:

Re: Re[2]: mdadm 2.6.4 : How i can check out current status of reshaping ?

2008-02-05 Thread Neil Brown
On Tuesday February 5, [EMAIL PROTECTED] wrote:
 Feb  5 11:56:12 raid01 kernel: BUG: unable to handle kernel paging request at 
 virtual address 001cd901

This looks like some sort of memory corruption.

 Feb  5 11:56:12 raid01 kernel: EIP is at md_do_sync+0x629/0xa32

This tells us what code is executing.

 Feb  5 11:56:12 raid01 kernel: Code: 54 24 48 0f 87 a4 01 00 00 72 0a 3b 44 
 24 44 0f 87 98 01 00 00 3b 7c 24 40 75 0a 3b 74 24 3c 0f 84 88 01 00 00 0b 85 
 30 01 00 00 88 08 0f 85 90 01 00 00 8b 85 30 01 00 00 a8 04 0f 85 82 01 00

This tells us what the actual byte of code were.
If I feed this line (from Code: onwards) into ksymoops I get 

   0:   54push   %esp
   1:   24 48 and$0x48,%al
   3:   0f 87 a4 01 00 00 ja 1ad _EIP+0x1ad
   9:   72 0a jb 15 _EIP+0x15
   b:   3b 44 24 44   cmp0x44(%esp),%eax
   f:   0f 87 98 01 00 00 ja 1ad _EIP+0x1ad
  15:   3b 7c 24 40   cmp0x40(%esp),%edi
  19:   75 0a jne25 _EIP+0x25
  1b:   3b 74 24 3c   cmp0x3c(%esp),%esi
  1f:   0f 84 88 01 00 00 je 1ad _EIP+0x1ad
  25:   0b 85 30 01 00 00 or 0x130(%ebp),%eax
Code;   Before first symbol
  2b:   88 08 mov%cl,(%eax)
  2d:   0f 85 90 01 00 00 jne1c3 _EIP+0x1c3
  33:   8b 85 30 01 00 00 mov0x130(%ebp),%eax
  39:   a8 04 test   $0x4,%al
  3b:   0f.byte 0xf
  3c:   85.byte 0x85
  3d:   82(bad)  
  3e:   01 00 add%eax,(%eax)


I removed the Code;... lines as they are just noise, except for the
one that points to the current instruction in the middle.
Note that it is dereferencing %eax, after just 'or'ing some value into
it, which is rather unusual.

Now get the md-mod.ko for the kernel you are running.
run
   gdb md-mod.ko

and give the command

   disassemble md_do_sync

and look for code at offset 0x629, which is 1577 in decimal.

I found a similar kernel to what you are running, and the matching code
is 

0x55c0 md_do_sync+1485:   cmp0x30(%esp),%eax
0x55c4 md_do_sync+1489:   ja 0x5749 md_do_sync+1878
0x55ca md_do_sync+1495:   cmp0x2c(%esp),%edi
0x55ce md_do_sync+1499:   jne0x55da md_do_sync+1511
0x55d0 md_do_sync+1501:   cmp0x28(%esp),%esi
0x55d4 md_do_sync+1505:   je 0x5749 md_do_sync+1878
0x55da md_do_sync+1511:   mov0x130(%ebp),%eax
0x55e0 md_do_sync+1517:   test   $0x8,%al
0x55e2 md_do_sync+1519:   jne0x575f md_do_sync+1900
0x55e8 md_do_sync+1525:   mov0x130(%ebp),%eax
0x55ee md_do_sync+1531:   test   $0x4,%al
0x55f0 md_do_sync+1533:   jne0x575f md_do_sync+1900
0x55f6 md_do_sync+1539:   mov0x38(%esp),%ecx
0x55fa md_do_sync+1543:   mov0x0,%eax
-

Note the sequence cmp, ja, cmp, jne, cmp, je
where the cmp arguments are consecutive 4byte values on the stack
(%esp).
In the code from your oops, the offsets are 0x44 0x40 0x3c.
In the kernel I found they are 0x30 0x2c 0x28.  The difference is some
subtle difference in the kernel, possibly a different compiler or
something.

Anyway, your code crashed at 


  25:   0b 85 30 01 00 00 or 0x130(%ebp),%eax
Code;   Before first symbol
  2b:   88 08 mov%cl,(%eax)

The matching code in the kernel I found is 

0x55da md_do_sync+1511:   mov0x130(%ebp),%eax
0x55e0 md_do_sync+1517:   test   $0x8,%al

Note that you have an 'or', the kernel I found has 'mov'.

If we look at the actual byte of code for those two instructions
the code that crashed shows the bytes above:

0b 85 30 01 00 00
88 08

if I get the same bytes with gdb:

(gdb) x/8b 0x55da
0x55da md_do_sync+1511:   0x8b0x850x300x010x000x00
0xa80x08
(gdb) 

So what should be 8b has become 0b, and what should be a8 has
become 08.

If you look for the same data in your md-mod.ko, you might find
slightly different details but it is clear to me that the code in
memory is bad.

Possible you have bad memory, or a bad CPU, or you are overclocking
the CPU, or it is getting hot, or something.


But you clearly have a hardware error.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Deleting mdadm RAID arrays

2008-02-05 Thread Marcin Krol
Hello everyone,

I have had a problem with RAID array (udev messed up disk names, I've had RAID 
on
disks only, without raid partitions) on Debian Etch server with 6 disks and so 
I decided 
to rearrange this. 

Deleted the disks from (2 RAID-5) arrays, deleted the md* devices from /dev,
created /dev/sd[a-f]1 Linux raid auto-detect partitions and rebooted the host.

Now the mdadm startup script is writing in loop a message like mdadm: warning: 
/dev/sda1 and 
/dev/sdb1 have similar superblocks. If they are not identical, --zero the 
superblock ... 

The host can't boot up now because of this.

If I boot the server with some disks, I can't even zero that superblock:

% mdadm --zero-superblock /dev/sdb1
mdadm: Couldn't open /dev/sdb1 for write - not zeroing

It's the same even after:

% mdadm --manage /dev/md2 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md2


Now, I have NEVER created /dev/md2 array, yet it show up automatically!

% cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md2 : active(auto-read-only) raid1 sdb1[1]
  390708736 blocks [3/1] [_U_]

md1 : inactive sda1[2]
  390708736 blocks

unused devices: none


Questions:

1. Where this info on array resides?! I have deleted /etc/mdadm/mdadm.conf 
and /dev/md devices and yet it comes seemingly out of nowhere.

2. How can I delete that damn array so it doesn't hang my server up in a loop?


-- 
Marcin Krol

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-05 Thread Moshe Yudkowsky


1. Where this info on array resides?! I have deleted /etc/mdadm/mdadm.conf 
and /dev/md devices and yet it comes seemingly out of nowhere.


/boot has a copy of mdadm.conf so that / and other drives can be started 
and then mounted. update-initramfs will update /boot's copy of mdadm.conf.


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-05 Thread Janek Kozicki
Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100)

 2. How can I delete that damn array so it doesn't hang my server up in a loop?

dd if=/dev/zero of=/dev/sdb1 bs=1M count=10

I'm not using mdadm.conf at all. Everything is stored in the
superblock of the device. So if you don't erase it - info about raid
array will be still automatically found.

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-05 Thread Linda Walsh



Michael Tokarev wrote:

note that with some workloads, write caching in
the drive actually makes write speed worse, not better - namely,
in case of massive writes.


With write barriers enabled, I did a quick test of
a large copy from one backup filesystem to another.
I'm not what you refer to when you say large, but
this disk has 387G used with 975 files, averaging about 406MB/file.

I was copying from /hde (ATA100-750G) to
/sdb (SATA-300-750G) (both, basically underlying model)

Of course your 'mileage may vary', and these were averages over
12 runs each (w/ + w/out wcaching);

(write cache on) writeread
dev ave   TPS MB/sMB/s
hde ave  64.67   30.94 0.0
sdb ave 249.510.2430.93

(write cache off)writeread
dev ave  TPS  MB/sMB/s
hde ave  45.63   21.81 0.0
xx: ave 177.76   0.24 21.96

write w/cache = (30.94-21.86)/21.86 = 45% faster
w/o write cache =   100-(100*21.81/30.94)   = 30% slower

These disks have barrier support, so I'd guess the differences would
have been greater if you didn't worry about losing w-cache contents.

If  barrier support doesn't work and one has to disable write-caching,
that is a noticeable performance penalty.

All writes with noatime, nodiratime, logbufs=8.


FWIW...slightly OT, the rates under Win for their write-through (FAT32-perf)
vs. write-back caching (NTFS-perf) were FAT about 60% faster over NTFS or
NTFS ~ 40% slower than FAT32 (with ops for no-last-access and no 3.1
filename creation)



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-05 Thread Michael Tokarev
Janek Kozicki wrote:
 Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100)
 
 2. How can I delete that damn array so it doesn't hang my server up in a 
 loop?
 
 dd if=/dev/zero of=/dev/sdb1 bs=1M count=10

This works provided the superblocks are at the beginning of the
component devices.  Which is not the case by default (0.90
superblocks, at the end of components), or with 1.0 superblocks.

  mdadm --zero-superblock /dev/sdb1

is the way to go here.

 I'm not using mdadm.conf at all. Everything is stored in the
 superblock of the device. So if you don't erase it - info about raid
 array will be still automatically found.

That's wrong, as you need at least something to identify the array
components.  UUID is the most reliable and commonly used.  You
assemble the arrays as

  mdadm --assemble /dev/md1 --uuid=123456789

or something like that anyway.  If not, your arrays may not start
properly in case you shuffled disks (e.g replaced a bad one), or
your disks were renumbered after a kernel or other hardware change
and so on.  The most convient place to store that info is mdadm.conf.
Here, it looks just like:

DEVICE partitions
ARRAY /dev/md1 UUID=4ee58096:e5bc04ac:b02137be:3792981a
ARRAY /dev/md2 UUID=b4dec03f:24ec8947:1742227c:761aa4cb

By default mdadm offers additional information which helps to
diagnose possible problems, namely:

ARRAY /dev/md5 level=raid5 num-devices=4 
UUID=6dc4e503:85540e55:d935dea5:d63df51b

This new info isn't necessary for mdadm to work (but UUID is),
yet it comes handy sometimes.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Auto generation of mdadm.conf (was: Deleting mdadm RAID arrays)

2008-02-05 Thread Janek Kozicki
Michael Tokarev said: (by the date of Tue, 05 Feb 2008 16:52:18 +0300)

 Janek Kozicki wrote:
  I'm not using mdadm.conf at all. 
 
 That's wrong, as you need at least something to identify the array
 components. 

I was afraid of that ;-) So, is that a correct way to automatically
generate a correct mdadm.conf ? I did it after some digging in man pages:

  echo 'DEVICE partitions'  mdadm.conf
  mdadm  --examine  --scan --config=mdadm.conf  ./mdadm.conf 

Now, when I do 'cat mdadm.conf' i get:

 DEVICE partitions
 ARRAY /dev/md/0 level=raid1 metadata=1 num-devices=3 
UUID=75b0f87879:539d6cee:f22092f4:7a6e6f name='backup':0
 ARRAY /dev/md/2 level=raid1 metadata=1 num-devices=3 
UUID=4fd340a6c4:db01d6f7:1e03da2d:bdd574 name=backup:2
 ARRAY /dev/md/1 level=raid5 metadata=1 num-devices=3 
UUID=22f22c3599:613d5231:d407a655:bdeb84 name=backup:1

Looks quite reasonable. Should I append it to /etc/mdadm/mdadm.conf ?
This file currently contains: (commented lines are left out)

  DEVICE partitions
  CREATE owner=root group=disk mode=0660 auto=yes
  HOMEHOST system
  MAILADDR root

This is the default content of /etc/mdadm/mdadm.conf on fresh debian
etch install.

best regards
-- 
Janek Kozicki
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-05 Thread Michael Tokarev
Moshe Yudkowsky wrote:
 Michael Tokarev wrote:
 Janek Kozicki wrote:
 Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100)

 2. How can I delete that damn array so it doesn't hang my server up
 in a loop?
 dd if=/dev/zero of=/dev/sdb1 bs=1M count=10

 This works provided the superblocks are at the beginning of the
 component devices.  Which is not the case by default (0.90
 superblocks, at the end of components), or with 1.0 superblocks.

   mdadm --zero-superblock /dev/sdb1
 
 Would that work if even if he doesn't update his mdadm.conf inside the
 /boot image? Or would mdadm attempt to build the array according to the
 instructions in mdadm.conf? I expect that it might depend on whether the
 instructions are given in terms of UUID or in terms of devices.

After zeroing superblocks, mdadm will NOT assemble the array,
regardless if using UUIDs or devices or whatever.  In order
to assemble the array, all component devices MUST have valid
superblocks and the superblocks must match each other.

mdadm --assemble in initramfs will simple fail to do its work.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-05 Thread Moshe Yudkowsky

Michael Tokarev wrote:

Janek Kozicki wrote:

Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100)


2. How can I delete that damn array so it doesn't hang my server up in a loop?

dd if=/dev/zero of=/dev/sdb1 bs=1M count=10


This works provided the superblocks are at the beginning of the
component devices.  Which is not the case by default (0.90
superblocks, at the end of components), or with 1.0 superblocks.

  mdadm --zero-superblock /dev/sdb1


Would that work if even if he doesn't update his mdadm.conf inside the 
/boot image? Or would mdadm attempt to build the array according to the 
instructions in mdadm.conf? I expect that it might depend on whether the 
instructions are given in terms of UUID or in terms of devices.


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 I think it a greater honour to have my head standing on the ports
  of this town for this quarrel, than to have my portrait in the
  King's bedchamber. -- Montrose, 20 May 1650
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Auto generation of mdadm.conf

2008-02-05 Thread Michael Tokarev
Janek Kozicki wrote:
 Michael Tokarev said: (by the date of Tue, 05 Feb 2008 16:52:18 +0300)
 
 Janek Kozicki wrote:
 I'm not using mdadm.conf at all. 
 That's wrong, as you need at least something to identify the array
 components. 
 
 I was afraid of that ;-) So, is that a correct way to automatically
 generate a correct mdadm.conf ? I did it after some digging in man pages:
 
   echo 'DEVICE partitions'  mdadm.conf
   mdadm  --examine  --scan --config=mdadm.conf  ./mdadm.conf 
 
 Now, when I do 'cat mdadm.conf' i get:
 
  DEVICE partitions
  ARRAY /dev/md/0 level=raid1 metadata=1 num-devices=3 
 UUID=75b0f87879:539d6cee:f22092f4:7a6e6f name='backup':0
  ARRAY /dev/md/2 level=raid1 metadata=1 num-devices=3 
 UUID=4fd340a6c4:db01d6f7:1e03da2d:bdd574 name=backup:2
  ARRAY /dev/md/1 level=raid5 metadata=1 num-devices=3 
 UUID=22f22c3599:613d5231:d407a655:bdeb84 name=backup:1

Hmm.  I wonder why the name for md/0 is in quotes, while others are not.

 Looks quite reasonable. Should I append it to /etc/mdadm/mdadm.conf ?

Probably... see below.

 This file currently contains: (commented lines are left out)
 
   DEVICE partitions
   CREATE owner=root group=disk mode=0660 auto=yes
   HOMEHOST system
   MAILADDR root
 
 This is the default content of /etc/mdadm/mdadm.conf on fresh debian
 etch install.

But now I wonder HOW your arrays gets assembled in the first place.

Let me guess... mdrun?  Or maybe in-kernel auto-detection?

The thing is that mdadm will NOT assemble your arrays given this
config.

If you have your disk/controller and md drivers built into the
kernel, AND marked the partitions as linux raid autodetect,
kernel may assemble them right at boot.  But I don't remember
if the kernel will even consider v.1 superblocks for its auto-
assembly.  In any way, don't rely on the kernel to do this
work, in-kernel assembly code is very simplistic and works
up to a moment when anything changes/breaks.  It's almost
the same code as was in old raidtools...

Another possibility is mdrun utility (shell script) shipped
with Debian's mdadm package.  It's deprecated now, but still
provided for compatibility.  mdrun is even worse, it will
try to assemble ALL arrays found, giving them random names
and numbers, not handling failures correctly, and failing
badly in case of, e.g. a foreign disk is found which
happens to contain a valid raid superblock somewhere...

Well.  There's another, 3rd possibility: mdadm can assemble
all arrays automatically (even if not listed explicitly in
mdadm.conf) using homehost (only available with v.1 superblock).
I haven't tried this option yet, so don't remember how it
works.  From the mdadm(8) manpage:

   Auto Assembly
   When --assemble is used with --scan and no devices  are  listed,  mdadm
   will  first  attempt  to  assemble  all the arrays listed in the config
   file.

   If a homehost has been specified (either in the config file or  on  the
   command line), mdadm will look further for possible arrays and will try
   to assemble anything that it finds which is tagged as belonging to  the
   given  homehost.   This is the only situation where mdadm will assemble
   arrays without being given specific device name or identity information
   for the array.

   If  mdadm  finds a consistent set of devices that look like they should
   comprise an array, and if the superblock is tagged as belonging to  the
   given  home host, it will automatically choose a device name and try to
   assemble the array.  If the array uses version-0.90 metadata, then  the
   minor  number as recorded in the superblock is used to create a name in
   /dev/md/ so for example /dev/md/3.  If the array uses  version-1  meta‐
   data,  then  the name from the superblock is used to similarly create a
   name in /dev/md (the name will have any ’host’ prefix stripped  first).

So.. probably this is the way your arrays are being assembled, since you
do have HOMEHOST in your mdadm.conf...  Looks like it should work, after
all... ;)  And in this case there's no need to specify additional array
information in the config file.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Keld Jørn Simonsen
On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote:
 On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote:
  Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 
  +0100)
  
 
 All the raid10's will have double time for writing, and raid5 and raid6
 will also have double or triple writing times, given that you can do
 striped writes on the raid0. 

For raid5 and raid6 I think this is even worse. My take is that for
raid5 when you write something, you first read the chunk data involved,
then you read the parity data, then you xor-subtract the data to be
changed, and you xor-add the new data, and then write the new data chunk
and the new parity chunk. In total 2 reads and 2 writes. The read/writes
happen on the same chunks, so latency is minimized. But in essence it is
still 4 IO operations, where it is only 2 writes on raid1/raid10,
that is only half the speed for writing on raid5 compared to raid1/10.

On raid6 this amounts to 6 IO operations, resulting in 1/3 of the
writing speed of raid1/10.

I note in passing that there is no difference between xor-subtract and
xor-add. 

Also I assume that you can calculate the parities of both raid5 and
raid6 given the old parities chunks and the old and new data chunk.
If you have to calculate the new parities by reading all the component
data chunks this is going to be really expensive, both in IO and CPU.
For a 10 drive raid5 this would involve reading 9 data chunks, and
making writes 5 times as expensive as raid1/10.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Justin Piszcz



On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote:

On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote:

Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 +0100)



All the raid10's will have double time for writing, and raid5 and raid6
will also have double or triple writing times, given that you can do
striped writes on the raid0.


For raid5 and raid6 I think this is even worse. My take is that for
raid5 when you write something, you first read the chunk data involved,
then you read the parity data, then you xor-subtract the data to be
changed, and you xor-add the new data, and then write the new data chunk
and the new parity chunk. In total 2 reads and 2 writes. The read/writes
happen on the same chunks, so latency is minimized. But in essence it is
still 4 IO operations, where it is only 2 writes on raid1/raid10,
that is only half the speed for writing on raid5 compared to raid1/10.

On raid6 this amounts to 6 IO operations, resulting in 1/3 of the
writing speed of raid1/10.

I note in passing that there is no difference between xor-subtract and
xor-add.

Also I assume that you can calculate the parities of both raid5 and
raid6 given the old parities chunks and the old and new data chunk.
If you have to calculate the new parities by reading all the component
data chunks this is going to be really expensive, both in IO and CPU.
For a 10 drive raid5 this would involve reading 9 data chunks, and
making writes 5 times as expensive as raid1/10.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



On my benchmarks RAID5 gave the best overall speed with 10 raptors, 
although I did not play with the various offsets/etc as much as I have 
tweaked the RAID5.


Justin.

recommendations for stripe/chunk size

2008-02-05 Thread Keld Jørn Simonsen
Hi

I am looking at revising our howto. I see a number of places where a
chunk size of 32 kiB is recommended, and even recommendations on
maybe using sizes of 4 kiB. 

My own take on that is that this really hurts performance. 
Normal disks have a rotation speed of between 5400 (laptop)
7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average
spinning time for one round of 6 to 12 ms, and average latency of half
this, that is 3 to 6 ms. Then you need to add head movement which
is something like 2 to 20 ms - in total average seek time 5 to 26 ms,
averaging around 13-17 ms. 

in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 
something like between 600 to 1200 kB, actual transfer rates of
80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck,
and transfer some data you should have something like 256/512 kiB
chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB
giving about a time of 20 ms per transaction
you should be able with random reads to transfer 12 MB/s  - my
actual figures is about 30 MB/s which is possibly because of the
elevator effect of the file system driver. With a size of 4 kb per chunk 
you should have a time of 15 ms per transaction, or 66 transactions per 
second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up
the transfer by a factor of 50. 

I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

I also see that there are some memory constrints on this.
Having maybe 1000 processes reading, as for my mirror service,
256 kib buffers would be acceptable, occupying 256 MB RAM.
That is reasonable, and I could even tolerate 512 MB ram used.
But going to 1 MiB buffers would be overdoing it for my configuration.

What would be the recommended chunk size for todays equipment?

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Auto generation of mdadm.conf

2008-02-05 Thread Janek Kozicki
Michael Tokarev said: (by the date of Tue, 05 Feb 2008 18:34:47 +0300)

...

 So.. probably this is the way your arrays are being assembled, since you
 do have HOMEHOST in your mdadm.conf...  Looks like it should work, after
 all... ;)  And in this case there's no need to specify additional array
 information in the config file.

whew, that was a long read. Thanks for detailed analysis. I hope that
your conclusion is correct, since I have no way to decide this by
myself. My knowledge is not enough here :)

best regards
-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recommendations for stripe/chunk size

2008-02-05 Thread Justin Piszcz



On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


Hi

I am looking at revising our howto. I see a number of places where a
chunk size of 32 kiB is recommended, and even recommendations on
maybe using sizes of 4 kiB.

My own take on that is that this really hurts performance.
Normal disks have a rotation speed of between 5400 (laptop)
7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average
spinning time for one round of 6 to 12 ms, and average latency of half
this, that is 3 to 6 ms. Then you need to add head movement which
is something like 2 to 20 ms - in total average seek time 5 to 26 ms,
averaging around 13-17 ms.

in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133
something like between 600 to 1200 kB, actual transfer rates of
80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck,
and transfer some data you should have something like 256/512 kiB
chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB
giving about a time of 20 ms per transaction
you should be able with random reads to transfer 12 MB/s  - my
actual figures is about 30 MB/s which is possibly because of the
elevator effect of the file system driver. With a size of 4 kb per chunk
you should have a time of 15 ms per transaction, or 66 transactions per
second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up
the transfer by a factor of 50.

I actually  think the kernel should operate with block sizes
like this and not wth 4 kiB blocks. It is the readahead and the elevator
algorithms that save us from randomly reading 4 kb a time.

I also see that there are some memory constrints on this.
Having maybe 1000 processes reading, as for my mirror service,
256 kib buffers would be acceptable, occupying 256 MB RAM.
That is reasonable, and I could even tolerate 512 MB ram used.
But going to 1 MiB buffers would be overdoing it for my configuration.

What would be the recommended chunk size for todays equipment?

Best regards
Keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



My benchmarks concluded that 256 KiB to 1024 KiB is optimal, too much 
below or too much over that range results in degradation.


Justin.

Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Keld Jørn Simonsen
On Tue, Feb 05, 2008 at 11:54:27AM -0500, Justin Piszcz wrote:
 
 
 On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:
 
 On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote:
 On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote:
 Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 
 +0100)
 
 
 All the raid10's will have double time for writing, and raid5 and raid6
 will also have double or triple writing times, given that you can do
 striped writes on the raid0.
 
 For raid5 and raid6 I think this is even worse. My take is that for
 raid5 when you write something, you first read the chunk data involved,
 then you read the parity data, then you xor-subtract the data to be
 changed, and you xor-add the new data, and then write the new data chunk
 and the new parity chunk. In total 2 reads and 2 writes. The read/writes
 happen on the same chunks, so latency is minimized. But in essence it is
 still 4 IO operations, where it is only 2 writes on raid1/raid10,
 that is only half the speed for writing on raid5 compared to raid1/10.
 
 On raid6 this amounts to 6 IO operations, resulting in 1/3 of the
 writing speed of raid1/10.
 
 I note in passing that there is no difference between xor-subtract and
 xor-add.
 
 Also I assume that you can calculate the parities of both raid5 and
 raid6 given the old parities chunks and the old and new data chunk.
 If you have to calculate the new parities by reading all the component
 data chunks this is going to be really expensive, both in IO and CPU.
 For a 10 drive raid5 this would involve reading 9 data chunks, and
 making writes 5 times as expensive as raid1/10.
 
 best regards
 keld
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 On my benchmarks RAID5 gave the best overall speed with 10 raptors, 
 although I did not play with the various offsets/etc as much as I have 
 tweaked the RAID5.

Could you give some figures? 

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-05 Thread Neil Brown
On Tuesday February 5, [EMAIL PROTECTED] wrote:
 
 % mdadm --zero-superblock /dev/sdb1
 mdadm: Couldn't open /dev/sdb1 for write - not zeroing

That's weird.
Why can't it open it?

Maybe you aren't running as root (The '%' prompt is suspicious).
Maybe the kernel has  been told to forget about the partitions of
/dev/sdb.
mdadm will sometimes tell it to do that, but only if you try to
assemble arrays out of whole components.

If that is the problem, then
   blockdev --rereadpt /dev/sdb

will fix it.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Justin Piszcz



On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


On Tue, Feb 05, 2008 at 11:54:27AM -0500, Justin Piszcz wrote:



On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote:

On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote:

Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07
+0100)



All the raid10's will have double time for writing, and raid5 and raid6
will also have double or triple writing times, given that you can do
striped writes on the raid0.


For raid5 and raid6 I think this is even worse. My take is that for
raid5 when you write something, you first read the chunk data involved,
then you read the parity data, then you xor-subtract the data to be
changed, and you xor-add the new data, and then write the new data chunk
and the new parity chunk. In total 2 reads and 2 writes. The read/writes
happen on the same chunks, so latency is minimized. But in essence it is
still 4 IO operations, where it is only 2 writes on raid1/raid10,
that is only half the speed for writing on raid5 compared to raid1/10.

On raid6 this amounts to 6 IO operations, resulting in 1/3 of the
writing speed of raid1/10.

I note in passing that there is no difference between xor-subtract and
xor-add.

Also I assume that you can calculate the parities of both raid5 and
raid6 given the old parities chunks and the old and new data chunk.
If you have to calculate the new parities by reading all the component
data chunks this is going to be really expensive, both in IO and CPU.
For a 10 drive raid5 this would involve reading 9 data chunks, and
making writes 5 times as expensive as raid1/10.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



On my benchmarks RAID5 gave the best overall speed with 10 raptors,
although I did not play with the various offsets/etc as much as I have
tweaked the RAID5.


Could you give some figures?


I remember testing with bonnie++ and raid10 was about half the speed 
(200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input 
was closer to RAID5 speeds/did not seem affected (~550MiB/s).


Justin.

Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Janek Kozicki
Justin Piszcz said: (by the date of Tue, 5 Feb 2008 17:28:27 -0500 (EST))

 I remember testing with bonnie++ and raid10 was about half the speed 
 (200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, 

writing on raid10 is supposed to be half the speed of reading. That's
because it must write to both mirrors.

IMHO raid5 could perform good here, because in *continuous* write
operation the blocks from other HDDs were just have been written,
they stay in cache and can be used to calculate xor. So you could get
close to almost raid-0 performance here.

Randomly scattered small-sized write operations will kill raid5
performance, for sure. Because corresponding blocks from few other
drives must be read, to calculate parity correctly. I'm wondering
how much raid5 performance would go down... Is there a bonnie++ test
for that, or any other benchmark software for this?


 but input was closer to RAID5 speeds/did not seem affected (~550MiB/s).

reading in raid5 and raid10 is supposed to be close to raid-0 speed.

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Keld Jørn Simonsen
On Tue, Feb 05, 2008 at 05:28:27PM -0500, Justin Piszcz wrote:
 
 
 Could you give some figures?
 
 I remember testing with bonnie++ and raid10 was about half the speed 
 (200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input 
 was closer to RAID5 speeds/did not seem affected (~550MiB/s).

Impressive. What levet of raid10 was involved? and what type of
equipment, how many disks? Maybe the better output for raid5 could be
due to some striping - AFAIK raid5 will be striping quite well, and 
writes almost equal to reading time indicates that the writes are
striping too.

best regards
keld
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)

2008-02-05 Thread Justin Piszcz



On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote:


On Tue, Feb 05, 2008 at 05:28:27PM -0500, Justin Piszcz wrote:




Could you give some figures?


I remember testing with bonnie++ and raid10 was about half the speed
(200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input
was closer to RAID5 speeds/did not seem affected (~550MiB/s).


Impressive. What levet of raid10 was involved? and what type of
Like I said, it was baseline testing, so pretty much the default raid10 
when you create it via mdadm, I did not mess with offsets, etc.



equipment, how many disks?

Ten 10,000rpm raptors.


Maybe the better output for raid5 could be
due to some striping - AFAIK raid5 will be striping quite well, and
writes almost equal to reading time indicates that the writes are
striping too.

best regards
keld


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-05 Thread Linda Walsh


Michael Tokarev wrote:

Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this? 


Yes.  I must say, I am not connected or paid by APC.


With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...


If you have a SmartUPS by APC, their is a freeware demon that monitors
it's status.  The UPS has USB and serial connections.
 It's included in some distributions (SuSE).  The config
file is pretty straight forward.

I recommend the 1000XL (1000 peak Volt-Amp load -- usually at startup;
note, this is not the same as watts as some of us were taught in basic
electronics class since the unit isn't a simple resistor (like a light
bulb). over the 1500XL because with the 1000XL, you can buy several
add-on batteries that plug into the back.

One minor (but not fatal) design flaw: the add-on batteries give no indication
that they are live (I knocked a cord on one, and only got 7 minutes
of uptime before things shut-down instead of my expected 20.
I have 3-cells total (controller  1 extra pack).  So why is my run time
so short?  I am being lazy in buying more extension packs.
The UPS is running 3 computers, the house-phone, (answering and wireless
handsets).  a digital clock, 1 LCD (usually off),  The real killer is a
new workstation with 2x2-Core-II chips and other comparable equipment.

The 1500XL doesn't allow for adding more power packs.
The 2200XL does allow extra packs but comes in a rack-mount format.

It's not just a battery backup -- it conditions the power -- to filter out
spikes and emit a pure sine wave.  It will kick in during over or under
voltage conditions (you can set the sensitivity).  Adjustable alarm
when on battery, setting of output volts (115, 230, 120, 240).  It
selftests at least every 2 weeks or shorter (to your fancy).

It also has a network feature (that I haven't gotten to work yet -- they just
changed the format), that allows other computers on the same net to also be
notified and take action.

You specify what scripts to run at what times (power off, power on, getting
critically low, etc).

Hasn't failed me 'yet' -- cept when a charger died and was replaced free of
cost (within warantee).  I have a separate setup another room for another
computer.

The upspowerd runs on linux or windows (under cygwin, I think).


You can specify when to shut down -- like 5 minutes of battery life left.

The controller unit has 1 battery.  But the add-ons have 2 batteries
each, so the first add-on adds 3x to the run-time.  When my system
did shut down prematurely, it went through the full halt sequence,
which I'd presume flushes disk caches.




the drive claims to have metadata safe on disk but actually does not,
and you lose power, the data claimed safe will evaporate, there's not
much the fs can do.  IO write barriers address this by forcing the drive
to flush order-critical data before continuing; xfs has them on by
default, although they are tested at mount time and if you have
something in between xfs and the disks which does not support barriers
(i.e. lvm...) then they are disabled again, with a notice in the logs.



Note also that with linux software raid barriers are NOT supported.

--
Are you sure about this?  When my system boots, I used to have
3 new IDE's, and one older one.  XFS checked each drive for barriers
and turned off barriers for a disk that didn't support it.  ... or
are you referring specifically to linux-raid setups?

Would it be possible on boot to have xfs probe the Raid array,
physically, to see if barriers are really supported (or not), and disable
them if they are not (and optionally disabling write caching, but that's
a major performance hit in my experience.

Linda
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-05 Thread Michael Tokarev
Linda Walsh wrote:
 
 Michael Tokarev wrote:
 Unfortunately an UPS does not *really* help here.  Because unless
 it has control program which properly shuts system down on the loss
 of input power, and the battery really has the capacity to power the
 system while it's shutting down (anyone tested this? 
 
 Yes.  I must say, I am not connected or paid by APC.
 
 With new UPS?
 and after an year of use, when the battery is not new?), -- unless
 the UPS actually has the capacity to shutdown system, it will cut
 the power at an unexpected time, while the disk(s) still has dirty
 caches...
 
 If you have a SmartUPS by APC, their is a freeware demon that monitors
[...]

Good stuff.  I knew at least SOME UPSes are good... ;)
Too bad I rarely see such stuff in use by regular
home users...
[]
 Note also that with linux software raid barriers are NOT supported.
 --
 Are you sure about this?  When my system boots, I used to have
 3 new IDE's, and one older one.  XFS checked each drive for barriers
 and turned off barriers for a disk that didn't support it.  ... or
 are you referring specifically to linux-raid setups?

I'm referring especially to linux-raid setups (software raid).
md devices don't support barriers, because of a very simple
reasons: once more than one disk drive is involved, md layer
can't guarantee ordering ACROSS drives too.  The problem is
that in case of power loss during writes, when an array needs
recovery/resync (at least the parts which were being written,
if bitmaps are in use), md layer will choose arbitrary drive
as a master and will copy data to another drive (speaking
of simplest case of 2-drive raid1 array).  But the thing
is that one drive may have two last barriers written (I mean
the data that was assotiated with the barriers), and
another neither of the two - in two different places.  And
hence we may see quite.. some inconsistency here.

This is regardless of whether underlying component devices
supports barriers or not.

 Would it be possible on boot to have xfs probe the Raid array,
 physically, to see if barriers are really supported (or not), and disable
 them if they are not (and optionally disabling write caching, but that's
 a major performance hit in my experience.

Xfs already probes the devices as you describe, exactly the
same way as you've seen with your ide disks, and disables
barriers.

The question and confusing was about what happens when the
barriers are disabled (provided, again, that we don't rely
on UPS and other external things).  As far as I understand,
when barriers are working properly, xfs should be safe wrt
power losses (still a bit unsure about this).  Now, when
barriers are turned off (for whatever reason), is it still
as safe?  I don't know.  Does it use regular cache flushes
in place of barriers in that case (which ARE supported by
md layer)?

Generally, it has been said numerous times that XFS is not
powercut-friendly, and it has to be used when everything
is stable, including power.  Hence I'm afraid to deploy it
where I know the power is not stable (we've about 70 such
places here, with servers in each, where they don't always
replace UPS batteries in time - ext3fs never crashed so
far, while ext2 did).

Thanks.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html