Re[2]: mdadm 2.6.4 : How i can check out current status of reshaping ?
Hello, Neil. YOU WROTE : 5 февраля 2008 г., 01:48:33: On Monday February 4, [EMAIL PROTECTED] wrote: [EMAIL PROTECTED]:/# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1] 1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] [_] unused devices: none ## But how i can see the status of reshaping ? Is it reshaped realy ? or may be just hang up ? or may be mdadm nothing do not give in general ? How long wait when reshaping will finish ? ## The reshape hasn't restarted. Did you do that mdadm -w /dev/md1 like I suggested? If so, what happened? Possibly you tried mounting the filesystem before trying the mdadm -w. There seems to be a bug such that doing this would cause the reshape not to restart, and mdadm -w would not help any more. I suggest you: echo 0 /sys/module/md_mod/parameters/start_ro stop the array mdadm -S /dev/md1 (after unmounting if necessary). Then assemble the array again. Then mdadm -w /dev/md1 just to be sure. If this doesn't work, please report exactly what you did, exactly what message you got and exactly where message appeared in the kernel log. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I read again your latter. at first time i did not do echo 0 /sys/module/md_mod/parameters/start_ro now i have done this, then mdadm -S /dev/md1 mdadm /dev/md1 -A /dev/sd[bcdef] mdadm -w /dev/md1 and i have : after 2 minutes kernel show something but reshaping during in process still [EMAIL PROTECTED]:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1] 1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] [_] [==..] reshape = 10.1% (49591552/488386496) finish=12127.2min speed=602K/sec unused devices: none [EMAIL PROTECTED]:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1] 1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] [_] [==..] reshape = 10.1% (49591552/488386496) finish=12259.0min speed=596K/sec unused devices: none [EMAIL PROTECTED]:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1] 1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] [_] [==..] reshape = 10.1% (49591552/488386496) finish=12311.7min speed=593K/sec unused devices: none [EMAIL PROTECTED]:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1] 1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] [_] [==..] reshape = 10.1% (49591552/488386496) finish=12338.1min speed=592K/sec unused devices: none Feb 5 11:54:21 raid01 kernel: raid5: reshape will continue Feb 5 11:54:21 raid01 kernel: raid5: device sdc operational as raid disk 0 Feb 5 11:54:21 raid01 kernel: raid5: device sdf operational as raid disk 3 Feb 5 11:54:21 raid01 kernel: raid5: device sde operational as raid disk 2 Feb 5 11:54:21 raid01 kernel: raid5: device sdd operational as raid disk 1 Feb 5 11:54:21 raid01 kernel: raid5: allocated 5245kB for md1 Feb 5 11:54:21 raid01 kernel: raid5: raid level 5 set md1 active with 4 out of 5 devices, algorithm 2 Feb 5 11:54:21 raid01 kernel: RAID5 conf printout: Feb 5 11:54:21 raid01 kernel: --- rd:5 wd:4 Feb 5 11:54:21 raid01 kernel: disk 0, o:1, dev:sdc Feb 5 11:54:21 raid01 kernel: disk 1, o:1, dev:sdd Feb 5 11:54:21 raid01 kernel: disk 2, o:1, dev:sde Feb 5 11:54:21 raid01 kernel: disk 3, o:1, dev:sdf Feb 5 11:54:21 raid01 kernel: ...ok start reshape thread Feb 5 11:54:21 raid01 mdadm: RebuildStarted event detected on md device /dev/md1 Feb 5 11:54:21 raid01 kernel: md: reshape of RAID array md1 Feb 5 11:54:21 raid01 kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Feb 5 11:54:21 raid01 kernel: md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for reshape. Feb 5 11:54:21 raid01 kernel: md: using 128k window, over a total of 488386496 blocks. Feb 5 11:56:12 raid01 kernel: BUG: unable to handle kernel paging request at virtual address 001cd901 Feb 5 11:56:12 raid01 kernel: printing eip:
Re: Re[2]: mdadm 2.6.4 : How i can check out current status of reshaping ?
On Tuesday February 5, [EMAIL PROTECTED] wrote: Feb 5 11:56:12 raid01 kernel: BUG: unable to handle kernel paging request at virtual address 001cd901 This looks like some sort of memory corruption. Feb 5 11:56:12 raid01 kernel: EIP is at md_do_sync+0x629/0xa32 This tells us what code is executing. Feb 5 11:56:12 raid01 kernel: Code: 54 24 48 0f 87 a4 01 00 00 72 0a 3b 44 24 44 0f 87 98 01 00 00 3b 7c 24 40 75 0a 3b 74 24 3c 0f 84 88 01 00 00 0b 85 30 01 00 00 88 08 0f 85 90 01 00 00 8b 85 30 01 00 00 a8 04 0f 85 82 01 00 This tells us what the actual byte of code were. If I feed this line (from Code: onwards) into ksymoops I get 0: 54push %esp 1: 24 48 and$0x48,%al 3: 0f 87 a4 01 00 00 ja 1ad _EIP+0x1ad 9: 72 0a jb 15 _EIP+0x15 b: 3b 44 24 44 cmp0x44(%esp),%eax f: 0f 87 98 01 00 00 ja 1ad _EIP+0x1ad 15: 3b 7c 24 40 cmp0x40(%esp),%edi 19: 75 0a jne25 _EIP+0x25 1b: 3b 74 24 3c cmp0x3c(%esp),%esi 1f: 0f 84 88 01 00 00 je 1ad _EIP+0x1ad 25: 0b 85 30 01 00 00 or 0x130(%ebp),%eax Code; Before first symbol 2b: 88 08 mov%cl,(%eax) 2d: 0f 85 90 01 00 00 jne1c3 _EIP+0x1c3 33: 8b 85 30 01 00 00 mov0x130(%ebp),%eax 39: a8 04 test $0x4,%al 3b: 0f.byte 0xf 3c: 85.byte 0x85 3d: 82(bad) 3e: 01 00 add%eax,(%eax) I removed the Code;... lines as they are just noise, except for the one that points to the current instruction in the middle. Note that it is dereferencing %eax, after just 'or'ing some value into it, which is rather unusual. Now get the md-mod.ko for the kernel you are running. run gdb md-mod.ko and give the command disassemble md_do_sync and look for code at offset 0x629, which is 1577 in decimal. I found a similar kernel to what you are running, and the matching code is 0x55c0 md_do_sync+1485: cmp0x30(%esp),%eax 0x55c4 md_do_sync+1489: ja 0x5749 md_do_sync+1878 0x55ca md_do_sync+1495: cmp0x2c(%esp),%edi 0x55ce md_do_sync+1499: jne0x55da md_do_sync+1511 0x55d0 md_do_sync+1501: cmp0x28(%esp),%esi 0x55d4 md_do_sync+1505: je 0x5749 md_do_sync+1878 0x55da md_do_sync+1511: mov0x130(%ebp),%eax 0x55e0 md_do_sync+1517: test $0x8,%al 0x55e2 md_do_sync+1519: jne0x575f md_do_sync+1900 0x55e8 md_do_sync+1525: mov0x130(%ebp),%eax 0x55ee md_do_sync+1531: test $0x4,%al 0x55f0 md_do_sync+1533: jne0x575f md_do_sync+1900 0x55f6 md_do_sync+1539: mov0x38(%esp),%ecx 0x55fa md_do_sync+1543: mov0x0,%eax - Note the sequence cmp, ja, cmp, jne, cmp, je where the cmp arguments are consecutive 4byte values on the stack (%esp). In the code from your oops, the offsets are 0x44 0x40 0x3c. In the kernel I found they are 0x30 0x2c 0x28. The difference is some subtle difference in the kernel, possibly a different compiler or something. Anyway, your code crashed at 25: 0b 85 30 01 00 00 or 0x130(%ebp),%eax Code; Before first symbol 2b: 88 08 mov%cl,(%eax) The matching code in the kernel I found is 0x55da md_do_sync+1511: mov0x130(%ebp),%eax 0x55e0 md_do_sync+1517: test $0x8,%al Note that you have an 'or', the kernel I found has 'mov'. If we look at the actual byte of code for those two instructions the code that crashed shows the bytes above: 0b 85 30 01 00 00 88 08 if I get the same bytes with gdb: (gdb) x/8b 0x55da 0x55da md_do_sync+1511: 0x8b0x850x300x010x000x00 0xa80x08 (gdb) So what should be 8b has become 0b, and what should be a8 has become 08. If you look for the same data in your md-mod.ko, you might find slightly different details but it is clear to me that the code in memory is bad. Possible you have bad memory, or a bad CPU, or you are overclocking the CPU, or it is getting hot, or something. But you clearly have a hardware error. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Deleting mdadm RAID arrays
Hello everyone, I have had a problem with RAID array (udev messed up disk names, I've had RAID on disks only, without raid partitions) on Debian Etch server with 6 disks and so I decided to rearrange this. Deleted the disks from (2 RAID-5) arrays, deleted the md* devices from /dev, created /dev/sd[a-f]1 Linux raid auto-detect partitions and rebooted the host. Now the mdadm startup script is writing in loop a message like mdadm: warning: /dev/sda1 and /dev/sdb1 have similar superblocks. If they are not identical, --zero the superblock ... The host can't boot up now because of this. If I boot the server with some disks, I can't even zero that superblock: % mdadm --zero-superblock /dev/sdb1 mdadm: Couldn't open /dev/sdb1 for write - not zeroing It's the same even after: % mdadm --manage /dev/md2 --fail /dev/sdb1 mdadm: set /dev/sdb1 faulty in /dev/md2 Now, I have NEVER created /dev/md2 array, yet it show up automatically! % cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid1] md2 : active(auto-read-only) raid1 sdb1[1] 390708736 blocks [3/1] [_U_] md1 : inactive sda1[2] 390708736 blocks unused devices: none Questions: 1. Where this info on array resides?! I have deleted /etc/mdadm/mdadm.conf and /dev/md devices and yet it comes seemingly out of nowhere. 2. How can I delete that damn array so it doesn't hang my server up in a loop? -- Marcin Krol - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
1. Where this info on array resides?! I have deleted /etc/mdadm/mdadm.conf and /dev/md devices and yet it comes seemingly out of nowhere. /boot has a copy of mdadm.conf so that / and other drives can be started and then mounted. update-initramfs will update /boot's copy of mdadm.conf. -- Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100) 2. How can I delete that damn array so it doesn't hang my server up in a loop? dd if=/dev/zero of=/dev/sdb1 bs=1M count=10 I'm not using mdadm.conf at all. Everything is stored in the superblock of the device. So if you don't erase it - info about raid array will be still automatically found. -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Michael Tokarev wrote: note that with some workloads, write caching in the drive actually makes write speed worse, not better - namely, in case of massive writes. With write barriers enabled, I did a quick test of a large copy from one backup filesystem to another. I'm not what you refer to when you say large, but this disk has 387G used with 975 files, averaging about 406MB/file. I was copying from /hde (ATA100-750G) to /sdb (SATA-300-750G) (both, basically underlying model) Of course your 'mileage may vary', and these were averages over 12 runs each (w/ + w/out wcaching); (write cache on) writeread dev ave TPS MB/sMB/s hde ave 64.67 30.94 0.0 sdb ave 249.510.2430.93 (write cache off)writeread dev ave TPS MB/sMB/s hde ave 45.63 21.81 0.0 xx: ave 177.76 0.24 21.96 write w/cache = (30.94-21.86)/21.86 = 45% faster w/o write cache = 100-(100*21.81/30.94) = 30% slower These disks have barrier support, so I'd guess the differences would have been greater if you didn't worry about losing w-cache contents. If barrier support doesn't work and one has to disable write-caching, that is a noticeable performance penalty. All writes with noatime, nodiratime, logbufs=8. FWIW...slightly OT, the rates under Win for their write-through (FAT32-perf) vs. write-back caching (NTFS-perf) were FAT about 60% faster over NTFS or NTFS ~ 40% slower than FAT32 (with ops for no-last-access and no 3.1 filename creation) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Janek Kozicki wrote: Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100) 2. How can I delete that damn array so it doesn't hang my server up in a loop? dd if=/dev/zero of=/dev/sdb1 bs=1M count=10 This works provided the superblocks are at the beginning of the component devices. Which is not the case by default (0.90 superblocks, at the end of components), or with 1.0 superblocks. mdadm --zero-superblock /dev/sdb1 is the way to go here. I'm not using mdadm.conf at all. Everything is stored in the superblock of the device. So if you don't erase it - info about raid array will be still automatically found. That's wrong, as you need at least something to identify the array components. UUID is the most reliable and commonly used. You assemble the arrays as mdadm --assemble /dev/md1 --uuid=123456789 or something like that anyway. If not, your arrays may not start properly in case you shuffled disks (e.g replaced a bad one), or your disks were renumbered after a kernel or other hardware change and so on. The most convient place to store that info is mdadm.conf. Here, it looks just like: DEVICE partitions ARRAY /dev/md1 UUID=4ee58096:e5bc04ac:b02137be:3792981a ARRAY /dev/md2 UUID=b4dec03f:24ec8947:1742227c:761aa4cb By default mdadm offers additional information which helps to diagnose possible problems, namely: ARRAY /dev/md5 level=raid5 num-devices=4 UUID=6dc4e503:85540e55:d935dea5:d63df51b This new info isn't necessary for mdadm to work (but UUID is), yet it comes handy sometimes. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Auto generation of mdadm.conf (was: Deleting mdadm RAID arrays)
Michael Tokarev said: (by the date of Tue, 05 Feb 2008 16:52:18 +0300) Janek Kozicki wrote: I'm not using mdadm.conf at all. That's wrong, as you need at least something to identify the array components. I was afraid of that ;-) So, is that a correct way to automatically generate a correct mdadm.conf ? I did it after some digging in man pages: echo 'DEVICE partitions' mdadm.conf mdadm --examine --scan --config=mdadm.conf ./mdadm.conf Now, when I do 'cat mdadm.conf' i get: DEVICE partitions ARRAY /dev/md/0 level=raid1 metadata=1 num-devices=3 UUID=75b0f87879:539d6cee:f22092f4:7a6e6f name='backup':0 ARRAY /dev/md/2 level=raid1 metadata=1 num-devices=3 UUID=4fd340a6c4:db01d6f7:1e03da2d:bdd574 name=backup:2 ARRAY /dev/md/1 level=raid5 metadata=1 num-devices=3 UUID=22f22c3599:613d5231:d407a655:bdeb84 name=backup:1 Looks quite reasonable. Should I append it to /etc/mdadm/mdadm.conf ? This file currently contains: (commented lines are left out) DEVICE partitions CREATE owner=root group=disk mode=0660 auto=yes HOMEHOST system MAILADDR root This is the default content of /etc/mdadm/mdadm.conf on fresh debian etch install. best regards -- Janek Kozicki - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Moshe Yudkowsky wrote: Michael Tokarev wrote: Janek Kozicki wrote: Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100) 2. How can I delete that damn array so it doesn't hang my server up in a loop? dd if=/dev/zero of=/dev/sdb1 bs=1M count=10 This works provided the superblocks are at the beginning of the component devices. Which is not the case by default (0.90 superblocks, at the end of components), or with 1.0 superblocks. mdadm --zero-superblock /dev/sdb1 Would that work if even if he doesn't update his mdadm.conf inside the /boot image? Or would mdadm attempt to build the array according to the instructions in mdadm.conf? I expect that it might depend on whether the instructions are given in terms of UUID or in terms of devices. After zeroing superblocks, mdadm will NOT assemble the array, regardless if using UUIDs or devices or whatever. In order to assemble the array, all component devices MUST have valid superblocks and the superblocks must match each other. mdadm --assemble in initramfs will simple fail to do its work. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Michael Tokarev wrote: Janek Kozicki wrote: Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100) 2. How can I delete that damn array so it doesn't hang my server up in a loop? dd if=/dev/zero of=/dev/sdb1 bs=1M count=10 This works provided the superblocks are at the beginning of the component devices. Which is not the case by default (0.90 superblocks, at the end of components), or with 1.0 superblocks. mdadm --zero-superblock /dev/sdb1 Would that work if even if he doesn't update his mdadm.conf inside the /boot image? Or would mdadm attempt to build the array according to the instructions in mdadm.conf? I expect that it might depend on whether the instructions are given in terms of UUID or in terms of devices. -- Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe I think it a greater honour to have my head standing on the ports of this town for this quarrel, than to have my portrait in the King's bedchamber. -- Montrose, 20 May 1650 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Auto generation of mdadm.conf
Janek Kozicki wrote: Michael Tokarev said: (by the date of Tue, 05 Feb 2008 16:52:18 +0300) Janek Kozicki wrote: I'm not using mdadm.conf at all. That's wrong, as you need at least something to identify the array components. I was afraid of that ;-) So, is that a correct way to automatically generate a correct mdadm.conf ? I did it after some digging in man pages: echo 'DEVICE partitions' mdadm.conf mdadm --examine --scan --config=mdadm.conf ./mdadm.conf Now, when I do 'cat mdadm.conf' i get: DEVICE partitions ARRAY /dev/md/0 level=raid1 metadata=1 num-devices=3 UUID=75b0f87879:539d6cee:f22092f4:7a6e6f name='backup':0 ARRAY /dev/md/2 level=raid1 metadata=1 num-devices=3 UUID=4fd340a6c4:db01d6f7:1e03da2d:bdd574 name=backup:2 ARRAY /dev/md/1 level=raid5 metadata=1 num-devices=3 UUID=22f22c3599:613d5231:d407a655:bdeb84 name=backup:1 Hmm. I wonder why the name for md/0 is in quotes, while others are not. Looks quite reasonable. Should I append it to /etc/mdadm/mdadm.conf ? Probably... see below. This file currently contains: (commented lines are left out) DEVICE partitions CREATE owner=root group=disk mode=0660 auto=yes HOMEHOST system MAILADDR root This is the default content of /etc/mdadm/mdadm.conf on fresh debian etch install. But now I wonder HOW your arrays gets assembled in the first place. Let me guess... mdrun? Or maybe in-kernel auto-detection? The thing is that mdadm will NOT assemble your arrays given this config. If you have your disk/controller and md drivers built into the kernel, AND marked the partitions as linux raid autodetect, kernel may assemble them right at boot. But I don't remember if the kernel will even consider v.1 superblocks for its auto- assembly. In any way, don't rely on the kernel to do this work, in-kernel assembly code is very simplistic and works up to a moment when anything changes/breaks. It's almost the same code as was in old raidtools... Another possibility is mdrun utility (shell script) shipped with Debian's mdadm package. It's deprecated now, but still provided for compatibility. mdrun is even worse, it will try to assemble ALL arrays found, giving them random names and numbers, not handling failures correctly, and failing badly in case of, e.g. a foreign disk is found which happens to contain a valid raid superblock somewhere... Well. There's another, 3rd possibility: mdadm can assemble all arrays automatically (even if not listed explicitly in mdadm.conf) using homehost (only available with v.1 superblock). I haven't tried this option yet, so don't remember how it works. From the mdadm(8) manpage: Auto Assembly When --assemble is used with --scan and no devices are listed, mdadm will first attempt to assemble all the arrays listed in the config file. If a homehost has been specified (either in the config file or on the command line), mdadm will look further for possible arrays and will try to assemble anything that it finds which is tagged as belonging to the given homehost. This is the only situation where mdadm will assemble arrays without being given specific device name or identity information for the array. If mdadm finds a consistent set of devices that look like they should comprise an array, and if the superblock is tagged as belonging to the given home host, it will automatically choose a device name and try to assemble the array. If the array uses version-0.90 metadata, then the minor number as recorded in the superblock is used to create a name in /dev/md/ so for example /dev/md/3. If the array uses version-1 meta‐ data, then the name from the superblock is used to similarly create a name in /dev/md (the name will have any ’host’ prefix stripped first). So.. probably this is the way your arrays are being assembled, since you do have HOMEHOST in your mdadm.conf... Looks like it should work, after all... ;) And in this case there's no need to specify additional array information in the config file. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote: On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote: Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 +0100) All the raid10's will have double time for writing, and raid5 and raid6 will also have double or triple writing times, given that you can do striped writes on the raid0. For raid5 and raid6 I think this is even worse. My take is that for raid5 when you write something, you first read the chunk data involved, then you read the parity data, then you xor-subtract the data to be changed, and you xor-add the new data, and then write the new data chunk and the new parity chunk. In total 2 reads and 2 writes. The read/writes happen on the same chunks, so latency is minimized. But in essence it is still 4 IO operations, where it is only 2 writes on raid1/raid10, that is only half the speed for writing on raid5 compared to raid1/10. On raid6 this amounts to 6 IO operations, resulting in 1/3 of the writing speed of raid1/10. I note in passing that there is no difference between xor-subtract and xor-add. Also I assume that you can calculate the parities of both raid5 and raid6 given the old parities chunks and the old and new data chunk. If you have to calculate the new parities by reading all the component data chunks this is going to be really expensive, both in IO and CPU. For a 10 drive raid5 this would involve reading 9 data chunks, and making writes 5 times as expensive as raid1/10. best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote: On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote: Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 +0100) All the raid10's will have double time for writing, and raid5 and raid6 will also have double or triple writing times, given that you can do striped writes on the raid0. For raid5 and raid6 I think this is even worse. My take is that for raid5 when you write something, you first read the chunk data involved, then you read the parity data, then you xor-subtract the data to be changed, and you xor-add the new data, and then write the new data chunk and the new parity chunk. In total 2 reads and 2 writes. The read/writes happen on the same chunks, so latency is minimized. But in essence it is still 4 IO operations, where it is only 2 writes on raid1/raid10, that is only half the speed for writing on raid5 compared to raid1/10. On raid6 this amounts to 6 IO operations, resulting in 1/3 of the writing speed of raid1/10. I note in passing that there is no difference between xor-subtract and xor-add. Also I assume that you can calculate the parities of both raid5 and raid6 given the old parities chunks and the old and new data chunk. If you have to calculate the new parities by reading all the component data chunks this is going to be really expensive, both in IO and CPU. For a 10 drive raid5 this would involve reading 9 data chunks, and making writes 5 times as expensive as raid1/10. best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html On my benchmarks RAID5 gave the best overall speed with 10 raptors, although I did not play with the various offsets/etc as much as I have tweaked the RAID5. Justin.
recommendations for stripe/chunk size
Hi I am looking at revising our howto. I see a number of places where a chunk size of 32 kiB is recommended, and even recommendations on maybe using sizes of 4 kiB. My own take on that is that this really hurts performance. Normal disks have a rotation speed of between 5400 (laptop) 7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average spinning time for one round of 6 to 12 ms, and average latency of half this, that is 3 to 6 ms. Then you need to add head movement which is something like 2 to 20 ms - in total average seek time 5 to 26 ms, averaging around 13-17 ms. in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 something like between 600 to 1200 kB, actual transfer rates of 80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck, and transfer some data you should have something like 256/512 kiB chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB giving about a time of 20 ms per transaction you should be able with random reads to transfer 12 MB/s - my actual figures is about 30 MB/s which is possibly because of the elevator effect of the file system driver. With a size of 4 kb per chunk you should have a time of 15 ms per transaction, or 66 transactions per second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up the transfer by a factor of 50. I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. I also see that there are some memory constrints on this. Having maybe 1000 processes reading, as for my mirror service, 256 kib buffers would be acceptable, occupying 256 MB RAM. That is reasonable, and I could even tolerate 512 MB ram used. But going to 1 MiB buffers would be overdoing it for my configuration. What would be the recommended chunk size for todays equipment? Best regards Keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Auto generation of mdadm.conf
Michael Tokarev said: (by the date of Tue, 05 Feb 2008 18:34:47 +0300) ... So.. probably this is the way your arrays are being assembled, since you do have HOMEHOST in your mdadm.conf... Looks like it should work, after all... ;) And in this case there's no need to specify additional array information in the config file. whew, that was a long read. Thanks for detailed analysis. I hope that your conclusion is correct, since I have no way to decide this by myself. My knowledge is not enough here :) best regards -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recommendations for stripe/chunk size
On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: Hi I am looking at revising our howto. I see a number of places where a chunk size of 32 kiB is recommended, and even recommendations on maybe using sizes of 4 kiB. My own take on that is that this really hurts performance. Normal disks have a rotation speed of between 5400 (laptop) 7200 (ide/sata) and 1 (SCSI) rounds per minute, giving an average spinning time for one round of 6 to 12 ms, and average latency of half this, that is 3 to 6 ms. Then you need to add head movement which is something like 2 to 20 ms - in total average seek time 5 to 26 ms, averaging around 13-17 ms. in about 15 ms you can read on current SATA-II (300 MB/s) or ATA/133 something like between 600 to 1200 kB, actual transfer rates of 80 MB/s on SATA-II and 40 MB/s on ATA/133. So to get some bang for the buck, and transfer some data you should have something like 256/512 kiB chunks. With a transfer rate of 50 MB/s and chunk sizes of 256 kiB giving about a time of 20 ms per transaction you should be able with random reads to transfer 12 MB/s - my actual figures is about 30 MB/s which is possibly because of the elevator effect of the file system driver. With a size of 4 kb per chunk you should have a time of 15 ms per transaction, or 66 transactions per second, or a transfer rate of 250 kb/s. So 256 kb vs 4 kb speeds up the transfer by a factor of 50. I actually think the kernel should operate with block sizes like this and not wth 4 kiB blocks. It is the readahead and the elevator algorithms that save us from randomly reading 4 kb a time. I also see that there are some memory constrints on this. Having maybe 1000 processes reading, as for my mirror service, 256 kib buffers would be acceptable, occupying 256 MB RAM. That is reasonable, and I could even tolerate 512 MB ram used. But going to 1 MiB buffers would be overdoing it for my configuration. What would be the recommended chunk size for todays equipment? Best regards Keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html My benchmarks concluded that 256 KiB to 1024 KiB is optimal, too much below or too much over that range results in degradation. Justin.
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
On Tue, Feb 05, 2008 at 11:54:27AM -0500, Justin Piszcz wrote: On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote: On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote: Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 +0100) All the raid10's will have double time for writing, and raid5 and raid6 will also have double or triple writing times, given that you can do striped writes on the raid0. For raid5 and raid6 I think this is even worse. My take is that for raid5 when you write something, you first read the chunk data involved, then you read the parity data, then you xor-subtract the data to be changed, and you xor-add the new data, and then write the new data chunk and the new parity chunk. In total 2 reads and 2 writes. The read/writes happen on the same chunks, so latency is minimized. But in essence it is still 4 IO operations, where it is only 2 writes on raid1/raid10, that is only half the speed for writing on raid5 compared to raid1/10. On raid6 this amounts to 6 IO operations, resulting in 1/3 of the writing speed of raid1/10. I note in passing that there is no difference between xor-subtract and xor-add. Also I assume that you can calculate the parities of both raid5 and raid6 given the old parities chunks and the old and new data chunk. If you have to calculate the new parities by reading all the component data chunks this is going to be really expensive, both in IO and CPU. For a 10 drive raid5 this would involve reading 9 data chunks, and making writes 5 times as expensive as raid1/10. best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html On my benchmarks RAID5 gave the best overall speed with 10 raptors, although I did not play with the various offsets/etc as much as I have tweaked the RAID5. Could you give some figures? best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
On Tuesday February 5, [EMAIL PROTECTED] wrote: % mdadm --zero-superblock /dev/sdb1 mdadm: Couldn't open /dev/sdb1 for write - not zeroing That's weird. Why can't it open it? Maybe you aren't running as root (The '%' prompt is suspicious). Maybe the kernel has been told to forget about the partitions of /dev/sdb. mdadm will sometimes tell it to do that, but only if you try to assemble arrays out of whole components. If that is the problem, then blockdev --rereadpt /dev/sdb will fix it. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: On Tue, Feb 05, 2008 at 11:54:27AM -0500, Justin Piszcz wrote: On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: On Thu, Jan 31, 2008 at 02:55:07AM +0100, Keld Jørn Simonsen wrote: On Wed, Jan 30, 2008 at 11:36:39PM +0100, Janek Kozicki wrote: Keld Jørn Simonsen said: (by the date of Wed, 30 Jan 2008 23:00:07 +0100) All the raid10's will have double time for writing, and raid5 and raid6 will also have double or triple writing times, given that you can do striped writes on the raid0. For raid5 and raid6 I think this is even worse. My take is that for raid5 when you write something, you first read the chunk data involved, then you read the parity data, then you xor-subtract the data to be changed, and you xor-add the new data, and then write the new data chunk and the new parity chunk. In total 2 reads and 2 writes. The read/writes happen on the same chunks, so latency is minimized. But in essence it is still 4 IO operations, where it is only 2 writes on raid1/raid10, that is only half the speed for writing on raid5 compared to raid1/10. On raid6 this amounts to 6 IO operations, resulting in 1/3 of the writing speed of raid1/10. I note in passing that there is no difference between xor-subtract and xor-add. Also I assume that you can calculate the parities of both raid5 and raid6 given the old parities chunks and the old and new data chunk. If you have to calculate the new parities by reading all the component data chunks this is going to be really expensive, both in IO and CPU. For a 10 drive raid5 this would involve reading 9 data chunks, and making writes 5 times as expensive as raid1/10. best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html On my benchmarks RAID5 gave the best overall speed with 10 raptors, although I did not play with the various offsets/etc as much as I have tweaked the RAID5. Could you give some figures? I remember testing with bonnie++ and raid10 was about half the speed (200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input was closer to RAID5 speeds/did not seem affected (~550MiB/s). Justin.
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
Justin Piszcz said: (by the date of Tue, 5 Feb 2008 17:28:27 -0500 (EST)) I remember testing with bonnie++ and raid10 was about half the speed (200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, writing on raid10 is supposed to be half the speed of reading. That's because it must write to both mirrors. IMHO raid5 could perform good here, because in *continuous* write operation the blocks from other HDDs were just have been written, they stay in cache and can be used to calculate xor. So you could get close to almost raid-0 performance here. Randomly scattered small-sized write operations will kill raid5 performance, for sure. Because corresponding blocks from few other drives must be read, to calculate parity correctly. I'm wondering how much raid5 performance would go down... Is there a bonnie++ test for that, or any other benchmark software for this? but input was closer to RAID5 speeds/did not seem affected (~550MiB/s). reading in raid5 and raid10 is supposed to be close to raid-0 speed. -- Janek Kozicki | - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
On Tue, Feb 05, 2008 at 05:28:27PM -0500, Justin Piszcz wrote: Could you give some figures? I remember testing with bonnie++ and raid10 was about half the speed (200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input was closer to RAID5 speeds/did not seem affected (~550MiB/s). Impressive. What levet of raid10 was involved? and what type of equipment, how many disks? Maybe the better output for raid5 could be due to some striping - AFAIK raid5 will be striping quite well, and writes almost equal to reading time indicates that the writes are striping too. best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which raid level gives maximum overall speed? (raid-10,f2 vs. raid-0)
On Tue, 5 Feb 2008, Keld Jørn Simonsen wrote: On Tue, Feb 05, 2008 at 05:28:27PM -0500, Justin Piszcz wrote: Could you give some figures? I remember testing with bonnie++ and raid10 was about half the speed (200-265 MiB/s) as RAID5 (400-420 MiB/s) for sequential output, but input was closer to RAID5 speeds/did not seem affected (~550MiB/s). Impressive. What levet of raid10 was involved? and what type of Like I said, it was baseline testing, so pretty much the default raid10 when you create it via mdadm, I did not mess with offsets, etc. equipment, how many disks? Ten 10,000rpm raptors. Maybe the better output for raid5 could be due to some striping - AFAIK raid5 will be striping quite well, and writes almost equal to reading time indicates that the writes are striping too. best regards keld
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Michael Tokarev wrote: Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? Yes. I must say, I am not connected or paid by APC. With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... If you have a SmartUPS by APC, their is a freeware demon that monitors it's status. The UPS has USB and serial connections. It's included in some distributions (SuSE). The config file is pretty straight forward. I recommend the 1000XL (1000 peak Volt-Amp load -- usually at startup; note, this is not the same as watts as some of us were taught in basic electronics class since the unit isn't a simple resistor (like a light bulb). over the 1500XL because with the 1000XL, you can buy several add-on batteries that plug into the back. One minor (but not fatal) design flaw: the add-on batteries give no indication that they are live (I knocked a cord on one, and only got 7 minutes of uptime before things shut-down instead of my expected 20. I have 3-cells total (controller 1 extra pack). So why is my run time so short? I am being lazy in buying more extension packs. The UPS is running 3 computers, the house-phone, (answering and wireless handsets). a digital clock, 1 LCD (usually off), The real killer is a new workstation with 2x2-Core-II chips and other comparable equipment. The 1500XL doesn't allow for adding more power packs. The 2200XL does allow extra packs but comes in a rack-mount format. It's not just a battery backup -- it conditions the power -- to filter out spikes and emit a pure sine wave. It will kick in during over or under voltage conditions (you can set the sensitivity). Adjustable alarm when on battery, setting of output volts (115, 230, 120, 240). It selftests at least every 2 weeks or shorter (to your fancy). It also has a network feature (that I haven't gotten to work yet -- they just changed the format), that allows other computers on the same net to also be notified and take action. You specify what scripts to run at what times (power off, power on, getting critically low, etc). Hasn't failed me 'yet' -- cept when a charger died and was replaced free of cost (within warantee). I have a separate setup another room for another computer. The upspowerd runs on linux or windows (under cygwin, I think). You can specify when to shut down -- like 5 minutes of battery life left. The controller unit has 1 battery. But the add-ons have 2 batteries each, so the first add-on adds 3x to the run-time. When my system did shut down prematurely, it went through the full halt sequence, which I'd presume flushes disk caches. the drive claims to have metadata safe on disk but actually does not, and you lose power, the data claimed safe will evaporate, there's not much the fs can do. IO write barriers address this by forcing the drive to flush order-critical data before continuing; xfs has them on by default, although they are tested at mount time and if you have something in between xfs and the disks which does not support barriers (i.e. lvm...) then they are disabled again, with a notice in the logs. Note also that with linux software raid barriers are NOT supported. -- Are you sure about this? When my system boots, I used to have 3 new IDE's, and one older one. XFS checked each drive for barriers and turned off barriers for a disk that didn't support it. ... or are you referring specifically to linux-raid setups? Would it be possible on boot to have xfs probe the Raid array, physically, to see if barriers are really supported (or not), and disable them if they are not (and optionally disabling write caching, but that's a major performance hit in my experience. Linda - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Linda Walsh wrote: Michael Tokarev wrote: Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? Yes. I must say, I am not connected or paid by APC. With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... If you have a SmartUPS by APC, their is a freeware demon that monitors [...] Good stuff. I knew at least SOME UPSes are good... ;) Too bad I rarely see such stuff in use by regular home users... [] Note also that with linux software raid barriers are NOT supported. -- Are you sure about this? When my system boots, I used to have 3 new IDE's, and one older one. XFS checked each drive for barriers and turned off barriers for a disk that didn't support it. ... or are you referring specifically to linux-raid setups? I'm referring especially to linux-raid setups (software raid). md devices don't support barriers, because of a very simple reasons: once more than one disk drive is involved, md layer can't guarantee ordering ACROSS drives too. The problem is that in case of power loss during writes, when an array needs recovery/resync (at least the parts which were being written, if bitmaps are in use), md layer will choose arbitrary drive as a master and will copy data to another drive (speaking of simplest case of 2-drive raid1 array). But the thing is that one drive may have two last barriers written (I mean the data that was assotiated with the barriers), and another neither of the two - in two different places. And hence we may see quite.. some inconsistency here. This is regardless of whether underlying component devices supports barriers or not. Would it be possible on boot to have xfs probe the Raid array, physically, to see if barriers are really supported (or not), and disable them if they are not (and optionally disabling write caching, but that's a major performance hit in my experience. Xfs already probes the devices as you describe, exactly the same way as you've seen with your ide disks, and disables barriers. The question and confusing was about what happens when the barriers are disabled (provided, again, that we don't rely on UPS and other external things). As far as I understand, when barriers are working properly, xfs should be safe wrt power losses (still a bit unsure about this). Now, when barriers are turned off (for whatever reason), is it still as safe? I don't know. Does it use regular cache flushes in place of barriers in that case (which ARE supported by md layer)? Generally, it has been said numerous times that XFS is not powercut-friendly, and it has to be used when everything is stable, including power. Hence I'm afraid to deploy it where I know the power is not stable (we've about 70 such places here, with servers in each, where they don't always replace UPS batteries in time - ext3fs never crashed so far, while ext2 did). Thanks. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html