Re: Time to deprecate old RAID formats?
On Thu, 2007-11-01 at 14:02 -0700, H. Peter Anvin wrote: Doug Ledford wrote: I would argue that ext[234] should be clearing those 512 bytes. Why aren't they cleared Actually, I didn't think msdos used the first 512 bytes for the same reason ext3 doesn't: space for a boot sector. The creators of MS-DOS put the superblock in the bootsector, so that the BIOS loads them both. It made sense in some diseased Microsoft programmer's mind. Either way, for RAID-1 booting, the boot sector really should be part of the protected area (and go through the MD stack.) It depends on what you are calling the protected area. If by that you mean outside the filesystem itself, and in a non-replicated area like where the superblock and internal bitmaps go, then yes, that would be ideal. If you mean in the file system proper, then that depends on the boot loader. The bootloader should deal with the offset problem by storing partition/filesystem-relative pointers, not absolute ones. Grub2 is on the way to this, but it isn't there yet. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
Neil Brown wrote: On Friday October 26, [EMAIL PROTECTED] wrote: Perhaps you could have called them 1.start, 1.end, and 1.4k in the beginning? Isn't hindsight wonderful? Those names seem good to me. I wonder if it is safe to generate them in -Eb output If you agree that they are better, using them in the obvious places would be better now than later. Are you going to put them in the metadata options as well? Let me know, I have looking at the documentation on my list for next week, and could include some text. Maybe the key confusion here is between version numbers and revision numbers. When you have multiple versions, there is no implicit assumption that one is better than another. Here is my version of what happened, now let's hear yours. When you have multiple revisions, you do assume ongoing improvement. v1.0 v1.1 and v1.2 are different version of the v1 superblock, which itself is a revision of the v0... Like kernel releases, people assume that the first number means *big* changes, the second incremental change. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Tue, 2007-10-30 at 07:55 +0100, Luca Berra wrote: Well it might be a matter of personal preference, but i would prefer an initrd doing just the minumum necessary to mount the root filesystem (and/or activating resume from a swap device), and leaving all the rest to initscripts, then an initrd that tries to do everything. The initrd does exactly that. The rescan for superblocks does not happen in initrd or mkinitrd, it must be done manually. The code in mkinitrd uses the mdadm.conf file as it stands, but in the initrd image it doesn't start all the arrays, just the needed arrays to get booted into your / partition. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sun, Oct 28, 2007 at 01:47:55PM -0400, Doug Ledford wrote: On Sun, 2007-10-28 at 15:13 +0100, Luca Berra wrote: On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote: It was only because I wasn't using mdadm in the initrd and specifying uuids that it found the right devices to start and ignored the whole disk devices. But, when I later made some more devices and went to update the mdadm.conf file using mdadm -Eb, it found the devices and added it to the mdadm.conf. If I hadn't checked it before remaking my initrd, it would have hosed the system. And it would have passed all the above is not clear to me, afair redhat initrd still uses raidautorun, RHEL does, but this is on a personal machine I installed Fedora an and latest Fedora has a mkinitrd that installs mdadm and mdadm.conf and starts the needed devices using the UUID. My first sentence above should have read that I *was* using mdadm. ah, ok i should look again at fedora's mkinitrd, last one i checked was 6.0.9-1 and i see mdadm was added in 6.0.9-2 which iirc does not works with recent superblocks, so you used uuids on kernel command line? or you use something else for initrd? why would remaking the initrd break it? Remaking the initrd installs the new mdadm.conf file, which would have then contained the whole disk devices and it's UUID. There in would have been the problem. yes, i read the patch, i don't like that code, as i don't like most of what has been put in mkinitrd from 5.0 onward. Imho the correct thing here would not have been copying the existing mdadm.conf but generating a safe one from output of mdadm -D (note -D, not -E) the tests you can throw at it. Quite simply, there is no way to tell the difference between those two situations with 100% certainty. Mdadm tries to be smart and start the newest devices, but Luca's original suggestion of skip the partition scanning in the kernel and figure it out from user space would not have shown mdadm the new devices and would have gotten it wrong every time. yes, in this particular case it would have, congratulation you found a new creative way of shooting yourself in the feet. Creative, not so much. I just backed out of what I started and tried something else. Lots of people do that. maybe mdadm should do checks when creating a device to prevent this kind of mistakes. i.e. if creating an array on a partition, check the whole device for a superblock and refuse in case it finds one if creating an array on a whole device that has a partition table, either require --force, or check for superblocks in every possible partition. What happens if you add the partition table *after* you make the whole disk device and there are stale superblocks in the partitions? This still isn't infallible. It depends on what you do with that partitioned device *after* having created the partition table. - If you try again to run mdadm on it (and the above is implemented it would fail, and you will be given a chance to wipe the stale sb) - If you don't and use them as plain devices, _and_ leave the line in mdadm.conf you will suffer a lot of pain. Since the problem is known and since fdisk/sfdisk/parted already do a lot of checks on the device, this could be another useful one. L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote: Remaking the initrd installs the new mdadm.conf file, which would have then contained the whole disk devices and it's UUID. There in would have been the problem. yes, i read the patch, i don't like that code, as i don't like most of what has been put in mkinitrd from 5.0 onward. Imho the correct thing here would not have been copying the existing mdadm.conf but generating a safe one from output of mdadm -D (note -D, not -E) I'm not sure I'd want that. Besides, what makes you say -D is safer than -E? -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote: On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote: Remaking the initrd installs the new mdadm.conf file, which would have then contained the whole disk devices and it's UUID. There in would have been the problem. yes, i read the patch, i don't like that code, as i don't like most of what has been put in mkinitrd from 5.0 onward. in case you wonder i am referring to things like emit dm create $1 $UUID $(/sbin/dmsetup table $1) Imho the correct thing here would not have been copying the existing mdadm.conf but generating a safe one from output of mdadm -D (note -D, not -E) I'm not sure I'd want that. Besides, what makes you say -D is safer than -E? mdadm -D /dev/mdX works on an active md device, so i strongly doubt the information gathered from there would be stale while mdadm -Es will scan disk devices for md superblock, thus possibly even finding stale superblocks or leftovers. I would strongly recommend against blindly doing mdadm -Es /etc/mdadm.conf and not supervising the result. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Mon, 2007-10-29 at 22:44 +0100, Luca Berra wrote: On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote: On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote: Remaking the initrd installs the new mdadm.conf file, which would have then contained the whole disk devices and it's UUID. There in would have been the problem. yes, i read the patch, i don't like that code, as i don't like most of what has been put in mkinitrd from 5.0 onward. in case you wonder i am referring to things like emit dm create $1 $UUID $(/sbin/dmsetup table $1) I make no judgments on the dm setup stuff, I know too little about the dm stack to be qualified. Imho the correct thing here would not have been copying the existing mdadm.conf but generating a safe one from output of mdadm -D (note -D, not -E) I'm not sure I'd want that. Besides, what makes you say -D is safer than -E? mdadm -D /dev/mdX works on an active md device, so i strongly doubt the information gathered from there would be stale while mdadm -Es will scan disk devices for md superblock, thus possibly even finding stale superblocks or leftovers. I would strongly recommend against blindly doing mdadm -Es /etc/mdadm.conf and not supervising the result. Well, I agree that blindly doing mdadm -Esb mdadm.conf would be bad, but that's not what mkinitrd is doing, it's using the mdadm.conf that's in place so you can update the mdadm.conf whenever you find it appropriate. And I agree -D has less chance of finding a stale superblock, but it's also true that it has no chance of finding non-stale superblocks on devices that aren't even started. So, as a method of getting all the right information in the event of system failure and rescuecd boot, it leaves something to be desired ;-) In other words, I'd rather use a mode that finds everything and lets me remove the stale than a mode that might miss something. But, that's a matter of personal choice. Considering that we only ever update mdadm.conf automatically during installs, after that the user makes manual mdadm.conf changes themselves, they are free to use whichever they prefer. The one thing I *do* like about mdadm -E above -D is it includes the superblock format in its output. The one thing I don't like, is it almost universally gets the name wrong. What I really want is a brief query format that both gives me the right name (-D) and the superblock format (-E). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Friday October 26, [EMAIL PROTECTED] wrote: Perhaps you could have called them 1.start, 1.end, and 1.4k in the beginning? Isn't hindsight wonderful? Those names seem good to me. I wonder if it is safe to generate them in -Eb output Maybe the key confusion here is between version numbers and revision numbers. When you have multiple versions, there is no implicit assumption that one is better than another. Here is my version of what happened, now let's hear yours. When you have multiple revisions, you do assume ongoing improvement. v1.0 v1.1 and v1.2 are different version of the v1 superblock, which itself is a revision of the v0... NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Mon, Oct 29, 2007 at 07:05:42PM -0400, Doug Ledford wrote: And I agree -D has less chance of finding a stale superblock, but it's also true that it has no chance of finding non-stale superblocks on Well it might be a matter of personal preference, but i would prefer an initrd doing just the minumum necessary to mount the root filesystem (and/or activating resume from a swap device), and leaving all the rest to initscripts, then an initrd that tries to do everything. devices that aren't even started. So, as a method of getting all the right information in the event of system failure and rescuecd boot, it leaves something to be desired ;-) In other words, I'd rather use a mode that finds everything and lets me remove the stale than a mode that might miss something. But, that's a matter of personal choice. In case of a rescuecd boot, you will probably not have any md devices activated, and you will probably run mdadm -Es to check what md are available, the data should be still on the disk, else you would be hosed anyway. L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Sat, Oct 27, 2007 at 04:09:03PM -0400, Doug Ledford wrote: On Sat, 2007-10-27 at 10:00 +0200, Luca Berra wrote: On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote: On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote: just apply some rules, so if you find a partition table _AND_ an md superblock at the end, read both and you can tell if it is an md on a partition or a partitioned md raid1 device. In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. then just ignore the device and log a warning, instead of doing a random choice. L. It also happened to be my OS drive pair. Ignoring it would have rendered the machine unusable. I wonder what would have happened if it got it wrong -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote: On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote: On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. Maybe we need a 2.0 superblock that contains the physical size of every component, not just the logical size that is used for RAID. That way if the size read from the superblock does not match the size of the device, you know that this device should be ignored. In my case that wouldn't have helped. What actually happened was I create a two disk raid1 device using whole devices and a version 1.0 superblock. I know a version 1.1 wouldn't work because it would be where the boot sector needed to be, and wasn't sure if a 1.2 would work either. Then I tried to make the whole disk raid device a partitioned device. This obviously put a partition table right where the BIOS and the kernel would look for it whether the raid was up or not. I also the only reason i can think for the above setup not working is udev mucking with your device too early. tried doing an lvm setup to split the raid up into chunks and that didn't work either. So, then I redid the partition table and created individual raid devices from the partitions. But, I didn't think to zero the old whole disk superblock. When I made the individual raid devices, I used all 1.1 superblocks. So, when it was all said and done, I had a bunch of partitions that looked like a valid set of partitions for the whole disk raid device and a whole disk raid superblock, but I also had superblocks in each partition with their own bitmaps and so on. OK It was only because I wasn't using mdadm in the initrd and specifying uuids that it found the right devices to start and ignored the whole disk devices. But, when I later made some more devices and went to update the mdadm.conf file using mdadm -Eb, it found the devices and added it to the mdadm.conf. If I hadn't checked it before remaking my initrd, it would have hosed the system. And it would have passed all the above is not clear to me, afair redhat initrd still uses raidautorun, which iirc does not works with recent superblocks, so you used uuids on kernel command line? or you use something else for initrd? why would remaking the initrd break it? the tests you can throw at it. Quite simply, there is no way to tell the difference between those two situations with 100% certainty. Mdadm tries to be smart and start the newest devices, but Luca's original suggestion of skip the partition scanning in the kernel and figure it out from user space would not have shown mdadm the new devices and would have gotten it wrong every time. yes, in this particular case it would have, congratulation you found a new creative way of shooting yourself in the feet. maybe mdadm should do checks when creating a device to prevent this kind of mistakes. i.e. if creating an array on a partition, check the whole device for a superblock and refuse in case it finds one if creating an array on a whole device that has a partition table, either require --force, or check for superblocks in every possible partition. L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Sun, 2007-10-28 at 15:13 +0100, Luca Berra wrote: On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote: On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote: On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. Maybe we need a 2.0 superblock that contains the physical size of every component, not just the logical size that is used for RAID. That way if the size read from the superblock does not match the size of the device, you know that this device should be ignored. In my case that wouldn't have helped. What actually happened was I create a two disk raid1 device using whole devices and a version 1.0 superblock. I know a version 1.1 wouldn't work because it would be where the boot sector needed to be, and wasn't sure if a 1.2 would work either. Then I tried to make the whole disk raid device a partitioned device. This obviously put a partition table right where the BIOS and the kernel would look for it whether the raid was up or not. I also the only reason i can think for the above setup not working is udev mucking with your device too early. It was a combination of boot loader issues and an inability to get this device partitioned up the way I needed. I went with a totally different setup in the end because I essentially started out with a two drive raid1 for the OS and another 2 drive raid1 for data, but I wanted to span them and I was attempting to do so with a mixture of md raid and lvm physical volume striping. Didn't work. tried doing an lvm setup to split the raid up into chunks and that didn't work either. So, then I redid the partition table and created individual raid devices from the partitions. But, I didn't think to zero the old whole disk superblock. When I made the individual raid devices, I used all 1.1 superblocks. So, when it was all said and done, I had a bunch of partitions that looked like a valid set of partitions for the whole disk raid device and a whole disk raid superblock, but I also had superblocks in each partition with their own bitmaps and so on. OK It was only because I wasn't using mdadm in the initrd and specifying uuids that it found the right devices to start and ignored the whole disk devices. But, when I later made some more devices and went to update the mdadm.conf file using mdadm -Eb, it found the devices and added it to the mdadm.conf. If I hadn't checked it before remaking my initrd, it would have hosed the system. And it would have passed all the above is not clear to me, afair redhat initrd still uses raidautorun, RHEL does, but this is on a personal machine I installed Fedora an and latest Fedora has a mkinitrd that installs mdadm and mdadm.conf and starts the needed devices using the UUID. My first sentence above should have read that I *was* using mdadm. which iirc does not works with recent superblocks, so you used uuids on kernel command line? or you use something else for initrd? why would remaking the initrd break it? Remaking the initrd installs the new mdadm.conf file, which would have then contained the whole disk devices and it's UUID. There in would have been the problem. the tests you can throw at it. Quite simply, there is no way to tell the difference between those two situations with 100% certainty. Mdadm tries to be smart and start the newest devices, but Luca's original suggestion of skip the partition scanning in the kernel and figure it out from user space would not have shown mdadm the new devices and would have gotten it wrong every time. yes, in this particular case it would have, congratulation you found a new creative way of shooting yourself in the feet. Creative, not so much. I just backed out of what I started and tried something else. Lots of people do that. maybe mdadm should do checks when creating a device to prevent this kind of mistakes. i.e. if creating an array on a partition, check the whole device for a superblock and refuse in case it finds one if creating an array on a whole device that has a partition table, either require --force, or check for superblocks in every possible partition. What happens if you add the partition table *after* you make the whole disk device and there are stale superblocks in the partitions? This still isn't infallible. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote: Actually, after doing some research, here's what I've found: I should note that both the lvm code and raid code are simplistic at the moment. For example, the raid5 mapping only supports the default raid5 layout. If you use any other layout, game over. Getting it to work with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but getting it to the point where it handles all the relevant setups properly would require a reasonable amount of coding. My first thought is that after the /boot partition is read (assuming you use one) restrictions go away. Performance of /boot is not much of an issue, for me at least, but more complex setups are sometimes need for the rest of the system. Thanks for the research. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote: On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote: just apply some rules, so if you find a partition table _AND_ an md superblock at the end, read both and you can tell if it is an md on a partition or a partitioned md raid1 device. In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. then just ignore the device and log a warning, instead of doing a random choice. L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, Oct 26, 2007 at 07:06:46PM +0200, Gabor Gombas wrote: On Fri, Oct 26, 2007 at 06:22:27PM +0200, Gabor Gombas wrote: You got the ordering wrong. You should get userspace support ready and accepted _first_, and then you can start the flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code configurable. sorry, i did not intend to start a flamewar. Oh wait that is possible even today. So you can build your own kernel without any partition table format support - problem solved. yes, i can build my own, i just tought it could be useful for someone but myself. maybe even Doug's enterprise customers L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Sat, Oct 27, 2007 at 12:20:12AM +0200, Gabor Gombas wrote: On Fri, Oct 26, 2007 at 02:41:56PM -0400, Doug Ledford wrote: * When using lilo to boot from a raid device, it automatically installs itself to the mbr, not to the partition. This can not be changed. Only 0.90 and 1.0 superblock types are supported because lilo doesn't understand the offset to the beginning of the fs otherwise. Huh? I have several machines that boot with LILO and the root is on RAID1. All install LILO to the boot sector of the mdX device (having boot=/dev/mdX in lilo.conf), while the MBR is installed by install-mbr. Since install-mbr has its own prompt that is displayed before LILO's prompt on boot, I can be pretty sure that LILO did not write anything to the MBR... the behaviour is documented in lilo man page, for the raid-extra-boot option. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote: [___snip___] Actually, after doing some research, here's what I've found: * When using lilo to boot from a raid device, it automatically installs itself to the mbr, not to the partition. This can not be changed. Only 0.90 and 1.0 superblock types are supported because lilo doesn't understand the offset to the beginning of the fs otherwise. I'm reasonably sure that's wrong, I used to set up dual boot machines by putting LILO in the partition and making that the boot partition, by changing the active partition flag I could just have the machine boot Windows, to keep people from getting confused. * When using grub to boot from a raid device, only 0.90 and 1.0 superblocks are supported[1] (because grub is ignorant of the raid and it requires the fs to start at the start of the partition). You can use either MBR or partition based installs of grub. However, partition based installs require that all bootable partitions be in exactly the same logical block address across all devices. This limitation can be an extremely hazardous limitation in the event a drive dies and you have to replace it with a new drive as newer drives may not share the older drive's geometry and will require starting your boot partition in an odd location to make the logical block addresses match. * When using grub2, there is supposedly already support for raid/lvm devices. However, I do not know if this includes version 1.0, 1.1, or 1.2 superblocks. I intend to find that out today. If you tell grub2 to install to an md device, it searches out all constituent devices and installs to the MBR on each device[2]. This can't be changed (at least right now, probably not ever though). That sounds like a good reason to avoid grub2, frankly. Software which decides that it knows what to do better than the user isn't my preference. If I wanted software which fores me to do things their way I'd be running Windows. So, given the above situations, really, superblock format 1.2 is likely to never be needed. None of the shipping boot loaders work with 1.2 regardless, and the boot loader under development won't install to the partition in the event of an md device and therefore doesn't need that 4k buffer that 1.2 provides. Sounds right, although it may have other uses for clever people. [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment. A person could probably hack it to work, but since grub development has stopped in preference to the still under development grub2, they won't take the patches upstream unless they are bug fixes, not new features. If the patches were available, doesn't work with existing raid formats would probably qualify as a bug. [2] There are two ways to install to a master boot record. The first is to use the first 512 bytes *only* and hardcode the location of the remainder of the boot loader into those 512 bytes. The second way is to use the free space between the MBR and the start of the first partition to embed the remainder of the boot loader. When you point grub2 at an md device, they automatically only use the second method of boot loader installation. This gives them the freedom to be able to modify the second stage boot loader on a boot disk by boot disk basis. The downside to this is that they need lots of room after the MBR and before the first partition in order to put their core.img file in place. I *think*, and I'll know for sure later today, that the core.img file is generated during grub install from the list of optional modules you specify during setup. Eg., the pc module gives partition table support, the lvm module lvm support, etc. You list the modules you need, and grub then builds a core.img out of all those modules. The normal amount of space between the MBR and the first partition is (sectors_per_track - 1). For standard disk geometries, that basically leaves 254 sectors, or 127k of space. This might not be enough for your particular needs if you have a complex boot environment. In that case, you would need to bump at least the starting track of your first partition to make room for your boot loader. Unfortunately, how is a person to know how much room their setup needs until after they've installed and it's too late to bump the partition table start? They can't. So, that's another thing I think I will check out today, what the maximum size of grub2 might be with all modules included, and what a common size might be. Based on your description, it sounds as if grub2 may not have given adequate thought to what users other than the authors might need (that may be a premature conclusion). I have multiple installs on several of my machines, and I assume that the grub2 for 32 and 64 bit will be different. Thanks for the research. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-27 at 10:00 +0200, Luca Berra wrote: On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote: On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote: just apply some rules, so if you find a partition table _AND_ an md superblock at the end, read both and you can tell if it is an md on a partition or a partitioned md raid1 device. In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. then just ignore the device and log a warning, instead of doing a random choice. L. It also happened to be my OS drive pair. Ignoring it would have rendered the machine unusable. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote: Actually, after doing some research, here's what I've found: * When using grub2, there is supposedly already support for raid/lvm devices. However, I do not know if this includes version 1.0, 1.1, or 1.2 superblocks. I intend to find that out today. It does not include support for any version 1 superblocks. It's noted in the code that it should, but doesn't yet. However, the interesting bit is that they rearchitected grub so that any reads from a device during boot are filtered through the stack that provides the device. So, when you tell grub2 to set root=md0, then all reads from md0 are filtered through the raid module, and the raid module then calls the reads from the IO module, which then does the actual int 13 call. This allows the raid module to read superblocks, detect the raid level and layout, and actually attempt to work on raid0/1/5 devices (at the moment). It also means that all the calls from the ext2 module when it attempts to read from the md device are filtered through the md module and therefore it would be simple for it to implement an offset into the real device to get past the version 1.1/1.2 superblocks. In terms of resilience, the raid module actually tries to utilize the raid itself during any failure. On raid1 devices, if it gets a read failure on any block it attempts to read, then it goes to the next device in the raid1 array and attempts to read from it. So, in the event that your normal boot disk suffers a sector failure in your actual kernel image, but the raid disk is otherwise fine, grub2 should be able to boot from the kernel image on the next raid device. Similarly, on raid5 it will attempt to recover from a block read failure by using the parity to generate the missing data unless the array is already in degraded mode at which point it will bail on any read failure. The lvm module attempts to properly map extents to physical volumes and allows you to have your bootable files in lvm logical volume. In that case you set root=logical-volume-name-as-it-appears-in-/dev/mapper and the lvm module then figures out what physical volumes contain that logical volume and where the extents are mapped and goes from there. I should note that both the lvm code and raid code are simplistic at the moment. For example, the raid5 mapping only supports the default raid5 layout. If you use any other layout, game over. Getting it to work with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but getting it to the point where it handles all the relevant setups properly would require a reasonable amount of coding. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-27 at 11:20 -0400, Bill Davidsen wrote: * When using lilo to boot from a raid device, it automatically installs itself to the mbr, not to the partition. This can not be changed. Only 0.90 and 1.0 superblock types are supported because lilo doesn't understand the offset to the beginning of the fs otherwise. I'm reasonably sure that's wrong, I used to set up dual boot machines by putting LILO in the partition and making that the boot partition, by changing the active partition flag I could just have the machine boot Windows, to keep people from getting confused. Yeah, someone else pointed this out too. The original patch to lilo *did* do as I suggest, so they must have improved on the patch later. * When using grub to boot from a raid device, only 0.90 and 1.0 superblocks are supported[1] (because grub is ignorant of the raid and it requires the fs to start at the start of the partition). You can use either MBR or partition based installs of grub. However, partition based installs require that all bootable partitions be in exactly the same logical block address across all devices. This limitation can be an extremely hazardous limitation in the event a drive dies and you have to replace it with a new drive as newer drives may not share the older drive's geometry and will require starting your boot partition in an odd location to make the logical block addresses match. * When using grub2, there is supposedly already support for raid/lvm devices. However, I do not know if this includes version 1.0, 1.1, or 1.2 superblocks. I intend to find that out today. If you tell grub2 to install to an md device, it searches out all constituent devices and installs to the MBR on each device[2]. This can't be changed (at least right now, probably not ever though). That sounds like a good reason to avoid grub2, frankly. Software which decides that it knows what to do better than the user isn't my preference. If I wanted software which fores me to do things their way I'd be running Windows. It's not really all that unreasonable of a restriction. Most people aren't aware than when you put a boot sector at the beginning of a partition, you only have 512 bytes of space, so the boot loader that you put there is basically nothing more than code to read the remainder of the boot loader from the file system space. Now, traditionally, most boot loaders have had to hard code the block addresses of certain key components into these second stage boot loaders. If a user isn't aware of the fact that the boot loader does this at install time (or at kernel selection update time in the case of lilo), then they aren't aware that the files must reside at exactly the same logical block address on all devices. Without that knowledge, they can easily create an unbootable setup by having the various boot partitions in slightly different locations on the disks. And intelligent partition editors like parted can compound the problem because as they insulate the user from having to pick which partition number is used for what partition, etc., they can end up placing the various boot partitions in different areas of different drives. The requirement above is a means of making sure that users aren't surprise by a non-working setup. The whole element of least surprise thing. Of course, if they keep that requirement, then I would expect it to be well documented so that people know this going into putting the boot loader in place, but I would argue that this is at least better than finding out when a drive dies that your system isn't bootable. So, given the above situations, really, superblock format 1.2 is likely to never be needed. None of the shipping boot loaders work with 1.2 regardless, and the boot loader under development won't install to the partition in the event of an md device and therefore doesn't need that 4k buffer that 1.2 provides. Sounds right, although it may have other uses for clever people. [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment. A person could probably hack it to work, but since grub development has stopped in preference to the still under development grub2, they won't take the patches upstream unless they are bug fixes, not new features. If the patches were available, doesn't work with existing raid formats would probably qualify as a bug. Possibly. I'm a bit overbooked on other work at the moment, but I may try to squeeze in some work on grub/grub2 to support version 1.1 or 1.2 superblocks. [2] There are two ways to install to a master boot record. The first is to use the first 512 bytes *only* and hardcode the location of the remainder of the boot loader into those 512 bytes. The second way is to use the free space between the MBR and the start of the first partition to embed the remainder of the boot loader. When you point grub2 at an md
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote: On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. Maybe we need a 2.0 superblock that contains the physical size of every component, not just the logical size that is used for RAID. That way if the size read from the superblock does not match the size of the device, you know that this device should be ignored. In my case that wouldn't have helped. What actually happened was I create a two disk raid1 device using whole devices and a version 1.0 superblock. I know a version 1.1 wouldn't work because it would be where the boot sector needed to be, and wasn't sure if a 1.2 would work either. Then I tried to make the whole disk raid device a partitioned device. This obviously put a partition table right where the BIOS and the kernel would look for it whether the raid was up or not. I also tried doing an lvm setup to split the raid up into chunks and that didn't work either. So, then I redid the partition table and created individual raid devices from the partitions. But, I didn't think to zero the old whole disk superblock. When I made the individual raid devices, I used all 1.1 superblocks. So, when it was all said and done, I had a bunch of partitions that looked like a valid set of partitions for the whole disk raid device and a whole disk raid superblock, but I also had superblocks in each partition with their own bitmaps and so on. It was only because I wasn't using mdadm in the initrd and specifying uuids that it found the right devices to start and ignored the whole disk devices. But, when I later made some more devices and went to update the mdadm.conf file using mdadm -Eb, it found the devices and added it to the mdadm.conf. If I hadn't checked it before remaking my initrd, it would have hosed the system. And it would have passed all the tests you can throw at it. Quite simply, there is no way to tell the difference between those two situations with 100% certainty. Mdadm tries to be smart and start the newest devices, but Luca's original suggestion of skip the partition scanning in the kernel and figure it out from user space would not have shown mdadm the new devices and would have gotten it wrong every time. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Thursday October 25, [EMAIL PROTECTED] wrote: I didn't get a reply to my suggestion of separating the data and location... No. Sorry. ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)? This would certainly make things a lot clearer to new (and old!) users: mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k or mdadm --create /dev/md0 --metadata 1.0 --meta-location start or mdadm --create /dev/md0 --metadata 1.0 --meta-location end I'm happy to support synonyms. How about --metadata 1-end --metadata 1-start ?? resulting in: mdadm --detail /dev/md0 /dev/md0: Version : 01.0 Metadata-locn : End-of-device It already lists the superblock location as a sector offset, but I don't have a problem with reporting: Version : 1.0 (metadata at end of device) Version : 1.1 (metadata at start of device) Would that help? Creation Time : Fri Aug 4 23:05:02 2006 Raid Level : raid0 You provide rational defaults for mortals and this approach allows people like Doug to do wacky HA things explicitly. I'm not sure you need any changes to the kernel code - probably just the docs and mdadm. True. It is conceivable that I could change the default, though that would require a decision as to what the new default would be. I think it would have to be 1.0 or it would cause too much confusion. A newer default would be nice. I also suspect that a *lot* of people will assume that the highest superblock version is the best and should be used for new installs etc. Grumble... why can't people expect what I want them to expect? So if you make 1.0 the default then how many users will try 'the bleeding edge' and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote from an old Soap: Confused, you will be... :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote: On Sat, 2007-10-20 at 09:53 +0200, Iustin Pop wrote: Honestly, I don't see how a properly configured system would start looking at the physical device by mistake. I suppose it's possible, but I didn't have this issue. Mount by label support scans all devices in /proc/partitions looking for the filesystem superblock that has the label you are trying to mount. it could probably be smarter, but in any case there is no point in mounting by label an md device. LVM (unless told not to) scans all devices in /proc/partitions looking yes, but lvm unless told to, will ignore devices having a valid md superblock. for valid LVM superblocks. In fact, you can't build a linux system that is resilient to device name changes without doing that. i dislike labels, especially for devices that contain the os. we should ensure great care that these are identified correctly, and mount-by-label does not (usb drive that migrate from one system to another are so common that you can't ignore them) you forgot udev ;) but the fix is easy. remove the partition detection code from the kernel and start working on a smart userspace replacement for device detection. we already have vol_id from udev and blkid from ext3 which support detection of many device formats. just apply some rules, so if you find a partition table _AND_ an md superblock at the end, read both and you can tell if it is an md on a partition or a partitioned md raid1 device. And you can with superblock at the front. You can create a new single disk raid1 over the existing superblock or you can munge the partition table to have it point at the start of your data. There are options, Please don't do that, use device-mapper to set the device up, without mucking with partition tables. L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Neil Brown wrote: On Thursday October 25, [EMAIL PROTECTED] wrote: I didn't get a reply to my suggestion of separating the data and location... No. Sorry. ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)? This would certainly make things a lot clearer to new (and old!) users: mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k or mdadm --create /dev/md0 --metadata 1.0 --meta-location start or mdadm --create /dev/md0 --metadata 1.0 --meta-location end I'm happy to support synonyms. How about --metadata 1-end --metadata 1-start ?? Offset? Do you like 1-offset4k or maybe 1-start4k or even 1-start+4k for that? The last is most intuitive but I don't know how you feel about the + in there. resulting in: mdadm --detail /dev/md0 /dev/md0: Version : 01.0 Metadata-locn : End-of-device It already lists the superblock location as a sector offset, but I don't have a problem with reporting: Version : 1.0 (metadata at end of device) Version : 1.1 (metadata at start of device) Would that help? Same comments on the reporting, metadata at block 4k or something. Creation Time : Fri Aug 4 23:05:02 2006 Raid Level : raid0 You provide rational defaults for mortals and this approach allows people like Doug to do wacky HA things explicitly. I'm not sure you need any changes to the kernel code - probably just the docs and mdadm. True. It is conceivable that I could change the default, though that would require a decision as to what the new default would be. I think it would have to be 1.0 or it would cause too much confusion. A newer default would be nice. I also suspect that a *lot* of people will assume that the highest superblock version is the best and should be used for new installs etc. Grumble... why can't people expect what I want them to expect? I confess that I thought 1.x was a series of solutions reflecting your evolving opinion on what was best, so maybe in retrospect you made a non-intuitive choice of nomenclature. Or bluntly, you picked confusing names for this and confused people. If 1.0 meant start, 1.1 meant 4k, and 1.2 meant end, at least it would be easy to remember for people who only create a new array a few times a year, or once in the lifetime of a new computer. So if you make 1.0 the default then how many users will try 'the bleeding edge' and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote from an old Soap: Confused, you will be... Perhaps you could have called them 1.start, 1.end, and 1.4k in the beginning? Isn't hindsight wonderful? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, Oct 26, 2007 at 11:54:18AM +0200, Luca Berra wrote: but the fix is easy. remove the partition detection code from the kernel and start working on a smart userspace replacement for device detection. we already have vol_id from udev and blkid from ext3 which support detection of many device formats. You got the ordering wrong. You should get userspace support ready and accepted _first_, and then you can start the flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code configurable. But even if you have the perfect userspace solution ready today, removing partitioning support from the kernel is a pretty invasive ABI change so it will take many years if it ever happens at all. I saw the let's move partition detection to user space argument several times on l-k in the past years but it never gained support... So if you want to make it happen, stop talking and start coding, and persuade all major distros to accept your changes. _Then_ you can start arguing to remove partition detection from the kernel, and even then it won't be easy. Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, Oct 26, 2007 at 06:22:27PM +0200, Gabor Gombas wrote: You got the ordering wrong. You should get userspace support ready and accepted _first_, and then you can start the flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code configurable. Oh wait that is possible even today. So you can build your own kernel without any partition table format support - problem solved. Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote: Neil Brown wrote: On Thursday October 25, [EMAIL PROTECTED] wrote: I didn't get a reply to my suggestion of separating the data and location... No. Sorry. ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)? This would certainly make things a lot clearer to new (and old!) users: mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k or mdadm --create /dev/md0 --metadata 1.0 --meta-location start or mdadm --create /dev/md0 --metadata 1.0 --meta-location end I'm happy to support synonyms. How about --metadata 1-end --metadata 1-start ?? Offset? Do you like 1-offset4k or maybe 1-start4k or even 1-start+4k for that? The last is most intuitive but I don't know how you feel about the + in there. Actually, after doing some research, here's what I've found: * When using lilo to boot from a raid device, it automatically installs itself to the mbr, not to the partition. This can not be changed. Only 0.90 and 1.0 superblock types are supported because lilo doesn't understand the offset to the beginning of the fs otherwise. * When using grub to boot from a raid device, only 0.90 and 1.0 superblocks are supported[1] (because grub is ignorant of the raid and it requires the fs to start at the start of the partition). You can use either MBR or partition based installs of grub. However, partition based installs require that all bootable partitions be in exactly the same logical block address across all devices. This limitation can be an extremely hazardous limitation in the event a drive dies and you have to replace it with a new drive as newer drives may not share the older drive's geometry and will require starting your boot partition in an odd location to make the logical block addresses match. * When using grub2, there is supposedly already support for raid/lvm devices. However, I do not know if this includes version 1.0, 1.1, or 1.2 superblocks. I intend to find that out today. If you tell grub2 to install to an md device, it searches out all constituent devices and installs to the MBR on each device[2]. This can't be changed (at least right now, probably not ever though). So, given the above situations, really, superblock format 1.2 is likely to never be needed. None of the shipping boot loaders work with 1.2 regardless, and the boot loader under development won't install to the partition in the event of an md device and therefore doesn't need that 4k buffer that 1.2 provides. [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment. A person could probably hack it to work, but since grub development has stopped in preference to the still under development grub2, they won't take the patches upstream unless they are bug fixes, not new features. [2] There are two ways to install to a master boot record. The first is to use the first 512 bytes *only* and hardcode the location of the remainder of the boot loader into those 512 bytes. The second way is to use the free space between the MBR and the start of the first partition to embed the remainder of the boot loader. When you point grub2 at an md device, they automatically only use the second method of boot loader installation. This gives them the freedom to be able to modify the second stage boot loader on a boot disk by boot disk basis. The downside to this is that they need lots of room after the MBR and before the first partition in order to put their core.img file in place. I *think*, and I'll know for sure later today, that the core.img file is generated during grub install from the list of optional modules you specify during setup. Eg., the pc module gives partition table support, the lvm module lvm support, etc. You list the modules you need, and grub then builds a core.img out of all those modules. The normal amount of space between the MBR and the first partition is (sectors_per_track - 1). For standard disk geometries, that basically leaves 254 sectors, or 127k of space. This might not be enough for your particular needs if you have a complex boot environment. In that case, you would need to bump at least the starting track of your first partition to make room for your boot loader. Unfortunately, how is a person to know how much room their setup needs until after they've installed and it's too late to bump the partition table start? They can't. So, that's another thing I think I will check out today, what the maximum size of grub2 might be with all modules included, and what a common size might be. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote: On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote: just apply some rules, so if you find a partition table _AND_ an md superblock at the end, read both and you can tell if it is an md on a partition or a partitioned md raid1 device. In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, Oct 26, 2007 at 02:41:56PM -0400, Doug Ledford wrote: * When using lilo to boot from a raid device, it automatically installs itself to the mbr, not to the partition. This can not be changed. Only 0.90 and 1.0 superblock types are supported because lilo doesn't understand the offset to the beginning of the fs otherwise. Huh? I have several machines that boot with LILO and the root is on RAID1. All install LILO to the boot sector of the mdX device (having boot=/dev/mdX in lilo.conf), while the MBR is installed by install-mbr. Since install-mbr has its own prompt that is displayed before LILO's prompt on boot, I can be pretty sure that LILO did not write anything to the MBR... What you say is only true for skewed RAID setups, but I always considered such a setup too risky for anything critical (not because of LILO, but because of the increased administrative complexity). Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. Maybe we need a 2.0 superblock that contains the physical size of every component, not just the logical size that is used for RAID. That way if the size read from the superblock does not match the size of the device, you know that this device should be ignored. Gabor -- - MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences - - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Thu, 2007-10-25 at 09:55 +1000, Neil Brown wrote: As for where the metadata should be placed, it is interesting to observe that the SNIA's DDFv1.2 puts it at the end of the device. And as DDF is an industry standard sponsored by multiple companies it must be .. Sorry. I had intended to say correct, but when it came to it, my fingers refused to type that word in that context. DDF is in a somewhat different situation though. It assumes that the components are whole devices, and that the controller has exclusive access - there is no way another controller could interpret the devices differently before the DDF controller has a chance. Putting a superblock at the end of a device works around OS compatibility issues and other things related to transitioning the device from part of an array to not, etc. But, it works if and only if you have the guarantee you mention. Long, long ago I tinkered with the idea of md multipath devices using an end of device superblock on the whole device to allow reliable multipath detection and autostart, failover of all partitions on a device when a command to any partition failed, ability to use standard partition tables, etc. while being 100% transparent to the rest of the OS. The second you considered FC connected devices and multi-OS access, that fell apart in a big way. Very analogous. So, I wouldn't necessarily call it wrong, but it's fragile. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
Jeff Garzik wrote: Neil Brown wrote: As for where the metadata should be placed, it is interesting to observe that the SNIA's DDFv1.2 puts it at the end of the device. And as DDF is an industry standard sponsored by multiple companies it must be .. Sorry. I had intended to say correct, but when it came to it, my fingers refused to type that word in that context. For the record, I have no intention of deprecating any of the metadata formats, not even 0.90. strongly agreed I didn't get a reply to my suggestion of separating the data and location... ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)? This would certainly make things a lot clearer to new (and old!) users: mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k or mdadm --create /dev/md0 --metadata 1.0 --meta-location start or mdadm --create /dev/md0 --metadata 1.0 --meta-location end resulting in: mdadm --detail /dev/md0 /dev/md0: Version : 01.0 Metadata-locn : End-of-device Creation Time : Fri Aug 4 23:05:02 2006 Raid Level : raid0 You provide rational defaults for mortals and this approach allows people like Doug to do wacky HA things explicitly. I'm not sure you need any changes to the kernel code - probably just the docs and mdadm. It is conceivable that I could change the default, though that would require a decision as to what the new default would be. I think it would have to be 1.0 or it would cause too much confusion. A newer default would be nice. I also suspect that a *lot* of people will assume that the highest superblock version is the best and should be used for new installs etc. So if you make 1.0 the default then how many users will try 'the bleeding edge' and use 1.2? So then you have 1.3 which is the same as 1.0? H? So to quote from an old Soap: Confused, you will be... David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Neil Brown wrote: I certainly accept that the documentation is probably less that perfect (by a large margin). I am more than happy to accept patches or concrete suggestions on how to improve that. I always think it is best if a non-developer writes documentation (and a developer reviews it) as then it is more likely to address the issues that a non-developer will want to read about, and in a way that will make sense to a non-developer. (i.e. I'm to close to the subject to write good doco). Patches against what's in 2.6.4 I assume? I can't promise to write anything which pleases even me, but I will take a look at it. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Bill Davidsen wrote: Neil Brown wrote: I certainly accept that the documentation is probably less that perfect (by a large margin). I am more than happy to accept patches or concrete suggestions on how to improve that. I always think it is best if a non-developer writes documentation (and a developer reviews it) as then it is more likely to address the issues that a non-developer will want to read about, and in a way that will make sense to a non-developer. (i.e. I'm to close to the subject to write good doco). Patches against what's in 2.6.4 I assume? I can't promise to write anything which pleases even me, but I will take a look at it. The man page is a great place for describing, eg, the superblock location; but don't forget we have http://linux-raid.osdl.org/index.php/Main_Page which is probably a better place for *discussions* (or essays) about the superblock location (eg the LVM / v1.1 comment Janek picked up on) In fact I was going to take some of the writings from this thread and put them up there. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Wed, 2007-10-24 at 16:22 -0400, Bill Davidsen wrote: Doug Ledford wrote: On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote: I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. This is the heart of the matter. When you consider that each file system and each volume management stack has a superblock, and they some store their superblocks at the end of devices and some at the beginning, and they can be stacked, then it becomes next to impossible to make sure a stacked setup is never recognized incorrectly under any circumstance. It might be possible if you use static device names, but our users *long* ago complained very loudly when adding a new disk or removing a bad disk caused their setup to fail to boot. So, along came mount by label and auto scans for superblocks. Once you do that, you *really* need all the superblocks at the same end of a device so when you stack things, it always works properly. Let me be devil's advocate, I noted in another post that location might be raid level dependent. For raid-1 putting the superblock at the end allows the BIOS to treat a single partition as a bootable unit. This is true for both the 1.0 and 1.2 superblock formats. The BIOS couldn't care less if there is an offset to the filesystem because it doesn't try to read from the filesystem. It just jumps to the first 512 byte sector and that's it. Grub/Lilo are the ones that have to know about the offset, and they would be made aware of the offset at install time. So, we are back to the exact same thing I was talking about. With the superblock at the beginning of the device, you don't hinder bootability with or without the raid working, the raid would be bootable regardless as long as you made it bootable, it only hinders accessing the filesystem via a running linux installation without bringing up the raid. For all other arrangements the end location puts the superblock where it is slightly more likely to be overwritten, and where it must be moved if the partition grows or whatever. There really may be no right answer. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Thursday October 25, [EMAIL PROTECTED] wrote: Neil Brown wrote: I certainly accept that the documentation is probably less that perfect (by a large margin). I am more than happy to accept patches or concrete suggestions on how to improve that. I always think it is best if a non-developer writes documentation (and a developer reviews it) as then it is more likely to address the issues that a non-developer will want to read about, and in a way that will make sense to a non-developer. (i.e. I'm to close to the subject to write good doco). Patches against what's in 2.6.4 I assume? I can't promise to write anything which pleases even me, but I will take a look at it. Any text at all would be welcome, but yes; patches against 2.6.4 would be easiest. Thanks NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote: I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. This is the heart of the matter. When you consider that each file system and each volume management stack has a superblock, and they some store their superblocks at the end of devices and some at the beginning, and they can be stacked, then it becomes next to impossible to make sure a stacked setup is never recognized incorrectly under any circumstance. I wonder if we should not really be talking about superblock versions 1.0, 1.1, 1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)? This would certainly make things a lot clearer to new users: mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k mdadm --detail /dev/md0 /dev/md0: Version : 01.0 Metadata-locn : End-of-device Creation Time : Fri Aug 4 23:05:02 2006 Raid Level : raid0 And there you have the deprecation... only two superblock versions and no real changes to code etc David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Bill == Bill Davidsen [EMAIL PROTECTED] writes: Bill John Stoffel wrote: Why do we have three different positions for storing the superblock? Bill Why do you suggest changing anything until you get the answer to Bill this question? If you don't understand why there are three Bill locations, perhaps that would be a good initial investigation. Because I've asked this question before and not gotten an answer, nor is it answered in the man page for mdadm on why we have this setup. Bill Clearly the short answer is that they reflect three stages of Bill Neil's thinking on the topic, and I would bet that he had a good Bill reason for moving the superblock when he did it. So let's hear Neil's thinking about all this? Or should I just work up a patch to do what I suggest and see how that flies? Bill Since you have to support all of them or break existing arrays, Bill and they all use the same format so there's no saving of code Bill size to mention, why even bring this up? Because of the confusion factor. Again, since noone has been able to articulate a reason why we have three different versions of the 1.x superblock, nor have I seen any good reasons for why we should have them, I'm going by the KISS principle to reduce the options to the best one. And no, I'm not advocating getting rid of legacy support, but I AM advocating that we settle on ONE standard format going forward as the default for all new RAID superblocks. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On 10/24/07, John Stoffel [EMAIL PROTECTED] wrote: Bill == Bill Davidsen [EMAIL PROTECTED] writes: Bill John Stoffel wrote: Why do we have three different positions for storing the superblock? Bill Why do you suggest changing anything until you get the answer to Bill this question? If you don't understand why there are three Bill locations, perhaps that would be a good initial investigation. Because I've asked this question before and not gotten an answer, nor is it answered in the man page for mdadm on why we have this setup. Bill Clearly the short answer is that they reflect three stages of Bill Neil's thinking on the topic, and I would bet that he had a good Bill reason for moving the superblock when he did it. So let's hear Neil's thinking about all this? Or should I just work up a patch to do what I suggest and see how that flies? Bill Since you have to support all of them or break existing arrays, Bill and they all use the same format so there's no saving of code Bill size to mention, why even bring this up? Because of the confusion factor. Again, since noone has been able to articulate a reason why we have three different versions of the 1.x superblock, nor have I seen any good reasons for why we should have them, I'm going by the KISS principle to reduce the options to the best one. And no, I'm not advocating getting rid of legacy support, but I AM advocating that we settle on ONE standard format going forward as the default for all new RAID superblocks. Why exactly are you on this crusade to find the one best v1 superblock location? Giving people the freedom to place the superblock where they choose isn't a bad thing. Would adding something like If in doubt, 1.1 is the safest choice. to the mdadm man page give you the KISS warm-fuzzies you're pining for? The fact that, after you read the manpage, you didn't even know that the only difference between the v1.x variants is the location that the superblock is placed indicates that you're not in a position to be so tremendously evangelical about affecting code changes that limit existing options. Mike - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
John Stoffel wrote: Bill == Bill Davidsen [EMAIL PROTECTED] writes: Bill John Stoffel wrote: Why do we have three different positions for storing the superblock? Bill Why do you suggest changing anything until you get the answer to Bill this question? If you don't understand why there are three Bill locations, perhaps that would be a good initial investigation. Because I've asked this question before and not gotten an answer, nor is it answered in the man page for mdadm on why we have this setup. Bill Clearly the short answer is that they reflect three stages of Bill Neil's thinking on the topic, and I would bet that he had a good Bill reason for moving the superblock when he did it. So let's hear Neil's thinking about all this? Or should I just work up a patch to do what I suggest and see how that flies? If you are only going to change the default, I think you're done, since people report problems with bootloaders starting versions other than 0.90. And until I hear Neil's thinking on this, I'm not sure that I know what the default location and type should be. In fact, reading the discussion I suspect it should be different for RAID-0 (should be at the end) and all other types (should be near the front). That retains the ability to mount one part of the mirror as a single partition, while minimizing the possibility of bad applications seeing something which looks like a filesystem at the start of a partition and trying to run fsck on it. Bill Since you have to support all of them or break existing arrays, Bill and they all use the same format so there's no saving of code Bill size to mention, why even bring this up? Because of the confusion factor. Again, since noone has been able to articulate a reason why we have three different versions of the 1.x superblock, nor have I seen any good reasons for why we should have them, I'm going by the KISS principle to reduce the options to the best one. And no, I'm not advocating getting rid of legacy support, but I AM advocating that we settle on ONE standard format going forward as the default for all new RAID superblocks. Unfortunately the solution can't be any simpler than the problem, and that's why I'm dubious that anything but the documentation should be changed, or an additional metadata target added per the discussion above, perhaps best1 for best 1.x format based on the raid level. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote: I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. This is the heart of the matter. When you consider that each file system and each volume management stack has a superblock, and they some store their superblocks at the end of devices and some at the beginning, and they can be stacked, then it becomes next to impossible to make sure a stacked setup is never recognized incorrectly under any circumstance. It might be possible if you use static device names, but our users *long* ago complained very loudly when adding a new disk or removing a bad disk caused their setup to fail to boot. So, along came mount by label and auto scans for superblocks. Once you do that, you *really* need all the superblocks at the same end of a device so when you stack things, it always works properly. Let me be devil's advocate, I noted in another post that location might be raid level dependent. For raid-1 putting the superblock at the end allows the BIOS to treat a single partition as a bootable unit. For all other arrangements the end location puts the superblock where it is slightly more likely to be overwritten, and where it must be moved if the partition grows or whatever. There really may be no right answer. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Tuesday October 23, [EMAIL PROTECTED] wrote: On Tue, 2007-10-23 at 19:03 -0400, Bill Davidsen wrote: John Stoffel wrote: Why do we have three different positions for storing the superblock? Why do you suggest changing anything until you get the answer to this question? If you don't understand why there are three locations, perhaps that would be a good initial investigation. Clearly the short answer is that they reflect three stages of Neil's thinking on the topic, and I would bet that he had a good reason for moving the superblock when he did it. I believe, and Neil can correct me if I'm wrong, that 1.0 (at the end of the device) is to satisfy people that want to get at their raid1 data without bringing up the device or using a loop mount with an offset. Version 1.1, at the beginning of the device, is to prevent accidental access to a device when the raid array doesn't come up. And version 1.2 (4k from the beginning of the device) would be suitable for those times when you want to embed a boot sector at the very beginning of the device (which really only needs 512 bytes, but a 4k offset is as easy to deal with as anything else). From the standpoint of wanting to make sure an array is suitable for embedding a boot sector, the 1.2 superblock may be the best default. Exactly correct. Another perspective is that I chickened out of making a decision and chose to support all the credible possibilities that I could think of. And showed that I didn't have enough imagination. The other possibility that I should have included (as has been suggested in this conversation, and previously on this list) is to store the superblock both at the beginning and the end for redundancy. However I cannot decide whether to combine the 1.0 and 1.1 locations, or the 1.0 and 1.2. And I don't think I want to support both (maybe I've learned my lesson). As for where the metadata should be placed, it is interesting to observe that the SNIA's DDFv1.2 puts it at the end of the device. And as DDF is an industry standard sponsored by multiple companies it must be .. Sorry. I had intended to say correct, but when it came to it, my fingers refused to type that word in that context. DDF is in a somewhat different situation though. It assumes that the components are whole devices, and that the controller has exclusive access - there is no way another controller could interpret the devices differently before the DDF controller has a chance. DDF is also interesting in that it uses 512 byte alignment for metadata. The 'anchor' block is in the last sector of the device. This contrasts with current md metadata which is all 4K aligned. Given that the drive manufacturers seem to be telling us that 4096 is the new 512, I think 4K alignment was a good idea. It could be that DDF actually specifies the anchor to reside in the last block rather than the last sector, and it could be that the spec allows for block size to be device specific - I'd have to hunt through the spec again to be sure. For the record, I have no intention of deprecating any of the metadata formats, not even 0.90. It is conceivable that I could change the default, though that would require a decision as to what the new default would be. I think it would have to be 1.0 or it would cause too much confusion. I think it would be entirely appropriate for a distro (especially an 'enterprise' distro) to choose a format and location that it was going to standardise on and support, and make that the default on that distro (by using a CREATE line in mdadm.conf). Debian has already done this by making 1.0 the default. I certainly accept that the documentation is probably less that perfect (by a large margin). I am more than happy to accept patches or concrete suggestions on how to improve that. I always think it is best if a non-developer writes documentation (and a developer reviews it) as then it is more likely to address the issues that a non-developer will want to read about, and in a way that will make sense to a non-developer. (i.e. I'm to close to the subject to write good doco). NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Neil Brown wrote: On Tuesday October 23, [EMAIL PROTECTED] wrote: As for where the metadata should be placed, it is interesting to observe that the SNIA's DDFv1.2 puts it at the end of the device. And as DDF is an industry standard sponsored by multiple companies it must be .. Sorry. I had intended to say correct, but when it came to it, my fingers refused to type that word in that context. DDF is in a somewhat different situation though. It assumes that the components are whole devices, and that the controller has exclusive access - there is no way another controller could interpret the devices differently before the DDF controller has a chance. grin agreed. DDF is also interesting in that it uses 512 byte alignment for metadata. The 'anchor' block is in the last sector of the device. This contrasts with current md metadata which is all 4K aligned. Given that the drive manufacturers seem to be telling us that 4096 is the new 512, I think 4K alignment was a good idea. It could be that DDF actually specifies the anchor to reside in the last block rather than the last sector, and it could be that the spec allows for block size to be device specific - I'd have to hunt through the spec again to be sure. Its a bit of a mess. Yes, with 1K and 4K sector devices starting to appear, as long as the underlying partitioning gets the initial partition alignment correct, this /should/ continue functioning as normal. If for whatever reason you wind up with an odd-aligned 1K sector device and your data winds up aligned to even numbered [hard] sectors, performance will definitely suffer. Mostly this is out of MD's hands, and up to the sysadmin and partitioning tools to get hard-sector alignment right. For the record, I have no intention of deprecating any of the metadata formats, not even 0.90. strongly agreed It is conceivable that I could change the default, though that would require a decision as to what the new default would be. I think it would have to be 1.0 or it would cause too much confusion. A newer default would be nice. Jeff - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
John Stoffel wrote: Why do we have three different positions for storing the superblock? Why do you suggest changing anything until you get the answer to this question? If you don't understand why there are three locations, perhaps that would be a good initial investigation. Clearly the short answer is that they reflect three stages of Neil's thinking on the topic, and I would bet that he had a good reason for moving the superblock when he did it. Since you have to support all of them or break existing arrays, and they all use the same format so there's no saving of code size to mention, why even bring this up? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: On Fri, 2007-10-19 at 23:23 +0200, Iustin Pop wrote: On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote: And if putting the superblock at the end is problematic, why is it the default? Shouldn't version 1.1 be the default? In my opinion, having the superblock *only* at the end (e.g. the 0.90 format) is the best option. It allows one to mount the disk separately (in case of RAID 1), if the MD superblock is corrupt or you just want to get easily at the raw data. Bad reasoning. It's the reason that the default is at the end of the device, but that was a bad decision made by Ingo long, long ago in a galaxy far, far away. The simple fact of the matter is there are only two type of raid devices for the purpose of this issue: those that fragment data (raid0/4/5/6/10) and those that don't (raid1, linear). For the purposes of this issue, there are only two states we care about: the raid array works or doesn't work. If the raid array works, then you *only* want the system to access the data via the raid array. If the raid array doesn't work, then for the fragmented case you *never* want the system to see any of the data from the raid array (such as an ext3 superblock) or a subsequent fsck could see a valid superblock and actually start a filesystem scan on the raw device, and end up hosing the filesystem beyond all repair after it hits the first chunk size break (although in practice this is usually a situation where fsck declares the filesystem so corrupt that it refuses to touch it, that's leaving an awful lot to chance, you really don't want fsck to *ever* see that superblock). If the raid array is raid1, then the raid array should *never* fail to start unless all disks are missing (in which case there is no raw device to access anyway). The very few failure types that will cause the raid array to not start automatically *and* still have an intact copy of the data usually happen when the raid array is perfectly healthy, in which case automatically finding a constituent device when the raid array failed to start is exactly the *wrong* thing to do (for instance, you enable SELinux on a machine and it hasn't been relabeled and the raid array fails to start because /dev/mdblah can't be created because of an SELinux denial...all the raid1 members are still there, but if you touch a single one of them, then you run the risk of creating silent data corruption). It really boils down to this: for any reason that a raid array might fail to start, you *never* want to touch the underlying data until someone has taken manual measures to figure out why it didn't start and corrected the problem. Putting the superblock in front of the data does not prevent manual measures (such as recreating superblocks) from getting at the data. But, putting superblocks at the end leaves the door open for accidental access via constituent devices when you *really* don't want that to happen. You didn't mention some ill-behaved application using the raw device (ie. database) writing just a little more than it should and destroying the superblock. So, no, the default should *not* be at the end of the device. You make a convincing argulemt. As to the people who complained exactly because of this feature, LVM has two mechanisms to protect from accessing PVs on the raw disks (the ignore raid components option and the filter - I always set filters when using LVM ontop of MD). regards, iustin -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
John Stoffel wrote: Michael == Michael Tokarev [EMAIL PROTECTED] writes: Michael Doug Ledford wrote: Michael [] 1.0, 1.1, and 1.2 are the same format, just in different positions on the disk. Of the three, the 1.1 format is the safest to use since it won't allow you to accidentally have some sort of metadata between the beginning of the disk and the raid superblock (such as an lvm2 superblock), and hence whenever the raid array isn't up, you won't be able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse case situations, I've seen lvm2 find a superblock on one RAID1 array member when the RAID1 array was down, the system came up, you used the system, the two copies of the raid array were made drastically inconsistent, then at the next reboot, the situation that prevented the RAID1 from starting was resolved, and it never know it failed to start last time, and the two inconsistent members we put back into a clean array). So, deprecating any of these is not really helpful. And you need to keep the old 0.90 format around for back compatibility with thousands of existing raid arrays. Michael Well, I strongly, completely disagree. You described a Michael real-world situation, and that's unfortunate, BUT: for at Michael least raid1, there ARE cases, pretty valid ones, when one Michael NEEDS to mount the filesystem without bringing up raid. Michael Raid1 allows that. Please describe one such case please. There have certainly been hacks of various RAID systems on other OSes such as Solaris where the VxVM and/or Solstice DiskSuite allowed you to encapsulate an existing partition into a RAID array. But in my experience (and I'm a professional sysadm... :-) it's not really all that useful, and can lead to problems liks those described by Doug. If you are going to mirror an existing filesystem, then by definition you have a second disk or partition available for the purpose. So you would merely setup the new RAID1, in degraded mode, using the new partition as the base. Then you copy the data over to the new RAID1 device, change your boot setup, and reboot. Once that is done, you can then add the original partition into the RAID1 array. As Doug says, and I agree strongly, you DO NOT want to have the possibility of confusion and data loss, especially on bootup. And this leads to the heart of my initial post on this matter, that the confusion of having four different variations of RAID superblocks is bad. We should deprecate them down to just two, the old 0.90 format, and the new 1.x format at the start of the RAID volume. Perhaps I am misreading you here, when you say depreciate them down do you mean the Adrian Bunk method of putting in a printk scolding the administrator, and then remove the feature a version later, or did you mean depreciate all but two which clearly doesn't suggest removing the capability at all? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Justin Piszcz wrote: On Fri, 19 Oct 2007, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of Justin anything else! Are you sure? I find that GRUB is much easier to use and setup than LILO these days. But hey, just dropping down to support 00.09.03 and 1.2 formats would be fine too. Let's just lessen the confusion if at all possible. John I am sure, I submitted a bug report to the LILO developer, he acknowledged the bug but I don't know if it was fixed. I have not tried GRUB with a RAID1 setup yet. Works fine. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Tue, 2007-10-23 at 19:03 -0400, Bill Davidsen wrote: John Stoffel wrote: Why do we have three different positions for storing the superblock? Why do you suggest changing anything until you get the answer to this question? If you don't understand why there are three locations, perhaps that would be a good initial investigation. Clearly the short answer is that they reflect three stages of Neil's thinking on the topic, and I would bet that he had a good reason for moving the superblock when he did it. I believe, and Neil can correct me if I'm wrong, that 1.0 (at the end of the device) is to satisfy people that want to get at their raid1 data without bringing up the device or using a loop mount with an offset. Version 1.1, at the beginning of the device, is to prevent accidental access to a device when the raid array doesn't come up. And version 1.2 (4k from the beginning of the device) would be suitable for those times when you want to embed a boot sector at the very beginning of the device (which really only needs 512 bytes, but a 4k offset is as easy to deal with as anything else). From the standpoint of wanting to make sure an array is suitable for embedding a boot sector, the 1.2 superblock may be the best default. Since you have to support all of them or break existing arrays, and they all use the same format so there's no saving of code size to mention, why even bring this up? -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: chunk size (was Re: Time to deprecate old RAID formats?)
On Tue, 2007-10-23 at 21:21 +0200, Michal Soltys wrote: Doug Ledford wrote: Well, first I was thinking of files in the few hundreds of megabytes each to gigabytes each, and when they are streamed, they are streamed at a rate much lower than the full speed of the array, but still at a fast rate. How parallel the reads are then would tend to be a function of chunk size versus streaming rate. Ahh, I see now. Thanks for explanation. I wonder though, if setting large readahead would help, if you used larger chunk size. Assuming other options are not possible - i.e. streaming from larger buffer, while reading to it in a full stripe width at least. Probably not. All my trial and error in the past with raid5 arrays and various situations that would cause pathological worst case behavior showed that once reads themselves reach 16k in size, and are sequential in nature, then the disk firmware's read ahead kicks in and your performance stays about the same regardless of increasing your OS read ahead. In a nutshell, once you've convinced the disk firmware that you are going to be reading some data sequentially, it does the rest. With a large stripe size (say 256k+), you'll trigger this firmware read ahead fairly early on in reading any given stripe, so you really don't buy much by reading the next stripe before you need it, and in fact can end up wasting a lot of RAM trying to do so, hurting overall performance. I'm not familiar with the benchmark you are referring to. I was thinking about http://www.mail-archive.com/linux-raid@vger.kernel.org/msg08461.html with small discussion that happend after that. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-20 at 22:24 +0400, Michael Tokarev wrote: John Stoffel wrote: Michael == Michael Tokarev [EMAIL PROTECTED] writes: As Doug says, and I agree strongly, you DO NOT want to have the possibility of confusion and data loss, especially on bootup. And There are different point of views, and different settings etc. Indeed, there are different points of view. And with that in mind, I'll just point out that my point of view is that of an engineer who is responsible for all the legitimate md bugs in our products once tech support has weeded out the you tried to do what? cases. From that point of view, I deal with *every* user's preferred use case, not any single use case. For example, I once dealt with a linux user who was unable to use his disk partition, because his system (it was RedHat if I remember correctly) recognized some LVM volume on his disk (it was previously used with Windows) and tried to automatically activate it, thus making it busy. Yep, that can still happen today under certain circumstances. What I'm talking about here is that any automatic activation of anything should be done with extreme care, using smart logic in the startup scripts if at all. We do. Unfortunately, there is no logic smart enough to recognize all the possible user use cases that we've seen given the way things are created now. The Doug's example - in my opinion anyway - shows wrong tools or bad logic in the startup sequence, not a general flaw in superblock location. Well, one of the problems is that you can both use an md device as an LVM physical volume and use an LVM logical volume as an md constituent device. Users have done both. For example, when one drive was almost dead, and mdadm tried to bring the array up, machine just hanged for unknown amount of time. An unexpirienced operator was there. Instead of trying to teach him how to pass parameter to the initramfs to stop trying to assemble root array and next assembling it manually, I told him to pass root=/dev/sda1 to the kernel. Root mounts read-only, so it should be a safe thing to do - I only needed root fs and minimal set of services (which are even in initramfs) just for it to boot up to SOME state where I can log in remotely and fix things later. Umm, no. Generally speaking (I can't speak for other distros) but both Fedora and RHEL remount root rw even when coming up in single user mode. The only time the fs is left in ro mode is when it drops to a shell during rc.sysinit as a result of a failed fs check. And if you are using an ext3 filesystem and things didn't go down clean, then you also get a journal replay. So, then what happens when you think you've fixed things, and you reboot, and then due to random chance, the ext3 fs check gets the journal off the drive that wasn't mounted and replays things again? Will this overwrite your fixes possibly? Yep. Could do all sorts of bad things. In fact, unless you do a full binary compare of your constituent devices, you could have silent data corruption and just never know about it. You may get off lucky and never *see* the corruption, but it could well be there. The only safe way to reintegrate your raid after doing what you suggest is to kick the unmounted drive out of the array before rebooting by using mdadm to zero its superblock, boot up with a degraded raid1 array, and readd the kicked device back in. So, while you list several more examples of times when it was convenient to do as you suggest, these times can be handled in other ways (although it may mean keeping a rescue CD handy at each location just for situations like this) that are far safer IMO. Now, putting all this back into the point of view I have to take, which is what's the best default action to take for my customers, I'm sure you can understand how a default setup and recommendation of use that leaves silent data corruption is simply a non-starter for me. If someone wants to do this manually, then go right ahead. But as for what we do by default when the user asks us to create a raid array, we really need to be on superblock 1.1 or 1.2 (although we aren't yet, we've waited for the version 1 superblock issues to iron out and will do so in a future release). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote: I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. This is the heart of the matter. When you consider that each file system and each volume management stack has a superblock, and they some store their superblocks at the end of devices and some at the beginning, and they can be stacked, then it becomes next to impossible to make sure a stacked setup is never recognized incorrectly under any circumstance. It might be possible if you use static device names, but our users *long* ago complained very loudly when adding a new disk or removing a bad disk caused their setup to fail to boot. So, along came mount by label and auto scans for superblocks. Once you do that, you *really* need all the superblocks at the same end of a device so when you stack things, it always works properly. Michael Another example is ext[234]fs - it does not touch first 512 Michael bytes of the device, so if there was an msdos filesystem Michael there before, it will be recognized as such by many tools, Michael and an attempt to mount it automatically will lead to at Michael least scary output and nothing mounted, or in fsck doing Michael fatal things to it in worst scenario. Sure thing the first Michael 512 bytes should be just cleared.. but that's another topic. I would argue that ext[234] should be clearing those 512 bytes. Why aren't they cleared Actually, I didn't think msdos used the first 512 bytes for the same reason ext3 doesn't: space for a boot sector. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
[ I was going to reply to this earlier, but the Red Sox and good weather got into the way this weekend. ;-] Michael == Michael Tokarev [EMAIL PROTECTED] writes: Michael I'm doing a sysadmin work for about 15 or 20 years. Welcome to the club! It's a fun career, always something new to learn. If you are going to mirror an existing filesystem, then by definition you have a second disk or partition available for the purpose. So you would merely setup the new RAID1, in degraded mode, using the new partition as the base. Then you copy the data over to the new RAID1 device, change your boot setup, and reboot. Michael And you have to copy the data twice as a result, instead of Michael copying it only once to the second disk. So? Why is this such a big deal? As I see it, there are two seperate ways to setup a RAID1 setup, on an OS. 1. The mirror is built ahead of time and you install onto the mirror. And twice as much data gets written, half to each disk. *grin* 2. You are encapsulating an existing OS install and you need to do a reboot from the un-mirrored OS to the mirrored setup. So yes, you do have to copy the data from the orig to the mirror, reboot, then resync back onto the original disk whish has been added into the the RAID set. Neither case is really that big a deal. And with the RAID super block at the front of the disk, you don't have to worry about mixing up which disk is which. It's not fun when you boot one disk, thinking it's the RAID disk, but end up booting the original disk. As Doug says, and I agree strongly, you DO NOT want to have the possibility of confusion and data loss, especially on bootup. And Michael There are different point of views, and different settings Michael etc. For example, I once dealt with a linux user who was Michael unable to use his disk partition, because his system (it was Michael RedHat if I remember correctly) recognized some LVM volume on Michael his disk (it was previously used with Windows) and tried to Michael automatically activate it, thus making it busy. What I'm Michael talking about here is that any automatic activation of Michael anything should be done with extreme care, using smart logic Michael in the startup scripts if at all. Ah... but you can also de-active LVM partitions as well if you like. Michael The Doug's example - in my opinion anyway - shows wrong tools Michael or bad logic in the startup sequence, not a general flaw in Michael superblock location. I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. This is really true when you use the full disk for the mirror, because then you don't have the partition table to base some initial guestimates on. Since there is an explicit Linux RAID partition type, as well as an explicit linux filesystem (filesystem is then decoded from the first Nk of the partition), you have a modicum of safety. If ext3 has the superblock in the first 4k of the disk, but you've setup the disk to use RAID1 with the LVM superblock at the end of the disk, you now need to be careful about how the disk is detected and then mounted. To the ext3 detection logic, it looks like an ext3 filesystem, to LVM, it looks like a RAID partition. Which is correct? Which is wrong? How do you tell programmatically? That's what I think that all superblocks should be in the SAME location on the disk and/or partitions if used. It keeps down problems like this. Michael Another example is ext[234]fs - it does not touch first 512 Michael bytes of the device, so if there was an msdos filesystem Michael there before, it will be recognized as such by many tools, Michael and an attempt to mount it automatically will lead to at Michael least scary output and nothing mounted, or in fsck doing Michael fatal things to it in worst scenario. Sure thing the first Michael 512 bytes should be just cleared.. but that's another topic. I would argue that ext[234] should be clearing those 512 bytes. Why aren't they cleared Michael Speaking of cases where it was really helpful to have an Michael ability to mount individual raid components directly without Michael the raid level - most of them was due to one or another Michael operator errors, usually together with bugs and/or omissions Michael in software. I don't remember exact scenarious anymore (last Michael time it was more than 2 years ago). Most of the time it was Michael one or another sort of system recovery. In this case, you're only talking about RAID1 mirrors, no other RAID configuration fits this scenario. And while this might look to be helpful, I would strongly argue that it's not, because it's a special case of the RAID code and can lead to all kinds of bugs and problems if it's not
Re: Time to deprecate old RAID formats?
John Stoffel wrote: Michael == Michael Tokarev [EMAIL PROTECTED] writes: If you are going to mirror an existing filesystem, then by definition you have a second disk or partition available for the purpose. So you would merely setup the new RAID1, in degraded mode, using the new partition as the base. Then you copy the data over to the new RAID1 device, change your boot setup, and reboot. Michael And you have to copy the data twice as a result, instead of Michael copying it only once to the second disk. So? Why is this such a big deal? As I see it, there are two seperate ways to setup a RAID1 setup, on an OS. [..] that was just a tiny nitpick, so to say, about a particular way to convert existing system into raid1 - not something which's done every day anyway. Still, double the time for copying your terabyte-sized drive is something to consider. [] Michael automatically activate it, thus making it busy. What I'm Michael talking about here is that any automatic activation of Michael anything should be done with extreme care, using smart logic Michael in the startup scripts if at all. Ah... but you can also de-active LVM partitions as well if you like. Yes, esp. being a newbie user who first installed linux on his PC just to see that he can't use his disk.. ;) That was a real situation - I helped someone who had never heard of LVM and did little of anything with filesystems/disks before. Michael The Doug's example - in my opinion anyway - shows wrong tools Michael or bad logic in the startup sequence, not a general flaw in Michael superblock location. I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. Superblock location does not depend on the filesystem. Raid exports the inside space only, excluding superblocks, to the next level (filesystem or else). This is really true when you use the full disk for the mirror, because then you don't have the partition table to base some initial guestimates on. Since there is an explicit Linux RAID partition type, as well as an explicit linux filesystem (filesystem is then decoded from the first Nk of the partition), you have a modicum of safety. Speaking of whole disks - first, don't do that (for reasons suitable for another topic), and second, using the whole disk or partitions makes no real difference whatsoever to the topic being discussed. There's just no need for the guesswork, except for the first install (to automatically recognize existing devices, and to use them after confirmation), and maybe for rescue systems, which again is a different topic. In any case, for a tool that does a guesswork (like libvolume-id, to create /dev/ symlinks), it's as easy to look at the end of the device as to the beginning or to any other fixed place - since the tool has to know the superblock format, it knows superblock location as well). Maybe manual guesswork, based on hexdump of first several kilobytes of data, is a bit more difficult in case where superblock is located at the end. But if one has to analyze hexdump, he doesn't care about raid anymore. If ext3 has the superblock in the first 4k of the disk, but you've setup the disk to use RAID1 with the LVM superblock at the end of the disk, you now need to be careful about how the disk is detected and then mounted. See above. For tools, it's trivial to distinguish a component of a raid volume from the volume itself, by looking for superblock at whatever location. Including stuff like mkfs, which - like mdadm does - may warn one about previous filesystem/volume information on the device in question. Michael Speaking of cases where it was really helpful to have an Michael ability to mount individual raid components directly without Michael the raid level - most of them was due to one or another Michael operator errors, usually together with bugs and/or omissions Michael in software. I don't remember exact scenarious anymore (last Michael time it was more than 2 years ago). Most of the time it was Michael one or another sort of system recovery. In this case, you're only talking about RAID1 mirrors, no other RAID configuration fits this scenario. And while this might look to be Definitely. However, linear - to some extent - can be used partially. But sure with much less usefulness. However, raid1 is much more common setup than anything else - IMHO anyway. It's the cheapest and the most reliable thing for an average user anyway - it's cheaper to get 2 large drives than to, say, 3 a bit smaller drives. Yes, raid1 has 1/2 space wasted, compared with, say, raid5 on top of 3 drives (only 1/3 wasted), but still 3 smallish drives costs more than 2 larger drives. helpful, I would strongly argue that it's not, because it's a special
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-20 at 09:53 +0200, Iustin Pop wrote: Honestly, I don't see how a properly configured system would start looking at the physical device by mistake. I suppose it's possible, but I didn't have this issue. Mount by label support scans all devices in /proc/partitions looking for the filesystem superblock that has the label you are trying to mount. LVM (unless told not to) scans all devices in /proc/partitions looking for valid LVM superblocks. In fact, you can't build a linux system that is resilient to device name changes without doing that. It's not only about the activation of the array. I'm mostly talking about RAID1, but the fact that migrating between RAID1 and plain disk is just a few hundred K at the end increases the flexibility very much. Flexibility, no. Convenience, yes. You can do all the things with superblock at the front that you can with it at the end, it just takes a little more effort. Also, sometime you want to recover as much as possible from a not intact copy of the data... And you can with superblock at the front. You can create a new single disk raid1 over the existing superblock or you can munge the partition table to have it point at the start of your data. There are options, they just require manual intervention. But if you are trying to rescue data off of a seriously broken device, you are already doing manual intervention anyway. Of course, different people have different priorities, but as I said, I like that this conversion is possible, and I never had the case of a tool saying hmm, /dev/mdsomething is not there, let's look at /dev/sdc instead. mount, pvscan. thanks, iustin -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: chunk size (was Re: Time to deprecate old RAID formats?)
On Sat, 2007-10-20 at 00:43 +0200, Michal Soltys wrote: Doug Ledford wrote: course, this comes at the expense of peak throughput on the device. Let's say you were building a mondo movie server, where you were streaming out digital movie files. In that case, you very well may care more about throughput than seek performance since I suspect you wouldn't have many small, random reads. Then I would use a small chunk size, sacrifice the seek performance, and get the throughput bonus of parallel reads from the same stripe on multiple disks. On the other hand, if I Out of curiosity though - why wouldn't large chunk work well here ? If you stream video (I assume large files, so like a good few MBs at least), the reads are parallel either way. Well, first I was thinking of files in the few hundreds of megabytes each to gigabytes each, and when they are streamed, they are streamed at a rate much lower than the full speed of the array, but still at a fast rate. How parallel the reads are then would tend to be a function of chunk size versus streaming rate. I guess I should clarify what I'm talking about anyway. To me, a large chunk size is 1 to 2MB or so, a small chunk size is in the 64k to 256k range. If you have a 10 disk raid5 array with a 2mb chunk size, and you aren't just copying files around, then it's hard to ever get that to do full speed parallel reads because you simply won't access the data fast enough. Yes, the amount of data read from each of the disks will be in less perfect proportion than in small chunk size scenario, but it's pretty neglible. Benchamrks I've seen (like Justin's one) seem not to care much about chunk size in sequential read/write scenarios (and often favors larger chunks). Some of my own tests I did few months ago confirmed that as well. I'm not familiar with the benchmark you are referring to. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: [] 1.0, 1.1, and 1.2 are the same format, just in different positions on the disk. Of the three, the 1.1 format is the safest to use since it won't allow you to accidentally have some sort of metadata between the beginning of the disk and the raid superblock (such as an lvm2 superblock), and hence whenever the raid array isn't up, you won't be able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse case situations, I've seen lvm2 find a superblock on one RAID1 array member when the RAID1 array was down, the system came up, you used the system, the two copies of the raid array were made drastically inconsistent, then at the next reboot, the situation that prevented the RAID1 from starting was resolved, and it never know it failed to start last time, and the two inconsistent members we put back into a clean array). So, deprecating any of these is not really helpful. And you need to keep the old 0.90 format around for back compatibility with thousands of existing raid arrays. Well, I strongly, completely disagree. You described a real-world situation, and that's unfortunate, BUT: for at least raid1, there ARE cases, pretty valid ones, when one NEEDS to mount the filesystem without bringing up raid. Raid1 allows that. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Michael == Michael Tokarev [EMAIL PROTECTED] writes: Michael Doug Ledford wrote: Michael [] 1.0, 1.1, and 1.2 are the same format, just in different positions on the disk. Of the three, the 1.1 format is the safest to use since it won't allow you to accidentally have some sort of metadata between the beginning of the disk and the raid superblock (such as an lvm2 superblock), and hence whenever the raid array isn't up, you won't be able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse case situations, I've seen lvm2 find a superblock on one RAID1 array member when the RAID1 array was down, the system came up, you used the system, the two copies of the raid array were made drastically inconsistent, then at the next reboot, the situation that prevented the RAID1 from starting was resolved, and it never know it failed to start last time, and the two inconsistent members we put back into a clean array). So, deprecating any of these is not really helpful. And you need to keep the old 0.90 format around for back compatibility with thousands of existing raid arrays. Michael Well, I strongly, completely disagree. You described a Michael real-world situation, and that's unfortunate, BUT: for at Michael least raid1, there ARE cases, pretty valid ones, when one Michael NEEDS to mount the filesystem without bringing up raid. Michael Raid1 allows that. Please describe one such case please. There have certainly been hacks of various RAID systems on other OSes such as Solaris where the VxVM and/or Solstice DiskSuite allowed you to encapsulate an existing partition into a RAID array. But in my experience (and I'm a professional sysadm... :-) it's not really all that useful, and can lead to problems liks those described by Doug. If you are going to mirror an existing filesystem, then by definition you have a second disk or partition available for the purpose. So you would merely setup the new RAID1, in degraded mode, using the new partition as the base. Then you copy the data over to the new RAID1 device, change your boot setup, and reboot. Once that is done, you can then add the original partition into the RAID1 array. As Doug says, and I agree strongly, you DO NOT want to have the possibility of confusion and data loss, especially on bootup. And this leads to the heart of my initial post on this matter, that the confusion of having four different variations of RAID superblocks is bad. We should deprecate them down to just two, the old 0.90 format, and the new 1.x format at the start of the RAID volume. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Sat, Oct 20, 2007 at 10:52:39AM -0400, John Stoffel wrote: Michael Well, I strongly, completely disagree. You described a Michael real-world situation, and that's unfortunate, BUT: for at Michael least raid1, there ARE cases, pretty valid ones, when one Michael NEEDS to mount the filesystem without bringing up raid. Michael Raid1 allows that. Please describe one such case please. Boot from a raid1 array, such that everything - including the partition table itself - is mirrored. iustin - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-20 at 17:07 +0200, Iustin Pop wrote: On Sat, Oct 20, 2007 at 10:52:39AM -0400, John Stoffel wrote: Michael Well, I strongly, completely disagree. You described a Michael real-world situation, and that's unfortunate, BUT: for at Michael least raid1, there ARE cases, pretty valid ones, when one Michael NEEDS to mount the filesystem without bringing up raid. Michael Raid1 allows that. Please describe one such case please. Boot from a raid1 array, such that everything - including the partition table itself - is mirrored. That's a *really* bad idea. If you want to subpartition a raid array, you really should either run lvm on top of raid or use a partitionable raid array embedded in a raid partition. If you don't, there are a whole slew of failure cases that would result in the same sort of accidental access and data corruption that I talked about. For instance, if you ever ran fdisk on the disk itself instead of the raid array, fdisk would happily create a partition that runs off the end of the raid device and into the superblock area. The raid subsystem autodetect only works on partitions labeled as type 0xfd, so it would never search for a raid superblock at the end of the actual device, and that means that if you boot from a rescue CD that doesn't contain an mdadm.conf file that specifies the whole disk device as a search device, then it is guaranteed to not start the device and possibly try and modify the underlying constituent devices. All around, it's just a *really* bad idea. I've heard several descriptions of things you *could* do with the superblock at the end, but as of yet, not one of them is a good idea if you really care about your data. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
John Stoffel wrote: Michael == Michael Tokarev [EMAIL PROTECTED] writes: [] Michael Well, I strongly, completely disagree. You described a Michael real-world situation, and that's unfortunate, BUT: for at Michael least raid1, there ARE cases, pretty valid ones, when one Michael NEEDS to mount the filesystem without bringing up raid. Michael Raid1 allows that. Please describe one such case please. There have certainly been hacks of various RAID systems on other OSes such as Solaris where the VxVM and/or Solstice DiskSuite allowed you to encapsulate an existing partition into a RAID array. But in my experience (and I'm a professional sysadm... :-) it's not really all that useful, and can lead to problems liks those described by Doug. I'm doing a sysadmin work for about 15 or 20 years. If you are going to mirror an existing filesystem, then by definition you have a second disk or partition available for the purpose. So you would merely setup the new RAID1, in degraded mode, using the new partition as the base. Then you copy the data over to the new RAID1 device, change your boot setup, and reboot. [...] And you have to copy the data twice as a result, instead of copying it only once to the second disk. As Doug says, and I agree strongly, you DO NOT want to have the possibility of confusion and data loss, especially on bootup. And There are different point of views, and different settings etc. For example, I once dealt with a linux user who was unable to use his disk partition, because his system (it was RedHat if I remember correctly) recognized some LVM volume on his disk (it was previously used with Windows) and tried to automatically activate it, thus making it busy. What I'm talking about here is that any automatic activation of anything should be done with extreme care, using smart logic in the startup scripts if at all. The Doug's example - in my opinion anyway - shows wrong tools or bad logic in the startup sequence, not a general flaw in superblock location. Another example is ext[234]fs - it does not touch first 512 bytes of the device, so if there was an msdos filesystem there before, it will be recognized as such by many tools, and an attempt to mount it automatically will lead to at least scary output and nothing mounted, or in fsck doing fatal things to it in worst scenario. Sure thing the first 512 bytes should be just cleared.. but that's another topic. Speaking of cases where it was really helpful to have an ability to mount individual raid components directly without the raid level - most of them was due to one or another operator errors, usually together with bugs and/or omissions in software. I don't remember exact scenarious anymore (last time it was more than 2 years ago). Most of the time it was one or another sort of system recovery. In almost all machines I maintain, there's a raid1 for the root filesystem built of all the drives (be it 2 or 4 or even 6 of them) - the key point is to be able to boot off any of them in case some cable/drive/controller rearrangement has to be done. Root filesystem is quite small (256 or 512 Mb here), and it's not too dynamic either -- so it's not a big deal to waste space for it. Problem occurs - obviously - when something goes wrong. And most of the time issues we had happened on a remote site, where there was no expirienced operator/sysadmin handy. For example, when one drive was almost dead, and mdadm tried to bring the array up, machine just hanged for unknown amount of time. An unexpirienced operator was there. Instead of trying to teach him how to pass parameter to the initramfs to stop trying to assemble root array and next assembling it manually, I told him to pass root=/dev/sda1 to the kernel. Root mounts read-only, so it should be a safe thing to do - I only needed root fs and minimal set of services (which are even in initramfs) just for it to boot up to SOME state where I can log in remotely and fix things later. (no I didn't want to remove the drive yet, I wanted to examine it first, and it turned to be a good idea because the hang was happening only at the beginning of it, and while we tried to install replacement and fill it up with data, there was an unreadable sector found on another drive, so this old but not removed drive was really handy). Another situation - after some weird crash I had to examine the filesystems found on both components - I want to look at the filesystems and compare them, WITHOUT messing up with raid superblocks (later on I wrote a tiny program to save/restore 0.90 superblocks), and without attempting a reconstruction attempts. In fact, this very case - examining the contents - is something I've been doing many times for one or another reason. There's just no need to involve raid layer here at all, but it doesn't disturb things either (in some cases anyway). Yet another - many times we had to copy an old system to a new one - new machine boots with 3 drives
Re: Time to deprecate old RAID formats?
Justin Piszcz wrote: On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote: [] Got it, so for RAID1 it would make sense if LILO supported it (the later versions of the md superblock) Lilo doesn't know anything about the superblock format, however, lilo expects the raid1 device to start at the beginning of the physical partition. In otherwords, format 1.0 would work with lilo. Did not work when I tried 1.x with LILO, switched back to 00.90.03 and it worked fine. There are different 1.x - and the difference is exactly this -- location of the superblock. In 1.0, superblock is located at the end, just like with 0.90, and lilo works just fine with it. It gets confused somehow (however I don't see how really, because it uses bmap() to get a list of physical blocks for the files it wants to access - those should be in absolute numbers, regardless of the superblock locaction) when the superblock is at the beginning (v 1.1 or 1.2). /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-20 at 22:38 +0400, Michael Tokarev wrote: Justin Piszcz wrote: On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote: [] Got it, so for RAID1 it would make sense if LILO supported it (the later versions of the md superblock) Lilo doesn't know anything about the superblock format, however, lilo expects the raid1 device to start at the beginning of the physical partition. In otherwords, format 1.0 would work with lilo. Did not work when I tried 1.x with LILO, switched back to 00.90.03 and it worked fine. There are different 1.x - and the difference is exactly this -- location of the superblock. In 1.0, superblock is located at the end, just like with 0.90, and lilo works just fine with it. It gets confused somehow (however I don't see how really, because it uses bmap() to get a list of physical blocks for the files it wants to access - those should be in absolute numbers, regardless of the superblock locaction) when the superblock is at the beginning (v 1.1 or 1.2). /mjt It's been a *long* time since I looked at the lilo raid1 support (I wrote the original patch that Red Hat used, I have no idea if that's what the lilo maintainer integrated though). However, IIRC, it uses bmap on the file, which implies it's via the filesystem mounted on the raid device. And the numbers are not absolute I don't think except with respect to the file system. So, I think the situation could be made to work if you just taught lilo that on version 1.1 or version 1.2 superblock raids that it should add the data offset of the raid to the bmap numbers (which I think are already added to the partition offset numbers). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Time to deprecate old RAID formats?
So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote: I'm sure an internal bitmap would. On RAID1 arrays, reads/writes are never split up by a chunk size for stripes. A 2mb read is a single read, where as on a raid4/5/6 array, a 2mb read will end up hitting a series of stripes across all disks. That means that on raid1 arrays, total disk seeks total reads/writes, where as on a raid4/5/6, total disk seeks is usually total reads/writes. That in turn implies that in a raid1 setup, disk seek time is important to performance, but not necessarily paramount. For raid456, disk seek time is paramount because of how many more seeks that format uses. When you then use an internal bitmap, you are adding writes to every member of the raid456 array, which adds more seeks. The same is true for raid1, but since raid1 doesn't have the same level of dependency on seek rates that raid456 has, it doesn't show the same performance hit that raid456 does. Got it, so for RAID1 it would make sense if LILO supported it (the later versions of the md superblock) Lilo doesn't know anything about the superblock format, however, lilo expects the raid1 device to start at the beginning of the physical partition. In otherwords, format 1.0 would work with lilo. Did not work when I tried 1.x with LILO, switched back to 00.90.03 and it worked fine. (for those who use LILO) but for RAID4/5/6, keep the bitmaps away :) I still use an internal bitmap regardless ;-) To help mitigate the cost of seeks on raid456, you can specify a huge chunk size (like 256k to 2MB or somewhere in that range). As long as you can get 90%+ of your reads/writes to fall into the space of a single chunk, then you start performing more like a raid1 device without the extra seek overhead. Of course, this comes at the expense of peak throughput on the device. Let's say you were building a mondo movie server, where you were streaming out digital movie files. In that case, you very well may care more about throughput than seek performance since I suspect you wouldn't have many small, random reads. Then I would use a small chunk size, sacrifice the seek performance, and get the throughput bonus of parallel reads from the same stripe on multiple disks. On the other hand, if I was setting up a mail server then I would go with a large chunk size because the filesystem activities themselves are going to produce lots of random seeks, and you don't want your raid setup to make that problem worse. Plus, most mail doesn't come in or go out at any sort of massive streaming speed, so you don't need the paralllel reads from multiple disks to perform well. It all depends on your particular use scenario. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote: And if putting the superblock at the end is problematic, why is it the default? Shouldn't version 1.1 be the default? In my opinion, having the superblock *only* at the end (e.g. the 0.90 format) is the best option. It allows one to mount the disk separately (in case of RAID 1), if the MD superblock is corrupt or you just want to get easily at the raw data. As to the people who complained exactly because of this feature, LVM has two mechanisms to protect from accessing PVs on the raw disks (the ignore raid components option and the filter - I always set filters when using LVM ontop of MD). regards, iustin - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-19 at 23:23 +0200, Iustin Pop wrote: On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote: And if putting the superblock at the end is problematic, why is it the default? Shouldn't version 1.1 be the default? In my opinion, having the superblock *only* at the end (e.g. the 0.90 format) is the best option. It allows one to mount the disk separately (in case of RAID 1), if the MD superblock is corrupt or you just want to get easily at the raw data. Bad reasoning. It's the reason that the default is at the end of the device, but that was a bad decision made by Ingo long, long ago in a galaxy far, far away. The simple fact of the matter is there are only two type of raid devices for the purpose of this issue: those that fragment data (raid0/4/5/6/10) and those that don't (raid1, linear). For the purposes of this issue, there are only two states we care about: the raid array works or doesn't work. If the raid array works, then you *only* want the system to access the data via the raid array. If the raid array doesn't work, then for the fragmented case you *never* want the system to see any of the data from the raid array (such as an ext3 superblock) or a subsequent fsck could see a valid superblock and actually start a filesystem scan on the raw device, and end up hosing the filesystem beyond all repair after it hits the first chunk size break (although in practice this is usually a situation where fsck declares the filesystem so corrupt that it refuses to touch it, that's leaving an awful lot to chance, you really don't want fsck to *ever* see that superblock). If the raid array is raid1, then the raid array should *never* fail to start unless all disks are missing (in which case there is no raw device to access anyway). The very few failure types that will cause the raid array to not start automatically *and* still have an intact copy of the data usually happen when the raid array is perfectly healthy, in which case automatically finding a constituent device when the raid array failed to start is exactly the *wrong* thing to do (for instance, you enable SELinux on a machine and it hasn't been relabeled and the raid array fails to start because /dev/mdblah can't be created because of an SELinux denial...all the raid1 members are still there, but if you touch a single one of them, then you run the risk of creating silent data corruption). It really boils down to this: for any reason that a raid array might fail to start, you *never* want to touch the underlying data until someone has taken manual measures to figure out why it didn't start and corrected the problem. Putting the superblock in front of the data does not prevent manual measures (such as recreating superblocks) from getting at the data. But, putting superblocks at the end leaves the door open for accidental access via constituent devices when you *really* don't want that to happen. So, no, the default should *not* be at the end of the device. As to the people who complained exactly because of this feature, LVM has two mechanisms to protect from accessing PVs on the raw disks (the ignore raid components option and the filter - I always set filters when using LVM ontop of MD). regards, iustin -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-19 at 12:38 -0400, John Stoffel wrote: 1, 1.0, 1.1, 1.2 Use the new version-1 format superblock. This has few restrictions. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). It looks to me that the 1.1, combined with the 1.0 should be what we use, with the 1.2 format nuked. Maybe call it 1.3? *grin* You're somewhat misreading the man page. You *can't* combine 1.0 with 1.1. All of the above options: 1, 1.0, 1.1, 1.2; specifically mean to use a version 1 superblock. 1.0 means use a version 1 superblock at the end of the disk. 1.1 means version 1 superblock at beginning of disk. `1.2 means version 1 at 4k offset from beginning of the disk. There really is no actual version 1.1, or 1.2, the .0, .1, and .2 part of the version *only* means where to put the version 1 superblock on the disk. If you just say version 1, then it goes to the default location for version 1 superblocks, and last I checked that was the end of disk (aka, 1.0). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 12:45 -0400, Justin Piszcz wrote: On Fri, 19 Oct 2007, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin Is a bitmap created by default with 1.x? I remember seeing Justin reports of 15-30% performance degradation using a bitmap on a Justin RAID5 with 1.x. Not according to the mdadm man page. I'd probably give up that performance if it meant that re-syncing an array went much faster after a crash. I certainly use it on my RAID1 setup on my home machine. John The performance AFTER a crash yes, but in general usage I remember seeing someone here doing benchmarks it had a negative affect on performance. I'm sure an internal bitmap would. On RAID1 arrays, reads/writes are never split up by a chunk size for stripes. A 2mb read is a single read, where as on a raid4/5/6 array, a 2mb read will end up hitting a series of stripes across all disks. That means that on raid1 arrays, total disk seeks total reads/writes, where as on a raid4/5/6, total disk seeks is usually total reads/writes. That in turn implies that in a raid1 setup, disk seek time is important to performance, but not necessarily paramount. For raid456, disk seek time is paramount because of how many more seeks that format uses. When you then use an internal bitmap, you are adding writes to every member of the raid456 array, which adds more seeks. The same is true for raid1, but since raid1 doesn't have the same level of dependency on seek rates that raid456 has, it doesn't show the same performance hit that raid456 does. Justin. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband Got it, so for RAID1 it would make sense if LILO supported it (the later versions of the md superblock) (for those who use LILO) but for RAID4/5/6, keep the bitmaps away :) Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin Is a bitmap created by default with 1.x? I remember seeing Justin reports of 15-30% performance degradation using a bitmap on a Justin RAID5 with 1.x. Not according to the mdadm man page. I'd probably give up that performance if it meant that re-syncing an array went much faster after a crash. I certainly use it on my RAID1 setup on my home machine. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, John Stoffel wrote: Doug == Doug Ledford [EMAIL PROTECTED] writes: Doug On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? Doug 1.0, 1.1, and 1.2 are the same format, just in different positions on Doug the disk. Of the three, the 1.1 format is the safest to use since it Doug won't allow you to accidentally have some sort of metadata between the Doug beginning of the disk and the raid superblock (such as an lvm2 Doug superblock), and hence whenever the raid array isn't up, you won't be Doug able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse Doug case situations, I've seen lvm2 find a superblock on one RAID1 array Doug member when the RAID1 array was down, the system came up, you used the Doug system, the two copies of the raid array were made drastically Doug inconsistent, then at the next reboot, the situation that prevented the Doug RAID1 from starting was resolved, and it never know it failed to start Doug last time, and the two inconsistent members we put back into a clean Doug array). So, deprecating any of these is not really helpful. And you Doug need to keep the old 0.90 format around for back compatibility with Doug thousands of existing raid arrays. This is a great case for making the 1.1 format be the default. So what are the advantages of the 1.0 and 1.2 formats then? Or should be we thinking about making two copies of the data on each RAID member, one at the beginning and one at the end, for resiliency? I just hate seeing this in the mag page: Declare the style of superblock (raid metadata) to be used. The default is 0.90 for --create, and to guess for other operations. The default can be overridden by setting the metadata value for the CREATE keyword in mdadm.conf. Options are: 0, 0.90, default Use the original 0.90 format superblock. This format limits arrays to 28 component devices and limits compo- nent devices of levels 1 and greater to 2 terabytes. 1, 1.0, 1.1, 1.2 Use the new version-1 format superblock. This has few restrictions. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). It looks to me that the 1.1, combined with the 1.0 should be what we use, with the 1.2 format nuked. Maybe call it 1.3? *grin* So at this point I'm not arguing to get rid of the 0.9 format, though I think it should NOT be the default any more, we should be using the 1.1 combined with 1.0 format. Is a bitmap created by default with 1.x? I remember seeing reports of 15-30% performance degradation using a bitmap on a RAID5 with 1.x. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? 1.0, 1.1, and 1.2 are the same format, just in different positions on the disk. Of the three, the 1.1 format is the safest to use since it won't allow you to accidentally have some sort of metadata between the beginning of the disk and the raid superblock (such as an lvm2 superblock), and hence whenever the raid array isn't up, you won't be able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse case situations, I've seen lvm2 find a superblock on one RAID1 array member when the RAID1 array was down, the system came up, you used the system, the two copies of the raid array were made drastically inconsistent, then at the next reboot, the situation that prevented the RAID1 from starting was resolved, and it never know it failed to start last time, and the two inconsistent members we put back into a clean array). So, deprecating any of these is not really helpful. And you need to keep the old 0.90 format around for back compatibility with thousands of existing raid arrays. Agree, what is the benefit in deprecating them? Is there that much old code or? It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of Justin anything else! Are you sure? I find that GRUB is much easier to use and setup than LILO these days. But hey, just dropping down to support 00.09.03 and 1.2 formats would be fine too. Let's just lessen the confusion if at all possible. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of Justin anything else! Are you sure? I find that GRUB is much easier to use and setup than LILO these days. But hey, just dropping down to support 00.09.03 and 1.2 formats would be fine too. Let's just lessen the confusion if at all possible. John I am sure, I submitted a bug report to the LILO developer, he acknowledged the bug but I don't know if it was fixed. I have not tried GRUB with a RAID1 setup yet. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of Justin anything else! Are you sure? I find that GRUB is much easier to use and setup than LILO these days. But hey, just dropping down to support 00.09.03 and 1.2 formats would be fine too. Let's just lessen the confusion if at all possible. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html I hope 00.90.03 is not deprecated, LILO cannot boot off of anything else! Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug == Doug Ledford [EMAIL PROTECTED] writes: Doug On Fri, 2007-10-19 at 12:38 -0400, John Stoffel wrote: 1, 1.0, 1.1, 1.2 Use the new version-1 format superblock. This has few restrictions. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). It looks to me that the 1.1, combined with the 1.0 should be what we use, with the 1.2 format nuked. Maybe call it 1.3? *grin* Doug You're somewhat misreading the man page. The man page is somewhat misleading then. It's not clear from reading it that the version 1 RAID superblock can be in one of three different positions in the volume. Doug You *can't* combine 1.0 with 1.1. All of the above options: 1, Doug 1.0, 1.1, 1.2; specifically mean to use a version 1 superblock. Doug 1.0 means use a version 1 superblock at the end of the disk. Doug 1.1 means version 1 superblock at beginning of disk. `1.2 means Doug version 1 at 4k offset from beginning of the disk. There really Doug is no actual version 1.1, or 1.2, the .0, .1, and .2 part of the Doug version *only* means where to put the version 1 superblock on Doug the disk. If you just say version 1, then it goes to the Doug default location for version 1 superblocks, and last I checked Doug that was the end of disk (aka, 1.0). So why not get rid of (deprecate) the version 1.0 and version 1.2 blocks, and only support the 1.1 version? Why do we have three different positions for storing the superblock? And if putting the superblock at the end is problematic, why is it the default? Shouldn't version 1.1 be the default? Or, alternatively, update the code so that we support RAID superblocks at BOTH the beginning and end 4k of the disk, for maximum redundancy. I guess I need to go and read the code to figure out the placement of 0.90 and 1.0 blocks to see how they are different. It's just not clear to me why we have such a muddle of 1.x formats to choose from and what the advantages and tradeoffs are between them. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 19 Oct 2007, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin Is a bitmap created by default with 1.x? I remember seeing Justin reports of 15-30% performance degradation using a bitmap on a Justin RAID5 with 1.x. Not according to the mdadm man page. I'd probably give up that performance if it meant that re-syncing an array went much faster after a crash. I certainly use it on my RAID1 setup on my home machine. John The performance AFTER a crash yes, but in general usage I remember seeing someone here doing benchmarks it had a negative affect on performance. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug == Doug Ledford [EMAIL PROTECTED] writes: Doug On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? Doug 1.0, 1.1, and 1.2 are the same format, just in different positions on Doug the disk. Of the three, the 1.1 format is the safest to use since it Doug won't allow you to accidentally have some sort of metadata between the Doug beginning of the disk and the raid superblock (such as an lvm2 Doug superblock), and hence whenever the raid array isn't up, you won't be Doug able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse Doug case situations, I've seen lvm2 find a superblock on one RAID1 array Doug member when the RAID1 array was down, the system came up, you used the Doug system, the two copies of the raid array were made drastically Doug inconsistent, then at the next reboot, the situation that prevented the Doug RAID1 from starting was resolved, and it never know it failed to start Doug last time, and the two inconsistent members we put back into a clean Doug array). So, deprecating any of these is not really helpful. And you Doug need to keep the old 0.90 format around for back compatibility with Doug thousands of existing raid arrays. This is a great case for making the 1.1 format be the default. So what are the advantages of the 1.0 and 1.2 formats then? Or should be we thinking about making two copies of the data on each RAID member, one at the beginning and one at the end, for resiliency? I just hate seeing this in the mag page: Declare the style of superblock (raid metadata) to be used. The default is 0.90 for --create, and to guess for other operations. The default can be overridden by setting the metadata value for the CREATE keyword in mdadm.conf. Options are: 0, 0.90, default Use the original 0.90 format superblock. This format limits arrays to 28 component devices and limits compo- nent devices of levels 1 and greater to 2 terabytes. 1, 1.0, 1.1, 1.2 Use the new version-1 format superblock. This has few restrictions. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). It looks to me that the 1.1, combined with the 1.0 should be what we use, with the 1.2 format nuked. Maybe call it 1.3? *grin* So at this point I'm not arguing to get rid of the 0.9 format, though I think it should NOT be the default any more, we should be using the 1.1 combined with 1.0 format. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? 1.0, 1.1, and 1.2 are the same format, just in different positions on the disk. Of the three, the 1.1 format is the safest to use since it won't allow you to accidentally have some sort of metadata between the beginning of the disk and the raid superblock (such as an lvm2 superblock), and hence whenever the raid array isn't up, you won't be able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse case situations, I've seen lvm2 find a superblock on one RAID1 array member when the RAID1 array was down, the system came up, you used the system, the two copies of the raid array were made drastically inconsistent, then at the next reboot, the situation that prevented the RAID1 from starting was resolved, and it never know it failed to start last time, and the two inconsistent members we put back into a clean array). So, deprecating any of these is not really helpful. And you need to keep the old 0.90 format around for back compatibility with thousands of existing raid arrays. It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of Justin anything else! Are you sure? I find that GRUB is much easier to use and setup than LILO these days. But hey, just dropping down to support 00.09.03 and 1.2 formats would be fine too. Let's just lessen the confusion if at all possible. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
chunk size (was Re: Time to deprecate old RAID formats?)
Doug Ledford wrote: course, this comes at the expense of peak throughput on the device. Let's say you were building a mondo movie server, where you were streaming out digital movie files. In that case, you very well may care more about throughput than seek performance since I suspect you wouldn't have many small, random reads. Then I would use a small chunk size, sacrifice the seek performance, and get the throughput bonus of parallel reads from the same stripe on multiple disks. On the other hand, if I Out of curiosity though - why wouldn't large chunk work well here ? If you stream video (I assume large files, so like a good few MBs at least), the reads are parallel either way. Yes, the amount of data read from each of the disks will be in less perfect proportion than in small chunk size scenario, but it's pretty neglible. Benchamrks I've seen (like Justin's one) seem not to care much about chunk size in sequential read/write scenarios (and often favors larger chunks). Some of my own tests I did few months ago confirmed that as well. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html