Re: stride / stripe alignment on LVM ?
On Fri, 2007-11-02 at 23:16 +0100, Janek Kozicki wrote: Bill Davidsen said: (by the date of Fri, 02 Nov 2007 09:01:05 -0400) So I would expect this to make a very large performance difference, so even if it work it would do so slowly. I was trying to find out the stripe layout for few hours, using hexedit and dd. And I'm baffled: md1 : active raid5 hda3[0] sda3[1] 969907968 blocks super 1.1 level 5, 128k chunk, algorithm 2 [3/2] [UU_] ^^^ You have the raid superblock in the front of hda3 and sda3, as well as a bitmap I assume, which means that any data you write to md0 will actually be written to sda3/hda3 *after* the superblock and bitmap. If you run mdadm -D /dev/md1 it will tell you the data offset (in sectors IIRC). When you hexedit hda3, you need to jump forward the same number of sectors to get at the beginning of the actual md1 data. Of course, being raid5 with one disk missing, the data may or may not be present in its non-parity format depending on exactly which block you are looking at. However, you don't really need to do anything to figure out the stripe size on your array, you have it already. All the talk about figuring out stripe layouts is for external raid devices that hide the raid layout from you. When you are talking about your own raid device that you created with mdadm, you specified the stripe layout when you created the array. In your case, the chunk size is 128K, and since you have a 3 disk raid5 array and one chunk in each stripe of a raid5 array is parity, the amount of data stored per stripe is chunk size * (number of disks - 1), so 256K in your case. But you don't even have to align the lvm to the stripe, just to a chunk, so you really only need to align the lvm superblock so that data starts at 128K offset into the raid array. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: stride / stripe alignment on LVM ?
On Sat, 2007-11-03 at 21:21 +0100, Janek Kozicki wrote: If you run mdadm -D /dev/md1 it will tell you the data offset (in sectors IIRC). Uh, I don't see it: Sorry, it's part of mdadm -E instead: [EMAIL PROTECTED] ~]# mdadm -E /dev/sdc1 /dev/sdc1: Magic : a92b4efc Version : 1.1 Feature Map : 0x1 Array UUID : c746e4f5:b015ffac:7216dbbd:48d973a7 Name : firewall:home:2 Creation Time : Mon May 28 20:47:07 2007 Raid Level : raid1 Raid Devices : 2 Used Dev Size : 625137018 (298.09 GiB 320.07 GB) Array Size : 625137018 (298.09 GiB 320.07 GB) Data Offset : 264 sectors Super Offset : 0 sectors State : clean Device UUID : 7efd05d5:dd921536:1d1a1750:6ba49303 Internal Bitmap : 8 sectors from superblock Update Time : Sat Nov 3 21:01:24 2007 Checksum : 27b3958f - correct Events : 2 Array Slot : 0 (0, 1) Array State : Uu [EMAIL PROTECTED] ~]# -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Implementing low level timeouts within MD
On Fri, 2007-11-02 at 03:41 -0500, Alberto Alonso wrote: On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote: Not in the older kernel versions you were running, no. These old versions (specially the RHEL) are supposed to be the official versions supported by Redhat and the hardware vendors, as they were very specific as to what versions of Linux were supported. The key word here being supported. That means if you run across a problem, we fix it. It doesn't mean there will never be any problems. Of all people, I would think you would appreciate that. Sorry if I sound frustrated and upset, but it is clearly a result of what supported and tested really means in this case. I'm sorry, but given the specially the RHEL case you cited, it is clear I can't help you. No one can. You were running first gen software on first gen hardware. You show me *any* software company who's first gen software never has to be updated to fix bugs, and I'll show you a software company that went out of business they day after they released their software. Our RHEL3 update kernels contained *significant* updates to the SATA stack after our GA release, replete with hardware driver updates and bug fixes. I don't know *when* that RHEL3 system failed, but I would venture a guess that it wasn't prior to RHEL3 Update 1. So, I'm guessing you didn't take advantage of those bug fixes. And I would hardly call once a quarter continuously updating your kernel. In any case, given your insistence on running first gen software on first gen hardware and not taking advantage of the support we *did* provide to protect you against that failure, I say again that I can't help you. I don't want to go into a discussion of commercial distros, which are supported as this is nor the time nor the place but I don't want to open the door to the excuse of its an old kernel, it wasn't when it got installed. I *really* can't help you. Outside of the rejected suggestion, I just want to figure out when software raid works and when it doesn't. With SATA, my experience is that it doesn't. So far I've only received one response stating success (they were using the 3ware and Areca product lines). No, your experience, as you listed it, is that SATA/usb-storage/Serverworks PATA failed you. The software raid never failed to perform as designed. However, one of the things you are doing here is drawing sweeping generalizations that are totally invalid. You are saying your experience is that SATA doesn't work, but you aren't qualifying it with the key factor: SATA doesn't work in what kernel version? It is pointless to try and establish whether or not something like SATA works in a global, all kernel inclusive fashion because the answer to the question varies depending on the kernel version. And the same is true of pretty much every driver you can name. This is why commercial companies don't just certify hardware, but the software version that actually works as opposed to all versions. In truth, you have *no idea* if SATA works today, because you haven't tried. As David pointed out, there was a significant overhaul of the SATA error recovery that took place *after* the kernel versions that failed you which totally invalidates your experiences and requires retesting of the later software to see if it performs differently. Anyway, this thread just posed the question, and as Neil pointed out, it isn't feasible/worth to implement timeouts within the md code. I think most of the points/discussions raised beyond that original question really belong to the thread Software RAID when it works and when it doesn't I do appreciate all comments and suggestions and I hope to keep them coming. I would hope however to hear more about success stories with specific hardware details. It would be helpfull to have a list of tested configurations that are known to work. I've had *lots* of success with software RAID as I've been running it for years. I've had old PATA drives fail, SCSI drives fail, FC drives fail, and I've had SATA drives that got kicked from the array due to read errors but not out and out drive failures. But I keep at least reasonably up to date with my kernels. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: switching root fs '/' to boot from RAID1 with grub
On Thu, 2007-11-01 at 11:57 -0700, H. Peter Anvin wrote: Doug Ledford wrote: Correct, and that's what you want. The alternative is that if the BIOS can see the first disk but it's broken and can't be used, and if you have the boot sector on the second disk set to read from BIOS disk 0x81 because you ASSuMEd the first disk would be broken but still present in the BIOS tables, then your machine won't boot unless that first dead but preset disk is present. If you remove the disk entirely, thereby bumping disk 0x81 to 0x80, then you are screwed. If you have any drive failure that prevents the first disk from being recognized (blown fuse, blown electronics, etc), you are screwed until you get a new disk to replace it. What you want is for it to use the drive number that BIOS passes into it (register DL), not a hard-coded number. That was my (only) point -- you're obviously right that hard-coding a number to 0x81 would be worse than useless. Oh, and I forgot to mention that in grub2, the DL register is ignored for RAID1 devices. Well, maybe not ignored, but once grub2 has determined that the intended boot partition is a raid partition, the raid code takes over and the raid code doesn't care about the DL register. Instead, it scans for all the other members of the raid array and utilizes whichever drives it needs to in order to complete the boot process. And since it does reads a sector (or a small group of sectors) at a time, it doesn't need any member of a raid1 array to be perfect, it will attempt a round robin read on all the sectors and only fail if all drives return an error for a given read. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Thu, 2007-11-01 at 14:02 -0700, H. Peter Anvin wrote: Doug Ledford wrote: I would argue that ext[234] should be clearing those 512 bytes. Why aren't they cleared Actually, I didn't think msdos used the first 512 bytes for the same reason ext3 doesn't: space for a boot sector. The creators of MS-DOS put the superblock in the bootsector, so that the BIOS loads them both. It made sense in some diseased Microsoft programmer's mind. Either way, for RAID-1 booting, the boot sector really should be part of the protected area (and go through the MD stack.) It depends on what you are calling the protected area. If by that you mean outside the filesystem itself, and in a non-replicated area like where the superblock and internal bitmaps go, then yes, that would be ideal. If you mean in the file system proper, then that depends on the boot loader. The bootloader should deal with the offset problem by storing partition/filesystem-relative pointers, not absolute ones. Grub2 is on the way to this, but it isn't there yet. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Implementing low level timeouts within MD
On Fri, 2007-11-02 at 13:21 -0500, Alberto Alonso wrote: On Fri, 2007-11-02 at 11:45 -0400, Doug Ledford wrote: The key word here being supported. That means if you run across a problem, we fix it. It doesn't mean there will never be any problems. On hardware specs I normally read supported as tested within that OS version to work within specs. I may be expecting too much. It was tested, it simply obviously had a bug you hit. Assuming that your particular failure situation is the only possible outcome for all the other people that used it would be an invalid assumption. There are lots of code paths in an error handler routine, and lots of different hardware failure scenarios, and they each have their own independent outcome should they ever be experienced. I'm sorry, but given the specially the RHEL case you cited, it is clear I can't help you. No one can. You were running first gen software on first gen hardware. You show me *any* software company who's first gen software never has to be updated to fix bugs, and I'll show you a software company that went out of business they day after they released their software. I only pointed to RHEL as an example since that was a particular distro that I use and exhibited the problem. I probably could of replaced it with Suse, Ubuntu, etc. I may have called the early versions back in 94 first gen but not today's versions. I know I didn't expect the SLS distro to work reliably back then. Then you didn't pay attention to what I said before: RHEL3 was the first ever RHEL product that had support for SATA hardware. The SATA drivers in RHEL3 *were* first gen. Can you provide specific chipsets that you used (specially for SATA)? All of the Adaptec SCSI chipsets through the 7899, Intel PATA, QLogic FC, and nVidia and winbond based SATA. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: switching root fs '/' to boot from RAID1 with grub
On Thu, 2007-11-01 at 10:31 -0700, H. Peter Anvin wrote: Doug Ledford wrote: device /dev/sda (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst device /dev/hdc (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst That will install grub on the master boot record of hdc and sda, and in both cases grub will look to whatever drive it is running on for the files to boot instead of going to a specific drive. No, it won't... it'll look for the first drive in the system (BIOS drive 80h). Yes, and except for some fantastic BIOS I've never heard of, the drive that the BIOS reads the boot sector from is always the 0x80 drive. This is either because the drive truly is the first drive, or because the BIOS is remapping a later drive to 0x80 for boot purposes. In either case, what I said is still true: the boot sector will look to read the data files from the drive the boot sector itself was read from. This means that if the BIOS can see the bad drive, but it doesn't work, you're still screwed. Correct, and that's what you want. The alternative is that if the BIOS can see the first disk but it's broken and can't be used, and if you have the boot sector on the second disk set to read from BIOS disk 0x81 because you ASSuMEd the first disk would be broken but still present in the BIOS tables, then your machine won't boot unless that first dead but preset disk is present. If you remove the disk entirely, thereby bumping disk 0x81 to 0x80, then you are screwed. If you have any drive failure that prevents the first disk from being recognized (blown fuse, blown electronics, etc), you are screwed until you get a new disk to replace it. Follow these simple rules when setting up boot sectors and you'll be OK: 1) If you are using RAID1, then a boot sector should *never* try and read data from anything other than the disk the boot sector is on. To do otherwise defeats the whole purpose of RAID1 which is that you only need 1 disk to survive in order for the array to survive. 2) If the BIOS runs any given MBR in the RAID array, then that MBR will be on the disk the BIOS has mapped to 0x80. 3) While there are failure scenarios that would leave a disk unusable but still visible to the OS, there are no magic BIOS switches to fake a totally dead device. So, since you can remove a defunct but present disk in order to allow disk B to become disk A, but you can't magic a new disk A out of thin air should it fail to the point of not being recognized, set all your raid boot sectors to think they are the first disk in the system and you will always be able to start your machine. So, what I said is true, the MBR will search on the disk it is being run from for the files it needs: 0x80. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Implementing low level timeouts within MD
On Thu, 2007-11-01 at 00:08 -0500, Alberto Alonso wrote: On Tue, 2007-10-30 at 13:39 -0400, Doug Ledford wrote: Really, you've only been bitten by three so far. Serverworks PATA (which I tend to agree with the other person, I would probably chock 3 types of bugs is too many, it basically affected all my customers with multi-terabyte arrays. Heck, we can also oversimplify things and say that it is really just one type and define everything as kernel type problems (or as some other kernel used to say... general protection error). I am sorry for not having hundreds of RAID servers from which to draw statistical analysis. As I have clearly stated in the past I am trying to come up with a list of known combinations that work. I think my data points are worth something to some people, specially those considering SATA drives and software RAID for their file servers. If you don't consider them important for you that's fine, but please don't belittle them just because they don't match your needs. I wasn't belittling them. I was trying to isolate the likely culprit in the situations. You seem to want the md stack to time things out. As has already been commented by several people, myself included, that's a band-aid and not a fix in the right place. The linux kernel community in general is pretty hard lined when it comes to fixing the bug in the wrong way. this up to Serverworks, not PATA), USB storage, and SATA (the SATA stack is arranged similar to the SCSI stack with a core library that all the drivers use, and then hardware dependent driver modules...I suspect that since you got bit on three different hardware versions that you were in fact hitting a core library bug, but that's just a suspicion and I could well be wrong). What you haven't tried is any of the SCSI/SAS/FC stuff, and generally that's what I've always used and had good things to say about. I've only used SATA for my home systems or workstations, not any production servers. The USB array was never meant to be a full production system, just to buy some time until the budget was allocated to buy a real array. Having said that, the raid code is written to withstand the USB disks getting disconnected as far as the driver reports it properly. Since it doesn't, I consider it another case that shows when not to use software RAID thinking that it will work. As for SCSI I think it is a greatly proved and reliable technology, I've dealt with it extensively and have always had great results. I know deal with it mostly on non Linux based systems. But I don't think it is affordable to most SMBs that need multi-terabyte arrays. I'll repeat my plea one more time. Is there a published list of tested combinations that respond well to hardware failures and fully signals the md code so that nothing hangs? I don't know of one, but like I said, I've not used a lot of the SATA stuff for production. I would make this one suggestion though, SATA is still an evolving driver stack to a certain extent, and as such, keeping with more current kernels than you have been using is likely to be a big factor in whether or not these sorts of things happen. OK, so based on this it seems that you would not recommend the use of SATA for production systems due to its immaturity, correct? Not in the older kernel versions you were running, no. Keep in mind that production systems are not able to be brought down just to keep up with kernel changes. We have some tru64 production servers with 1500 to 2500 days uptime, that's not uncommon in industry. And I guarantee not a single one of those systems even knows what SATA is. They all use tried and true SCSI/FC technology. In any case, if Neil is so inclined to do so, he can add timeout code into the md stack, it's not my decision to make. However, I would say that the current RAID subsystem relies on the underlying disk subsystem to report errors when they occur instead of hanging infinitely, which implies that the raid subsystem relies upon a bug free low level driver. It is intended to deal with hardware failure, in as much as possible, and a driver bug isn't a hardware failure. You are asking the RAID subsystem to be extended to deal with software errors as well. Even though you may have thought it should handle this type of failure when you put those systems together, it in fact was not designed to do so. For that reason, choice of hardware and status of drivers for specific versions of hardware is important, and therefore it is also important to keep up to date with driver updates. It's highly likely that had you been keeping up to date with kernels, several of those failures might not have happened. One of the benefits of having many people running a software setup is that when one person hits a bug and you fix it, and then distribute that fix to everyone else, you save everyone else from also hitting that bug. You have
Re: switching root fs '/' to boot from RAID1 with grub
On Thu, 2007-11-01 at 20:04 +0100, Janek Kozicki wrote: Doug Ledford said: (by the date of Thu, 01 Nov 2007 14:30:58 -0400) So, what I said is true, the MBR will search on the disk it is being run from for the files it needs: 0x80. my motherboard allows to pick a boot device if I press F11 during boot. Do you mean, that no matter which HDD I will choose it will have 0x80 number? All the motherboard BIOS drive mapping things I've seen will do exactly that. In order to boot from say drive sda when hda is present, they map BIOS device 0x80 to sda instead of hda. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: switching root fs '/' to boot from RAID1 with grub
On Wed, 2007-10-31 at 16:01 +0100, Janek Kozicki wrote: And now I have a full RAID1 array. Now just two questions: 1. when I `shutdown -r now` I see a worrying message at the end: Stopping array md2 done (stopped) Stopping array md1 done (stopped) Stopping array md0 failed (busy) Will now reboot md: Stopping all md devices md: md0 still in use reboots Is that ok ? Yes. That's typical for the root device because root is never unmounted prior to shutdown. The messages are probably more worrying than they need to be. The system should have successfully switched the array to read only mode at the first attempt to stop the array. Neil, any chance on getting the messages for the / device to be less worrisome? 2. Will grub update all drives automatically, for instance when I will upgrade kernel by 'aptitude upgrade'? Or do I need to repeat your grub instructions each time a new kernel is installed? Now that grub's installed, you won't have to do anything manual again. The only time you might have to repeat that grub install procedure is if you loose a drive and need to add a new one back in, then the new one will need it. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Tue, 2007-10-30 at 07:55 +0100, Luca Berra wrote: Well it might be a matter of personal preference, but i would prefer an initrd doing just the minumum necessary to mount the root filesystem (and/or activating resume from a swap device), and leaving all the rest to initscripts, then an initrd that tries to do everything. The initrd does exactly that. The rescan for superblocks does not happen in initrd or mkinitrd, it must be done manually. The code in mkinitrd uses the mdadm.conf file as it stands, but in the initrd image it doesn't start all the arrays, just the needed arrays to get booted into your / partition. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Implementing low level timeouts within MD
On Tue, 2007-10-30 at 00:19 -0500, Alberto Alonso wrote: On Sat, 2007-10-27 at 12:33 +0200, Samuel Tardieu wrote: I agree with Doug: nothing prevents you from using md above very slow drivers (such as remote disks or even a filesystem implemented over a tape device to make it extreme). Only the low-level drivers know when it is appropriate to timeout or fail. Sam The problem is when some of these drivers are just not smart enough to keep themselves out of trouble. Unfortunately I've been bitten by apparently too many of them. Really, you've only been bitten by three so far. Serverworks PATA (which I tend to agree with the other person, I would probably chock this up to Serverworks, not PATA), USB storage, and SATA (the SATA stack is arranged similar to the SCSI stack with a core library that all the drivers use, and then hardware dependent driver modules...I suspect that since you got bit on three different hardware versions that you were in fact hitting a core library bug, but that's just a suspicion and I could well be wrong). What you haven't tried is any of the SCSI/SAS/FC stuff, and generally that's what I've always used and had good things to say about. I've only used SATA for my home systems or workstations, not any production servers. I'll repeat my plea one more time. Is there a published list of tested combinations that respond well to hardware failures and fully signals the md code so that nothing hangs? I don't know of one, but like I said, I've not used a lot of the SATA stuff for production. I would make this one suggestion though, SATA is still an evolving driver stack to a certain extent, and as such, keeping with more current kernels than you have been using is likely to be a big factor in whether or not these sorts of things happen. If not, I would like to see what people that have experienced hardware failures and survived them are using so that such a list can be compiled. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Implementing low level timeouts within MD
On Tue, 2007-10-30 at 00:08 -0500, Alberto Alonso wrote: On Mon, 2007-10-29 at 13:22 -0400, Doug Ledford wrote: OK, these you don't get to count. If you run raid over USB...well...you get what you get. IDE never really was a proper server interface, and SATA is much better, but USB was never anything other than a means to connect simple devices without having to put a card in your PC, it was never intended to be a raid transport. I still count them ;-) I guess I just would of hoped for software raid to really don't care about the lower layers. The job of software raid is to help protect your data. In order to do that, the raid needs to be run over something that *at least* provides a minimum level of reliability itself. The entire USB spec is written under the assumption that a USB device can disappear at any time and the stack must accept that (and it can, just trip on a cable some time and watch your raid device get all pissy). So, yes, software raid can run over any block device, but putting it over an unreliable connection medium is like telling a gladiator that he has to face the lion with no sword, no shield, and his hands tied behind his back. He might survive, but you have so seriously handicapped him that it's all but over. * Supermicro MB with ICH5/ICH5R controller and 2 RAID5 arrays of 3 disks each. (only one drive on one array went bad) * VIA VT6420 built into the MB with RAID1 across 2 SATA drives. * And the most complex is this week's server with 4 PCI/PCI-X cards. But the one that hanged the server was a 4 disk RAID5 array on a RocketRAID1540 card. And 3 SATA failures, right? I'm assuming the Supermicro is SATA or else it has more PATA ports than I've ever seen. Was the RocketRAID card in hardware or software raid mode? It sounds like it could be a combination of both, something like hardware on the card, and software across the different cards or something like that. What kernels were these under? Yes, these 3 were all SATA. The kernels (in the same order as above) are: * 2.4.21-4.ELsmp #1 (Basically RHEL v3) *Really* old kernel. RHEL3 is in maintenance mode already, and that was the GA kernel. It was also the first RHEL release with SATA support. So, first gen driver on first gen kernel. * 2.6.18-4-686 #1 SMP on a Fedora Core release 2 * 2.6.17.13 (compiled from vanilla sources) The RocketRAID was configured for all drives as legacy/normal and software RAID5 across all drives. I wasn't using hardware raid on the last described system when it crashed. So, the system that died *just this week* was running 2.6.17.13? Like I said in my last email, the SATA stack has been evolving over the last few years, and that's quite a few revisions behind. My basic advice is this: if you are going to use the latest and greatest hardware options, then you should either make sure you are using an up to date distro kernel of some sort or you need to watch the kernel update announcements for fixes related to that hardware and update your kernels/drivers as appropriate. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: switching root fs '/' to boot from RAID1 with grub
On Tue, 2007-10-30 at 21:07 +0100, Janek Kozicki wrote: Hello, I have and olde HDD and two new HDDs: - hda1 - my current root filesystem '/' - sda1 - part of raid1 /dev/md0 [U_U] - hdc1 - part of raid1 /dev/md0 [U_U] I want all hda1, sda1, hdc1 to be a raid1. I remounted hda1 readonly then I did 'dd if=/dev/hda1 of=/dev/md0'. I carefully checked that the partition sizes match exactly. So now md0 contains the same thing as hda1. But hda1 is still outside of the array. I want to add it to the array. But before I do this I think that I should boot from /dev/md0 ? Otherwise I might hose this system. I tried `grub-install /dev/sda1` (assuming that grub would see no problem with reading raid1 partition, and boot from it, until mdadm detects an array). I tried `grub-install /dev/sda` as well as on /dev/hdc and /dev/hdc1. I turned off 'active' flag for partition hda1 and turned it on for hdc1 and sda1. But still grub is booting from hda1. Well, without going into a lot of detail, you would have to boot from /dev/hda1 and specify a root=/dev/md0 option to the kernel to actually boot to the new / filesystem before grub-install will do what you are expecting. The fact that hda1 is mounted as / and that hda1 contains /boot with all your kernels and initrd images means that when you run grub-install it looks up the current location of the /boot files, sees they are on /dev/hda1, and regardless of where you put the boot sector (sda1, hdc1), those sectors point to the files grub found in /boot which are on hda1. I did all this with version 1.1 Which won't work, and you'll see that as soon as you have md0 mounted as / and try to run grub-install again. mdadm --create --verbose /dev/md0 --chunk=64 --level=raid1 \ --metadata=1.1 --bitmap=internal --raid-devices=3 /dev/sda1 \ missing /dev/hdc1 I'm NOT using LVM here. Can someone tell me how should I switch grub to boot from /dev/md0 ? After the boot I will add hda1 to the array, and all three partitions should become a raid1. Grub doesn't work with version 1.1 superblocks at the moment. It could be made to work quick and dirty in a short period of time, making it work properly would take longer. So, here's what I would do in your case. Scrap the current /dev/md0 setup. Make a new /dev/md0 using a version 1.0 superblock with all the other options the same as before. BTW, your partition sizes don't need to match exactly. If the new device is larger than your /dev/hda1, then no big deal, just do the dd like you did before and when you are done you can resize the fs to fit the device. If the new device is slightly smaller than /dev/hda1, then just run resizefs to shrink your /dev/hda1 to the same size as the fs on the /dev/md0 *before* you do the dd from hda1 to md0. Once you have the data copied to /dev/md0, you'll need to reboot the system and this time specify /dev/md0 as your root device (you may need to remake your initrd before you reboot, I don't know if your initrd starts /dev/md0, but it needs to). Once you are running with md0 as your root partition, you need to run grub to install the boot sectors on the md0 devices. You can't use grub-install though, it gets it wrong. Run the grub command, then enter these commands: device /dev/sda (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst device /dev/hdc (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst That will install grub on the master boot record of hdc and sda, and in both cases grub will look to whatever drive it is running on for the files to boot instead of going to a specific drive. Next you need to modify the /etc/grub.conf and change all the root= lines to be root=/dev/md0, and you need to modify /etc/fstab the same way. Then you probably need to remake all the initrd images so that they contain the update. Once you've done that, shut the system down, remove /dev/hda from the machine entirely, move /dev/hdc to /dev/hda, then reboot. The system should boot up to your raid array just fine. If it doesn't work, you can always put your old hda back in and boot up from it. If it does work, shut the machine down one more time, put the old hda in as hdc, boot back up (which should boot from hda to the md0 root, it should not touch hdc), add hdc to the raid array, let it resync, and then the final step is to run the grub install on hdc to make it match the other two disks. After that, you have a fully functional and booting raid1 array. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Sun, 2007-10-28 at 20:21 -0400, Bill Davidsen wrote: Doug Ledford wrote: On Fri, 2007-10-26 at 11:15 +0200, Luca Berra wrote: On Thu, Oct 25, 2007 at 02:40:06AM -0400, Doug Ledford wrote: The partition table is the single, (mostly) universally recognized arbiter of what possible data might be on the disk. Having a partition table may not make mdadm recognize the md superblock any better, but it keeps all that other stuff from even trying to access data that it doesn't have a need to access and prevents random luck from turning your day bad. on a pc maybe, but that is 20 years old design. So? Unix is 35+ year old design, I suppose you want to switch to Vista then? partition table design is limited because it is still based on C/H/S, which do not exist anymore. Put a partition table on a big storage, say a DMX, and enjoy a 20% performance decrease. Because you didn't stripe align the partition, your bad. Align to /what/ stripe? Hardware (CHS is fiction), software (of the RAID you're about to create), or ??? I don't notice my FC6 or FC7 install programs using any special partition location to start, I have only run (tried to run) FC8-test3 for the live CD, so I can't say what it might do. CentOS4 didn't do anything obvious, either, so unless I really misunderstand your position at redhat, that would be your bad. ;-) If you mean start a partition on a pseudo-CHS boundary, fdisk seems to use what it thinks are cylinders for that. Please clarify what alignment provides a performance benefit. Luca was specifically talking about the big multi-terabyte to petabyte hardware arrays on the market. DMX, DDN, and others. When they export a volume to the OS, there is an underlying stripe layout to that volume. If you don't use any partition table at all, you are automatically aligned with their stripes. However, if you do, then you have to align your partition on a chunk boundary or else performance drops pretty dramatically as a result of more writes than not crossing chunk boundaries unnecessarily. It's only relevant when you are talking about a raid device that shows the OS a single logical disk made from lots of other disks. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Mon, 2007-10-29 at 09:22 -0400, Bill Davidsen wrote: consider a storage with 64 spt, an io size of 4k and partition starting at sector 63. first io request will require two ios from the storage (1 for sector 63, and one for sectors 64 to 70) the next 7 io (71-78,79-86,97-94,95-102,103-110,111-118,119-126) will be on the same track the 8th will again require to be split, and so on. this causes the storage to do 1 unnecessary io every 8. YMMV. No one makes drives with fixed spt any more. Your assumptions are a decade out of date. Your missing the point, it's not about drive tracks, it's about array tracks, aka chunks. A 64k write, that should write to one and only one chunk, ends up spanning two. That increases the amount of writing the array has to do and the number of disks it busies for a typical single I/O operation. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote: Remaking the initrd installs the new mdadm.conf file, which would have then contained the whole disk devices and it's UUID. There in would have been the problem. yes, i read the patch, i don't like that code, as i don't like most of what has been put in mkinitrd from 5.0 onward. Imho the correct thing here would not have been copying the existing mdadm.conf but generating a safe one from output of mdadm -D (note -D, not -E) I'm not sure I'd want that. Besides, what makes you say -D is safer than -E? -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Mon, 2007-10-29 at 09:18 +0100, Luca Berra wrote: On Sun, Oct 28, 2007 at 10:59:01PM -0700, Daniel L. Miller wrote: Doug Ledford wrote: Anyway, I happen to *like* the idea of using full disk devices, but the reality is that the md subsystem doesn't have exclusive ownership of the disks at all times, and without that it really needs to stake a claim on the space instead of leaving things to chance IMO. I've been re-reading this post numerous times - trying to ignore the burgeoning flame war :) - and this last sentence finally clicked with me. I am sorry Daniel, when i read Doug and Bill, stating that your issue was not having a partition table, i immediately took the bait and forgot about your original issue. I never said *his* issue was lack of partition table, I just said I don't recommend that because it's flaky. The last statement I made about his issue was to ask about whether the problem was happening during initrd time or sysinit time to try and identify if it was failing before or after / was mounted to try and determine where the issue might lay. Then we got off on the tangent about partitions, and at the same time Neil started asking about udev, at which point it came out that he's running ubuntu, and as much as I would like to help, the fact of the matter is that I've never touched ubuntu and wouldn't have the faintest clue, so I let Neil handle it. At which point he found that the udev scripts in ubuntu are being stupid, and from the looks of it are the cause of the problem. So, I've considered the initial issue root caused for a bit now. like udev/hal that believes it knows better than you about what you have on your disks. but _NEITHER OF THESE IS YOUR PROBLEM_ imho Actually, it looks like udev *is* the problem, but not because of partition tables. I am also sorry to say that i fail to identify what the source of your problem is, we should try harder instead of flaming between us. We can do both, or at least I can :-P Is it possible to reproduce it on the live system e.g. unmount, stop array, start it again and mount. I bet it will work flawlessly in this case. then i would disable starting this array at boot, and start it manually when the system is up (stracing mdadm, so we can see what it does) I am also wondering about this: md: md0: raid array is not clean -- starting background reconstruction does your system shut down properly? do you see the message about stopping md at the very end of the reboot/halt process? The root cause is that as udev adds his sata devices one at a time, on each add of the sata device it invokes mdadm to see if there is an array to start, and it doesn't use incremental mode on mdadm. As a result, as soon as there are 3 out of the 4 disks present, mdadm starts the array in degraded mode. It's probably a race between the mdadm started on the third disk and mdadm started on the fourth disk that results in the message about being unable to set the array info. The one loosing the race gets the error as the other one has already manipulated the array (for example, the 4th disk mdadm could be trying to add the first disk to the array, but it's already there, so it gets this error and bails). So, as much as you might dislike mkinitrd since 5.0 Luca, it doesn't have this particular problem ;-) In the initrd we produce, it loads all the SCSI/SATA/etc drivers first, then calls mkblkdevs which forces all of the devices to appear in /dev, and only then does it start the mdadm/lvm configuration. Daniel, I make no promises what so ever that this will even work at all as it may fail to load modules or all other sorts of weirdness, but if you want to test the theory, you can download the latest mkinitrd from fedoraproject.org, then use it to create an initrd image under some other name than your default image name, then manually edit your boot to have an extra stanza that uses the mkinitrd generated initrd image instead of the ubuntu image, and then just see if it brings the md device up cleanly instead of in degraded mode. That should be a fairly quick and easy way to test if Neil's analysis of the udev script was right. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Sun, 2007-10-28 at 22:59 -0700, Daniel L. Miller wrote: Doug Ledford wrote: Anyway, I happen to *like* the idea of using full disk devices, but the reality is that the md subsystem doesn't have exclusive ownership of the disks at all times, and without that it really needs to stake a claim on the space instead of leaving things to chance IMO. I've been re-reading this post numerous times - trying to ignore the burgeoning flame war :) - and this last sentence finally clicked with me. As I'm a novice Linux user - and not involved in development at all - bear with me if I'm stating something obvious. And if I'm wrong - please be gentle! 1. md devices are not native to the kernel - they are created/assembled/activated/whatever by a userspace program. My real point was that md doesn't own the disks, meaning that during startup, and at other points in time, software other than the md stack can attempt to use the disk directly. That software may be the linux file system code, linux lvm code, or in some case entirely different OS software. Given that these situations can arise, using a partition table to mark the space as in use by linux is what I meant by staking a claim. It doesn't keep the linux kernel from using it because it thinks it owns it, but it does stop other software from attempting to use it. 2. Because md devices are non-native devices, and are composed of native devices, the kernel may try to use those components directly without going through md. In the case of superblocks at the end, yes. The kernel may see the underlying file system or lvm disk label even if the md device is not started. 3. Creating a partition table somehow (I'm still not clear how/why) reduces the chance the kernel will access the drive directly without md. The partition table is more to tell other software that linux owns the space and to avoid mistakes where someone runs fdisk on a disk accidentally and wipes out your array because they added a partition table on what they thought was a new disk (more likely when you have large arrays of disks attached via fiber channel or such than in a single system). Putting the superblock at the beginning of the md device is the main thing that guarantees the kernel will never try to use what's inside the md device without the md device running. These concepts suddenly have me terrified over my data integrity. Is the md system so delicate that BOOT sequence can corrupt it? If you have your superblocks at the end of the devices, then there are certain failure modes that can cause data inconsistencies. Generally speaking they won't harm the array itself, it's just that the different disks in a raid1 array might contain different data. If you don't use partitions, then the majority of failure scenarios involve things like accidental use of fdisk on the unpartitioned device, access of the device by other OSes, that sort of thing. How is it more reliable AFTER the completed boot sequence? Once the array is up and running, the constituent disks are marked as busy in the operating system, which prevents other portions of the linux kernel and other software in general from getting at the md owned disks. Nothing in the documentation (that I read - granted I don't always read everything) stated that partitioning prior to md creation was necessary - in fact references were provided on how to use complete disks. Is there an official position on, To Partition, or Not To Partition? Particularly for my application - dedicated Linux server, RAID-10 configuration, identical drives. And if partitioning is the answer - what do I need to do with my live dataset? Drop one drive, partition, then add the partition as a new drive to the set - and repeat for each drive after the rebuild finishes? You *probably*, and I emphasize probably, don't need to do anything. I emphasize it because I don't know enough about your situation to say so with 100% certainty. If I'm wrong, it's not my fault. Now, that said, here's the gist of the situation. There are specific failure cases that can corrupt data in an md raid1 array mainly related to superblocks at the end of devices. There are specific failure cases where an unpartitioned device can be accidentally partitioned or where a partitioned md array in combination with superblocks at the end and using a whole disk device can be misrecognized as a partitioned normal drive. There are, on the other hand, cases where it's perfectly safe to use unpartitioned devices, or superblocks at the end of devices. My recommendation when someone asks what to do is to use partitions, and to use superblocks at the beginning of the devices (except for /boot since that isn't supported at the moment). The reason I give that advice is that I assume if a person knows enough to know when it's safe to use unpartitioned devices, like Luca, then they wouldn't be asking me for advice. So since
Re: Implementing low level timeouts within MD
On Sun, 2007-10-28 at 01:27 -0500, Alberto Alonso wrote: On Sat, 2007-10-27 at 19:55 -0400, Doug Ledford wrote: On Sat, 2007-10-27 at 16:46 -0500, Alberto Alonso wrote: Regardless of the fact that it is not MD's fault, it does make software raid an invalid choice when combined with those drivers. A single disk failure within a RAID5 array bringing a file server down is not a valid option under most situations. Without knowing the exact controller you have and driver you use, I certainly can't tell the situation. However, I will note that there are times when no matter how well the driver is written, the wrong type of drive failure *will* take down the entire machine. For example, on an SPI SCSI bus, a single drive failure that involves a blown terminator will cause the electrical signaling on the bus to go dead no matter what the driver does to try and work around it. Sorry I thought I copied the list with the info that I sent to Richard. Here is the main hardware combinations. --- Excerpt Start Certainly. The times when I had good results (ie. failed drives with properly degraded arrays have been with old PATA based IDE controllers built in the motherboard and the Highpoint PATA cards). The failures (ie. single disk failure bringing the whole server down) have been with the following: * External disks on USB enclosures, both RAID1 and RAID5 (two different systems) Don't know the actual controller for these. I assume it is related to usb-storage, but can probably research the actual chipset, if it is needed. OK, these you don't get to count. If you run raid over USB...well...you get what you get. IDE never really was a proper server interface, and SATA is much better, but USB was never anything other than a means to connect simple devices without having to put a card in your PC, it was never intended to be a raid transport. * Internal serverworks PATA controller on a netengine server. The server if off waiting to get picked up, so I can't get the important details. 1 PATA failure. * Supermicro MB with ICH5/ICH5R controller and 2 RAID5 arrays of 3 disks each. (only one drive on one array went bad) * VIA VT6420 built into the MB with RAID1 across 2 SATA drives. * And the most complex is this week's server with 4 PCI/PCI-X cards. But the one that hanged the server was a 4 disk RAID5 array on a RocketRAID1540 card. And 3 SATA failures, right? I'm assuming the Supermicro is SATA or else it has more PATA ports than I've ever seen. Was the RocketRAID card in hardware or software raid mode? It sounds like it could be a combination of both, something like hardware on the card, and software across the different cards or something like that. What kernels were these under? --- Excerpt End I wasn't even asking as to whether or not it should, I was asking if it could. It could, but without careful control of timeouts for differing types of devices, you could end up making the software raid less reliable instead of more reliable overall. Even if the default timeout was really long (ie. 1 minute) and then configurable on a per device (or class) via /proc it would really help. It's a band-aid. It's working around other bugs in the kernel instead of fixing the real problem. Generally speaking, most modern drivers will work well. It's easier to maintain a list of known bad drivers than known good drivers. That's what has been so frustrating. The old PATA IDE hardware always worked and the new stuff is what has crashed. In all fairness, the SATA core is still relatively young. IDE was around for eons, where as Jeff started the SATA code just a few years back. In that time I know he's had to deal with both software bugs and hardware bugs that would lock a SATA port up solid with no return. What it sounds like to me is you found some of those. Be careful which hardware raid you choose, as in the past several brands have been known to have the exact same problem you are having with software raid, so you may not end up buying yourself anything. (I'm not naming names because it's been long enough since I paid attention to hardware raid driver issues that the issues I knew of could have been solved by now and I don't want to improperly accuse a currently well working driver of being broken) I have settled for 3ware. All my tests showed that it performed quite well and kicked drives out when needed. Of course, I haven't had a bad drive on a 3ware production server yet, so I may end up pulling the little bit of hair I have left. I am now rushing the RocketRAID 2220 into production without testing due to it being the only thing I could get my hands on. I'll report any experiences as they happen. Thanks for all the info, Alberto -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford
Re: Time to deprecate old RAID formats?
On Mon, 2007-10-29 at 22:44 +0100, Luca Berra wrote: On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote: On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote: Remaking the initrd installs the new mdadm.conf file, which would have then contained the whole disk devices and it's UUID. There in would have been the problem. yes, i read the patch, i don't like that code, as i don't like most of what has been put in mkinitrd from 5.0 onward. in case you wonder i am referring to things like emit dm create $1 $UUID $(/sbin/dmsetup table $1) I make no judgments on the dm setup stuff, I know too little about the dm stack to be qualified. Imho the correct thing here would not have been copying the existing mdadm.conf but generating a safe one from output of mdadm -D (note -D, not -E) I'm not sure I'd want that. Besides, what makes you say -D is safer than -E? mdadm -D /dev/mdX works on an active md device, so i strongly doubt the information gathered from there would be stale while mdadm -Es will scan disk devices for md superblock, thus possibly even finding stale superblocks or leftovers. I would strongly recommend against blindly doing mdadm -Es /etc/mdadm.conf and not supervising the result. Well, I agree that blindly doing mdadm -Esb mdadm.conf would be bad, but that's not what mkinitrd is doing, it's using the mdadm.conf that's in place so you can update the mdadm.conf whenever you find it appropriate. And I agree -D has less chance of finding a stale superblock, but it's also true that it has no chance of finding non-stale superblocks on devices that aren't even started. So, as a method of getting all the right information in the event of system failure and rescuecd boot, it leaves something to be desired ;-) In other words, I'd rather use a mode that finds everything and lets me remove the stale than a mode that might miss something. But, that's a matter of personal choice. Considering that we only ever update mdadm.conf automatically during installs, after that the user makes manual mdadm.conf changes themselves, they are free to use whichever they prefer. The one thing I *do* like about mdadm -E above -D is it includes the superblock format in its output. The one thing I don't like, is it almost universally gets the name wrong. What I really want is a brief query format that both gives me the right name (-D) and the superblock format (-E). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Mon, 2007-10-29 at 22:29 +0100, Luca Berra wrote: At which point he found that the udev scripts in ubuntu are being stupid, and from the looks of it are the cause of the problem. So, I've considered the initial issue root caused for a bit now. It seems i made an idiot of myself by missing half of the thread, and i even knew ubuntu was braindead in their use of udev at startup, since a similar discussion came up on the lvm or the dm-devel mailing list (that time iirc it was about lvm over multipath) Nah. Even if we had concluded that udev was to blame here, I'm not entirely certain that we hadn't left Daniel with the impression that we suspected it versus blamed it, so reiterating it doesn't hurt. And I'm sure no one has given him a fix for the problem (although Neil did request a change that will give debug output, but not solve the problem), so not dropping it entirely would seem appropriate as well. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sun, 2007-10-28 at 15:13 +0100, Luca Berra wrote: On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote: On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote: On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. Maybe we need a 2.0 superblock that contains the physical size of every component, not just the logical size that is used for RAID. That way if the size read from the superblock does not match the size of the device, you know that this device should be ignored. In my case that wouldn't have helped. What actually happened was I create a two disk raid1 device using whole devices and a version 1.0 superblock. I know a version 1.1 wouldn't work because it would be where the boot sector needed to be, and wasn't sure if a 1.2 would work either. Then I tried to make the whole disk raid device a partitioned device. This obviously put a partition table right where the BIOS and the kernel would look for it whether the raid was up or not. I also the only reason i can think for the above setup not working is udev mucking with your device too early. It was a combination of boot loader issues and an inability to get this device partitioned up the way I needed. I went with a totally different setup in the end because I essentially started out with a two drive raid1 for the OS and another 2 drive raid1 for data, but I wanted to span them and I was attempting to do so with a mixture of md raid and lvm physical volume striping. Didn't work. tried doing an lvm setup to split the raid up into chunks and that didn't work either. So, then I redid the partition table and created individual raid devices from the partitions. But, I didn't think to zero the old whole disk superblock. When I made the individual raid devices, I used all 1.1 superblocks. So, when it was all said and done, I had a bunch of partitions that looked like a valid set of partitions for the whole disk raid device and a whole disk raid superblock, but I also had superblocks in each partition with their own bitmaps and so on. OK It was only because I wasn't using mdadm in the initrd and specifying uuids that it found the right devices to start and ignored the whole disk devices. But, when I later made some more devices and went to update the mdadm.conf file using mdadm -Eb, it found the devices and added it to the mdadm.conf. If I hadn't checked it before remaking my initrd, it would have hosed the system. And it would have passed all the above is not clear to me, afair redhat initrd still uses raidautorun, RHEL does, but this is on a personal machine I installed Fedora an and latest Fedora has a mkinitrd that installs mdadm and mdadm.conf and starts the needed devices using the UUID. My first sentence above should have read that I *was* using mdadm. which iirc does not works with recent superblocks, so you used uuids on kernel command line? or you use something else for initrd? why would remaking the initrd break it? Remaking the initrd installs the new mdadm.conf file, which would have then contained the whole disk devices and it's UUID. There in would have been the problem. the tests you can throw at it. Quite simply, there is no way to tell the difference between those two situations with 100% certainty. Mdadm tries to be smart and start the newest devices, but Luca's original suggestion of skip the partition scanning in the kernel and figure it out from user space would not have shown mdadm the new devices and would have gotten it wrong every time. yes, in this particular case it would have, congratulation you found a new creative way of shooting yourself in the feet. Creative, not so much. I just backed out of what I started and tried something else. Lots of people do that. maybe mdadm should do checks when creating a device to prevent this kind of mistakes. i.e. if creating an array on a partition, check the whole device for a superblock and refuse in case it finds one if creating an array on a whole device that has a partition table, either require --force, or check for superblocks in every possible partition. What happens if you add the partition table *after* you make the whole disk device and there are stale superblocks in the partitions? This still isn't infallible. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Sun, 2007-10-28 at 14:37 +0100, Luca Berra wrote: On Sat, Oct 27, 2007 at 04:47:30PM -0400, Doug Ledford wrote: Most of the time it does. But those times where it can fail, the failure is due to not taking the precautions necessary to prevent it: aka labeling disk usage via some sort of partition table/disklabel/etc. I strongly disagree. the failure is badly designed software. Then you need to blame Ingo who made putting the superblock at the end of the device the standard. If the superblock were always at the beginning, then this whole argument would be moot. Things would be reliable the way you want. Using whole disk devices isn't a means of organizing space. It's a way to get a rather miniscule amount of space back by *not* organizing the space. if i am using, say lvm to organize disk space, a partition table is unnecessary to the organization, and it is natural not using them. If you are using straight lvm then you don't have this problem anyway. Lvm doesn't allow the underlying physical device to *look* like a valid, partitioned, single device. Md does when the superblock is at the end. This whole argument seems to boil down to you wanting to perfectly optimize your system for your use case which includes controlling the environment enough that you know it's safe to not partition your disks, where as I argue that although this works in controlled environments, it is known to have failure modes in other environments, and I would be totally remiss if I recommended to my customers that they should take the risk that you can ignore because of your controlled environment since I know a lot of my customers *don't* have a controlled environment such as you do. The whole argument to me boils down to the fact that not having a partition table on a device is possible, and software that do not consider this eventuality is flawed, It's simply not possible to 100% certain differentiate between an md whole disk partitioned device with a superblock at the end and a regular device. Period. You can try to be clever, but you can also get tripped up. The flaw is not with the software, it's with a design that allowed this to happen. and recommnding to work-around flawed software is just digging your head in the sand. If a design is broken but in place, I have no choice but to work around it. Anything else is just stupid. But i believe i did not convince you one ounce more than you convinced me, so i'll quit this thread which is getting too far. Regards, L. -- Luca Berra -- [EMAIL PROTECTED] Communication Media Services S.r.l. /\ \ / ASCII RIBBON CAMPAIGN XAGAINST HTML MAIL / \ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-27 at 10:00 +0200, Luca Berra wrote: On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote: On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote: just apply some rules, so if you find a partition table _AND_ an md superblock at the end, read both and you can tell if it is an md on a partition or a partitioned md raid1 device. In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. then just ignore the device and log a warning, instead of doing a random choice. L. It also happened to be my OS drive pair. Ignoring it would have rendered the machine unusable. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Sat, 2007-10-27 at 09:50 +0200, Luca Berra wrote: On Fri, Oct 26, 2007 at 03:26:33PM -0400, Doug Ledford wrote: On Fri, 2007-10-26 at 11:15 +0200, Luca Berra wrote: On Thu, Oct 25, 2007 at 02:40:06AM -0400, Doug Ledford wrote: The partition table is the single, (mostly) universally recognized arbiter of what possible data might be on the disk. Having a partition table may not make mdadm recognize the md superblock any better, but it keeps all that other stuff from even trying to access data that it doesn't have a need to access and prevents random luck from turning your day bad. on a pc maybe, but that is 20 years old design. So? Unix is 35+ year old design, I suppose you want to switch to Vista then? unix is a 35+ year old design that evolved in time, some ideas were kept, some ditched. BSD disk labels are still in use, SunOS disk labels are still in use, partition tables are somewhat on the way out, but only because they are being replaced by the new EFI disk partitioning method. The only place where partitionless devices is common is in dedicated raid boxes where the raid controller is the only thing that will *ever* see that disk. Sometimes they do it on big SAN/NAS stuff because they don't want to align the partition table to the underlying device's stripe layout, but even then they do so in a tightly controlled environment where they know exactly which machines will be allowed to even try and access the device. partition table design is limited because it is still based on C/H/S, which do not exist anymore. Put a partition table on a big storage, say a DMX, and enjoy a 20% performance decrease. Because you didn't stripe align the partition, your bad. :) by default fdisk misalignes partition tables and aligning them is more complex than just doing without. So. You really need to take the time and to understand the alignment of the device because then and only then can you pass options to mke2fs to align the fs metadata with the stripes as well thereby buying you ever more performance than just leaving off the partition table (assuming that's what you use, I don't know if other mkfs programs have the same options for aligning metadata with stripes). And if you take the time to understand the underlying stripe layout for the mkfs stuff, then you can use the same information to align the partition table. Oh, and let's not go into what can happen if you're talking about a dual boot machine and what Windows might do to the disk if it doesn't think the disk space is already spoken for by a linux partition. Why the hell should the existance of windows limit the possibility of linux working properly. Linux works properly with a partition table, so this is a specious statement. It should also work properly without one. Most of the time it does. But those times where it can fail, the failure is due to not taking the precautions necessary to prevent it: aka labeling disk usage via some sort of partition table/disklabel/etc. If i have a pc that dualboots windows i will take care of using the common denominator of a partition table, if it is my big server i will probably not. since it won't boot anything else than Linux. Doesn't really gain you anything, but your choice. Besides, the question wasn't why shouldn't Luca Berra use whole disk devices, it was why I don't recommend using whole disk devices, and my recommendation wasn't based in the least bit upon a single person's use scenario. If i am the only person in the world that believes partition tables should not be required then i'll shut up. On the opposite, i once inserted an mmc memory card, which had been initialized on my mobile phone, into the mmc slot of my laptop, and was faced with a load of error about mmcblk0 having an invalid partition table. So? The messages are just informative, feel free to ignore them. but did not anaconda propose to wipe unpartitioned disks? Did you stick your mmc card in there during the install of the OS? That's the only time anaconda ever runs, and therefore the only time it ever checks your devices. It makes sense that during the initial install, when the OS is only configured to see locally connected devices, or possibly iSCSI devices that you have specifically told it to probe, that it would then ask you the question about those devices. Other network attached or shared devices are generally added after the initial install. The phone dictates the format, only a moron would say otherwise. But, then again, the phone doesn't care about interoperability and many other issues on memory cards that it thinks it owns, so only a moron would argue that because a phone doesn't use a partition table that nothing else in the computer realm needs to either. i don't count myself as a moron, what i am trying to say is that partition tables are one way of organizing disk space, not the only one. Using whole disk devices isn't
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote: Actually, after doing some research, here's what I've found: * When using grub2, there is supposedly already support for raid/lvm devices. However, I do not know if this includes version 1.0, 1.1, or 1.2 superblocks. I intend to find that out today. It does not include support for any version 1 superblocks. It's noted in the code that it should, but doesn't yet. However, the interesting bit is that they rearchitected grub so that any reads from a device during boot are filtered through the stack that provides the device. So, when you tell grub2 to set root=md0, then all reads from md0 are filtered through the raid module, and the raid module then calls the reads from the IO module, which then does the actual int 13 call. This allows the raid module to read superblocks, detect the raid level and layout, and actually attempt to work on raid0/1/5 devices (at the moment). It also means that all the calls from the ext2 module when it attempts to read from the md device are filtered through the md module and therefore it would be simple for it to implement an offset into the real device to get past the version 1.1/1.2 superblocks. In terms of resilience, the raid module actually tries to utilize the raid itself during any failure. On raid1 devices, if it gets a read failure on any block it attempts to read, then it goes to the next device in the raid1 array and attempts to read from it. So, in the event that your normal boot disk suffers a sector failure in your actual kernel image, but the raid disk is otherwise fine, grub2 should be able to boot from the kernel image on the next raid device. Similarly, on raid5 it will attempt to recover from a block read failure by using the parity to generate the missing data unless the array is already in degraded mode at which point it will bail on any read failure. The lvm module attempts to properly map extents to physical volumes and allows you to have your bootable files in lvm logical volume. In that case you set root=logical-volume-name-as-it-appears-in-/dev/mapper and the lvm module then figures out what physical volumes contain that logical volume and where the extents are mapped and goes from there. I should note that both the lvm code and raid code are simplistic at the moment. For example, the raid5 mapping only supports the default raid5 layout. If you use any other layout, game over. Getting it to work with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but getting it to the point where it handles all the relevant setups properly would require a reasonable amount of coding. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Implementing low level timeouts within MD
On Sat, 2007-10-27 at 16:46 -0500, Alberto Alonso wrote: On Fri, 2007-10-26 at 15:00 -0400, Doug Ledford wrote: This isn't an md problem, this is a low level disk driver problem. Yell at the author of the disk driver in question. If that driver doesn't time things out and return errors up the stack in a reasonable time, then it's broken. Md should not, and realistically can not, take the place of a properly written low level driver. I am not arguing whether or not MD is at fault, I know it isn't. Regardless of the fact that it is not MD's fault, it does make software raid an invalid choice when combined with those drivers. A single disk failure within a RAID5 array bringing a file server down is not a valid option under most situations. Without knowing the exact controller you have and driver you use, I certainly can't tell the situation. However, I will note that there are times when no matter how well the driver is written, the wrong type of drive failure *will* take down the entire machine. For example, on an SPI SCSI bus, a single drive failure that involves a blown terminator will cause the electrical signaling on the bus to go dead no matter what the driver does to try and work around it. I wasn't even asking as to whether or not it should, I was asking if it could. It could, but without careful control of timeouts for differing types of devices, you could end up making the software raid less reliable instead of more reliable overall. Should is a relative term, could is not. If the MD code can not cope with poorly written drivers then a list of valid drivers and cards would be nice to have (that's why I posted my ... when it works and when it doesn't, I was trying to come up with such a list). Generally speaking, most modern drivers will work well. It's easier to maintain a list of known bad drivers than known good drivers. I only got 1 answer with brand specific information to figure out when it works and when it doesn't work. My recent experience is that too many drivers seem to have the problem so software raid is no longer an option for any new systems that I build, and as time and money permits I'll be switching to hardware/firmware raid all my legacy servers. Be careful which hardware raid you choose, as in the past several brands have been known to have the exact same problem you are having with software raid, so you may not end up buying yourself anything. (I'm not naming names because it's been long enough since I paid attention to hardware raid driver issues that the issues I knew of could have been solved by now and I don't want to improperly accuse a currently well working driver of being broken) -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
device, they automatically only use the second method of boot loader installation. This gives them the freedom to be able to modify the second stage boot loader on a boot disk by boot disk basis. The downside to this is that they need lots of room after the MBR and before the first partition in order to put their core.img file in place. I *think*, and I'll know for sure later today, that the core.img file is generated during grub install from the list of optional modules you specify during setup. Eg., the pc module gives partition table support, the lvm module lvm support, etc. You list the modules you need, and grub then builds a core.img out of all those modules. The normal amount of space between the MBR and the first partition is (sectors_per_track - 1). For standard disk geometries, that basically leaves 254 sectors, or 127k of space. This might not be enough for your particular needs if you have a complex boot environment. In that case, you would need to bump at least the starting track of your first partition to make room for your boot loader. Unfortunately, how is a person to know how much room their setup needs until after they've installed and it's too late to bump the partition table start? They can't. So, that's another thing I think I will check out today, what the maximum size of grub2 might be with all modules included, and what a common size might be. Based on your description, it sounds as if grub2 may not have given adequate thought to what users other than the authors might need (that may be a premature conclusion). I have multiple installs on several of my machines, and I assume that the grub2 for 32 and 64 bit will be different. Thanks for the research. No, not really. The grub command on the two is different, but they actually build the boot sector out of 16 bit non-protected mode code, just like DOS. So either one would build the same boot sector given the same config. And you can always use the same trick I've used in the past of creating a large /boot partition (say 250MB) and using that same partition as /boot in all of your installs. Then they share a single grub config (while the grub binaries are in the individual / partitions) and from the single grub instance you can boot to any of the installs, as well as a kernel update in any install updates that global grub config. The other option is to use separate /boot partitions and chain load the grub instances, but I find that clunky in comparison. Of course, in my case I also made /lib/modules its own partition and also shared it between all the installs so that I could manually edit the various kernel boot params to specify different root partitions and in so doing I could boot a RHEL5 kernel using a RHEL4 install and vice versa. But if you do that, you have to manually patch /etc/rc.d/rc.sysinit to mount the /lib/modules partition before ever trying to do anything with modules (and you have to mount it rw so they can do a depmod if needed), then remount it ro for the fsck, then it gets remounted rw again after the fs check. It was a pain in the ass to maintain because every update to initscripts would wipe out the patch and if you forgot to repatch the file, the system wouldn't boot and you'd have to boot into another install, mount the / partition of the broken install, patch the file, then it would work again in that install. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote: On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote: In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. Maybe we need a 2.0 superblock that contains the physical size of every component, not just the logical size that is used for RAID. That way if the size read from the superblock does not match the size of the device, you know that this device should be ignored. In my case that wouldn't have helped. What actually happened was I create a two disk raid1 device using whole devices and a version 1.0 superblock. I know a version 1.1 wouldn't work because it would be where the boot sector needed to be, and wasn't sure if a 1.2 would work either. Then I tried to make the whole disk raid device a partitioned device. This obviously put a partition table right where the BIOS and the kernel would look for it whether the raid was up or not. I also tried doing an lvm setup to split the raid up into chunks and that didn't work either. So, then I redid the partition table and created individual raid devices from the partitions. But, I didn't think to zero the old whole disk superblock. When I made the individual raid devices, I used all 1.1 superblocks. So, when it was all said and done, I had a bunch of partitions that looked like a valid set of partitions for the whole disk raid device and a whole disk raid superblock, but I also had superblocks in each partition with their own bitmaps and so on. It was only because I wasn't using mdadm in the initrd and specifying uuids that it found the right devices to start and ignored the whole disk devices. But, when I later made some more devices and went to update the mdadm.conf file using mdadm -Eb, it found the devices and added it to the mdadm.conf. If I hadn't checked it before remaking my initrd, it would have hosed the system. And it would have passed all the tests you can throw at it. Quite simply, there is no way to tell the difference between those two situations with 100% certainty. Mdadm tries to be smart and start the newest devices, but Luca's original suggestion of skip the partition scanning in the kernel and figure it out from user space would not have shown mdadm the new devices and would have gotten it wrong every time. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote: Neil Brown wrote: On Thursday October 25, [EMAIL PROTECTED] wrote: I didn't get a reply to my suggestion of separating the data and location... No. Sorry. ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data format (0.9 vs 1.0) and a location (end,start,offset4k)? This would certainly make things a lot clearer to new (and old!) users: mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k or mdadm --create /dev/md0 --metadata 1.0 --meta-location start or mdadm --create /dev/md0 --metadata 1.0 --meta-location end I'm happy to support synonyms. How about --metadata 1-end --metadata 1-start ?? Offset? Do you like 1-offset4k or maybe 1-start4k or even 1-start+4k for that? The last is most intuitive but I don't know how you feel about the + in there. Actually, after doing some research, here's what I've found: * When using lilo to boot from a raid device, it automatically installs itself to the mbr, not to the partition. This can not be changed. Only 0.90 and 1.0 superblock types are supported because lilo doesn't understand the offset to the beginning of the fs otherwise. * When using grub to boot from a raid device, only 0.90 and 1.0 superblocks are supported[1] (because grub is ignorant of the raid and it requires the fs to start at the start of the partition). You can use either MBR or partition based installs of grub. However, partition based installs require that all bootable partitions be in exactly the same logical block address across all devices. This limitation can be an extremely hazardous limitation in the event a drive dies and you have to replace it with a new drive as newer drives may not share the older drive's geometry and will require starting your boot partition in an odd location to make the logical block addresses match. * When using grub2, there is supposedly already support for raid/lvm devices. However, I do not know if this includes version 1.0, 1.1, or 1.2 superblocks. I intend to find that out today. If you tell grub2 to install to an md device, it searches out all constituent devices and installs to the MBR on each device[2]. This can't be changed (at least right now, probably not ever though). So, given the above situations, really, superblock format 1.2 is likely to never be needed. None of the shipping boot loaders work with 1.2 regardless, and the boot loader under development won't install to the partition in the event of an md device and therefore doesn't need that 4k buffer that 1.2 provides. [1] Grub won't work with either 1.1 or 1.2 superblocks at the moment. A person could probably hack it to work, but since grub development has stopped in preference to the still under development grub2, they won't take the patches upstream unless they are bug fixes, not new features. [2] There are two ways to install to a master boot record. The first is to use the first 512 bytes *only* and hardcode the location of the remainder of the boot loader into those 512 bytes. The second way is to use the free space between the MBR and the start of the first partition to embed the remainder of the boot loader. When you point grub2 at an md device, they automatically only use the second method of boot loader installation. This gives them the freedom to be able to modify the second stage boot loader on a boot disk by boot disk basis. The downside to this is that they need lots of room after the MBR and before the first partition in order to put their core.img file in place. I *think*, and I'll know for sure later today, that the core.img file is generated during grub install from the list of optional modules you specify during setup. Eg., the pc module gives partition table support, the lvm module lvm support, etc. You list the modules you need, and grub then builds a core.img out of all those modules. The normal amount of space between the MBR and the first partition is (sectors_per_track - 1). For standard disk geometries, that basically leaves 254 sectors, or 127k of space. This might not be enough for your particular needs if you have a complex boot environment. In that case, you would need to bump at least the starting track of your first partition to make room for your boot loader. Unfortunately, how is a person to know how much room their setup needs until after they've installed and it's too late to bump the partition table start? They can't. So, that's another thing I think I will check out today, what the maximum size of grub2 might be with all modules included, and what a common size might be. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote: On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote: just apply some rules, so if you find a partition table _AND_ an md superblock at the end, read both and you can tell if it is an md on a partition or a partitioned md raid1 device. In fact, no you can't. I know, because I've created a device that had both but wasn't a raid device. And it's matching partner still existed too. What you are talking about would have misrecognized this situation, guaranteed. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Fri, 2007-10-26 at 11:15 +0200, Luca Berra wrote: On Thu, Oct 25, 2007 at 02:40:06AM -0400, Doug Ledford wrote: The partition table is the single, (mostly) universally recognized arbiter of what possible data might be on the disk. Having a partition table may not make mdadm recognize the md superblock any better, but it keeps all that other stuff from even trying to access data that it doesn't have a need to access and prevents random luck from turning your day bad. on a pc maybe, but that is 20 years old design. So? Unix is 35+ year old design, I suppose you want to switch to Vista then? partition table design is limited because it is still based on C/H/S, which do not exist anymore. Put a partition table on a big storage, say a DMX, and enjoy a 20% performance decrease. Because you didn't stripe align the partition, your bad. Oh, and let's not go into what can happen if you're talking about a dual boot machine and what Windows might do to the disk if it doesn't think the disk space is already spoken for by a linux partition. Why the hell should the existance of windows limit the possibility of linux working properly. Linux works properly with a partition table, so this is a specious statement. If i have a pc that dualboots windows i will take care of using the common denominator of a partition table, if it is my big server i will probably not. since it won't boot anything else than Linux. Doesn't really gain you anything, but your choice. Besides, the question wasn't why shouldn't Luca Berra use whole disk devices, it was why I don't recommend using whole disk devices, and my recommendation wasn't based in the least bit upon a single person's use scenario. And, in particular with mdadm, I once created a full disk md raid array on a couple disks, then couldn't get things arranged like I wanted, so I just partitioned the disks and then created new arrays in the partitions (without first manually zeroing the superblock for the whole disk array). Since I used a version 1.0 superblock on the whole disk array, and then used version 1.1 superblocks in the partitions, the net result was that when I ran mdadm -Eb, mdadm would find both the 1.1 and 1.0 superblocks in the last partition on the disk. Confused both myself and mdadm for a while. yes, this is fun On the opposite, i once inserted an mmc memory card, which had been initialized on my mobile phone, into the mmc slot of my laptop, and was faced with a load of error about mmcblk0 having an invalid partition table. So? The messages are just informative, feel free to ignore them. Obviously it had none, it was a plain fat filesystem. Is the solution partitioning it? I don't think the phone would agree. The phone dictates the format, only a moron would say otherwise. But, then again, the phone doesn't care about interoperability and many other issues on memory cards that it thinks it owns, so only a moron would argue that because a phone doesn't use a partition table that nothing else in the computer realm needs to either. Anyway, I happen to *like* the idea of using full disk devices, but the reality is that the md subsystem doesn't have exclusive ownership of the disks at all times, and without that it really needs to stake a claim on the space instead of leaving things to chance IMO. Start removing the partition detection code from the blasted kernel and move it to userspace, which is already in place, but it is not the default. Which just moves where the work is done, not what work needs to be done. It's a change for no benefit and a waste of time. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Implementing low level timeouts within MD
On Fri, 2007-10-26 at 12:12 -0500, Alberto Alonso wrote: I've been asking on my other posts but haven't seen a direct reply to this question: Can MD implement timeouts so that it detects problems when drivers don't come back? For me this year shall be known as the year the array stood still (bad scifi reference :-) After 4 different array failures all due to a single drive failure I think it would really be helpful if the md code timed out the driver. This isn't an md problem, this is a low level disk driver problem. Yell at the author of the disk driver in question. If that driver doesn't time things out and return errors up the stack in a reasonable time, then it's broken. Md should not, and realistically can not, take the place of a properly written low level driver. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Wed, 2007-10-24 at 22:43 -0700, Daniel L. Miller wrote: Bill Davidsen wrote: Daniel L. Miller wrote: Current mdadm.conf: DEVICE partitions ARRAY /dev/.static/dev/md0 level=raid10 num-devices=4 UUID=9d94b17b:f5fac31a:577c252b:0d4c4b2a auto=part still have the problem where on boot one drive is not part of the array. Is there a log file I can check to find out WHY a drive is not being added? It's been a while since the reboot, but I did find some entries in dmesg - I'm appending both the md lines and the physical disk related lines. The bottom shows one disk not being added (this time is was sda) - and the disk that gets skipped on each boot seems to be random - there's no consistent failure: I suspect the base problem is that you are using whole disks instead of partitions, and the problem with the partition table below is probably an indication that you have something on that drive which looks like a partition table but isn't. That prevents the drive from being recognized as a whole drive. You're lucky, if the data looked enough like a partition table to be valid the o/s probably would have tried to do something with it. [...] This may be the rare case where you really do need to specify the actual devices to get reliable operation. OK - I'm officially confused now (I was just unofficially before). WHY is it a problem using whole drives as RAID components? I would have thought that building a RAID storage unit with identically sized drives - and using each drive's full capacity - is exactly the way you're supposed to! As much as anything else this can be summed up as you are thinking of how you are using the drives and not how unexpected software on your system might try and use your drives. Without a partition table, none of the software on your system can know what to do with the drives except mdadm when it finds an md superblock. That doesn't stop other software from *trying* to find out how to use your drives though. That includes the kernel trying to look for a valid partition table, mount possibly scanning the drive for a file system label, lvm scanning for an lvm superblock, mtools looking for a dos filesystem, etc. Under normal conditions, the random data on your drive will never look valid to these other pieces of software. But, once in a great while, it will look valid. And that's when all hell breaks loose. Or worse, you run a partition program such as fdisk on the device and it initializes the partition table (something that the Fedora/RHEL installers do to all disks without partition tables...well, the installer tells you there's no partition table and asks if you want to initialize it, but if someone is in a hurry and hits yes when they meant no, bye bye data). The partition table is the single, (mostly) universally recognized arbiter of what possible data might be on the disk. Having a partition table may not make mdadm recognize the md superblock any better, but it keeps all that other stuff from even trying to access data that it doesn't have a need to access and prevents random luck from turning your day bad. Oh, and let's not go into what can happen if you're talking about a dual boot machine and what Windows might do to the disk if it doesn't think the disk space is already spoken for by a linux partition. And, in particular with mdadm, I once created a full disk md raid array on a couple disks, then couldn't get things arranged like I wanted, so I just partitioned the disks and then created new arrays in the partitions (without first manually zeroing the superblock for the whole disk array). Since I used a version 1.0 superblock on the whole disk array, and then used version 1.1 superblocks in the partitions, the net result was that when I ran mdadm -Eb, mdadm would find both the 1.1 and 1.0 superblocks in the last partition on the disk. Confused both myself and mdadm for a while. Anyway, I happen to *like* the idea of using full disk devices, but the reality is that the md subsystem doesn't have exclusive ownership of the disks at all times, and without that it really needs to stake a claim on the space instead of leaving things to chance IMO. I should mention that the boot/system drive is IDE, and NOT part of the RAID. So I'm not worried about losing the system - but I AM concerned about the data. I'm using four drives in a RAID-10 configuration - I thought this would provide a good blend of safety and performance for a small fileserver. Because it's RAID-10 - I would ASSuME that I can drop one drive (after all, I keep booting one drive short), partition if necessary, and add it back in. But how would splitting these disks into partitions improve either stability or performance? -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http
Re: Raid-10 mount at startup always has problem
On Thu, 2007-10-25 at 16:12 +1000, Neil Brown wrote: md: md0 stopped. md: md0 stopped. md: bindsdc md: bindsdd md: bindsdb md: md0: raid array is not clean -- starting background reconstruction raid10: raid set md0 active with 3 out of 4 devices md: couldn't update array info. -22 ^^^ This is the most surprising line, and hence the one most likely to convey helpful information. This message is generated when a process calls SET_ARRAY_INFO on an array that is already running, and the changes implied by the new array_info are not supportable. The only way I can see this happening is if two copies of mdadm are running at exactly the same time and are both are trying to assemble the same array. The first calls SET_ARRAY_INFO and assembles the (partial) array. The second calls SET_ARRAY_INFO and gets this error. Not all devices are included because while when one mdadm when to look, at a device, the other has it locked and so the first just ignored it. If mdadm copy A gets three of the devices, I wouldn't think mdadm copy B would have been able to get enough devices to decide to even try and assemble the array (assuming that once copy A locked the devices during open, that it then held the devices until time to assemble the array). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Thu, 2007-10-25 at 09:55 +1000, Neil Brown wrote: As for where the metadata should be placed, it is interesting to observe that the SNIA's DDFv1.2 puts it at the end of the device. And as DDF is an industry standard sponsored by multiple companies it must be .. Sorry. I had intended to say correct, but when it came to it, my fingers refused to type that word in that context. DDF is in a somewhat different situation though. It assumes that the components are whole devices, and that the controller has exclusive access - there is no way another controller could interpret the devices differently before the DDF controller has a chance. Putting a superblock at the end of a device works around OS compatibility issues and other things related to transitioning the device from part of an array to not, etc. But, it works if and only if you have the guarantee you mention. Long, long ago I tinkered with the idea of md multipath devices using an end of device superblock on the whole device to allow reliable multipath detection and autostart, failover of all partitions on a device when a command to any partition failed, ability to use standard partition tables, etc. while being 100% transparent to the rest of the OS. The second you considered FC connected devices and multi-OS access, that fell apart in a big way. Very analogous. So, I wouldn't necessarily call it wrong, but it's fragile. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Wed, 2007-10-24 at 16:22 -0400, Bill Davidsen wrote: Doug Ledford wrote: On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote: I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. This is the heart of the matter. When you consider that each file system and each volume management stack has a superblock, and they some store their superblocks at the end of devices and some at the beginning, and they can be stacked, then it becomes next to impossible to make sure a stacked setup is never recognized incorrectly under any circumstance. It might be possible if you use static device names, but our users *long* ago complained very loudly when adding a new disk or removing a bad disk caused their setup to fail to boot. So, along came mount by label and auto scans for superblocks. Once you do that, you *really* need all the superblocks at the same end of a device so when you stack things, it always works properly. Let me be devil's advocate, I noted in another post that location might be raid level dependent. For raid-1 putting the superblock at the end allows the BIOS to treat a single partition as a bootable unit. This is true for both the 1.0 and 1.2 superblock formats. The BIOS couldn't care less if there is an offset to the filesystem because it doesn't try to read from the filesystem. It just jumps to the first 512 byte sector and that's it. Grub/Lilo are the ones that have to know about the offset, and they would be made aware of the offset at install time. So, we are back to the exact same thing I was talking about. With the superblock at the beginning of the device, you don't hinder bootability with or without the raid working, the raid would be bootable regardless as long as you made it bootable, it only hinders accessing the filesystem via a running linux installation without bringing up the raid. For all other arrangements the end location puts the superblock where it is slightly more likely to be overwritten, and where it must be moved if the partition grows or whatever. There really may be no right answer. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Raid-10 mount at startup always has problem
On Wed, 2007-10-24 at 07:22 -0700, Daniel L. Miller wrote: Daniel L. Miller wrote: Richard Scobie wrote: Daniel L. Miller wrote: And you didn't ask, but my mdadm.conf: DEVICE partitions ARRAY /dev/.static/dev/md0 level=raid10 num-devices=4 UUID=9d94b17b:f5fac31a:577c252b:0d4c4b2a Try adding auto=part at the end of you mdadm.conf ARRAY line. Thanks - will see what happens on my next reboot. Current mdadm.conf: DEVICE partitions ARRAY /dev/.static/dev/md0 level=raid10 num-devices=4 UUID=9d94b17b:f5fac31a:577c252b:0d4c4b2a auto=part still have the problem where on boot one drive is not part of the array. Is there a log file I can check to find out WHY a drive is not being added? It usually means either the device is busy at the time the raid startup happened, or the device wasn't created by udev yet at the time the startup happened. It it failing to start the array properly in the initrd or is this happening after you've switched to the rootfs and are running the startup scripts? md: md0 stopped. md: md0 stopped. md: bindsdc md: bindsdd md: bindsdb Whole disk raid devices == bad. Lots of stuff can go wrong with that setup. md: md0: raid array is not clean -- starting background reconstruction raid10: raid set md0 active with 3 out of 4 devices md: couldn't update array info. -22 md: resync of RAID array md0 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for resync. md: using 128k window, over a total of 312581632 blocks. Filesystem md0: Disabling barriers, not supported by the underlying device XFS mounting filesystem md0 Starting XFS recovery on filesystem: md0 (logdev: internal) Ending XFS recovery on filesystem: md0 (logdev: internal) -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Tue, 2007-10-23 at 19:03 -0400, Bill Davidsen wrote: John Stoffel wrote: Why do we have three different positions for storing the superblock? Why do you suggest changing anything until you get the answer to this question? If you don't understand why there are three locations, perhaps that would be a good initial investigation. Clearly the short answer is that they reflect three stages of Neil's thinking on the topic, and I would bet that he had a good reason for moving the superblock when he did it. I believe, and Neil can correct me if I'm wrong, that 1.0 (at the end of the device) is to satisfy people that want to get at their raid1 data without bringing up the device or using a loop mount with an offset. Version 1.1, at the beginning of the device, is to prevent accidental access to a device when the raid array doesn't come up. And version 1.2 (4k from the beginning of the device) would be suitable for those times when you want to embed a boot sector at the very beginning of the device (which really only needs 512 bytes, but a 4k offset is as easy to deal with as anything else). From the standpoint of wanting to make sure an array is suitable for embedding a boot sector, the 1.2 superblock may be the best default. Since you have to support all of them or break existing arrays, and they all use the same format so there's no saving of code size to mention, why even bring this up? -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: chunk size (was Re: Time to deprecate old RAID formats?)
On Tue, 2007-10-23 at 21:21 +0200, Michal Soltys wrote: Doug Ledford wrote: Well, first I was thinking of files in the few hundreds of megabytes each to gigabytes each, and when they are streamed, they are streamed at a rate much lower than the full speed of the array, but still at a fast rate. How parallel the reads are then would tend to be a function of chunk size versus streaming rate. Ahh, I see now. Thanks for explanation. I wonder though, if setting large readahead would help, if you used larger chunk size. Assuming other options are not possible - i.e. streaming from larger buffer, while reading to it in a full stripe width at least. Probably not. All my trial and error in the past with raid5 arrays and various situations that would cause pathological worst case behavior showed that once reads themselves reach 16k in size, and are sequential in nature, then the disk firmware's read ahead kicks in and your performance stays about the same regardless of increasing your OS read ahead. In a nutshell, once you've convinced the disk firmware that you are going to be reading some data sequentially, it does the rest. With a large stripe size (say 256k+), you'll trigger this firmware read ahead fairly early on in reading any given stripe, so you really don't buy much by reading the next stripe before you need it, and in fact can end up wasting a lot of RAM trying to do so, hurting overall performance. I'm not familiar with the benchmark you are referring to. I was thinking about http://www.mail-archive.com/linux-raid@vger.kernel.org/msg08461.html with small discussion that happend after that. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-20 at 22:24 +0400, Michael Tokarev wrote: John Stoffel wrote: Michael == Michael Tokarev [EMAIL PROTECTED] writes: As Doug says, and I agree strongly, you DO NOT want to have the possibility of confusion and data loss, especially on bootup. And There are different point of views, and different settings etc. Indeed, there are different points of view. And with that in mind, I'll just point out that my point of view is that of an engineer who is responsible for all the legitimate md bugs in our products once tech support has weeded out the you tried to do what? cases. From that point of view, I deal with *every* user's preferred use case, not any single use case. For example, I once dealt with a linux user who was unable to use his disk partition, because his system (it was RedHat if I remember correctly) recognized some LVM volume on his disk (it was previously used with Windows) and tried to automatically activate it, thus making it busy. Yep, that can still happen today under certain circumstances. What I'm talking about here is that any automatic activation of anything should be done with extreme care, using smart logic in the startup scripts if at all. We do. Unfortunately, there is no logic smart enough to recognize all the possible user use cases that we've seen given the way things are created now. The Doug's example - in my opinion anyway - shows wrong tools or bad logic in the startup sequence, not a general flaw in superblock location. Well, one of the problems is that you can both use an md device as an LVM physical volume and use an LVM logical volume as an md constituent device. Users have done both. For example, when one drive was almost dead, and mdadm tried to bring the array up, machine just hanged for unknown amount of time. An unexpirienced operator was there. Instead of trying to teach him how to pass parameter to the initramfs to stop trying to assemble root array and next assembling it manually, I told him to pass root=/dev/sda1 to the kernel. Root mounts read-only, so it should be a safe thing to do - I only needed root fs and minimal set of services (which are even in initramfs) just for it to boot up to SOME state where I can log in remotely and fix things later. Umm, no. Generally speaking (I can't speak for other distros) but both Fedora and RHEL remount root rw even when coming up in single user mode. The only time the fs is left in ro mode is when it drops to a shell during rc.sysinit as a result of a failed fs check. And if you are using an ext3 filesystem and things didn't go down clean, then you also get a journal replay. So, then what happens when you think you've fixed things, and you reboot, and then due to random chance, the ext3 fs check gets the journal off the drive that wasn't mounted and replays things again? Will this overwrite your fixes possibly? Yep. Could do all sorts of bad things. In fact, unless you do a full binary compare of your constituent devices, you could have silent data corruption and just never know about it. You may get off lucky and never *see* the corruption, but it could well be there. The only safe way to reintegrate your raid after doing what you suggest is to kick the unmounted drive out of the array before rebooting by using mdadm to zero its superblock, boot up with a degraded raid1 array, and readd the kicked device back in. So, while you list several more examples of times when it was convenient to do as you suggest, these times can be handled in other ways (although it may mean keeping a rescue CD handy at each location just for situations like this) that are far safer IMO. Now, putting all this back into the point of view I have to take, which is what's the best default action to take for my customers, I'm sure you can understand how a default setup and recommendation of use that leaves silent data corruption is simply a non-starter for me. If someone wants to do this manually, then go right ahead. But as for what we do by default when the user asks us to create a raid array, we really need to be on superblock 1.1 or 1.2 (although we aren't yet, we've waited for the version 1 superblock issues to iron out and will do so in a future release). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote: I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. This is the heart of the matter. When you consider that each file system and each volume management stack has a superblock, and they some store their superblocks at the end of devices and some at the beginning, and they can be stacked, then it becomes next to impossible to make sure a stacked setup is never recognized incorrectly under any circumstance. It might be possible if you use static device names, but our users *long* ago complained very loudly when adding a new disk or removing a bad disk caused their setup to fail to boot. So, along came mount by label and auto scans for superblocks. Once you do that, you *really* need all the superblocks at the same end of a device so when you stack things, it always works properly. Michael Another example is ext[234]fs - it does not touch first 512 Michael bytes of the device, so if there was an msdos filesystem Michael there before, it will be recognized as such by many tools, Michael and an attempt to mount it automatically will lead to at Michael least scary output and nothing mounted, or in fsck doing Michael fatal things to it in worst scenario. Sure thing the first Michael 512 bytes should be just cleared.. but that's another topic. I would argue that ext[234] should be clearing those 512 bytes. Why aren't they cleared Actually, I didn't think msdos used the first 512 bytes for the same reason ext3 doesn't: space for a boot sector. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-20 at 09:53 +0200, Iustin Pop wrote: Honestly, I don't see how a properly configured system would start looking at the physical device by mistake. I suppose it's possible, but I didn't have this issue. Mount by label support scans all devices in /proc/partitions looking for the filesystem superblock that has the label you are trying to mount. LVM (unless told not to) scans all devices in /proc/partitions looking for valid LVM superblocks. In fact, you can't build a linux system that is resilient to device name changes without doing that. It's not only about the activation of the array. I'm mostly talking about RAID1, but the fact that migrating between RAID1 and plain disk is just a few hundred K at the end increases the flexibility very much. Flexibility, no. Convenience, yes. You can do all the things with superblock at the front that you can with it at the end, it just takes a little more effort. Also, sometime you want to recover as much as possible from a not intact copy of the data... And you can with superblock at the front. You can create a new single disk raid1 over the existing superblock or you can munge the partition table to have it point at the start of your data. There are options, they just require manual intervention. But if you are trying to rescue data off of a seriously broken device, you are already doing manual intervention anyway. Of course, different people have different priorities, but as I said, I like that this conversion is possible, and I never had the case of a tool saying hmm, /dev/mdsomething is not there, let's look at /dev/sdc instead. mount, pvscan. thanks, iustin -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: chunk size (was Re: Time to deprecate old RAID formats?)
On Sat, 2007-10-20 at 00:43 +0200, Michal Soltys wrote: Doug Ledford wrote: course, this comes at the expense of peak throughput on the device. Let's say you were building a mondo movie server, where you were streaming out digital movie files. In that case, you very well may care more about throughput than seek performance since I suspect you wouldn't have many small, random reads. Then I would use a small chunk size, sacrifice the seek performance, and get the throughput bonus of parallel reads from the same stripe on multiple disks. On the other hand, if I Out of curiosity though - why wouldn't large chunk work well here ? If you stream video (I assume large files, so like a good few MBs at least), the reads are parallel either way. Well, first I was thinking of files in the few hundreds of megabytes each to gigabytes each, and when they are streamed, they are streamed at a rate much lower than the full speed of the array, but still at a fast rate. How parallel the reads are then would tend to be a function of chunk size versus streaming rate. I guess I should clarify what I'm talking about anyway. To me, a large chunk size is 1 to 2MB or so, a small chunk size is in the 64k to 256k range. If you have a 10 disk raid5 array with a 2mb chunk size, and you aren't just copying files around, then it's hard to ever get that to do full speed parallel reads because you simply won't access the data fast enough. Yes, the amount of data read from each of the disks will be in less perfect proportion than in small chunk size scenario, but it's pretty neglible. Benchamrks I've seen (like Justin's one) seem not to care much about chunk size in sequential read/write scenarios (and often favors larger chunks). Some of my own tests I did few months ago confirmed that as well. I'm not familiar with the benchmark you are referring to. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-20 at 17:07 +0200, Iustin Pop wrote: On Sat, Oct 20, 2007 at 10:52:39AM -0400, John Stoffel wrote: Michael Well, I strongly, completely disagree. You described a Michael real-world situation, and that's unfortunate, BUT: for at Michael least raid1, there ARE cases, pretty valid ones, when one Michael NEEDS to mount the filesystem without bringing up raid. Michael Raid1 allows that. Please describe one such case please. Boot from a raid1 array, such that everything - including the partition table itself - is mirrored. That's a *really* bad idea. If you want to subpartition a raid array, you really should either run lvm on top of raid or use a partitionable raid array embedded in a raid partition. If you don't, there are a whole slew of failure cases that would result in the same sort of accidental access and data corruption that I talked about. For instance, if you ever ran fdisk on the disk itself instead of the raid array, fdisk would happily create a partition that runs off the end of the raid device and into the superblock area. The raid subsystem autodetect only works on partitions labeled as type 0xfd, so it would never search for a raid superblock at the end of the actual device, and that means that if you boot from a rescue CD that doesn't contain an mdadm.conf file that specifies the whole disk device as a search device, then it is guaranteed to not start the device and possibly try and modify the underlying constituent devices. All around, it's just a *really* bad idea. I've heard several descriptions of things you *could* do with the superblock at the end, but as of yet, not one of them is a good idea if you really care about your data. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Sat, 2007-10-20 at 22:38 +0400, Michael Tokarev wrote: Justin Piszcz wrote: On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote: [] Got it, so for RAID1 it would make sense if LILO supported it (the later versions of the md superblock) Lilo doesn't know anything about the superblock format, however, lilo expects the raid1 device to start at the beginning of the physical partition. In otherwords, format 1.0 would work with lilo. Did not work when I tried 1.x with LILO, switched back to 00.90.03 and it worked fine. There are different 1.x - and the difference is exactly this -- location of the superblock. In 1.0, superblock is located at the end, just like with 0.90, and lilo works just fine with it. It gets confused somehow (however I don't see how really, because it uses bmap() to get a list of physical blocks for the files it wants to access - those should be in absolute numbers, regardless of the superblock locaction) when the superblock is at the beginning (v 1.1 or 1.2). /mjt It's been a *long* time since I looked at the lilo raid1 support (I wrote the original patch that Red Hat used, I have no idea if that's what the lilo maintainer integrated though). However, IIRC, it uses bmap on the file, which implies it's via the filesystem mounted on the raid device. And the numbers are not absolute I don't think except with respect to the file system. So, I think the situation could be made to work if you just taught lilo that on version 1.1 or version 1.2 superblock raids that it should add the data offset of the raid to the bmap numbers (which I think are already added to the partition offset numbers). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-19 at 23:23 +0200, Iustin Pop wrote: On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote: And if putting the superblock at the end is problematic, why is it the default? Shouldn't version 1.1 be the default? In my opinion, having the superblock *only* at the end (e.g. the 0.90 format) is the best option. It allows one to mount the disk separately (in case of RAID 1), if the MD superblock is corrupt or you just want to get easily at the raw data. Bad reasoning. It's the reason that the default is at the end of the device, but that was a bad decision made by Ingo long, long ago in a galaxy far, far away. The simple fact of the matter is there are only two type of raid devices for the purpose of this issue: those that fragment data (raid0/4/5/6/10) and those that don't (raid1, linear). For the purposes of this issue, there are only two states we care about: the raid array works or doesn't work. If the raid array works, then you *only* want the system to access the data via the raid array. If the raid array doesn't work, then for the fragmented case you *never* want the system to see any of the data from the raid array (such as an ext3 superblock) or a subsequent fsck could see a valid superblock and actually start a filesystem scan on the raw device, and end up hosing the filesystem beyond all repair after it hits the first chunk size break (although in practice this is usually a situation where fsck declares the filesystem so corrupt that it refuses to touch it, that's leaving an awful lot to chance, you really don't want fsck to *ever* see that superblock). If the raid array is raid1, then the raid array should *never* fail to start unless all disks are missing (in which case there is no raw device to access anyway). The very few failure types that will cause the raid array to not start automatically *and* still have an intact copy of the data usually happen when the raid array is perfectly healthy, in which case automatically finding a constituent device when the raid array failed to start is exactly the *wrong* thing to do (for instance, you enable SELinux on a machine and it hasn't been relabeled and the raid array fails to start because /dev/mdblah can't be created because of an SELinux denial...all the raid1 members are still there, but if you touch a single one of them, then you run the risk of creating silent data corruption). It really boils down to this: for any reason that a raid array might fail to start, you *never* want to touch the underlying data until someone has taken manual measures to figure out why it didn't start and corrected the problem. Putting the superblock in front of the data does not prevent manual measures (such as recreating superblocks) from getting at the data. But, putting superblocks at the end leaves the door open for accidental access via constituent devices when you *really* don't want that to happen. So, no, the default should *not* be at the end of the device. As to the people who complained exactly because of this feature, LVM has two mechanisms to protect from accessing PVs on the raw disks (the ignore raid components option and the filter - I always set filters when using LVM ontop of MD). regards, iustin -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-19 at 12:38 -0400, John Stoffel wrote: 1, 1.0, 1.1, 1.2 Use the new version-1 format superblock. This has few restrictions. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). It looks to me that the 1.1, combined with the 1.0 should be what we use, with the 1.2 format nuked. Maybe call it 1.3? *grin* You're somewhat misreading the man page. You *can't* combine 1.0 with 1.1. All of the above options: 1, 1.0, 1.1, 1.2; specifically mean to use a version 1 superblock. 1.0 means use a version 1 superblock at the end of the disk. 1.1 means version 1 superblock at beginning of disk. `1.2 means version 1 at 4k offset from beginning of the disk. There really is no actual version 1.1, or 1.2, the .0, .1, and .2 part of the version *only* means where to put the version 1 superblock on the disk. If you just say version 1, then it goes to the default location for version 1 superblocks, and last I checked that was the end of disk (aka, 1.0). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote: Justin == Justin Piszcz [EMAIL PROTECTED] writes: Justin On Fri, 19 Oct 2007, John Stoffel wrote: So, Is it time to start thinking about deprecating the old 0.9, 1.0 and 1.1 formats to just standardize on the 1.2 format? What are the issues surrounding this? 1.0, 1.1, and 1.2 are the same format, just in different positions on the disk. Of the three, the 1.1 format is the safest to use since it won't allow you to accidentally have some sort of metadata between the beginning of the disk and the raid superblock (such as an lvm2 superblock), and hence whenever the raid array isn't up, you won't be able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse case situations, I've seen lvm2 find a superblock on one RAID1 array member when the RAID1 array was down, the system came up, you used the system, the two copies of the raid array were made drastically inconsistent, then at the next reboot, the situation that prevented the RAID1 from starting was resolved, and it never know it failed to start last time, and the two inconsistent members we put back into a clean array). So, deprecating any of these is not really helpful. And you need to keep the old 0.90 format around for back compatibility with thousands of existing raid arrays. It's certainly easy enough to change mdadm to default to the 1.2 format and to require a --force switch to allow use of the older formats. I keep seeing that we support these old formats, and it's never been clear to me why we have four different ones available? Why can't we start defining the canonical format for Linux RAID metadata? Thanks, John [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of Justin anything else! Are you sure? I find that GRUB is much easier to use and setup than LILO these days. But hey, just dropping down to support 00.09.03 and 1.2 formats would be fine too. Let's just lessen the confusion if at all possible. John - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Partitioned arrays initially missing from /proc/partitions
Neil Brown wrote: Yes, but it should not be needed, and I'd like to understand why it is. One of the last things do_md_run does is mddev-changed = 1; When you next open /dev/md_d0, md_open is called which calls check_disk_change(). This will call into md_fops-md_media_changed which will return the value of mddev-changed, which will be '1'. So check_disk_change will then call md_fops-revalidate_disk which will set mddev-changed to 0, and will then set bd_invalidated to 1 (as bd_disk-minors 1 (being 64)). md_open will then return into do_open (in fs/block_dev.c) and because bd_invalidated is true, it will call rescan_partitions and the partitions will appear. Yuck. The md stack should populate the partition information on device creation *without* needing someone to open the resulting device. That you can tweak mdadm to open the device after creation is fine, but unless no other program is allowed to use the ioctls to start devices, and unless this is a documented part of the API, waiting until second open to populate the device info is just flat wrong. It breaks all sorts of expectations people have regarding things like mount by label, etc. Hmmm... there is room for a race there. If some other process opens /dev/md_d0 before mdadm gets to close it, it will call rescan_partitions before first calling bd_set_size to update the size of the bdev. So when we try to read the partition table, it will appear to be reading past the EOF, and will not actually read anything.. I guess udev must be opening the block device at exactly the wrong time. I can simulate this by holding /dev/md_d0 open while assembling the array. If I do that, the partitions don't get created. Yuck. Maybe I could call bd_set_size in md_open before calling check_disk_change.. Yep, this patch seems to fix it. Could you confirm? Thanks, NeilBrown diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2007-04-17 11:42:15.0 +1000 +++ ./drivers/md/md.c 2007-04-24 21:29:51.0 +1000 @@ -4485,6 +4485,8 @@ static int md_open(struct inode *inode, mddev_get(mddev); mddev_unlock(mddev); + if (mddev-changed) + bd_set_size(inode-i_bdev, mddev-array_size 1); check_disk_change(inode-i_bdev); out: return err; - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford [EMAIL PROTECTED] http://people.redhat.com/dledford Infiniband specific RPMs can be found at http://people.redhat.com/dledford/Infiniband - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC: 2.6 patch] simplify drivers/md/md.c:update_size()
On Fri, 2006-12-15 at 01:19 +0100, Adrian Bunk wrote: While looking at commit 8ddeeae51f2f197b4fafcba117ee8191b49d843e, I got the impression that this commit couldn't fix anything, since the size variable can't be changed before fit gets used. Is there any big thinko, or is the patch below that slightly simplifies update_size() semantically equivalent to the current code? No, this patch is broken. Where it fails is specifically the case where you want to autofit the largest possible size, you have different size devices, and the first device is not the smallest. When you hit the first device, you will set size, then as you repeat the ITERATE_RDEV loop, when you hit the smaller device, size will be non-0 and you'll then trigger the later if and return -ENOSPC. In the case of autofit, you have to preserve the fit variable instead of looking at size so you know whether or not to modify the size when you hit a smaller device later in the list. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] --- drivers/md/md.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- linux-2.6.19-mm1/drivers/md/md.c.old 2006-12-15 00:57:05.0 +0100 +++ linux-2.6.19-mm1/drivers/md/md.c 2006-12-15 00:57:42.0 +0100 @@ -4039,57 +4039,56 @@ * Generate a 128 bit UUID */ get_random_bytes(mddev-uuid, 16); mddev-new_level = mddev-level; mddev-new_chunk = mddev-chunk_size; mddev-new_layout = mddev-layout; mddev-delta_disks = 0; mddev-dead = 0; return 0; } static int update_size(mddev_t *mddev, unsigned long size) { mdk_rdev_t * rdev; int rv; struct list_head *tmp; - int fit = (size == 0); if (mddev-pers-resize == NULL) return -EINVAL; /* The size is the amount of each device that is used. * This can only make sense for arrays with redundancy. * linear and raid0 always use whatever space is available * We can only consider changing the size if no resync * or reconstruction is happening, and if the new size * is acceptable. It must fit before the sb_offset or, * if that is data_offset, it must fit before the * size of each device. * If size is zero, we find the largest size that fits. */ if (mddev-sync_thread) return -EBUSY; ITERATE_RDEV(mddev,rdev,tmp) { sector_t avail; avail = rdev-size * 2; - if (fit (size == 0 || size avail/2)) + if (size == 0) size = avail/2; if (avail ((sector_t)size 1)) return -ENOSPC; } rv = mddev-pers-resize(mddev, (sector_t)size *2); if (!rv) { struct block_device *bdev; bdev = bdget_disk(mddev-gendisk, 0); if (bdev) { mutex_lock(bdev-bd_inode-i_mutex); i_size_write(bdev-bd_inode, (loff_t)mddev-array_size 10); mutex_unlock(bdev-bd_inode-i_mutex); bdput(bdev); } } return rv; } -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Recovering from default FC6 install
On Sun, 2006-11-12 at 01:00 -0500, Bill Davidsen wrote: I tried something new on a test system, using the install partitioning tools to partition the disk. I had three drives and went with RAID-1 for boot, and RAID-5+LVM for the rest. After the install was complete I noted that it was solid busy on the drives, and found that the base RAID appears to have been created (a) with no superblock and (b) with no bitmap. That last is an issue, as a test system it WILL be getting hung and rebooted, and recovering the 1.5TB took hours. Is there an easy way to recover this? The LVM dropped on it has a lot of partitions, and there is a lot of data in them asfter several hours of feeding with GigE, so I can't readily back up and recreate by hand. Suggestions? First, the Fedora installer *always* creates persistent arrays, so I'm not sure what is making you say it didn't, but they should be persistent. So, assuming that they are persistent, just recreate the arrays in place as version 1.0 superblocks with internal bitmap. I did that exact thing on my FC6 machine I was testing with (raid1, not raid5, but no biggie there) and it worked fine. The detailed list of instructions: Reboot with a rescue CD, skip the finding of the installation, when you are at a prompt, use mdadm to examine the raid superblocks so you get all the pertinent data such as chunk size on the raid5 and ordering of constituent drives in the raid5 right. Then recreate the arrays as version 1.0 superblocks with internal write intent bitmaps. Then mount the partitions manually, bind mount things like /dev and /proc into wherever you mounted the root filesystem, edit the mdadm.conf on the root filesystem and remove the ARRAY lines (the GUIDs will be wrong now), use mdadm -Db or mdadm -Eb to get new ARRAY lines and append them to the mdadm.conf (possibly altering the device names for the arrays, and if you use -E remember to correct the printout of the GUID in the array line, it's 10:8:8:6 instead of 8:8:8:8), patch mkinitrd with something like the attached patch, patch /etc/rc.d/rc.sysinit with something like the other attached patch (or leave this patch out but manually add the correct auto= parameter to your ARRAY lines in the mdadm.conf), chroot into the root filesystem, remake your initrd image, fdisk the drives and switch the linux partition types from raid autodetect to plain linux, reboot, and you are done. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband --- /sbin/mkinitrd 2006-09-28 12:51:28.0 -0400 +++ mkinitrd 2006-11-12 10:28:31.0 -0500 @@ -1096,6 +1096,13 @@ mknod $MNTIMAGE/dev/efirtc c 10 136 fi +if [ -n $raiddevices ]; then +inst /sbin/mdadm.static $MNTIMAGE/bin/mdadm +if [ -f /etc/mdadm.conf ]; then +cp $verbose /etc/mdadm.conf $MNTIMAGE/etc/mdadm.conf +fi +fi + # FIXME -- this can really go poorly with clvm or duplicate vg names. # nash should do lvm probing for us and write its own configs. if [ -n $vg_list ]; then @@ -1234,8 +1241,7 @@ if [ -n $raiddevices ]; then for dev in $raiddevices; do -cp -a /dev/${dev} $MNTIMAGE/dev -emit raidautorun /dev/${dev} +emit mdadm -As --auto=yes /dev/${dev} done fi --- /etc/rc.d/rc.sysinit 2006-10-04 18:14:53.0 -0400 +++ rc.sysinit 2006-11-12 10:29:03.0 -0500 @@ -403,7 +403,7 @@ update_boot_stage RCraid [ -x /sbin/nash ] echo raidautorun /dev/md0 | nash --quiet if [ -f /etc/mdadm.conf ]; then -/sbin/mdadm -A -s +/sbin/mdadm -A -s --auto=yes fi # Device mapper related initialization signature.asc Description: This is a digitally signed message part
Re: mdadm-2.5.4 issues and 2.6.18.1 kernel md issues
it was partitioned or not. So, for example, if it's not a partitioned array, you would have to teach grub that, let's say you have your boot data on (hd0,0), then if (hd0,0) is part of a raid array with certain superblock types (probably have to read /proc/mdstat to know), then the start of (hd0,0) is not the start of the partition, but instead something like partition size in blocks - whole md device size in blocks = offset into partition to start of md device, and consequently the ext filesystem that /boot is comprised of. If it is partitioned, then you could teach it the notion of (hd0,0,0), aka chained partition tables, where you use the same offset calculation above to get to the chained partition table, then read that partition table to get the offset to the filesystem. I don't think it would be too difficult for grub, but it would have to be added. This does, however, point out that the md stack's decision to use a geometry on it's devices that is totally different than the real constituent device geometry means that grub would have to perform conversions on that chained partition table to get from md offset to real device offset. That may not matter much in the end, but it will have to be done. The difference in geometry also precludes doing a whole device md array with the superblock at the end and the partition table where the normal device partition table would be. Although that sort of setup is risky in terms of failure to assemble like I pointed out above, it does have it's appeal for certain situations like multipath or the ability to have a partitioned raid1 device with /boot in the array without needing to modify grub, especially on machines that don't have built in SATA raid that dm-raid could make use of. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
mdadm-2.5.4 issues and 2.6.18.1 kernel md issues
If I use mdadm 2.5.4 to create a version 1 superblock raid1 device, it starts a resync. If I then reboot the computer part way through, when it boots back up, the resync gets cancelled and the array is considered clean. This is against a 2.6.18.1 kernel. If I create a version 1 superblock raid1 array, mdadm -D constituent device says that the device is not part of a raid array (and likewise the kernel autorun facility fails to find the device). If I create a version 1 superblock raid1 array, mdadm -E constituent device sees the superblock. If I then run mdadm -E --brief on that same device, it prints out the 1 line ARRAY line, but it misprints the UUID such that is a 10 digit hex number: 8 digit hex number: 8 digit hex number: 6 digit hex number. It also prints the mdadm device in the ARRAY line as /dev/md/# where as mdadm -D --brief prints the device as /dev/md#. Consistency would be nice. Does the superblock still not store any information about whether or not the array is a single device or partitionable? Would be nice if the superblock gave some clue as to that fact so that it could be used to set the auto= param on an mdadm -E --brief line to the right mode. Mdadm assumes that the --name option passed in to create an array means something in particular to the md array name and modifies subsequent mdadm -D --brief and mdadm -E --brief outputs to include the name option minus the hostname. Aka, if I set the name to firewall:/boot, mdadm -E --brief will then print out the ARRAY line device as /dev/md//boot. I don't think this is documented anywhere. This also raises the question of how partitionable md devices will be handled in regards to their name component. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: future hardware
On Fri, 2006-10-27 at 17:18 -0500, Daniel Korstad wrote: leaving me one 5.25 left for the fan. In addition to the fan in the item above, I have the exhaust fan on the Power Supply, another 12mm exhaust fan and a 12mm intake that blows across the other HDs. Sorry, I too much of a hurry, those are 120cm exhaust and 120cm intake Hehehe, I'll burn in hell for pointing this out, but as 10mm == 1cm, a 120*mm* fan or 12*cm* fan would be correct. I'm pretty sure your fans are neither 12mm nor 120cm (or if you do have a 120cm fan...damn...that's a lot of cooling)... -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
RE: why partition arrays?
On Thu, 2006-10-19 at 12:25 +0100, Ken Walker wrote: So is LVM better for partitions on a large raid5, or any raid, than separate partitions on that array. In some ways yes, although it introduces a certain amount of uncertainty in tuning of block devices. I'm still in my learning curve :) for example, if one has Linux running on a two disk mirror array, raid1, and the first disk is partitioned, say 5 partitions, with those partitions mirrored on the second disk, and each identical partition is then run as a mirror raid1. What your saying is that, if a single partition fails, to remove the drive you have to fail all the array partitions on the drive your taking out, then rebuild the partitions and then add to the dirty raid the new partitions one at a time. Yep. Will LVM remove all this, so if you have a mirror as a single raid partition, and use LVM to create the partitions on that mirror, if a disk goes down, can it be removed, replaced, and then just added to the single raid, with LVM having had no idea what was going on in the background and just plod along merrily. Yep. In addition, with LVM, if you added two new disks, also in a raid1 array, then you could add that to your current volume group as another physical volume, and the LVM code would happily extend your volume to span both RAID1 arrays and increase the size. Since the md code can now grow things, this isn't as impressive as it used to be, but it's probably a little easier to handle the lvm stuff than the md growth stuff if for no other reason than they have graphical LVM tools that you can do this with. Is LVM stable, or can it cause more problems than separate raids on a array. Current incarnations are very stable. I mentioned earlier that it can introduce some tuning issues. If you are dealing with a raid device directly, then it's relatively straight forward to set the stripe size, chunk size, etc. according to the number of raid disks and then set the elevator and possibly things like read ahead values to optimize the raid array's performance for different needs. When you introduce LVM on top of raid, there is the possibility that there will be interactions between the two that have a detrimental impact on performance (this may not always be the case, and it may not be unfixable, I'm just saying it's an additional layer you have to deal with). -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Propose of enhancement of raid1 driver
On Thu, 2006-10-19 at 13:28 +1000, Neil Brown wrote: On Tuesday October 17, [EMAIL PROTECTED] wrote: I would like to propose an enhancement of raid 1 driver in linux kernel. The enhancement would be speedup of data reading on mirrored partitions. The idea is easy. If we have mirrored partition over 2 disks, and these disk are in sync, there is possibility of simultaneous reading of the data from both disks on the same way as in raid 0. So it would be chunk1 read from master, chunk2 read from slave at the same time. As result it would give significant speedup of read operation (comparable with speed of raid 0 disks). This is not as easy as it sounds. Skipping over blocks within a track is no faster than reading blocks in the track, so you would need to make sure that your chunk size is larger than one track - probably it would need to be several tracks. Raid1 already does some read-balancing, though it is possible (even likely) that it doesn't balance very effectively. Working out how best to do the balancing in general in a non-trivial task, but would be worth spending time on. The raid10 module in linux supports a layout described as 'far=2'. In this layout, with two drives, the first half of the drives is used for a raid0, and the second half is used for a mirrored raid0 with the data on the other disk. In this layout reads should certainly go at raid0 speeds, though there is cost in the speed of writes. Maybe you would like to experiment. Write a program that reads from two drives in parallel, reading all the 'odd' chunks from one drive and the 'even' chunks from the other, and find out how fast it is. Maybe you could get it to try lots of different chunk sizes and see which is the fastest. Too artificial. The results of this sort of test would not translate well to real world usage. That might be quite helpful in understanding how to get read-balancing working well. Doing *good* read balancing is hard, especially given things like FC attached storage, iSCSI/iSER, etc. If I wanted to do this right, I'd start by teaching the md code to look more deeply into block devices, possibly even with a self tuning series of reads at startup to test things like close seek sequential operation times versus maximum seek throughput which would clue you in as to whether the device you are talking to might have more than 1 physical spindle which would impact the cost you associate to seek requiring operations relative to bandwidth heavy operations, I might even go so far as to look into the SCSI transport classes for clues about data throughput at bus bandwidth versus command startup/teardown costs on the bus so you have an accurate idea if lots of outstanding small commands are likely to cause your device to suffer bus starvation issues from overhead. Then I'd use that data to help me numerically quantify the load on a device, updated both when a command is added to the block layer queue (the queued load) and when the command is actually removed from the block queue and sent to the device (the active load) and updated again when the command is received back. Then, I'd basically look at what an incoming command *would* do to each constituent disk's load values to see whether it should go to one or the other. But, that's just off the top of my head and I may be on crack...I didn't check what my wife handed me this morning. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: why partition arrays?
On Wed, 2006-10-18 at 15:43 +0200, martin f krafft wrote: also sprach Doug Ledford [EMAIL PROTECTED] [2006.10.18.1526 +0200]: There are a couple reasons I can think. Thanks for your elaborate response. If you don't mind, I shall link to it from the FAQ. Sure. I have one other question: do partitionable and traditional arrays actually differ in format? Put differently: can I assemble a traditional array as a partitionable one simply by specifying: mdadm --create ... /dev/md0 ... mdadm --stop /dev/md0 mdadm --assemble --auto=part ... /dev/md0 ... ? Or do the superblocks actually differ? Neil would be more authoritative about what would differ in the superblocks, but yes, it is possible to do as you listed above. In fact, if you create a partitioned array, and your mkinitrd doesn't restart it as a partitioned array, you'll wonder how to mount your filesystems since the system will happily start that originally partitioned array as non partitioned. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: avoiding the initial resync on --create
On Tue, 2006-10-10 at 11:55 +0200, Gabor Gombas wrote: On Mon, Oct 09, 2006 at 12:32:00PM -0400, Doug Ledford wrote: You don't really need to. After a clean install, the operating system has no business reading any block it didn't write to during the install unless you are just reading disk blocks for the fun of it. What happens if you have a crash, and fsck for some reason tries to read into that uninitialized area? This may happen even years after the install if the array was never resynced and the filesystem was never 100% full... What happens, if fsck tries to read the same area twice but gets different data, because the second time the read went to a different disk? And yes, fsck is exactly an application that reads blocks just for the fun of it when it tries to find all the pieces of the filesystem, esp. for filesystems that (unlike e.g. ext3) do not keep metadata at fixed locations. Not at all true. Every filesystem, no matter where it stores its metadata blocks, still writes to every single metadata block it allocates to initialize that metadata block. The same is true for directory blocks...they are created with a . and .. entry and nothing else. What exactly do you think mke2fs is doing when it's writing out the inode groups, block groups, bitmaps, etc.? Every metadata block needed by fsck is written either during mkfs or during use as the filesystem data is grown. So, like my original email said, fsck has no business reading any block that hasn't been written to either by the install or since the install when the filesystem was filled up more. It certainly does *not* read blocks just for the fun of it, nor does it rely on anything the filesystem didn't specifically write. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: avoiding the initial resync on --create
On Tue, 2006-10-10 at 22:37 +0200, Gabor Gombas wrote: You don't get my point. I'm not talking about normal operation, but about the case when the filesystem becomes corrupt, and fsck has to glue together the pieces. Consider reiserfs: See my other on list mail about the fallacy of the idea that consistency of garbage data blocks is any better than inconsistency. As I mentioned in it, even if it's a deleted file, a lost metadata block, etc., it will always be consistent if it's a valid block to consider during rebuild because *at some point in time* since the filesystem was created, it will have been written. Reiserfsck is just as susceptible to random garbage on a single disk not part of any raid array as it is to inconsistent blocks in a raid1 as it is to a fully synced raid1 array with garbage that looks like a reiserfs. That's a shortcoming of that filesystem and there is no one to blame but Hans Reiser for that. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: avoiding the initial resync on --create
On Mon, 2006-10-09 at 15:10 -0400, Rob Bray wrote: On Mon, 2006-10-09 at 15:49 +0200, Erik Mouw wrote: There is no way to figure out what exactly is correct data and what is not. It might work right after creation and during the initial install, but after the next reboot there is no way to figure out what blocks to believe. You don't really need to. After a clean install, the operating system has no business reading any block it didn't write to during the install unless you are just reading disk blocks for the fun of it. And any program that depends on data that hasn't first been written to disk is just wrong and stupid anyway. I suppose a partial-stripe write would read back junk data on the other disks, xor with your write, and update the parity block. The original email was about raid1 and the fact that reads from different disks could return different data. For that scenario, my comments are accurate. For the parity based raids, you never have two disks with the same block, so you would only ever get different results if you had a disk fail and the parity was never initialized. For that situation, you would need to init the parity on any stripe that has been even partially written to. Totally unwritten stripes could have any parity you want since the data is undefined anyway, so who cares if it changes when a disk fails and you are reconstructing from parity. If you benchmark the disk, you're going to be reading blocks you didn't necessarily write, which could kick out consistency errors. The only benchmarks I know of that give a rats ass about the data integrity are ones that write a pattern first and then read it back. In that case, parity would have been init'ed during the write. A whole-array consistency check would puke on the out-of-whack parity data. Or a whole array consistency check on an array that hasn't had a whole array parity init makes no sense. You could create the array without touching the parity, update parity on all stripes that are written, leave a flag in the superblock indicating the array has never been init'ed, and in the event of failure you can use the parity safe in the knowledge that all stripes that have been written to have valid parity and all other stripes we don't care about. The main problem here is that if we *did* need a consistency check, we couldn't tell errors from uninit'ed stripes. You could also make it so that the first time you run a consistency check with the uninit'ed flag in the superblock set, you calculate all parity and then clear the flag in the superblock and on all subsequent runs you would then know when you have an error as opposed to an uninit'ed block. Probably the best thing to do would be on create of the array, setup a large all 0 block of mem and repeatedly write that to all blocks in the array devices except parity blocks and use a large all 1 block for that. Then you could just write the entire array at blinding speed. You could call that the quick-init option or something. You wouldn't be able to use the array until it was done, but it would be quick. If you wanted to be *really* fast, at least for SCSI drives you could write one large chunk of 0's and one large chunk of 1's at the first parity block, then use the SCSI COPY command to copy the 0 chunk everywhere it needs to go, and likewise for the parity chunk, and avoid transferring the data over the SCSI bus more than once. -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: avoiding the initial resync on --create
On Tue, 2006-10-10 at 07:33 +1000, Neil Brown wrote: On Monday October 9, [EMAIL PROTECTED] wrote: The original email was about raid1 and the fact that reads from different disks could return different data. To be fair, the original mail didn't mention raid1 at all. It did mention raid5 and raid6 as a possible contrast so you could reasonably get the impression that it was talking about raid1. But that wasn't stated. OK, well I got that impression from the contrast ;-) Otherwise I agree. There is no real need to perform the sync of a raid1 at creation. However it seems to be a good idea to regularly 'check' an array to make sure that all blocks on all disks get read to find sleeping bad blocks early. If you didn't sync first, then every check will find lots of errors. Ofcourse you could 'repair' instead of 'check'. Or do that once. Or something. For raid6 it is also safe to not sync first, though with the same caveat as raid1. Raid6 always updates parity by reading all blocks in the stripe that aren't known and calculating P and Q. So the first write to a stripe will make P and Q correct for that stripe. This is current behaviour. I don't think I can guarantee it will never changed. For raid5 it is NOT safe to skip the initial sync. It is possible for all updates to be read-modify-write updates which assume the parity is correct. If it is wrong, it stays wrong. Then when you lose a drive, the parity blocks are wrong so the data you recover using them is wrong. superblock-init_flag == FALSE then make all writes a parity generating not updating write (less efficient, so you would want to resync the array and clear this up soon, but possible). In summary, it is safe to use --assume-clean on a raid1 or raid1o, though I would recommend a repair before too long. For other raid levels it is best avoided. Probably the best thing to do would be on create of the array, setup a large all 0 block of mem and repeatedly write that to all blocks in the array devices except parity blocks and use a large all 1 block for that. No, you would want 0 for the parity block too. 0 + 0 = 0. Sorry, I was thinking odd parity. Then you could just write the entire array at blinding speed. You could call that the quick-init option or something. You wouldn't be able to use the array until it was done, but it would be quick. I doubt you would notice it being faster than the current resync/recovery that happens on creation. We go at device-speed - either the buss device or the storage device depending on which is slower. There's memory overhead though. That can impact other operations the cpu might do while in the process of recovering. If you wanted to be *really* fast, at least for SCSI drives you could write one large chunk of 0's and one large chunk of 1's at the first parity block, then use the SCSI COPY command to copy the 0 chunk everywhere it needs to go, and likewise for the parity chunk, and avoid transferring the data over the SCSI bus more than once. Yes, that might be measurably faster. It is the sort of thing you might do in a hardware RAID controller but I doubt it would ever get done in md (there is a price for being very general). Bleh...sometimes I really dislike always making things cater to the lowest common denominator...you're never as good as you could be and you are always as bad as the worst case... -- Doug Ledford [EMAIL PROTECTED] GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Strange intermittant errors + RAID doesn't fail the disk.
On Fri, 2006-07-07 at 00:29 +0200, Christian Pernegger wrote: I don't know exactly how the driver was responding to the bad cable, but it clearly wasn't returning an error, so md didn't fail it. There were a lot of errors in dmesg -- seems like they did not get passed up to md? I find it surprising that the md layer doesn't have its own timeouts, but then I know nothing about such things :) Thanks for clearing this up for me, C. [...] ata2: port reset, p_is 800 is 2 pis 0 cmd 44017 tf d0 ss 123 se 0 ata2: status=0x50 { DriveReady SeekComplete } sdc: Current: sense key: No Sense Additional sense: No additional sense information ata2: handling error/timeout ata2: port reset, p_is 0 is 0 pis 0 cmd 44017 tf 150 ss 123 se 0 ata2: status=0x50 { DriveReady SeekComplete } ata2: error=0x01 { AddrMarkNotFound } sdc: Current: sense key: No Sense Additional sense: No additional sense information [repeat] This looks like a bad sd/sata lld interaction problem. Specifically, the sata driver wasn't filling in a suitable sense code block to simulate auto-sense on the command, and the scsi disk driver was either trying to get sense or retrying the same command. Anyway, not an md issue, a sata/scsi issue in terms of why it wasn't getting out of the reset loop eventually. I would send your bad cable to Jeff Garzik for further analysis of the problem ;-) -- Doug Ledford [EMAIL PROTECTED] http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AW: RAID1 and data safety?
writing the end of journal entry), then you basically wait for all your journal transactions to complete before sending the end of journal transaction. You don't have to wait for *all* writes to the drive to complete, just the journal writes. This is why performance isn't killed by journaling. The filesystem proper writes for previous journal transactions can be taking place while you are doing this waiting. --- You mentioned data journaling, and it sounded like it is reliable working. Which one of the existing journaling fs did you have in your mind? I use ext3 personally. But that's as much because it's the default filesystem and I know Stephen Tweedie will fix it if it's broken ;-) --- Afaik a read only reads from *one* HD (in raid1). So how to be sure, that *both* HDs are still perfectly o.k.? Am I am fine to do a cat /dev/hda2 /dev/null ; cat /dev/hdb2 /dev/null even *during* the md is active and getting used r/w? It's ok to do this. However, reads happen from both hard drives in a raid1 array in a sort of round robin fashion. You don't really know which reads are going to go where, but each drive will get read from. Doing what you suggest will get you a full read check on each drive and do so safely. Of course, if it's supported on your system, you could also just enable the SMART daemon and have it tell the drives to do continuous background media checks to detect sectors that are either already bad or getting ready to go bad (corrected error conditions). -- Doug Ledford [EMAIL PROTECTED] http://www.xsintricity.com/dledford http://www.livejournal.com/users/deerslayer AIM: DeerObliterator - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 - failed disks - i'm confusing
On Fri, 2005-04-01 at 03:22 -0800, Alvin Oga wrote: On Fri, 1 Apr 2005, Gordon Henderson wrote: On Fri, 1 Apr 2005, Alvin Oga wrote: - ambient temp should be 65F or less and disk operating temp ( hddtemp ) should be 35 or less Are we confusing F and C here? 65F was for normal server room environment ( some folks use 72F for office ) and i changed units to 35C for hd operating temp vs 25C - most of my ide disks run at under 30C - p4-2.xG cpu temps under 40C hddtemp typically reports temperatures in C. 35F is bloody cold! nah ... i like my disks cold to the touch ... ( 2 fans per disks ) Just for the record, second guessing mechanical engineers with thermodynamics background training and an eye towards differing material expansion rates and the like can be risky. This is like saying Nah, I like the engine in my car to run cold, so I use no thermostat and two fans on the radiator. It might sound like a good idea to you, but proper cylinder to piston wall clearance is obtained at a specific temperature (cylinder sleeves are typically some sort of iron or steel compound and expand in diameter slower than the aluminum pistons when heated to operating temperature, so the pistons are made smaller in diameter at room temperature so that when both the sleeve and the piston are at operating temperature the clearance will be correct). Running an engine at a lower temperature increases that clearance and can result in premature piston failure. As far as hard drive internals are concerned, I'm not positive whether or not they are subject to the same sort of thermal considerations, but just looking at the outside of a hard drive shows a very common case of an aluminum cast frame and some sort of iron/steel based top plate. These are going to expand at different rates with temperature and for all I know if you run the drive overly cool, you may be placing undue stress on the seal between these two parts of the drive (consider the case of both the aluminum frame and the top plate having a channel for a rubber o-ring, and until the drive reaches operating temp. the channels may not line up perfectly, resulting in stress on the o-ring). Anyway, it might or might not hurt the drives to run them well below their designed operating temperature, I don't have schematics and materials lists in front of me to tell for sure. But second guessing mechanical engineers that likely have compensated for thermal issues at a given, specific common operating temperature is usually risky. Most people think Heat kills and therefore like to keep things as cool as possible. For mechanical devices anyway, it's not so much that heat kills, as it is operating outside of the designed temperature range, either above or below, that reduces overall life expectancy. Keep your drives from overheating, but don't try to freeze them would be my advice. -- Doug Ledford [EMAIL PROTECTED] http://people.redhat.com/dledford - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1-diseaster on reboot: old version overwrites new version
On Sat, 2005-04-02 at 09:35 -0800, Tim Moore wrote: peter pilsl wrote: The only explantion to me is, that I had the wrong entry in my lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2 So maybe root was always mounted as /dev/hda6 and never as /dev/md2, which was started, but never had any data written to it. Is this a possible explanation? No. The lilo.conf entry just tells the kernel where root is located. Yes, as Neil posted, this exactly explains the issue. If /dev/hda6 is part of a raid1 array, and you write to it instead of /dev/md2, then those writes are never sent to /dev/hdc6 and the two devices get out of sync. Plus, standard initrd setups and the like are written to accommodate users passing in arbitrary root= options on the kernel command line to over ride the default root partition, and in those situations the root partition must be taken from the command line and not from fstab in order for this to work. So, whether it's lilo or grub or whatever, the root= line on your kernel command line is *the* authority when it comes to what will be mounted as the root partition you actually use. Can you publish your /etc/fstab and fdisk -l output? Keep in mind the root partitions is already mounted in ro mode by the time fstab is available and the rc.sysinit script merely remounts it rw. Again, the command line is the authority. -- Doug Ledford [EMAIL PROTECTED] http://people.redhat.com/dledford - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 and data safety?
A is busy completing some reads at the moment, and drive B isn't and completes the end of journal write quickly. The patch Peter posted (or at least talked about, can't remember which) would then return a completion event to the ext3 journal code. The ext3 code would then assume the journal was all complete and start issuing the writes related to that journal transaction en-masse. These writes will then go to drives A and B. Since drive A was busy with some reads, it gets these writes prior to completing the end of transaction write it already had in its queue. Being a nice, smart SCSI disk with tagged queuing enabled, it then proceeds to complete the whole queue of writes in whatever order is most efficient for it. It completes two of the writes that were issued by the ext3 filesystem after the ext3 filesystem thought the journal entry was complete, and then the machine has a power supply failure and nothing else gets written. As it turns out, drive A is the first drive in the rdev array, so on reboot, it's selected as the master for resync. Now, that means that all the data, journal and everything else, is going to be copied from drive A to drive B. And guess what. We never completed that end of journal write on drive A, so when the ext3 filesystem is mounted, that journal transaction is going to be considered incomplete and *not* get replayed. But we've also written a couple of the updates from that transaction to disk A already. Well, there you go, data corruption. So, Peter, if you are still toying with that patch, it's a *BAD* idea. That's what using a journaling filesystem on top of an md device gets you in terms of what problems the journaling solves for the md device. In turn, a weakness of any journaling filesystem is that it is inherently vulnerable to hard disk failures. A drive failure takes out the filesystem and your machine becomes unusable. Obviously, this very problem is what md solves for filesystems. Whether talking about the journal or the rest of the filesystem, if you let a hard drive error percolate up to the filesystem, then you've failed in the goal of software raid. I remember talk once about how putting the journal on the raid device was bad because it would cause the media in that area of the drive to wear out faster. The proper response to that is: So. I don't care. If that section of media wears out faster, fine by me, because I'm smart and put both my journal and my filesystem on a software raid device that allows me to replace the worn out device with a fresh one without ever loosing any data or suffering a crash. The goal of the md layer is not to prevent drive wear out, the goal is to make us tolerant of drive failures so we don't care when they happen, we simply replace the bad drive and go on. Since drive failures happen on a fairly regular basis without md, if the price of not suffering problems as a result of those failures is that we slightly increase the failure rate due to excessive writing in the journal area, then fine by me. In addition, if you use raid5 arrays like I do, then putting the journal on the raid array is a huge win because of the outrageously high sequential throughput of a raid5 array. Journals are preallocated at filesystem creation time and occupy a more or less sequential area on the disks. Journals are also more or less a ring buffer. You can tune the journal size to a reasonable multiple of a full stripe size on the raid5 array (say something like 1 to 10 MB per disk, so in a 5 disk raid5 array, I'd use between a 4 and 40MB journal, depending on whether I thought I would be doing a lot of large writes of sufficient enough size to utilize a large journal), turn on journaling of not just meta- data but all data, and then benefit from the fact that the journal writes take place as more or less sequential writes as seen by things like tiobench benchmark runs, and because the typical filesystem writes are usually much more random in nature, the journaling overhead can be reduced to no more than, say, 25% performance loss while getting the benefit of both meta-data and regular data journaled. It's certainly *far* faster than sticking the journal on some other device unless it's another very fast raid array. Anyway, I think the situation can be summed up as this: See Peter try to admin lots of machines. See Peter imagine problems that don't exist. See Peter disable features that would make his life easier as Peter takes steps to circumvent his imaginary problems. See Peter stay at work over New Years holiday fixing problems that were likely a result of his own efforts to avoid problems. Don't be a Peter, listen to Neil. -- Doug Ledford [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 - failed disks - i'm confusing
On Mon, 2005-04-04 at 15:51 -0700, Alvin Oga wrote: On Mon, 4 Apr 2005, Doug Ledford wrote: Anyway, it might or might not hurt the drives to run them well below their designed operating temperature, I don't have schematics and materials lists in front of me to tell for sure. ez enough to do ... its called specs on the various manufacturers websites ... similarly for the operating temp of the ICs on the disk controllers .. you're welcome to run your disks hot ... I didn't say to run them hot, just design temp. Overheating is bad, just like you mentioned. i prefer to run it cool to the finger touch test as the server room to be 65F and its a known fact for 40+ years ... heat kills electromechanical items, car engines is a different animal for different reasons Yes it does, and my point wasn't to say that it doesn't, just to say that for the mechanical portion of electromechanical devices, excessive cool can be bad as well. -- Doug Ledford [EMAIL PROTECTED] http://people.redhat.com/dledford - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html