Re: stride / stripe alignment on LVM ?
Bill Davidsen said: (by the date of Fri, 02 Nov 2007 09:01:05 -0400) > So I would expect this to make a very large performance difference, so > even if it work it would do so slowly. I was trying to find out the stripe layout for few hours, using hexedit and dd. And I'm baffled: md1 : active raid5 hda3[0] sda3[1] 969907968 blocks super 1.1 level 5, 128k chunk, algorithm 2 [3/2] [UU_] bitmap: 8/8 pages [32KB], 32768KB chunk I fill md1 with random data: # dd bs=128k count=64 if=/dev/urandom of=/dev/md1 # hexedit /dev/md1 I copy/paste (and remove formmatting) the first 32 bytes of /dev/md1, now I search for those 32 bytes in /dev/hda3 and in /dev/sda3: # hexedit /dev/hda3 # hexedit /dev/sda3 And no luck! I'd expect the first bytes of /dev/md1 to be on beginning of the first drive (hda3). I pick next 20 bytes from /dev/md1 and I can find them on /dev/hda3 starting just after address 0x1. The bytes before and after those 20 bytes are similar to those on /dev/md1. So now I hexedit /dev/md1 and write by hand 32 bytes of 0xAA. Then I look at address 0x1 on /dev/hda3 - and there is no 0xAA at all. Well.. it's not critical for me, so you can just ignore my mumbling, I was just wondering what obvious did I miss. There seems to be more XORing (or sth. else) involved than I expected. Maybe the disc did not flush writes, and what I see on /dev/md1 is not yet present on /dev/hda3 (how's that possible?) Nevertheless, I think that I will resign from LVM, and just put ext3 on /dev/md1, to avoid this stripe misalignment. I wanted LVM here only because I might wanted to use lvm-snapshot, but I can live without that. I can already grow /dev/md1 without LVM, but using mdadm grow. best regards -- Janek Kozicki | - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
On Fri, 2007-11-02 at 15:15 -0400, Doug Ledford wrote: > It was tested, it simply obviously had a bug you hit. Assuming that > your particular failure situation is the only possible outcome for all > the other people that used it would be an invalid assumption. There are > lots of code paths in an error handler routine, and lots of different > hardware failure scenarios, and they each have their own independent > outcome should they ever be experienced. This is the kind of statement why I said you were belittling my experiences. And to think that since I've hit it in three different machines with different hardware and different kernel versions that it won't affect others is something else. I thought I was helping, but don't worry I learned my lesson, it won't happen again. I asked people for their experiences, clearly not everybody is as lucky as I am. > Then you didn't pay attention to what I said before: RHEL3 was the first > ever RHEL product that had support for SATA hardware. The SATA drivers > in RHEL3 *were* first gen. Oh, I paid attention alright. It is my fault for assuming that things not marked as experimental are not experimental. Alberto - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
On Fri, 2007-11-02 at 13:21 -0500, Alberto Alonso wrote: > On Fri, 2007-11-02 at 11:45 -0400, Doug Ledford wrote: > > > The key word here being "supported". That means if you run across a > > problem, we fix it. It doesn't mean there will never be any problems. > > On hardware specs I normally read "supported" as "tested within that > OS version to work within specs". I may be expecting too much. It was tested, it simply obviously had a bug you hit. Assuming that your particular failure situation is the only possible outcome for all the other people that used it would be an invalid assumption. There are lots of code paths in an error handler routine, and lots of different hardware failure scenarios, and they each have their own independent outcome should they ever be experienced. > > I'm sorry, but given the "specially the RHEL" case you cited, it is > > clear I can't help you. No one can. You were running first gen > > software on first gen hardware. You show me *any* software company > > who's first gen software never has to be updated to fix bugs, and I'll > > show you a software company that went out of business they day after > > they released their software. > > I only pointed to RHEL as an example since that was a particular > distro that I use and exhibited the problem. I probably could of > replaced it with Suse, Ubuntu, etc. I may have called the early > versions back in 94 first gen but not today's versions. I know I > didn't expect the SLS distro to work reliably back then. Then you didn't pay attention to what I said before: RHEL3 was the first ever RHEL product that had support for SATA hardware. The SATA drivers in RHEL3 *were* first gen. > Can you provide specific chipsets that you used (specially for SATA)? All of the Adaptec SCSI chipsets through the 7899, Intel PATA, QLogic FC, and nVidia and winbond based SATA. -- Doug Ledford <[EMAIL PROTECTED]> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Implementing low level timeouts within MD
On Fri, 2007-11-02 at 11:45 -0400, Doug Ledford wrote: > The key word here being "supported". That means if you run across a > problem, we fix it. It doesn't mean there will never be any problems. On hardware specs I normally read "supported" as "tested within that OS version to work within specs". I may be expecting too much. > I'm sorry, but given the "specially the RHEL" case you cited, it is > clear I can't help you. No one can. You were running first gen > software on first gen hardware. You show me *any* software company > who's first gen software never has to be updated to fix bugs, and I'll > show you a software company that went out of business they day after > they released their software. I only pointed to RHEL as an example since that was a particular distro that I use and exhibited the problem. I probably could of replaced it with Suse, Ubuntu, etc. I may have called the early versions back in 94 first gen but not today's versions. I know I didn't expect the SLS distro to work reliably back then. Thanks for reminding me on what I should and shouldn't consider first gen. I guess I should always wait for a couple of updates prior to considering a distro stable, I'll keep that in mind in the future. > I *really* can't help you. And I never expected you to. None of my posts asked for support to get my specific hardware and kernels working. I did ask for help identifying combinations that work and those that don't. The thread on low level timeouts within MD was meant as a forward thinking question to see if it could solve some of these problems. It has been settled that no, so that's that. I am really not trying to push the issue with MD timeouts. > No, your experience, as you listed it, is that > SATA/usb-storage/Serverworks PATA failed you. The software raid never > failed to perform as designed. And I never said that software raid did anything outside what it was designed to do. I did state that when the goal is to keep the server from hanging (a reasonable goal if you ask me) the combination of SATA/usb-storage/Serverworks PATA with software raid is not a working solution (neither it is without software raid for that matter) > However, one of the things you are doing here is drawing sweeping > generalizations that are totally invalid. You are saying your > experience is that SATA doesn't work, but you aren't qualifying it with > the key factor: SATA doesn't work in what kernel version? It is > pointless to try and establish whether or not something like SATA works > in a global, all kernel inclusive fashion because the answer to the > question varies depending on the kernel version. And the same is true > of pretty much every driver you can name. This is why commercial At time of purchase the hardware vendor (Supermicro for those interested) listed RHLE v3, which is what got installed. > companies don't just certify hardware, but the software version that > actually works as opposed to all versions. In truth, you have *no idea* > if SATA works today, because you haven't tried. As David pointed out, > there was a significant overhaul of the SATA error recovery that took > place *after* the kernel versions that failed you which totally > invalidates your experiences and requires retesting of the later > software to see if it performs differently. I completely agree that retesting is needed based on the improvements stated. I don't think it invalidates my experiences though, it does date them, but that's fine. And yes, I see your point on always listing specific kernel versions I will do better with the details in the future. > I've had *lots* of success with software RAID as I've been running it > for years. I've had old PATA drives fail, SCSI drives fail, FC drives > fail, and I've had SATA drives that got kicked from the array due to > read errors but not out and out drive failures. But I keep at least > reasonably up to date with my kernels. > Can you provide specific chipsets that you used (specially for SATA)? Thanks, Alberto - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
On Fri, 2007-11-02 at 11:09 +, David Greaves wrote: > David > PS I can't really contribute to your list - I'm only using cheap desktop > hardware. > - If you had failures and it properly handled them, then you can contribute to the good combinations, so far that's the list that is kind of empty :-( Thanks, Alberto - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching root fs '/' to boot from RAID1 with grub
H. Peter Anvin wrote: Doug Ledford wrote: device /dev/sda (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst device /dev/hdc (hd0) root (hd0,0) install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst That will install grub on the master boot record of hdc and sda, and in both cases grub will look to whatever drive it is running on for the files to boot instead of going to a specific drive. No, it won't... it'll look for the first drive in the system (BIOS drive 80h). This means that if the BIOS can see the bad drive, but it doesn't work, you're still screwed. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Depends how "bad" the drive is. Just to align the thread on this - If the boot sector is bad - the bios on newer boxes will skip to the next one. But if it is "good", and you boot into garbage - - could be Windows.. does it crash? b - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: switching root fs '/' to boot from RAID1 with grub
On Thu, 2007-11-01 at 11:57 -0700, H. Peter Anvin wrote: > Doug Ledford wrote: > > > > Correct, and that's what you want. The alternative is that if the BIOS > > can see the first disk but it's broken and can't be used, and if you > > have the boot sector on the second disk set to read from BIOS disk 0x81 > > because you ASSuMEd the first disk would be broken but still present in > > the BIOS tables, then your machine won't boot unless that first dead but > > preset disk is present. If you remove the disk entirely, thereby > > bumping disk 0x81 to 0x80, then you are screwed. If you have any drive > > failure that prevents the first disk from being recognized (blown fuse, > > blown electronics, etc), you are screwed until you get a new disk to > > replace it. > > > > What you want is for it to use the drive number that BIOS passes into it > (register DL), not a hard-coded number. That was my (only) point -- > you're obviously right that hard-coding a number to 0x81 would be worse > than useless. Oh, and I forgot to mention that in grub2, the DL register is ignored for RAID1 devices. Well, maybe not ignored, but once grub2 has determined that the intended boot partition is a raid partition, the raid code takes over and the raid code doesn't care about the DL register. Instead, it scans for all the other members of the raid array and utilizes whichever drives it needs to in order to complete the boot process. And since it does reads a sector (or a small group of sectors) at a time, it doesn't need any member of a raid1 array to be perfect, it will attempt a round robin read on all the sectors and only fail if all drives return an error for a given read. -- Doug Ledford <[EMAIL PROTECTED]> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Time to deprecate old RAID formats?
On Thu, 2007-11-01 at 14:02 -0700, H. Peter Anvin wrote: > Doug Ledford wrote: > >> > >> I would argue that ext[234] should be clearing those 512 bytes. Why > >> aren't they cleared > > > > Actually, I didn't think msdos used the first 512 bytes for the same > > reason ext3 doesn't: space for a boot sector. > > > > The creators of MS-DOS put the superblock in the bootsector, so that the > BIOS loads them both. It made sense in some diseased Microsoft > programmer's mind. > > Either way, for RAID-1 booting, the boot sector really should be part of > the protected area (and go through the MD stack.) It depends on what you are calling the protected area. If by that you mean outside the filesystem itself, and in a non-replicated area like where the superblock and internal bitmaps go, then yes, that would be ideal. If you mean in the file system proper, then that depends on the boot loader. > The bootloader should > deal with the offset problem by storing partition/filesystem-relative > pointers, not absolute ones. Grub2 is on the way to this, but it isn't there yet. -- Doug Ledford <[EMAIL PROTECTED]> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Implementing low level timeouts within MD
On Fri, 2007-11-02 at 03:41 -0500, Alberto Alonso wrote: > On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote: > > Not in the older kernel versions you were running, no. > > These "old versions" (specially the RHEL) are supposed to be > the official versions supported by Redhat and the hardware > vendors, as they were very specific as to what versions of > Linux were supported. The key word here being "supported". That means if you run across a problem, we fix it. It doesn't mean there will never be any problems. > Of all people, I would think you would > appreciate that. Sorry if I sound frustrated and upset, but > it is clearly a result of what "supported and tested" really > means in this case. I'm sorry, but given the "specially the RHEL" case you cited, it is clear I can't help you. No one can. You were running first gen software on first gen hardware. You show me *any* software company who's first gen software never has to be updated to fix bugs, and I'll show you a software company that went out of business they day after they released their software. Our RHEL3 update kernels contained *significant* updates to the SATA stack after our GA release, replete with hardware driver updates and bug fixes. I don't know *when* that RHEL3 system failed, but I would venture a guess that it wasn't prior to RHEL3 Update 1. So, I'm guessing you didn't take advantage of those bug fixes. And I would hardly call once a quarter "continuously updating" your kernel. In any case, given your insistence on running first gen software on first gen hardware and not taking advantage of the support we *did* provide to protect you against that failure, I say again that I can't help you. > I don't want to go into a discussion of > commercial distros, which are "supported" as this is nor the > time nor the place but I don't want to open the door to the > excuse of "its an old kernel", it wasn't when it got installed. I *really* can't help you. > Outside of the rejected suggestion, I just want to figure out > when software raid works and when it doesn't. With SATA, my > experience is that it doesn't. So far I've only received one > response stating success (they were using the 3ware and Areca > product lines). No, your experience, as you listed it, is that SATA/usb-storage/Serverworks PATA failed you. The software raid never failed to perform as designed. However, one of the things you are doing here is drawing sweeping generalizations that are totally invalid. You are saying your experience is that SATA doesn't work, but you aren't qualifying it with the key factor: SATA doesn't work in what kernel version? It is pointless to try and establish whether or not something like SATA works in a global, all kernel inclusive fashion because the answer to the question varies depending on the kernel version. And the same is true of pretty much every driver you can name. This is why commercial companies don't just certify hardware, but the software version that actually works as opposed to all versions. In truth, you have *no idea* if SATA works today, because you haven't tried. As David pointed out, there was a significant overhaul of the SATA error recovery that took place *after* the kernel versions that failed you which totally invalidates your experiences and requires retesting of the later software to see if it performs differently. > Anyway, this thread just posed the question, and as Neil pointed > out, it isn't feasible/worth to implement timeouts within the md > code. I think most of the points/discussions raised beyond that > original question really belong to the thread "Software RAID when > it works and when it doesn't" > > I do appreciate all comments and suggestions and I hope to keep > them coming. I would hope however to hear more about success > stories with specific hardware details. It would be helpfull > to have a list of tested configurations that are known to work. I've had *lots* of success with software RAID as I've been running it for years. I've had old PATA drives fail, SCSI drives fail, FC drives fail, and I've had SATA drives that got kicked from the array due to read errors but not out and out drive failures. But I keep at least reasonably up to date with my kernels. -- Doug Ledford <[EMAIL PROTECTED]> GPG KeyID: CFBFF194 http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: This is a digitally signed message part
Re: Superblocks
Any reason 0.9 is the default? Should I be worried about using 1.0 superblocks? And can I "upgrade" my array from 0.9 to 1.0 superblocks? Thanks, Greg On 11/1/07, Neil Brown <[EMAIL PROTECTED]> wrote: > On Tuesday October 30, [EMAIL PROTECTED] wrote: > > Which is the default type of superblock? 0.90 or 1.0? > > The default default is 0.90. > However a local device can be set in mdadm.conf with e.g. >CREATE metdata=1.0 > > NeilBrown > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: doesm mdadm try to use fastest HDD ?
Janek Kozicki wrote: Hello, My three HHDs have following speeds: hda - speed 70 MB/sec hdc - speed 27 MB/sec sda - speed 60 MB/sec They create a raid1 /dev/md0 and raid5 /dev/md1 arrays. I wanted to ask if mdadm is trying to pick the fastest HDD during operation? Maybe I can "tell" which HDD is preferred? If you are doing raid-1 between hdc and some faster drive, you could try using write-mostly and see go that works for you. For raid-5, it's faster to read the data off the slow drive than reconstruct it with multiple reads to multiple othjer faster drives. This came to my mind when I saw this: # mdadm --query --detail /dev/md1 | grep Prefer Preferred Minor : 1 And also in the manual: -W, --write-mostly [...] "can be useful if mirroring over a slow link." many thanks for all your help! I have two thoughts on this: 1 - if performance is critical, replace the slow drive 2 - for most things you do, I would expect seek to be more important than transfer rate -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stride / stripe alignment on LVM ?
Neil Brown wrote: On Thursday November 1, [EMAIL PROTECTED] wrote: Hello, I have raid5 /dev/md1, --chunk=128 --metadata=1.1. On it I have created LVM volume called 'raid5', and finally a logical volume 'backup'. Then I formatted it with command: mkfs.ext3 -b 4096 -E stride=32 -E resize=550292480 /dev/raid5/backup And because LVM is putting its own metadata on /dev/md1, the ext3 partition is shifted by some (unknown for me) amount of bytes from the beginning of /dev/md1. I was wondering, how big is the shift, and would it hurt the performance/safety if the `ext3 stride=32` didn't align perfectly with the physical stripes on HDD? It is probably better to ask this question on an ext3 list as people there might know exactly what 'stride' does. I *think* it causes the inode tables to be offset in different block-groups so that they are not all on the same drive. If that is the case, then an offset causes by LVM isn't going to make any difference at all. Actually, I think that all of the performance evil Doug was mentioning will apply to LVM as well. So if things are poorly aligned, they will be poorly handled, a stripe-sized write will not go in a stripe, but will overlap chunks and cause all the data from all chunks to be read back for a new raid-5 calculation. So I would expect this to make a very large performance difference, so even if it work it would do so slowly. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very small internal bitmap after recreate
Am 02.11.2007 um 12:43 schrieb Neil Brown: For now, you will have to live with a smallish bitmap, which probably isn't a real problem. Ok then. Array Slot : 3 (0, 1, failed, 2, 3, 4) Array State : uuUuu 1 failed This time I'm getting nervous - Array State failed doesn't sound good! This is nothing to worry about - just a bad message from mdadm. The superblock has recorded that there was once a device in position 2 which is now failed (See the list in "Array Slot"). This summaries as "1 failed" in "Array State". But the array is definitely working OK now. Good to know. Thanks a lot Ralf - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
Alberto Alonso wrote: On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote: Not in the older kernel versions you were running, no. These "old versions" (specially the RHEL) are supposed to be the official versions supported by Redhat and the hardware vendors, as they were very specific as to what versions of Linux were supported. So the vendors of the failing drives claimed that these kernels were supported? That's great, most vendors don't even consider Linux supported. What response did you get when you reported the problem to Redhat on your RHEL support contract? Did they agree that this hardware, and its use for software raid, was supported and intended? Of all people, I would think you would appreciate that. Sorry if I sound frustrated and upset, but it is clearly a result of what "supported and tested" really means in this case. I don't want to go into a discussion of commercial distros, which are "supported" as this is nor the time nor the place but I don't want to open the door to the excuse of "its an old kernel", it wasn't when it got installed. The problem is in the time travel module. It didn't properly cope with future hardware, and since you have very long uptimes, I'm reasonably sure you haven't updated the kernel to get fixes installed. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Superblocks
Neil Brown wrote: On Tuesday October 30, [EMAIL PROTECTED] wrote: Which is the default type of superblock? 0.90 or 1.0? The default default is 0.90. However a local device can be set in mdadm.conf with e.g. CREATE metdata=1.0 If you change to 1.start, 1.ed, 1.4k names for clarity, they need to be accepted here, as well. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Neil Brown wrote: On Friday October 26, [EMAIL PROTECTED] wrote: Perhaps you could have called them 1.start, 1.end, and 1.4k in the beginning? Isn't hindsight wonderful? Those names seem good to me. I wonder if it is safe to generate them in "-Eb" output If you agree that they are better, using them in the obvious places would be better now than later. Are you going to put them in the metadata options as well? Let me know, I have looking at the documentation on my list for next week, and could include some text. Maybe the key confusion here is between "version" numbers and "revision" numbers. When you have multiple versions, there is no implicit assumption that one is better than another. "Here is my version of what happened, now let's hear yours". When you have multiple revisions, you do assume ongoing improvement. v1.0 v1.1 and v1.2 are different version of the v1 superblock, which itself is a revision of the v0... Like kernel releases, people assume that the first number means *big* changes, the second incremental change. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
doesm mdadm try to use fastest HDD ?
Hello, My three HHDs have following speeds: hda - speed 70 MB/sec hdc - speed 27 MB/sec sda - speed 60 MB/sec They create a raid1 /dev/md0 and raid5 /dev/md1 arrays. I wanted to ask if mdadm is trying to pick the fastest HDD during operation? Maybe I can "tell" which HDD is preferred? This came to my mind when I saw this: # mdadm --query --detail /dev/md1 | grep Prefer Preferred Minor : 1 And also in the manual: -W, --write-mostly [...] "can be useful if mirroring over a slow link." many thanks for all your help! -- Janek Kozicki | - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very small internal bitmap after recreate
On Friday November 2, [EMAIL PROTECTED] wrote: > > Am 02.11.2007 um 10:22 schrieb Neil Brown: > > > On Friday November 2, [EMAIL PROTECTED] wrote: > >> I have a 5 disk version 1.0 superblock RAID5 which had an internal > >> bitmap that has been reported to have a size of 299 pages in /proc/ > >> mdstat. For whatever reason I removed this bitmap (mdadm --grow -- > >> bitmap=none) and recreated it afterwards (mdadm --grow -- > >> bitmap=internal). Now it has a reported size of 10 pages. > >> > >> Do I have a problem? > > > > Not a big problem, but possibly a small problem. > > Can you send > >mdadm -E /dev/sdg1 > > as well? > > Sure: > > # mdadm -E /dev/sdg1 > /dev/sdg1: >Magic : a92b4efc > Version : 01 > Feature Map : 0x1 > Array UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19 > Name : 1 >Creation Time : Wed Oct 31 14:30:55 2007 > Raid Level : raid5 > Raid Devices : 5 > >Used Dev Size : 625137008 (298.09 GiB 320.07 GB) > Array Size : 2500547584 (1192.35 GiB 1280.28 GB) >Used Size : 625136896 (298.09 GiB 320.07 GB) > Super Offset : 625137264 sectors So there is 256 sectors before the superblock were a bitmap could go, or about 6 sectors afterwards >State : clean > Device UUID : 95afade2:f2ab8e83:b0c764a0:4732827d > > Internal Bitmap : 2 sectors from superblock And the '6 sectors afterwards' was chosen. 6 sectors has room for 5*512*8 = 20480 bits, and from your previous email: > Bitmap : 19078 bits (chunks), 0 dirty (0.0%) you have 19078 bits, which is about right (a the bitmap chunk size must be a power of 2). So the problem is that "mdadm -G" is putting the bitmap after the superblock rather than considering the space before (checks code) Ahh, I remember now. There is currently no interface to tell the kernel where to put the bitmap when creating one on an active array, so it always puts in the 'safe' place. Another enhancement waiting for time. For now, you will have to live with a smallish bitmap, which probably isn't a real problem. With 19078 bits, you will still get a several-thousand-fold increase it resync speed after a crash (i.e. hours become seconds) and to some extent, fewer bits are better and you have to update them less. I've haven't made any measurements to see what size bitmap is ideal... maybe someone should :-) > Update Time : Fri Nov 2 07:46:38 2007 > Checksum : 4ee307b3 - correct > Events : 408088 > > Layout : left-symmetric > Chunk Size : 128K > > Array Slot : 3 (0, 1, failed, 2, 3, 4) > Array State : uuUuu 1 failed > > This time I'm getting nervous - Array State failed doesn't sound good! This is nothing to worry about - just a bad message from mdadm. The superblock has recorded that there was once a device in position 2 which is now failed (See the list in "Array Slot"). This summaries as "1 failed" in "Array State". But the array is definitely working OK now. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stride / stripe alignment on LVM ?
Janek Kozicki wrote: And because LVM is putting its own metadata on /dev/md1, the ext3 partition is shifted by some (unknown for me) amount of bytes from the beginning of /dev/md1. It seems to be multiply of 64KiB. You can specify it during pvcreate, with --metadatasize option. It will be rounded to multiply of 64 KiB, and will add another 64 KiB on its own. Extents will follow directly after that. 4 sectors mentioned in pcvreate's man page are covered by that option as well. So i.e. if you have chunk 1MiB, then pvcreate ... --metadatasize 960K ... should give you chunk-aligned logical volumes, assuming you have actual extent size set appropriately as well. If you use default chunk size, you shouldn't need any extra options. Make sure if it really is this way, after pv/vg/first lv creation. I found it experimentally, so ymmv. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
Alberto Alonso wrote: > On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote: >> Not in the older kernel versions you were running, no. > > These "old versions" (specially the RHEL) are supposed to be > the official versions supported by Redhat and the hardware > vendors, as they were very specific as to what versions of > Linux were supported. Of all people, I would think you would > appreciate that. Sorry if I sound frustrated and upset, but > it is clearly a result of what "supported and tested" really > means in this case. I don't want to go into a discussion of > commercial distros, which are "supported" as this is nor the > time nor the place but I don't want to open the door to the > excuse of "its an old kernel", it wasn't when it got installed. It may be worth noting that the context of this email is the upstream linux-raid list. In my time watching the list it is mainly focused on 'current' code and development (but hugely supportive of older environments). In general discussions in this context will have a certain mindset - and it's not going to be the same as that which you'd find in an enterprise product support list. > Outside of the rejected suggestion, I just want to figure out > when software raid works and when it doesn't. With SATA, my > experience is that it doesn't. SATA, or more precisely, error handling in SATA has recently been significantly overhauled by Tejun Heo (IIRC). We're talking post 2.6.18 though (again IIRC) - so as far as SATA EH goes, older kernels bear no relation to the new ones. And the initial SATA EH code was, of course, beta :) David PS I can't really contribute to your list - I'm only using cheap desktop hardware. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very small internal bitmap after recreate
Am 02.11.2007 um 11:22 schrieb Ralf Müller: # mdadm -E /dev/sdg1 /dev/sdg1: Magic : a92b4efc Version : 01 Feature Map : 0x1 Array UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19 Name : 1 Creation Time : Wed Oct 31 14:30:55 2007 Raid Level : raid5 Raid Devices : 5 Used Dev Size : 625137008 (298.09 GiB 320.07 GB) Array Size : 2500547584 (1192.35 GiB 1280.28 GB) Used Size : 625136896 (298.09 GiB 320.07 GB) Super Offset : 625137264 sectors State : clean Device UUID : 95afade2:f2ab8e83:b0c764a0:4732827d Internal Bitmap : 2 sectors from superblock Update Time : Fri Nov 2 07:46:38 2007 Checksum : 4ee307b3 - correct Events : 408088 Layout : left-symmetric Chunk Size : 128K Array Slot : 3 (0, 1, failed, 2, 3, 4) Array State : uuUuu 1 failed This time I'm getting nervous - Array State failed doesn't sound good! Just to make it clear - the array is still reported active by in / proc/mdstat and behaves well - no failed devices: md1 : active raid5 sdd1[0] sdh1[5] sdf1[4] sdg1[3] sde1[1] 1250273792 blocks super 1.0 level 5, 128k chunk, algorithm 2 [5/5] [U] bitmap: 0/10 pages [0KB], 16384KB chunk Regards Ralf - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Very small internal bitmap after recreate
Am 02.11.2007 um 10:22 schrieb Neil Brown: On Friday November 2, [EMAIL PROTECTED] wrote: I have a 5 disk version 1.0 superblock RAID5 which had an internal bitmap that has been reported to have a size of 299 pages in /proc/ mdstat. For whatever reason I removed this bitmap (mdadm --grow -- bitmap=none) and recreated it afterwards (mdadm --grow -- bitmap=internal). Now it has a reported size of 10 pages. Do I have a problem? Not a big problem, but possibly a small problem. Can you send mdadm -E /dev/sdg1 as well? Sure: # mdadm -E /dev/sdg1 /dev/sdg1: Magic : a92b4efc Version : 01 Feature Map : 0x1 Array UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19 Name : 1 Creation Time : Wed Oct 31 14:30:55 2007 Raid Level : raid5 Raid Devices : 5 Used Dev Size : 625137008 (298.09 GiB 320.07 GB) Array Size : 2500547584 (1192.35 GiB 1280.28 GB) Used Size : 625136896 (298.09 GiB 320.07 GB) Super Offset : 625137264 sectors State : clean Device UUID : 95afade2:f2ab8e83:b0c764a0:4732827d Internal Bitmap : 2 sectors from superblock Update Time : Fri Nov 2 07:46:38 2007 Checksum : 4ee307b3 - correct Events : 408088 Layout : left-symmetric Chunk Size : 128K Array Slot : 3 (0, 1, failed, 2, 3, 4) Array State : uuUuu 1 failed This time I'm getting nervous - Array State failed doesn't sound good! Regards Ralf - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Very small internal bitmap after recreate
I have a 5 disk version 1.0 superblock RAID5 which had an internal bitmap that has been reported to have a size of 299 pages in /proc/ mdstat. For whatever reason I removed this bitmap (mdadm --grow -- bitmap=none) and recreated it afterwards (mdadm --grow -- bitmap=internal). Now it has a reported size of 10 pages. Do I have a problem? # cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : active raid5 sdd1[0] sdh1[5] sdf1[4] sdg1[3] sde1[1] 1250273792 blocks super 1.0 level 5, 128k chunk, algorithm 2 [5/5] [U] bitmap: 0/10 pages [0KB], 16384KB chunk # mdadm -X /dev/sdg1 Filename : /dev/sdg1 Magic : 6d746962 Version : 4 UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19 Events : 408088 Events Cleared : 408088 State : OK Chunksize : 16 MB Daemon : 5s flush period Write Mode : Normal Sync Size : 312568448 (298.09 GiB 320.07 GB) Bitmap : 19078 bits (chunks), 0 dirty (0.0%) # mdadm --version mdadm - v2.6.2 - 21st May 2007 # uname -a Linux DatenGrab 2.6.22.9-0.4-default #1 SMP 2007/10/05 21:32:04 UTC i686 i686 i386 GNU/Linux Regards Ralf - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software RAID when it works and when it doesn't
On Sat, 2007-10-27 at 11:26 -0400, Bill Davidsen wrote: > Alberto Alonso wrote: > > On Fri, 2007-10-26 at 18:12 +0200, Goswin von Brederlow wrote: > > > > > >> Depending on the hardware you can still access a different disk while > >> another one is reseting. But since there is no timeout in md it won't > >> try to use any other disk while one is stuck. > >> > >> That is exactly what I miss. > >> > >> MfG > >> Goswin > >> - > >> > > > > That is exactly what I've been talking about. Can md implement > > timeouts and not just leave it to the drivers? > > > > I can't believe it but last night another array hit the dust when > > 1 of the 12 drives went bad. This year is just a nightmare for > > me. It brought all the network down until I was able to mark it > > failed and reboot to remove it from the array. > > > > I'm not sure what kind of drives and drivers you use, but I certainly > have drives go bad and they get marked as failed. Both on old PATA > drives and newer SATA. All the SCSI I currently use is on IBM hardware > RAID (ServeRAID), so I can only assume that failure would be noted. > -- Alberto AlonsoGlobal Gate Systems LLC. (512) 351-7233http://www.ggsys.net Hardware, consulting, sysadmin, monitoring and remote backups - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Implementing low level timeouts within MD
On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote: > I wasn't belittling them. I was trying to isolate the likely culprit in > the situations. You seem to want the md stack to time things out. As > has already been commented by several people, myself included, that's a > band-aid and not a fix in the right place. The linux kernel community > in general is pretty hard lined when it comes to fixing the bug in the > wrong way. It did sound as if I was complaining about nothing and that I shouldn't bother the linux-raid people and instead just continuously update the kernel and stop raising issues. If I misunderstood you I'm sorry, but somehow I still think that belittling my problems was implied in your responses. > Not in the older kernel versions you were running, no. These "old versions" (specially the RHEL) are supposed to be the official versions supported by Redhat and the hardware vendors, as they were very specific as to what versions of Linux were supported. Of all people, I would think you would appreciate that. Sorry if I sound frustrated and upset, but it is clearly a result of what "supported and tested" really means in this case. I don't want to go into a discussion of commercial distros, which are "supported" as this is nor the time nor the place but I don't want to open the door to the excuse of "its an old kernel", it wasn't when it got installed. > And I guarantee not a single one of those systems even knows what SATA > is. They all use tried and true SCSI/FC technology. Sure, the tru64 units I talked about don't use SATA (although some did use PATA) I'll concede to that point. > In any case, if Neil is so inclined to do so, he can add timeout code > into the md stack, it's not my decision to make. The timeout was nothing more than a suggestion based on what I consider a reasonable expectation of usability. Neil said no and I respect that. If I didn'tm I could always write my own as per the open source model :-) But I am not inclined to do so. Outside of the rejected suggestion, I just want to figure out when software raid works and when it doesn't. With SATA, my experience is that it doesn't. So far I've only received one response stating success (they were using the 3ware and Areca product lines). Anyway, this thread just posed the question, and as Neil pointed out, it isn't feasible/worth to implement timeouts within the md code. I think most of the points/discussions raised beyond that original question really belong to the thread "Software RAID when it works and when it doesn't" I do appreciate all comments and suggestions and I hope to keep them coming. I would hope however to hear more about success stories with specific hardware details. It would be helpfull to have a list of tested configurations that are known to work. Alberto - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug in processing dependencies by async_tx_submit() ?
Hi Dan, On Friday 02 November 2007 03:36, Dan Williams wrote: > > This is happened because of the specific implementation of > > dma_wait_for_async_tx(). > > So I take it you are not implementing interrupt based callbacks in your driver? Why not ? I have interrupt based callbacks in my driver. An INTERRUPT descriptor, implemented for both (COPY and XOR) channels, does the callback upon its completion. Here is an example where your implementation of dma_wait_for_async_tx() will not work as expected. Let's we have OP1 <--depends on-- OP2 <--depends on-- OP3, where OP1: cookie = -EBUSY, channel = DMA0; <- not submitted OP2: cookie = -EBUSY, channel = DMA0; <- not submitted OP3: cookie = 101, channel = DMA1; <- submitted, but not linked to h/w where cookie == 101 is some valid, positive cookie; and this fact means that OP3 *was submitted* to the DMA1 channel but *perhaps was not linked* to the h/w chain, for example, because the threshold for DMA1 was not achieved yet. With your implementation of dma_wait_for_async_tx() we do dma_sync_wait(OP2). And I propose to do dma_sync_wait(OP3), because in your case we may never wait for OP2 completion since dma_sync_wait() flushes to h/w the chains of DMA0, but OP3 in DMA1 remains unlinked to h/w and it blocks all the chain of dependencies. > > The "iter", we finally waiting for there, corresponds to the last allocated > > but not-yet-submitted descriptor. But if the "iter" we are waiting for is > > dependent from another descriptor which has cookie > 0, but is not yet > > submitted to the h/w channel because of the fact that threshold is not > > achieved to this moment, then we may wait in dma_wait_for_async_tx() > > infinitely. I think that it makes more sense to get the first descriptor > > which was submitted to the channel but probably is not put into the h/w > > chain, i.e. with cookie > 0 and do dma_sync_wait() of this descriptor. > > > > When I modified the dma_wait_for_async_tx() in such way, then the kernel > > locking had disappeared. But nevertheless the mkfs processes hangs-up after > > some time. So, it looks like something is still missing in support of the > > chaining dependencies feature... > > > > I am preparing a new patch that replaces ASYNC_TX_DEP_ACK with > ASYNC_TX_CHAIN_ACK. The plan is to make the entire chain of > dependencies available up until the last transaction is submitted. > This allows the entire dependency chain to be walked at > async_tx_submit time so that we can properly handle these multiple > dependency cases. I'll send it out when it passes my internal > tests... Fine. I guess this replacement assumes some modifications to the RAID-5 driver as well. Right? -- Yuri Tikhonov, Senior Software Engineer Emcraft Systems, www.emcraft.com - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html