Re: limits on raid
On Fri, 22 Jun 2007, David Greaves wrote: That's not a bad thing - until you look at the complexity it brings - and then consider the impact and exceptions when you do, eg hardware acceleration? md information fed up to the fs layer for xfs? simple long term maintenance? Often these problems are well worth the benefits of the feature. I _wonder_ if this is one where the right thing is to just say no :) In this case I think the advantages of a higher level system knowing what efficiant blocks to do writes/reads in can potentially be a HUGE advantage. if the uppper levels know that you ahve a 6 disk raid 6 array with a 64K chunk size then reads and writes in 256k chunks (aligned) should be able to be done at basicly the speed of a 4 disk raid 0 array. what's even more impressive is that this could be done even if the array is degraded (if you know the drives have failed you don't even try to read from them and you only have to reconstruct the missing info once per stripe) the current approach doesn't give the upper levels any chance to operate in this mode, they just don't have enough information to do so. the part about wanting to know raid 0 chunk size so that the upper layers can be sure that data that's supposed to be redundant is on seperate drives is also possible storage technology is headed in the direction of having the system do more and more of the layout decisions, and re-stripe the array as conditions change (similar to what md can already do with enlarging raid5/6 arrays) but unless you want to eventually put all that decision logic into the md layer you should make it possible for other layers to make queries to find out what's what and then they can give directions for what they want to have happen. so for several reasons I don't see this as something that's deserving of an atomatic 'no' David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
Bill Davidsen wrote: David Greaves wrote: [EMAIL PROTECTED] wrote: On Fri, 22 Jun 2007, David Greaves wrote: If you end up 'fiddling' in md because someone specified --assume-clean on a raid5 [in this case just to save a few minutes *testing time* on system with a heavily choked bus!] then that adds *even more* complexity and exception cases into all the stuff you described. A few minutes? Are you reading the times people are seeing with multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. Yes. But we are talking initial creation here. And as soon as you believe that the array is actually usable you cut that rebuild rate, perhaps in half, and get dog-slow performance from the array. It's usable in the sense that reads and writes work, but for useful work it's pretty painful. You either fail to understand the magnitude of the problem or wish to trivialize it for some reason. I do understand the problem and I'm not trying to trivialise it :) I _suggested_ that it's worth thinking about things rather than jumping in to say oh, we can code up a clever algorithm that keeps track of what stripes have valid parity and which don't and we can optimise the read/copy/write for valid stripes and use the raid6 type read-all/write-all for invalid stripes and then we can write a bit extra on the check code to set the bitmaps.. Phew - and that lets us run the array at semi-degraded performance (raid6-like) for 3 days rather than either waiting before we put it into production or running it very slowly. Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT? What happens in those 3 years when we have a disk fail? The solution doesn't apply then - it's 3 days to rebuild - like it or not. By delaying parity computation until the first write to a stripe only the growth of a filesystem is slowed, and all data are protected without waiting for the lengthly check. The rebuild speed can be set very low, because on-demand rebuild will do most of the work. I am not saying you are wrong. I ask merely if the balance of benefit outweighs the balance of complexity. If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs - very useful indeed. I'm very much for the fs layer reading the lower block structure so I don't have to fiddle with arcane tuning parameters - yes, *please* help make xfs self-tuning! Keeping life as straightforward as possible low down makes the upwards interface more manageable and that goal more realistic... Those two paragraphs are mutually exclusive. The fs can be simple because it rests on a simple device, even if the simple device is provided by LVM or md. And LVM and md can stay simple because they rest on simple devices, even if they are provided by PATA, SATA, nbd, etc. Independent layers make each layer more robust. If you want to compromise the layer separation, some approach like ZFS with full integration would seem to be promising. Note that layers allow specialized features at each point, trading integration for flexibility. That's a simplistic summary. You *can* loosely couple the layers. But you can enrich the interface and tightly couple them too - XFS is capable (I guess) of understanding md more fully than say ext2. XFS would still work on a less 'talkative' block device where performance wasn't as important (USB flash maybe, dunno). My feeling is that full integration and independent layers each have benefits, as you connect the layers to expose operational details you need to handle changes in those details, which would seem to make layers more complex. Agreed. What I'm looking for here is better performance in one particular layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel that the current performance suggests room for improvement. I agree there is room for improvement. I suggest that it may be more fruitful to write a tool called raid5prepare that writes zeroes/ones as appropriate to all component devices and then you can use --assume-clean without concern. That could look to see if the devices are scsi or whatever and take advantage of the hyperfast block writes that can be done. David - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Fri, 22 Jun 2007, Bill Davidsen wrote: By delaying parity computation until the first write to a stripe only the growth of a filesystem is slowed, and all data are protected without waiting for the lengthly check. The rebuild speed can be set very low, because on-demand rebuild will do most of the work. I'm very much for the fs layer reading the lower block structure so I don't have to fiddle with arcane tuning parameters - yes, *please* help make xfs self-tuning! Keeping life as straightforward as possible low down makes the upwards interface more manageable and that goal more realistic... Those two paragraphs are mutually exclusive. The fs can be simple because it rests on a simple device, even if the simple device is provided by LVM or md. And LVM and md can stay simple because they rest on simple devices, even if they are provided by PATA, SATA, nbd, etc. Independent layers make each layer more robust. If you want to compromise the layer separation, some approach like ZFS with full integration would seem to be promising. Note that layers allow specialized features at each point, trading integration for flexibility. My feeling is that full integration and independent layers each have benefits, as you connect the layers to expose operational details you need to handle changes in those details, which would seem to make layers more complex. What I'm looking for here is better performance in one particular layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel that the current performance suggests room for improvement. they both have have benifits, but it shouldn't have to be either-or if you build the seperate layers and provide for ways that the upper layers can query the lower layers to find what's efficiant then you can have some uppoer layers that don't care about this and trat the lower layer as a simple block device, while other upper layers find out what sort of things are more efficiant to do and use the same lower layer in a more complex manner the alturnative is to duplicate effort (and code) to have two codebases that try to do the same thing, one stand-alone, and one as a part of an integrated solution (and it gets even worse if there end up being multiple integrated solutions) David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
Neil Brown wrote: This isn't quite right. Thanks :) Firstly, it is mdadm which decided to make one drive a 'spare' for raid5, not the kernel. Secondly, it only applies to raid5, not raid6 or raid1 or raid10. For raid6, the initial resync (just like the resync after an unclean shutdown) reads all the data blocks, and writes all the P and Q blocks. raid5 can do that, but it is faster the read all but one disk, and write to that one disk. How about this: Initial Creation When mdadm asks the kernel to create a raid array the most noticeable activity is what's called the initial resync. Raid level 0 doesn't have any redundancy so there is no initial resync. For raid levels 1,4,6 and 10 mdadm creates the array and starts a resync. The raid algorithm then reads the data blocks and writes the appropriate parity/mirror (P+Q) blocks across all the relevant disks. There is some sample output in a section below... For raid5 there is an optimisation: mdadm takes one of the disks and marks it as 'spare'; it then creates the array in degraded mode. The kernel marks the spare disk as 'rebuilding' and starts to read from the 'good' disks, calculate the parity and determines what should be on the spare disk and then just writes to it. Once all this is done the array is clean and all disks are active. This can take quite a time and the array is not fully resilient whilst this is happening (it is however fully useable). Also is raid4 like raid5 or raid6 in this respect? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
[EMAIL PROTECTED] wrote: On Thu, 21 Jun 2007, David Chinner wrote: On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote: I have that - apparently naive - idea that drives use strong checksum, and will never return bad data, only good data or an error. If this isn't right, then it would really help to understand what the cause of other failures are before working out how to handle them The drive is not the only source of errors, though. You could have a path problem that is corrupting random bits between the drive and the filesystem. So the data on the disk might be fine, and reading it via a redundant path might be all that is needed. one of the 'killer features' of zfs is that it does checksums of every file on disk. so many people don't consider the disk infallable. several other filesystems also do checksums both bitkeeper and git do checksums of files to detect disk corruption No, all of those checksums are to detect *filesystem* corruption, not device corruption (a mere side-effect). as david C points out there are many points in the path where the data could get corrupted besides on the platter. Yup, that too. But drives either return good data, or an error. Cheers - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On 21 Jun 2007, Neil Brown stated: I have that - apparently naive - idea that drives use strong checksum, and will never return bad data, only good data or an error. If this isn't right, then it would really help to understand what the cause of other failures are before working out how to handle them Look at the section `Disks and errors' in Val Henson's excellent report on last year's filesystems workshop: http://lwn.net/Articles/190223/. Most of the error modes given there lead to valid checksums and wrong data... (while you're there, read the first part too :) ) -- `... in the sense that dragons logically follow evolution so they would be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep furiously - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
I didn't get a comment on my suggestion for a quick and dirty fix for -assume-clean issues... Bill Davidsen wrote: Neil Brown wrote: On Thursday June 14, [EMAIL PROTECTED] wrote: it's now churning away 'rebuilding' the brand new array. a few questions/thoughts. why does it need to do a rebuild when makeing a new array? couldn't it just zero all the drives instead? (or better still just record most of the space as 'unused' and initialize it as it starts useing it?) Yes, it could zero all the drives first. But that would take the same length of time (unless p/q generation was very very slow), and you wouldn't be able to start writing data until it had finished. You can dd /dev/zero onto all drives and then create the array with --assume-clean if you want to. You could even write a shell script to do it for you. Yes, you could record which space is used vs unused, but I really don't think the complexity is worth it. How about a simple solution which would get an array on line and still be safe? All it would take is a flag which forced reconstruct writes for RAID-5. You could set it with an option, or automatically if someone puts --assume-clean with --create, leave it in the superblock until the first repair runs to completion. And for repair you could make some assumptions about bad parity not being caused by error but just unwritten. Thought 2: I think the unwritten bit is easier than you think, you only need it on parity blocks for RAID5, not on data blocks. When a write is done, if the bit is set do a reconstruct, write the parity block, and clear the bit. Keeping a bit per data block is madness, and appears to be unnecessary as well. while I consider zfs to be ~80% hype, one advantage it could have (but I don't know if it has) is that since the filesystem an raid are integrated into one layer they can optimize the case where files are being written onto unallocated space and instead of reading blocks from disk to calculate the parity they could just put zeros in the unallocated space, potentially speeding up the system by reducing the amount of disk I/O. Certainly. But the raid doesn't need to be tightly integrated into the filesystem to achieve this. The filesystem need only know the geometry of the RAID and when it comes to write, it tries to write full stripes at a time. If that means writing some extra blocks full of zeros, it can try to do that. This would require a little bit better communication between filesystem and raid, but not much. If anyone has a filesystem that they want to be able to talk to raid better, they need only ask... is there any way that linux would be able to do this sort of thing? or is it impossible due to the layering preventing the nessasary knowledge from being in the right place? Linux can do anything we want it to. Interfaces can be changed. All it takes is a fairly well defined requirement, and the will to make it happen (and some technical expertise, and lots of time and coffee?). Well, I gave you two thoughts, one which would be slow until a repair but sounds easy to do, and one which is slightly harder but works better and minimizes performance impact. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
[EMAIL PROTECTED] wrote: one channel, 2 OS drives plus the 45 drives in the array. Huh? You can only have 16 devices on a scsi bus, counting the host adapter. And I don't think you can even manage that much reliably with the newer higher speed versions, at least not without some very special cables. yes I realize that there will be bottlenecks with this, the large capacity is to handle longer history (it's going to be a 30TB circular buffer being fed by a pair of OC-12 links) Building one of those nice packet sniffers for the NSA to install on ATTs network eh? ;) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote: yes, I'm useing promise drive shelves, I have them configured to export the 15 drives as 15 LUNs on a single ID. I'm going to be useing this as a huge circular buffer that will just be overwritten eventually 99% of the time, but once in a while I will need to go back into the buffer and extract and process the data. I would guess that if you ran 15 drives per channel on 3 different channels, you would resync in 1/3 the time. Well unless you end up saturating the PCI bus instead. hardware raid of course has an advantage there in that it doesn't have to go across the bus to do the work (although if you put 45 drives on one scsi channel on hardware raid, it will still be limited). -- Len Sorensen - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Tue, 19 Jun 2007, Lennart Sorensen wrote: On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote: yes, I'm useing promise drive shelves, I have them configured to export the 15 drives as 15 LUNs on a single ID. I'm going to be useing this as a huge circular buffer that will just be overwritten eventually 99% of the time, but once in a while I will need to go back into the buffer and extract and process the data. I would guess that if you ran 15 drives per channel on 3 different channels, you would resync in 1/3 the time. Well unless you end up saturating the PCI bus instead. hardware raid of course has an advantage there in that it doesn't have to go across the bus to do the work (although if you put 45 drives on one scsi channel on hardware raid, it will still be limited). I fully realize that the channel will be the bottleneck, I just didn't understand what /proc/mdstat was telling me. I thought that it was telling me that the resync was processing 5M/sec, not that it was writing 5M/sec on each of the two parity locations. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
[EMAIL PROTECTED] wrote: in my case it takes 2+ days to resync the array before I can do any performance testing with it. for some reason it's only doing the rebuild at ~5M/sec (even though I've increased the min and max rebuild speeds and a dd to the array seems to be ~44M/sec, even during the rebuild) With performance like that, it sounds like you're saturating a bus somewhere along the line. If you're using scsi, for instance, it's very easy for a long chain of drives to overwhelm a channel. You might also want to consider some other RAID layouts like 1+0 or 5+0 depending upon your space vs. reliability needs. -- Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: in my case it takes 2+ days to resync the array before I can do any performance testing with it. for some reason it's only doing the rebuild at ~5M/sec (even though I've increased the min and max rebuild speeds and a dd to the array seems to be ~44M/sec, even during the rebuild) With performance like that, it sounds like you're saturating a bus somewhere along the line. If you're using scsi, for instance, it's very easy for a long chain of drives to overwhelm a channel. You might also want to consider some other RAID layouts like 1+0 or 5+0 depending upon your space vs. reliability needs. I plan to test the different configurations. however, if I was saturating the bus with the reconstruct how can I fire off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the reconstruct to ~4M/sec? I'm putting 10x as much data through the bus at that point, it would seem to proove that it's not the bus that's saturated. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Lennart Sorensen wrote: On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote: I plan to test the different configurations. however, if I was saturating the bus with the reconstruct how can I fire off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the reconstruct to ~4M/sec? I'm putting 10x as much data through the bus at that point, it would seem to proove that it's not the bus that's saturated. dd 45MB/s from the raid sounds reasonable. If you have 45 drives, doing a resync of raid5 or radi6 should probably involve reading all the disks, and writing new parity data to one drive. So if you are writing 5MB/s, then you are reading 44*5MB/s from the other drives, which is 220MB/s. If your resync drops to 4MB/s when doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read capacity, which surprisingly seems to match the dd speed you are getting. Seems like you are indeed very much saturating a bus somewhere. The numbers certainly agree with that theory. What kind of setup is the drives connected to? simple ultra-wide SCSI to a single controller. I didn't realize that the rate reported by /proc/mdstat was the write speed that was takeing place, I thought it was the total data rate (reads + writes). the next time this message gets changed it would be a good thing to clarify this. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: I plan to test the different configurations. however, if I was saturating the bus with the reconstruct how can I fire off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the reconstruct to ~4M/sec? I'm putting 10x as much data through the bus at that point, it would seem to proove that it's not the bus that's saturated. I am unconvinced. If you take ~1MB/s for each active drive, add in SCSI overhead, 45M/sec seems reasonable. Have you look at a running iostat while all this is going on? Try it out- add up the kb/s from each drive and see how close you are to your maximum theoretical IO. I didn't try iostat, I did look at vmstat, and there the numbers look even worse, the bo column is ~500 for the resync by itself, but with the DD it's ~50,000. when I get access to the box again I'll try iostat to get more details Also, how's your CPU utilization? ~30% of one cpu for the raid 6 thread, ~5% of one cpu for the resync thread David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Lennart Sorensen wrote: On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote: simple ultra-wide SCSI to a single controller. Hmm, isn't ultra-wide limited to 40MB/s? Is it Ultra320 wide? That could do a lot more, and 220MB/s sounds plausable for 320 scsi. yes, sorry, ultra 320 wide. I didn't realize that the rate reported by /proc/mdstat was the write speed that was takeing place, I thought it was the total data rate (reads + writes). the next time this message gets changed it would be a good thing to clarify this. Well I suppose itcould make sense to show rate of rebuild which you can then compare against the total size of tha raid, or you can have rate of write, which you then compare against the size of the drive being synced. Certainly I would expect much higer speeds if it was the overall raid size, while the numbers seem pretty reasonable as a write speed. 4MB/s would take for ever if it was the overall raid resync speed. I usually see SATA raid1 resync at 50 to 60MB/s or so, which matches the read and write speeds of the drives in the raid. as I read it right now what happens is the worst of the options, you show the total size of the array for the amount of work that needs to be done, but then show only the write speed for the rate pf progress being made through the job. total rebuild time was estimated at ~3200 min David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
[EMAIL PROTECTED] wrote: yes, sorry, ultra 320 wide. Exactly how many channels and drives? -- Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Mon, 18 Jun 2007, Brendan Conoboy wrote: [EMAIL PROTECTED] wrote: yes, sorry, ultra 320 wide. Exactly how many channels and drives? one channel, 2 OS drives plus the 45 drives in the array. yes I realize that there will be bottlenecks with this, the large capacity is to handle longer history (it's going to be a 30TB circular buffer being fed by a pair of OC-12 links) it appears that my big mistake was not understanding what /proc/mdstat is telling me. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
Neil Brown [EMAIL PROTECTED] writes: Having the filesystem duplicate data, store checksums, and be able to find a different copy if the first one it chose was bad is very sensible and cannot be done by just putting the filesystem on RAID. Apropos checksums: since RAID5 copies/xors anyways it would be nice to combine that with the file system. During the xor a simple checksum could be computed in parallel and stored in the file system. And the copy/checksum passes will hopefully at some point be combined. -Andi - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
dean gaudet wrote: On Sat, 16 Jun 2007, Wakko Warner wrote: When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's always slowed the system down when booting up. Quite significantly I must say. I wait until I can login and change the rebuild max speed to slow it down while I'm using it. But that is another thing. i use an external write-intent bitmap on a raid1 to avoid this... you could use internal bitmap but that slows down i/o too much for my tastes. i also use an external xfs journal for the same reason. 2 disk raid1 for root/journal/bitmap, N disk raid5 for bulk storage. no spindles in common. I must remember this if I have to rebuild the array. Although I'm considering moving to a hardware raid solution when I upgrade my storage. -- Lab tests show that use of micro$oft causes cancer in lab animals Got Gas??? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
Neil Brown wrote: On Thursday June 14, [EMAIL PROTECTED] wrote: On Fri, 15 Jun 2007, Neil Brown wrote: On Thursday June 14, [EMAIL PROTECTED] wrote: what is the limit for the number of devices that can be in a single array? I'm trying to build a 45x750G array and want to experiment with the different configurations. I'm trying to start with raid6, but mdadm is complaining about an invalid number of drives David Lang man mdadm search for limits. (forgive typos). thanks. why does it still default to the old format after so many new versions? (by the way, the documetnation said 28 devices, but I couldn't get it to accept more then 27) Dunno - maybe I can't count... it's now churning away 'rebuilding' the brand new array. a few questions/thoughts. why does it need to do a rebuild when makeing a new array? couldn't it just zero all the drives instead? (or better still just record most of the space as 'unused' and initialize it as it starts useing it?) Yes, it could zero all the drives first. But that would take the same length of time (unless p/q generation was very very slow), and you wouldn't be able to start writing data until it had finished. You can dd /dev/zero onto all drives and then create the array with --assume-clean if you want to. You could even write a shell script to do it for you. Yes, you could record which space is used vs unused, but I really don't think the complexity is worth it. How about a simple solution which would get an array on line and still be safe? All it would take is a flag which forced reconstruct writes for RAID-5. You could set it with an option, or automatically if someone puts --assume-clean with --create, leave it in the superblock until the first repair runs to completion. And for repair you could make some assumptions about bad parity not being caused by error but just unwritten. Thought 2: I think the unwritten bit is easier than you think, you only need it on parity blocks for RAID5, not on data blocks. When a write is done, if the bit is set do a reconstruct, write the parity block, and clear the bit. Keeping a bit per data block is madness, and appears to be unnecessary as well. while I consider zfs to be ~80% hype, one advantage it could have (but I don't know if it has) is that since the filesystem an raid are integrated into one layer they can optimize the case where files are being written onto unallocated space and instead of reading blocks from disk to calculate the parity they could just put zeros in the unallocated space, potentially speeding up the system by reducing the amount of disk I/O. Certainly. But the raid doesn't need to be tightly integrated into the filesystem to achieve this. The filesystem need only know the geometry of the RAID and when it comes to write, it tries to write full stripes at a time. If that means writing some extra blocks full of zeros, it can try to do that. This would require a little bit better communication between filesystem and raid, but not much. If anyone has a filesystem that they want to be able to talk to raid better, they need only ask... is there any way that linux would be able to do this sort of thing? or is it impossible due to the layering preventing the nessasary knowledge from being in the right place? Linux can do anything we want it to. Interfaces can be changed. All it takes is a fairly well defined requirement, and the will to make it happen (and some technical expertise, and lots of time and coffee?). Well, I gave you two thoughts, one which would be slow until a repair but sounds easy to do, and one which is slightly harder but works better and minimizes performance impact. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
dean gaudet wrote: On Sun, 17 Jun 2007, Wakko Warner wrote: i use an external write-intent bitmap on a raid1 to avoid this... you could use internal bitmap but that slows down i/o too much for my tastes. i also use an external xfs journal for the same reason. 2 disk raid1 for root/journal/bitmap, N disk raid5 for bulk storage. no spindles in common. I must remember this if I have to rebuild the array. Although I'm considering moving to a hardware raid solution when I upgrade my storage. you can do it without a rebuild -- that's in fact how i did it the first time. to add an external bitmap: mdadm --grow --bitmap /bitmapfile /dev/mdX plus add bitmap=/bitmapfile to mdadm.conf... as in: ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc I used evms to setup mine. I have used mdadm in the past. I use lvm ontop of it which evms makes it a little easier to maintain. I have 3 arrays total (only the raid5 was configured by evms, the other 2 raid1s were done by hand) you can also easily move an ext3 journal to an external journal with tune2fs (see man page). I only have 2 ext3 file systems (One of which is mounted R/O since it's full), all my others are reiserfs (v3). What benefit would I gain by using an external journel and how big would it need to be? if you use XFS it's a bit more of a challenge to convert from internal to external, but see this thread: I specifically didn't use XFS (or JFS) since neither one at the time could be shrinked. -- Lab tests show that use of micro$oft causes cancer in lab animals Got Gas??? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote: Combining these thoughts, it would make a lot of sense for the filesystem to be able to say to the block device That blocks looks wrong - can you find me another copy to try?. That is an example of the sort of closer integration between filesystem and RAID that would make sense. I think that this would only be useful on devices that store discrete copies of the blocks on different devices i.e. mirrors. If it's an XOR based RAID, you don't have another copy you can retreive Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Sat, 16 Jun 2007, Neil Brown wrote: It would be possible to have a 'this is not initialised' flag on the array, and if that is not set, always do a reconstruct-write rather than a read-modify-write. But the first time you have an unclean shutdown you are going to resync all the parity anyway (unless you have a bitmap) so you may as well resync at the start. And why is it such a big deal anyway? The initial resync doesn't stop you from using the array. I guess if you wanted to put an array into production instantly and couldn't afford any slowdown due to resync, then you might want to skip the initial resync but is that really likely? in my case it takes 2+ days to resync the array before I can do any performance testing with it. for some reason it's only doing the rebuild at ~5M/sec (even though I've increased the min and max rebuild speeds and a dd to the array seems to be ~44M/sec, even during the rebuild) I want to test several configurations, from a 45 disk raid6 to a 45 disk raid0. at 2-3 days per test (or longer, depending on the tests) this becomes a very slow process. also, when a rebuild is slow enough (and has enough of a performance impact) it's not uncommon to want to operate in degraded mode just long enought oget to a maintinance window and then recreate the array and reload from backup. David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Friday June 15, [EMAIL PROTECTED] wrote: As I understand the way raid works, when you write a block to the array, it will have to read all the other blocks in the stripe and recalculate the parity and write it out. Your understanding is incomplete. For raid5 on an array with more than 3 drive, if you attempt to write a single block, it will: - read the current value of the block, and the parity block. - subtract the old value of the block from the parity, and add the new value. - write out the new data and the new parity. If the parity was wrong before, it will still be wrong. If you then lose a drive, you lose your data. With the current implementation in md, this only affect RAID5. RAID6 will always behave as you describe. But I don't promise that won't change with time. It would be possible to have a 'this is not initialised' flag on the array, and if that is not set, always do a reconstruct-write rather than a read-modify-write. But the first time you have an unclean shutdown you are going to resync all the parity anyway (unless you have a bitmap) so you may as well resync at the start. And why is it such a big deal anyway? The initial resync doesn't stop you from using the array. I guess if you wanted to put an array into production instantly and couldn't afford any slowdown due to resync, then you might want to skip the initial resync but is that really likely? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html