Hi Simon, On Sat, Nov 07, 2009 at 07:38:03AM +1300, Simon Baxter wrote: > Hi > > I've been running logical volume management (LVMs) on my production VDR > box for years, but recently had a drive failure. To be honest, in the > ~20 years I've had PCs in the house, this is the first time a drive > failed! > > Anyway, I've bought 3x 1.5 TB SATA disks which I'd like to put into a > software (mdadm) raid 5 array. > ... > > I regularly record 3 and sometimes 4 channels simultaneously, while > watching a recording. Under regular LVM, this sometimes seemed to cause > some "slow downs".
I know I risk a flame war here but I feel obliged to say it: Avoid raid5 if you can avoid it! It is fun to play with but if you care for your data buy a fourth drive and do raid1+0 (mirroring and striping) instead. Raid 5 is very fast on linear read operations because basically the load will be spread onto all the available drives. But if you are going to run vdr on that drive array, you are going to do a lot of write operations, and raid5 is bad if you do a lot of writes for a very simple reason. Take a raid5 array with X devices. If you want to write just one block, you need to read 2 blocks (the old data that you are going to overwrite and the old parity) and you need to write 2 blocks (one with the actual data and one with the new parity). In the best of case, the disk block that you are going to overwrite is already in ram, but the parity block almost never will be. Only if you keep writing the same block over and over, you'll have data and parity blocks cached. In most cases (and certainly in the case of writing data streams on disk) you'll need to read two blocks before you can calculate the new parity and write it back to the disks along with your data. So in short you do two reads and two writes for every write operation. There goes your performance... Now about drive failures... if one of X disks fails, you can still read blocks on the OK drives with just one read operation but you need X-1 read operations for every read operation on the failed drive. Writes on OK drives have the same two reads/two writes as before, (only if the failed drive contained the parity for this block you can skip the additional two reads and one write). If however you need to write on the the failed drive, then you need to read every other X-1 drive in the array to first reconstruct the missing data and then you can calculate and write the new parity. (and then you throw away the actual data that you were going to write because the drive that you could write it to is gone...) Example: You have your three 1.5TB drives A B C in an array and C fails. In this situation you'd want to treat your drives as carefully as possible because one more failure and all your data is gone. Unfortunately continued operating in fail condition will put your remaining drives under much more stress than usually. Reading will cause twice the read operations on your remaining drives. block : n n+1 n+2 OK State : a b c Failstate: a b ab Writing (on a small array) will produce the same load of two reads and two writes average per write. block: n n+1 n+2 OK: acAC baBA cbCB FAIL: A baBA baB Confusingly enough the read load per drive doesn't change if you have more than three drives in your array. Reads will still produce on average double the load in failed state. Writes on a failed array seem to produce the same load as on an OK array. But this is only true for very small arrays. If you add more disks you'll see that the "read penalty" grows for writing blocks where the data disk is missing and you need to read all other drives in order to update th parity. Reconstruction of you array after adding a new drive will take a long time and most of complete array failures (i.e. data lost forever) occure during the rebuilding phase, not during in the fail state. Thats simply because you put a lot of stress on your drives (that probably come from same batch as the one that already failed). Depending on the number and nature of your drives and the host connection they have, the limiting factor can read performance (you need to read X-1 drives completely) or it can be the write performance if your disk is slower on sustained writing than on reading. Remember that you need to read and write a whole disks worth of data, not just the used parts. Example: Your drives have 1.5tb and we assume that you have a whoopin 100MB/s on read as well as on write. (pretty much the fastest there currently is). You need to read 3tb as well as write 1.5tb. if your system can handle the load in parallel you can treat it as just writing one 1.5tb drive. 1500000mb/100mb/s/60s/m makes 250 minutes or 4 hours and 10 minutes. I am curious if you can still use the system under such an io load. Anybody with experience on this? Anyway the reconstruction rate can be tuned via the proc fs. Now for the raid 1+0 alternative with the same resulting storage capacity you'll need 4 instead of 3 drives. In OK state one read command will result in one read operation but the operation can be completed on any drive that is part of the mirror set. So seek performance will be much better as the io-scheduler will select the drive that is currently not busy and/or who's head is closer to the requested block. As you do mirroring and striping you can use all four drives' performance for linear reading. You end up with 33% more read performance than with the raid5 setup (but hey you paid 33% more as well :-) ) Writing one block requires two write operations instead of two reads and two writes and since you don't need to read the old data before writing the new stuff, you don't need to wait for the heads to move around, and the disk to rotate to the right place and the read operation to get the data from the disk to ram first. You can simply write to the disk and let the disk's controller handle the rest. In other words: Your write performance will be much better than with raid5. In failed state (lets assume drive C of A=B+C=D fails), reading performance will drop by 33% as one drive is missing. The mirror drive of C will have to handle the load by itself: block: n n+1 n+2 n+3 ok: a c b d fail: a d b d This again assumes that the load is shared equally between the drives of a mirror set and is probably true for long sustained reads. In reality the scheduler would select the drive that is currently not busy and/or who's head is closer to the region you want to read. So if you are reading two streams of data that are stored in different regions of the disk, the disk in a raid5 array would have to do a lot of seek operations while the raid1+0 would keep one head on each stream's location and only quietly jump from one track to the next (assuming your disk is not heavily fragmented). If one of the two disks in a mirror set fails you'll have the heads jumping again. Writing on an array with a failed drived maintains the same for load for each individual drive and the performance will also stay the same. block: n n+1 n+2 n+3 ok: AB CD AB CD fail: AB D AB D Rebuilding will require to read the mirrored drive and write the new one. So you'll need to read 1.5tb and write 1.5tb. It will take the same time but produce less system load than in the raid5 example and only one old disk will be put under a lot of stress instead of all remaining drives. Btw: Your raid 1+0 array can handle two drive failures as long as they don't occure in the same mirror set. so A and C or B and D could fail and you'd still have all your data. Naturally Murphy's law applies and if you continue reading from that array you will stress that single remaining drive more than the others and its chances to fail will increase. But if you are worried about double faults you might as well run raid6 on those 4 drives ... but don't ask for performance there. In all this I assume that you have a backup on another drive of all data that you care about. If you don't, WHAT THE F*** ARE YOU DOING? You are trusting your data to microscopic particles of rotating rust... Use two of the three drives as raid1 device that will quickly get your data in and out and use the third as a backup device that will hold copies of the data that you care about. That way you are safe against single drive failure and against stupid users/software. Assuming that your backup drive is not mounted/accesible all the time. If you have a lot of data that you don't realy care about, you can use two of the three drives as raid0 device and use the third to only backup the data that is important to you. I know you could use LVM to create one big volumegroup on to manage all three disks and create the logical volumes that you store important data on with a "--mirrors" argument proportional to your paranoia but this would still only protect you from hardware failures. To have protection against software/user failures you'd need to do snapshots as well and I don't like the way you have to do their growing and shrinking manually plus it would still all be "online" and vulnerable to typos in "dd" commands.. enough time wasted.. just one more thing ... all those RAID thingies assume that you trust your disks to fail silently, i.e. return nothing instead of returning wrong data. if you wanted to protect against this you'd have to forget about improved performance and instead be content with the performance of your slowest drive. for each read you'd have to read the a block from each of your X drives in a raid array and compare the computed parity with the one read from disk, or in the simple raid1+0 you'd have to read both copies and compare them. cheers -henrik _______________________________________________ vdr mailing list email@example.com http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr