Re: raid5: two writing algorithms
On Fri, Feb 08, 2008 at 12:51:39PM +1100, Neil Brown wrote: On Friday February 8, [EMAIL PROTECTED] wrote: On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote: On Thursday February 7, [EMAIL PROTECTED] wrote: So I hereby give the idea for inspiration to kernel hackers. and I hereby invite you to read the code ;-) I did some reading. Is there somewhere a description of it, especially the raid code, or are the comments and the code the best documentation? No. If a description was written (and various people have tried to describe various parts) it would be out of date within a few months :-( OK, I was under the impression that some of the code did not change much. Eg. you said that there had not been any work on optimizing raid10 for performance since the 2.6.12 kernel I was using. And then at least the raid5 code, the last copyright notice right in the top is Copyright (C) 2002, 2003 H. Peter Anvin. That is 5 years ago. And your name is not on it. So I did not look that much into that code, thinking nothing had been done there for ages. Maybe you could add your name on it, that would only be fair. The same comment goes for other modules (for which it is relevant). Look for READ_MODIFY_WRITE and RECONSTRUCT_WRITE no. That only applied to raid6 code now.. Look instead for the 'rcw' and 'rmw' counters, and then at 'handle_write_operations5' which does different things based on the 'rcw' variable. It used to be a lot clearer before we implemented xor-offload. The xor-offload stuff is good, but it does make the code more complex. OK, I think it is fairly well documented here, I can at least follow the logic, and then I think it is a good approach to have the flow description/strategy included directly in the code. Given there are many changes to the code, different files for code and description could easily mix up the alignment of code and documentation badly. Do you say that this is already implemented? Yes. That is very good! Do you konw if other implementations of this, eg. commercial controller code, have this facility? If not, we could list this as an advantage of linux raid. Anyway it would be implicit in performance documentation. I do plan to write up something on performance, soonish. The howto is hopelessly outdated. IMHO such code should make the performance of raid5 random writes not that bad. Better than the reputation that raid5 is hopelessly slow for database writing. I think raid5 would be less than double as slow as raid1 for random writing. Well, I do have a hack in mind, on the raid10,f2. I need to investigate some more, and possibly test out what really happens. But maybe the code already does what I want it to. You are possibly the one that knows the code best, so maybe you can tell me if raid10,f2 always does its reading in the first part of the disks? Yes, I know the code best. No, raid10,f2 doesn't always use the first part of the disk. Getting it to do that would be a fairly small change in 'read_balance' in md/raid10.c. I'm not at all convinced that the read balancing code in raid10 (or raid1) really does the best thing. So any improvements - backed up with broad testing - would be most welcome. I think I know where to do my proposed changes, and how it could be done. So maybe in a not too distant future I will have done my first kernel hack! Best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: two writing algorithms
On Thursday February 7, [EMAIL PROTECTED] wrote: As I understand it, there are 2 valid algoritms for writing in raid5. 1. calculate the parity data by XOR'ing all data of the relevant data chunks. 2. calculate the parity data by kind of XOR-subtracting the old data to be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add is actually the same). There are situations where method 1 is the fastest, and situations where method 2 is the fastest. My idea is then that the raid5 code in the kernel can calculate which method is the faster. method 1 is faster, if all data is already available. I understand that this method is employed in the current kernel. This would eg be the case with sequential writes. Method 2 is faster, if no data is available in core. It would require 2 reads and two writes, which always will be faster than n reads and 1 write, possibly except for n=2. method 2 is thus faster normally for random writes. I think that method 2 is not used in the kernel today. Mayby I am wrong, but I did have a look in the kernel code. It is very odd that you would think something about the behaviour of the kernel with actually having looked. It also seems a little arrogant to have a clever idea and assume that no one else has thought of it before. So I hereby give the idea for inspiration to kernel hackers. and I hereby invite you to read the code ;-) Code reading is a good first step to being a Yoyr kernel hacker wannabe ^ NeilBrown keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
raid5: two writing algorithms
As I understand it, there are 2 valid algoritms for writing in raid5. 1. calculate the parity data by XOR'ing all data of the relevant data chunks. 2. calculate the parity data by kind of XOR-subtracting the old data to be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add is actually the same). There are situations where method 1 is the fastest, and situations where method 2 is the fastest. My idea is then that the raid5 code in the kernel can calculate which method is the faster. method 1 is faster, if all data is already available. I understand that this method is employed in the current kernel. This would eg be the case with sequential writes. Method 2 is faster, if no data is available in core. It would require 2 reads and two writes, which always will be faster than n reads and 1 write, possibly except for n=2. method 2 is thus faster normally for random writes. I think that method 2 is not used in the kernel today. Mayby I am wrong, but I did have a look in the kernel code. So I hereby give the idea for inspiration to kernel hackers. Yoyr kernel hacker wannabe keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: two writing algorithms
On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote: On Thursday February 7, [EMAIL PROTECTED] wrote: As I understand it, there are 2 valid algoritms for writing in raid5. 1. calculate the parity data by XOR'ing all data of the relevant data chunks. 2. calculate the parity data by kind of XOR-subtracting the old data to be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add is actually the same). There are situations where method 1 is the fastest, and situations where method 2 is the fastest. My idea is then that the raid5 code in the kernel can calculate which method is the faster. method 1 is faster, if all data is already available. I understand that this method is employed in the current kernel. This would eg be the case with sequential writes. Method 2 is faster, if no data is available in core. It would require 2 reads and two writes, which always will be faster than n reads and 1 write, possibly except for n=2. method 2 is thus faster normally for random writes. I think that method 2 is not used in the kernel today. Mayby I am wrong, but I did have a look in the kernel code. It is very odd that you would think something about the behaviour of the kernel with actually having looked. It also seems a little arrogant to have a clever idea and assume that no one else has thought of it before. Oh well, I have to admit that I do not understand the code fully. I am not a seasoned kernel hacker, as I also indicated in my ad hoc signature. So I hereby give the idea for inspiration to kernel hackers. and I hereby invite you to read the code ;-) I did some reading. Is there somewhere a description of it, especially the raid code, or are the comments and the code the best documentation? Do you say that this is already implemented? I am sorry if you think I am mailing too much on the list. But I happen to think it is fun. And I do try to give something back. Code reading is a good first step to being a Yoyr kernel hacker wannabe ^ NeilBrown Well, I do have a hack in mind, on the raid10,f2. I need to investigate some more, and possibly test out what really happens. But maybe the code already does what I want it to. You are possibly the one that knows the code best, so maybe you can tell me if raid10,f2 always does its reading in the first part of the disks? best regards keld - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: two writing algorithms
On Friday February 8, [EMAIL PROTECTED] wrote: On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote: On Thursday February 7, [EMAIL PROTECTED] wrote: So I hereby give the idea for inspiration to kernel hackers. and I hereby invite you to read the code ;-) I did some reading. Is there somewhere a description of it, especially the raid code, or are the comments and the code the best documentation? No. If a description was written (and various people have tried to describe various parts) it would be out of date within a few months :-( Look for READ_MODIFY_WRITE and RECONSTRUCT_WRITE no. That only applied to raid6 code now.. Look instead for the 'rcw' and 'rmw' counters, and then at 'handle_write_operations5' which does different things based on the 'rcw' variable. It used to be a lot clearer before we implemented xor-offload. The xor-offload stuff is good, but it does make the code more complex. Do you say that this is already implemented? Yes. I am sorry if you think I am mailing too much on the list. You aren't. But I happen to think it is fun. Good. And I do try to give something back. We'll look forward to that. Code reading is a good first step to being a Yoyr kernel hacker wannabe ^ NeilBrown Well, I do have a hack in mind, on the raid10,f2. I need to investigate some more, and possibly test out what really happens. But maybe the code already does what I want it to. You are possibly the one that knows the code best, so maybe you can tell me if raid10,f2 always does its reading in the first part of the disks? Yes, I know the code best. No, raid10,f2 doesn't always use the first part of the disk. Getting it to do that would be a fairly small change in 'read_balance' in md/raid10.c. I'm not at all convinced that the read balancing code in raid10 (or raid1) really does the best thing. So any improvements - backed up with broad testing - would be most welcome. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html