Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
http://www.cs.utk.edu/~plank/plank/papers/CS-96-332.html even describes an implementation _very_ similar to the current code, right down to using a table for the logarithm and inverse logarithm calculations. We don't use a table for logarithm and inverse logarithm calculations. Any time you do a table lookup you commit suicide from a performance standpoint. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
On Sat, 2009-07-18 at 13:04 -0700, Dan Williams wrote: On Sat, Jul 18, 2009 at 11:42 AM, David Woodhousedw...@infradead.org wrote: On Sat, 18 Jul 2009, Dan Williams wrote: I was under the impression that btrfs wanted to leverage md's stripe handling logic as well, seems that is not the case? No. We do a bunch of the stuff you mention above, but entirely within the file system so we don't have to invent a bunch of layering violations just to work around a broken design. Sure, a layering violation for an existing filesystem. For btrfs, at LSF'09, we briefly talked about breaking out more than just the erasure codes from software-raid into a libraid. At some point in the i/o path a btrfs stripe operation becomes indistinguishable from a raid5,6 operation so at first glance there appears to be room to share common infrastructure like portions of handle_stripe for example. At this point we've actually implemented the fundamental parts of RAID[56] support in btrfs, and it's looking like all we really want is the arithmetic routines. Perhaps that's because I'm insufficiently familiar with the handle_stripe() function to which you refer. Would you like to take to take a look at http://git.infradead.org/users/dwmw2/btrfs-raid56.git and attempt to convince me that I should be reusing more? Criticism in 'diff -up' form is always welcome... :) -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
David Woodhouse wrote: At this point we've actually implemented the fundamental parts of RAID[56] support in btrfs, and it's looking like all we really want is the arithmetic routines. Given that you have no legacy requirements, and that supporting more than two disks may be interesting, it may very well be worth spending some time at new codes now rather than later. Part of that investigation, though, is going to have to be if and how they can be accelerated. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
David Woodhouse wrote: I'm only interested in what we can use directly within btrfs -- and ideally I do want something which gives me an _arbitrary_ number of redundant blocks, rather than limiting me to 2. But the legacy code is good enough for now¹. When I get round to wanting more, I was thinking of lifting something like http://git.infradead.org/mtd-utils.git?a=blob;f=fec.c to start with, and maybe hoping that someone cleverer will come up with something better. The less I have to deal with Galois Fields, the happier I'll be. Well, if you want something with more than 2-block redundancy you need something other than the existing RAID-6 code which, as you know, is a special case of general Reed-Solomon coding that I happen to have spent a lot of time optimizing. The FEC code is not optimized at all if I can tell, and certainly doesn't use SSE in any way -- never mind the GF accelerators that are starting to appear. That doesn't mean it *couldn't*, just that noone has done the work to either implement it or prove it can't be done. Either way, perhaps the Plank paper that Rik pointed to could be useful as a starting point; it's probably worth taking their performance numbers with a *major* grain of salt: their implementation of RAID-6 RS-Opt which is supposed to be equivalent to my code performs at 400 MB/s, which is less than Pentium III-era performance of the real world code (they compare not to real code but to their own implementation in Java, called Jerasure.) Implementability using real array instruction sets is key to decent performance. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
On Sat, Jul 18, 2009 at 4:53 AM, David Woodhousedw...@infradead.org wrote: On Fri, 2009-07-17 at 11:49 -0400, H. Peter Anvin wrote: Cost, yes, of changing an on-disk format. Personally, I don't care about that -- I'm utterly uninterested in the legacy RAID-6 setup where it pretends to be a normal disk. I think that model is as fundamentally wrong as flash devices making the similar pretence. I can understand the frustration of these details being irretrievably hidden behind a proprietary interface out of the filesystem's control. However, this is not the case with Linux software RAID. I suspect that there is room for more interaction with even legacy filesystems to communicate things like: don't worry about initializing that region of the disk it's all free space, don't bother resyncing on dirty shutdown, if power-loss interrupts a write I guarantee I will replay the entire stripe to you at a later date, or hey, that last block I read doesn't checksum, can you come up with a different version? I was under the impression that btrfs wanted to leverage md's stripe handling logic as well, seems that is not the case? -- Dan ¹ Well, kind of. The xor_blocks() function will silently screw you over if you ask it to handle more than 5 blocks at a time. async_xor() handles arbitrary block counts. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
Alex Elsayed wrote: H. Peter Anvin wrote: implementation in Java, called Jerasure.) Implementability using real array instruction sets is key to decent performance. Actually, it is made clear in the paper that Jerasure is written as a C library, and Clearsafe is the only Java implementation. Don't let the name fool you. ;D And again, I make a typo. s/Clearsafe/Cleversafe/. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
On Sat, 18 Jul 2009, Dan Williams wrote: On Sat, Jul 18, 2009 at 4:53 AM, David Woodhousedw...@infradead.org wrote: On Fri, 2009-07-17 at 11:49 -0400, H. Peter Anvin wrote: Cost, yes, of changing an on-disk format. Personally, I don't care about that -- I'm utterly uninterested in the legacy RAID-6 setup where it pretends to be a normal disk. I think that model is as fundamentally wrong as flash devices making the similar pretence. I can understand the frustration of these details being irretrievably hidden behind a proprietary interface out of the filesystem's control. However, this is not the case with Linux software RAID. I suspect that there is room for more interaction with even legacy filesystems to communicate things like: don't worry about initializing that region of the disk it's all free space, don't bother resyncing on dirty shutdown, if power-loss interrupts a write I guarantee I will replay the entire stripe to you at a later date, or hey, that last block I read doesn't checksum, can you come up with a different version? I was under the impression that btrfs wanted to leverage md's stripe handling logic as well, seems that is not the case? No. We do a bunch of the stuff you mention above, but entirely within the file system so we don't have to invent a bunch of layering violations just to work around a broken design. ¹ Well, kind of. The xor_blocks() function will silently screw you over if you ask it to handle more than 5 blocks at a time. async_xor() handles arbitrary block counts. That's useful to know; thanks. -- dwmw2
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
On Sat, Jul 18, 2009 at 11:42 AM, David Woodhousedw...@infradead.org wrote: On Sat, 18 Jul 2009, Dan Williams wrote: I was under the impression that btrfs wanted to leverage md's stripe handling logic as well, seems that is not the case? No. We do a bunch of the stuff you mention above, but entirely within the file system so we don't have to invent a bunch of layering violations just to work around a broken design. Sure, a layering violation for an existing filesystem. For btrfs, at LSF'09, we briefly talked about breaking out more than just the erasure codes from software-raid into a libraid. At some point in the i/o path a btrfs stripe operation becomes indistinguishable from a raid5,6 operation so at first glance there appears to be room to share common infrastructure like portions of handle_stripe for example. -- Dan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
On 07/16/2009 01:38 PM, H. Peter Anvin wrote: Dan Williams wrote: On Mon, Jul 13, 2009 at 7:11 AM, David Woodhousedw...@infradead.org wrote: We'll want to use these in btrfs too. Signed-off-by: David Woodhouse david.woodho...@intel.com Do you suspect that btrfs will also want to perform these operations asynchronously? I am preparing an updated release of the raid6 offload patch kit, but the previous WIP release can be browsed at: http://git.kernel.org/?p=linux/kernel/git/djbw/async_tx.git;a=shortlog;h=raid6 The routines are housed in crypto/async_tx/async_pq.c and crypto/async_tx/async_raid6_recov.c. I also wonder if the raid6 algos are a better fit under crypto/ alongside xor? I am also sitting on a set of synchronous (CPU) acceleration patches for RAID-6 recovery, just waiting for the APIs to stabilize. -hpa Worth sharing a pointer to a really neat set of papers that describe open source friendly RAID6 and erasure encoding algorithms that were presented last year and this at FAST: http://www.cs.utk.edu/~plank/plank/papers/papers.html If I remember correctly, James Plank's papers also have implemented and benchmarked the various encodings, Ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
Ric Wheeler wrote: Worth sharing a pointer to a really neat set of papers that describe open source friendly RAID6 and erasure encoding algorithms that were presented last year and this at FAST: http://www.cs.utk.edu/~plank/plank/papers/papers.html If I remember correctly, James Plank's papers also have implemented and benchmarked the various encodings, I have seen the papers; I'm not sure it really makes that much difference. One of the things that bugs me about these papers is that he compares to *his* implementation of my optimizations, but not to my code. In real life implementations, on commodity hardware, we're limited by memory and disk performance, not by CPU utilization. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
On 07/17/2009 11:40 AM, H. Peter Anvin wrote: Ric Wheeler wrote: I have seen the papers; I'm not sure it really makes that much difference. One of the things that bugs me about these papers is that he compares to *his* implementation of my optimizations, but not to my code. In real life implementations, on commodity hardware, we're limited by memory and disk performance, not by CPU utilization. Fair enough - I thought that his coverage of the other open source friendly encodings beyond RAID6 was actually quite interesting. If you have specifics that you found unconvincing in his work, I am pretty sure that he would be delighted to hear from you first hand. James seemed to me to be very reasonable and very much a pro-Linux academic, so I would love to be able to get him and his grad students aligned in a useful way for us :-) The main flaw, as I said, is in the phrase as implemented by the Jerasure library. He's comparing his own implementations of various algorithms, not optimized implementations. The bottom line is pretty much this: the cost of changing the encoding would appear to outweigh the benefit. I'm not trying to claim the Linux RAID-6 implementation is optimal, but it is simple and appears to be fast enough that the math isn't the bottleneck. -hpa Cost? Thank about how to get free grad student hours testing out things that you might or might not want to leverage on down the road :-) ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
Ric Wheeler wrote: The bottom line is pretty much this: the cost of changing the encoding would appear to outweigh the benefit. I'm not trying to claim the Linux RAID-6 implementation is optimal, but it is simple and appears to be fast enough that the math isn't the bottleneck. Cost? Thank about how to get free grad student hours testing out things that you might or might not want to leverage on down the road :-) Cost, yes, of changing an on-disk format. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
Ric Wheeler wrote: The main flaw, as I said, is in the phrase as implemented by the Jerasure library. He's comparing his own implementations of various algorithms, not optimized implementations. The bottom line is pretty much this: the cost of changing the encoding would appear to outweigh the benefit. I'm not trying to claim the Linux RAID-6 implementation is optimal, but it is simple and appears to be fast enough that the math isn't the bottleneck. Cost? Thank about how to get free grad student hours testing out things that you might or might not want to leverage on down the road :-) Anyway... I don't really care too much. If someone wants to redesign the Linux RAID-6 and Neil decides to take it I'm not going to object. I'm also not very likely to do any work on it. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
Alex Elsayed wrote: Ric Wheeler wrote: On 07/17/2009 11:49 AM, H. Peter Anvin wrote: Ric Wheeler wrote: The bottom line is pretty much this: the cost of changing the encoding would appear to outweigh the benefit. I'm not trying to claim the Linux RAID-6 implementation is optimal, but it is simple and appears to be fast enough that the math isn't the bottleneck. Cost? Thank about how to get free grad student hours testing out things that you might or might not want to leverage on down the road :-) Cost, yes, of changing an on-disk format. -hpa Putting RAID6 behind us, we still might be interested in the other encodings that are in: A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries For Storage http://www.cs.utk.edu/~plank/plank/papers/FAST-2009.html since they give us even more flexibility Of course, there's also the fact that, using (essentially unchanged) the current code for Reed-Solomon coding, it's completely doable to have arbitrary NxM redundancy up to (N + M) 256 disks (this limit is due to the current maximum of 8 for symsize [referred to as 'w' in the below paper] in rs_init. If increased to 16, the maximum number of disks would be 65535). It's also space-optimal for all combinations of N (checksum) and M (data). http://www.cs.utk.edu/~plank/plank/papers/CS-96-332.html even describes an implementation _very_ similar to the current code, right down to using a table for the logarithm and inverse logarithm calculations. Also, (referencing the earlier-posted paper comparing open-source coding techniques), Cauchy Reed-Solomon codes seem to maintain most of the benefits of the current system (including the ability to provide NxM redundancy, while still retaining the property of being space-optimal), with significant performance gains. It also provides an optimization for the RAID6 case, so once again the common case would get a benefit over less common cases (as with Mr. Anvin's RAID6 optimization in the current system) However, I will have to dispute that the other methods provide more flexibility - Cauchy Reed-Solomon codes are at best a horizontal move there, and the other systems are restricted to (or at very least, far more effective in) RAID6 systems. Whoops, got N (data) and M (checksum) backwards. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] md: Factor out RAID6 algorithms into lib/
We'll want to use these in btrfs too. Signed-off-by: David Woodhouse david.woodho...@intel.com --- drivers/md/Kconfig |5 +- drivers/md/Makefile | 76 - lib/Kconfig |3 + lib/Makefile |1 + lib/raid6/Makefile | 78 ++ {drivers/md = lib/raid6}/mktables.c |0 {drivers/md = lib/raid6}/raid6algos.c |0 {drivers/md = lib/raid6}/raid6altivec.uc|0 {drivers/md = lib/raid6}/raid6int.uc|0 {drivers/md = lib/raid6}/raid6mmx.c |0 {drivers/md = lib/raid6}/raid6recov.c |0 {drivers/md = lib/raid6}/raid6sse1.c|0 {drivers/md = lib/raid6}/raid6sse2.c|0 {drivers/md = lib/raid6}/raid6test/Makefile |0 {drivers/md = lib/raid6}/raid6test/test.c |0 {drivers/md = lib/raid6}/raid6x86.h |0 {drivers/md = lib/raid6}/unroll.pl |0 17 files changed, 83 insertions(+), 80 deletions(-) create mode 100644 lib/raid6/Makefile rename {drivers/md = lib/raid6}/mktables.c (100%) rename {drivers/md = lib/raid6}/raid6algos.c (100%) rename {drivers/md = lib/raid6}/raid6altivec.uc (100%) rename {drivers/md = lib/raid6}/raid6int.uc (100%) rename {drivers/md = lib/raid6}/raid6mmx.c (100%) rename {drivers/md = lib/raid6}/raid6recov.c (100%) rename {drivers/md = lib/raid6}/raid6sse1.c (100%) rename {drivers/md = lib/raid6}/raid6sse2.c (100%) rename {drivers/md = lib/raid6}/raid6test/Makefile (100%) rename {drivers/md = lib/raid6}/raid6test/test.c (100%) rename {drivers/md = lib/raid6}/raid6x86.h (100%) rename {drivers/md = lib/raid6}/unroll.pl (100%) diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 36e0675..42bf294 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -121,7 +121,7 @@ config MD_RAID10 config MD_RAID456 tristate RAID-4/RAID-5/RAID-6 mode depends on BLK_DEV_MD - select MD_RAID6_PQ + select RAID6_PQ select ASYNC_MEMCPY select ASYNC_XOR ---help--- @@ -152,9 +152,6 @@ config MD_RAID456 If unsure, say Y. -config MD_RAID6_PQ - tristate - config MD_MULTIPATH tristate Multipath I/O support depends on BLK_DEV_MD diff --git a/drivers/md/Makefile b/drivers/md/Makefile index 45cc595..2c0b697 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -10,13 +10,6 @@ dm-snapshot-y+= dm-snap.o dm-exception-store.o dm-snap-transient.o \ dm-mirror-y+= dm-raid1.o md-mod-y += md.o bitmap.o raid456-y += raid5.o -raid6_pq-y += raid6algos.o raid6recov.o raid6tables.o \ - raid6int1.o raid6int2.o raid6int4.o \ - raid6int8.o raid6int16.o raid6int32.o \ - raid6altivec1.o raid6altivec2.o raid6altivec4.o \ - raid6altivec8.o \ - raid6mmx.o raid6sse1.o raid6sse2.o -hostprogs-y+= mktables # Note: link order is important. All raid personalities # and must come before md.o, as they each initialise @@ -27,7 +20,6 @@ obj-$(CONFIG_MD_LINEAR) += linear.o obj-$(CONFIG_MD_RAID0) += raid0.o obj-$(CONFIG_MD_RAID1) += raid1.o obj-$(CONFIG_MD_RAID10)+= raid10.o -obj-$(CONFIG_MD_RAID6_PQ) += raid6_pq.o obj-$(CONFIG_MD_RAID456) += raid456.o obj-$(CONFIG_MD_MULTIPATH) += multipath.o obj-$(CONFIG_MD_FAULTY)+= faulty.o @@ -40,75 +32,7 @@ obj-$(CONFIG_DM_SNAPSHOT)+= dm-snapshot.o obj-$(CONFIG_DM_MIRROR)+= dm-mirror.o dm-log.o dm-region-hash.o obj-$(CONFIG_DM_ZERO) += dm-zero.o -quiet_cmd_unroll = UNROLL $@ - cmd_unroll = $(PERL) $(srctree)/$(src)/unroll.pl $(UNROLL) \ -$ $@ || ( rm -f $@ exit 1 ) - -ifeq ($(CONFIG_ALTIVEC),y) -altivec_flags := -maltivec -mabi=altivec -endif - ifeq ($(CONFIG_DM_UEVENT),y) dm-mod-objs+= dm-uevent.o endif -targets += raid6int1.c -$(obj)/raid6int1.c: UNROLL := 1 -$(obj)/raid6int1.c: $(src)/raid6int.uc $(src)/unroll.pl FORCE - $(call if_changed,unroll) - -targets += raid6int2.c -$(obj)/raid6int2.c: UNROLL := 2 -$(obj)/raid6int2.c: $(src)/raid6int.uc $(src)/unroll.pl FORCE - $(call if_changed,unroll) - -targets += raid6int4.c -$(obj)/raid6int4.c: UNROLL := 4 -$(obj)/raid6int4.c: $(src)/raid6int.uc $(src)/unroll.pl FORCE - $(call if_changed,unroll) - -targets += raid6int8.c -$(obj)/raid6int8.c: UNROLL := 8 -$(obj)/raid6int8.c: $(src)/raid6int.uc $(src)/unroll.pl FORCE - $(call if_changed,unroll) - -targets += raid6int16.c -$(obj)/raid6int16.c: UNROLL := 16 -$(obj)/raid6int16.c: $(src)/raid6int.uc $(src)/unroll.pl FORCE - $(call if_changed,unroll) - -targets += raid6int32.c -$(obj)/raid6int32.c: UNROLL := 32 -$(obj)/raid6int32.c: $(src)/raid6int.uc $(src)/unroll.pl