Re: [PATCH] update xfs maintainers
Dave is on the other side of the international date line from those of us in the States. If my time zone math is correct, this thread began and continued *after* the end of his 'normal' Friday workday, during Dave's weekend. You think it might be possible he decided to unplug and actually live for a couple of days? Put this on hold until Monday. -- Stan On 11/9/2013 6:30 PM, Ben Myers wrote: > Dave, > > On Sat, Nov 09, 2013 at 05:51:30PM -0600, Ben Myers wrote: >> Hey Neil, >> >> On Sat, Nov 09, 2013 at 10:44:24AM +1100, NeilBrown wrote: >>> On Sat, 9 Nov 2013 06:59:00 +0800 Zhi Yong Wu wrote: >>> On Sat, Nov 9, 2013 at 6:03 AM, Ben Myers wrote: > Hey Ric, > > On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote: >> On 11/08/2013 03:46 PM, Ben Myers wrote: >>> Hey Christoph, >>> >>> On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: > Mark is replacing Alex as my backup because Alex is really busy at > Linaro and asked to be taken off awhile ago. The holiday season is > coming up and I fully intend to go off my meds, turn in to Fonzy the > bear, and eat my hat. I need someone to watch the shop while I'm off > exploring on Mars. I trust Mark to do that because he is totally > awesome. Doing this as an unilateral decisions is not something that will win you a fan base. >>> It's posted for review. >>> While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. >>> I think we're doing a decent job too. So thanks for that much at >>> least. ;) I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. >>> That really didn't happen Christoph. It's not in my tree or in a pull >>> request. >>> >>> Linus, let me know what you want to do. I do think we're doing a fair >>> job over >>> here, and (geez) I'm just trying to add Mark as my backup since Alex is >>> too >>> busy. I know the RH people want more control, and that's >>> understandable, but >>> they really don't need to replace me to get their code in. Ouch. >>> >>> Thanks, >>> Ben >> >> Christoph is not a Red Hat person. >> >> Jeff is from Oracle. >> >> This is not a Red Hat vs SGI thing, > > Sorry if my read on that was wrong. > >> Dave simply has earned the right >> to take on the formal leadership role of maintainer. > > Then we're gonna need some Reviewed-bys. ;) > > From: Ben Myers > > xfs: update maintainers > > Add Dave as maintainer of XFS. > > Signed-off-by: Ben Myers > --- > MAINTAINERS |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: b/MAINTAINERS > === > --- a/MAINTAINERS 2013-11-08 15:20:18.935186245 -0600 > +++ b/MAINTAINERS 2013-11-08 15:22:50.685245977 -0600 > @@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb* > > XFS FILESYSTEM > P: Silicon Graphics Inc > +M: Dave Chinner Use his personal
Re: [PATCH] update xfs maintainers
Dave is on the other side of the international date line from those of us in the States. If my time zone math is correct, this thread began and continued *after* the end of his 'normal' Friday workday, during Dave's weekend. You think it might be possible he decided to unplug and actually live for a couple of days? Put this on hold until Monday. -- Stan On 11/9/2013 6:30 PM, Ben Myers wrote: Dave, On Sat, Nov 09, 2013 at 05:51:30PM -0600, Ben Myers wrote: Hey Neil, On Sat, Nov 09, 2013 at 10:44:24AM +1100, NeilBrown wrote: On Sat, 9 Nov 2013 06:59:00 +0800 Zhi Yong Wu zwu.ker...@gmail.com wrote: On Sat, Nov 9, 2013 at 6:03 AM, Ben Myers b...@sgi.com wrote: Hey Ric, On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote: On 11/08/2013 03:46 PM, Ben Myers wrote: Hey Christoph, On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote: On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote: Mark is replacing Alex as my backup because Alex is really busy at Linaro and asked to be taken off awhile ago. The holiday season is coming up and I fully intend to go off my meds, turn in to Fonzy the bear, and eat my hat. I need someone to watch the shop while I'm off exploring on Mars. I trust Mark to do that because he is totally awesome. Doing this as an unilateral decisions is not something that will win you a fan base. It's posted for review. While we never had anything reassembling a democracy in Linux Kernel development making decisions without even contacting the major contributor is wrong, twice so if the maintainer is a relatively minor contributor to start with. Just because it recent came up elsewhere I'd like to recite the definition from Trond here again: http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html By many of the creative roles enlisted there it's clear that Dave should be the maintainer. He's been the main contributor and chief architect for XFS for many year, while the maintainers came and went at the mercy of SGI. This is not meant to bad mouth either of you as I think you're doing a reasonably good job compared to other maintainers, but at the same time the direction is set by other people that have a much longer involvement with the project, and having them officially in control would help us forward a lot. It would also avoid having to spend considerable resources to train every new generation of SGI maintainer. Coming to and end I would like to maintain Dave Chinner as the primary XFS maintainer for all the work he has done as biggest contributor and architect of XFS since longer than I can remember, and I would love to retain Ben Myers as a co-maintainer for all the good work he has done maintaining and reviewing patches since November 2011. I think we're doing a decent job too. So thanks for that much at least. ;) I would also like to use this post as a public venue to condemn the unilateral smokey backroom decisions about XFS maintainership that SGI is trying to enforce on the community. That really didn't happen Christoph. It's not in my tree or in a pull request. Linus, let me know what you want to do. I do think we're doing a fair job over here, and (geez) I'm just trying to add Mark as my backup since Alex is too busy. I know the RH people want more control, and that's understandable, but they really don't need to replace me to get their code in. Ouch. Thanks, Ben Christoph is not a Red Hat person. Jeff is from Oracle. This is not a Red Hat vs SGI thing, Sorry if my read on that was wrong. Dave simply has earned the right to take on the formal leadership role of maintainer. Then we're gonna need some Reviewed-bys. ;) From: Ben Myers b...@sgi.com xfs: update maintainers Add Dave as maintainer of XFS. Signed-off-by: Ben Myers b...@sgi.com --- MAINTAINERS |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: b/MAINTAINERS === --- a/MAINTAINERS 2013-11-08 15:20:18.935186245 -0600 +++ b/MAINTAINERS 2013-11-08 15:22:50.685245977 -0600 @@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb* XFS FILESYSTEM P: Silicon Graphics Inc +M: Dave Chinner dchin...@fromorbit.com Use his personal private mail account? I guess that you should ask for his opinion at first, or it is more appropriate that he submit this patch by himself. If y'all don't mind, I'd like to have authored this one. ;) Indeed. And does he even want the job? I heard Linus say in a recent interview that being a maintainer is a $#!+ job. I've found that it can be a little bit stressful sometimes and it tends to crowd out feature work, so I guess I agree with him. It turns out to be an excellent weight loss plan. Is it really best for the most active developers to be burdened with that extra work? (hmm.. maybe I should
Re: high-speed disk I/O is CPU-bound?
On 5/16/2013 5:56 PM, Dave Chinner wrote: > On Thu, May 16, 2013 at 11:35:08AM -0400, David Oostdyk wrote: >> On 05/16/13 07:36, Stan Hoeppner wrote: >>> On 5/15/2013 7:59 PM, Dave Chinner wrote: >>>> [cc xfs list, seeing as that's where all the people who use XFS in >>>> these sorts of configurations hang out. ] >>>> >>>> On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote: >>>>> As a basic benchmark, I have an application >>>>> that simply writes the same buffer (say, 128MB) to disk repeatedly. >>>>> Alternatively you could use the "dd" utility. (For these >>>>> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since >>>>> these systems have a lot of RAM.) >>>>> >>>>> The basic observations are: >>>>> >>>>> 1. "single-threaded" writes, either a file on the mounted >>>>> filesystem or with a "dd" to the raw RAID device, seem to be limited >>>>> to 1200-1400MB/sec. These numbers vary slightly based on whether >>>>> TurboBoost is affecting the writing process or not. "top" will show >>>>> this process running at 100% CPU. >>>> Expected. You are using buffered IO. Write speed is limited by the >>>> rate at which your user process can memcpy data into the page cache. >>>> >>>>> 2. With two benchmarks running on the same device, I see aggregate >>>>> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect >>>>> the drives of being able to deliver. This can either be with two >>>>> applications writing to separate files on the same mounted file >>>>> system, or two separate "dd" applications writing to distinct >>>>> locations on the raw device. >>> 2.4GB/s is the interface limit of quad lane 6G SAS. Coincidence? If >>> you've daisy chained the SAS expander backplanes within a server chassis >>> (9266-8i/72405), or between external enclosures (9285-8e/71685), and >>> have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your >>> RAID card, this would fully explain the 2.4GB/s wall, regardless of how >>> many parallel processes are writing, or any other software factor. >>> >>> But surely you already know this, and you're using more than one 4 lane >>> cable. Just covering all the bases here, due to seeing 2.4 GB/s as the >>> stated wall. This number is just too coincidental to ignore. >> >> We definitely have two 4-lane cables being used, but this is an >> interesting coincidence. I'd be surprised if anyone could really >> achieve the theoretical throughput on one cable, though. We have >> one JBOD that only takes a single 4-lane cable, and we seem to cap >> out at closer to 1450MB/sec on that unit. (This is just a single >> point of reference, and I don't have many tests where only one >> 4-lane cable was in use.) > > You can get pretty close to the theoretical limit on the back end > SAS cables - just like you can with FC. Yep. > What I'd suggest you do is look at the RAID card configuration - > often they default to active/passive failover configurations when > there are multiple channels to the same storage. Then hey only use > one of the cables for all traffic. Some RAID cards offer > ative/active or "load balanced" options where all back end paths are > used in redundant configurations rather than just one Also read the docs for your JBOD chassis. Some have a single expander module with 2 host ports while some have two such expanders for redundancy and have 4 total host ports. The latter requires dual ported drives. In this config you'd use one host port on each expander and configure the RAID HBA for multipathing. (It may be possible to use all 4 host ports in this setup but this requires a RAID HBA with 4 external 4 lane connectors. I'm not aware of any at this time, nut only two port models. So you'd have to use two non-RAID HBAs each with two 4 lane ports, SCSI multipath, and Linux md/RAID.) Most JBODs that use the LSI 2x36 expander ASIC will give you full b/w over two host ports in a single expander single chassis config. Other JBODs may direct wire one of the two host port to the expansion port so you may only get full 8 lane host bandwidth with an expansion unit attached. There are likely other configurations I'm not aware of. >> You guys hit the nail on the head! With O_DIRECT I can use a single >> writer thread and easily see the same throughput that I _ever_ saw >> in the multiple-writer case (~2.4GB
Re: high-speed disk I/O is CPU-bound?
On 5/16/2013 5:56 PM, Dave Chinner wrote: On Thu, May 16, 2013 at 11:35:08AM -0400, David Oostdyk wrote: On 05/16/13 07:36, Stan Hoeppner wrote: On 5/15/2013 7:59 PM, Dave Chinner wrote: [cc xfs list, seeing as that's where all the people who use XFS in these sorts of configurations hang out. ] On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote: As a basic benchmark, I have an application that simply writes the same buffer (say, 128MB) to disk repeatedly. Alternatively you could use the dd utility. (For these benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since these systems have a lot of RAM.) The basic observations are: 1. single-threaded writes, either a file on the mounted filesystem or with a dd to the raw RAID device, seem to be limited to 1200-1400MB/sec. These numbers vary slightly based on whether TurboBoost is affecting the writing process or not. top will show this process running at 100% CPU. Expected. You are using buffered IO. Write speed is limited by the rate at which your user process can memcpy data into the page cache. 2. With two benchmarks running on the same device, I see aggregate write speeds of up to ~2.4GB/sec, which is closer to what I'd expect the drives of being able to deliver. This can either be with two applications writing to separate files on the same mounted file system, or two separate dd applications writing to distinct locations on the raw device. 2.4GB/s is the interface limit of quad lane 6G SAS. Coincidence? If you've daisy chained the SAS expander backplanes within a server chassis (9266-8i/72405), or between external enclosures (9285-8e/71685), and have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your RAID card, this would fully explain the 2.4GB/s wall, regardless of how many parallel processes are writing, or any other software factor. But surely you already know this, and you're using more than one 4 lane cable. Just covering all the bases here, due to seeing 2.4 GB/s as the stated wall. This number is just too coincidental to ignore. We definitely have two 4-lane cables being used, but this is an interesting coincidence. I'd be surprised if anyone could really achieve the theoretical throughput on one cable, though. We have one JBOD that only takes a single 4-lane cable, and we seem to cap out at closer to 1450MB/sec on that unit. (This is just a single point of reference, and I don't have many tests where only one 4-lane cable was in use.) You can get pretty close to the theoretical limit on the back end SAS cables - just like you can with FC. Yep. What I'd suggest you do is look at the RAID card configuration - often they default to active/passive failover configurations when there are multiple channels to the same storage. Then hey only use one of the cables for all traffic. Some RAID cards offer ative/active or load balanced options where all back end paths are used in redundant configurations rather than just one Also read the docs for your JBOD chassis. Some have a single expander module with 2 host ports while some have two such expanders for redundancy and have 4 total host ports. The latter requires dual ported drives. In this config you'd use one host port on each expander and configure the RAID HBA for multipathing. (It may be possible to use all 4 host ports in this setup but this requires a RAID HBA with 4 external 4 lane connectors. I'm not aware of any at this time, nut only two port models. So you'd have to use two non-RAID HBAs each with two 4 lane ports, SCSI multipath, and Linux md/RAID.) Most JBODs that use the LSI 2x36 expander ASIC will give you full b/w over two host ports in a single expander single chassis config. Other JBODs may direct wire one of the two host port to the expansion port so you may only get full 8 lane host bandwidth with an expansion unit attached. There are likely other configurations I'm not aware of. You guys hit the nail on the head! With O_DIRECT I can use a single writer thread and easily see the same throughput that I _ever_ saw in the multiple-writer case (~2.4GB/sec), and top shows the writer at 10% CPU usage. I've modified my application to use O_DIRECT and it makes a world of difference. Be aware that O_DIRECT is not a magic bullet. It can make your IO go a lot slower on some worklaods and storage configs [It's interesting that you see performance benefits for O_DIRECT even with a single SATA drive. The single SATA drive has little to do with it actually. It's the limited CPU/RAM bus b/w of the box. The reason O_DIRECT shows a 78% improvement in disk throughput is a direct result of dramatically decreased memory pressure, allowing full speed DMA from RAM to the HBA over the PCI bus. The pressure caused by the mem-mem copying of buffered IO causes every read in the CPU to be a cache miss, further exacerbating the load on the CPU/RAM buses. All the memory
Re: high-speed disk I/O is CPU-bound?
On 5/15/2013 7:59 PM, Dave Chinner wrote: > [cc xfs list, seeing as that's where all the people who use XFS in > these sorts of configurations hang out. ] > > On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote: >> Hello, >> >> I have a few relatively high-end systems with hardware RAIDs which >> are being used for recording systems, and I'm trying to get a better >> understanding of contiguous write performance. >> >> The hardware that I've tested with includes two high-end Intel >> E5-2600 and E5-4600 (~3GHz) series systems, as well as a slightly >> older Xeon 5600 system. The JBODs include a 45x3.5" JBOD, a 28x3.5" >> JBOD (with either 7200RPM or 10kRPM SAS drives), and a 24x2.5" JBOD >> with 10kRPM drives. I've tried LSI controllers (9285-8e, 9266-8i, >> as well as the integrated Intel LSI controllers) as well as Adaptec >> Series 7 RAID controllers (72405 and 71685). So, you have something like the following raw aggregate drive b/w, assuming average outer-inner track 120MB/s streaming write throughput per drive: 45 drives ~5.4 GB/s 28 drives ~3.4 GB/s 24 drives ~2.8 GB/s The two LSI HBAs you mention are PCIe 2.0 devices. Note that PCIe 2.0 x8 is limited to ~4GB/s each way. If those 45 drives are connected to the 9285-8e via all 8 SAS lanes, you are still losing about 1/3rd of the aggregate drive b/w. If they're connected to the 71685 via 8 lanes and this HBA is in a PCIe 3.0 slot then you're only losing about 600MB/s. >> Normally I'll setup the RAIDs as RAID60 and format them as XFS, but >> the exact RAID level, filesystem type, and even RAID hardware don't >> seem to matter very much from my observations (but I'm willing to >> try any suggestions). Lack of performance variability here tends to suggest your workloads are all streaming in nature, and/or your application profile isn't taking full advantage of the software stack and the hardware, i.e. insufficient parallelism, overlapping IOs, etc. Or, see down below for another possibility. These are all current generation HBAs with fast multi-core ASICs and big write cache. RAID6 parity writes even with high drive counts shouldn't significantly degrade large streaming write performance. RMW workloads will still suffer substantially as usual due to rotational latencies. Fast ASICs can't solve this problem. > Document them. There's many ways to screw them up and get bad > performance. More detailed info always helps. >> As a basic benchmark, I have an application >> that simply writes the same buffer (say, 128MB) to disk repeatedly. >> Alternatively you could use the "dd" utility. (For these >> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since >> these systems have a lot of RAM.) >> >> The basic observations are: >> >> 1. "single-threaded" writes, either a file on the mounted >> filesystem or with a "dd" to the raw RAID device, seem to be limited >> to 1200-1400MB/sec. These numbers vary slightly based on whether >> TurboBoost is affecting the writing process or not. "top" will show >> this process running at 100% CPU. > > Expected. You are using buffered IO. Write speed is limited by the > rate at which your user process can memcpy data into the page cache. > >> 2. With two benchmarks running on the same device, I see aggregate >> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect >> the drives of being able to deliver. This can either be with two >> applications writing to separate files on the same mounted file >> system, or two separate "dd" applications writing to distinct >> locations on the raw device. 2.4GB/s is the interface limit of quad lane 6G SAS. Coincidence? If you've daisy chained the SAS expander backplanes within a server chassis (9266-8i/72405), or between external enclosures (9285-8e/71685), and have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your RAID card, this would fully explain the 2.4GB/s wall, regardless of how many parallel processes are writing, or any other software factor. But surely you already know this, and you're using more than one 4 lane cable. Just covering all the bases here, due to seeing 2.4 GB/s as the stated wall. This number is just too coincidental to ignore. >> (Increasing the number of writers >> beyond two does not seem to increase aggregate performance; "top" >> will show both processes running at perhaps 80% CPU). So you're not referring to dd processes when you say "writers beyond two". Otherwise you'd say "four" or "eight" instead of "both" processes. > Still using buffered IO, which means you are typically limited by > the rate at which the flusher thread can do writeback. > >> 3. I haven't been able to find any tricks (lio_listio, multiple >> threads writing to distinct file offsets, etc) that seem to deliver >> higher write speeds when writing to a single file. (This might be >> xfs-specific, though) > > How about using direct IO? Single threaded direct IO will beslower > than buffered IO, but
Re: high-speed disk I/O is CPU-bound?
On 5/15/2013 7:59 PM, Dave Chinner wrote: [cc xfs list, seeing as that's where all the people who use XFS in these sorts of configurations hang out. ] On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote: Hello, I have a few relatively high-end systems with hardware RAIDs which are being used for recording systems, and I'm trying to get a better understanding of contiguous write performance. The hardware that I've tested with includes two high-end Intel E5-2600 and E5-4600 (~3GHz) series systems, as well as a slightly older Xeon 5600 system. The JBODs include a 45x3.5 JBOD, a 28x3.5 JBOD (with either 7200RPM or 10kRPM SAS drives), and a 24x2.5 JBOD with 10kRPM drives. I've tried LSI controllers (9285-8e, 9266-8i, as well as the integrated Intel LSI controllers) as well as Adaptec Series 7 RAID controllers (72405 and 71685). So, you have something like the following raw aggregate drive b/w, assuming average outer-inner track 120MB/s streaming write throughput per drive: 45 drives ~5.4 GB/s 28 drives ~3.4 GB/s 24 drives ~2.8 GB/s The two LSI HBAs you mention are PCIe 2.0 devices. Note that PCIe 2.0 x8 is limited to ~4GB/s each way. If those 45 drives are connected to the 9285-8e via all 8 SAS lanes, you are still losing about 1/3rd of the aggregate drive b/w. If they're connected to the 71685 via 8 lanes and this HBA is in a PCIe 3.0 slot then you're only losing about 600MB/s. Normally I'll setup the RAIDs as RAID60 and format them as XFS, but the exact RAID level, filesystem type, and even RAID hardware don't seem to matter very much from my observations (but I'm willing to try any suggestions). Lack of performance variability here tends to suggest your workloads are all streaming in nature, and/or your application profile isn't taking full advantage of the software stack and the hardware, i.e. insufficient parallelism, overlapping IOs, etc. Or, see down below for another possibility. These are all current generation HBAs with fast multi-core ASICs and big write cache. RAID6 parity writes even with high drive counts shouldn't significantly degrade large streaming write performance. RMW workloads will still suffer substantially as usual due to rotational latencies. Fast ASICs can't solve this problem. Document them. There's many ways to screw them up and get bad performance. More detailed info always helps. As a basic benchmark, I have an application that simply writes the same buffer (say, 128MB) to disk repeatedly. Alternatively you could use the dd utility. (For these benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since these systems have a lot of RAM.) The basic observations are: 1. single-threaded writes, either a file on the mounted filesystem or with a dd to the raw RAID device, seem to be limited to 1200-1400MB/sec. These numbers vary slightly based on whether TurboBoost is affecting the writing process or not. top will show this process running at 100% CPU. Expected. You are using buffered IO. Write speed is limited by the rate at which your user process can memcpy data into the page cache. 2. With two benchmarks running on the same device, I see aggregate write speeds of up to ~2.4GB/sec, which is closer to what I'd expect the drives of being able to deliver. This can either be with two applications writing to separate files on the same mounted file system, or two separate dd applications writing to distinct locations on the raw device. 2.4GB/s is the interface limit of quad lane 6G SAS. Coincidence? If you've daisy chained the SAS expander backplanes within a server chassis (9266-8i/72405), or between external enclosures (9285-8e/71685), and have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your RAID card, this would fully explain the 2.4GB/s wall, regardless of how many parallel processes are writing, or any other software factor. But surely you already know this, and you're using more than one 4 lane cable. Just covering all the bases here, due to seeing 2.4 GB/s as the stated wall. This number is just too coincidental to ignore. (Increasing the number of writers beyond two does not seem to increase aggregate performance; top will show both processes running at perhaps 80% CPU). So you're not referring to dd processes when you say writers beyond two. Otherwise you'd say four or eight instead of both processes. Still using buffered IO, which means you are typically limited by the rate at which the flusher thread can do writeback. 3. I haven't been able to find any tricks (lio_listio, multiple threads writing to distinct file offsets, etc) that seem to deliver higher write speeds when writing to a single file. (This might be xfs-specific, though) How about using direct IO? Single threaded direct IO will beslower than buffered IO, but throughput should scale linearly with the number of threads if the IO size is large enough (e.g. 32MB). Try this quick/dirty
Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
On 9/19/2012 1:52 PM, Nix wrote: > So I have this x86-64 server running Linux 3.5.1 When did you install 3.5.1 on this machine? If fairly recently, does it run without these errors when booted into the previous kernel? > with a SATA-on-PCIe > Areca 1210 hardware RAID-5 controller driven by libata which has been > humming along happily for years -- but suddenly, today, the entire > machine froze for a couple of minutes (or at least fs access froze), > followed by this in the logs: > > Sep 19 16:55:47 spindle notice: [3447524.381843] arcmsr0: abort device > command of scsi id = 0 lun = 1 > [... repeated a few times at intervals over the next five minutes, > followed by a mass of them at 16:59:29, and...] > Sep 19 16:59:25 spindle err: [3447657.821450] arcmsr: executing bus reset > eh.num_resets = 0, num_aborts = 33 > Sep 19 16:59:25 spindle notice: [3447697.878386] arcmsr0: wait 'abort all > outstanding command' timeout > Sep 19 16:59:25 spindle notice: [3447697.878628] arcmsr0: executing hw bus > reset . > Sep 19 16:59:25 spindle err: [3447698.287054] irq 16: nobody cared (try > booting with the "irqpoll" option) > Sep 19 16:59:25 spindle warning: [3447698.287291] Pid: 0, comm: swapper/4 Not > tainted 3.5.1-dirty #1 > Sep 19 16:59:25 spindle warning: [3447698.287522] Call Trace: > Sep 19 16:59:25 spindle warning: [3447698.287754] > [] __report_bad_irq+0x31/0xc2 > Sep 19 16:59:25 spindle warning: [3447698.288031] [] > note_interrupt+0x16a/0x1e8 > Sep 19 16:59:25 spindle warning: [3447698.288263] [] > handle_irq_event_percpu+0x163/0x1a5 > Sep 19 16:59:25 spindle warning: [3447698.288497] [] > handle_irq_event+0x38/0x55 > Sep 19 16:59:25 spindle warning: [3447698.288727] [] > handle_fasteoi_irq+0x78/0xab > Sep 19 16:59:25 spindle warning: [3447698.288960] [] > handle_irq+0x24/0x2a > Sep 19 16:59:25 spindle warning: [3447698.289189] [] > do_IRQ+0x4d/0xb4 > Sep 19 16:59:25 spindle warning: [3447698.289419] [] > common_interrupt+0x67/0x67 > Sep 19 16:59:25 spindle warning: [3447698.289648] > [] ? acpi_idle_enter_c1+0xcb/0xf2 > Sep 19 16:59:25 spindle warning: [3447698.289919] [] ? > acpi_idle_enter_c1+0xa9/0xf2 > Sep 19 16:59:25 spindle warning: [3447698.290152] [] > cpuidle_enter+0x12/0x14 > Sep 19 16:59:25 spindle warning: [3447698.290382] [] > cpuidle_idle_call+0xc5/0x175 > Sep 19 16:59:25 spindle warning: [3447698.290614] [] > cpu_idle+0x5b/0xa5 > Sep 19 16:59:25 spindle warning: [3447698.290844] [] > start_secondary+0x1a2/0x1a6 > Sep 19 16:59:25 spindle err: [3447698.291074] handlers: > Sep 19 16:59:25 spindle err: [3447698.291294] [] usb_hcd_irq > Sep 19 16:59:25 spindle emerg: [3447698.291553] Disabling IRQ #16 > Sep 19 16:59:25 spindle err: [3447710.888187] arcmsr0: waiting for hw bus > reset return, retry=0 > Sep 19 16:59:25 spindle err: [3447720.882155] arcmsr0: waiting for hw bus > reset return, retry=1 > Sep 19 16:59:25 spindle notice: [3447730.896410] Areca RAID Controller0: F/W > V1.46 2009-01-06 & Model ARC-1210 > Sep 19 16:59:25 spindle err: [3447730.916348] arcmsr: scsi bus reset eh > returns with success > > This is the first SCSI (that is, um, ATA) bus reset I have *ever* had on > this machine, hence my concern. (The IRQ disable we can ignore: it was > just bad luck that an interrupt destined for the Areca hit after the > controller had briefly vanished from the PCI bus as part of resetting.) > > Now just last week another (surge-protected) machine on the same power > main as it died without warning with a fried power supply which > apparently roasted the BIOS and/or other motherboard components before > it died (the ACPI DSDT was filled with rubbish, and other things must > have been fried because even with ACPI off Linux wouldn't boot more than > one time out of a hundred (freezing solid at different places in the > boot each time). So my worry level when this SCSI bus reset turned up > today is quite high. It's higher given that the controller logs > (accessed via the Areca binary-only utility for this purpose) show no > sign of any problem at all. > > EDAC shows no PCI bus problems and no memory problems, so this probably > *is* the controller. > > So... is this a serious problem? Does anyone know if I'm about to lose > this controller, or indeed machine as well? (I really, really hope not.) > > I'd write this off as a spurious problem and not report it at all, but > I'm jittery as heck after the catastrophic hardware failure last week, > and when this happens in close proximity, I worry. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
On 9/19/2012 1:52 PM, Nix wrote: So I have this x86-64 server running Linux 3.5.1 When did you install 3.5.1 on this machine? If fairly recently, does it run without these errors when booted into the previous kernel? with a SATA-on-PCIe Areca 1210 hardware RAID-5 controller driven by libata which has been humming along happily for years -- but suddenly, today, the entire machine froze for a couple of minutes (or at least fs access froze), followed by this in the logs: Sep 19 16:55:47 spindle notice: [3447524.381843] arcmsr0: abort device command of scsi id = 0 lun = 1 [... repeated a few times at intervals over the next five minutes, followed by a mass of them at 16:59:29, and...] Sep 19 16:59:25 spindle err: [3447657.821450] arcmsr: executing bus reset eh.num_resets = 0, num_aborts = 33 Sep 19 16:59:25 spindle notice: [3447697.878386] arcmsr0: wait 'abort all outstanding command' timeout Sep 19 16:59:25 spindle notice: [3447697.878628] arcmsr0: executing hw bus reset . Sep 19 16:59:25 spindle err: [3447698.287054] irq 16: nobody cared (try booting with the irqpoll option) Sep 19 16:59:25 spindle warning: [3447698.287291] Pid: 0, comm: swapper/4 Not tainted 3.5.1-dirty #1 Sep 19 16:59:25 spindle warning: [3447698.287522] Call Trace: Sep 19 16:59:25 spindle warning: [3447698.287754] IRQ [810af5ba] __report_bad_irq+0x31/0xc2 Sep 19 16:59:25 spindle warning: [3447698.288031] [810af84e] note_interrupt+0x16a/0x1e8 Sep 19 16:59:25 spindle warning: [3447698.288263] [810ad9d5] handle_irq_event_percpu+0x163/0x1a5 Sep 19 16:59:25 spindle warning: [3447698.288497] [810ada4f] handle_irq_event+0x38/0x55 Sep 19 16:59:25 spindle warning: [3447698.288727] [810b01a0] handle_fasteoi_irq+0x78/0xab Sep 19 16:59:25 spindle warning: [3447698.288960] [8103631c] handle_irq+0x24/0x2a Sep 19 16:59:25 spindle warning: [3447698.289189] [81036229] do_IRQ+0x4d/0xb4 Sep 19 16:59:25 spindle warning: [3447698.289419] [815070e7] common_interrupt+0x67/0x67 Sep 19 16:59:25 spindle warning: [3447698.289648] EOI [812ab174] ? acpi_idle_enter_c1+0xcb/0xf2 Sep 19 16:59:25 spindle warning: [3447698.289919] [812ab152] ? acpi_idle_enter_c1+0xa9/0xf2 Sep 19 16:59:25 spindle warning: [3447698.290152] [813c1446] cpuidle_enter+0x12/0x14 Sep 19 16:59:25 spindle warning: [3447698.290382] [813c1902] cpuidle_idle_call+0xc5/0x175 Sep 19 16:59:25 spindle warning: [3447698.290614] [8103c2da] cpu_idle+0x5b/0xa5 Sep 19 16:59:25 spindle warning: [3447698.290844] [81ad4fcb] start_secondary+0x1a2/0x1a6 Sep 19 16:59:25 spindle err: [3447698.291074] handlers: Sep 19 16:59:25 spindle err: [3447698.291294] [8133b9a3] usb_hcd_irq Sep 19 16:59:25 spindle emerg: [3447698.291553] Disabling IRQ #16 Sep 19 16:59:25 spindle err: [3447710.888187] arcmsr0: waiting for hw bus reset return, retry=0 Sep 19 16:59:25 spindle err: [3447720.882155] arcmsr0: waiting for hw bus reset return, retry=1 Sep 19 16:59:25 spindle notice: [3447730.896410] Areca RAID Controller0: F/W V1.46 2009-01-06 Model ARC-1210 Sep 19 16:59:25 spindle err: [3447730.916348] arcmsr: scsi bus reset eh returns with success This is the first SCSI (that is, um, ATA) bus reset I have *ever* had on this machine, hence my concern. (The IRQ disable we can ignore: it was just bad luck that an interrupt destined for the Areca hit after the controller had briefly vanished from the PCI bus as part of resetting.) Now just last week another (surge-protected) machine on the same power main as it died without warning with a fried power supply which apparently roasted the BIOS and/or other motherboard components before it died (the ACPI DSDT was filled with rubbish, and other things must have been fried because even with ACPI off Linux wouldn't boot more than one time out of a hundred (freezing solid at different places in the boot each time). So my worry level when this SCSI bus reset turned up today is quite high. It's higher given that the controller logs (accessed via the Areca binary-only utility for this purpose) show no sign of any problem at all. EDAC shows no PCI bus problems and no memory problems, so this probably *is* the controller. So... is this a serious problem? Does anyone know if I'm about to lose this controller, or indeed machine as well? (I really, really hope not.) I'd write this off as a spurious problem and not report it at all, but I'm jittery as heck after the catastrophic hardware failure last week, and when this happens in close proximity, I worry. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/21/2012 9:51 AM, Miquel van Smoorenburg wrote: > On 08/20/2012 01:34 AM, Stan Hoeppner wrote: >> I'm glad you jumped in David. You made a critical statement of fact >> below which clears some things up. If you had stated it early on, >> before Miquel stole the thread and moved it to LKML proper, it would >> have short circuited a lot of this discussion. Which is: > > I'm sorry about that, that's because of the software that I use to > follow most mailinglist. I didn't notice that the discussion was cc'ed > to both lkml and l-r. I should fix that. Oh, my bad. I thought it was intentional. Don't feel too bad about it. When I tried to copy lkml back in on the one message I screwed up as well. I though Tbird had filled in the full address but it didn't. >> Thus my original statement was correct, or at least half correct[1], as >> it pertained to md/RAID6. Then Miquel switched the discussion to >> md/RAID5 and stated I was all wet. I wasn't, and neither was Dave >> Chinner. I was simply unaware of this md/RAID5 single block write RMW >> shortcut > > Well, all I tried to say is that a small write of, say, 4K, to a > raid5/raid6 array does not need to re-write the whole stripe (i.e. > chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that. And I'm glad you did. Before that I didn't know about these efficiency shortcuts and exactly how md does writeback on partial stripe updates. Even with these optimizations, a default 512KB chunk is too big, for the reasons I stated, the big one being the fact that you'll rarely fill a full stripe, meaning nearly every write will incur an RMW cycle. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/21/2012 9:51 AM, Miquel van Smoorenburg wrote: On 08/20/2012 01:34 AM, Stan Hoeppner wrote: I'm glad you jumped in David. You made a critical statement of fact below which clears some things up. If you had stated it early on, before Miquel stole the thread and moved it to LKML proper, it would have short circuited a lot of this discussion. Which is: I'm sorry about that, that's because of the software that I use to follow most mailinglist. I didn't notice that the discussion was cc'ed to both lkml and l-r. I should fix that. Oh, my bad. I thought it was intentional. Don't feel too bad about it. When I tried to copy lkml back in on the one message I screwed up as well. I though Tbird had filled in the full address but it didn't. Thus my original statement was correct, or at least half correct[1], as it pertained to md/RAID6. Then Miquel switched the discussion to md/RAID5 and stated I was all wet. I wasn't, and neither was Dave Chinner. I was simply unaware of this md/RAID5 single block write RMW shortcut Well, all I tried to say is that a small write of, say, 4K, to a raid5/raid6 array does not need to re-write the whole stripe (i.e. chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that. And I'm glad you did. Before that I didn't know about these efficiency shortcuts and exactly how md does writeback on partial stripe updates. Even with these optimizations, a default 512KB chunk is too big, for the reasons I stated, the big one being the fact that you'll rarely fill a full stripe, meaning nearly every write will incur an RMW cycle. -- Stan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/19/2012 9:01 AM, David Brown wrote: > I'm sort of jumping in to this thread, so my apologies if I repeat > things other people have said already. I'm glad you jumped in David. You made a critical statement of fact below which clears some things up. If you had stated it early on, before Miquel stole the thread and moved it to LKML proper, it would have short circuited a lot of this discussion. Which is: > AFAIK, there is scope for a few performance optimisations in raid6. One > is that for small writes which only need to change one block, raid5 uses > a "short-cut" RMW cycle (read the old data block, read the old parity > block, calculate the new parity block, write the new data and parity > blocks). A similar short-cut could be implemented in raid6, though it > is not clear how much a difference it would really make. Thus my original statement was correct, or at least half correct[1], as it pertained to md/RAID6. Then Miquel switched the discussion to md/RAID5 and stated I was all wet. I wasn't, and neither was Dave Chinner. I was simply unaware of this md/RAID5 single block write RMW shortcut. I'm copying lkml proper on this simply to set the record straight. Not that anyone was paying attention, but it needs to be in the same thread in the archives. The takeaway: md/RAID6 must read all devices in a RMW cycle. md/RAID5 takes a shortcut for single block writes, and must only read one drive for the RMW cycle. [1}The only thing that's not clear at this point is if md/RAID6 also always writes back all chunks during RMW, or only the chunk that has changed. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/19/2012 9:01 AM, David Brown wrote: I'm sort of jumping in to this thread, so my apologies if I repeat things other people have said already. I'm glad you jumped in David. You made a critical statement of fact below which clears some things up. If you had stated it early on, before Miquel stole the thread and moved it to LKML proper, it would have short circuited a lot of this discussion. Which is: AFAIK, there is scope for a few performance optimisations in raid6. One is that for small writes which only need to change one block, raid5 uses a short-cut RMW cycle (read the old data block, read the old parity block, calculate the new parity block, write the new data and parity blocks). A similar short-cut could be implemented in raid6, though it is not clear how much a difference it would really make. Thus my original statement was correct, or at least half correct[1], as it pertained to md/RAID6. Then Miquel switched the discussion to md/RAID5 and stated I was all wet. I wasn't, and neither was Dave Chinner. I was simply unaware of this md/RAID5 single block write RMW shortcut. I'm copying lkml proper on this simply to set the record straight. Not that anyone was paying attention, but it needs to be in the same thread in the archives. The takeaway: md/RAID6 must read all devices in a RMW cycle. md/RAID5 takes a shortcut for single block writes, and must only read one drive for the RMW cycle. [1}The only thing that's not clear at this point is if md/RAID6 also always writes back all chunks during RMW, or only the chunk that has changed. -- Stan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote: > On 16-08-12 1:05 PM, Stan Hoeppner wrote: >> On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote: >>> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have >>> to read that 4K block, and the corresponding 4K block on the >>> parity drive, recalculate parity, and write back 4K of data and 4K >>> of parity. (read|read) modify (write|write). You do not have to >>> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks. >> >> See: http://www.spinics.net/lists/xfs/msg12627.html >> >> Dave usually knows what he's talking about, and I didn't see Neil nor >> anyone else correcting him on his description of md RMW behavior. > > Well he's wrong, or you're interpreting it incorrectly. > > I did a simple test: > > * created a 1G partition on 3 seperate disks > * created a md raid5 array with 512K chunksize: > mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1 > /dev/sdd1 > * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1' > * wrote a single 4K block: > dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0 > > Output from iostat over the period in which the 4K write was done. Look > at kB read and kB written: > > Device:tpskB_read/skB_wrtn/skB_readkB_wrtn > sdb1 0.60 0.00 1.60 0 8 > sdc1 0.60 0.80 0.80 4 4 > sdd1 0.60 0.00 1.60 0 8 > > As you can see, a single 4K read, and a few writes. You see a few blocks > more written that you'd expect because the superblock is updated too. I'm no dd expert, but this looks like you're simply writing a 4KB block to a new stripe, using an offset, but not to an existing stripe, as the array is in a virgin state. So it doesn't appear this test is going to trigger RMW. Don't you need now need to do another write in the same stripe to to trigger RMW? Maybe I'm just reading this wrong. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote: On 16-08-12 1:05 PM, Stan Hoeppner wrote: On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote: Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have to read that 4K block, and the corresponding 4K block on the parity drive, recalculate parity, and write back 4K of data and 4K of parity. (read|read) modify (write|write). You do not have to do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks. See: http://www.spinics.net/lists/xfs/msg12627.html Dave usually knows what he's talking about, and I didn't see Neil nor anyone else correcting him on his description of md RMW behavior. Well he's wrong, or you're interpreting it incorrectly. I did a simple test: * created a 1G partition on 3 seperate disks * created a md raid5 array with 512K chunksize: mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1 /dev/sdd1 * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1' * wrote a single 4K block: dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0 Output from iostat over the period in which the 4K write was done. Look at kB read and kB written: Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdb1 0.60 0.00 1.60 0 8 sdc1 0.60 0.80 0.80 4 4 sdd1 0.60 0.00 1.60 0 8 As you can see, a single 4K read, and a few writes. You see a few blocks more written that you'd expect because the superblock is updated too. I'm no dd expert, but this looks like you're simply writing a 4KB block to a new stripe, using an offset, but not to an existing stripe, as the array is in a virgin state. So it doesn't appear this test is going to trigger RMW. Don't you need now need to do another write in the same stripe to to trigger RMW? Maybe I'm just reading this wrong. -- Stan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote: > In article you write: >> It's time to blow away the array and start over. You're already >> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID, >> but for a handful of niche all streaming workloads with little/no >> rewrite, such as video surveillance or DVR workloads. >> >> Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why: >> Deleting a single file changes only a few bytes of directory metadata. >> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data, >> modify the directory block in question, calculate parity, then write out >> 3MB of data to rust. So you consume 6MB of bandwidth to write less than >> a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify >> a few bytes of metadata. Yes, insane. > > Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have > to read that 4K block, and the corresponding 4K block on the > parity drive, recalculate parity, and write back 4K of data and 4K > of parity. (read|read) modify (write|write). You do not have to > do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks. See: http://www.spinics.net/lists/xfs/msg12627.html Dave usually knows what he's talking about, and I didn't see Neil nor anyone else correcting him on his description of md RMW behavior. What I stated above is pretty much exactly what Dave stated, but for the fact I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6 and 5MB/6MB for 12 drives. >> Parity RAID sucks in general because of RMW, but it is orders of >> magnitude worse when one chooses to use an insane chunk size to boot, >> and especially so with a large drive count. [snip] > Also, 256K or 512K isn't all that big nowadays, there's not much > latency difference between reading 32K or 512K.. You're forgetting 3 very important things: 1. All filesystems have metadata 2. All (worth using) filesystems have a metadata journal 3. All workloads include some, if not major, metadata operations When writing journal and directory metadata there is a huge difference between a 32KB and 512KB chunk especially as the drive count in the array increases. Rarely does a filesystem pack enough journal operations into a single writeout to fill a 512KB stripe, let alone a 4MB stripe. With a 32KB chunk you see full stripe width journal writes frequently, minimizing the number of RMW writes to the journal, even up to 16 data spindle parity arrays (18 drive RAID6). Using a 512KB chunk will cause most journal writes to be partial stripe writes, triggering RMW for most journal writes. The same is true for directory metadata writes. Everyone knows that parity RAID sucks for anything but purely streaming workloads with little metadata. With most/all other workloads, using a large chunk size, such as the md metadata 1.2 default of 512KB, with parity RAID, simply makes it much worse, whether the RMW cycle affects all disks or just one data disk and one parity disk. >> Recreate your array, partition aligned, and manually specify a sane >> chunk size of something like 32KB. You'll be much happier with real >> workloads. > > Aligning is a good idea, Understatement of the century. Just as critical, if not more so, FS stripe alignment is mandatory with parity RAID lest full stripe writeout can/will trigger RMW. > and on modern distributions partitions, > LVM lv's etc are generally created with 1MB alignment. But using > a small chunksize like 32K? That depends on the workload, but > in most cases I'd advise against it. People should ignore your advice in this regard. A small chunk size is optimal for nearly all workloads on a parity array for the reasons I stated above. It's the large chunk that is extremely workload dependent, as again, it only fits well with low metadata streaming workloads. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote: In article xs4all.502c1c01.1040...@hardwarefreak.com you write: It's time to blow away the array and start over. You're already misaligned, and a 512KB chunk is insanely unsuitable for parity RAID, but for a handful of niche all streaming workloads with little/no rewrite, such as video surveillance or DVR workloads. Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why: Deleting a single file changes only a few bytes of directory metadata. With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data, modify the directory block in question, calculate parity, then write out 3MB of data to rust. So you consume 6MB of bandwidth to write less than a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify a few bytes of metadata. Yes, insane. Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have to read that 4K block, and the corresponding 4K block on the parity drive, recalculate parity, and write back 4K of data and 4K of parity. (read|read) modify (write|write). You do not have to do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks. See: http://www.spinics.net/lists/xfs/msg12627.html Dave usually knows what he's talking about, and I didn't see Neil nor anyone else correcting him on his description of md RMW behavior. What I stated above is pretty much exactly what Dave stated, but for the fact I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6 and 5MB/6MB for 12 drives. Parity RAID sucks in general because of RMW, but it is orders of magnitude worse when one chooses to use an insane chunk size to boot, and especially so with a large drive count. [snip] Also, 256K or 512K isn't all that big nowadays, there's not much latency difference between reading 32K or 512K.. You're forgetting 3 very important things: 1. All filesystems have metadata 2. All (worth using) filesystems have a metadata journal 3. All workloads include some, if not major, metadata operations When writing journal and directory metadata there is a huge difference between a 32KB and 512KB chunk especially as the drive count in the array increases. Rarely does a filesystem pack enough journal operations into a single writeout to fill a 512KB stripe, let alone a 4MB stripe. With a 32KB chunk you see full stripe width journal writes frequently, minimizing the number of RMW writes to the journal, even up to 16 data spindle parity arrays (18 drive RAID6). Using a 512KB chunk will cause most journal writes to be partial stripe writes, triggering RMW for most journal writes. The same is true for directory metadata writes. Everyone knows that parity RAID sucks for anything but purely streaming workloads with little metadata. With most/all other workloads, using a large chunk size, such as the md metadata 1.2 default of 512KB, with parity RAID, simply makes it much worse, whether the RMW cycle affects all disks or just one data disk and one parity disk. Recreate your array, partition aligned, and manually specify a sane chunk size of something like 32KB. You'll be much happier with real workloads. Aligning is a good idea, Understatement of the century. Just as critical, if not more so, FS stripe alignment is mandatory with parity RAID lest full stripe writeout can/will trigger RMW. and on modern distributions partitions, LVM lv's etc are generally created with 1MB alignment. But using a small chunksize like 32K? That depends on the workload, but in most cases I'd advise against it. People should ignore your advice in this regard. A small chunk size is optimal for nearly all workloads on a parity array for the reasons I stated above. It's the large chunk that is extremely workload dependent, as again, it only fits well with low metadata streaming workloads. -- Stan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/15/2012 5:10 PM, Andy Lutomirski wrote: > On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner wrote: >> On 8/15/2012 12:57 PM, Andy Lutomirski wrote: >>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson >>> wrote: >>>> On 15/08/2012 01:49, Andy Lutomirski wrote: >>>>> >>>>> If I do: >>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M >>>> >>>> [...] >>>> >>>>> It looks like md isn't recognizing that I'm writing whole stripes when >>>>> I'm in O_DIRECT mode. >>>> >>>> >>>> I see your md device is partitioned. Is the partition itself >>>> stripe-aligned? >>> >>> Crud. >>> >>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0] >>> 11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2 >>> [6/6] [UU] >>> >>> IIUC this means that I/O should be aligned on 2MB boundaries (512k >>> chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector >>> (i.e. 1MB) boundary. >> >> It's time to blow away the array and start over. You're already >> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID, >> but for a handful of niche all streaming workloads with little/no >> rewrite, such as video surveillance or DVR workloads. >> >> Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why: >> Deleting a single file changes only a few bytes of directory metadata. >> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data, >> modify the directory block in question, calculate parity, then write out >> 3MB of data to rust. So you consume 6MB of bandwidth to write less than >> a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify >> a few bytes of metadata. Yes, insane. > > Grr. I thought the bad old days of filesystem and related defaults > sucking were over. The previous md chunk default of 64KB wasn't horribly bad, though still maybe a bit high for alot of common workloads. I didn't have eyes/ears on the discussion and/or testing process that led to the 'new' 512KB default. Obviously something went horribly wrong here. 512KB isn't a show stopper as a default for 0/1/10, but is 8-16 times too large for parity RAID. > cryptsetup aligns sanely these days, xfs is > sensible, etc. XFS won't align with the 512KB chunk default of metadata 1.2. The largest XFS journal stripe unit (su--chunk) is 256KB, and even that isn't recommended. Thus mkfs.xfs throws an error due to the 512KB stripe. See the md and xfs archives for more details, specifically Dave Chinner's colorful comments on the md 512KB default. > wtf? Why is there no sensible filesystem for > huge disks? zfs can't cp --reflink and has all kinds of source > availability and licensing issues, xfs can't dedupe at all, and btrfs > isn't nearly stable enough. Deduplication isn't a responsibility of a filesystem. TTBOMK there are two, and only two, COW filesystems in existence: ZFS and BTRFS. And these are the only two to offer a native dedupe capability. They did it because they could, with COW, not necessarily because they *should*. There are dozens of other single node, cluster, and distributed filesystems in use today and none of them support COW, and thus none support dedup. So to *expect* a 'sensible' filesystem to include dedupe is wishful thinking at best. > Anyhow, I'll try the patch from Wu Fengguang. There's still a bug here... Always one somewhere. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/15/2012 12:57 PM, Andy Lutomirski wrote: > On Wed, Aug 15, 2012 at 4:50 AM, John Robinson > wrote: >> On 15/08/2012 01:49, Andy Lutomirski wrote: >>> >>> If I do: >>> # dd if=/dev/zero of=/dev/md0p1 bs=8M >> >> [...] >> >>> It looks like md isn't recognizing that I'm writing whole stripes when >>> I'm in O_DIRECT mode. >> >> >> I see your md device is partitioned. Is the partition itself stripe-aligned? > > Crud. > > md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0] > 11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2 > [6/6] [UU] > > IIUC this means that I/O should be aligned on 2MB boundaries (512k > chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector > (i.e. 1MB) boundary. It's time to blow away the array and start over. You're already misaligned, and a 512KB chunk is insanely unsuitable for parity RAID, but for a handful of niche all streaming workloads with little/no rewrite, such as video surveillance or DVR workloads. Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why: Deleting a single file changes only a few bytes of directory metadata. With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data, modify the directory block in question, calculate parity, then write out 3MB of data to rust. So you consume 6MB of bandwidth to write less than a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify a few bytes of metadata. Yes, insane. Parity RAID sucks in general because of RMW, but it is orders of magnitude worse when one chooses to use an insane chunk size to boot, and especially so with a large drive count. It seems people tend to use large chunk sizes because array initialization is a bit faster, and running block x-fer "tests" with dd buffered sequential reads/writes makes their Levi's expand. Then they are confused when their actual workloads are horribly slow. Recreate your array, partition aligned, and manually specify a sane chunk size of something like 32KB. You'll be much happier with real workloads. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/15/2012 12:57 PM, Andy Lutomirski wrote: On Wed, Aug 15, 2012 at 4:50 AM, John Robinson john.robin...@anonymous.org.uk wrote: On 15/08/2012 01:49, Andy Lutomirski wrote: If I do: # dd if=/dev/zero of=/dev/md0p1 bs=8M [...] It looks like md isn't recognizing that I'm writing whole stripes when I'm in O_DIRECT mode. I see your md device is partitioned. Is the partition itself stripe-aligned? Crud. md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0] 11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UU] IIUC this means that I/O should be aligned on 2MB boundaries (512k chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector (i.e. 1MB) boundary. It's time to blow away the array and start over. You're already misaligned, and a 512KB chunk is insanely unsuitable for parity RAID, but for a handful of niche all streaming workloads with little/no rewrite, such as video surveillance or DVR workloads. Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why: Deleting a single file changes only a few bytes of directory metadata. With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data, modify the directory block in question, calculate parity, then write out 3MB of data to rust. So you consume 6MB of bandwidth to write less than a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify a few bytes of metadata. Yes, insane. Parity RAID sucks in general because of RMW, but it is orders of magnitude worse when one chooses to use an insane chunk size to boot, and especially so with a large drive count. It seems people tend to use large chunk sizes because array initialization is a bit faster, and running block x-fer tests with dd buffered sequential reads/writes makes their Levi's expand. Then they are confused when their actual workloads are horribly slow. Recreate your array, partition aligned, and manually specify a sane chunk size of something like 32KB. You'll be much happier with real workloads. -- Stan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: O_DIRECT to md raid 6 is slow
On 8/15/2012 5:10 PM, Andy Lutomirski wrote: On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner s...@hardwarefreak.com wrote: On 8/15/2012 12:57 PM, Andy Lutomirski wrote: On Wed, Aug 15, 2012 at 4:50 AM, John Robinson john.robin...@anonymous.org.uk wrote: On 15/08/2012 01:49, Andy Lutomirski wrote: If I do: # dd if=/dev/zero of=/dev/md0p1 bs=8M [...] It looks like md isn't recognizing that I'm writing whole stripes when I'm in O_DIRECT mode. I see your md device is partitioned. Is the partition itself stripe-aligned? Crud. md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0] 11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UU] IIUC this means that I/O should be aligned on 2MB boundaries (512k chunk * 4 non-parity disks). gdisk put my partition on a 2048 sector (i.e. 1MB) boundary. It's time to blow away the array and start over. You're already misaligned, and a 512KB chunk is insanely unsuitable for parity RAID, but for a handful of niche all streaming workloads with little/no rewrite, such as video surveillance or DVR workloads. Yes, 512KB is the md 1.2 default. And yes, it is insane. Here's why: Deleting a single file changes only a few bytes of directory metadata. With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data, modify the directory block in question, calculate parity, then write out 3MB of data to rust. So you consume 6MB of bandwidth to write less than a dozen bytes. With a 12 drive RAID6 that's 12MB of bandwidth to modify a few bytes of metadata. Yes, insane. Grr. I thought the bad old days of filesystem and related defaults sucking were over. The previous md chunk default of 64KB wasn't horribly bad, though still maybe a bit high for alot of common workloads. I didn't have eyes/ears on the discussion and/or testing process that led to the 'new' 512KB default. Obviously something went horribly wrong here. 512KB isn't a show stopper as a default for 0/1/10, but is 8-16 times too large for parity RAID. cryptsetup aligns sanely these days, xfs is sensible, etc. XFS won't align with the 512KB chunk default of metadata 1.2. The largest XFS journal stripe unit (su--chunk) is 256KB, and even that isn't recommended. Thus mkfs.xfs throws an error due to the 512KB stripe. See the md and xfs archives for more details, specifically Dave Chinner's colorful comments on the md 512KB default. wtf? rantWhy is there no sensible filesystem for huge disks? zfs can't cp --reflink and has all kinds of source availability and licensing issues, xfs can't dedupe at all, and btrfs isn't nearly stable enough./rant Deduplication isn't a responsibility of a filesystem. TTBOMK there are two, and only two, COW filesystems in existence: ZFS and BTRFS. And these are the only two to offer a native dedupe capability. They did it because they could, with COW, not necessarily because they *should*. There are dozens of other single node, cluster, and distributed filesystems in use today and none of them support COW, and thus none support dedup. So to *expect* a 'sensible' filesystem to include dedupe is wishful thinking at best. Anyhow, I'll try the patch from Wu Fengguang. There's still a bug here... Always one somewhere. -- Stan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: An Andre To Remember
On 7/28/2012 7:11 PM, Nicholas A. Bellinger wrote: > On Fri, 2012-07-27 at 13:56 -0400, Jeff Garzik wrote: >> An Andre To Remember >> July 2012 >> >> Linux lost a friend and advocate this month. Though never a household >> name, Andre Hedrick had a positive impact on everyone today running >> Linux, or using a website, with any form of IDE (ATA) or SCSI storage >> -- that means millions upon millions of users today. >> >> For a time, Andre interacted with practically every relevant IDE >> drive and controller manufacturer, as well as the T13 standards >> committee through which IDE changes were made. He helped ensure >> Linux had near-universal IDE support in a hardware era when Linux >> support was a second thought if at all. As the Register article[1] >> noted, with CPRM and other efforts, Andre worked to keep storage a >> more open platform than it might otherwise have been. >> >> [1] http://www.theregister.co.uk/2012/07/26/andre_hedrick/ >> >> Andre also played a role in IDE technology coalescing around the idea >> of a "taskfile", which is IDE-speak for an RPC command issued to a >> disk drive, and the RPC response returned from the drive. It was >> very important to Andre that the kernel have a "taskfile ioctl", >> an API enabling full programmable access to the disk drive. At the >> time, a more limited "cmd ioctl" API was the best option available, >> but Linux's cmd ioctl did not give users full and complete access to >> their own disk drive. >> >> Andre's taskfile concept was a central component of the current, >> rewritten-from-scratch Linux IDE driver "libata." libata uses an >> "ata_taskfile" to communicate with all IDE drives, whether from a >> decade ago or built yesterday. The taskfile concept modernized >> IDE software, by forcing the industry to move away from a slow, >> signals-originated register API to a modern, packetized RPC messaging >> API, similar to where SCSI storage had already been moving. >> >> I spent many hours on the phone with Andre, circa 2003, learning all >> there was to know about ATA storage, while writing libata. Andre could >> be considered one of the grandfathers of libata, along with Alan Cox. >> I became friends with Andre during this time, and we talked a lot. >> >> Andre was unquestionably smart, driven and an advocate for Linux user >> freedom. >> > > Hi Jeff, > > Thank you for sharing your thoughts + memories of Andre. > > As we grieve this extreme loss, I'd like to try to share some of my own > experiences with Andre that will hopefully help others to begin to > understand the kind + generous type of person that Andre really was, and > just some of his staggering technical feats + accomplishments that can > be talked about publicly today. > > Along with Andre being involved in the history of libata and IDE/ATA > development, those of us in the Linux kernel storage development > community also know, he was also instrumental in creation of the > original out-of-tree PyX iSCSI target code that's now in mainline. > > In summer 2002, I sitting next to Andre when he coined the term 'IBLOCK' > after drawing a rough sketch on a notebook after an idea in Walnut > Creek, California, and the name ending up sticking.. The interesting > development bits really started to unfold in the spring of 2004 when we > finally managed to get drivers/ide/ export working with iscsi-target on > x86 using 2.4.x code. > > That quickly unfolded into a Sony Playstation-2 (MIPS EE) port using IDE > disk DMA mode + network PIO on 2.2.x era kernel code capable of > streaming multiple DVD quality streams to hungry iSCSI clients.. > > Left to my own devices for hardware hacking, I managed to turn our first > disassembled PS2 into a broken parts machine (whoops) but Andre was > going to made sure that it was not going to happen again.. I bought > another PS2, and he was the person who soldered wires to the handful of > tiny via pin-outs to access the one-way serial output for EE boot > information last at night, while I worked on the necessary kernel bits > needed for bring-up of the PS2 specific IDE backend target driver. (The > PS2 IDE driver required contiguous memory for IDE DMA ops to function > via a single struct buffer_head (TCQ=1) on the non-cache coherent MIPS > based platform.) > > He carefully made physical space in the machine's cramped chassis, using > sticky pads where necessary to hold the small PCB containing a simple > ASIC doing the conversion of the signal into PC RS-232 serial output. > He made it look completely flush, like exactly how it was supposed to > come from the factory. Or you know, from the magical place near the old > Bell Labs R center where new development kits for cutting edge tech > are born. > > CBS Sunday Morning even did a story on Andre and his family in the > summer of 2004 while all of this was going on.. Not for the PS2 > iscsi-target or any other code of course, but for the fact that he was > chosen by EBay to
Re: An Andre To Remember
On 7/28/2012 7:11 PM, Nicholas A. Bellinger wrote: On Fri, 2012-07-27 at 13:56 -0400, Jeff Garzik wrote: An Andre To Remember July 2012 Linux lost a friend and advocate this month. Though never a household name, Andre Hedrick had a positive impact on everyone today running Linux, or using a website, with any form of IDE (ATA) or SCSI storage -- that means millions upon millions of users today. For a time, Andre interacted with practically every relevant IDE drive and controller manufacturer, as well as the T13 standards committee through which IDE changes were made. He helped ensure Linux had near-universal IDE support in a hardware era when Linux support was a second thought if at all. As the Register article[1] noted, with CPRM and other efforts, Andre worked to keep storage a more open platform than it might otherwise have been. [1] http://www.theregister.co.uk/2012/07/26/andre_hedrick/ Andre also played a role in IDE technology coalescing around the idea of a taskfile, which is IDE-speak for an RPC command issued to a disk drive, and the RPC response returned from the drive. It was very important to Andre that the kernel have a taskfile ioctl, an API enabling full programmable access to the disk drive. At the time, a more limited cmd ioctl API was the best option available, but Linux's cmd ioctl did not give users full and complete access to their own disk drive. Andre's taskfile concept was a central component of the current, rewritten-from-scratch Linux IDE driver libata. libata uses an ata_taskfile to communicate with all IDE drives, whether from a decade ago or built yesterday. The taskfile concept modernized IDE software, by forcing the industry to move away from a slow, signals-originated register API to a modern, packetized RPC messaging API, similar to where SCSI storage had already been moving. I spent many hours on the phone with Andre, circa 2003, learning all there was to know about ATA storage, while writing libata. Andre could be considered one of the grandfathers of libata, along with Alan Cox. I became friends with Andre during this time, and we talked a lot. Andre was unquestionably smart, driven and an advocate for Linux user freedom. Hi Jeff, Thank you for sharing your thoughts + memories of Andre. As we grieve this extreme loss, I'd like to try to share some of my own experiences with Andre that will hopefully help others to begin to understand the kind + generous type of person that Andre really was, and just some of his staggering technical feats + accomplishments that can be talked about publicly today. Along with Andre being involved in the history of libata and IDE/ATA development, those of us in the Linux kernel storage development community also know, he was also instrumental in creation of the original out-of-tree PyX iSCSI target code that's now in mainline. In summer 2002, I sitting next to Andre when he coined the term 'IBLOCK' after drawing a rough sketch on a notebook after an idea in Walnut Creek, California, and the name ending up sticking.. The interesting development bits really started to unfold in the spring of 2004 when we finally managed to get drivers/ide/ export working with iscsi-target on x86 using 2.4.x code. That quickly unfolded into a Sony Playstation-2 (MIPS EE) port using IDE disk DMA mode + network PIO on 2.2.x era kernel code capable of streaming multiple DVD quality streams to hungry iSCSI clients.. Left to my own devices for hardware hacking, I managed to turn our first disassembled PS2 into a broken parts machine (whoops) but Andre was going to made sure that it was not going to happen again.. I bought another PS2, and he was the person who soldered wires to the handful of tiny via pin-outs to access the one-way serial output for EE boot information last at night, while I worked on the necessary kernel bits needed for bring-up of the PS2 specific IDE backend target driver. (The PS2 IDE driver required contiguous memory for IDE DMA ops to function via a single struct buffer_head (TCQ=1) on the non-cache coherent MIPS based platform.) He carefully made physical space in the machine's cramped chassis, using sticky pads where necessary to hold the small PCB containing a simple ASIC doing the conversion of the signal into PC RS-232 serial output. He made it look completely flush, like exactly how it was supposed to come from the factory. Or you know, from the magical place near the old Bell Labs RD center where new development kits for cutting edge tech are born. CBS Sunday Morning even did a story on Andre and his family in the summer of 2004 while all of this was going on.. Not for the PS2 iscsi-target or any other code of course, but for the fact that he was chosen by EBay to represent California small business as part of a group that lobbied in Washington DC. The reason that E-bay chose Andre is because he built PyX