Re: [PATCH] update xfs maintainers

2013-11-09 Thread Stan Hoeppner
Dave is on the other side of the international date line from those of
us in the States.  If my time zone math is correct, this thread began
and continued *after* the end of his 'normal' Friday workday, during
Dave's weekend.  You think it might be possible he decided to unplug and
actually live for a couple of days?

Put this on hold until Monday.

-- 
Stan



On 11/9/2013 6:30 PM, Ben Myers wrote:
> Dave,
> 
> On Sat, Nov 09, 2013 at 05:51:30PM -0600, Ben Myers wrote:
>> Hey Neil,
>>
>> On Sat, Nov 09, 2013 at 10:44:24AM +1100, NeilBrown wrote:
>>> On Sat, 9 Nov 2013 06:59:00 +0800 Zhi Yong Wu  wrote:
>>>
 On Sat, Nov 9, 2013 at 6:03 AM, Ben Myers  wrote:
> Hey Ric,
>
> On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote:
>> On 11/08/2013 03:46 PM, Ben Myers wrote:
>>> Hey Christoph,
>>>
>>> On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote:
 On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:
> Mark is replacing Alex as my backup because Alex is really busy at
> Linaro and asked to be taken off awhile ago.  The holiday season is
> coming up and I fully intend to go off my meds, turn in to Fonzy the
> bear, and eat my hat.  I need someone to watch the shop while I'm off
> exploring on Mars.  I trust Mark to do that because he is totally
> awesome.

 Doing this as an unilateral decisions is not something that will win 
 you
 a fan base.
>>> It's posted for review.
>>>
 While we never had anything reassembling a democracy in Linux Kernel
 development making decisions without even contacting the major
 contributor is wrong, twice so if the maintainer is a relatively minor
 contributor to start with.

 Just because it recent came up elsewhere I'd like to recite the
 definition from Trond here again:


 http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

 By many of the creative roles enlisted there it's clear that Dave 
 should
 be the maintainer.  He's been the main contributor and chief architect
 for XFS for many year, while the maintainers came and went at the mercy
 of SGI.  This is not meant to bad mouth either of you as I think you're
 doing a reasonably good job compared to other maintainers, but at the
 same time the direction is set by other people that have a much longer
 involvement with the project, and having them officially in control
 would help us forward a lot.  It would also avoid having to spend
 considerable resources to train every new generation of SGI maintainer.

 Coming to and end I would like to maintain Dave Chinner as the primary
 XFS maintainer for all the work he has done as biggest contributor and
 architect of XFS since longer than I can remember, and I would love to
 retain Ben Myers as a co-maintainer for all the good work he has done
 maintaining and reviewing patches since November 2011.
>>> I think we're doing a decent job too.  So thanks for that much at 
>>> least.  ;)
 I would also like to use this post as a public venue to condemn the
 unilateral smokey backroom decisions about XFS maintainership that SGI 
 is
 trying to enforce on the community.
>>> That really didn't happen Christoph.  It's not in my tree or in a pull 
>>> request.
>>>
>>> Linus, let me know what you want to do.  I do think we're doing a fair 
>>> job over
>>> here, and (geez) I'm just trying to add Mark as my backup since Alex is 
>>> too
>>> busy.  I know the RH people want more control, and that's 
>>> understandable, but
>>> they really don't need to replace me to get their code in.  Ouch.
>>>
>>> Thanks,
>>> Ben
>>
>> Christoph is not a Red Hat person.
>>
>> Jeff is from Oracle.
>>
>> This is not a Red Hat vs SGI thing,
>
> Sorry if my read on that was wrong.
>
>> Dave simply has earned the right
>> to take on the formal leadership role of maintainer.
>
> Then we're gonna need some Reviewed-bys.  ;)
>
> From: Ben Myers 
>
> xfs: update maintainers
>
> Add Dave as maintainer of XFS.
>
> Signed-off-by: Ben Myers 
> ---
>  MAINTAINERS |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: b/MAINTAINERS
> ===
> --- a/MAINTAINERS   2013-11-08 15:20:18.935186245 -0600
> +++ b/MAINTAINERS   2013-11-08 15:22:50.685245977 -0600
> @@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb*
>
>  XFS FILESYSTEM
>  P: Silicon Graphics Inc
> +M: Dave Chinner 
 Use his personal 

Re: [PATCH] update xfs maintainers

2013-11-09 Thread Stan Hoeppner
Dave is on the other side of the international date line from those of
us in the States.  If my time zone math is correct, this thread began
and continued *after* the end of his 'normal' Friday workday, during
Dave's weekend.  You think it might be possible he decided to unplug and
actually live for a couple of days?

Put this on hold until Monday.

-- 
Stan



On 11/9/2013 6:30 PM, Ben Myers wrote:
 Dave,
 
 On Sat, Nov 09, 2013 at 05:51:30PM -0600, Ben Myers wrote:
 Hey Neil,

 On Sat, Nov 09, 2013 at 10:44:24AM +1100, NeilBrown wrote:
 On Sat, 9 Nov 2013 06:59:00 +0800 Zhi Yong Wu zwu.ker...@gmail.com wrote:

 On Sat, Nov 9, 2013 at 6:03 AM, Ben Myers b...@sgi.com wrote:
 Hey Ric,

 On Fri, Nov 08, 2013 at 03:50:21PM -0500, Ric Wheeler wrote:
 On 11/08/2013 03:46 PM, Ben Myers wrote:
 Hey Christoph,

 On Fri, Nov 08, 2013 at 11:34:24AM -0800, Christoph Hellwig wrote:
 On Fri, Nov 08, 2013 at 12:03:37PM -0600, Ben Myers wrote:
 Mark is replacing Alex as my backup because Alex is really busy at
 Linaro and asked to be taken off awhile ago.  The holiday season is
 coming up and I fully intend to go off my meds, turn in to Fonzy the
 bear, and eat my hat.  I need someone to watch the shop while I'm off
 exploring on Mars.  I trust Mark to do that because he is totally
 awesome.

 Doing this as an unilateral decisions is not something that will win 
 you
 a fan base.
 It's posted for review.

 While we never had anything reassembling a democracy in Linux Kernel
 development making decisions without even contacting the major
 contributor is wrong, twice so if the maintainer is a relatively minor
 contributor to start with.

 Just because it recent came up elsewhere I'd like to recite the
 definition from Trond here again:


 http://lists.linux-foundation.org/pipermail/ksummit-2012-discuss/2012-June/66.html

 By many of the creative roles enlisted there it's clear that Dave 
 should
 be the maintainer.  He's been the main contributor and chief architect
 for XFS for many year, while the maintainers came and went at the mercy
 of SGI.  This is not meant to bad mouth either of you as I think you're
 doing a reasonably good job compared to other maintainers, but at the
 same time the direction is set by other people that have a much longer
 involvement with the project, and having them officially in control
 would help us forward a lot.  It would also avoid having to spend
 considerable resources to train every new generation of SGI maintainer.

 Coming to and end I would like to maintain Dave Chinner as the primary
 XFS maintainer for all the work he has done as biggest contributor and
 architect of XFS since longer than I can remember, and I would love to
 retain Ben Myers as a co-maintainer for all the good work he has done
 maintaining and reviewing patches since November 2011.
 I think we're doing a decent job too.  So thanks for that much at 
 least.  ;)
 I would also like to use this post as a public venue to condemn the
 unilateral smokey backroom decisions about XFS maintainership that SGI 
 is
 trying to enforce on the community.
 That really didn't happen Christoph.  It's not in my tree or in a pull 
 request.

 Linus, let me know what you want to do.  I do think we're doing a fair 
 job over
 here, and (geez) I'm just trying to add Mark as my backup since Alex is 
 too
 busy.  I know the RH people want more control, and that's 
 understandable, but
 they really don't need to replace me to get their code in.  Ouch.

 Thanks,
 Ben

 Christoph is not a Red Hat person.

 Jeff is from Oracle.

 This is not a Red Hat vs SGI thing,

 Sorry if my read on that was wrong.

 Dave simply has earned the right
 to take on the formal leadership role of maintainer.

 Then we're gonna need some Reviewed-bys.  ;)

 From: Ben Myers b...@sgi.com

 xfs: update maintainers

 Add Dave as maintainer of XFS.

 Signed-off-by: Ben Myers b...@sgi.com
 ---
  MAINTAINERS |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 Index: b/MAINTAINERS
 ===
 --- a/MAINTAINERS   2013-11-08 15:20:18.935186245 -0600
 +++ b/MAINTAINERS   2013-11-08 15:22:50.685245977 -0600
 @@ -9387,8 +9387,8 @@ F:drivers/xen/*swiotlb*

  XFS FILESYSTEM
  P: Silicon Graphics Inc
 +M: Dave Chinner dchin...@fromorbit.com
 Use his personal private mail account? I guess that you should ask for
 his opinion at first, or it is more appropriate that he submit this
 patch by himself.

 If y'all don't mind, I'd like to have authored this one.  ;)
  
 Indeed.  And does he even want the job?  I heard Linus say in a recent
 interview that being a maintainer is a $#!+ job. 

 I've found that it can be a little bit stressful sometimes and it tends to
 crowd out feature work, so I guess I agree with him.  It turns out to be an
 excellent weight loss plan.

 Is it really best for the
 most active developers to be burdened with that extra work?

 (hmm.. maybe I should 

Re: high-speed disk I/O is CPU-bound?

2013-05-17 Thread Stan Hoeppner
On 5/16/2013 5:56 PM, Dave Chinner wrote:
> On Thu, May 16, 2013 at 11:35:08AM -0400, David Oostdyk wrote:
>> On 05/16/13 07:36, Stan Hoeppner wrote:
>>> On 5/15/2013 7:59 PM, Dave Chinner wrote:
>>>> [cc xfs list, seeing as that's where all the people who use XFS in
>>>> these sorts of configurations hang out. ]
>>>>
>>>> On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote:
>>>>> As a basic benchmark, I have an application
>>>>> that simply writes the same buffer (say, 128MB) to disk repeatedly.
>>>>> Alternatively you could use the "dd" utility.  (For these
>>>>> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
>>>>> these systems have a lot of RAM.)
>>>>>
>>>>> The basic observations are:
>>>>>
>>>>> 1.  "single-threaded" writes, either a file on the mounted
>>>>> filesystem or with a "dd" to the raw RAID device, seem to be limited
>>>>> to 1200-1400MB/sec.  These numbers vary slightly based on whether
>>>>> TurboBoost is affecting the writing process or not.  "top" will show
>>>>> this process running at 100% CPU.
>>>> Expected. You are using buffered IO. Write speed is limited by the
>>>> rate at which your user process can memcpy data into the page cache.
>>>>
>>>>> 2.  With two benchmarks running on the same device, I see aggregate
>>>>> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
>>>>> the drives of being able to deliver.  This can either be with two
>>>>> applications writing to separate files on the same mounted file
>>>>> system, or two separate "dd" applications writing to distinct
>>>>> locations on the raw device.
>>> 2.4GB/s is the interface limit of quad lane 6G SAS.  Coincidence?  If
>>> you've daisy chained the SAS expander backplanes within a server chassis
>>> (9266-8i/72405), or between external enclosures (9285-8e/71685), and
>>> have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your
>>> RAID card, this would fully explain the 2.4GB/s wall, regardless of how
>>> many parallel processes are writing, or any other software factor.
>>>
>>> But surely you already know this, and you're using more than one 4 lane
>>> cable.  Just covering all the bases here, due to seeing 2.4 GB/s as the
>>> stated wall.  This number is just too coincidental to ignore.
>>
>> We definitely have two 4-lane cables being used, but this is an
>> interesting coincidence.  I'd be surprised if anyone could really
>> achieve the theoretical throughput on one cable, though.  We have
>> one JBOD that only takes a single 4-lane cable, and we seem to cap
>> out at closer to 1450MB/sec on that unit.  (This is just a single
>> point of reference, and I don't have many tests where only one
>> 4-lane cable was in use.)
> 
> You can get pretty close to the theoretical limit on the back end
> SAS cables - just like you can with FC.

Yep.

> What I'd suggest you do is look at the RAID card configuration -
> often they default to active/passive failover configurations when
> there are multiple channels to the same storage. Then hey only use
> one of the cables for all traffic. Some RAID cards offer
> ative/active or "load balanced" options where all back end paths are
> used in redundant configurations rather than just one

Also read the docs for your JBOD chassis.  Some have a single expander
module with 2 host ports while some have two such expanders for
redundancy and have 4 total host ports.  The latter requires dual ported
drives.  In this config you'd use one host port on each expander and
configure the RAID HBA for multipathing.  (It may be possible to use all
4 host ports in this setup but this requires a RAID HBA with 4 external
4 lane connectors.  I'm not aware of any at this time, nut only two port
models.  So you'd have to use two non-RAID HBAs each with two 4 lane
ports, SCSI multipath, and Linux md/RAID.)

Most JBODs that use the LSI 2x36 expander ASIC will give you full b/w
over two host ports in a single expander single chassis config.  Other
JBODs may direct wire one of the two host port to the expansion port so
you may only get full 8 lane host bandwidth with an expansion unit
attached.  There are likely other configurations I'm not aware of.

>> You guys hit the nail on the head!  With O_DIRECT I can use a single
>> writer thread and easily see the same throughput that I _ever_ saw
>> in the multiple-writer case (~2.4GB

Re: high-speed disk I/O is CPU-bound?

2013-05-17 Thread Stan Hoeppner
On 5/16/2013 5:56 PM, Dave Chinner wrote:
 On Thu, May 16, 2013 at 11:35:08AM -0400, David Oostdyk wrote:
 On 05/16/13 07:36, Stan Hoeppner wrote:
 On 5/15/2013 7:59 PM, Dave Chinner wrote:
 [cc xfs list, seeing as that's where all the people who use XFS in
 these sorts of configurations hang out. ]

 On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote:
 As a basic benchmark, I have an application
 that simply writes the same buffer (say, 128MB) to disk repeatedly.
 Alternatively you could use the dd utility.  (For these
 benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
 these systems have a lot of RAM.)

 The basic observations are:

 1.  single-threaded writes, either a file on the mounted
 filesystem or with a dd to the raw RAID device, seem to be limited
 to 1200-1400MB/sec.  These numbers vary slightly based on whether
 TurboBoost is affecting the writing process or not.  top will show
 this process running at 100% CPU.
 Expected. You are using buffered IO. Write speed is limited by the
 rate at which your user process can memcpy data into the page cache.

 2.  With two benchmarks running on the same device, I see aggregate
 write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
 the drives of being able to deliver.  This can either be with two
 applications writing to separate files on the same mounted file
 system, or two separate dd applications writing to distinct
 locations on the raw device.
 2.4GB/s is the interface limit of quad lane 6G SAS.  Coincidence?  If
 you've daisy chained the SAS expander backplanes within a server chassis
 (9266-8i/72405), or between external enclosures (9285-8e/71685), and
 have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your
 RAID card, this would fully explain the 2.4GB/s wall, regardless of how
 many parallel processes are writing, or any other software factor.

 But surely you already know this, and you're using more than one 4 lane
 cable.  Just covering all the bases here, due to seeing 2.4 GB/s as the
 stated wall.  This number is just too coincidental to ignore.

 We definitely have two 4-lane cables being used, but this is an
 interesting coincidence.  I'd be surprised if anyone could really
 achieve the theoretical throughput on one cable, though.  We have
 one JBOD that only takes a single 4-lane cable, and we seem to cap
 out at closer to 1450MB/sec on that unit.  (This is just a single
 point of reference, and I don't have many tests where only one
 4-lane cable was in use.)
 
 You can get pretty close to the theoretical limit on the back end
 SAS cables - just like you can with FC.

Yep.

 What I'd suggest you do is look at the RAID card configuration -
 often they default to active/passive failover configurations when
 there are multiple channels to the same storage. Then hey only use
 one of the cables for all traffic. Some RAID cards offer
 ative/active or load balanced options where all back end paths are
 used in redundant configurations rather than just one

Also read the docs for your JBOD chassis.  Some have a single expander
module with 2 host ports while some have two such expanders for
redundancy and have 4 total host ports.  The latter requires dual ported
drives.  In this config you'd use one host port on each expander and
configure the RAID HBA for multipathing.  (It may be possible to use all
4 host ports in this setup but this requires a RAID HBA with 4 external
4 lane connectors.  I'm not aware of any at this time, nut only two port
models.  So you'd have to use two non-RAID HBAs each with two 4 lane
ports, SCSI multipath, and Linux md/RAID.)

Most JBODs that use the LSI 2x36 expander ASIC will give you full b/w
over two host ports in a single expander single chassis config.  Other
JBODs may direct wire one of the two host port to the expansion port so
you may only get full 8 lane host bandwidth with an expansion unit
attached.  There are likely other configurations I'm not aware of.

 You guys hit the nail on the head!  With O_DIRECT I can use a single
 writer thread and easily see the same throughput that I _ever_ saw
 in the multiple-writer case (~2.4GB/sec), and top shows the writer
 at 10% CPU usage.  I've modified my application to use O_DIRECT and
 it makes a world of difference.
 
 Be aware that O_DIRECT is not a magic bullet. It can make your IO
 go a lot slower on some worklaods and storage configs

 [It's interesting that you see performance benefits for O_DIRECT
 even with a single SATA drive.  

The single SATA drive has little to do with it actually.  It's the
limited CPU/RAM bus b/w of the box.  The reason O_DIRECT shows a 78%
improvement in disk throughput is a direct result of dramatically
decreased memory pressure, allowing full speed DMA from RAM to the HBA
over the PCI bus.  The pressure caused by the mem-mem copying of
buffered IO causes every read in the CPU to be a cache miss, further
exacerbating the load on the CPU/RAM buses.  All the memory

Re: high-speed disk I/O is CPU-bound?

2013-05-16 Thread Stan Hoeppner
On 5/15/2013 7:59 PM, Dave Chinner wrote:
> [cc xfs list, seeing as that's where all the people who use XFS in
> these sorts of configurations hang out. ]
> 
> On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote:
>> Hello,
>>
>> I have a few relatively high-end systems with hardware RAIDs which
>> are being used for recording systems, and I'm trying to get a better
>> understanding of contiguous write performance.
>>
>> The hardware that I've tested with includes two high-end Intel
>> E5-2600 and E5-4600 (~3GHz) series systems, as well as a slightly
>> older Xeon 5600 system.  The JBODs include a 45x3.5" JBOD, a 28x3.5"
>> JBOD (with either 7200RPM or 10kRPM SAS drives), and a 24x2.5" JBOD
>> with 10kRPM drives.  I've tried LSI controllers (9285-8e, 9266-8i,
>> as well as the integrated Intel LSI controllers) as well as Adaptec
>> Series 7 RAID controllers (72405 and 71685).

So, you have something like the following raw aggregate drive b/w,
assuming average outer-inner track 120MB/s streaming write throughput
per drive:

45 drives ~5.4 GB/s
28 drives ~3.4 GB/s
24 drives ~2.8 GB/s

The two LSI HBAs you mention are PCIe 2.0 devices.  Note that PCIe 2.0
x8 is limited to ~4GB/s each way.  If those 45 drives are connected to
the 9285-8e via all 8 SAS lanes, you are still losing about 1/3rd of the
aggregate drive b/w.  If they're connected to the 71685 via 8 lanes and
this HBA is in a PCIe 3.0 slot then you're only losing about 600MB/s.

>> Normally I'll setup the RAIDs as RAID60 and format them as XFS, but
>> the exact RAID level, filesystem type, and even RAID hardware don't
>> seem to matter very much from my observations (but I'm willing to
>> try any suggestions).

Lack of performance variability here tends to suggest your workloads are
all streaming in nature, and/or your application profile isn't taking
full advantage of the software stack and the hardware, i.e. insufficient
parallelism, overlapping IOs, etc.  Or, see down below for another
possibility.

These are all current generation HBAs with fast multi-core ASICs and big
write cache.  RAID6 parity writes even with high drive counts shouldn't
significantly degrade large streaming write performance.  RMW workloads
will still suffer substantially as usual due to rotational latencies.
Fast ASICs can't solve this problem.

> Document them. There's many ways to screw them up and get bad
> performance.

More detailed info always helps.

>> As a basic benchmark, I have an application
>> that simply writes the same buffer (say, 128MB) to disk repeatedly.
>> Alternatively you could use the "dd" utility.  (For these
>> benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
>> these systems have a lot of RAM.)
>>
>> The basic observations are:
>>
>> 1.  "single-threaded" writes, either a file on the mounted
>> filesystem or with a "dd" to the raw RAID device, seem to be limited
>> to 1200-1400MB/sec.  These numbers vary slightly based on whether
>> TurboBoost is affecting the writing process or not.  "top" will show
>> this process running at 100% CPU.
> 
> Expected. You are using buffered IO. Write speed is limited by the
> rate at which your user process can memcpy data into the page cache.
> 
>> 2.  With two benchmarks running on the same device, I see aggregate
>> write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
>> the drives of being able to deliver.  This can either be with two
>> applications writing to separate files on the same mounted file
>> system, or two separate "dd" applications writing to distinct
>> locations on the raw device.  

2.4GB/s is the interface limit of quad lane 6G SAS.  Coincidence?  If
you've daisy chained the SAS expander backplanes within a server chassis
(9266-8i/72405), or between external enclosures (9285-8e/71685), and
have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your
RAID card, this would fully explain the 2.4GB/s wall, regardless of how
many parallel processes are writing, or any other software factor.

But surely you already know this, and you're using more than one 4 lane
cable.  Just covering all the bases here, due to seeing 2.4 GB/s as the
stated wall.  This number is just too coincidental to ignore.

>> (Increasing the number of writers
>> beyond two does not seem to increase aggregate performance; "top"
>> will show both processes running at perhaps 80% CPU).

So you're not referring to dd processes when you say "writers beyond
two".  Otherwise you'd say "four" or "eight" instead of "both" processes.

> Still using buffered IO, which means you are typically limited by
> the rate at which the flusher thread can do writeback.
>
>> 3.  I haven't been able to find any tricks (lio_listio, multiple
>> threads writing to distinct file offsets, etc) that seem to deliver
>> higher write speeds when writing to a single file.  (This might be
>> xfs-specific, though)
> 
> How about using direct IO? Single threaded direct IO will beslower
> than buffered IO, but 

Re: high-speed disk I/O is CPU-bound?

2013-05-16 Thread Stan Hoeppner
On 5/15/2013 7:59 PM, Dave Chinner wrote:
 [cc xfs list, seeing as that's where all the people who use XFS in
 these sorts of configurations hang out. ]
 
 On Fri, May 10, 2013 at 10:04:44AM -0400, David Oostdyk wrote:
 Hello,

 I have a few relatively high-end systems with hardware RAIDs which
 are being used for recording systems, and I'm trying to get a better
 understanding of contiguous write performance.

 The hardware that I've tested with includes two high-end Intel
 E5-2600 and E5-4600 (~3GHz) series systems, as well as a slightly
 older Xeon 5600 system.  The JBODs include a 45x3.5 JBOD, a 28x3.5
 JBOD (with either 7200RPM or 10kRPM SAS drives), and a 24x2.5 JBOD
 with 10kRPM drives.  I've tried LSI controllers (9285-8e, 9266-8i,
 as well as the integrated Intel LSI controllers) as well as Adaptec
 Series 7 RAID controllers (72405 and 71685).

So, you have something like the following raw aggregate drive b/w,
assuming average outer-inner track 120MB/s streaming write throughput
per drive:

45 drives ~5.4 GB/s
28 drives ~3.4 GB/s
24 drives ~2.8 GB/s

The two LSI HBAs you mention are PCIe 2.0 devices.  Note that PCIe 2.0
x8 is limited to ~4GB/s each way.  If those 45 drives are connected to
the 9285-8e via all 8 SAS lanes, you are still losing about 1/3rd of the
aggregate drive b/w.  If they're connected to the 71685 via 8 lanes and
this HBA is in a PCIe 3.0 slot then you're only losing about 600MB/s.

 Normally I'll setup the RAIDs as RAID60 and format them as XFS, but
 the exact RAID level, filesystem type, and even RAID hardware don't
 seem to matter very much from my observations (but I'm willing to
 try any suggestions).

Lack of performance variability here tends to suggest your workloads are
all streaming in nature, and/or your application profile isn't taking
full advantage of the software stack and the hardware, i.e. insufficient
parallelism, overlapping IOs, etc.  Or, see down below for another
possibility.

These are all current generation HBAs with fast multi-core ASICs and big
write cache.  RAID6 parity writes even with high drive counts shouldn't
significantly degrade large streaming write performance.  RMW workloads
will still suffer substantially as usual due to rotational latencies.
Fast ASICs can't solve this problem.

 Document them. There's many ways to screw them up and get bad
 performance.

More detailed info always helps.

 As a basic benchmark, I have an application
 that simply writes the same buffer (say, 128MB) to disk repeatedly.
 Alternatively you could use the dd utility.  (For these
 benchmarks, I set /proc/sys/vm/dirty_bytes to 512M or lower, since
 these systems have a lot of RAM.)

 The basic observations are:

 1.  single-threaded writes, either a file on the mounted
 filesystem or with a dd to the raw RAID device, seem to be limited
 to 1200-1400MB/sec.  These numbers vary slightly based on whether
 TurboBoost is affecting the writing process or not.  top will show
 this process running at 100% CPU.
 
 Expected. You are using buffered IO. Write speed is limited by the
 rate at which your user process can memcpy data into the page cache.
 
 2.  With two benchmarks running on the same device, I see aggregate
 write speeds of up to ~2.4GB/sec, which is closer to what I'd expect
 the drives of being able to deliver.  This can either be with two
 applications writing to separate files on the same mounted file
 system, or two separate dd applications writing to distinct
 locations on the raw device.  

2.4GB/s is the interface limit of quad lane 6G SAS.  Coincidence?  If
you've daisy chained the SAS expander backplanes within a server chassis
(9266-8i/72405), or between external enclosures (9285-8e/71685), and
have a single 4 lane cable (SFF-8087/8088/8643/8644) connected to your
RAID card, this would fully explain the 2.4GB/s wall, regardless of how
many parallel processes are writing, or any other software factor.

But surely you already know this, and you're using more than one 4 lane
cable.  Just covering all the bases here, due to seeing 2.4 GB/s as the
stated wall.  This number is just too coincidental to ignore.

 (Increasing the number of writers
 beyond two does not seem to increase aggregate performance; top
 will show both processes running at perhaps 80% CPU).

So you're not referring to dd processes when you say writers beyond
two.  Otherwise you'd say four or eight instead of both processes.

 Still using buffered IO, which means you are typically limited by
 the rate at which the flusher thread can do writeback.

 3.  I haven't been able to find any tricks (lio_listio, multiple
 threads writing to distinct file offsets, etc) that seem to deliver
 higher write speeds when writing to a single file.  (This might be
 xfs-specific, though)
 
 How about using direct IO? Single threaded direct IO will beslower
 than buffered IO, but throughput should scale linearly with the
 number of threads if the IO size is large enough (e.g. 32MB).

Try this quick/dirty 

Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?

2012-09-19 Thread Stan Hoeppner
On 9/19/2012 1:52 PM, Nix wrote:
> So I have this x86-64 server running Linux 3.5.1 

When did you install 3.5.1 on this machine?  If fairly recently, does it
run without these errors when booted into the previous kernel?

> with a SATA-on-PCIe
> Areca 1210 hardware RAID-5 controller driven by libata which has been
> humming along happily for years -- but suddenly, today, the entire
> machine froze for a couple of minutes (or at least fs access froze),
> followed by this in the logs:
> 
> Sep 19 16:55:47 spindle notice: [3447524.381843] arcmsr0: abort device 
> command of scsi id = 0 lun = 1 
> [... repeated a few times at intervals over the next five minutes,
>  followed by a mass of them at 16:59:29, and...]
> Sep 19 16:59:25 spindle err: [3447657.821450] arcmsr: executing bus reset 
> eh.num_resets = 0, num_aborts = 33 
> Sep 19 16:59:25 spindle notice: [3447697.878386] arcmsr0: wait 'abort all 
> outstanding command' timeout 
> Sep 19 16:59:25 spindle notice: [3447697.878628] arcmsr0: executing hw bus 
> reset .
> Sep 19 16:59:25 spindle err: [3447698.287054] irq 16: nobody cared (try 
> booting with the "irqpoll" option)
> Sep 19 16:59:25 spindle warning: [3447698.287291] Pid: 0, comm: swapper/4 Not 
> tainted 3.5.1-dirty #1
> Sep 19 16:59:25 spindle warning: [3447698.287522] Call Trace:
> Sep 19 16:59:25 spindle warning: [3447698.287754]
> [] __report_bad_irq+0x31/0xc2
> Sep 19 16:59:25 spindle warning: [3447698.288031]  [] 
> note_interrupt+0x16a/0x1e8
> Sep 19 16:59:25 spindle warning: [3447698.288263]  [] 
> handle_irq_event_percpu+0x163/0x1a5
> Sep 19 16:59:25 spindle warning: [3447698.288497]  [] 
> handle_irq_event+0x38/0x55
> Sep 19 16:59:25 spindle warning: [3447698.288727]  [] 
> handle_fasteoi_irq+0x78/0xab
> Sep 19 16:59:25 spindle warning: [3447698.288960]  [] 
> handle_irq+0x24/0x2a
> Sep 19 16:59:25 spindle warning: [3447698.289189]  [] 
> do_IRQ+0x4d/0xb4
> Sep 19 16:59:25 spindle warning: [3447698.289419]  [] 
> common_interrupt+0x67/0x67
> Sep 19 16:59:25 spindle warning: [3447698.289648]
> [] ? acpi_idle_enter_c1+0xcb/0xf2
> Sep 19 16:59:25 spindle warning: [3447698.289919]  [] ? 
> acpi_idle_enter_c1+0xa9/0xf2
> Sep 19 16:59:25 spindle warning: [3447698.290152]  [] 
> cpuidle_enter+0x12/0x14
> Sep 19 16:59:25 spindle warning: [3447698.290382]  [] 
> cpuidle_idle_call+0xc5/0x175
> Sep 19 16:59:25 spindle warning: [3447698.290614]  [] 
> cpu_idle+0x5b/0xa5
> Sep 19 16:59:25 spindle warning: [3447698.290844]  [] 
> start_secondary+0x1a2/0x1a6
> Sep 19 16:59:25 spindle err: [3447698.291074] handlers:
> Sep 19 16:59:25 spindle err: [3447698.291294] [] usb_hcd_irq
> Sep 19 16:59:25 spindle emerg: [3447698.291553] Disabling IRQ #16
> Sep 19 16:59:25 spindle err: [3447710.888187] arcmsr0: waiting for hw bus 
> reset return, retry=0
> Sep 19 16:59:25 spindle err: [3447720.882155] arcmsr0: waiting for hw bus 
> reset return, retry=1
> Sep 19 16:59:25 spindle notice: [3447730.896410] Areca RAID Controller0: F/W 
> V1.46 2009-01-06 & Model ARC-1210
> Sep 19 16:59:25 spindle err: [3447730.916348] arcmsr: scsi  bus reset eh 
> returns with success
> 
> This is the first SCSI (that is, um, ATA) bus reset I have *ever* had on
> this machine, hence my concern. (The IRQ disable we can ignore: it was
> just bad luck that an interrupt destined for the Areca hit after the
> controller had briefly vanished from the PCI bus as part of resetting.)
> 
> Now just last week another (surge-protected) machine on the same power
> main as it died without warning with a fried power supply which
> apparently roasted the BIOS and/or other motherboard components before
> it died (the ACPI DSDT was filled with rubbish, and other things must
> have been fried because even with ACPI off Linux wouldn't boot more than
> one time out of a hundred (freezing solid at different places in the
> boot each time). So my worry level when this SCSI bus reset turned up
> today is quite high. It's higher given that the controller logs
> (accessed via the Areca binary-only utility for this purpose) show no
> sign of any problem at all.
> 
> EDAC shows no PCI bus problems and no memory problems, so this probably
> *is* the controller.
> 
> So... is this a serious problem? Does anyone know if I'm about to lose
> this controller, or indeed machine as well? (I really, really hope not.)
> 
> I'd write this off as a spurious problem and not report it at all, but
> I'm jittery as heck after the catastrophic hardware failure last week,
> and when this happens in close proximity, I worry.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?

2012-09-19 Thread Stan Hoeppner
On 9/19/2012 1:52 PM, Nix wrote:
 So I have this x86-64 server running Linux 3.5.1 

When did you install 3.5.1 on this machine?  If fairly recently, does it
run without these errors when booted into the previous kernel?

 with a SATA-on-PCIe
 Areca 1210 hardware RAID-5 controller driven by libata which has been
 humming along happily for years -- but suddenly, today, the entire
 machine froze for a couple of minutes (or at least fs access froze),
 followed by this in the logs:
 
 Sep 19 16:55:47 spindle notice: [3447524.381843] arcmsr0: abort device 
 command of scsi id = 0 lun = 1 
 [... repeated a few times at intervals over the next five minutes,
  followed by a mass of them at 16:59:29, and...]
 Sep 19 16:59:25 spindle err: [3447657.821450] arcmsr: executing bus reset 
 eh.num_resets = 0, num_aborts = 33 
 Sep 19 16:59:25 spindle notice: [3447697.878386] arcmsr0: wait 'abort all 
 outstanding command' timeout 
 Sep 19 16:59:25 spindle notice: [3447697.878628] arcmsr0: executing hw bus 
 reset .
 Sep 19 16:59:25 spindle err: [3447698.287054] irq 16: nobody cared (try 
 booting with the irqpoll option)
 Sep 19 16:59:25 spindle warning: [3447698.287291] Pid: 0, comm: swapper/4 Not 
 tainted 3.5.1-dirty #1
 Sep 19 16:59:25 spindle warning: [3447698.287522] Call Trace:
 Sep 19 16:59:25 spindle warning: [3447698.287754]  IRQ  
 [810af5ba] __report_bad_irq+0x31/0xc2
 Sep 19 16:59:25 spindle warning: [3447698.288031]  [810af84e] 
 note_interrupt+0x16a/0x1e8
 Sep 19 16:59:25 spindle warning: [3447698.288263]  [810ad9d5] 
 handle_irq_event_percpu+0x163/0x1a5
 Sep 19 16:59:25 spindle warning: [3447698.288497]  [810ada4f] 
 handle_irq_event+0x38/0x55
 Sep 19 16:59:25 spindle warning: [3447698.288727]  [810b01a0] 
 handle_fasteoi_irq+0x78/0xab
 Sep 19 16:59:25 spindle warning: [3447698.288960]  [8103631c] 
 handle_irq+0x24/0x2a
 Sep 19 16:59:25 spindle warning: [3447698.289189]  [81036229] 
 do_IRQ+0x4d/0xb4
 Sep 19 16:59:25 spindle warning: [3447698.289419]  [815070e7] 
 common_interrupt+0x67/0x67
 Sep 19 16:59:25 spindle warning: [3447698.289648]  EOI  
 [812ab174] ? acpi_idle_enter_c1+0xcb/0xf2
 Sep 19 16:59:25 spindle warning: [3447698.289919]  [812ab152] ? 
 acpi_idle_enter_c1+0xa9/0xf2
 Sep 19 16:59:25 spindle warning: [3447698.290152]  [813c1446] 
 cpuidle_enter+0x12/0x14
 Sep 19 16:59:25 spindle warning: [3447698.290382]  [813c1902] 
 cpuidle_idle_call+0xc5/0x175
 Sep 19 16:59:25 spindle warning: [3447698.290614]  [8103c2da] 
 cpu_idle+0x5b/0xa5
 Sep 19 16:59:25 spindle warning: [3447698.290844]  [81ad4fcb] 
 start_secondary+0x1a2/0x1a6
 Sep 19 16:59:25 spindle err: [3447698.291074] handlers:
 Sep 19 16:59:25 spindle err: [3447698.291294] [8133b9a3] usb_hcd_irq
 Sep 19 16:59:25 spindle emerg: [3447698.291553] Disabling IRQ #16
 Sep 19 16:59:25 spindle err: [3447710.888187] arcmsr0: waiting for hw bus 
 reset return, retry=0
 Sep 19 16:59:25 spindle err: [3447720.882155] arcmsr0: waiting for hw bus 
 reset return, retry=1
 Sep 19 16:59:25 spindle notice: [3447730.896410] Areca RAID Controller0: F/W 
 V1.46 2009-01-06  Model ARC-1210
 Sep 19 16:59:25 spindle err: [3447730.916348] arcmsr: scsi  bus reset eh 
 returns with success
 
 This is the first SCSI (that is, um, ATA) bus reset I have *ever* had on
 this machine, hence my concern. (The IRQ disable we can ignore: it was
 just bad luck that an interrupt destined for the Areca hit after the
 controller had briefly vanished from the PCI bus as part of resetting.)
 
 Now just last week another (surge-protected) machine on the same power
 main as it died without warning with a fried power supply which
 apparently roasted the BIOS and/or other motherboard components before
 it died (the ACPI DSDT was filled with rubbish, and other things must
 have been fried because even with ACPI off Linux wouldn't boot more than
 one time out of a hundred (freezing solid at different places in the
 boot each time). So my worry level when this SCSI bus reset turned up
 today is quite high. It's higher given that the controller logs
 (accessed via the Areca binary-only utility for this purpose) show no
 sign of any problem at all.
 
 EDAC shows no PCI bus problems and no memory problems, so this probably
 *is* the controller.
 
 So... is this a serious problem? Does anyone know if I'm about to lose
 this controller, or indeed machine as well? (I really, really hope not.)
 
 I'd write this off as a spurious problem and not report it at all, but
 I'm jittery as heck after the catastrophic hardware failure last week,
 and when this happens in close proximity, I worry.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-21 Thread Stan Hoeppner
On 8/21/2012 9:51 AM, Miquel van Smoorenburg wrote:
> On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
>> I'm glad you jumped in David.  You made a critical statement of fact
>> below which clears some things up.  If you had stated it early on,
>> before Miquel stole the thread and moved it to LKML proper, it would
>> have short circuited a lot of this discussion.  Which is:
> 
> I'm sorry about that, that's because of the software that I use to
> follow most mailinglist. I didn't notice that the discussion was cc'ed
> to both lkml and l-r. I should fix that.

Oh, my bad.  I thought it was intentional.

Don't feel too bad about it.  When I tried to copy lkml back in on the
one message I screwed up as well.  I though Tbird had filled in the full
address but it didn't.

>> Thus my original statement was correct, or at least half correct[1], as
>> it pertained to md/RAID6.  Then Miquel switched the discussion to
>> md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
>> Chinner.  I was simply unaware of this md/RAID5 single block write RMW
>> shortcut
> 
> Well, all I tried to say is that a small write of, say, 4K, to a
> raid5/raid6 array does not need to re-write the whole stripe (i.e.
> chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

And I'm glad you did.  Before that I didn't know about these efficiency
shortcuts and exactly how md does writeback on partial stripe updates.

Even with these optimizations, a default 512KB chunk is too big, for the
reasons I stated, the big one being the fact that you'll rarely fill a
full stripe, meaning nearly every write will incur an RMW cycle.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-21 Thread Stan Hoeppner
On 8/21/2012 9:51 AM, Miquel van Smoorenburg wrote:
 On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
 I'm glad you jumped in David.  You made a critical statement of fact
 below which clears some things up.  If you had stated it early on,
 before Miquel stole the thread and moved it to LKML proper, it would
 have short circuited a lot of this discussion.  Which is:
 
 I'm sorry about that, that's because of the software that I use to
 follow most mailinglist. I didn't notice that the discussion was cc'ed
 to both lkml and l-r. I should fix that.

Oh, my bad.  I thought it was intentional.

Don't feel too bad about it.  When I tried to copy lkml back in on the
one message I screwed up as well.  I though Tbird had filled in the full
address but it didn't.

 Thus my original statement was correct, or at least half correct[1], as
 it pertained to md/RAID6.  Then Miquel switched the discussion to
 md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
 Chinner.  I was simply unaware of this md/RAID5 single block write RMW
 shortcut
 
 Well, all I tried to say is that a small write of, say, 4K, to a
 raid5/raid6 array does not need to re-write the whole stripe (i.e.
 chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

And I'm glad you did.  Before that I didn't know about these efficiency
shortcuts and exactly how md does writeback on partial stripe updates.

Even with these optimizations, a default 512KB chunk is too big, for the
reasons I stated, the big one being the fact that you'll rarely fill a
full stripe, meaning nearly every write will incur an RMW cycle.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-19 Thread Stan Hoeppner
On 8/19/2012 9:01 AM, David Brown wrote:
> I'm sort of jumping in to this thread, so my apologies if I repeat
> things other people have said already.

I'm glad you jumped in David.  You made a critical statement of fact
below which clears some things up.  If you had stated it early on,
before Miquel stole the thread and moved it to LKML proper, it would
have short circuited a lot of this discussion.  Which is:

> AFAIK, there is scope for a few performance optimisations in raid6.  One
> is that for small writes which only need to change one block, raid5 uses
> a "short-cut" RMW cycle (read the old data block, read the old parity
> block, calculate the new parity block, write the new data and parity
> blocks).  A similar short-cut could be implemented in raid6, though it
> is not clear how much a difference it would really make.

Thus my original statement was correct, or at least half correct[1], as
it pertained to md/RAID6.  Then Miquel switched the discussion to
md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
Chinner.  I was simply unaware of this md/RAID5 single block write RMW
shortcut.  I'm copying lkml proper on this simply to set the record
straight.  Not that anyone was paying attention, but it needs to be in
the same thread in the archives.  The takeaway:

md/RAID6 must read all devices in a RMW cycle.

md/RAID5 takes a shortcut for single block writes, and must only read
one drive for the RMW cycle.

[1}The only thing that's not clear at this point is if md/RAID6 also
always writes back all chunks during RMW, or only the chunk that has
changed.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-19 Thread Stan Hoeppner
On 8/19/2012 9:01 AM, David Brown wrote:
 I'm sort of jumping in to this thread, so my apologies if I repeat
 things other people have said already.

I'm glad you jumped in David.  You made a critical statement of fact
below which clears some things up.  If you had stated it early on,
before Miquel stole the thread and moved it to LKML proper, it would
have short circuited a lot of this discussion.  Which is:

 AFAIK, there is scope for a few performance optimisations in raid6.  One
 is that for small writes which only need to change one block, raid5 uses
 a short-cut RMW cycle (read the old data block, read the old parity
 block, calculate the new parity block, write the new data and parity
 blocks).  A similar short-cut could be implemented in raid6, though it
 is not clear how much a difference it would really make.

Thus my original statement was correct, or at least half correct[1], as
it pertained to md/RAID6.  Then Miquel switched the discussion to
md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
Chinner.  I was simply unaware of this md/RAID5 single block write RMW
shortcut.  I'm copying lkml proper on this simply to set the record
straight.  Not that anyone was paying attention, but it needs to be in
the same thread in the archives.  The takeaway:

md/RAID6 must read all devices in a RMW cycle.

md/RAID5 takes a shortcut for single block writes, and must only read
one drive for the RMW cycle.

[1}The only thing that's not clear at this point is if md/RAID6 also
always writes back all chunks during RMW, or only the chunk that has
changed.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-17 Thread Stan Hoeppner
On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
> On 16-08-12 1:05 PM, Stan Hoeppner wrote:
>> On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
>>> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
>>> to read that 4K block, and the corresponding 4K block on the
>>> parity drive, recalculate parity, and write back 4K of data and 4K
>>> of parity. (read|read) modify (write|write). You do not have to
>>> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.
>>
>> See:  http://www.spinics.net/lists/xfs/msg12627.html
>>
>> Dave usually knows what he's talking about, and I didn't see Neil nor
>> anyone else correcting him on his description of md RMW behavior.
> 
> Well he's wrong, or you're interpreting it incorrectly.
> 
> I did a simple test:
> 
> * created a 1G partition on 3 seperate disks
> * created a md raid5 array with 512K chunksize:
>   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
> /dev/sdd1
> * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
> * wrote a single 4K block:
>   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
> 
> Output from iostat over the period in which the 4K write was done. Look
> at kB read and kB written:
> 
> Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
> sdb1  0.60 0.00 1.60  0  8
> sdc1  0.60 0.80 0.80  4  4
> sdd1  0.60 0.00 1.60  0  8
> 
> As you can see, a single 4K read, and a few writes. You see a few blocks
> more written that you'd expect because the superblock is updated too.

I'm no dd expert, but this looks like you're simply writing a 4KB block
to a new stripe, using an offset, but not to an existing stripe, as the
array is in a virgin state.  So it doesn't appear this test is going to
trigger RMW.  Don't you need now need to do another write in the same
stripe to to trigger RMW?  Maybe I'm just reading this wrong.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-17 Thread Stan Hoeppner
On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
 On 16-08-12 1:05 PM, Stan Hoeppner wrote:
 On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
 Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
 to read that 4K block, and the corresponding 4K block on the
 parity drive, recalculate parity, and write back 4K of data and 4K
 of parity. (read|read) modify (write|write). You do not have to
 do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

 See:  http://www.spinics.net/lists/xfs/msg12627.html

 Dave usually knows what he's talking about, and I didn't see Neil nor
 anyone else correcting him on his description of md RMW behavior.
 
 Well he's wrong, or you're interpreting it incorrectly.
 
 I did a simple test:
 
 * created a 1G partition on 3 seperate disks
 * created a md raid5 array with 512K chunksize:
   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
 /dev/sdd1
 * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
 * wrote a single 4K block:
   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
 
 Output from iostat over the period in which the 4K write was done. Look
 at kB read and kB written:
 
 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
 sdb1  0.60 0.00 1.60  0  8
 sdc1  0.60 0.80 0.80  4  4
 sdd1  0.60 0.00 1.60  0  8
 
 As you can see, a single 4K read, and a few writes. You see a few blocks
 more written that you'd expect because the superblock is updated too.

I'm no dd expert, but this looks like you're simply writing a 4KB block
to a new stripe, using an offset, but not to an existing stripe, as the
array is in a virgin state.  So it doesn't appear this test is going to
trigger RMW.  Don't you need now need to do another write in the same
stripe to to trigger RMW?  Maybe I'm just reading this wrong.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-16 Thread Stan Hoeppner
On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
> In article  you write:
>> It's time to blow away the array and start over.  You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata.  Yes, insane.
> 
> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
> to read that 4K block, and the corresponding 4K block on the
> parity drive, recalculate parity, and write back 4K of data and 4K
> of parity. (read|read) modify (write|write). You do not have to
> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

See:  http://www.spinics.net/lists/xfs/msg12627.html

Dave usually knows what he's talking about, and I didn't see Neil nor
anyone else correcting him on his description of md RMW behavior.  What
I stated above is pretty much exactly what Dave stated, but for the fact
I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6
and 5MB/6MB for 12 drives.

>> Parity RAID sucks in general because of RMW, but it is orders of
>> magnitude worse when one chooses to use an insane chunk size to boot,
>> and especially so with a large drive count.
[snip]
> Also, 256K or 512K isn't all that big nowadays, there's not much
> latency difference between reading 32K or 512K..

You're forgetting 3 very important things:

1.  All filesystems have metadata
2.  All (worth using) filesystems have a metadata journal
3.  All workloads include some, if not major, metadata operations

When writing journal and directory metadata there is a huge difference
between a 32KB and 512KB chunk especially as the drive count in the
array increases.  Rarely does a filesystem pack enough journal
operations into a single writeout to fill a 512KB stripe, let alone a
4MB stripe.  With a 32KB chunk you see full stripe width journal writes
frequently, minimizing the number of RMW writes to the journal, even up
to 16 data spindle parity arrays (18 drive RAID6).   Using a 512KB chunk
will cause most journal writes to be partial stripe writes, triggering
RMW for most journal writes.  The same is true for directory metadata
writes.

Everyone knows that parity RAID sucks for anything but purely streaming
workloads with little metadata.  With most/all other workloads, using a
large chunk size, such as the md metadata 1.2 default of 512KB, with
parity RAID, simply makes it much worse, whether the RMW cycle affects
all disks or just one data disk and one parity disk.

>> Recreate your array, partition aligned, and manually specify a sane
>> chunk size of something like 32KB.  You'll be much happier with real
>> workloads.
> 
> Aligning is a good idea, 

Understatement of the century.  Just as critical, if not more so, FS
stripe alignment is mandatory with parity RAID lest full stripe writeout
can/will trigger RMW.

> and on modern distributions partitions,
> LVM lv's etc are generally created with 1MB alignment. But using
> a small chunksize like 32K? That depends on the workload, but
> in most cases I'd advise against it.

People should ignore your advice in this regard.  A small chunk size is
optimal for nearly all workloads on a parity array for the reasons I
stated above.  It's the large chunk that is extremely workload
dependent, as again, it only fits well with low metadata streaming
workloads.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-16 Thread Stan Hoeppner
On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
 In article xs4all.502c1c01.1040...@hardwarefreak.com you write:
 It's time to blow away the array and start over.  You're already
 misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
 but for a handful of niche all streaming workloads with little/no
 rewrite, such as video surveillance or DVR workloads.

 Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
 Deleting a single file changes only a few bytes of directory metadata.
 With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
 modify the directory block in question, calculate parity, then write out
 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
 a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
 a few bytes of metadata.  Yes, insane.
 
 Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
 to read that 4K block, and the corresponding 4K block on the
 parity drive, recalculate parity, and write back 4K of data and 4K
 of parity. (read|read) modify (write|write). You do not have to
 do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

See:  http://www.spinics.net/lists/xfs/msg12627.html

Dave usually knows what he's talking about, and I didn't see Neil nor
anyone else correcting him on his description of md RMW behavior.  What
I stated above is pretty much exactly what Dave stated, but for the fact
I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6
and 5MB/6MB for 12 drives.

 Parity RAID sucks in general because of RMW, but it is orders of
 magnitude worse when one chooses to use an insane chunk size to boot,
 and especially so with a large drive count.
[snip]
 Also, 256K or 512K isn't all that big nowadays, there's not much
 latency difference between reading 32K or 512K..

You're forgetting 3 very important things:

1.  All filesystems have metadata
2.  All (worth using) filesystems have a metadata journal
3.  All workloads include some, if not major, metadata operations

When writing journal and directory metadata there is a huge difference
between a 32KB and 512KB chunk especially as the drive count in the
array increases.  Rarely does a filesystem pack enough journal
operations into a single writeout to fill a 512KB stripe, let alone a
4MB stripe.  With a 32KB chunk you see full stripe width journal writes
frequently, minimizing the number of RMW writes to the journal, even up
to 16 data spindle parity arrays (18 drive RAID6).   Using a 512KB chunk
will cause most journal writes to be partial stripe writes, triggering
RMW for most journal writes.  The same is true for directory metadata
writes.

Everyone knows that parity RAID sucks for anything but purely streaming
workloads with little metadata.  With most/all other workloads, using a
large chunk size, such as the md metadata 1.2 default of 512KB, with
parity RAID, simply makes it much worse, whether the RMW cycle affects
all disks or just one data disk and one parity disk.

 Recreate your array, partition aligned, and manually specify a sane
 chunk size of something like 32KB.  You'll be much happier with real
 workloads.
 
 Aligning is a good idea, 

Understatement of the century.  Just as critical, if not more so, FS
stripe alignment is mandatory with parity RAID lest full stripe writeout
can/will trigger RMW.

 and on modern distributions partitions,
 LVM lv's etc are generally created with 1MB alignment. But using
 a small chunksize like 32K? That depends on the workload, but
 in most cases I'd advise against it.

People should ignore your advice in this regard.  A small chunk size is
optimal for nearly all workloads on a parity array for the reasons I
stated above.  It's the large chunk that is extremely workload
dependent, as again, it only fits well with low metadata streaming
workloads.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Stan Hoeppner
On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner  wrote:
>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>>  wrote:
>>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>>
>>>>> If I do:
>>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>>
>>>> [...]
>>>>
>>>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>>>> I'm in O_DIRECT mode.
>>>>
>>>>
>>>> I see your md device is partitioned. Is the partition itself 
>>>> stripe-aligned?
>>>
>>> Crud.
>>>
>>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>>>   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>>> [6/6] [UU]
>>>
>>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>>> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
>>> (i.e. 1MB) boundary.
>>
>> It's time to blow away the array and start over.  You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata.  Yes, insane.
> 
> Grr.  I thought the bad old days of filesystem and related defaults
> sucking were over.  

The previous md chunk default of 64KB wasn't horribly bad, though still
maybe a bit high for alot of common workloads.  I didn't have eyes/ears
on the discussion and/or testing process that led to the 'new' 512KB
default.  Obviously something went horribly wrong here.  512KB isn't a
show stopper as a default for 0/1/10, but is 8-16 times too large for
parity RAID.

> cryptsetup aligns sanely these days, xfs is
> sensible, etc.  

XFS won't align with the 512KB chunk default of metadata 1.2.  The
largest XFS journal stripe unit (su--chunk) is 256KB, and even that
isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
stripe.  See the md and xfs archives for more details, specifically Dave
Chinner's colorful comments on the md 512KB default.

> wtf?  Why is there no sensible filesystem for
> huge disks?  zfs can't cp --reflink and has all kinds of source
> availability and licensing issues, xfs can't dedupe at all, and btrfs
> isn't nearly stable enough.

Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
these are the only two to offer a native dedupe capability.  They did it
because they could, with COW, not necessarily because they *should*.
There are dozens of other single node, cluster, and distributed
filesystems in use today and none of them support COW, and thus none
support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
is wishful thinking at best.

> Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

Always one somewhere.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Stan Hoeppner
On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>  wrote:
>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>
>>> If I do:
>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>
>> [...]
>>
>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>> I'm in O_DIRECT mode.
>>
>>
>> I see your md device is partitioned. Is the partition itself stripe-aligned?
> 
> Crud.
> 
> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [6/6] [UU]
> 
> IIUC this means that I/O should be aligned on 2MB boundaries (512k
> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
> (i.e. 1MB) boundary.

It's time to blow away the array and start over.  You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust.  So you consume 6MB of bandwidth to write less than
a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata.  Yes, insane.

Parity RAID sucks in general because of RMW, but it is orders of
magnitude worse when one chooses to use an insane chunk size to boot,
and especially so with a large drive count.

It seems people tend to use large chunk sizes because array
initialization is a bit faster, and running block x-fer "tests" with dd
buffered sequential reads/writes makes their Levi's expand.  Then they
are confused when their actual workloads are horribly slow.

Recreate your array, partition aligned, and manually specify a sane
chunk size of something like 32KB.  You'll be much happier with real
workloads.

-- 
Stan


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Stan Hoeppner
On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
 john.robin...@anonymous.org.uk wrote:
 On 15/08/2012 01:49, Andy Lutomirski wrote:

 If I do:
 # dd if=/dev/zero of=/dev/md0p1 bs=8M

 [...]

 It looks like md isn't recognizing that I'm writing whole stripes when
 I'm in O_DIRECT mode.


 I see your md device is partitioned. Is the partition itself stripe-aligned?
 
 Crud.
 
 md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
 [6/6] [UU]
 
 IIUC this means that I/O should be aligned on 2MB boundaries (512k
 chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
 (i.e. 1MB) boundary.

It's time to blow away the array and start over.  You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust.  So you consume 6MB of bandwidth to write less than
a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata.  Yes, insane.

Parity RAID sucks in general because of RMW, but it is orders of
magnitude worse when one chooses to use an insane chunk size to boot,
and especially so with a large drive count.

It seems people tend to use large chunk sizes because array
initialization is a bit faster, and running block x-fer tests with dd
buffered sequential reads/writes makes their Levi's expand.  Then they
are confused when their actual workloads are horribly slow.

Recreate your array, partition aligned, and manually specify a sane
chunk size of something like 32KB.  You'll be much happier with real
workloads.

-- 
Stan


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: O_DIRECT to md raid 6 is slow

2012-08-15 Thread Stan Hoeppner
On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner s...@hardwarefreak.com wrote:
 On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
 On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
 john.robin...@anonymous.org.uk wrote:
 On 15/08/2012 01:49, Andy Lutomirski wrote:

 If I do:
 # dd if=/dev/zero of=/dev/md0p1 bs=8M

 [...]

 It looks like md isn't recognizing that I'm writing whole stripes when
 I'm in O_DIRECT mode.


 I see your md device is partitioned. Is the partition itself 
 stripe-aligned?

 Crud.

 md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
   11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
 [6/6] [UU]

 IIUC this means that I/O should be aligned on 2MB boundaries (512k
 chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
 (i.e. 1MB) boundary.

 It's time to blow away the array and start over.  You're already
 misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
 but for a handful of niche all streaming workloads with little/no
 rewrite, such as video surveillance or DVR workloads.

 Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
 Deleting a single file changes only a few bytes of directory metadata.
 With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
 modify the directory block in question, calculate parity, then write out
 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
 a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
 a few bytes of metadata.  Yes, insane.
 
 Grr.  I thought the bad old days of filesystem and related defaults
 sucking were over.  

The previous md chunk default of 64KB wasn't horribly bad, though still
maybe a bit high for alot of common workloads.  I didn't have eyes/ears
on the discussion and/or testing process that led to the 'new' 512KB
default.  Obviously something went horribly wrong here.  512KB isn't a
show stopper as a default for 0/1/10, but is 8-16 times too large for
parity RAID.

 cryptsetup aligns sanely these days, xfs is
 sensible, etc.  

XFS won't align with the 512KB chunk default of metadata 1.2.  The
largest XFS journal stripe unit (su--chunk) is 256KB, and even that
isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
stripe.  See the md and xfs archives for more details, specifically Dave
Chinner's colorful comments on the md 512KB default.

 wtf?  rantWhy is there no sensible filesystem for
 huge disks?  zfs can't cp --reflink and has all kinds of source
 availability and licensing issues, xfs can't dedupe at all, and btrfs
 isn't nearly stable enough./rant

Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
these are the only two to offer a native dedupe capability.  They did it
because they could, with COW, not necessarily because they *should*.
There are dozens of other single node, cluster, and distributed
filesystems in use today and none of them support COW, and thus none
support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
is wishful thinking at best.

 Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

Always one somewhere.

-- 
Stan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: An Andre To Remember

2012-07-29 Thread Stan Hoeppner
On 7/28/2012 7:11 PM, Nicholas A. Bellinger wrote:
> On Fri, 2012-07-27 at 13:56 -0400, Jeff Garzik wrote:
>>  An Andre To Remember
>>  July 2012
>>
>> Linux lost a friend and advocate this month.  Though never a household
>> name, Andre Hedrick had a positive impact on everyone today running
>> Linux, or using a website, with any form of IDE (ATA) or SCSI storage
>> -- that means millions upon millions of users today.
>>
>> For a time, Andre interacted with practically every relevant IDE
>> drive and controller manufacturer, as well as the T13 standards
>> committee through which IDE changes were made.  He helped ensure
>> Linux had near-universal IDE support in a hardware era when Linux
>> support was a second thought if at all.  As the Register article[1]
>> noted, with CPRM and other efforts, Andre worked to keep storage a
>> more open platform than it might otherwise have been.
>>
>> [1] http://www.theregister.co.uk/2012/07/26/andre_hedrick/
>>
>> Andre also played a role in IDE technology coalescing around the idea
>> of a "taskfile", which is IDE-speak for an RPC command issued to a
>> disk drive, and the RPC response returned from the drive.  It was
>> very important to Andre that the kernel have a "taskfile ioctl",
>> an API enabling full programmable access to the disk drive.  At the
>> time, a more limited "cmd ioctl" API was the best option available,
>> but Linux's cmd ioctl did not give users full and complete access to
>> their own disk drive.
>>
>> Andre's taskfile concept was a central component of the current,
>> rewritten-from-scratch Linux IDE driver "libata."  libata uses an
>> "ata_taskfile" to communicate with all IDE drives, whether from a
>> decade ago or built yesterday.  The taskfile concept modernized
>> IDE software, by forcing the industry to move away from a slow,
>> signals-originated register API to a modern, packetized RPC messaging
>> API, similar to where SCSI storage had already been moving.
>>
>> I spent many hours on the phone with Andre, circa 2003, learning all
>> there was to know about ATA storage, while writing libata.  Andre could
>> be considered one of the grandfathers of libata, along with Alan Cox.
>> I became friends with Andre during this time, and we talked a lot.
>>
>> Andre was unquestionably smart, driven and an advocate for Linux user
>> freedom.
>>
> 
> Hi Jeff,
> 
> Thank you for sharing your thoughts + memories of Andre.
> 
> As we grieve this extreme loss, I'd like to try to share some of my own
> experiences with Andre that will hopefully help others to begin to
> understand the kind + generous type of person that Andre really was, and
> just some of his staggering technical feats + accomplishments that can
> be talked about publicly today.
> 
> Along with Andre being involved in the history of libata and IDE/ATA
> development, those of us in the Linux kernel storage development
> community also know, he was also instrumental in creation of the
> original out-of-tree PyX iSCSI target code that's now in mainline.
> 
> In summer 2002, I sitting next to Andre when he coined the term 'IBLOCK'
> after drawing a rough sketch on a notebook after an idea in Walnut
> Creek, California, and the name ending up sticking..   The interesting
> development bits really started to unfold in the spring of 2004 when we
> finally managed to get drivers/ide/ export working with iscsi-target on
> x86 using 2.4.x code.  
> 
> That quickly unfolded into a Sony Playstation-2 (MIPS EE) port using IDE
> disk DMA mode + network PIO on 2.2.x era kernel code capable of
> streaming multiple DVD quality streams to hungry iSCSI clients..
> 
> Left to my own devices for hardware hacking, I managed to turn our first
> disassembled PS2 into a broken parts machine (whoops) but Andre was
> going to made sure that it was not going to happen again..  I bought
> another PS2, and he was the person who soldered wires to the handful of
> tiny via pin-outs to access the one-way serial output for EE boot
> information last at night, while I worked on the necessary kernel bits
> needed for bring-up of the PS2 specific IDE backend target driver.  (The
> PS2 IDE driver required contiguous memory for IDE DMA ops to function
> via a single struct buffer_head (TCQ=1) on the non-cache coherent MIPS
> based platform.)
> 
> He carefully made physical space in the machine's cramped chassis, using
> sticky pads where necessary to hold the small PCB containing a simple
> ASIC doing the conversion of the signal into PC RS-232 serial output.
> He made it look completely flush, like exactly how it was supposed to
> come from the factory.  Or you know, from the magical place near the old
> Bell Labs R center where new development kits for cutting edge tech
> are born.
> 
> CBS Sunday Morning even did a story on Andre and his family in the
> summer of 2004 while all of this was going on..  Not for the PS2
> iscsi-target or any other code of course, but for the fact that he was
> chosen by EBay to 

Re: An Andre To Remember

2012-07-29 Thread Stan Hoeppner
On 7/28/2012 7:11 PM, Nicholas A. Bellinger wrote:
 On Fri, 2012-07-27 at 13:56 -0400, Jeff Garzik wrote:
  An Andre To Remember
  July 2012

 Linux lost a friend and advocate this month.  Though never a household
 name, Andre Hedrick had a positive impact on everyone today running
 Linux, or using a website, with any form of IDE (ATA) or SCSI storage
 -- that means millions upon millions of users today.

 For a time, Andre interacted with practically every relevant IDE
 drive and controller manufacturer, as well as the T13 standards
 committee through which IDE changes were made.  He helped ensure
 Linux had near-universal IDE support in a hardware era when Linux
 support was a second thought if at all.  As the Register article[1]
 noted, with CPRM and other efforts, Andre worked to keep storage a
 more open platform than it might otherwise have been.

 [1] http://www.theregister.co.uk/2012/07/26/andre_hedrick/

 Andre also played a role in IDE technology coalescing around the idea
 of a taskfile, which is IDE-speak for an RPC command issued to a
 disk drive, and the RPC response returned from the drive.  It was
 very important to Andre that the kernel have a taskfile ioctl,
 an API enabling full programmable access to the disk drive.  At the
 time, a more limited cmd ioctl API was the best option available,
 but Linux's cmd ioctl did not give users full and complete access to
 their own disk drive.

 Andre's taskfile concept was a central component of the current,
 rewritten-from-scratch Linux IDE driver libata.  libata uses an
 ata_taskfile to communicate with all IDE drives, whether from a
 decade ago or built yesterday.  The taskfile concept modernized
 IDE software, by forcing the industry to move away from a slow,
 signals-originated register API to a modern, packetized RPC messaging
 API, similar to where SCSI storage had already been moving.

 I spent many hours on the phone with Andre, circa 2003, learning all
 there was to know about ATA storage, while writing libata.  Andre could
 be considered one of the grandfathers of libata, along with Alan Cox.
 I became friends with Andre during this time, and we talked a lot.

 Andre was unquestionably smart, driven and an advocate for Linux user
 freedom.

 
 Hi Jeff,
 
 Thank you for sharing your thoughts + memories of Andre.
 
 As we grieve this extreme loss, I'd like to try to share some of my own
 experiences with Andre that will hopefully help others to begin to
 understand the kind + generous type of person that Andre really was, and
 just some of his staggering technical feats + accomplishments that can
 be talked about publicly today.
 
 Along with Andre being involved in the history of libata and IDE/ATA
 development, those of us in the Linux kernel storage development
 community also know, he was also instrumental in creation of the
 original out-of-tree PyX iSCSI target code that's now in mainline.
 
 In summer 2002, I sitting next to Andre when he coined the term 'IBLOCK'
 after drawing a rough sketch on a notebook after an idea in Walnut
 Creek, California, and the name ending up sticking..   The interesting
 development bits really started to unfold in the spring of 2004 when we
 finally managed to get drivers/ide/ export working with iscsi-target on
 x86 using 2.4.x code.  
 
 That quickly unfolded into a Sony Playstation-2 (MIPS EE) port using IDE
 disk DMA mode + network PIO on 2.2.x era kernel code capable of
 streaming multiple DVD quality streams to hungry iSCSI clients..
 
 Left to my own devices for hardware hacking, I managed to turn our first
 disassembled PS2 into a broken parts machine (whoops) but Andre was
 going to made sure that it was not going to happen again..  I bought
 another PS2, and he was the person who soldered wires to the handful of
 tiny via pin-outs to access the one-way serial output for EE boot
 information last at night, while I worked on the necessary kernel bits
 needed for bring-up of the PS2 specific IDE backend target driver.  (The
 PS2 IDE driver required contiguous memory for IDE DMA ops to function
 via a single struct buffer_head (TCQ=1) on the non-cache coherent MIPS
 based platform.)
 
 He carefully made physical space in the machine's cramped chassis, using
 sticky pads where necessary to hold the small PCB containing a simple
 ASIC doing the conversion of the signal into PC RS-232 serial output.
 He made it look completely flush, like exactly how it was supposed to
 come from the factory.  Or you know, from the magical place near the old
 Bell Labs RD center where new development kits for cutting edge tech
 are born.
 
 CBS Sunday Morning even did a story on Andre and his family in the
 summer of 2004 while all of this was going on..  Not for the PS2
 iscsi-target or any other code of course, but for the fact that he was
 chosen by EBay to represent California small business as part of a group
 that lobbied in Washington DC.  The reason that E-bay chose Andre is
 because he built PyX