subject:"Re\: limits on raid"

Re: limits on raid

2007-06-22 Thread david


On Fri, 22 Jun 2007, Bill Davidsen wrote:

By delaying parity computation until the first write to a stripe only the 
growth of a filesystem is slowed, and all data are protected without waiting 
for the lengthly check. The rebuild speed can be set very low, because 
on-demand rebuild will do most of the work.


 I'm very much for the fs layer reading the lower block structure so I
 don't have to fiddle with arcane tuning parameters - yes, *please* help
 make xfs self-tuning!

 Keeping life as straightforward as possible low down makes the upwards
 interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple because it 
rests on a simple device, even if the "simple device" is provided by LVM or 
md. And LVM and md can stay simple because they rest on simple devices, even 
if they are provided by PATA, SATA, nbd, etc. Independent layers make each 
layer more robust. If you want to compromise the layer separation, some 
approach like ZFS with full integration would seem to be promising. Note that 
layers allow specialized features at each point, trading integration for 
flexibility.


My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you need to 
handle changes in those details, which would seem to make layers more 
complex. What I'm looking for here is better performance in one particular 
layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel 
that the current performance suggests room for improvement.


they both have have benifits, but it shouldn't have to be either-or

if you build the seperate layers and provide for ways that the upper 
layers can query the lower layers to find what's efficiant then you can 
have some uppoer layers that don't care about this and trat the lower 
layer as a simple block device, while other upper layers find out what 
sort of things are more efficiant to do and use the same lower layer in a 
more complex manner


the alturnative is to duplicate effort (and code) to have two codebases 
that try to do the same thing, one stand-alone, and one as a part of an 
integrated solution (and it gets even worse if there end up being multiple 
integrated solutions)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread David Greaves


Bill Davidsen wrote:

David Greaves wrote:

[EMAIL PROTECTED] wrote:

On Fri, 22 Jun 2007, David Greaves wrote:
If you end up 'fiddling' in md because someone specified 
--assume-clean on a raid5 [in this case just to save a few minutes 
*testing time* on system with a heavily choked bus!] then that adds 
*even more* complexity and exception cases into all the stuff you 
described.


A "few minutes?" Are you reading the times people are seeing with 
multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. 

Yes. But we are talking initial creation here.

And as soon as you believe that the array is actually "usable" you cut 
that rebuild rate, perhaps in half, and get dog-slow performance from 
the array. It's usable in the sense that reads and writes work, but for 
useful work it's pretty painful. You either fail to understand the 
magnitude of the problem or wish to trivialize it for some reason.

I do understand the problem and I'm not trying to trivialise it :)

I _suggested_ that it's worth thinking about things rather than jumping in to 
say "oh, we can code up a clever algorithm that keeps track of what stripes have 
valid parity and which don't and we can optimise the read/copy/write for valid 
stripes and use the raid6 type read-all/write-all for invalid stripes and then 
we can write a bit extra on the check code to set the bitmaps.."


Phew - and that lets us run the array at semi-degraded performance (raid6-like) 
for 3 days rather than either waiting before we put it into production or 
running it very slowly.

Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT?

What happens in those 3 years when we have a disk fail? The solution doesn't 
apply then - it's 3 days to rebuild - like it or not.


By delaying parity computation until the first write to a stripe only 
the growth of a filesystem is slowed, and all data are protected without 
waiting for the lengthly check. The rebuild speed can be set very low, 
because on-demand rebuild will do most of the work.

I am not saying you are wrong.
I ask merely if the balance of benefit outweighs the balance of complexity.

If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs 
- very useful indeed.


I'm very much for the fs layer reading the lower block structure so I 
don't have to fiddle with arcane tuning parameters - yes, *please* 
help make xfs self-tuning!


Keeping life as straightforward as possible low down makes the upwards 
interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple 
because it rests on a simple device, even if the "simple device" is 
provided by LVM or md. And LVM and md can stay simple because they rest 
on simple devices, even if they are provided by PATA, SATA, nbd, etc. 
Independent layers make each layer more robust. If you want to 
compromise the layer separation, some approach like ZFS with full 
integration would seem to be promising. Note that layers allow 
specialized features at each point, trading integration for flexibility.


That's a simplistic summary.
You *can* loosely couple the layers. But you can enrich the interface and 
tightly couple them too - XFS is capable (I guess) of understanding md more 
fully than say ext2.
XFS would still work on a less 'talkative' block device where performance wasn't 
as important (USB flash maybe, dunno).



My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you 
need to handle changes in those details, which would seem to make layers 
more complex.

Agreed.

What I'm looking for here is better performance in one 
particular layer, the md RAID5 layer. I like to avoid unnecessary 
complexity, but I feel that the current performance suggests room for 
improvement.


I agree there is room for improvement.
I suggest that it may be more fruitful to write a tool called "raid5prepare"
that writes zeroes/ones as appropriate to all component devices and then you can 
use --assume-clean without concern. That could look to see if the devices are 
scsi or whatever and take advantage of the hyperfast block writes that can be done.


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread Bill Davidsen


David Greaves wrote:

[EMAIL PROTECTED] wrote:

On Fri, 22 Jun 2007, David Greaves wrote:

That's not a bad thing - until you look at the complexity it brings 
- and then consider the impact and exceptions when you do, eg 
hardware acceleration? md information fed up to the fs layer for 
xfs? simple long term maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to "just say no" :)
so for several reasons I don't see this as something that's deserving 
of an atomatic 'no'


David Lang


Err, re-read it, I hope you'll see that I agree with you - I actually 
just meant the --assume-clean workaround stuff :)


If you end up 'fiddling' in md because someone specified 
--assume-clean on a raid5 [in this case just to save a few minutes 
*testing time* on system with a heavily choked bus!] then that adds 
*even more* complexity and exception cases into all the stuff you 
described.


A "few minutes?" Are you reading the times people are seeing with 
multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. 
And as soon as you believe that the array is actually "usable" you cut 
that rebuild rate, perhaps in half, and get dog-slow performance from 
the array. It's usable in the sense that reads and writes work, but for 
useful work it's pretty painful. You either fail to understand the 
magnitude of the problem or wish to trivialize it for some reason.


By delaying parity computation until the first write to a stripe only 
the growth of a filesystem is slowed, and all data are protected without 
waiting for the lengthly check. The rebuild speed can be set very low, 
because on-demand rebuild will do most of the work.


I'm very much for the fs layer reading the lower block structure so I 
don't have to fiddle with arcane tuning parameters - yes, *please* 
help make xfs self-tuning!


Keeping life as straightforward as possible low down makes the upwards 
interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple 
because it rests on a simple device, even if the "simple device" is 
provided by LVM or md. And LVM and md can stay simple because they rest 
on simple devices, even if they are provided by PATA, SATA, nbd, etc. 
Independent layers make each layer more robust. If you want to 
compromise the layer separation, some approach like ZFS with full 
integration would seem to be promising. Note that layers allow 
specialized features at each point, trading integration for flexibility.


My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you 
need to handle changes in those details, which would seem to make layers 
more complex. What I'm looking for here is better performance in one 
particular layer, the md RAID5 layer. I like to avoid unnecessary 
complexity, but I feel that the current performance suggests room for 
improvement.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread David Greaves


[EMAIL PROTECTED] wrote:

On Fri, 22 Jun 2007, David Greaves wrote:

That's not a bad thing - until you look at the complexity it brings - 
and then consider the impact and exceptions when you do, eg hardware 
acceleration? md information fed up to the fs layer for xfs? simple 
long term maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to "just say no" :)
so for several reasons I don't see this as something that's deserving of 
an atomatic 'no'


David Lang


Err, re-read it, I hope you'll see that I agree with you - I actually just meant 
the --assume-clean workaround stuff :)


If you end up 'fiddling' in md because someone specified --assume-clean on a 
raid5 [in this case just to save a few minutes *testing time* on system with a 
heavily choked bus!] then that adds *even more* complexity and exception cases 
into all the stuff you described.


I'm very much for the fs layer reading the lower block structure so I don't have 
to fiddle with arcane tuning parameters - yes, *please* help make xfs self-tuning!


Keeping life as straightforward as possible low down makes the upwards interface 
more manageable and that goal more realistic...


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread david


On Fri, 22 Jun 2007, David Greaves wrote:

That's not a bad thing - until you look at the complexity it brings - and 
then consider the impact and exceptions when you do, eg hardware 
acceleration? md information fed up to the fs layer for xfs? simple long term 
maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to "just say no" :)


In this case I think the advantages of a higher level system knowing what 
efficiant blocks to do writes/reads in can potentially be a HUGE 
advantage.


if the uppper levels know that you ahve a 6 disk raid 6 array with a 64K 
chunk size then reads and writes in 256k chunks (aligned) should be able 
to be done at basicly the speed of a 4 disk raid 0 array.


what's even more impressive is that this could be done even if the array 
is degraded (if you know the drives have failed you don't even try to read 
from them and you only have to reconstruct the missing info once per 
stripe)


the current approach doesn't give the upper levels any chance to operate 
in this mode, they just don't have enough information to do so.


the part about wanting to know raid 0 chunk size so that the upper layers 
can be sure that data that's supposed to be redundant is on seperate 
drives is also possible


storage technology is headed in the direction of having the system do more 
and more of the layout decisions, and re-stripe the array as conditions 
change (similar to what md can already do with enlarging raid5/6 arrays) 
but unless you want to eventually put all that decision logic into the md 
layer you should make it possible for other layers to make queries to find 
out what's what and then they can give directions for what they want to 
have happen.


so for several reasons I don't see this as something that's deserving of 
an atomatic 'no'


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread David Greaves


Neil Brown wrote:

On Thursday June 21, [EMAIL PROTECTED] wrote:
I didn't get a comment on my suggestion for a quick and dirty fix for 
-assume-clean issues...


Bill Davidsen wrote:
How about a simple solution which would get an array on line and still 
be safe? All it would take is a flag which forced reconstruct writes 
for RAID-5. You could set it with an option, or automatically if 
someone puts --assume-clean with --create, leave it in the superblock 
until the first "repair" runs to completion. And for repair you could 
make some assumptions about bad parity not being caused by error but 
just unwritten.


It is certainly possible, and probably not a lot of effort.  I'm not
really excited about it though.

So if someone to submit a patch that did the right stuff,  I would
probably accept it, but I am unlikely to do it myself.


Thought 2: I think the unwritten bit is easier than you think, you 
only need it on parity blocks for RAID5, not on data blocks. When a 
write is done, if the bit is set do a reconstruct, write the parity 
block, and clear the bit. Keeping a bit per data block is madness, and 
appears to be unnecessary as well.


Where do you propose storing those bits?  And how many would you cache
in memory?  And what performance hit would you suffer for accessing
them?  And would it be worth it?


Sometimes I think one of the problems with Linux is that it tries to do 
everything for everyone.


That's not a bad thing - until you look at the complexity it brings - and then 
consider the impact and exceptions when you do, eg hardware acceleration? md 
information fed up to the fs layer for xfs? simple long term maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to "just say no" :)

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread David Greaves


Neil Brown wrote:

On Thursday June 21, [EMAIL PROTECTED] wrote:
I didn't get a comment on my suggestion for a quick and dirty fix for 
-assume-clean issues...


Bill Davidsen wrote:
How about a simple solution which would get an array on line and still 
be safe? All it would take is a flag which forced reconstruct writes 
for RAID-5. You could set it with an option, or automatically if 
someone puts --assume-clean with --create, leave it in the superblock 
until the first repair runs to completion. And for repair you could 
make some assumptions about bad parity not being caused by error but 
just unwritten.


It is certainly possible, and probably not a lot of effort.  I'm not
really excited about it though.

So if someone to submit a patch that did the right stuff,  I would
probably accept it, but I am unlikely to do it myself.


Thought 2: I think the unwritten bit is easier than you think, you 
only need it on parity blocks for RAID5, not on data blocks. When a 
write is done, if the bit is set do a reconstruct, write the parity 
block, and clear the bit. Keeping a bit per data block is madness, and 
appears to be unnecessary as well.


Where do you propose storing those bits?  And how many would you cache
in memory?  And what performance hit would you suffer for accessing
them?  And would it be worth it?


Sometimes I think one of the problems with Linux is that it tries to do 
everything for everyone.


That's not a bad thing - until you look at the complexity it brings - and then 
consider the impact and exceptions when you do, eg hardware acceleration? md 
information fed up to the fs layer for xfs? simple long term maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to just say no :)

David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread david


On Fri, 22 Jun 2007, David Greaves wrote:

That's not a bad thing - until you look at the complexity it brings - and 
then consider the impact and exceptions when you do, eg hardware 
acceleration? md information fed up to the fs layer for xfs? simple long term 
maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to just say no :)


In this case I think the advantages of a higher level system knowing what 
efficiant blocks to do writes/reads in can potentially be a HUGE 
advantage.


if the uppper levels know that you ahve a 6 disk raid 6 array with a 64K 
chunk size then reads and writes in 256k chunks (aligned) should be able 
to be done at basicly the speed of a 4 disk raid 0 array.


what's even more impressive is that this could be done even if the array 
is degraded (if you know the drives have failed you don't even try to read 
from them and you only have to reconstruct the missing info once per 
stripe)


the current approach doesn't give the upper levels any chance to operate 
in this mode, they just don't have enough information to do so.


the part about wanting to know raid 0 chunk size so that the upper layers 
can be sure that data that's supposed to be redundant is on seperate 
drives is also possible


storage technology is headed in the direction of having the system do more 
and more of the layout decisions, and re-stripe the array as conditions 
change (similar to what md can already do with enlarging raid5/6 arrays) 
but unless you want to eventually put all that decision logic into the md 
layer you should make it possible for other layers to make queries to find 
out what's what and then they can give directions for what they want to 
have happen.


so for several reasons I don't see this as something that's deserving of 
an atomatic 'no'


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread David Greaves


[EMAIL PROTECTED] wrote:

On Fri, 22 Jun 2007, David Greaves wrote:

That's not a bad thing - until you look at the complexity it brings - 
and then consider the impact and exceptions when you do, eg hardware 
acceleration? md information fed up to the fs layer for xfs? simple 
long term maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to just say no :)
so for several reasons I don't see this as something that's deserving of 
an atomatic 'no'


David Lang


Err, re-read it, I hope you'll see that I agree with you - I actually just meant 
the --assume-clean workaround stuff :)


If you end up 'fiddling' in md because someone specified --assume-clean on a 
raid5 [in this case just to save a few minutes *testing time* on system with a 
heavily choked bus!] then that adds *even more* complexity and exception cases 
into all the stuff you described.


I'm very much for the fs layer reading the lower block structure so I don't have 
to fiddle with arcane tuning parameters - yes, *please* help make xfs self-tuning!


Keeping life as straightforward as possible low down makes the upwards interface 
more manageable and that goal more realistic...


David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread Bill Davidsen


David Greaves wrote:

[EMAIL PROTECTED] wrote:

On Fri, 22 Jun 2007, David Greaves wrote:

That's not a bad thing - until you look at the complexity it brings 
- and then consider the impact and exceptions when you do, eg 
hardware acceleration? md information fed up to the fs layer for 
xfs? simple long term maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to just say no :)
so for several reasons I don't see this as something that's deserving 
of an atomatic 'no'


David Lang


Err, re-read it, I hope you'll see that I agree with you - I actually 
just meant the --assume-clean workaround stuff :)


If you end up 'fiddling' in md because someone specified 
--assume-clean on a raid5 [in this case just to save a few minutes 
*testing time* on system with a heavily choked bus!] then that adds 
*even more* complexity and exception cases into all the stuff you 
described.


A few minutes? Are you reading the times people are seeing with 
multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. 
And as soon as you believe that the array is actually usable you cut 
that rebuild rate, perhaps in half, and get dog-slow performance from 
the array. It's usable in the sense that reads and writes work, but for 
useful work it's pretty painful. You either fail to understand the 
magnitude of the problem or wish to trivialize it for some reason.


By delaying parity computation until the first write to a stripe only 
the growth of a filesystem is slowed, and all data are protected without 
waiting for the lengthly check. The rebuild speed can be set very low, 
because on-demand rebuild will do most of the work.


I'm very much for the fs layer reading the lower block structure so I 
don't have to fiddle with arcane tuning parameters - yes, *please* 
help make xfs self-tuning!


Keeping life as straightforward as possible low down makes the upwards 
interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple 
because it rests on a simple device, even if the simple device is 
provided by LVM or md. And LVM and md can stay simple because they rest 
on simple devices, even if they are provided by PATA, SATA, nbd, etc. 
Independent layers make each layer more robust. If you want to 
compromise the layer separation, some approach like ZFS with full 
integration would seem to be promising. Note that layers allow 
specialized features at each point, trading integration for flexibility.


My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you 
need to handle changes in those details, which would seem to make layers 
more complex. What I'm looking for here is better performance in one 
particular layer, the md RAID5 layer. I like to avoid unnecessary 
complexity, but I feel that the current performance suggests room for 
improvement.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread David Greaves


Bill Davidsen wrote:

David Greaves wrote:

[EMAIL PROTECTED] wrote:

On Fri, 22 Jun 2007, David Greaves wrote:
If you end up 'fiddling' in md because someone specified 
--assume-clean on a raid5 [in this case just to save a few minutes 
*testing time* on system with a heavily choked bus!] then that adds 
*even more* complexity and exception cases into all the stuff you 
described.


A few minutes? Are you reading the times people are seeing with 
multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. 

Yes. But we are talking initial creation here.

And as soon as you believe that the array is actually usable you cut 
that rebuild rate, perhaps in half, and get dog-slow performance from 
the array. It's usable in the sense that reads and writes work, but for 
useful work it's pretty painful. You either fail to understand the 
magnitude of the problem or wish to trivialize it for some reason.

I do understand the problem and I'm not trying to trivialise it :)

I _suggested_ that it's worth thinking about things rather than jumping in to 
say oh, we can code up a clever algorithm that keeps track of what stripes have 
valid parity and which don't and we can optimise the read/copy/write for valid 
stripes and use the raid6 type read-all/write-all for invalid stripes and then 
we can write a bit extra on the check code to set the bitmaps..


Phew - and that lets us run the array at semi-degraded performance (raid6-like) 
for 3 days rather than either waiting before we put it into production or 
running it very slowly.

Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT?

What happens in those 3 years when we have a disk fail? The solution doesn't 
apply then - it's 3 days to rebuild - like it or not.


By delaying parity computation until the first write to a stripe only 
the growth of a filesystem is slowed, and all data are protected without 
waiting for the lengthly check. The rebuild speed can be set very low, 
because on-demand rebuild will do most of the work.

I am not saying you are wrong.
I ask merely if the balance of benefit outweighs the balance of complexity.

If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs 
- very useful indeed.


I'm very much for the fs layer reading the lower block structure so I 
don't have to fiddle with arcane tuning parameters - yes, *please* 
help make xfs self-tuning!


Keeping life as straightforward as possible low down makes the upwards 
interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple 
because it rests on a simple device, even if the simple device is 
provided by LVM or md. And LVM and md can stay simple because they rest 
on simple devices, even if they are provided by PATA, SATA, nbd, etc. 
Independent layers make each layer more robust. If you want to 
compromise the layer separation, some approach like ZFS with full 
integration would seem to be promising. Note that layers allow 
specialized features at each point, trading integration for flexibility.


That's a simplistic summary.
You *can* loosely couple the layers. But you can enrich the interface and 
tightly couple them too - XFS is capable (I guess) of understanding md more 
fully than say ext2.
XFS would still work on a less 'talkative' block device where performance wasn't 
as important (USB flash maybe, dunno).



My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you 
need to handle changes in those details, which would seem to make layers 
more complex.

Agreed.

What I'm looking for here is better performance in one 
particular layer, the md RAID5 layer. I like to avoid unnecessary 
complexity, but I feel that the current performance suggests room for 
improvement.


I agree there is room for improvement.
I suggest that it may be more fruitful to write a tool called raid5prepare
that writes zeroes/ones as appropriate to all component devices and then you can 
use --assume-clean without concern. That could look to see if the devices are 
scsi or whatever and take advantage of the hyperfast block writes that can be done.


David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-22 Thread david


On Fri, 22 Jun 2007, Bill Davidsen wrote:

By delaying parity computation until the first write to a stripe only the 
growth of a filesystem is slowed, and all data are protected without waiting 
for the lengthly check. The rebuild speed can be set very low, because 
on-demand rebuild will do most of the work.


 I'm very much for the fs layer reading the lower block structure so I
 don't have to fiddle with arcane tuning parameters - yes, *please* help
 make xfs self-tuning!

 Keeping life as straightforward as possible low down makes the upwards
 interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple because it 
rests on a simple device, even if the simple device is provided by LVM or 
md. And LVM and md can stay simple because they rest on simple devices, even 
if they are provided by PATA, SATA, nbd, etc. Independent layers make each 
layer more robust. If you want to compromise the layer separation, some 
approach like ZFS with full integration would seem to be promising. Note that 
layers allow specialized features at each point, trading integration for 
flexibility.


My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you need to 
handle changes in those details, which would seem to make layers more 
complex. What I'm looking for here is better performance in one particular 
layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel 
that the current performance suggests room for improvement.


they both have have benifits, but it shouldn't have to be either-or

if you build the seperate layers and provide for ways that the upper 
layers can query the lower layers to find what's efficiant then you can 
have some uppoer layers that don't care about this and trat the lower 
layer as a simple block device, while other upper layers find out what 
sort of things are more efficiant to do and use the same lower layer in a 
more complex manner


the alturnative is to duplicate effort (and code) to have two codebases 
that try to do the same thing, one stand-alone, and one as a part of an 
integrated solution (and it gets even worse if there end up being multiple 
integrated solutions)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Neil Brown

On Thursday June 21, [EMAIL PROTECTED] wrote:
> I didn't get a comment on my suggestion for a quick and dirty fix for 
> -assume-clean issues...
> 
> Bill Davidsen wrote:
> > How about a simple solution which would get an array on line and still 
> > be safe? All it would take is a flag which forced reconstruct writes 
> > for RAID-5. You could set it with an option, or automatically if 
> > someone puts --assume-clean with --create, leave it in the superblock 
> > until the first "repair" runs to completion. And for repair you could 
> > make some assumptions about bad parity not being caused by error but 
> > just unwritten.

It is certainly possible, and probably not a lot of effort.  I'm not
really excited about it though.

So if someone to submit a patch that did the right stuff,  I would
probably accept it, but I am unlikely to do it myself.


> >
> > Thought 2: I think the unwritten bit is easier than you think, you 
> > only need it on parity blocks for RAID5, not on data blocks. When a 
> > write is done, if the bit is set do a reconstruct, write the parity 
> > block, and clear the bit. Keeping a bit per data block is madness, and 
> > appears to be unnecessary as well.

Where do you propose storing those bits?  And how many would you cache
in memory?  And what performance hit would you suffer for accessing
them?  And would it be worth it?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Bill Davidsen

I didn't get a comment on my suggestion for a quick and dirty fix for 
-assume-clean issues...


Bill Davidsen wrote:

Neil Brown wrote:

On Thursday June 14, [EMAIL PROTECTED] wrote:
 

it's now churning away 'rebuilding' the brand new array.

a few questions/thoughts.

why does it need to do a rebuild when makeing a new array? couldn't 
it just zero all the drives instead? (or better still just record 
most of the space as 'unused' and initialize it as it starts useing 
it?)



Yes, it could zero all the drives first.  But that would take the same
length of time (unless p/q generation was very very slow), and you
wouldn't be able to start writing data until it had finished.
You can "dd" /dev/zero onto all drives and then create the array with
--assume-clean if you want to.  You could even write a shell script to
do it for you.

Yes, you could record which space is used vs unused, but I really
don't think the complexity is worth it.

  
How about a simple solution which would get an array on line and still 
be safe? All it would take is a flag which forced reconstruct writes 
for RAID-5. You could set it with an option, or automatically if 
someone puts --assume-clean with --create, leave it in the superblock 
until the first "repair" runs to completion. And for repair you could 
make some assumptions about bad parity not being caused by error but 
just unwritten.


Thought 2: I think the unwritten bit is easier than you think, you 
only need it on parity blocks for RAID5, not on data blocks. When a 
write is done, if the bit is set do a reconstruct, write the parity 
block, and clear the bit. Keeping a bit per data block is madness, and 
appears to be unnecessary as well.
while I consider zfs to be ~80% hype, one advantage it could have 
(but I don't know if it has) is that since the filesystem an raid 
are integrated into one layer they can optimize the case where files 
are being written onto unallocated space and instead of reading 
blocks from disk to calculate the parity they could just put zeros 
in the unallocated space, potentially speeding up the system by 
reducing the amount of disk I/O.



Certainly.  But the raid doesn't need to be tightly integrated
into the filesystem to achieve this.  The filesystem need only know
the geometry of the RAID and when it comes to write, it tries to write
full stripes at a time.  If that means writing some extra blocks full
of zeros, it can try to do that.  This would require a little bit
better communication between filesystem and raid, but not much.  If
anyone has a filesystem that they want to be able to talk to raid
better, they need only ask...
 
 
is there any way that linux would be able to do this sort of thing? 
or is it impossible due to the layering preventing the nessasary 
knowledge from being in the right place?



Linux can do anything we want it to.  Interfaces can be changed.  All
it takes is a fairly well defined requirement, and the will to make it
happen (and some technical expertise, and lots of time  and
coffee?).
  
Well, I gave you two thoughts, one which would be slow until a repair 
but sounds easy to do, and one which is slightly harder but works 
better and minimizes performance impact.





--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Nix

On 21 Jun 2007, Neil Brown stated:
> I have that - apparently naive - idea that drives use strong checksum,
> and will never return bad data, only good data or an error.  If this
> isn't right, then it would really help to understand what the cause of
> other failures are before working out how to handle them

Look at the section `Disks and errors' in Val Henson's excellent report
on last year's filesystems workshop: .
Most of the error modes given there lead to valid checksums and wrong
data...

(while you're there, read the first part too :) )

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Martin K. Petersen

> "Mattias" == Mattias Wadenstein <[EMAIL PROTECTED]> writes:

Mattias> In theory, that's how storage should work. In practice,
Mattias> silent data corruption does happen. If not from the disks
Mattias> themselves, somewhere along the path of cables, controllers,
Mattias> drivers, buses, etc. If you add in fcal, you'll get even more
Mattias> sources of failure, but usually you can avoid SANs (if you
Mattias> care about your data).

Oracle cares a lot about people's data 8).  And we've seen many cases
of silent data corruption.  Often the problem goes unnoticed for
months.  And by the time you find out about it you may have gone
through your backup cycle so the data is simply lost.

The Oracle database in combination with certain high-end arrays
supports a technology called HARD (Hardware Assisted Resilient Data)
which allows the array front end to verify the integrity of an I/O
before committing it to disk.  The downside to HARD is that it's
proprietary and only really high-end customers use it (many
enterprises actually mandate HARD).

A couple of years ago some changes started to trickle into the SCSI
Block Commands spec.  And as some of you know I've been working on
implementing support for this Data Integrity Field in Linux.

What DIF allows you to do is to attach some integrity metadata to an
I/O.  We can attach this metadata all the way up in the userland
application context where the risk of corruption is relatively small.
The metadata passes all the way through the I/O stack, gets verified
by the HBA firmware, through the fabric, gets verified by the array
front end and finally again by the disk drive before the change is
committed to platter.  Any discrepancy will cause the I/O to be
failed.  And thanks to the intermediate checks you also get fault
isolation.

The DIF integrity metadata contains a CRC of the data block as well as
a reference tag that (for Type 1) needs to match the target sector on
disk.  This way the common problem of misdirected writes can be
alleviated.

Initially, DIF is going to be offered in the FC/SAS space.  But I
encourage everybody to lean on their SATA drive manufacturer of choice
and encourage them to provide a similar functionality on consumer or
at the very least nearline drives.


Note there's a difference between FS checksums and DIF.  Filesystem
checksums (plug: http://oss.oracle.com/projects/btrfs/) allows the
filesystem to detect that it read something bad.  And as discussed
earlier we can potentially retry the read from another mirror or
reconstruct in the case of RAID5/6.

DIF, however, is a proactive technology.  It prevents bad stuff from
being written to disk in the first place.  You'll know right away when
corruption happens, not 4 months later when you try to read the data
back.

So DIF and filesystem checksumming go hand in hand in preventing data
corruption...

-- 
Martin K. Petersen  Oracle Linux Engineering

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Mark Lord


[EMAIL PROTECTED] wrote:

On Thu, 21 Jun 2007, David Chinner wrote:


On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


The drive is not the only source of errors, though.  You could
have a path problem that is corrupting random bits between the drive
and the filesystem. So the data on the disk might be fine, and
reading it via a redundant path might be all that is needed.


one of the 'killer features' of zfs is that it does checksums of every 
file on disk. so many people don't consider the disk infallable.


several other filesystems also do checksums

both bitkeeper and git do checksums of files to detect disk corruption


No, all of those checksums are to detect *filesystem* corruption,
not device corruption (a mere side-effect).

as david C points out there are many points in the path where the data 
could get corrupted besides on the platter.


Yup, that too.

But drives either return good data, or an error.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread david


On Thu, 21 Jun 2007, Mattias Wadenstein wrote:


On Thu, 21 Jun 2007, Neil Brown wrote:


 I have that - apparently naive - idea that drives use strong checksum,
 and will never return bad data, only good data or an error.  If this
 isn't right, then it would really help to understand what the cause of
 other failures are before working out how to handle them


In theory, that's how storage should work. In practice, silent data 
corruption does happen. If not from the disks themselves, somewhere along the 
path of cables, controllers, drivers, buses, etc. If you add in fcal, you'll 
get even more sources of failure, but usually you can avoid SANs (if you care 
about your data).


heh, the pitch I get from the self proclaimed experts is that if you care 
about your data you put it on the san (so you can take advantage of the 
more expensive disk arrays, various backup advantages, and replication 
features that tend to be focused on the san becouse it's a big target)


David Lang


Well, here is a couple of the issues that I've seen myself:

A hw-raid controller returning every 64th bit as 0, no matter what's on disk. 
With no error condition at all. (I've also heard from a collegue about this 
on every 64k, but not seen that myself.)


An fcal switch occasionally resetting, garbling the blocks in transit with 
random data. Lost a few TB of user data that way.


Add to this the random driver breakage that happens now and then. I've also 
had a few broken filesystems due to in-memory corruption due to bad ram, not 
sure there is much hope of fixing that though.


Also, this presentation is pretty worrying on the frequency of silent data 
corruption:


https://indico.desy.de/contributionDisplay.py?contribId=65=42=257

/Mattias Wadenstein



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Justin Piszcz




On Thu, 21 Jun 2007, Mattias Wadenstein wrote:


On Thu, 21 Jun 2007, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


In theory, that's how storage should work. In practice, silent data 
corruption does happen. If not from the disks themselves, somewhere along the 
path of cables, controllers, drivers, buses, etc. If you add in fcal, you'll 
get even more sources of failure, but usually you can avoid SANs (if you care 
about your data).


Well, here is a couple of the issues that I've seen myself:

A hw-raid controller returning every 64th bit as 0, no matter what's on disk. 
With no error condition at all. (I've also heard from a collegue about this 
on every 64k, but not seen that myself.)


An fcal switch occasionally resetting, garbling the blocks in transit with 
random data. Lost a few TB of user data that way.


Add to this the random driver breakage that happens now and then. I've also 
had a few broken filesystems due to in-memory corruption due to bad ram, not 
sure there is much hope of fixing that though.


Also, this presentation is pretty worrying on the frequency of silent data 
corruption:


https://indico.desy.de/contributionDisplay.py?contribId=65=42=257

/Mattias Wadenstein
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Very interesting slides/presentation, going to watch it shortly.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Mattias Wadenstein


On Thu, 21 Jun 2007, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


In theory, that's how storage should work. In practice, silent data 
corruption does happen. If not from the disks themselves, somewhere along 
the path of cables, controllers, drivers, buses, etc. If you add in fcal, 
you'll get even more sources of failure, but usually you can avoid SANs 
(if you care about your data).


Well, here is a couple of the issues that I've seen myself:

A hw-raid controller returning every 64th bit as 0, no matter what's on 
disk. With no error condition at all. (I've also heard from a collegue 
about this on every 64k, but not seen that myself.)


An fcal switch occasionally resetting, garbling the blocks in transit with 
random data. Lost a few TB of user data that way.


Add to this the random driver breakage that happens now and then. I've 
also had a few broken filesystems due to in-memory corruption due to bad 
ram, not sure there is much hope of fixing that though.


Also, this presentation is pretty worrying on the frequency of silent data 
corruption:


https://indico.desy.de/contributionDisplay.py?contribId=65=42=257

/Mattias Wadenstein
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread David Chinner

On Thu, Jun 21, 2007 at 04:39:36PM +1000, David Chinner wrote:
> FWIW, I don't think this really removes the need for a filesystem to
> be able to keep multiple copies of stuff about. If the copy(s) on a
> device are gone, you've still got to have another copy somewhere
> else to get it back...

Speaking of knowing where you can safely put multiple copies, I'm in
the process of telling XFS about linear alignment of the underlying
array so that we can:

- spread out the load across it faster.
- provide workload isolation
- know where *not* to put duplicate or EDAC data

I'm aiming at identical subvolumes so it's simple to implement.  All
I need to know is the size of each subvolume. I can supply that at
mkfs time or in a mount option, but I want something that can works
automatically so I need to query dm to find out the size of each
underlying device during mount.  We should also pass stripe
unit/width with the same interface while we are at it...

What's the correct and safe way to get this information from dm
both into the kernel and out to userspace (mkfs)?

FWIW, my end goal is to be able to map the underlying block device
address spaces directly into the filesystem so that the filesystem
is able to use the underlying devices intelligently and I can
logically separate caches and writeback for the separate subdevices.
A struct address_space per subdevice would be ideal - anyone got
ideas on how to get that?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread David Greaves


[EMAIL PROTECTED] wrote:

On Thu, 21 Jun 2007, David Chinner wrote:
one of the 'killer features' of zfs is that it does checksums of every 
file on disk. so many people don't consider the disk infallable.


several other filesystems also do checksums

both bitkeeper and git do checksums of files to detect disk corruption


How different is that to raid1/5/6 being set to a 'paranoid' "read-verify" mode 
(as per Dan's recent email) where a read reads from _all_ spindles and verifies 
(and with R6 maybe corrects) the stripe before returning it?


Doesn't solve DaveC's issue about the fs doing redundancy but isn't that 
essentially just fs level mirroring?


David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread David Greaves


Neil Brown wrote:


This isn't quite right.

Thanks :)


Firstly, it is mdadm which decided to make one drive a 'spare' for
raid5, not the kernel.
Secondly, it only applies to raid5, not raid6 or raid1 or raid10.

For raid6, the initial resync (just like the resync after an unclean
shutdown) reads all the data blocks, and writes all the P and Q
blocks.
raid5 can do that, but it is faster the read all but one disk, and
write to that one disk.


How about this:

Initial Creation

When mdadm asks the kernel to create a raid array the most noticeable activity 
is what's called the "initial resync".


Raid level 0 doesn't have any redundancy so there is no initial resync.

For raid levels 1,4,6 and 10 mdadm creates the array and starts a resync. The 
raid algorithm then reads the data blocks and writes the appropriate 
parity/mirror (P+Q) blocks across all the relevant disks. There is some sample 
output in a section below...


For raid5 there is an optimisation: mdadm takes one of the disks and marks it as 
'spare'; it then creates the array in degraded mode. The kernel marks the spare 
disk as 'rebuilding' and starts to read from the 'good' disks, calculate the 
parity and determines what should be on the spare disk and then just writes to it.


Once all this is done the array is clean and all disks are active.

This can take quite a time and the array is not fully resilient whilst this is 
happening (it is however fully useable).






Also is raid4 like raid5 or raid6 in this respect?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread david


On Thu, 21 Jun 2007, David Chinner wrote:


On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


The drive is not the only source of errors, though.  You could
have a path problem that is corrupting random bits between the drive
and the filesystem. So the data on the disk might be fine, and
reading it via a redundant path might be all that is needed.


one of the 'killer features' of zfs is that it does checksums of every 
file on disk. so many people don't consider the disk infallable.


several other filesystems also do checksums

both bitkeeper and git do checksums of files to detect disk corruption

as david C points out there are many points in the path where the data 
could get corrupted besides on the platter.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread David Chinner

On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:
> On Monday June 18, [EMAIL PROTECTED] wrote:
> > On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
> > > Combining these thoughts, it would make a lot of sense for the
> > > filesystem to be able to say to the block device "That blocks looks
> > > wrong - can you find me another copy to try?".  That is an example of
> > > the sort of closer integration between filesystem and RAID that would
> > > make sense.
> > 
> > I think that this would only be useful on devices that store
> > discrete copies of the blocks on different devices i.e. mirrors. If
> > it's an XOR based RAID, you don't have another copy you can
> > retreive
> 
> You could reconstruct the block in question from all the other blocks
> (including parity) and see if that differs from the data block read
> from disk...  For RAID6, there would be a number of different ways to
> calculate alternate blocks.   Not convinced that it is actually
> something we want to do, but it is a possibility.

Agreed - it's not as straight forward as a mirror, and it kind of assumes
that you have software RAID.

/me had his head stuck in hw raid land ;)

> I have that - apparently naive - idea that drives use strong checksum,
> and will never return bad data, only good data or an error.  If this
> isn't right, then it would really help to understand what the cause of
> other failures are before working out how to handle them

The drive is not the only source of errors, though.  You could
have a path problem that is corrupting random bits between the drive
and the filesystem. So the data on the disk might be fine, and
reading it via a redundant path might be all that is needed.

Yeah, so I can see how having a different retry semantic would be a
good idea. i.e. if we do a READ_VERIFY I/O, the underlying device
attempts to verify the data is good in as many ways as possible
before returning the verified data or an error.

I guess a filesystem read would become something like this:

verified = 0
error = read(block)
if (error) {
read_verify:
error = read_verify(block)
if (error) {
OMG THE SKY IS FALLING
return error
}
verified = 1
}
/* check contents */
if (contents are bad) {
if (!verified)
goto read_verify
OMG THE SKY HAS FALLEN
return -EIO
}

Is this the sort of erro handling and re-issuing of
I/O that you had in mind?

FWIW, I don't think this really removes the need for a filesystem to
be able to keep multiple copies of stuff about. If the copy(s) on a
device are gone, you've still got to have another copy somewhere
else to get it back...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread David Chinner

On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:
 On Monday June 18, [EMAIL PROTECTED] wrote:
  On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
   Combining these thoughts, it would make a lot of sense for the
   filesystem to be able to say to the block device That blocks looks
   wrong - can you find me another copy to try?.  That is an example of
   the sort of closer integration between filesystem and RAID that would
   make sense.
  
  I think that this would only be useful on devices that store
  discrete copies of the blocks on different devices i.e. mirrors. If
  it's an XOR based RAID, you don't have another copy you can
  retreive
 
 You could reconstruct the block in question from all the other blocks
 (including parity) and see if that differs from the data block read
 from disk...  For RAID6, there would be a number of different ways to
 calculate alternate blocks.   Not convinced that it is actually
 something we want to do, but it is a possibility.

Agreed - it's not as straight forward as a mirror, and it kind of assumes
that you have software RAID.

/me had his head stuck in hw raid land ;)

 I have that - apparently naive - idea that drives use strong checksum,
 and will never return bad data, only good data or an error.  If this
 isn't right, then it would really help to understand what the cause of
 other failures are before working out how to handle them

The drive is not the only source of errors, though.  You could
have a path problem that is corrupting random bits between the drive
and the filesystem. So the data on the disk might be fine, and
reading it via a redundant path might be all that is needed.

Yeah, so I can see how having a different retry semantic would be a
good idea. i.e. if we do a READ_VERIFY I/O, the underlying device
attempts to verify the data is good in as many ways as possible
before returning the verified data or an error.

I guess a filesystem read would become something like this:

verified = 0
error = read(block)
if (error) {
read_verify:
error = read_verify(block)
if (error) {
OMG THE SKY IS FALLING
return error
}
verified = 1
}
/* check contents */
if (contents are bad) {
if (!verified)
goto read_verify
OMG THE SKY HAS FALLEN
return -EIO
}

Is this the sort of erro handling and re-issuing of
I/O that you had in mind?

FWIW, I don't think this really removes the need for a filesystem to
be able to keep multiple copies of stuff about. If the copy(s) on a
device are gone, you've still got to have another copy somewhere
else to get it back...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread david


On Thu, 21 Jun 2007, David Chinner wrote:


On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


The drive is not the only source of errors, though.  You could
have a path problem that is corrupting random bits between the drive
and the filesystem. So the data on the disk might be fine, and
reading it via a redundant path might be all that is needed.


one of the 'killer features' of zfs is that it does checksums of every 
file on disk. so many people don't consider the disk infallable.


several other filesystems also do checksums

both bitkeeper and git do checksums of files to detect disk corruption

as david C points out there are many points in the path where the data 
could get corrupted besides on the platter.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread David Greaves


Neil Brown wrote:


This isn't quite right.

Thanks :)


Firstly, it is mdadm which decided to make one drive a 'spare' for
raid5, not the kernel.
Secondly, it only applies to raid5, not raid6 or raid1 or raid10.

For raid6, the initial resync (just like the resync after an unclean
shutdown) reads all the data blocks, and writes all the P and Q
blocks.
raid5 can do that, but it is faster the read all but one disk, and
write to that one disk.


How about this:

Initial Creation

When mdadm asks the kernel to create a raid array the most noticeable activity 
is what's called the initial resync.


Raid level 0 doesn't have any redundancy so there is no initial resync.

For raid levels 1,4,6 and 10 mdadm creates the array and starts a resync. The 
raid algorithm then reads the data blocks and writes the appropriate 
parity/mirror (P+Q) blocks across all the relevant disks. There is some sample 
output in a section below...


For raid5 there is an optimisation: mdadm takes one of the disks and marks it as 
'spare'; it then creates the array in degraded mode. The kernel marks the spare 
disk as 'rebuilding' and starts to read from the 'good' disks, calculate the 
parity and determines what should be on the spare disk and then just writes to it.


Once all this is done the array is clean and all disks are active.

This can take quite a time and the array is not fully resilient whilst this is 
happening (it is however fully useable).






Also is raid4 like raid5 or raid6 in this respect?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread David Greaves


[EMAIL PROTECTED] wrote:

On Thu, 21 Jun 2007, David Chinner wrote:
one of the 'killer features' of zfs is that it does checksums of every 
file on disk. so many people don't consider the disk infallable.


several other filesystems also do checksums

both bitkeeper and git do checksums of files to detect disk corruption


How different is that to raid1/5/6 being set to a 'paranoid' read-verify mode 
(as per Dan's recent email) where a read reads from _all_ spindles and verifies 
(and with R6 maybe corrects) the stripe before returning it?


Doesn't solve DaveC's issue about the fs doing redundancy but isn't that 
essentially just fs level mirroring?


David
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread David Chinner

On Thu, Jun 21, 2007 at 04:39:36PM +1000, David Chinner wrote:
 FWIW, I don't think this really removes the need for a filesystem to
 be able to keep multiple copies of stuff about. If the copy(s) on a
 device are gone, you've still got to have another copy somewhere
 else to get it back...

Speaking of knowing where you can safely put multiple copies, I'm in
the process of telling XFS about linear alignment of the underlying
array so that we can:

- spread out the load across it faster.
- provide workload isolation
- know where *not* to put duplicate or EDAC data

I'm aiming at identical subvolumes so it's simple to implement.  All
I need to know is the size of each subvolume. I can supply that at
mkfs time or in a mount option, but I want something that can works
automatically so I need to query dm to find out the size of each
underlying device during mount.  We should also pass stripe
unit/width with the same interface while we are at it...

What's the correct and safe way to get this information from dm
both into the kernel and out to userspace (mkfs)?

FWIW, my end goal is to be able to map the underlying block device
address spaces directly into the filesystem so that the filesystem
is able to use the underlying devices intelligently and I can
logically separate caches and writeback for the separate subdevices.
A struct address_space per subdevice would be ideal - anyone got
ideas on how to get that?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Mattias Wadenstein


On Thu, 21 Jun 2007, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


In theory, that's how storage should work. In practice, silent data 
corruption does happen. If not from the disks themselves, somewhere along 
the path of cables, controllers, drivers, buses, etc. If you add in fcal, 
you'll get even more sources of failure, but usually you can avoid SANs 
(if you care about your data).


Well, here is a couple of the issues that I've seen myself:

A hw-raid controller returning every 64th bit as 0, no matter what's on 
disk. With no error condition at all. (I've also heard from a collegue 
about this on every 64k, but not seen that myself.)


An fcal switch occasionally resetting, garbling the blocks in transit with 
random data. Lost a few TB of user data that way.


Add to this the random driver breakage that happens now and then. I've 
also had a few broken filesystems due to in-memory corruption due to bad 
ram, not sure there is much hope of fixing that though.


Also, this presentation is pretty worrying on the frequency of silent data 
corruption:


https://indico.desy.de/contributionDisplay.py?contribId=65sessionId=42confId=257

/Mattias Wadenstein
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Justin Piszcz




On Thu, 21 Jun 2007, Mattias Wadenstein wrote:


On Thu, 21 Jun 2007, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


In theory, that's how storage should work. In practice, silent data 
corruption does happen. If not from the disks themselves, somewhere along the 
path of cables, controllers, drivers, buses, etc. If you add in fcal, you'll 
get even more sources of failure, but usually you can avoid SANs (if you care 
about your data).


Well, here is a couple of the issues that I've seen myself:

A hw-raid controller returning every 64th bit as 0, no matter what's on disk. 
With no error condition at all. (I've also heard from a collegue about this 
on every 64k, but not seen that myself.)


An fcal switch occasionally resetting, garbling the blocks in transit with 
random data. Lost a few TB of user data that way.


Add to this the random driver breakage that happens now and then. I've also 
had a few broken filesystems due to in-memory corruption due to bad ram, not 
sure there is much hope of fixing that though.


Also, this presentation is pretty worrying on the frequency of silent data 
corruption:


https://indico.desy.de/contributionDisplay.py?contribId=65sessionId=42confId=257

/Mattias Wadenstein
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Very interesting slides/presentation, going to watch it shortly.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread david


On Thu, 21 Jun 2007, Mattias Wadenstein wrote:


On Thu, 21 Jun 2007, Neil Brown wrote:


 I have that - apparently naive - idea that drives use strong checksum,
 and will never return bad data, only good data or an error.  If this
 isn't right, then it would really help to understand what the cause of
 other failures are before working out how to handle them


In theory, that's how storage should work. In practice, silent data 
corruption does happen. If not from the disks themselves, somewhere along the 
path of cables, controllers, drivers, buses, etc. If you add in fcal, you'll 
get even more sources of failure, but usually you can avoid SANs (if you care 
about your data).


heh, the pitch I get from the self proclaimed experts is that if you care 
about your data you put it on the san (so you can take advantage of the 
more expensive disk arrays, various backup advantages, and replication 
features that tend to be focused on the san becouse it's a big target)


David Lang


Well, here is a couple of the issues that I've seen myself:

A hw-raid controller returning every 64th bit as 0, no matter what's on disk. 
With no error condition at all. (I've also heard from a collegue about this 
on every 64k, but not seen that myself.)


An fcal switch occasionally resetting, garbling the blocks in transit with 
random data. Lost a few TB of user data that way.


Add to this the random driver breakage that happens now and then. I've also 
had a few broken filesystems due to in-memory corruption due to bad ram, not 
sure there is much hope of fixing that though.


Also, this presentation is pretty worrying on the frequency of silent data 
corruption:


https://indico.desy.de/contributionDisplay.py?contribId=65sessionId=42confId=257

/Mattias Wadenstein



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Mark Lord


[EMAIL PROTECTED] wrote:

On Thu, 21 Jun 2007, David Chinner wrote:


On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


The drive is not the only source of errors, though.  You could
have a path problem that is corrupting random bits between the drive
and the filesystem. So the data on the disk might be fine, and
reading it via a redundant path might be all that is needed.


one of the 'killer features' of zfs is that it does checksums of every 
file on disk. so many people don't consider the disk infallable.


several other filesystems also do checksums

both bitkeeper and git do checksums of files to detect disk corruption


No, all of those checksums are to detect *filesystem* corruption,
not device corruption (a mere side-effect).

as david C points out there are many points in the path where the data 
could get corrupted besides on the platter.


Yup, that too.

But drives either return good data, or an error.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Martin K. Petersen

 Mattias == Mattias Wadenstein [EMAIL PROTECTED] writes:

Mattias In theory, that's how storage should work. In practice,
Mattias silent data corruption does happen. If not from the disks
Mattias themselves, somewhere along the path of cables, controllers,
Mattias drivers, buses, etc. If you add in fcal, you'll get even more
Mattias sources of failure, but usually you can avoid SANs (if you
Mattias care about your data).

Oracle cares a lot about people's data 8).  And we've seen many cases
of silent data corruption.  Often the problem goes unnoticed for
months.  And by the time you find out about it you may have gone
through your backup cycle so the data is simply lost.

The Oracle database in combination with certain high-end arrays
supports a technology called HARD (Hardware Assisted Resilient Data)
which allows the array front end to verify the integrity of an I/O
before committing it to disk.  The downside to HARD is that it's
proprietary and only really high-end customers use it (many
enterprises actually mandate HARD).

A couple of years ago some changes started to trickle into the SCSI
Block Commands spec.  And as some of you know I've been working on
implementing support for this Data Integrity Field in Linux.

What DIF allows you to do is to attach some integrity metadata to an
I/O.  We can attach this metadata all the way up in the userland
application context where the risk of corruption is relatively small.
The metadata passes all the way through the I/O stack, gets verified
by the HBA firmware, through the fabric, gets verified by the array
front end and finally again by the disk drive before the change is
committed to platter.  Any discrepancy will cause the I/O to be
failed.  And thanks to the intermediate checks you also get fault
isolation.

The DIF integrity metadata contains a CRC of the data block as well as
a reference tag that (for Type 1) needs to match the target sector on
disk.  This way the common problem of misdirected writes can be
alleviated.

Initially, DIF is going to be offered in the FC/SAS space.  But I
encourage everybody to lean on their SATA drive manufacturer of choice
and encourage them to provide a similar functionality on consumer or
at the very least nearline drives.


Note there's a difference between FS checksums and DIF.  Filesystem
checksums (plug: http://oss.oracle.com/projects/btrfs/) allows the
filesystem to detect that it read something bad.  And as discussed
earlier we can potentially retry the read from another mirror or
reconstruct in the case of RAID5/6.

DIF, however, is a proactive technology.  It prevents bad stuff from
being written to disk in the first place.  You'll know right away when
corruption happens, not 4 months later when you try to read the data
back.

So DIF and filesystem checksumming go hand in hand in preventing data
corruption...

-- 
Martin K. Petersen  Oracle Linux Engineering

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Nix

On 21 Jun 2007, Neil Brown stated:
 I have that - apparently naive - idea that drives use strong checksum,
 and will never return bad data, only good data or an error.  If this
 isn't right, then it would really help to understand what the cause of
 other failures are before working out how to handle them

Look at the section `Disks and errors' in Val Henson's excellent report
on last year's filesystems workshop: http://lwn.net/Articles/190223/.
Most of the error modes given there lead to valid checksums and wrong
data...

(while you're there, read the first part too :) )

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Bill Davidsen

I didn't get a comment on my suggestion for a quick and dirty fix for 
-assume-clean issues...


Bill Davidsen wrote:

Neil Brown wrote:

On Thursday June 14, [EMAIL PROTECTED] wrote:
 

it's now churning away 'rebuilding' the brand new array.

a few questions/thoughts.

why does it need to do a rebuild when makeing a new array? couldn't 
it just zero all the drives instead? (or better still just record 
most of the space as 'unused' and initialize it as it starts useing 
it?)



Yes, it could zero all the drives first.  But that would take the same
length of time (unless p/q generation was very very slow), and you
wouldn't be able to start writing data until it had finished.
You can dd /dev/zero onto all drives and then create the array with
--assume-clean if you want to.  You could even write a shell script to
do it for you.

Yes, you could record which space is used vs unused, but I really
don't think the complexity is worth it.

  
How about a simple solution which would get an array on line and still 
be safe? All it would take is a flag which forced reconstruct writes 
for RAID-5. You could set it with an option, or automatically if 
someone puts --assume-clean with --create, leave it in the superblock 
until the first repair runs to completion. And for repair you could 
make some assumptions about bad parity not being caused by error but 
just unwritten.


Thought 2: I think the unwritten bit is easier than you think, you 
only need it on parity blocks for RAID5, not on data blocks. When a 
write is done, if the bit is set do a reconstruct, write the parity 
block, and clear the bit. Keeping a bit per data block is madness, and 
appears to be unnecessary as well.
while I consider zfs to be ~80% hype, one advantage it could have 
(but I don't know if it has) is that since the filesystem an raid 
are integrated into one layer they can optimize the case where files 
are being written onto unallocated space and instead of reading 
blocks from disk to calculate the parity they could just put zeros 
in the unallocated space, potentially speeding up the system by 
reducing the amount of disk I/O.



Certainly.  But the raid doesn't need to be tightly integrated
into the filesystem to achieve this.  The filesystem need only know
the geometry of the RAID and when it comes to write, it tries to write
full stripes at a time.  If that means writing some extra blocks full
of zeros, it can try to do that.  This would require a little bit
better communication between filesystem and raid, but not much.  If
anyone has a filesystem that they want to be able to talk to raid
better, they need only ask...
 
 
is there any way that linux would be able to do this sort of thing? 
or is it impossible due to the layering preventing the nessasary 
knowledge from being in the right place?



Linux can do anything we want it to.  Interfaces can be changed.  All
it takes is a fairly well defined requirement, and the will to make it
happen (and some technical expertise, and lots of time  and
coffee?).
  
Well, I gave you two thoughts, one which would be slow until a repair 
but sounds easy to do, and one which is slightly harder but works 
better and minimizes performance impact.





--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-21 Thread Neil Brown

On Thursday June 21, [EMAIL PROTECTED] wrote:
 I didn't get a comment on my suggestion for a quick and dirty fix for 
 -assume-clean issues...
 
 Bill Davidsen wrote:
  How about a simple solution which would get an array on line and still 
  be safe? All it would take is a flag which forced reconstruct writes 
  for RAID-5. You could set it with an option, or automatically if 
  someone puts --assume-clean with --create, leave it in the superblock 
  until the first repair runs to completion. And for repair you could 
  make some assumptions about bad parity not being caused by error but 
  just unwritten.

It is certainly possible, and probably not a lot of effort.  I'm not
really excited about it though.

So if someone to submit a patch that did the right stuff,  I would
probably accept it, but I am unlikely to do it myself.


 
  Thought 2: I think the unwritten bit is easier than you think, you 
  only need it on parity blocks for RAID5, not on data blocks. When a 
  write is done, if the bit is set do a reconstruct, write the parity 
  block, and clear the bit. Keeping a bit per data block is madness, and 
  appears to be unnecessary as well.

Where do you propose storing those bits?  And how many would you cache
in memory?  And what performance hit would you suffer for accessing
them?  And would it be worth it?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-20 Thread Neil Brown

On Saturday June 16, [EMAIL PROTECTED] wrote:
> Neil Brown wrote:
> > On Friday June 15, [EMAIL PROTECTED] wrote:
> >  
> >>   As I understand the way
> >> raid works, when you write a block to the array, it will have to read all
> >> the other blocks in the stripe and recalculate the parity and write it out.
> > 
> > Your understanding is incomplete.
> 
> Does this help?
> [for future reference so you can paste a url and save the typing for code :) ]
> 
> http://linux-raid.osdl.org/index.php/Initial_Array_Creation
> 
> David
> 
> 
> 
> Initial Creation
> 
> When mdadm asks the kernel to create a raid array the most noticeable 
> activity 
> is what's called the "initial resync".
> 
> The kernel takes one (or two for raid6) disks and marks them as 'spare'; it 
> then 
> creates the array in degraded mode. It then marks spare disks as 'rebuilding' 
> and starts to read from the 'good' disks, calculate the parity and determines 
> what should be on any spare disks and then writes it. Once all this is done 
> the 
> array is clean and all disks are active.

This isn't quite right.
Firstly, it is mdadm which decided to make one drive a 'spare' for
raid5, not the kernel.
Secondly, it only applies to raid5, not raid6 or raid1 or raid10.

For raid6, the initial resync (just like the resync after an unclean
shutdown) reads all the data blocks, and writes all the P and Q
blocks.
raid5 can do that, but it is faster the read all but one disk, and
write to that one disk.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-20 Thread Neil Brown

On Monday June 18, [EMAIL PROTECTED] wrote:
> On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
> > Combining these thoughts, it would make a lot of sense for the
> > filesystem to be able to say to the block device "That blocks looks
> > wrong - can you find me another copy to try?".  That is an example of
> > the sort of closer integration between filesystem and RAID that would
> > make sense.
> 
> I think that this would only be useful on devices that store
> discrete copies of the blocks on different devices i.e. mirrors. If
> it's an XOR based RAID, you don't have another copy you can
> retreive

You could reconstruct the block in question from all the other blocks
(including parity) and see if that differs from the data block read
from disk...  For RAID6, there would be a number of different ways to
calculate alternate blocks.   Not convinced that it is actually
something we want to do, but it is a possibility.

I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-20 Thread Neil Brown

On Monday June 18, [EMAIL PROTECTED] wrote:
 On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
  Combining these thoughts, it would make a lot of sense for the
  filesystem to be able to say to the block device That blocks looks
  wrong - can you find me another copy to try?.  That is an example of
  the sort of closer integration between filesystem and RAID that would
  make sense.
 
 I think that this would only be useful on devices that store
 discrete copies of the blocks on different devices i.e. mirrors. If
 it's an XOR based RAID, you don't have another copy you can
 retreive

You could reconstruct the block in question from all the other blocks
(including parity) and see if that differs from the data block read
from disk...  For RAID6, there would be a number of different ways to
calculate alternate blocks.   Not convinced that it is actually
something we want to do, but it is a possibility.

I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-20 Thread Neil Brown

On Saturday June 16, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  On Friday June 15, [EMAIL PROTECTED] wrote:
   
As I understand the way
  raid works, when you write a block to the array, it will have to read all
  the other blocks in the stripe and recalculate the parity and write it out.
  
  Your understanding is incomplete.
 
 Does this help?
 [for future reference so you can paste a url and save the typing for code :) ]
 
 http://linux-raid.osdl.org/index.php/Initial_Array_Creation
 
 David
 
 
 
 Initial Creation
 
 When mdadm asks the kernel to create a raid array the most noticeable 
 activity 
 is what's called the initial resync.
 
 The kernel takes one (or two for raid6) disks and marks them as 'spare'; it 
 then 
 creates the array in degraded mode. It then marks spare disks as 'rebuilding' 
 and starts to read from the 'good' disks, calculate the parity and determines 
 what should be on any spare disks and then writes it. Once all this is done 
 the 
 array is clean and all disks are active.

This isn't quite right.
Firstly, it is mdadm which decided to make one drive a 'spare' for
raid5, not the kernel.
Secondly, it only applies to raid5, not raid6 or raid1 or raid10.

For raid6, the initial resync (just like the resync after an unclean
shutdown) reads all the data blocks, and writes all the P and Q
blocks.
raid5 can do that, but it is faster the read all but one disk, and
write to that one disk.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-19 Thread david


On Tue, 19 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote:

yes, I'm useing promise drive shelves, I have them configured to export
the 15 drives as 15 LUNs on a single ID.

I'm going to be useing this as a huge circular buffer that will just be
overwritten eventually 99% of the time, but once in a while I will need to
go back into the buffer and extract and process the data.


I would guess that if you ran 15 drives per channel on 3 different
channels, you would resync in 1/3 the time.  Well unless you end up
saturating the PCI bus instead.

hardware raid of course has an advantage there in that it doesn't have
to go across the bus to do the work (although if you put 45 drives on
one scsi channel on hardware raid, it will still be limited).


I fully realize that the channel will be the bottleneck, I just didn't 
understand what /proc/mdstat was telling me. I thought that it was telling 
me that the resync was processing 5M/sec, not that it was writing 5M/sec 
on each of the two parity locations.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-19 Thread Lennart Sorensen

On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote:
> yes, I'm useing promise drive shelves, I have them configured to export 
> the 15 drives as 15 LUNs on a single ID.
> 
> I'm going to be useing this as a huge circular buffer that will just be 
> overwritten eventually 99% of the time, but once in a while I will need to 
> go back into the buffer and extract and process the data.

I would guess that if you ran 15 drives per channel on 3 different
channels, you would resync in 1/3 the time.  Well unless you end up
saturating the PCI bus instead.

hardware raid of course has an advantage there in that it doesn't have
to go across the bus to do the work (although if you put 45 drives on
one scsi channel on hardware raid, it will still be limited).

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-19 Thread david


On Tue, 19 Jun 2007, Phillip Susi wrote:


[EMAIL PROTECTED] wrote:

 one channel, 2 OS drives plus the 45 drives in the array.


Huh?  You can only have 16 devices on a scsi bus, counting the host adapter. 
And I don't think you can even manage that much reliably with the newer 
higher speed versions, at least not without some very special cables.


6 devices on the bus (2 OS drives, 3 promise drive shelves, controller 
card)



 yes I realize that there will be bottlenecks with this, the large capacity
 is to handle longer history (it's going to be a 30TB circular buffer being
 fed by a pair of OC-12 links)


Building one of those nice packet sniffers for the NSA to install on AT 
network eh? ;)


just for going back in time to track hacker actions at a bank.

I'm hopeing that once I figure out the drives the rest of the software 
will basicly boil down to tcpdump with the right options to write to a 
circular buffer of files.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-19 Thread Phillip Susi


[EMAIL PROTECTED] wrote:

one channel, 2 OS drives plus the 45 drives in the array.


Huh?  You can only have 16 devices on a scsi bus, counting the host 
adapter.  And I don't think you can even manage that much reliably with 
the newer higher speed versions, at least not without some very special 
cables.


yes I realize that there will be bottlenecks with this, the large 
capacity is to handle longer history (it's going to be a 30TB circular 
buffer being fed by a pair of OC-12 links)


Building one of those nice packet sniffers for the NSA to install on 
AT network eh? ;)



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-19 Thread Phillip Susi


[EMAIL PROTECTED] wrote:

one channel, 2 OS drives plus the 45 drives in the array.


Huh?  You can only have 16 devices on a scsi bus, counting the host 
adapter.  And I don't think you can even manage that much reliably with 
the newer higher speed versions, at least not without some very special 
cables.


yes I realize that there will be bottlenecks with this, the large 
capacity is to handle longer history (it's going to be a 30TB circular 
buffer being fed by a pair of OC-12 links)


Building one of those nice packet sniffers for the NSA to install on 
ATTs network eh? ;)



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-19 Thread david


On Tue, 19 Jun 2007, Phillip Susi wrote:


[EMAIL PROTECTED] wrote:

 one channel, 2 OS drives plus the 45 drives in the array.


Huh?  You can only have 16 devices on a scsi bus, counting the host adapter. 
And I don't think you can even manage that much reliably with the newer 
higher speed versions, at least not without some very special cables.


6 devices on the bus (2 OS drives, 3 promise drive shelves, controller 
card)



 yes I realize that there will be bottlenecks with this, the large capacity
 is to handle longer history (it's going to be a 30TB circular buffer being
 fed by a pair of OC-12 links)


Building one of those nice packet sniffers for the NSA to install on ATTs 
network eh? ;)


just for going back in time to track hacker actions at a bank.

I'm hopeing that once I figure out the drives the rest of the software 
will basicly boil down to tcpdump with the right options to write to a 
circular buffer of files.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-19 Thread Lennart Sorensen

On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote:
 yes, I'm useing promise drive shelves, I have them configured to export 
 the 15 drives as 15 LUNs on a single ID.
 
 I'm going to be useing this as a huge circular buffer that will just be 
 overwritten eventually 99% of the time, but once in a while I will need to 
 go back into the buffer and extract and process the data.

I would guess that if you ran 15 drives per channel on 3 different
channels, you would resync in 1/3 the time.  Well unless you end up
saturating the PCI bus instead.

hardware raid of course has an advantage there in that it doesn't have
to go across the bus to do the work (although if you put 45 drives on
one scsi channel on hardware raid, it will still be limited).

--
Len Sorensen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-19 Thread david


On Tue, 19 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote:

yes, I'm useing promise drive shelves, I have them configured to export
the 15 drives as 15 LUNs on a single ID.

I'm going to be useing this as a huge circular buffer that will just be
overwritten eventually 99% of the time, but once in a while I will need to
go back into the buffer and extract and process the data.


I would guess that if you ran 15 drives per channel on 3 different
channels, you would resync in 1/3 the time.  Well unless you end up
saturating the PCI bus instead.

hardware raid of course has an advantage there in that it doesn't have
to go across the bus to do the work (although if you put 45 drives on
one scsi channel on hardware raid, it will still be limited).


I fully realize that the channel will be the bottleneck, I just didn't 
understand what /proc/mdstat was telling me. I thought that it was telling 
me that the resync was processing 5M/sec, not that it was writing 5M/sec 
on each of the two parity locations.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Brendan Conoboy


[EMAIL PROTECTED] wrote:
yes, I'm useing promise drive shelves, I have them configured to export 
the 15 drives as 15 LUNs on a single ID.


Well, that would account for it.  Your bus is very, very saturated.  If 
all your drives are active, you can't get more than ~7MB/s per disk 
under perfect conditions.


--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Wakko Warner wrote:


Subject: Re: limits on raid

[EMAIL PROTECTED] wrote:

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

yes, sorry, ultra 320 wide.


Exactly how many channels and drives?


one channel, 2 OS drives plus the 45 drives in the array.


Given that the drives only have 4 ID bits, how can you have 47 drives on 1
cable?  You'd need a minimum of 3 channels for 47 drives.  Do you have some
sort of external box that holds X number of drives and only uses a single
ID?


yes, I'm useing promise drive shelves, I have them configured to export 
the 15 drives as 15 LUNs on a single ID.


I'm going to be useing this as a huge circular buffer that will just be 
overwritten eventually 99% of the time, but once in a while I will need to 
go back into the buffer and extract and process the data.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Wakko Warner

[EMAIL PROTECTED] wrote:
> On Mon, 18 Jun 2007, Brendan Conoboy wrote:
> 
> >[EMAIL PROTECTED] wrote:
> >> yes, sorry, ultra 320 wide.
> >
> >Exactly how many channels and drives?
> 
> one channel, 2 OS drives plus the 45 drives in the array.

Given that the drives only have 4 ID bits, how can you have 47 drives on 1
cable?  You'd need a minimum of 3 channels for 47 drives.  Do you have some
sort of external box that holds X number of drives and only uses a single
ID?

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
 Got Gas???
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 yes, sorry, ultra 320 wide.


Exactly how many channels and drives?


one channel, 2 OS drives plus the 45 drives in the array.

yes I realize that there will be bottlenecks with this, the large capacity 
is to handle longer history (it's going to be a 30TB circular buffer being 
fed by a pair of OC-12 links)


it appears that my big mistake was not understanding what /proc/mdstat is 
telling me.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Brendan Conoboy


[EMAIL PROTECTED] wrote:

yes, sorry, ultra 320 wide.


Exactly how many channels and drives?

--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote:

simple ultra-wide SCSI to a single controller.


Hmm, isn't ultra-wide limited to 40MB/s?  Is it Ultra320 wide?  That
could do a lot more, and 220MB/s sounds plausable for 320 scsi.


yes, sorry, ultra 320 wide.


I didn't realize that the rate reported by /proc/mdstat was the write
speed that was takeing place, I thought it was the total data rate (reads
+ writes). the next time this message gets changed it would be a good
thing to clarify this.


Well I suppose itcould make sense to show rate of rebuild which you can
then compare against the total size of tha raid, or you can have rate of
write, which you then compare against the size of the drive being
synced.  Certainly I would expect much higer speeds if it was the
overall raid size, while the numbers seem pretty reasonable as a write
speed.  4MB/s would take for ever if it was the overall raid resync
speed.  I usually see SATA raid1 resync at 50 to 60MB/s or so, which
matches the read and write speeds of the drives in the raid.


as I read it right now what happens is the worst of the options, you show 
the total size of the array for the amount of work that needs to be done, 
but then show only the write speed for the rate pf progress being made 
through the job.


total rebuild time was estimated at ~3200 min

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Lennart Sorensen

On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote:
> simple ultra-wide SCSI to a single controller.

Hmm, isn't ultra-wide limited to 40MB/s?  Is it Ultra320 wide?  That
could do a lot more, and 220MB/s sounds plausable for 320 scsi.

> I didn't realize that the rate reported by /proc/mdstat was the write 
> speed that was takeing place, I thought it was the total data rate (reads 
> + writes). the next time this message gets changed it would be a good 
> thing to clarify this.

Well I suppose itcould make sense to show rate of rebuild which you can
then compare against the total size of tha raid, or you can have rate of
write, which you then compare against the size of the drive being
synced.  Certainly I would expect much higer speeds if it was the
overall raid size, while the numbers seem pretty reasonable as a write
speed.  4MB/s would take for ever if it was the overall raid resync
speed.  I usually see SATA raid1 resync at 50 to 60MB/s or so, which
matches the read and write speeds of the drives in the raid.

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 I plan to test the different configurations.

 however, if I was saturating the bus with the reconstruct how can I fire
 off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
 reconstruct to ~4M/sec?

 I'm putting 10x as much data through the bus at that point, it would seem
 to proove that it's not the bus that's saturated.


I am unconvinced.  If you take ~1MB/s for each active drive, add in SCSI 
overhead, 45M/sec seems reasonable.  Have you look at a running iostat while 
all this is going on?  Try it out- add up the kb/s from each drive and see 
how close you are to your maximum theoretical IO.


I didn't try iostat, I did look at vmstat, and there the numbers look even 
worse, the bo column is ~500 for the resync by itself, but with the DD 
it's ~50,000. when I get access to the box again I'll try iostat to get 
more details



Also, how's your CPU utilization?


~30% of one cpu for the raid 6 thread, ~5% of one cpu for the resync 
thread


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote:

I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
reconstruct to ~4M/sec?

I'm putting 10x as much data through the bus at that point, it would seem
to proove that it's not the bus that's saturated.


dd 45MB/s from the raid sounds reasonable.

If you have 45 drives, doing a resync of raid5 or radi6 should probably
involve reading all the disks, and writing new parity data to one drive.
So if you are writing 5MB/s, then you are reading 44*5MB/s from the
other drives, which is 220MB/s.  If your resync drops to 4MB/s when
doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read
capacity, which surprisingly seems to match the dd speed you are
getting.  Seems like you are indeed very much saturating a bus
somewhere.  The numbers certainly agree with that theory.

What kind of setup is the drives connected to?


simple ultra-wide SCSI to a single controller.

I didn't realize that the rate reported by /proc/mdstat was the write 
speed that was takeing place, I thought it was the total data rate (reads 
+ writes). the next time this message gets changed it would be a good 
thing to clarify this.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Brendan Conoboy


[EMAIL PROTECTED] wrote:

I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire 
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing 
the reconstruct to ~4M/sec?


I'm putting 10x as much data through the bus at that point, it would 
seem to proove that it's not the bus that's saturated.


I am unconvinced.  If you take ~1MB/s for each active drive, add in SCSI 
overhead, 45M/sec seems reasonable.  Have you look at a running iostat 
while all this is going on?  Try it out- add up the kb/s from each drive 
and see how close you are to your maximum theoretical IO.


Also, how's your CPU utilization?

--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Lennart Sorensen

On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote:
> I plan to test the different configurations.
> 
> however, if I was saturating the bus with the reconstruct how can I fire 
> off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the 
> reconstruct to ~4M/sec?
> 
> I'm putting 10x as much data through the bus at that point, it would seem 
> to proove that it's not the bus that's saturated.

dd 45MB/s from the raid sounds reasonable.

If you have 45 drives, doing a resync of raid5 or radi6 should probably
involve reading all the disks, and writing new parity data to one drive.
So if you are writing 5MB/s, then you are reading 44*5MB/s from the
other drives, which is 220MB/s.  If your resync drops to 4MB/s when
doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read
capacity, which surprisingly seems to match the dd speed you are
getting.  Seems like you are indeed very much saturating a bus
somewhere.  The numbers certainly agree with that theory.

What kind of setup is the drives connected to?

--
Len Sorensen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 in my case it takes 2+ days to resync the array before I can do any
 performance testing with it. for some reason it's only doing the rebuild
 at ~5M/sec (even though I've increased the min and max rebuild speeds and
 a dd to the array seems to be ~44M/sec, even during the rebuild)


With performance like that, it sounds like you're saturating a bus somewhere 
along the line.  If you're using scsi, for instance, it's very easy for a 
long chain of drives to overwhelm a channel.  You might also want to consider 
some other RAID layouts like 1+0 or 5+0 depending upon your space vs. 
reliability needs.


I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire 
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the 
reconstruct to ~4M/sec?


I'm putting 10x as much data through the bus at that point, it would seem 
to proove that it's not the bus that's saturated.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Brendan Conoboy


[EMAIL PROTECTED] wrote:
in my case it takes 2+ days to resync the array before I can do any 
performance testing with it. for some reason it's only doing the rebuild 
at ~5M/sec (even though I've increased the min and max rebuild speeds 
and a dd to the array seems to be ~44M/sec, even during the rebuild)


With performance like that, it sounds like you're saturating a bus 
somewhere along the line.  If you're using scsi, for instance, it's very 
easy for a long chain of drives to overwhelm a channel.  You might also 
want to consider some other RAID layouts like 1+0 or 5+0 depending upon 
your space vs. reliability needs.


--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Brendan Conoboy


[EMAIL PROTECTED] wrote:
in my case it takes 2+ days to resync the array before I can do any 
performance testing with it. for some reason it's only doing the rebuild 
at ~5M/sec (even though I've increased the min and max rebuild speeds 
and a dd to the array seems to be ~44M/sec, even during the rebuild)


With performance like that, it sounds like you're saturating a bus 
somewhere along the line.  If you're using scsi, for instance, it's very 
easy for a long chain of drives to overwhelm a channel.  You might also 
want to consider some other RAID layouts like 1+0 or 5+0 depending upon 
your space vs. reliability needs.


--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 in my case it takes 2+ days to resync the array before I can do any
 performance testing with it. for some reason it's only doing the rebuild
 at ~5M/sec (even though I've increased the min and max rebuild speeds and
 a dd to the array seems to be ~44M/sec, even during the rebuild)


With performance like that, it sounds like you're saturating a bus somewhere 
along the line.  If you're using scsi, for instance, it's very easy for a 
long chain of drives to overwhelm a channel.  You might also want to consider 
some other RAID layouts like 1+0 or 5+0 depending upon your space vs. 
reliability needs.


I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire 
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the 
reconstruct to ~4M/sec?


I'm putting 10x as much data through the bus at that point, it would seem 
to proove that it's not the bus that's saturated.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Lennart Sorensen

On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote:
 I plan to test the different configurations.
 
 however, if I was saturating the bus with the reconstruct how can I fire 
 off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the 
 reconstruct to ~4M/sec?
 
 I'm putting 10x as much data through the bus at that point, it would seem 
 to proove that it's not the bus that's saturated.

dd 45MB/s from the raid sounds reasonable.

If you have 45 drives, doing a resync of raid5 or radi6 should probably
involve reading all the disks, and writing new parity data to one drive.
So if you are writing 5MB/s, then you are reading 44*5MB/s from the
other drives, which is 220MB/s.  If your resync drops to 4MB/s when
doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read
capacity, which surprisingly seems to match the dd speed you are
getting.  Seems like you are indeed very much saturating a bus
somewhere.  The numbers certainly agree with that theory.

What kind of setup is the drives connected to?

--
Len Sorensen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Brendan Conoboy


[EMAIL PROTECTED] wrote:

I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire 
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing 
the reconstruct to ~4M/sec?


I'm putting 10x as much data through the bus at that point, it would 
seem to proove that it's not the bus that's saturated.


I am unconvinced.  If you take ~1MB/s for each active drive, add in SCSI 
overhead, 45M/sec seems reasonable.  Have you look at a running iostat 
while all this is going on?  Try it out- add up the kb/s from each drive 
and see how close you are to your maximum theoretical IO.


Also, how's your CPU utilization?

--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote:

I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
reconstruct to ~4M/sec?

I'm putting 10x as much data through the bus at that point, it would seem
to proove that it's not the bus that's saturated.


dd 45MB/s from the raid sounds reasonable.

If you have 45 drives, doing a resync of raid5 or radi6 should probably
involve reading all the disks, and writing new parity data to one drive.
So if you are writing 5MB/s, then you are reading 44*5MB/s from the
other drives, which is 220MB/s.  If your resync drops to 4MB/s when
doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read
capacity, which surprisingly seems to match the dd speed you are
getting.  Seems like you are indeed very much saturating a bus
somewhere.  The numbers certainly agree with that theory.

What kind of setup is the drives connected to?


simple ultra-wide SCSI to a single controller.

I didn't realize that the rate reported by /proc/mdstat was the write 
speed that was takeing place, I thought it was the total data rate (reads 
+ writes). the next time this message gets changed it would be a good 
thing to clarify this.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 I plan to test the different configurations.

 however, if I was saturating the bus with the reconstruct how can I fire
 off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
 reconstruct to ~4M/sec?

 I'm putting 10x as much data through the bus at that point, it would seem
 to proove that it's not the bus that's saturated.


I am unconvinced.  If you take ~1MB/s for each active drive, add in SCSI 
overhead, 45M/sec seems reasonable.  Have you look at a running iostat while 
all this is going on?  Try it out- add up the kb/s from each drive and see 
how close you are to your maximum theoretical IO.


I didn't try iostat, I did look at vmstat, and there the numbers look even 
worse, the bo column is ~500 for the resync by itself, but with the DD 
it's ~50,000. when I get access to the box again I'll try iostat to get 
more details



Also, how's your CPU utilization?


~30% of one cpu for the raid 6 thread, ~5% of one cpu for the resync 
thread


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Lennart Sorensen

On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote:
 simple ultra-wide SCSI to a single controller.

Hmm, isn't ultra-wide limited to 40MB/s?  Is it Ultra320 wide?  That
could do a lot more, and 220MB/s sounds plausable for 320 scsi.

 I didn't realize that the rate reported by /proc/mdstat was the write 
 speed that was takeing place, I thought it was the total data rate (reads 
 + writes). the next time this message gets changed it would be a good 
 thing to clarify this.

Well I suppose itcould make sense to show rate of rebuild which you can
then compare against the total size of tha raid, or you can have rate of
write, which you then compare against the size of the drive being
synced.  Certainly I would expect much higer speeds if it was the
overall raid size, while the numbers seem pretty reasonable as a write
speed.  4MB/s would take for ever if it was the overall raid resync
speed.  I usually see SATA raid1 resync at 50 to 60MB/s or so, which
matches the read and write speeds of the drives in the raid.

--
Len Sorensen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote:

simple ultra-wide SCSI to a single controller.


Hmm, isn't ultra-wide limited to 40MB/s?  Is it Ultra320 wide?  That
could do a lot more, and 220MB/s sounds plausable for 320 scsi.


yes, sorry, ultra 320 wide.


I didn't realize that the rate reported by /proc/mdstat was the write
speed that was takeing place, I thought it was the total data rate (reads
+ writes). the next time this message gets changed it would be a good
thing to clarify this.


Well I suppose itcould make sense to show rate of rebuild which you can
then compare against the total size of tha raid, or you can have rate of
write, which you then compare against the size of the drive being
synced.  Certainly I would expect much higer speeds if it was the
overall raid size, while the numbers seem pretty reasonable as a write
speed.  4MB/s would take for ever if it was the overall raid resync
speed.  I usually see SATA raid1 resync at 50 to 60MB/s or so, which
matches the read and write speeds of the drives in the raid.


as I read it right now what happens is the worst of the options, you show 
the total size of the array for the amount of work that needs to be done, 
but then show only the write speed for the rate pf progress being made 
through the job.


total rebuild time was estimated at ~3200 min

David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Brendan Conoboy


[EMAIL PROTECTED] wrote:

yes, sorry, ultra 320 wide.


Exactly how many channels and drives?

--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 yes, sorry, ultra 320 wide.


Exactly how many channels and drives?


one channel, 2 OS drives plus the 45 drives in the array.

yes I realize that there will be bottlenecks with this, the large capacity 
is to handle longer history (it's going to be a 30TB circular buffer being 
fed by a pair of OC-12 links)


it appears that my big mistake was not understanding what /proc/mdstat is 
telling me.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Wakko Warner

[EMAIL PROTECTED] wrote:
 On Mon, 18 Jun 2007, Brendan Conoboy wrote:
 
 [EMAIL PROTECTED] wrote:
  yes, sorry, ultra 320 wide.
 
 Exactly how many channels and drives?
 
 one channel, 2 OS drives plus the 45 drives in the array.

Given that the drives only have 4 ID bits, how can you have 47 drives on 1
cable?  You'd need a minimum of 3 channels for 47 drives.  Do you have some
sort of external box that holds X number of drives and only uses a single
ID?

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
 Got Gas???
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread david


On Mon, 18 Jun 2007, Wakko Warner wrote:


Subject: Re: limits on raid

[EMAIL PROTECTED] wrote:

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

yes, sorry, ultra 320 wide.


Exactly how many channels and drives?


one channel, 2 OS drives plus the 45 drives in the array.


Given that the drives only have 4 ID bits, how can you have 47 drives on 1
cable?  You'd need a minimum of 3 channels for 47 drives.  Do you have some
sort of external box that holds X number of drives and only uses a single
ID?


yes, I'm useing promise drive shelves, I have them configured to export 
the 15 drives as 15 LUNs on a single ID.


I'm going to be useing this as a huge circular buffer that will just be 
overwritten eventually 99% of the time, but once in a while I will need to 
go back into the buffer and extract and process the data.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-18 Thread Brendan Conoboy


[EMAIL PROTECTED] wrote:
yes, I'm useing promise drive shelves, I have them configured to export 
the 15 drives as 15 LUNs on a single ID.


Well, that would account for it.  Your bus is very, very saturated.  If 
all your drives are active, you can't get more than ~7MB/s per disk 
under perfect conditions.


--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread David Chinner

On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
> Combining these thoughts, it would make a lot of sense for the
> filesystem to be able to say to the block device "That blocks looks
> wrong - can you find me another copy to try?".  That is an example of
> the sort of closer integration between filesystem and RAID that would
> make sense.

I think that this would only be useful on devices that store
discrete copies of the blocks on different devices i.e. mirrors. If
it's an XOR based RAID, you don't have another copy you can
retreive

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread david


On Sun, 17 Jun 2007, dean gaudet wrote:


On Sun, 17 Jun 2007, Wakko Warner wrote:


What benefit would I gain by using an external journel and how big would it
need to be?


i don't know how big the journal needs to be... i'm limited by xfs'
maximum journal size of 128MiB.

i don't have much benchmark data -- but here are some rough notes i took
when i was evaluating a umem NVRAM card.  since the pata disks in the
raid1 have write caching enabled it's somewhat of an unfair comparison,
but the important info is the 88 seconds for internal journal vs. 81
seconds for external journal.


if you turn on disk write caching the difference will be much larger.


-dean

time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'


I know that sync will force everything to get as far as the journal, will 
it force the journal to be flushed?


David Lang



xfs journal raid5 bitmaptimes
internalnone0.18s user 2.14s system 2% cpu 1:27.95 total
internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total
raid1   none0.07s user 2.02s system 2% cpu 1:20.62 total
raid1   internal0.14s user 2.01s system 1% cpu 1:55.18 total
raid1   raid1   0.14s user 2.03s system 2% cpu 1:20.61 total
umemnone0.13s user 2.07s system 2% cpu 1:20.77 total
umeminternal0.15s user 2.16s system 2% cpu 1:51.28 total
umemumem0.12s user 2.13s system 2% cpu 1:20.50 total


raid5:
- 4x seagate 7200.10 400GB on marvell MV88SX6081
- mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1

raid1:
- 2x maxtor 6Y200P0 on 3ware 7504
- two 128MiB partitions starting at cyl 1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 
/dev/sd[fg]1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 
/dev/sd[fg]2
- md1 is used for external xfs journal
- md2 has an ext3 filesystem for the external md4 bitmap

xfs:
- mkfs.xfs issued before each run using the defaults (aside from -l 
logdev=/dev/md1)
- mount -o noatime,nodiratime[,logdev=/dev/md1]

umem:
- 512MiB Micro Memory MM-5415CN
- 2 partitions similar to the raid1 setup


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread david


On Sun, 17 Jun 2007, Wakko Warner wrote:


you can also easily move an ext3 journal to an external journal with
tune2fs (see man page).


I only have 2 ext3 file systems (One of which is mounted R/O since it's
full), all my others are reiserfs (v3).

What benefit would I gain by using an external journel and how big would it
need to be?


if you have the journal on a drive by itself you end up doing (almost) 
sequential reads and writes to the journal and the disk head doesn't need 
to move much.


this can greatly increase your write speeds since

1. the journal gets written faster (completeing the write as far as your 
software is concerned)


2. the heads don't need to seek back and forth from the journal to the 
final location that the data gets written.


as for how large it should be, it all depends on the volume of your 
writes, once the journal fills up all writes stall until space is freed in 
the journal, IIRC Ext3 is limited to 128M, with todays drive sizes I don't 
see any reason to make it any smaller.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread dean gaudet

On Sun, 17 Jun 2007, Wakko Warner wrote:

> What benefit would I gain by using an external journel and how big would it
> need to be?

i don't know how big the journal needs to be... i'm limited by xfs'
maximum journal size of 128MiB.

i don't have much benchmark data -- but here are some rough notes i took
when i was evaluating a umem NVRAM card.  since the pata disks in the
raid1 have write caching enabled it's somewhat of an unfair comparison,
but the important info is the 88 seconds for internal journal vs. 81
seconds for external journal.

-dean

time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'

xfs journal raid5 bitmaptimes
internalnone0.18s user 2.14s system 2% cpu 1:27.95 total
internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total
raid1   none0.07s user 2.02s system 2% cpu 1:20.62 total
raid1   internal0.14s user 2.01s system 1% cpu 1:55.18 total
raid1   raid1   0.14s user 2.03s system 2% cpu 1:20.61 total
umemnone0.13s user 2.07s system 2% cpu 1:20.77 total
umeminternal0.15s user 2.16s system 2% cpu 1:51.28 total
umemumem0.12s user 2.13s system 2% cpu 1:20.50 total


raid5:
- 4x seagate 7200.10 400GB on marvell MV88SX6081
- mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1

raid1:
- 2x maxtor 6Y200P0 on 3ware 7504
- two 128MiB partitions starting at cyl 1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 
/dev/sd[fg]1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 
/dev/sd[fg]2
- md1 is used for external xfs journal
- md2 has an ext3 filesystem for the external md4 bitmap

xfs:
- mkfs.xfs issued before each run using the defaults (aside from -l 
logdev=/dev/md1)
- mount -o noatime,nodiratime[,logdev=/dev/md1] 

umem:
- 512MiB Micro Memory MM-5415CN
- 2 partitions similar to the raid1 setup
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Wakko Warner

dean gaudet wrote:
> On Sun, 17 Jun 2007, Wakko Warner wrote:
> 
> > > i use an external write-intent bitmap on a raid1 to avoid this... you 
> > > could use internal bitmap but that slows down i/o too much for my tastes. 
> > >  
> > > i also use an external xfs journal for the same reason.  2 disk raid1 for 
> > > root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
> > > common.
> > 
> > I must remember this if I have to rebuild the array.  Although I'm
> > considering moving to a hardware raid solution when I upgrade my storage.
> 
> you can do it without a rebuild -- that's in fact how i did it the first 
> time.
> 
> to add an external bitmap:
> 
> mdadm --grow --bitmap /bitmapfile /dev/mdX
> 
> plus add "bitmap=/bitmapfile" to mdadm.conf... as in:
> 
> ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

I used evms to setup mine.  I have used mdadm in the past.  I use lvm ontop
of it which evms makes it a little easier to maintain.  I have 3 arrays
total (only the raid5 was configured by evms, the other 2 raid1s were done
by hand)

> you can also easily move an ext3 journal to an external journal with 
> tune2fs (see man page).

I only have 2 ext3 file systems (One of which is mounted R/O since it's
full), all my others are reiserfs (v3).

What benefit would I gain by using an external journel and how big would it
need to be?

> if you use XFS it's a bit more of a challenge to convert from internal to 
> external, but see this thread:

I specifically didn't use XFS (or JFS) since neither one at the time could
be shrinked.

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
 Got Gas???
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread dean gaudet

On Sun, 17 Jun 2007, Wakko Warner wrote:

> dean gaudet wrote:
> > On Sat, 16 Jun 2007, Wakko Warner wrote:
> > 
> > > When I've had an unclean shutdown on one of my systems (10x 50gb raid5) 
> > > it's
> > > always slowed the system down when booting up.  Quite significantly I must
> > > say.  I wait until I can login and change the rebuild max speed to slow it
> > > down while I'm using it.   But that is another thing.
> > 
> > i use an external write-intent bitmap on a raid1 to avoid this... you 
> > could use internal bitmap but that slows down i/o too much for my tastes.  
> > i also use an external xfs journal for the same reason.  2 disk raid1 for 
> > root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
> > common.
> 
> I must remember this if I have to rebuild the array.  Although I'm
> considering moving to a hardware raid solution when I upgrade my storage.

you can do it without a rebuild -- that's in fact how i did it the first 
time.

to add an external bitmap:

mdadm --grow --bitmap /bitmapfile /dev/mdX

plus add "bitmap=/bitmapfile" to mdadm.conf... as in:

ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

you can also easily move an ext3 journal to an external journal with 
tune2fs (see man page).

if you use XFS it's a bit more of a challenge to convert from internal to 
external, but see this thread:

http://marc.theaimsgroup.com/?l=linux-xfs=106929781232520=2

i found that i had to do "sb 1", "sb 2", ..., "sb N" for all sb rather 
than just the "sb 0" that email instructed me to do.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Bill Davidsen


[EMAIL PROTECTED] wrote:

On Sat, 16 Jun 2007, Neil Brown wrote:


It would be possible to have a 'this is not initialised' flag on the
array, and if that is not set, always do a reconstruct-write rather
than a read-modify-write.  But the first time you have an unclean
shutdown you are going to resync all the parity anyway (unless you
have a bitmap) so you may as well resync at the start.

And why is it such a big deal anyway?  The initial resync doesn't stop
you from using the array.  I guess if you wanted to put an array into
production instantly and couldn't afford any slowdown due to resync,
then you might want to skip the initial resync but is that really
likely?


in my case it takes 2+ days to resync the array before I can do any 
performance testing with it. for some reason it's only doing the 
rebuild at ~5M/sec (even though I've increased the min and max rebuild 
speeds and a dd to the array seems to be ~44M/sec, even during the 
rebuild)


I want to test several configurations, from a 45 disk raid6 to a 45 
disk raid0. at 2-3 days per test (or longer, depending on the tests) 
this becomes a very slow process.


I've been doing stuff like this, but I just build the array on a 
partition per drive so the init is livable. For the stuff I'm doing a 
total of 500-100GB is ample to do performance testing.
also, when a rebuild is slow enough (and has enough of a performance 
impact) it's not uncommon to want to operate in degraded mode just 
long enought oget to a maintinance window and then recreate the array 
and reload from backup.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Bill Davidsen


Neil Brown wrote:

On Thursday June 14, [EMAIL PROTECTED] wrote:
  

On Fri, 15 Jun 2007, Neil Brown wrote:



On Thursday June 14, [EMAIL PROTECTED] wrote:
  

what is the limit for the number of devices that can be in a single array?

I'm trying to build a 45x750G array and want to experiment with the
different configurations. I'm trying to start with raid6, but mdadm is
complaining about an invalid number of drives

David Lang


"man mdadm"  search for "limits".  (forgive typos).
  

thanks.

why does it still default to the old format after so many new versions? 
(by the way, the documetnation said 28 devices, but I couldn't get it to 
accept more then 27)



Dunno - maybe I can't count...

  

it's now churning away 'rebuilding' the brand new array.

a few questions/thoughts.

why does it need to do a rebuild when makeing a new array? couldn't it 
just zero all the drives instead? (or better still just record most of the 
space as 'unused' and initialize it as it starts useing it?)



Yes, it could zero all the drives first.  But that would take the same
length of time (unless p/q generation was very very slow), and you
wouldn't be able to start writing data until it had finished.
You can "dd" /dev/zero onto all drives and then create the array with
--assume-clean if you want to.  You could even write a shell script to
do it for you.

Yes, you could record which space is used vs unused, but I really
don't think the complexity is worth it.

  
How about a simple solution which would get an array on line and still 
be safe? All it would take is a flag which forced reconstruct writes for 
RAID-5. You could set it with an option, or automatically if someone 
puts --assume-clean with --create, leave it in the superblock until the 
first "repair" runs to completion. And for repair you could make some 
assumptions about bad parity not being caused by error but just unwritten.


Thought 2: I think the unwritten bit is easier than you think, you only 
need it on parity blocks for RAID5, not on data blocks. When a write is 
done, if the bit is set do a reconstruct, write the parity block, and 
clear the bit. Keeping a bit per data block is madness, and appears to 
be unnecessary as well.
while I consider zfs to be ~80% hype, one advantage it could have (but I 
don't know if it has) is that since the filesystem an raid are integrated 
into one layer they can optimize the case where files are being written 
onto unallocated space and instead of reading blocks from disk to 
calculate the parity they could just put zeros in the unallocated space, 
potentially speeding up the system by reducing the amount of disk I/O.



Certainly.  But the raid doesn't need to be tightly integrated
into the filesystem to achieve this.  The filesystem need only know
the geometry of the RAID and when it comes to write, it tries to write
full stripes at a time.  If that means writing some extra blocks full
of zeros, it can try to do that.  This would require a little bit
better communication between filesystem and raid, but not much.  If
anyone has a filesystem that they want to be able to talk to raid
better, they need only ask...
 
  
is there any way that linux would be able to do this sort of thing? or is 
it impossible due to the layering preventing the nessasary knowledge from 
being in the right place?



Linux can do anything we want it to.  Interfaces can be changed.  All
it takes is a fairly well defined requirement, and the will to make it
happen (and some technical expertise, and lots of time  and
coffee?).
  
Well, I gave you two thoughts, one which would be slow until a repair 
but sounds easy to do, and one which is slightly harder but works better 
and minimizes performance impact.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Wakko Warner

dean gaudet wrote:
> On Sat, 16 Jun 2007, Wakko Warner wrote:
> 
> > When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's
> > always slowed the system down when booting up.  Quite significantly I must
> > say.  I wait until I can login and change the rebuild max speed to slow it
> > down while I'm using it.   But that is another thing.
> 
> i use an external write-intent bitmap on a raid1 to avoid this... you 
> could use internal bitmap but that slows down i/o too much for my tastes.  
> i also use an external xfs journal for the same reason.  2 disk raid1 for 
> root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
> common.

I must remember this if I have to rebuild the array.  Although I'm
considering moving to a hardware raid solution when I upgrade my storage.

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
 Got Gas???
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Andi Kleen

Neil Brown <[EMAIL PROTECTED]> writes:
> 
> Having the filesystem duplicate data, store checksums, and be able to
> find a different copy if the first one it chose was bad is very
> sensible and cannot be done by just putting the filesystem on RAID.

Apropos checksums: since RAID5 copies/xors anyways it would
be nice to combine that with the file system. During the xor
a simple checksum could be computed in parallel and stored
in the file system.

And the copy/checksum passes will hopefully at some
point be combined.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Andi Kleen

Neil Brown [EMAIL PROTECTED] writes:
 
 Having the filesystem duplicate data, store checksums, and be able to
 find a different copy if the first one it chose was bad is very
 sensible and cannot be done by just putting the filesystem on RAID.

Apropos checksums: since RAID5 copies/xors anyways it would
be nice to combine that with the file system. During the xor
a simple checksum could be computed in parallel and stored
in the file system.

And the copy/checksum passes will hopefully at some
point be combined.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Wakko Warner

dean gaudet wrote:
 On Sat, 16 Jun 2007, Wakko Warner wrote:
 
  When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's
  always slowed the system down when booting up.  Quite significantly I must
  say.  I wait until I can login and change the rebuild max speed to slow it
  down while I'm using it.   But that is another thing.
 
 i use an external write-intent bitmap on a raid1 to avoid this... you 
 could use internal bitmap but that slows down i/o too much for my tastes.  
 i also use an external xfs journal for the same reason.  2 disk raid1 for 
 root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
 common.

I must remember this if I have to rebuild the array.  Although I'm
considering moving to a hardware raid solution when I upgrade my storage.

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
 Got Gas???
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Bill Davidsen


Neil Brown wrote:

On Thursday June 14, [EMAIL PROTECTED] wrote:
  

On Fri, 15 Jun 2007, Neil Brown wrote:



On Thursday June 14, [EMAIL PROTECTED] wrote:
  

what is the limit for the number of devices that can be in a single array?

I'm trying to build a 45x750G array and want to experiment with the
different configurations. I'm trying to start with raid6, but mdadm is
complaining about an invalid number of drives

David Lang


man mdadm  search for limits.  (forgive typos).
  

thanks.

why does it still default to the old format after so many new versions? 
(by the way, the documetnation said 28 devices, but I couldn't get it to 
accept more then 27)



Dunno - maybe I can't count...

  

it's now churning away 'rebuilding' the brand new array.

a few questions/thoughts.

why does it need to do a rebuild when makeing a new array? couldn't it 
just zero all the drives instead? (or better still just record most of the 
space as 'unused' and initialize it as it starts useing it?)



Yes, it could zero all the drives first.  But that would take the same
length of time (unless p/q generation was very very slow), and you
wouldn't be able to start writing data until it had finished.
You can dd /dev/zero onto all drives and then create the array with
--assume-clean if you want to.  You could even write a shell script to
do it for you.

Yes, you could record which space is used vs unused, but I really
don't think the complexity is worth it.

  
How about a simple solution which would get an array on line and still 
be safe? All it would take is a flag which forced reconstruct writes for 
RAID-5. You could set it with an option, or automatically if someone 
puts --assume-clean with --create, leave it in the superblock until the 
first repair runs to completion. And for repair you could make some 
assumptions about bad parity not being caused by error but just unwritten.


Thought 2: I think the unwritten bit is easier than you think, you only 
need it on parity blocks for RAID5, not on data blocks. When a write is 
done, if the bit is set do a reconstruct, write the parity block, and 
clear the bit. Keeping a bit per data block is madness, and appears to 
be unnecessary as well.
while I consider zfs to be ~80% hype, one advantage it could have (but I 
don't know if it has) is that since the filesystem an raid are integrated 
into one layer they can optimize the case where files are being written 
onto unallocated space and instead of reading blocks from disk to 
calculate the parity they could just put zeros in the unallocated space, 
potentially speeding up the system by reducing the amount of disk I/O.



Certainly.  But the raid doesn't need to be tightly integrated
into the filesystem to achieve this.  The filesystem need only know
the geometry of the RAID and when it comes to write, it tries to write
full stripes at a time.  If that means writing some extra blocks full
of zeros, it can try to do that.  This would require a little bit
better communication between filesystem and raid, but not much.  If
anyone has a filesystem that they want to be able to talk to raid
better, they need only ask...
 
  
is there any way that linux would be able to do this sort of thing? or is 
it impossible due to the layering preventing the nessasary knowledge from 
being in the right place?



Linux can do anything we want it to.  Interfaces can be changed.  All
it takes is a fairly well defined requirement, and the will to make it
happen (and some technical expertise, and lots of time  and
coffee?).
  
Well, I gave you two thoughts, one which would be slow until a repair 
but sounds easy to do, and one which is slightly harder but works better 
and minimizes performance impact.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Bill Davidsen


[EMAIL PROTECTED] wrote:

On Sat, 16 Jun 2007, Neil Brown wrote:


It would be possible to have a 'this is not initialised' flag on the
array, and if that is not set, always do a reconstruct-write rather
than a read-modify-write.  But the first time you have an unclean
shutdown you are going to resync all the parity anyway (unless you
have a bitmap) so you may as well resync at the start.

And why is it such a big deal anyway?  The initial resync doesn't stop
you from using the array.  I guess if you wanted to put an array into
production instantly and couldn't afford any slowdown due to resync,
then you might want to skip the initial resync but is that really
likely?


in my case it takes 2+ days to resync the array before I can do any 
performance testing with it. for some reason it's only doing the 
rebuild at ~5M/sec (even though I've increased the min and max rebuild 
speeds and a dd to the array seems to be ~44M/sec, even during the 
rebuild)


I want to test several configurations, from a 45 disk raid6 to a 45 
disk raid0. at 2-3 days per test (or longer, depending on the tests) 
this becomes a very slow process.


I've been doing stuff like this, but I just build the array on a 
partition per drive so the init is livable. For the stuff I'm doing a 
total of 500-100GB is ample to do performance testing.
also, when a rebuild is slow enough (and has enough of a performance 
impact) it's not uncommon to want to operate in degraded mode just 
long enought oget to a maintinance window and then recreate the array 
and reload from backup.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread dean gaudet

On Sun, 17 Jun 2007, Wakko Warner wrote:

 dean gaudet wrote:
  On Sat, 16 Jun 2007, Wakko Warner wrote:
  
   When I've had an unclean shutdown on one of my systems (10x 50gb raid5) 
   it's
   always slowed the system down when booting up.  Quite significantly I must
   say.  I wait until I can login and change the rebuild max speed to slow it
   down while I'm using it.   But that is another thing.
  
  i use an external write-intent bitmap on a raid1 to avoid this... you 
  could use internal bitmap but that slows down i/o too much for my tastes.  
  i also use an external xfs journal for the same reason.  2 disk raid1 for 
  root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
  common.
 
 I must remember this if I have to rebuild the array.  Although I'm
 considering moving to a hardware raid solution when I upgrade my storage.

you can do it without a rebuild -- that's in fact how i did it the first 
time.

to add an external bitmap:

mdadm --grow --bitmap /bitmapfile /dev/mdX

plus add bitmap=/bitmapfile to mdadm.conf... as in:

ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

you can also easily move an ext3 journal to an external journal with 
tune2fs (see man page).

if you use XFS it's a bit more of a challenge to convert from internal to 
external, but see this thread:

http://marc.theaimsgroup.com/?l=linux-xfsm=106929781232520w=2

i found that i had to do sb 1, sb 2, ..., sb N for all sb rather 
than just the sb 0 that email instructed me to do.

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread Wakko Warner

dean gaudet wrote:
 On Sun, 17 Jun 2007, Wakko Warner wrote:
 
   i use an external write-intent bitmap on a raid1 to avoid this... you 
   could use internal bitmap but that slows down i/o too much for my tastes. 

   i also use an external xfs journal for the same reason.  2 disk raid1 for 
   root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
   common.
  
  I must remember this if I have to rebuild the array.  Although I'm
  considering moving to a hardware raid solution when I upgrade my storage.
 
 you can do it without a rebuild -- that's in fact how i did it the first 
 time.
 
 to add an external bitmap:
 
 mdadm --grow --bitmap /bitmapfile /dev/mdX
 
 plus add bitmap=/bitmapfile to mdadm.conf... as in:
 
 ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

I used evms to setup mine.  I have used mdadm in the past.  I use lvm ontop
of it which evms makes it a little easier to maintain.  I have 3 arrays
total (only the raid5 was configured by evms, the other 2 raid1s were done
by hand)

 you can also easily move an ext3 journal to an external journal with 
 tune2fs (see man page).

I only have 2 ext3 file systems (One of which is mounted R/O since it's
full), all my others are reiserfs (v3).

What benefit would I gain by using an external journel and how big would it
need to be?

 if you use XFS it's a bit more of a challenge to convert from internal to 
 external, but see this thread:

I specifically didn't use XFS (or JFS) since neither one at the time could
be shrinked.

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
 Got Gas???
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread dean gaudet

On Sun, 17 Jun 2007, Wakko Warner wrote:

 What benefit would I gain by using an external journel and how big would it
 need to be?

i don't know how big the journal needs to be... i'm limited by xfs'
maximum journal size of 128MiB.

i don't have much benchmark data -- but here are some rough notes i took
when i was evaluating a umem NVRAM card.  since the pata disks in the
raid1 have write caching enabled it's somewhat of an unfair comparison,
but the important info is the 88 seconds for internal journal vs. 81
seconds for external journal.

-dean

time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'

xfs journal raid5 bitmaptimes
internalnone0.18s user 2.14s system 2% cpu 1:27.95 total
internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total
raid1   none0.07s user 2.02s system 2% cpu 1:20.62 total
raid1   internal0.14s user 2.01s system 1% cpu 1:55.18 total
raid1   raid1   0.14s user 2.03s system 2% cpu 1:20.61 total
umemnone0.13s user 2.07s system 2% cpu 1:20.77 total
umeminternal0.15s user 2.16s system 2% cpu 1:51.28 total
umemumem0.12s user 2.13s system 2% cpu 1:20.50 total


raid5:
- 4x seagate 7200.10 400GB on marvell MV88SX6081
- mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1

raid1:
- 2x maxtor 6Y200P0 on 3ware 7504
- two 128MiB partitions starting at cyl 1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 
/dev/sd[fg]1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 
/dev/sd[fg]2
- md1 is used for external xfs journal
- md2 has an ext3 filesystem for the external md4 bitmap

xfs:
- mkfs.xfs issued before each run using the defaults (aside from -l 
logdev=/dev/md1)
- mount -o noatime,nodiratime[,logdev=/dev/md1] 

umem:
- 512MiB Micro Memory MM-5415CN
- 2 partitions similar to the raid1 setup
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread david


On Sun, 17 Jun 2007, Wakko Warner wrote:


you can also easily move an ext3 journal to an external journal with
tune2fs (see man page).


I only have 2 ext3 file systems (One of which is mounted R/O since it's
full), all my others are reiserfs (v3).

What benefit would I gain by using an external journel and how big would it
need to be?


if you have the journal on a drive by itself you end up doing (almost) 
sequential reads and writes to the journal and the disk head doesn't need 
to move much.


this can greatly increase your write speeds since

1. the journal gets written faster (completeing the write as far as your 
software is concerned)


2. the heads don't need to seek back and forth from the journal to the 
final location that the data gets written.


as for how large it should be, it all depends on the volume of your 
writes, once the journal fills up all writes stall until space is freed in 
the journal, IIRC Ext3 is limited to 128M, with todays drive sizes I don't 
see any reason to make it any smaller.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread david


On Sun, 17 Jun 2007, dean gaudet wrote:


On Sun, 17 Jun 2007, Wakko Warner wrote:


What benefit would I gain by using an external journel and how big would it
need to be?


i don't know how big the journal needs to be... i'm limited by xfs'
maximum journal size of 128MiB.

i don't have much benchmark data -- but here are some rough notes i took
when i was evaluating a umem NVRAM card.  since the pata disks in the
raid1 have write caching enabled it's somewhat of an unfair comparison,
but the important info is the 88 seconds for internal journal vs. 81
seconds for external journal.


if you turn on disk write caching the difference will be much larger.


-dean

time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'


I know that sync will force everything to get as far as the journal, will 
it force the journal to be flushed?


David Lang



xfs journal raid5 bitmaptimes
internalnone0.18s user 2.14s system 2% cpu 1:27.95 total
internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total
raid1   none0.07s user 2.02s system 2% cpu 1:20.62 total
raid1   internal0.14s user 2.01s system 1% cpu 1:55.18 total
raid1   raid1   0.14s user 2.03s system 2% cpu 1:20.61 total
umemnone0.13s user 2.07s system 2% cpu 1:20.77 total
umeminternal0.15s user 2.16s system 2% cpu 1:51.28 total
umemumem0.12s user 2.13s system 2% cpu 1:20.50 total


raid5:
- 4x seagate 7200.10 400GB on marvell MV88SX6081
- mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1

raid1:
- 2x maxtor 6Y200P0 on 3ware 7504
- two 128MiB partitions starting at cyl 1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 
/dev/sd[fg]1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 
/dev/sd[fg]2
- md1 is used for external xfs journal
- md2 has an ext3 filesystem for the external md4 bitmap

xfs:
- mkfs.xfs issued before each run using the defaults (aside from -l 
logdev=/dev/md1)
- mount -o noatime,nodiratime[,logdev=/dev/md1]

umem:
- 512MiB Micro Memory MM-5415CN
- 2 partitions similar to the raid1 setup


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-17 Thread David Chinner

On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
 Combining these thoughts, it would make a lot of sense for the
 filesystem to be able to say to the block device That blocks looks
 wrong - can you find me another copy to try?.  That is an example of
 the sort of closer integration between filesystem and RAID that would
 make sense.

I think that this would only be useful on devices that store
discrete copies of the blocks on different devices i.e. mirrors. If
it's an XOR based RAID, you don't have another copy you can
retreive

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-16 Thread dean gaudet

On Sat, 16 Jun 2007, Wakko Warner wrote:

> When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's
> always slowed the system down when booting up.  Quite significantly I must
> say.  I wait until I can login and change the rebuild max speed to slow it
> down while I'm using it.   But that is another thing.

i use an external write-intent bitmap on a raid1 to avoid this... you 
could use internal bitmap but that slows down i/o too much for my tastes.  
i also use an external xfs journal for the same reason.  2 disk raid1 for 
root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
common.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-16 Thread dean gaudet

On Sat, 16 Jun 2007, David Greaves wrote:

> Neil Brown wrote:
> > On Friday June 15, [EMAIL PROTECTED] wrote:
> >  
> > >   As I understand the way
> > > raid works, when you write a block to the array, it will have to read all
> > > the other blocks in the stripe and recalculate the parity and write it
> > > out.
> > 
> > Your understanding is incomplete.
> 
> Does this help?
> [for future reference so you can paste a url and save the typing for code :) ]
> 
> http://linux-raid.osdl.org/index.php/Initial_Array_Creation

i fixed a typo and added one more note which i think is quite fair:

It is also safe to use --assume-clean if you are performing
performance measurements of different raid configurations. Just
be sure to rebuild your array without --assume-clean when you
decide on your final configuration.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-16 Thread Avi Kivity

Neil Brown wrote:
>>>   
>>>   
>> Some things are not achievable with block-level raid.  For example, with
>> redundancy integrated into the filesystem, you can have three copies for
>> metadata, two copies for small files, and parity blocks for large files,
>> effectively using different raid levels for different types of data on
>> the same filesystem.
>> 
>
> Absolutely.  And doing that is a very good idea quite independent of
> underlying RAID.  Even ext2 stores multiple copies of the superblock.
>
> Having the filesystem duplicate data, store checksums, and be able to
> find a different copy if the first one it chose was bad is very
> sensible and cannot be done by just putting the filesystem on RAID.
>   

It would need to know a lot about the RAID geometry in order not to put
the the copies on the same disks.

> Having the filesystem keep multiple copies of each data block so that
> when one drive dies, another block is used does not excite me quite so
> much.  If you are going to do that, then you want to be able to
> reconstruct the data that should be on a failed drive onto a new
> drive.
> For a RAID system, that reconstruction can go at the full speed of the
> drive subsystem - but needs to copy every block, whether used or not.
> For in-filesystem duplication, it is easy to imagine that being quite
> slow and complex.  It would depend a lot on how you arrange data,
> and maybe there is some clever approach to data layout that I haven't
> thought of.  But I think that sort of thing is much easier to do in a
> RAID layer below the filesystem.
>   

You'd need a reverse mapping of extents to files.  While maintaining
that is expensive, it brings a lot of benefits:

- rebuild a failed drive, without rebuilding free space
- evacuate a drive in anticipation of taking it offline
- efficient defragmentation

Reverse mapping storage could serve as free space store too.

> Combining these thoughts, it would make a lot of sense for the
> filesystem to be able to say to the block device "That blocks looks
> wrong - can you find me another copy to try?".  That is an example of
> the sort of closer integration between filesystem and RAID that would
> make sense.
>   

It's a step forward, but still quite limited compared to combining the
two layers together.  Sticking with the example above, you still can't
have a mix of parity-protected files and mirror-protected files; the
RAID decides that for you.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: limits on raid

2007-06-16 Thread david


On Sat, 16 Jun 2007, David Greaves wrote:


[EMAIL PROTECTED] wrote:

 On Sat, 16 Jun 2007, Neil Brown wrote:

 I want to test several configurations, from a 45 disk raid6 to a 45 disk
 raid0. at 2-3 days per test (or longer, depending on the tests) this
 becomes a very slow process.
Are you suggesting the code that is written to enhance data integrity is 
optimised (or even touched) to support this kind of test scenario?

Seriously? :)


actually, if it can be done without a huge impact to the maintainability 
of the code I think it would be a good idea for the simple reason that I 
think the increased experimentation would result in people finding out 
what raid level is really appropriate for their needs.


there is a _lot_ of confusion around about what the performance 
implications of different raid levels are (especially when you consider 
things like raid 10/50/60 where you have two layers combined) and anything 
that encourages experimentation would be a good thing.



 also, when a rebuild is slow enough (and has enough of a performance
 impact) it's not uncommon to want to operate in degraded mode just long
 enought oget to a maintinance window and then recreate the array and
 reload from backup.


so would mdadm --remove the rebuilding disk help?


no. let me try again

drive fails monday morning

scenerio 1

replace the failed drive, start the rebuild. system will be slow (degraded 
mode + rebuild) for the next three days.


scenerio 2

leave it in degraded mode until monday night (accepting the speed penalty 
for degraded mode, but not the rebuild penalty)


monday night shutdown the system, put in the new drive, reinitialize the 
array, reload the system from backup.


system is back to full speed tuesday morning.

scenerio 2 isn't supported with md today, although it sounds as if the 
skip rebuild could do this except for raid 5


on my test system, the rebuild says it's running at 5M/s a DD to a file on 
the array says it's doing 45M/s (even while the rebuild is running), so it 
seems to me that there may be value in this approach.


David Lang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 136 matches

Mail list logo