Re: limits on raid

2007-06-22 Thread david

On Fri, 22 Jun 2007, David Greaves wrote:

That's not a bad thing - until you look at the complexity it brings - and 
then consider the impact and exceptions when you do, eg hardware 
acceleration? md information fed up to the fs layer for xfs? simple long term 
maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to just say no :)


In this case I think the advantages of a higher level system knowing what 
efficiant blocks to do writes/reads in can potentially be a HUGE 
advantage.


if the uppper levels know that you ahve a 6 disk raid 6 array with a 64K 
chunk size then reads and writes in 256k chunks (aligned) should be able 
to be done at basicly the speed of a 4 disk raid 0 array.


what's even more impressive is that this could be done even if the array 
is degraded (if you know the drives have failed you don't even try to read 
from them and you only have to reconstruct the missing info once per 
stripe)


the current approach doesn't give the upper levels any chance to operate 
in this mode, they just don't have enough information to do so.


the part about wanting to know raid 0 chunk size so that the upper layers 
can be sure that data that's supposed to be redundant is on seperate 
drives is also possible


storage technology is headed in the direction of having the system do more 
and more of the layout decisions, and re-stripe the array as conditions 
change (similar to what md can already do with enlarging raid5/6 arrays) 
but unless you want to eventually put all that decision logic into the md 
layer you should make it possible for other layers to make queries to find 
out what's what and then they can give directions for what they want to 
have happen.


so for several reasons I don't see this as something that's deserving of 
an atomatic 'no'


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-22 Thread David Greaves

Bill Davidsen wrote:

David Greaves wrote:

[EMAIL PROTECTED] wrote:

On Fri, 22 Jun 2007, David Greaves wrote:
If you end up 'fiddling' in md because someone specified 
--assume-clean on a raid5 [in this case just to save a few minutes 
*testing time* on system with a heavily choked bus!] then that adds 
*even more* complexity and exception cases into all the stuff you 
described.


A few minutes? Are you reading the times people are seeing with 
multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. 

Yes. But we are talking initial creation here.

And as soon as you believe that the array is actually usable you cut 
that rebuild rate, perhaps in half, and get dog-slow performance from 
the array. It's usable in the sense that reads and writes work, but for 
useful work it's pretty painful. You either fail to understand the 
magnitude of the problem or wish to trivialize it for some reason.

I do understand the problem and I'm not trying to trivialise it :)

I _suggested_ that it's worth thinking about things rather than jumping in to 
say oh, we can code up a clever algorithm that keeps track of what stripes have 
valid parity and which don't and we can optimise the read/copy/write for valid 
stripes and use the raid6 type read-all/write-all for invalid stripes and then 
we can write a bit extra on the check code to set the bitmaps..


Phew - and that lets us run the array at semi-degraded performance (raid6-like) 
for 3 days rather than either waiting before we put it into production or 
running it very slowly.

Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT?

What happens in those 3 years when we have a disk fail? The solution doesn't 
apply then - it's 3 days to rebuild - like it or not.


By delaying parity computation until the first write to a stripe only 
the growth of a filesystem is slowed, and all data are protected without 
waiting for the lengthly check. The rebuild speed can be set very low, 
because on-demand rebuild will do most of the work.

I am not saying you are wrong.
I ask merely if the balance of benefit outweighs the balance of complexity.

If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs 
- very useful indeed.


I'm very much for the fs layer reading the lower block structure so I 
don't have to fiddle with arcane tuning parameters - yes, *please* 
help make xfs self-tuning!


Keeping life as straightforward as possible low down makes the upwards 
interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple 
because it rests on a simple device, even if the simple device is 
provided by LVM or md. And LVM and md can stay simple because they rest 
on simple devices, even if they are provided by PATA, SATA, nbd, etc. 
Independent layers make each layer more robust. If you want to 
compromise the layer separation, some approach like ZFS with full 
integration would seem to be promising. Note that layers allow 
specialized features at each point, trading integration for flexibility.


That's a simplistic summary.
You *can* loosely couple the layers. But you can enrich the interface and 
tightly couple them too - XFS is capable (I guess) of understanding md more 
fully than say ext2.
XFS would still work on a less 'talkative' block device where performance wasn't 
as important (USB flash maybe, dunno).



My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you 
need to handle changes in those details, which would seem to make layers 
more complex.

Agreed.

What I'm looking for here is better performance in one 
particular layer, the md RAID5 layer. I like to avoid unnecessary 
complexity, but I feel that the current performance suggests room for 
improvement.


I agree there is room for improvement.
I suggest that it may be more fruitful to write a tool called raid5prepare
that writes zeroes/ones as appropriate to all component devices and then you can 
use --assume-clean without concern. That could look to see if the devices are 
scsi or whatever and take advantage of the hyperfast block writes that can be done.


David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-22 Thread david

On Fri, 22 Jun 2007, Bill Davidsen wrote:

By delaying parity computation until the first write to a stripe only the 
growth of a filesystem is slowed, and all data are protected without waiting 
for the lengthly check. The rebuild speed can be set very low, because 
on-demand rebuild will do most of the work.


 I'm very much for the fs layer reading the lower block structure so I
 don't have to fiddle with arcane tuning parameters - yes, *please* help
 make xfs self-tuning!

 Keeping life as straightforward as possible low down makes the upwards
 interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple because it 
rests on a simple device, even if the simple device is provided by LVM or 
md. And LVM and md can stay simple because they rest on simple devices, even 
if they are provided by PATA, SATA, nbd, etc. Independent layers make each 
layer more robust. If you want to compromise the layer separation, some 
approach like ZFS with full integration would seem to be promising. Note that 
layers allow specialized features at each point, trading integration for 
flexibility.


My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you need to 
handle changes in those details, which would seem to make layers more 
complex. What I'm looking for here is better performance in one particular 
layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel 
that the current performance suggests room for improvement.


they both have have benifits, but it shouldn't have to be either-or

if you build the seperate layers and provide for ways that the upper 
layers can query the lower layers to find what's efficiant then you can 
have some uppoer layers that don't care about this and trat the lower 
layer as a simple block device, while other upper layers find out what 
sort of things are more efficiant to do and use the same lower layer in a 
more complex manner


the alturnative is to duplicate effort (and code) to have two codebases 
that try to do the same thing, one stand-alone, and one as a part of an 
integrated solution (and it gets even worse if there end up being multiple 
integrated solutions)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-21 Thread David Greaves

Neil Brown wrote:


This isn't quite right.

Thanks :)


Firstly, it is mdadm which decided to make one drive a 'spare' for
raid5, not the kernel.
Secondly, it only applies to raid5, not raid6 or raid1 or raid10.

For raid6, the initial resync (just like the resync after an unclean
shutdown) reads all the data blocks, and writes all the P and Q
blocks.
raid5 can do that, but it is faster the read all but one disk, and
write to that one disk.


How about this:

Initial Creation

When mdadm asks the kernel to create a raid array the most noticeable activity 
is what's called the initial resync.


Raid level 0 doesn't have any redundancy so there is no initial resync.

For raid levels 1,4,6 and 10 mdadm creates the array and starts a resync. The 
raid algorithm then reads the data blocks and writes the appropriate 
parity/mirror (P+Q) blocks across all the relevant disks. There is some sample 
output in a section below...


For raid5 there is an optimisation: mdadm takes one of the disks and marks it as 
'spare'; it then creates the array in degraded mode. The kernel marks the spare 
disk as 'rebuilding' and starts to read from the 'good' disks, calculate the 
parity and determines what should be on the spare disk and then just writes to it.


Once all this is done the array is clean and all disks are active.

This can take quite a time and the array is not fully resilient whilst this is 
happening (it is however fully useable).






Also is raid4 like raid5 or raid6 in this respect?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-21 Thread Mark Lord

[EMAIL PROTECTED] wrote:

On Thu, 21 Jun 2007, David Chinner wrote:


On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


The drive is not the only source of errors, though.  You could
have a path problem that is corrupting random bits between the drive
and the filesystem. So the data on the disk might be fine, and
reading it via a redundant path might be all that is needed.


one of the 'killer features' of zfs is that it does checksums of every 
file on disk. so many people don't consider the disk infallable.


several other filesystems also do checksums

both bitkeeper and git do checksums of files to detect disk corruption


No, all of those checksums are to detect *filesystem* corruption,
not device corruption (a mere side-effect).

as david C points out there are many points in the path where the data 
could get corrupted besides on the platter.


Yup, that too.

But drives either return good data, or an error.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-21 Thread Nix
On 21 Jun 2007, Neil Brown stated:
 I have that - apparently naive - idea that drives use strong checksum,
 and will never return bad data, only good data or an error.  If this
 isn't right, then it would really help to understand what the cause of
 other failures are before working out how to handle them

Look at the section `Disks and errors' in Val Henson's excellent report
on last year's filesystems workshop: http://lwn.net/Articles/190223/.
Most of the error modes given there lead to valid checksums and wrong
data...

(while you're there, read the first part too :) )

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-21 Thread Bill Davidsen
I didn't get a comment on my suggestion for a quick and dirty fix for 
-assume-clean issues...


Bill Davidsen wrote:

Neil Brown wrote:

On Thursday June 14, [EMAIL PROTECTED] wrote:
 

it's now churning away 'rebuilding' the brand new array.

a few questions/thoughts.

why does it need to do a rebuild when makeing a new array? couldn't 
it just zero all the drives instead? (or better still just record 
most of the space as 'unused' and initialize it as it starts useing 
it?)



Yes, it could zero all the drives first.  But that would take the same
length of time (unless p/q generation was very very slow), and you
wouldn't be able to start writing data until it had finished.
You can dd /dev/zero onto all drives and then create the array with
--assume-clean if you want to.  You could even write a shell script to
do it for you.

Yes, you could record which space is used vs unused, but I really
don't think the complexity is worth it.

  
How about a simple solution which would get an array on line and still 
be safe? All it would take is a flag which forced reconstruct writes 
for RAID-5. You could set it with an option, or automatically if 
someone puts --assume-clean with --create, leave it in the superblock 
until the first repair runs to completion. And for repair you could 
make some assumptions about bad parity not being caused by error but 
just unwritten.


Thought 2: I think the unwritten bit is easier than you think, you 
only need it on parity blocks for RAID5, not on data blocks. When a 
write is done, if the bit is set do a reconstruct, write the parity 
block, and clear the bit. Keeping a bit per data block is madness, and 
appears to be unnecessary as well.
while I consider zfs to be ~80% hype, one advantage it could have 
(but I don't know if it has) is that since the filesystem an raid 
are integrated into one layer they can optimize the case where files 
are being written onto unallocated space and instead of reading 
blocks from disk to calculate the parity they could just put zeros 
in the unallocated space, potentially speeding up the system by 
reducing the amount of disk I/O.



Certainly.  But the raid doesn't need to be tightly integrated
into the filesystem to achieve this.  The filesystem need only know
the geometry of the RAID and when it comes to write, it tries to write
full stripes at a time.  If that means writing some extra blocks full
of zeros, it can try to do that.  This would require a little bit
better communication between filesystem and raid, but not much.  If
anyone has a filesystem that they want to be able to talk to raid
better, they need only ask...
 
 
is there any way that linux would be able to do this sort of thing? 
or is it impossible due to the layering preventing the nessasary 
knowledge from being in the right place?



Linux can do anything we want it to.  Interfaces can be changed.  All
it takes is a fairly well defined requirement, and the will to make it
happen (and some technical expertise, and lots of time  and
coffee?).
  
Well, I gave you two thoughts, one which would be slow until a repair 
but sounds easy to do, and one which is slightly harder but works 
better and minimizes performance impact.





--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-19 Thread Phillip Susi

[EMAIL PROTECTED] wrote:

one channel, 2 OS drives plus the 45 drives in the array.


Huh?  You can only have 16 devices on a scsi bus, counting the host 
adapter.  And I don't think you can even manage that much reliably with 
the newer higher speed versions, at least not without some very special 
cables.


yes I realize that there will be bottlenecks with this, the large 
capacity is to handle longer history (it's going to be a 30TB circular 
buffer being fed by a pair of OC-12 links)


Building one of those nice packet sniffers for the NSA to install on 
ATTs network eh? ;)



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-19 Thread Lennart Sorensen
On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote:
 yes, I'm useing promise drive shelves, I have them configured to export 
 the 15 drives as 15 LUNs on a single ID.
 
 I'm going to be useing this as a huge circular buffer that will just be 
 overwritten eventually 99% of the time, but once in a while I will need to 
 go back into the buffer and extract and process the data.

I would guess that if you ran 15 drives per channel on 3 different
channels, you would resync in 1/3 the time.  Well unless you end up
saturating the PCI bus instead.

hardware raid of course has an advantage there in that it doesn't have
to go across the bus to do the work (although if you put 45 drives on
one scsi channel on hardware raid, it will still be limited).

--
Len Sorensen
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-19 Thread david

On Tue, 19 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote:

yes, I'm useing promise drive shelves, I have them configured to export
the 15 drives as 15 LUNs on a single ID.

I'm going to be useing this as a huge circular buffer that will just be
overwritten eventually 99% of the time, but once in a while I will need to
go back into the buffer and extract and process the data.


I would guess that if you ran 15 drives per channel on 3 different
channels, you would resync in 1/3 the time.  Well unless you end up
saturating the PCI bus instead.

hardware raid of course has an advantage there in that it doesn't have
to go across the bus to do the work (although if you put 45 drives on
one scsi channel on hardware raid, it will still be limited).


I fully realize that the channel will be the bottleneck, I just didn't 
understand what /proc/mdstat was telling me. I thought that it was telling 
me that the resync was processing 5M/sec, not that it was writing 5M/sec 
on each of the two parity locations.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread Brendan Conoboy

[EMAIL PROTECTED] wrote:
in my case it takes 2+ days to resync the array before I can do any 
performance testing with it. for some reason it's only doing the rebuild 
at ~5M/sec (even though I've increased the min and max rebuild speeds 
and a dd to the array seems to be ~44M/sec, even during the rebuild)


With performance like that, it sounds like you're saturating a bus 
somewhere along the line.  If you're using scsi, for instance, it's very 
easy for a long chain of drives to overwhelm a channel.  You might also 
want to consider some other RAID layouts like 1+0 or 5+0 depending upon 
your space vs. reliability needs.


--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 in my case it takes 2+ days to resync the array before I can do any
 performance testing with it. for some reason it's only doing the rebuild
 at ~5M/sec (even though I've increased the min and max rebuild speeds and
 a dd to the array seems to be ~44M/sec, even during the rebuild)


With performance like that, it sounds like you're saturating a bus somewhere 
along the line.  If you're using scsi, for instance, it's very easy for a 
long chain of drives to overwhelm a channel.  You might also want to consider 
some other RAID layouts like 1+0 or 5+0 depending upon your space vs. 
reliability needs.


I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire 
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the 
reconstruct to ~4M/sec?


I'm putting 10x as much data through the bus at that point, it would seem 
to proove that it's not the bus that's saturated.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote:

I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
reconstruct to ~4M/sec?

I'm putting 10x as much data through the bus at that point, it would seem
to proove that it's not the bus that's saturated.


dd 45MB/s from the raid sounds reasonable.

If you have 45 drives, doing a resync of raid5 or radi6 should probably
involve reading all the disks, and writing new parity data to one drive.
So if you are writing 5MB/s, then you are reading 44*5MB/s from the
other drives, which is 220MB/s.  If your resync drops to 4MB/s when
doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read
capacity, which surprisingly seems to match the dd speed you are
getting.  Seems like you are indeed very much saturating a bus
somewhere.  The numbers certainly agree with that theory.

What kind of setup is the drives connected to?


simple ultra-wide SCSI to a single controller.

I didn't realize that the rate reported by /proc/mdstat was the write 
speed that was takeing place, I thought it was the total data rate (reads 
+ writes). the next time this message gets changed it would be a good 
thing to clarify this.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 I plan to test the different configurations.

 however, if I was saturating the bus with the reconstruct how can I fire
 off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
 reconstruct to ~4M/sec?

 I'm putting 10x as much data through the bus at that point, it would seem
 to proove that it's not the bus that's saturated.


I am unconvinced.  If you take ~1MB/s for each active drive, add in SCSI 
overhead, 45M/sec seems reasonable.  Have you look at a running iostat while 
all this is going on?  Try it out- add up the kb/s from each drive and see 
how close you are to your maximum theoretical IO.


I didn't try iostat, I did look at vmstat, and there the numbers look even 
worse, the bo column is ~500 for the resync by itself, but with the DD 
it's ~50,000. when I get access to the box again I'll try iostat to get 
more details



Also, how's your CPU utilization?


~30% of one cpu for the raid 6 thread, ~5% of one cpu for the resync 
thread


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote:

simple ultra-wide SCSI to a single controller.


Hmm, isn't ultra-wide limited to 40MB/s?  Is it Ultra320 wide?  That
could do a lot more, and 220MB/s sounds plausable for 320 scsi.


yes, sorry, ultra 320 wide.


I didn't realize that the rate reported by /proc/mdstat was the write
speed that was takeing place, I thought it was the total data rate (reads
+ writes). the next time this message gets changed it would be a good
thing to clarify this.


Well I suppose itcould make sense to show rate of rebuild which you can
then compare against the total size of tha raid, or you can have rate of
write, which you then compare against the size of the drive being
synced.  Certainly I would expect much higer speeds if it was the
overall raid size, while the numbers seem pretty reasonable as a write
speed.  4MB/s would take for ever if it was the overall raid resync
speed.  I usually see SATA raid1 resync at 50 to 60MB/s or so, which
matches the read and write speeds of the drives in the raid.


as I read it right now what happens is the worst of the options, you show 
the total size of the array for the amount of work that needs to be done, 
but then show only the write speed for the rate pf progress being made 
through the job.


total rebuild time was estimated at ~3200 min

David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread Brendan Conoboy

[EMAIL PROTECTED] wrote:

yes, sorry, ultra 320 wide.


Exactly how many channels and drives?

--
Brendan Conoboy / Red Hat, Inc. / [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 yes, sorry, ultra 320 wide.


Exactly how many channels and drives?


one channel, 2 OS drives plus the 45 drives in the array.

yes I realize that there will be bottlenecks with this, the large capacity 
is to handle longer history (it's going to be a 30TB circular buffer being 
fed by a pair of OC-12 links)


it appears that my big mistake was not understanding what /proc/mdstat is 
telling me.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread Andi Kleen
Neil Brown [EMAIL PROTECTED] writes:
 
 Having the filesystem duplicate data, store checksums, and be able to
 find a different copy if the first one it chose was bad is very
 sensible and cannot be done by just putting the filesystem on RAID.

Apropos checksums: since RAID5 copies/xors anyways it would
be nice to combine that with the file system. During the xor
a simple checksum could be computed in parallel and stored
in the file system.

And the copy/checksum passes will hopefully at some
point be combined.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread Wakko Warner
dean gaudet wrote:
 On Sat, 16 Jun 2007, Wakko Warner wrote:
 
  When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's
  always slowed the system down when booting up.  Quite significantly I must
  say.  I wait until I can login and change the rebuild max speed to slow it
  down while I'm using it.   But that is another thing.
 
 i use an external write-intent bitmap on a raid1 to avoid this... you 
 could use internal bitmap but that slows down i/o too much for my tastes.  
 i also use an external xfs journal for the same reason.  2 disk raid1 for 
 root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
 common.

I must remember this if I have to rebuild the array.  Although I'm
considering moving to a hardware raid solution when I upgrade my storage.

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
 Got Gas???
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread Bill Davidsen

Neil Brown wrote:

On Thursday June 14, [EMAIL PROTECTED] wrote:
  

On Fri, 15 Jun 2007, Neil Brown wrote:



On Thursday June 14, [EMAIL PROTECTED] wrote:
  

what is the limit for the number of devices that can be in a single array?

I'm trying to build a 45x750G array and want to experiment with the
different configurations. I'm trying to start with raid6, but mdadm is
complaining about an invalid number of drives

David Lang


man mdadm  search for limits.  (forgive typos).
  

thanks.

why does it still default to the old format after so many new versions? 
(by the way, the documetnation said 28 devices, but I couldn't get it to 
accept more then 27)



Dunno - maybe I can't count...

  

it's now churning away 'rebuilding' the brand new array.

a few questions/thoughts.

why does it need to do a rebuild when makeing a new array? couldn't it 
just zero all the drives instead? (or better still just record most of the 
space as 'unused' and initialize it as it starts useing it?)



Yes, it could zero all the drives first.  But that would take the same
length of time (unless p/q generation was very very slow), and you
wouldn't be able to start writing data until it had finished.
You can dd /dev/zero onto all drives and then create the array with
--assume-clean if you want to.  You could even write a shell script to
do it for you.

Yes, you could record which space is used vs unused, but I really
don't think the complexity is worth it.

  
How about a simple solution which would get an array on line and still 
be safe? All it would take is a flag which forced reconstruct writes for 
RAID-5. You could set it with an option, or automatically if someone 
puts --assume-clean with --create, leave it in the superblock until the 
first repair runs to completion. And for repair you could make some 
assumptions about bad parity not being caused by error but just unwritten.


Thought 2: I think the unwritten bit is easier than you think, you only 
need it on parity blocks for RAID5, not on data blocks. When a write is 
done, if the bit is set do a reconstruct, write the parity block, and 
clear the bit. Keeping a bit per data block is madness, and appears to 
be unnecessary as well.
while I consider zfs to be ~80% hype, one advantage it could have (but I 
don't know if it has) is that since the filesystem an raid are integrated 
into one layer they can optimize the case where files are being written 
onto unallocated space and instead of reading blocks from disk to 
calculate the parity they could just put zeros in the unallocated space, 
potentially speeding up the system by reducing the amount of disk I/O.



Certainly.  But the raid doesn't need to be tightly integrated
into the filesystem to achieve this.  The filesystem need only know
the geometry of the RAID and when it comes to write, it tries to write
full stripes at a time.  If that means writing some extra blocks full
of zeros, it can try to do that.  This would require a little bit
better communication between filesystem and raid, but not much.  If
anyone has a filesystem that they want to be able to talk to raid
better, they need only ask...
 
  
is there any way that linux would be able to do this sort of thing? or is 
it impossible due to the layering preventing the nessasary knowledge from 
being in the right place?



Linux can do anything we want it to.  Interfaces can be changed.  All
it takes is a fairly well defined requirement, and the will to make it
happen (and some technical expertise, and lots of time  and
coffee?).
  
Well, I gave you two thoughts, one which would be slow until a repair 
but sounds easy to do, and one which is slightly harder but works better 
and minimizes performance impact.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread Wakko Warner
dean gaudet wrote:
 On Sun, 17 Jun 2007, Wakko Warner wrote:
 
   i use an external write-intent bitmap on a raid1 to avoid this... you 
   could use internal bitmap but that slows down i/o too much for my tastes. 

   i also use an external xfs journal for the same reason.  2 disk raid1 for 
   root/journal/bitmap, N disk raid5 for bulk storage.  no spindles in 
   common.
  
  I must remember this if I have to rebuild the array.  Although I'm
  considering moving to a hardware raid solution when I upgrade my storage.
 
 you can do it without a rebuild -- that's in fact how i did it the first 
 time.
 
 to add an external bitmap:
 
 mdadm --grow --bitmap /bitmapfile /dev/mdX
 
 plus add bitmap=/bitmapfile to mdadm.conf... as in:
 
 ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

I used evms to setup mine.  I have used mdadm in the past.  I use lvm ontop
of it which evms makes it a little easier to maintain.  I have 3 arrays
total (only the raid5 was configured by evms, the other 2 raid1s were done
by hand)

 you can also easily move an ext3 journal to an external journal with 
 tune2fs (see man page).

I only have 2 ext3 file systems (One of which is mounted R/O since it's
full), all my others are reiserfs (v3).

What benefit would I gain by using an external journel and how big would it
need to be?

 if you use XFS it's a bit more of a challenge to convert from internal to 
 external, but see this thread:

I specifically didn't use XFS (or JFS) since neither one at the time could
be shrinked.

-- 
 Lab tests show that use of micro$oft causes cancer in lab animals
 Got Gas???
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread David Chinner
On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
 Combining these thoughts, it would make a lot of sense for the
 filesystem to be able to say to the block device That blocks looks
 wrong - can you find me another copy to try?.  That is an example of
 the sort of closer integration between filesystem and RAID that would
 make sense.

I think that this would only be useful on devices that store
discrete copies of the blocks on different devices i.e. mirrors. If
it's an XOR based RAID, you don't have another copy you can
retreive

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-16 Thread david

On Sat, 16 Jun 2007, Neil Brown wrote:


It would be possible to have a 'this is not initialised' flag on the
array, and if that is not set, always do a reconstruct-write rather
than a read-modify-write.  But the first time you have an unclean
shutdown you are going to resync all the parity anyway (unless you
have a bitmap) so you may as well resync at the start.

And why is it such a big deal anyway?  The initial resync doesn't stop
you from using the array.  I guess if you wanted to put an array into
production instantly and couldn't afford any slowdown due to resync,
then you might want to skip the initial resync but is that really
likely?


in my case it takes 2+ days to resync the array before I can do any 
performance testing with it. for some reason it's only doing the rebuild 
at ~5M/sec (even though I've increased the min and max rebuild speeds and 
a dd to the array seems to be ~44M/sec, even during the rebuild)


I want to test several configurations, from a 45 disk raid6 to a 45 disk 
raid0. at 2-3 days per test (or longer, depending on the tests) this 
becomes a very slow process.


also, when a rebuild is slow enough (and has enough of a performance 
impact) it's not uncommon to want to operate in degraded mode just long 
enought oget to a maintinance window and then recreate the array and 
reload from backup.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-15 Thread Neil Brown
On Friday June 15, [EMAIL PROTECTED] wrote:
 
   As I understand the way
 raid works, when you write a block to the array, it will have to read all
 the other blocks in the stripe and recalculate the parity and write it out.

Your understanding is incomplete.
For raid5 on an array with more than 3 drive, if you attempt to write
a single block, it will:

 - read the current value of the block, and the parity block.
 - subtract the old value of the block from the parity, and add
   the new value.
 - write out the new data and the new parity.

If the parity was wrong before, it will still be wrong.  If you then
lose a drive, you lose your data.

With the current implementation in md, this only affect RAID5.  RAID6
will always behave as you describe.  But I don't promise that won't
change with time.

It would be possible to have a 'this is not initialised' flag on the
array, and if that is not set, always do a reconstruct-write rather
than a read-modify-write.  But the first time you have an unclean
shutdown you are going to resync all the parity anyway (unless you
have a bitmap) so you may as well resync at the start.

And why is it such a big deal anyway?  The initial resync doesn't stop
you from using the array.  I guess if you wanted to put an array into
production instantly and couldn't afford any slowdown due to resync,
then you might want to skip the initial resync but is that really
likely?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html