Re: Software RAID when it works and when it doesn't

2007-10-23 Thread Alberto Alonso
On Tue, 2007-10-23 at 18:45 -0400, Bill Davidsen wrote:

> I'm not sure the timeouts are the problem, even if md did its own 
> timeout, it then needs a way to tell the driver (or device) to stop 
> retrying. I don't believe that's available, certainly not everywhere, 
> and anything other than everywhere would turn the md code into a nest of 
> exceptions.
> 

If we loose the ability to communication to that drive I don't see it
as a problem (that's the whole point, we kick it out of the array). So,
if we can't tell the driver about the failure we are still OK, md could
successfully deal with misbehaved drivers.

Alberto


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lvm-devel] [PATCH] lvm2 support for detecting v1.x MD superblocks

2007-10-23 Thread Mike Snitzer
On 10/23/07, Alasdair G Kergon <[EMAIL PROTECTED]> wrote:
> On Tue, Oct 23, 2007 at 11:32:56AM -0400, Mike Snitzer wrote:
> > I've tested the attached patch to work on MDs with v0.90.0, v1.0,
> > v1.1, and v1.2 superblocks.
>
> I'll apply this, thanks, but need to add comments (or reference) to explain
> what the hard-coded numbers are:
>
> sb_offset = (size - 8 * 2) & ~(4 * 2 - 1);
> etc.

All values are in terms of sectors; so that is where the * 2 is coming
from.  The v1.0 case follows the same model as the MD_NEW_SIZE_SECTORS
which is used for v0.90.0.  The difference is that the v1.0 superblock
is found "at least 8K, but less than 12K, from the end of the device".

The same switch statement is used in mdadm and is accompanied with the
following comment:

/*
 * Calculate the position of the superblock.
 * It is always aligned to a 4K boundary and
 * depending on minor_version, it can be:
 * 0: At least 8K, but less than 12K, from end of device
 * 1: At start of device
 * 2: 4K from start of device.
 */

Would it be sufficient to add that comment block above
v1_sb_offset()'s switch statement?

thanks,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [lvm-devel] [PATCH] lvm2 support for detecting v1.x MD superblocks

2007-10-23 Thread Alasdair G Kergon
On Tue, Oct 23, 2007 at 11:32:56AM -0400, Mike Snitzer wrote:
> I've tested the attached patch to work on MDs with v0.90.0, v1.0,
> v1.1, and v1.2 superblocks.
 
I'll apply this, thanks, but need to add comments (or reference) to explain
what the hard-coded numbers are:

sb_offset = (size - 8 * 2) & ~(4 * 2 - 1);
etc.

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Doug Ledford
On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:

> I don't agree completely.  I think the superblock location is a key
> issue, because if you have a superblock location which moves depending
> the filesystem or LVM you use to look at the partition (or full disk)
> then you need to be even more careful about how to poke at things.

This is the heart of the matter.  When you consider that each file
system and each volume management stack has a superblock, and they some
store their superblocks at the end of devices and some at the beginning,
and they can be stacked, then it becomes next to impossible to make sure
a stacked setup is never recognized incorrectly under any circumstance.
It might be possible if you use static device names, but our users
*long* ago complained very loudly when adding a new disk or removing a
bad disk caused their setup to fail to boot.  So, along came mount by
label and auto scans for superblocks.  Once you do that, you *really*
need all the superblocks at the same end of a device so when you stack
things, it always works properly.

> Michael> Another example is ext[234]fs - it does not touch first 512
> Michael> bytes of the device, so if there was an msdos filesystem
> Michael> there before, it will be recognized as such by many tools,
> Michael> and an attempt to mount it automatically will lead to at
> Michael> least scary output and nothing mounted, or in fsck doing
> Michael> fatal things to it in worst scenario.  Sure thing the first
> Michael> 512 bytes should be just cleared.. but that's another topic.
> 
> I would argue that ext[234] should be clearing those 512 bytes.  Why
> aren't they cleared  

Actually, I didn't think msdos used the first 512 bytes for the same
reason ext3 doesn't: space for a boot sector.

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Doug Ledford
On Sat, 2007-10-20 at 22:24 +0400, Michael Tokarev wrote:
> John Stoffel wrote:
> >> "Michael" == Michael Tokarev <[EMAIL PROTECTED]> writes:

> > As Doug says, and I agree strongly, you DO NOT want to have the
> > possibility of confusion and data loss, especially on bootup.  And
> 
> There are different point of views, and different settings etc.

Indeed, there are different points of view.  And with that in mind, I'll
just point out that my point of view is that of an engineer who is
responsible for all the legitimate md bugs in our products once tech
support has weeded out the "you tried to do what?" cases.  From that
point of view, I deal with *every* user's preferred use case, not any
single use case.

> For example, I once dealt with a linux user who was unable to
> use his disk partition, because his system (it was RedHat if I
> remember correctly) recognized some LVM volume on his disk (it
> was previously used with Windows) and tried to automatically
> activate it, thus making it "busy".

Yep, that can still happen today under certain circumstances.

>   What I'm talking about here
> is that any automatic activation of anything should be done with
> extreme care, using smart logic in the startup scripts if at
> all.

We do.  Unfortunately, there is no logic smart enough to recognize all
the possible user use cases that we've seen given the way things are
created now.

> The Doug's example - in my opinion anyway - shows wrong tools
> or bad logic in the startup sequence, not a general flaw in
> superblock location.

Well, one of the problems is that you can both use an md device as an
LVM physical volume and use an LVM logical volume as an md constituent
device.  Users have done both.

> For example, when one drive was almost dead, and mdadm tried
> to bring the array up, machine just hanged for unknown amount
> of time.  An unexpirienced operator was there.  Instead of
> trying to teach him how to pass parameter to the initramfs
> to stop trying to assemble root array and next assembling
> it manually, I told him to pass "root=/dev/sda1" to the
> kernel.  Root mounts read-only, so it should be a safe thing
> to do - I only needed root fs and minimal set of services
> (which are even in initramfs) just for it to boot up to SOME
> state where I can log in remotely and fix things later.

Umm, no.  Generally speaking (I can't speak for other distros) but both
Fedora and RHEL remount root rw even when coming up in single user mode.
The only time the fs is left in ro mode is when it drops to a shell
during rc.sysinit as a result of a failed fs check.  And if you are
using an ext3 filesystem and things didn't go down clean, then you also
get a journal replay.  So, then what happens when you think you've fixed
things, and you reboot, and then due to random chance, the ext3 fs check
gets the journal off the drive that wasn't mounted and replays things
again?  Will this overwrite your fixes possibly?  Yep.  Could do all
sorts of bad things.  In fact, unless you do a full binary compare of
your constituent devices, you could have silent data corruption and just
never know about it.  You may get off lucky and never *see* the
corruption, but it could well be there.  The only safe way to
reintegrate your raid after doing what you suggest is to kick the
unmounted drive out of the array before rebooting by using mdadm to zero
its superblock, boot up with a degraded raid1 array, and readd the
kicked device back in.

So, while you list several more examples of times when it was convenient
to do as you suggest, these times can be handled in other ways (although
it may mean keeping a rescue CD handy at each location just for
situations like this) that are far safer IMO.

Now, putting all this back into the point of view I have to take, which
is what's the best default action to take for my customers, I'm sure you
can understand how a default setup and recommendation of use that leaves
silent data corruption is simply a non-starter for me.  If someone wants
to do this manually, then go right ahead.  But as for what we do by
default when the user asks us to create a raid array, we really need to
be on superblock 1.1 or 1.2 (although we aren't yet, we've waited for
the version 1 superblock issues to iron out and will do so in a future
release).

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: chunk size (was Re: Time to deprecate old RAID formats?)

2007-10-23 Thread Doug Ledford
On Tue, 2007-10-23 at 21:21 +0200, Michal Soltys wrote:
> Doug Ledford wrote:
> > 
> > Well, first I was thinking of files in the few hundreds of megabytes
> > each to gigabytes each, and when they are streamed, they are streamed at
> > a rate much lower than the full speed of the array, but still at a fast
> > rate.  How parallel the reads are then would tend to be a function of
> > chunk size versus streaming rate. 
> 
> Ahh, I see now. Thanks for explanation.
> 
> I wonder though, if setting large readahead would help, if you used larger 
> chunk size. Assuming other options are not possible - i.e. streaming from 
> larger buffer, while reading to it in a full stripe width at least.

Probably not.  All my trial and error in the past with raid5 arrays and
various situations that would cause pathological worst case behavior
showed that once reads themselves reach 16k in size, and are sequential
in nature, then the disk firmware's read ahead kicks in and your
performance stays about the same regardless of increasing your OS read
ahead.  In a nutshell, once you've convinced the disk firmware that you
are going to be reading some data sequentially, it does the rest.  With
a large stripe size (say 256k+), you'll trigger this firmware read ahead
fairly early on in reading any given stripe, so you really don't buy
much by reading the next stripe before you need it, and in fact can end
up wasting a lot of RAM trying to do so, hurting overall performance.

> > 
> > I'm not familiar with the benchmark you are referring to.
> > 
> 
> I was thinking about 
> http://www.mail-archive.com/linux-raid@vger.kernel.org/msg08461.html
> 
> with small discussion that happend after that.
-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Doug Ledford
On Tue, 2007-10-23 at 19:03 -0400, Bill Davidsen wrote:
> John Stoffel wrote:
> > Why do we have three different positions for storing the superblock?  
> >   
> Why do you suggest changing anything until you get the answer to this 
> question? If you don't understand why there are three locations, perhaps 
> that would be a good initial investigation.
> 
> Clearly the short answer is that they reflect three stages of Neil's 
> thinking on the topic, and I would bet that he had a good reason for 
> moving the superblock when he did it.

I believe, and Neil can correct me if I'm wrong, that 1.0 (at the end of
the device) is to satisfy people that want to get at their raid1 data
without bringing up the device or using a loop mount with an offset.
Version 1.1, at the beginning of the device, is to prevent accidental
access to a device when the raid array doesn't come up.  And version 1.2
(4k from the beginning of the device) would be suitable for those times
when you want to embed a boot sector at the very beginning of the device
(which really only needs 512 bytes, but a 4k offset is as easy to deal
with as anything else).  From the standpoint of wanting to make sure an
array is suitable for embedding a boot sector, the 1.2 superblock may be
the best default.

> Since you have to support all of them or break existing arrays, and they 
> all use the same format so there's no saving of code size to mention, 
> why even bring this up?
> 
-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Bill Davidsen

Justin Piszcz wrote:



On Fri, 19 Oct 2007, John Stoffel wrote:


"Justin" == Justin Piszcz <[EMAIL PROTECTED]> writes:


Justin> On Fri, 19 Oct 2007, John Stoffel wrote:



So,

Is it time to start thinking about deprecating the old 0.9, 1.0 and
1.1 formats to just standardize on the 1.2 format?  What are the
issues surrounding this?

It's certainly easy enough to change mdadm to default to the 1.2
format and to require a --force switch to  allow use of the older
formats.

I keep seeing that we support these old formats, and it's never been
clear to me why we have four different ones available?  Why can't we
start defining the canonical format for Linux RAID metadata?

Thanks,
John
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe 
linux-raid" in

the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Justin> I hope 00.90.03 is not deprecated, LILO cannot boot off of
Justin> anything else!

Are you sure?  I find that GRUB is much easier to use and setup than
LILO these days.  But hey, just dropping down to support 00.09.03 and
1.2 formats would be fine too.  Let's just lessen the confusion if at
all possible.

John



I am sure, I submitted a bug report to the LILO developer, he 
acknowledged the bug but I don't know if it was fixed.


I have not tried GRUB with a RAID1 setup yet.


Works fine.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Bill Davidsen

John Stoffel wrote:

"Michael" == Michael Tokarev <[EMAIL PROTECTED]> writes:



Michael> Doug Ledford wrote:
Michael> []
  

1.0, 1.1, and 1.2 are the same format, just in different positions on
the disk.  Of the three, the 1.1 format is the safest to use since it
won't allow you to accidentally have some sort of metadata between the
beginning of the disk and the raid superblock (such as an lvm2
superblock), and hence whenever the raid array isn't up, you won't be
able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
case situations, I've seen lvm2 find a superblock on one RAID1 array
member when the RAID1 array was down, the system came up, you used the
system, the two copies of the raid array were made drastically
inconsistent, then at the next reboot, the situation that prevented the
RAID1 from starting was resolved, and it never know it failed to start
last time, and the two inconsistent members we put back into a clean
array).  So, deprecating any of these is not really helpful.  And you
need to keep the old 0.90 format around for back compatibility with
thousands of existing raid arrays.
  


Michael> Well, I strongly, completely disagree.  You described a
Michael> real-world situation, and that's unfortunate, BUT: for at
Michael> least raid1, there ARE cases, pretty valid ones, when one
Michael> NEEDS to mount the filesystem without bringing up raid.
Michael> Raid1 allows that.

Please describe one such case please.  There have certainly been hacks
of various RAID systems on other OSes such as Solaris where the VxVM
and/or Solstice DiskSuite allowed you to encapsulate an existing
partition into a RAID array.  


But in my experience (and I'm a professional sysadm... :-) it's not
really all that useful, and can lead to problems liks those described
by Doug.  


If you are going to mirror an existing filesystem, then by definition
you have a second disk or partition available for the purpose.  So you
would merely setup the new RAID1, in degraded mode, using the new
partition as the base.  Then you copy the data over to the new RAID1
device, change your boot setup, and reboot.

Once that is done, you can then add the original partition into the
RAID1 array.  


As Doug says, and I agree strongly, you DO NOT want to have the
possibility of confusion and data loss, especially on bootup.  And
this leads to the heart of my initial post on this matter, that the
confusion of having four different variations of RAID superblocks is
bad.  We should deprecate them down to just two, the old 0.90 format,
and the new 1.x format at the start of the RAID volume.
  


Perhaps I am misreading you here, when you say "depreciate them down" do 
you mean the Adrian Bunk method of putting in a printk scolding the 
administrator, and then remove the feature a version later, or did you 
mean "depreciate all but two" which clearly doesn't suggest removing the 
capability at all?


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Bill Davidsen

Doug Ledford wrote:

On Fri, 2007-10-19 at 23:23 +0200, Iustin Pop wrote:
  

On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote:


And if putting the superblock at the end is problematic, why is it the
default?  Shouldn't version 1.1 be the default?  
  

In my opinion, having the superblock *only* at the end (e.g. the 0.90
format) is the best option.

It allows one to mount the disk separately (in case of RAID 1), if the
MD superblock is corrupt or you just want to get easily at the raw data.



Bad reasoning.  It's the reason that the default is at the end of the
device, but that was a bad decision made by Ingo long, long ago in a
galaxy far, far away.

The simple fact of the matter is there are only two type of raid devices
for the purpose of this issue: those that fragment data (raid0/4/5/6/10)
and those that don't (raid1, linear).

For the purposes of this issue, there are only two states we care about:
the raid array works or doesn't work.

If the raid array works, then you *only* want the system to access the
data via the raid array.  If the raid array doesn't work, then for the
fragmented case you *never* want the system to see any of the data from
the raid array (such as an ext3 superblock) or a subsequent fsck could
see a valid superblock and actually start a filesystem scan on the raw
device, and end up hosing the filesystem beyond all repair after it hits
the first chunk size break (although in practice this is usually a
situation where fsck declares the filesystem so corrupt that it refuses
to touch it, that's leaving an awful lot to chance, you really don't
want fsck to *ever* see that superblock).

If the raid array is raid1, then the raid array should *never* fail to
start unless all disks are missing (in which case there is no raw device
to access anyway).  The very few failure types that will cause the raid
array to not start automatically *and* still have an intact copy of the
data usually happen when the raid array is perfectly healthy, in which
case automatically finding a constituent device when the raid array
failed to start is exactly the *wrong* thing to do (for instance, you
enable SELinux on a machine and it hasn't been relabeled and the raid
array fails to start because /dev/md can't be created because of
an SELinux denial...all the raid1 members are still there, but if you
touch a single one of them, then you run the risk of creating silent
data corruption).

It really boils down to this: for any reason that a raid array might
fail to start, you *never* want to touch the underlying data until
someone has taken manual measures to figure out why it didn't start and
corrected the problem.  Putting the superblock in front of the data does
not prevent manual measures (such as recreating superblocks) from
getting at the data.  But, putting superblocks at the end leaves the
door open for accidental access via constituent devices when you
*really* don't want that to happen.
  


You didn't mention some ill-behaved application using the raw device 
(ie. database) writing just a little more than it should and destroying 
the superblock.

So, no, the default should *not* be at the end of the device.

  

You make a convincing argulemt.

As to the people who complained exactly because of this feature, LVM has
two mechanisms to protect from accessing PVs on the raw disks (the
ignore raid components option and the filter - I always set filters when
using LVM ontop of MD).

regards,
iustin




--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-23 Thread Bill Davidsen

John Stoffel wrote:
Why do we have three different positions for storing the superblock?  
  
Why do you suggest changing anything until you get the answer to this 
question? If you don't understand why there are three locations, perhaps 
that would be a good initial investigation.


Clearly the short answer is that they reflect three stages of Neil's 
thinking on the topic, and I would bet that he had a good reason for 
moving the superblock when he did it.


Since you have to support all of them or break existing arrays, and they 
all use the same format so there's no saving of code size to mention, 
why even bring this up?


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software RAID when it works and when it doesn't

2007-10-23 Thread Bill Davidsen

Alberto Alonso wrote:

On Thu, 2007-10-18 at 17:26 +0200, Goswin von Brederlow wrote:
  

Mike Accetta <[EMAIL PROTECTED]> writes:



  

What I would like to see is a timeout driven fallback mechanism. If
one mirror does not return the requested data within a certain time
(say 1 second) then the request should be duplicated on the other
mirror. If the first mirror later unchokes then it remains in the
raid, if it fails it gets removed. But (at least reads) should not
have to wait for that process.

Even better would be if some write delay could also be used. The still
working mirror would get an increase in its serial (so on reboot you
know one disk is newer). If the choking mirror unchokes then it can
write back all the delayed data and also increase its serial to
match. Otherwise it gets really failed. But you might have to use
bitmaps for this or the cache size would limit its usefullnes.

MfG
Goswin



I think a timeout on both: reads and writes is a must. Basically I
believe that all problems that I've encountered issues using software
raid would have been resolved by using a timeout within the md code.

This will keep a server from crashing/hanging when the underlying 
driver doesn't properly handle hard drive problems. MD can be 
smarter than the "dumb" drivers.


Just my thoughts though, as I've never got an answer as to whether or
not md can implement its own timeouts.


I'm not sure the timeouts are the problem, even if md did its own 
timeout, it then needs a way to tell the driver (or device) to stop 
retrying. I don't believe that's available, certainly not everywhere, 
and anything other than everywhere would turn the md code into a nest of 
exceptions.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-23 Thread Bill Davidsen

Louis-David Mitterrand wrote:

On Tue, Oct 09, 2007 at 01:48:50PM +0400, Michael Tokarev wrote:
  

There still is - at least for ext[23].  Even offline resizers
can't do resizes from any to any size, extfs developers recommend
to recreate filesystem anyway if size changes significantly.
I'm too lazy to find a reference now, it has been mentioned here
on linux-raid at least this year.  It's sorta like fat (yea, that
ms-dog filesystem) - when you resize it from, say, 501Mb to 999Mb,
everything is ok, but if you want to go from 501Mb to 1Gb+1, you
have to recreate almost all data structures because sizes of
all internal fields changes - and here it's much safer to just
re-create it from scratch than trying to modify it in place.
Sure it's much better for extfs, but the point is still the same.



I'll just mention that I once resized a multi-Tera ext3 filesystem and 
it took 8hours +, a comparable XFS online resize lasted all of 10 
seconds! 


Because of the different way these file systems do things, there is no 
comparable resize, at least in terms of work to be done. For many 
systems R/W operations are more common than resize, so the F/S type is 
selected to optimize that. ;-)


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: chunk size (was Re: Time to deprecate old RAID formats?)

2007-10-23 Thread Michal Soltys

Doug Ledford wrote:


Well, first I was thinking of files in the few hundreds of megabytes
each to gigabytes each, and when they are streamed, they are streamed at
a rate much lower than the full speed of the array, but still at a fast
rate.  How parallel the reads are then would tend to be a function of
chunk size versus streaming rate. 


Ahh, I see now. Thanks for explanation.

I wonder though, if setting large readahead would help, if you used larger 
chunk size. Assuming other options are not possible - i.e. streaming from 
larger buffer, while reading to it in a full stripe width at least.




I'm not familiar with the benchmark you are referring to.



I was thinking about 
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg08461.html


with small discussion that happend after that.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: async_tx: get best channel

2007-10-23 Thread Dan Williams
On Fri, 2007-10-19 at 05:23 -0700, Yuri Tikhonov wrote:
> 
>  Hello Dan,

Hi Yuri, sorry it has taken me so long to get back to you...
> 
>  I have a suggestion regarding the async_tx_find_channel() procedure.
> 
>  First, a little introduction. Some processors (e.g. ppc440spe) have several 
> DMA
> engines (say DMA1 and DMA2) which are capable of performing the same type of
> operation, say XOR. The DMA2 engine may process the XOR operation faster than
> the DMA1 engine, but DMA2 (which is faster) has some restrictions for the 
> source
> operand addresses, whereas there are no such restrictions for DMA1 (which is 
> slower).
> So the question is, how may ASYNC_TX select the DMA engine which will be the
> most effective for the given tx operation ?

>  In the example just described this means: if the faster engine, DMA2, may 
> process
> the tx operation with the given source operand addresses, then we select DMA2;
> if the given source operand addresses cannot be processed with DMA2, then we
> select the slower engine, DMA1.
> 
>  I see the following way for introducing such functionality.
> 
>  We may introduce an additional method in struct dma_device (let's call it 
> device_estimate())
> which would take the following as the arguments:
> --- the list of sources to be processed during the given tx,
> --- the type of operation (XOR, COPY, ...),
> --- perhaps something else,
>  and then estimate the effectiveness of processing this tx on the given 
> channel.
>  The async_tx_find_channel() function should call the device_estimate() 
> method for each
> registered dma channel and then select the most effective one.
>  The architecture specific ADMA driver will be responsible for returning the 
> greatest
> value from the device_estimate() method for the channel which will be the 
> most effective
> for this given tx.
> 
>  What are your thoughts regarding this? Do you see any other effective ways 
> for
> enhancing ASYNC_TX with such functionality?

The problem with moving this test to async_tx_find_channel() is that it
imposes extra overhead in the fast path.  It would be best if we could
keep all these decisions in the slow path, or at least hide it from
architectures that do not need to implement it.  The thing that makes
this tricky is the fact that the speed is based on the source address...

One question what are the source address restrictions, is it around
high-memory?  My thought is MD usually only operates on GFP_KERNEL
memory but sometimes sees high-memory when copying data into and out of
the cache.  You might be able to achieve your use case by disabling
(hiding) the XOR capability on the channels used for copying.  This will
cause async_tx to switch the operation from the high memory capable copy
channel to the fast low memory XOR channel.

Another way to approach this would be to implement architecture specific
definitions of dma_channel_add_remove() and async_tx_rebalance().  This
will bypass the default allocation scheme and allow you to assign the
fastest channel to an operation, but it still does not allow for dynamic
selection based on source/destination address...

> 
>  Regards, Yuri
> 
Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lvm2 support for detecting v1.x MD superblocks

2007-10-23 Thread Mike Snitzer
lvm2's MD v1.0 superblock detection doesn't work at all (because it
doesn't use v1 sb offsets).

I've tested the attached patch to work on MDs with v0.90.0, v1.0,
v1.1, and v1.2 superblocks.

please advise, thanks.
Mike
Index: lib/device/dev-md.c
===
RCS file: /cvs/lvm2/LVM2/lib/device/dev-md.c,v
retrieving revision 1.5
diff -u -r1.5 dev-md.c
--- lib/device/dev-md.c	20 Aug 2007 20:55:25 -	1.5
+++ lib/device/dev-md.c	23 Oct 2007 15:17:57 -
@@ -25,6 +25,40 @@
 #define MD_NEW_SIZE_SECTORS(x) ((x & ~(MD_RESERVED_SECTORS - 1)) \
 - MD_RESERVED_SECTORS)
 
+int dev_has_md_sb(struct device *dev, uint64_t sb_offset, uint64_t *sb)
+{
+	int ret = 0;	
+	uint32_t md_magic;
+	/* Version 1 is little endian; version 0.90.0 is machine endian */
+	if (dev_read(dev, sb_offset, sizeof(uint32_t), &md_magic) &&
+	((md_magic == xlate32(MD_SB_MAGIC)) ||
+	 (md_magic == MD_SB_MAGIC))) {
+		if (sb)
+			*sb = sb_offset;
+		ret = 1;
+	}
+	return ret;
+}
+
+uint64_t v1_sb_offset(uint64_t size, int minor_version) {
+	uint64_t sb_offset;
+	switch(minor_version) {
+	case 0:
+		sb_offset = size;
+		sb_offset -= 8*2;
+		sb_offset &= ~(4*2-1);
+		break;
+	case 1:
+		sb_offset = 0;
+		break;
+	case 2:
+		sb_offset = 4*2;
+		break;
+	}
+	sb_offset <<= SECTOR_SHIFT;
+	return sb_offset;
+}
+
 /*
  * Returns -1 on error
  */
@@ -35,7 +69,6 @@
 #ifdef linux
 
 	uint64_t size, sb_offset;
-	uint32_t md_magic;
 
 	if (!dev_get_size(dev, &size)) {
 		stack;
@@ -50,16 +83,20 @@
 		return -1;
 	}
 
-	sb_offset = MD_NEW_SIZE_SECTORS(size) << SECTOR_SHIFT;
-
 	/* Check if it is an md component device. */
-	/* Version 1 is little endian; version 0.90.0 is machine endian */
-	if (dev_read(dev, sb_offset, sizeof(uint32_t), &md_magic) &&
-	((md_magic == xlate32(MD_SB_MAGIC)) ||
-	 (md_magic == MD_SB_MAGIC))) {
-		if (sb)
-			*sb = sb_offset;
+	/* Version 0.90.0 */
+	sb_offset = MD_NEW_SIZE_SECTORS(size) << SECTOR_SHIFT;
+	if (dev_has_md_sb(dev, sb_offset, sb)) {
 		ret = 1;
+	} else {
+		/* Version 1, try v1.0 -> v1.2 */
+		int minor;
+		for (minor = 0; minor <= 2; minor++) {
+			if (dev_has_md_sb(dev, v1_sb_offset(size, minor), sb)) {
+ret = 1;
+break;
+			}
+		}
 	}
 
 	if (!dev_close(dev))