Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-13 Thread Ric Wheeler



Guy Watkins wrote:

} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
} Sent: Thursday, July 12, 2007 1:35 PM
} To: [EMAIL PROTECTED]
} Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper
} development; [EMAIL PROTECTED]; [EMAIL PROTECTED];
} linux-raid@vger.kernel.org; Jens Axboe; David Chinner; Andreas Dilger
} Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for
} devices, filesystems, and dm/md.
} 
} On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:

}  [EMAIL PROTECTED] wrote:
}   On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
}  
}   All of the high end arrays have non-volatile cache (read, on power
} loss, it is a
}   promise that it will get all of your data out to permanent storage).
} You don't
}   need to ask this kind of array to drain the cache. In fact, it might
} just ignore
}   you if you send it that kind of request ;-)
}  
}   OK, I'll bite - how does the kernel know whether the other end of that
}   fiberchannel cable is attached to a DMX-3 or to some no-name product
} that
}   may not have the same assurances?  Is there a I'm a high-end array
} bit
}   in the sense data that I'm unaware of?
}  
} 
}  There are ways to query devices (think of hdparm -I in S-ATA/P-ATA
} drives, SCSI
}  has similar queries) to see what kind of device you are talking to. I am
} not
}  sure it is worth the trouble to do any automatic detection/handling of
} this.
} 
}  In this specific case, it is more a case of when you attach a high end
} (or
}  mid-tier) device to a server, you should configure it without barriers
} for its
}  exported LUNs.
} 
} I don't have a problem with the sysadmin *telling* the system the other

} end of
} that fiber cable has characteristics X, Y and Z.  What worried me was
} that it
} looked like conflating device reported writeback cache with device
} actually
} has enough battery/hamster/whatever backup to flush everything on a power
} loss.
} (My back-of-envelope calculation shows for a worst-case of needing a 1ms
} seek
} for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.
} That's
} a lot of battery..)

Most hardware RAID devices I know of use the battery to save the cache while
the power is off.  When the power is restored it flushes the cache to disk.
If the power failure lasts longer than the batteries then the cache data is
lost, but the batteries last 24+ hours I beleve.


Most mid-range and high end arrays actually use that battery to insure that data 
is all written out to permanent media when the power is lost. I won't go into 
how that is done, but it clearly would not be a safe assumption to assume that 
your power outage is only going to be a certain length of time (and if not, you 
would lose data).




A big EMC array we had had enough battery power to power about 400 disks
while the 16 Gig of cache was flushed.  I think EMC told me the batteries
would last about 20 minutes.  I don't recall if the array was usable during
the 20 minutes.  We never tested a power failure.

Guy


I worked on the team that designed that big array.

At one point, we had an array on loan to a partner who tried to put it in a very 
small data center. A few weeks later, they brought in an electrician who needed 
to run more power into the center.  It was pretty funny - he tried to find a 
power button to turn it off and then just walked over and dropped power trying 
to get the Symm to turn off.  When that didn't work, he was really, really 
confused ;-)


ric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Ric Wheeler



[EMAIL PROTECTED] wrote:

On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:

[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)

OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a I'm a high-end array bit
in the sense data that I'm unaware of?

There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not

sure it is worth the trouble to do any automatic detection/handling of this.

In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its

exported LUNs.


I don't have a problem with the sysadmin *telling* the system the other end of
that fiber cable has characteristics X, Y and Z.  What worried me was that it
looked like conflating device reported writeback cache with device actually
has enough battery/hamster/whatever backup to flush everything on a power loss.
(My back-of-envelope calculation shows for a worst-case of needing a 1ms seek
for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.  That's
a lot of battery..)


I think that we are on the same page here - just let the sys admin mount without 
barriers for big arrays.


1GB of cache, by the way, is really small for some of us ;-)

ric

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-11 Thread Ric Wheeler


[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a I'm a high-end array bit
in the sense data that I'm unaware of?



There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not 
sure it is worth the trouble to do any automatic detection/handling of this.


In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its 
exported LUNs.


ric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: parity check for read?

2007-04-04 Thread Ric Wheeler



Mirko Benz wrote:

Neil,

Exactly what I had in mind.

Some vendors claim they do parity checking for reads. Technically it 
should be possible for Linux RAID as well but is not implemented – correct?


Reliability data for unrecoverable read errors:
- enterprise SAS drive (ST3300655SS): 1 in 10^16 bits transfered, ~ 1 
error in 1,1 PB
- enterprise SATA drive (ST3500630NS): 1 in 10^14 bits transfered, ~ 1 
error in 11 TB


For a single SATA drive @ 50 MB/s it take on average 2,7 days to 
encounter an error.
For a large RAID with several drives this becomes much lower or am I 
viewing this wrong?


Regards,
Mirko


One note is that if the drive itself notices the unrecoverable read error, MD 
will see this as an IO error and rebuild the stripe.


What you need the parity check on read for is to validate errors not at the disk 
sector level, but rather ones that sneak in from DRAM, HBA errors or wire level 
uncorrected errors.


ric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-27 Thread Ric Wheeler

Martin K. Petersen wrote:

Eric == Moore, Eric [EMAIL PROTECTED] writes:


Eric Martin K. Petersen on Data Intergrity Feature, which is also
Eric called EEDP(End to End Data Protection), which he presented some
Eric ideas/suggestions of adding an API in linux for this.  

T10 DIF is interesting for a few things: 


 - Ensuring that the data integrity is preserved when writing a buffer
   to disk

 - Ensuring that the write ends up on the right hardware sector

These features make the most sense in terms of WRITE.  Disks already
have plenty of CRC on the data so if a READ fails on a regular drive
we already know about it.


There are paths through a read that could still benefit from the extra 
data integrity.  The CRC gets validated on the physical sector, but we 
don't have the same level of strict data checking once it is read into 
the disk's write cache or being transferred out of cache on the way to 
the transport...




We can, however, leverage DIF with my proposal to expose the
protection data to host memory.  This will allow us to verify the data
integrity information before passing it to the filesystem or
application.  We can say this is really the information the disk
sent. It hasn't been mangled along the way.

And by using the APP tag we can mark a sector as - say - metadata or
data to ease putting the recovery puzzle back together.

It would be great if the app tag was more than 16 bits.  Ted mentioned
that ideally he'd like to store the inode number in the app tag.  But
as it stands there isn't room.

In any case this is all slightly orthogonal to Ric's original post
about finding the right persistence heuristics in the error handling
path...



Still all a very relevant discussion - I agree that we could really use 
more than just 16 bits...


ric

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler



Alan wrote:

the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend. 


Not quite that simple.


I think that write errors are normally quite serious, but there are exceptions 
which might be able to be worked around with retries.  To Ted's point, in 
general, a write to a bad spot on the media will cause a remapping which should 
be transparent (if a bit slow) to us.




If you write a block aligned size the same size as the physical media
block size maybe this is true. If you write a sector on a device with
physical sector size larger than logical block size (as allowed by say
ATA7) then it's less clear what happens. I don't know if the drive
firmware implements multiple tails in this case.

On a read error it is worth trying the other parts of the I/O.



I think that this is mostly true, but we also need to balance this against the 
need for higher levels to get a timely response.  In a really large IO, a naive 
retry of a very large write could lead to a non-responsive system for a very 
large time...


ric



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler


Alan wrote:
I think that this is mostly true, but we also need to balance this against the 
need for higher levels to get a timely response.  In a really large IO, a naive 
retry of a very large write could lead to a non-responsive system for a very 
large time...


And losing the I/O could result in a system that is non responsive until
the tape restore completes two days later


Which brings us back to a recent discussion at the file system workshop on being 
more repair oriented in file system design so we can survive situations like 
this a bit more reliably ;-)


ric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler



Jeff Garzik wrote:

Theodore Tso wrote:

Can someone with knowledge of current disk drive behavior confirm that
for all drives that support bad block sparing, if an attempt to write
to a particular spot on disk results in an error due to bad media at
that spot, the disk drive will automatically rewrite the sector to a
sector in its spare pool, and automatically redirect that sector to
the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend.  



This is what will /probably/ happen.  The drive should indeed find a 
spare sector and remap it, if the write attempt encounters a bad spot on 
the media.


However, with a large enough write, large enough bad-spot-on-media, and 
a firmware programmed to never take more than X seconds to complete 
their enterprise customers' I/O, it might just fail.



IMO, somewhere in the kernel, when we receive a read-op or write-op 
media error, we should immediately try to plaster that area with small 
writes.  Sure, if it's a read-op you lost data, but this method will 
maximize the chance that you can refresh/reuse the logical sectors in 
question.


Jeff


One interesting counter example is a smaller write than a full page - say 512 
bytes out of 4k.


If we need to do a read-modify-write and it just so happens that 1 of the 7 
sectors we need to read is flaky, will this look like a write failure?


ric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


end to end error recovery musings

2007-02-23 Thread Ric Wheeler
In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.


My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one appliance like 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.


The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.


Each box has a watchdog timer that can be set to fire after at most 2 
minutes.


(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).


Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.


We still have the following challenges:

   (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.


   (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.


   (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...


   (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.


We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)


ric



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] latency problem in md driver

2006-12-22 Thread Ric Wheeler

Jeff Garzik wrote:

Lars Ellenberg wrote:

md raidX make_request functions strip off the BIO_RW_SYNC flag,
this introducing additional latency.

below is a suggested patch for the raid1.c .
other suggested solutions would be to let the bio_clone do its work,
and not reassign thereby stripping off all flags.
at most strip off known unwanted flags (the BARRIER flag).


It sounds like a major bug to strip the barrier flag.  I quite 
understand that a barrier to a RAID device as a whole behaves 
differently from a barrier to an ATA or SCSI device, but that's no 
excuse to avoid the problem.


If MD does not pass barriers, it is unilaterally dropping the data made 
it to the media guarantee.


Jeff


Exactly right - if we do not pass the barrier request down to the 
members of the RAID group, then we lose the data integrity needed.


Of course, in a RAID group, this will be introduce latency, but that is 
the correct behavior.


ric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata hotplug and md raid?

2006-09-13 Thread Ric Wheeler

(Adding Tejun  Greg KH to this thread)

Leon Woestenberg wrote:


Hello all,

I am testing the (work-in-progress / upcoming) libata SATA hotplug.
Hotplugging alone seems to work, but not well in combination with md
RAID.

Here is my report and a question about intended behaviour. Mainstream
2.6.17.11 kernel patched with libata-tj-2.6.17.4-20060710.tar.bz2 from
http://home-tj.org/files/libata-tj-stable/.

Supermicro P8SCT motherboard with Intel ICH6R, using AHCI libata driver.

In short, I use ext3 over /dev/md0 over 4 SATA drives /dev/sd[a-d]
each driven by libata ahci. I unplug then replug the drive that is
rebuilding in RAID-5.

When I unplug a drive, /dev/sda is removed, hotplug seems to work to
the point where proc/mdstat shows the drive failed, but not removed.

Every other notion of the drive (in kernel and udev /dev namespace)
seems to be gone after unplugging. I cannot manually removed the drive
using mdadm, because it tells me the drive does not exist.

Replugging the drive brings it back as /dev/sde, md0 will not pick it up.


I have a similar setup, AHCI + 4 drives but using a RAID-1 group.  The 
thing that you are looking for is persistent device naming and should 
work properly if you can tweak udev/hotplug correctly.


I have verified that a drive pull/drive reinsert on a mainline kernel 
with a SLES10 base does provide this (first insertion gives me sdb, pull 
followed by reinsert still is sdb), but have not tested interaction with 
RAID since I am focused on the bad block handling at the moment.  I will 
add this to my list ;-)




The expected behaviour (from me) is that the drive re-appears as 
/dev/sda.


What is the intended behaviour of md in this case?

Should some user-space application fail-remove a drive as a pre-action
of the unplug event from udev, or should md fully remove the drive
within kernel space??

See kernel/udev/userspace messages in chronological order,
with my actions marked between  , at this web
page:

http://pastebin.ca/168798

Thanks,
--
Leon



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Failed Hard Disk... help!

2006-06-10 Thread Ric Wheeler



David M. Strang wrote:


/Patrick wrote:


pretty sure smartctl -d ata -a /dev/sdwhatever will tell you the
serial number. (Hopefully the kernel is new enough that it supports
SATA/smart, otherwise you need a kernel patch which won't be any 
better...)



Yep... 2.6.15 or better... I need the magical patch =\.

Any other options?


If you have an up dated copy of hdparm, you can use it against libata 
SCSI drives to get the serial number:


   # hdparm -V
   hdparm v5.7

   # hdparm -I /dev/sda


   /dev/sda:

   ATA device, with non-removable media
   Model Number:   Maxtor 7L320S0
   Serial Number:  L616D6YH
   Firmware Revision:  BACE1G70
   (and so on)




-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md: Change ENOTSUPP to EOPNOTSUPP

2006-04-29 Thread Ric Wheeler



Molle Bestefich wrote:


Ric Wheeler wrote:


You are absolutely right - if you do not have a validated, working
barrier for your low level devices (or a high end, battery backed array
or JBOD), you should disable the write cache on your RAIDed partitions
and on your normal file systems ;-)

There is working support for SCSI (or libata S-ATA) barrier operations
in mainline, but they conflict with queue enable targets which ends up
leaving queuing on and disabling the barriers.



Thank you very much for the information!

How can I check that I have a validated, working barrier with my
particular kernel version etc.?
(Do I just assume that since it's not SCSI, it doesn't work?)


The support is in for all drive types now, but you do have to check.

You should look in /var/log/messages and see that you have something 
like this:


 Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: warning: reiserfs: 
option skip_busy is set
 Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: warning: allocator 
options = [0020] 
 Mar 29 16:07:19 localhost kernel:
 Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: found reiserfs 
format 3.6 with standard journal
 Mar 29 16:07:19 localhost kernel: ReiserFS: sdc14: Using r5 hash to 
sort names

 Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: using ordered data mode
 Mar 29 16:07:20 localhost kernel: reiserfs: using flush barriers

You can also do a sanity check on the number of synchronous IO's/second 
and make sure that it seems sane for your class of drive.  For example, 
I use a simple test which creates files, fills them and then fsync's 
each file before close. 

With the barrier on and write cache active, I can write about 30 (10k) 
files/sec to a new file system.  I get the same number with no barrier 
and write cache off which is what you would expect.


If you manually mount with barriers off and the write cache on however, 
your numbers will jump up to about 852 (10k) files/sec.  This is the one 
to look out for ;-)




I find it, hmm... stupefying?  horrendous?  completely brain dead?  I
don't know..  that noone warns users about this.  I bet there's a
million people out there, happily using MD (probably installed and
initialized it with Fedora Core / anaconda) and thinking their data is
safe, while in fact it is anything but.  Damn, this is not a good
situation..


The wide spread use of the write barrier is pretty new stuff.  In 
fairness, the accepted wisdom is (and has been for a long time) to 
always run with write cache off if you care about your data integrity 
(again, regardless of MD or native file system). Think of the write 
barrier support as a great performance boost (I can see almost a 50% win 
in some cases), but getting it well understood and routinely tested is 
still a challenge.




(Any suggestions for a good place to fix this?  Better really really
really late than never...)


Good test suites and lots of user testing...

ric

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP

2006-04-28 Thread Ric Wheeler



Molle Bestefich wrote:


NeilBrown wrote:


Change ENOTSUPP to EOPNOTSUPP
Because that is what you get if a BIO_RW_BARRIER isn't supported !



Dumb question, hope someone can answer it :).

Does this mean that any version of MD up till now won't know that SATA
disks does not support barriers, and therefore won't flush SATA disks
and therefore I need to disable the disks's write cache if I want to
be 100% sure that raid arrays are not corrupted?

Or am I way off :-).

You are absolutely right - if you do not have a validated, working 
barrier for your low level devices (or a high end, battery backed array 
or JBOD), you should disable the write cache on your RAIDed partitions 
and on your normal file systems ;-)


There is working support for SCSI (or libata S-ATA) barrier operations 
in mainline, but they conflict with queue enable targets which ends up 
leaving queuing on and disabling the barriers.


ric

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md faster than h/w?

2006-01-16 Thread Ric Wheeler

Max Waterman wrote:


Mark Hahn wrote:


I've written a fairly
simple bandwidth-reporting tool:
http://www.sharcnet.ca/~hahn/iorate.c

it prints incremental bandwidth, which I find helpful because it shows
recording zones, like this slightly odd Samsung:
http://www.sharcnet.ca/~hahn/sp0812c.png



Using iorate.c, I guess somewhat different numbers for the 2.6.15 
kernel than
for the 2.6.8 kernel - the 2.6.15 kernel starts off at 105MB/s and 
head down

to 94MB/s, while 2.6.8 starts at 140MB/s and heads town to 128MB/s.

That seems like a significant difference to me?

What to do?

Max.

Keep in mind that disk performance is very dependent on exactly what 
your IO pattern looks like and which part of the disk you are reading.


For example, you should be able to consistently max out the bus if you 
write a relatively small (say 8MB) block of data to a disk and then 
(avoiding the buffer cache) do direct IO reads to read it back.  This 
test is useful for figuring out if we have introduced any IO performance 
bumps as all of the data read should come directly from the disk cache 
and not require any head movement, platter reads, etc.  You can repeat 
this test for each of the independent drives in your system.


It is also important to keep in mind that different parts of your disk 
platter have different maximum throughput rates.  For example, reading 
from the outer sectors on a platter will give you a significantly 
different profile than reading from the inner sectors on a platter.


We have some tests that we use to measure raw disk performance that try 
to get through these hurdles to measure performance in a consistent and 
reproducible way...


ric

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disk light remains on

2006-01-03 Thread Ric Wheeler

[EMAIL PROTECTED] wrote:

Thanks for the reply.

On Mon, Jan 02, 2006 at 11:49:14PM -0500, Ross Vandegrift wrote:


I just began using RAID-1 (in 2.6.12) on a pair of SATA drives, and
now the hard disk drive light comes on during booting--about when the
RAID system is loaded--and stays on.




There was a bug in libata that did not clear the register that set the 
drive light - you should be able to get rid of this issue with an 
updated kernel or simply ignore it,


ric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


4-way RAID-1 group never finishes md_do_sync()?

2005-01-31 Thread Ric Wheeler
We have a setup where the system partitions (/, /usr, /var) are all 
mirrored across a 4 volume RAID-1 devices.

On some set of our nodes running both a SLES based kernel and 2.6.10, we 
have a condition where the a device gets stuck in the md_do_sync() code 
and never makes progress, never advances and spews into /var/log/messages.

I addd a bunch of printk's to md_do_sync and got the following:
   Jan 31 15:05:57 centera kernel: md: waking up MD thread md5_raid1.
   Jan 31 15:05:57 centera kernel: md: recovery thread got woken up
   (sb_dirty 0 recovery 0x13 sync_thread 0xf6312380 REC_NEEDED 0 
   REC_DONE 1 ) ...
   Jan 31 15:05:57 centera kernel: interrupting MD-thread pid 16530
   Jan 31 15:05:57 centera kernel: raid1_spare_active start loop -
   working disks 4 degraded 0
   Jan 31 15:05:57 centera kernel: raid1_spare_active end loop -
   working disks 4 degraded 0
   Jan 31 15:05:57 centera kernel: raid1_spare_active - no spares made
   active
   Jan 31 15:05:57 centera kernel: md: updating md5 RAID superblock on
   device (in sync 1)
   Jan 31 15:05:57 centera kernel: md: sdd5 6(write) sdd5's sb
   offset: 305152
   Jan 31 15:05:57 centera kernel: md: sdc5 6(write) sdc5's sb
   offset: 305152
   Jan 31 15:05:57 centera kernel: md: sdb5 6(write) sdb5's sb
   offset: 305152
   Jan 31 15:05:57 centera kernel: md: sda5 6(write) sda5's sb
   offset: 305152
   Jan 31 15:05:57 centera kernel: md: waking up MD thread md5_raid1.
   Jan 31 15:05:57 centera kernel: md: recovery thread got woken up
   (sb_dirty 0 recovery 0x20 sync_thread 0x0 REC_NEEDED 1 REC_DONE 0 ) ...
   Jan 31 15:05:57 centera kernel: md: recovery (2) raid_disk 0 faulty
   0 nr_pending 0 degraded 0...
   Jan 31 15:05:57 centera kernel: md: recovery (2) raid_disk 2 faulty
   0 nr_pending 0 degraded 0...
   Jan 31 15:05:57 centera kernel: md: recovery (2) raid_disk 1 faulty
   0 nr_pending 0 degraded 0...
   Jan 31 15:05:57 centera kernel: md: recovery (2) raid_disk 3 faulty
   0 nr_pending 0 degraded 0...
   Jan 31 15:05:57 centera kernel: md: waking up MD thread md5_resync.
   Jan 31 15:05:57 centera kernel: md: waking up MD thread md5_raid1.
   Jan 31 15:05:57 centera kernel: md: recovery thread got woken up
   (sb_dirty 0 recovery 0x3 sync_thread 0xf6312380 REC_NEEDED 0
   REC_DONE 0 ) ...
   Jan 31 15:05:57 centera kernel: md_do_sync ITERATE_MDDEV: mddev md5
   mddev-curr_resync 2 mddev2 md3 mddev2-curr_resync 0
   Jan 31 15:05:58 centera kernel: md_do_sync ITERATE_MDDEV: mddev md5
   mddev-curr_resync 2 mddev2 md6 mddev2-curr_resync 0
   Jan 31 15:05:58 centera kernel: md_do_sync ITERATE_MDDEV: mddev md5
   mddev-curr_resync 2 mddev2 md7 mddev2-curr_resync 0
   Jan 31 15:05:58 centera kernel: md_do_sync ITERATE_MDDEV: mddev md5
   mddev-curr_resync 2 mddev2 md8 mddev2-curr_resync 0
   Jan 31 15:05:58 centera kernel: md_do_sync ITERATE_MDDEV: mddev md5
   mddev-curr_resync 2 mddev2 md0 mddev2-curr_resync 0
   Jan 31 15:05:58 centera kernel: md: syncing RAID array md5 recovery
   0x3 resync_mark 26725 resync_mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md: minimum _guaranteed_
   reconstruction speed: 1000 KB/sec/disc.
   Jan 31 15:05:58 centera kernel: md: using maximum available idle IO
   bandwith (but not more than 20 KB/sec) for reconstruction.
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 j 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 0 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 1 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 2 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 3 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 4 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 5 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 6 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 7 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 8 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md_do_sync: md5 step 9 mark 26726
   mark_cnt 610304
   Jan 31 15:05:58 centera kernel: md: using 128k window, over a total
   of 305152 blocks.
   Jan 31 15:05:58 centera kernel: md: resuming recovery of md5 from
   checkpoint.
   Jan 31 15:05:58 centera kernel: md: md5: sync done.

If you look at the sync part of md_do_sync, it is comparing the 
max_sectors (610304) to the value of  recovery_cp (stored in j) and they 
are equal.

In the out of md_do_sync, there is code that is setting recovery_cp:
out:
   mddev-queue-unplug_fn(mddev-queue);
   wait_event(mddev-recovery_wait, !atomic_read(mddev-recovery_active));
   /* tell personality that we are finished */
   mddev-pers-sync_request(mddev, max_sectors, 1);
   if (!test_bit(MD_RECOVERY_ERR, mddev-recovery) 
   mddev-curr_resync  2