Re: split RAID1 during backups?

2005-10-25 Thread Bill Davidsen
On Mon, 24 Oct 2005, Jeff Breidenbach wrote:

 
 Ok... thanks everyone!
 
 David, you said you are worried about failure scenarios
 involved with RAID splitting. Could you please elaborate?
 My biggest concern is I'm going to accidentally trigger
 a rebuild no matter what I try but maybe you have something
 more serious in mind.
 
 Brad, your suggestion about kernel 2.6.13 and intent logging and
 having mdadm pull a disk sounds like a winner. I'm going to to try it
 if the software looks mature enough. Should I be scared?
 
 Dean, the comment about write-mostly is confusing to me.  Let's say
 I somehow marked one of the component drives write-mostly to quiet it
 down. How do I get at it? Linux will not let me mount the component
 partition if md0 is also mounted. Do you think write-mostly or
 write-behind are likely enough to be magic bullets that I should
 learn all about them?
 
 Bill, thanks for the suggestion to use nbd instead of netcat.  Netcat
 is solid software and very fast, but does feel a little like duct
 tape. You also suggested putting a third drive (local or nbd remote)
 temporarily in the RAID1. What does that buy versus the current
 practice of using dd_rescue to copy the data off md0? I'm not
 imagining any I/O savings over the current approach.

As a paranoid admin, you (a) reduce read-only time, (b) never have an 
unmirrored data running, and (c) it does let you send from an unused drive 
(or you can get a real hot swap bay and put the mirrored drive in the 
safe).

I have one other thought, if you want to just stream this to another drive 
and can stand long r/o mounts (or play with intent stuff, carefully), that 
is to:
 open a socket to something on the other machine which is going to write a 
single BIG data file, or to a partition of the same size.

 open the partition as a file (open /dev/md0)

 use the sendfile() system call to blast the file to the socket without 
using user memory.

Based on my vast experience with one test program, this should work ;-) It 
will be limited by how fast you can write it at the other end, I suspect.



I still think you should be able to do incrementals. If that's Reiser3 
you're using, it may not be performing as well as ext3 with hashing would, 
but I lack the time to test that properly.

 
 John, I'm using 4KB blocks in reiserfs with tail packing. All sorts of
 other details are in the dmesg output [1]. I agree seeks are a major
 bottleneck, and I like your suggestion about putting extra spindles
 in. Master-slave won't work because the data is continuously changing.
 I'm not going to argue about the optimality of millions of tiny files
 (go talk to Hans Reiser about that one!) but I definitely don't foresee
 major application redesign any time soon.
 
 Most importantly, thanks for the encouragement. So far it sounds like
 there might be some ninja magic required, but I'm becoming
 increasingly optimistic that it will be - somehow - possible manage
 disk contention in order to dramatically raise backup speeds.
 
 Cheers,
 Jeff
 
 [1] http://www.jab.org/dmesg
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-- 
bill davidsen [EMAIL PROTECTED]
  CTO TMR Associates, Inc
Doing interesting things with little computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: split RAID1 during backups?

2005-10-27 Thread Bill Davidsen
On Wed, 26 Oct 2005, Jeff Breidenbach wrote:

 
 Norman What you should be able to do with software raid1 is the
 Norman following: Stop the raid, mount both underlying devices
 Norman instead of the raid device, but of course READ ONLY. Both
 Norman contain the complete data and filesystem, and in addition to
 Norman that the md superblock at the end. Both should be identical
 Norman copies of that.  Thus, you do not have to resync
 Norman afterwards. You then can backup the one disk while serving the
 Norman web server from the other. When you are done, unmount,
 Norman assemble the raid, mount it and go on.
 
 I tried both variants of Norman's suggestion on a test machine and
 they worked great. Shutting down and restarting md0 did not trigger a
 rebuild. Perfect! And I could mount component partitions
 read-only at any time. However on the production machine the
 component partitions refused to mount, claiming to be already
 mounted. Despite the fact that the component drives do not show up
 anywhere in lsof or mtab. When I saw this, I got nervous and did not
 even try stopping md0 on the production machine.

As long as md0 is running I suspect the partition will be marked as in 
use. So you have to stop it. If the 2.4 kernel didn't detect that, I would 
call it a bug.

 
 # mount -o ro /dev/sdc1 backup
 mount: /dev/sdc1 already mounted or backup busy
 
 The two machines hardly match. The test machine has a 2.4.27 kernel
 and JBOD drives hanging off a 3ware 7xxx controller. The production
 machine has a 2.6.12 kernel and Intel SATA controllers. Both machines
 have mdadm 1.9.0, and the discrepancy in behavior seems weird to
 me. Any insights?

[___snip___]
 
 Bill If you want to try something which used to work see nbd,
 Bill export 500GB from another machine, add the network block device
 Bill to the mirror, let it sync, break the mirror. Haven't tried
 Bill since 2.4.19 or so.
 
 Wow, nbd (network block device) sounds really useful. I wonder if it
 is a good way to provide more spindles to a hungry webserver.  Plus
 they had a major release yesterday. While I've been focusing on
 managing disk contention, if there's an easy way to reduce it, that's
 definitely fair game.
 
 Some of the other suggestions I'm going to hold off on. For example,
 sendfile() doesn't really address the bottleneck of disk contention.

sendfile() bypasses the copy to user buffer, which in turn will bypass 
copy to system buffers, which eliminates contention for buffer space. Use 
vmstat to check, if you have a lot of system time and lots of space in 
buffers of various kinds, there's a good possibility that the problem is 
there.

 I'm also not so anxious to switch filesystems. That's a two week
 endeavor that doesn't really address the contention issue. And it's
 also a little hard for me to imagine that someone is going to beat the
 pants off of reiserfs, especially since reiserfs was specifically
 designed to deal with lots of small files efficiently. Finally, I'm
 not going to focus on incremental backups if there's any prayer of
 getting a 500GB full backup in 3 hours.  Full backups provide a LOT of
 warm fuzzies.
 
 Again, thank you all very much.


-- 
bill davidsen [EMAIL PROTECTED]
  CTO TMR Associates, Inc
Doing interesting things with little computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: udev and mdadm -- I'm lost

2005-11-16 Thread Bill Davidsen

Konstantin Olchanski wrote:


On Tue, Nov 08, 2005 at 04:40:23PM -0700, Dave Jiang wrote:
 


I see there is an mdadm  --auto option now...
 

Just use --auto or --auto=yes and it should take care of the device node 
creations. I have a /etc/mdadm.conf so I just do:

mdadm -As --auto=yes and it brings everything up over udev.
   



I face a similar problem each time I use KNOPPIX to revive a non-booting
server. It always takes me 10-20 minutes to figure out the right
mdadm incantation to start the md devices. It does not help that mdadm
wants an /etc/mdadm.conf which is on an md device itself, unaccessible.
 


Hate to say it, that's a reward of a poor config choice and should be fixed.


I wish there were a simple command to find and start all md devices, something
along the lines of:

mdadm --please-find-and-start-all-my-md-devices-please-please-please
or
mdadm --start-all, or whatever.

 




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID-6

2005-11-16 Thread Bill Davidsen
Based on some google searching on RAID-6, I find that it seems to be 
used to describe two different things. One is very similar to RAID-5, 
but with two redundancy blocks per stripe, one XOR and one CRC (or at 
any rate two methods are employed). The other sources define RAID-6 as 
RAID-5 with a distributed hot spare, AKA RAID-5E, which spreads head 
motion to all drives for performance.


Any clarification on this?

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor Software RAID-0 performance with 2.6.14.2

2005-11-22 Thread Bill Davidsen

Lars Roland wrote:

I have created a stripe across two 500Gb disks located on separate IDE
channels using:

mdadm -Cv /dev/md0 -c32 -n2 -l0 /dev/hdb /dev/hdd

the performance is awful on both kernel 2.6.12.5 and 2.6.14.2 (even
with hdparm and blockdev tuning), both bonnie++ and hdparm (included
below) shows a single disk operating faster than the stripe:


In looking at this I found something interesting, even though you 
identified your problem before I was able to use the data for the 
intended purpose. So other than suggesting that the stripe size is too 
small, nothing on that, your hardware is the issue.


I have two ATA drives connected, and each has two partitions. The first 
partition of each is mirrored for reliability with default 64k chunks, 
and the second is striped, with 512k chunks (I write a lot of 100MB 
files to this f/s).


Reading the individual devices with dd, I saw a transfer rate of about 
60MB/s, while the striped md1 device gave just under 120MB/s. (60.3573 
and 119.6458) actually. However, the mirrored md0 also gave just 60MB/s 
read speed.


One of the advantages of mirroring is that if there is heavy read load 
when one drive is busy there is another copy of the data on the other 
drive(s). But doing 1MB reads on the mirrored device did not show that 
the kernel took advantage of this in any way. In fact, it looks as if 
all the reads are going to the first device, even with multiple 
processes running. Does the md code now set write-mostly by default 
and only go to the redundant drives if the first fails?


I won't be able to do a lot of testing until Thursday, or perhaps 
Wednesday night, but that is not as I expected and not what I want, I do 
mirroring on web and news servers to spread the head motion, now I will 
be looking at the stats to see if that's happening.


I added the raid M/L to the addresses, since this is getting to be 
general RAID question.


--
   -bill davidsen ([EMAIL PROTECTED])
The secret to procrastination is to put things off until the
 last possible moment - but no longer  -me
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor Software RAID-0 performance with 2.6.14.2

2005-11-23 Thread Bill Davidsen

Paul Clements wrote:

Bill Davidsen wrote:

One of the advantages of mirroring is that if there is heavy read load 
when one drive is busy there is another copy of the data on the other 
drive(s). But doing 1MB reads on the mirrored device did not show that 
the kernel took advantage of this in any way. In fact, it looks as if 
all the reads are going to the first device, even with multiple 
processes running. Does the md code now set write-mostly by default 
and only go to the redundant drives if the first fails?



No, it doesn't use write-mostly by default. The way raid1 read balancing 
works (in recent kernels) is this:


- sequential reads continue to go to the first disk

- for non-sequential reads, the code tries to pick the disk whose head 
is closest to the sector that needs to be read


So even if the reads aren't exactly sequential, you probably still end 
up reading from the first disk most of the time. I imagine with a more 
random read pattern you'd see the second disk getting used.


Thanks for the clarification. I think the current method is best for 
most cases, I have to think about how large a file you would need to 
have any saving in transfer time given that you have to consider the 
slowest seek, drives doing other things on a busy system, etc.


--
   -bill davidsen ([EMAIL PROTECTED])
The secret to procrastination is to put things off until the
 last possible moment - but no longer  -me

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Booting from raid1 -- md: invalid raid superblock magic on sdb1

2005-12-03 Thread Bill Davidsen

David M. Strang wrote:

I've read, and read, and read -- and I'm still not having ANY luck 
booting completely from a raid1 device.


This is my setup...

sda1 is booting, working great. I'm attempting to transition to a 
bootable raid1.


sdb1 is a 400GB partition -- it is type FD.

Disk /dev/sdb: 400.0 GB, 400088457216 bytes
2 heads, 4 sectors/track, 97677846 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

  Device Boot  Start End  Blocks   Id  System
/dev/sdb1   197677846   390711382   fd  Linux raid 
autodetect


I have created my raid1 mirror with the following command:

mdadm --create /dev/md_d0 -e1 -ap --level=1 --raid-devices=2 
missing,/dev/sdb1


The raid created correctly, I then partitioned md_d0 to match sda1. 


I hope that's a typo... you need to partition sdb, not the md_d0 raid 
device.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: building a disk server

2005-12-03 Thread Bill Davidsen

Brad Dameron wrote:


Might look at a Areca SATA RAID controller. They support up to 24 ports
and has hardware level RAID capacity expansion. Or if you want to go
cheaper look at the 3ware controller and use it in JBOD. That way you
can get the Smart monitoring and hotplug. 


Here is the server I built with 3TB usable for pretty cheap.
All this was from www.8anet.com

Supermicro SC933T-R760 3u or SC932T-R760 rackmount Chassis with 15 SATA
Hot-Swap drive trays and triple redundant power supplies.
Any motherboard and CPU will do. I would recommend a AMD64 CPU with a
motherboard that has a PCI-X slot on it if possible. I used a Tyan S2468
with dual Athlon 2800's and 2GB. 
A 3ware 9500S-12. Not the 9500S-12MI with this case. Or the new

9550SX-12 which is much faster now.
12 - 300GB Maxtor MaXLine III drives
2 - Western Digital 36GB 10k drives


I use the 2 36GB drives mirrored for the OS since I had the extra slots.
Could of went with a Areca 16 port card instead. But I already had the
3ware laying around. I went with the 300GB Maxtor drives because at the
time they were the ones that had SATAII NCQ (Native Command Queuing) and
16MB cache. This setup is very fast and I use it as a NFS server for
backing up my main servers. I currently have about 20% left out of 3TB.
Time to add another one. 
 



The only part of the hardware I would change is the CPU setup, a single 
dual core setup seems more cost and heat effective now. The controller 
is fine, but that just gets better with time as new stuff comes out.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size ?

2005-12-12 Thread Bill Davidsen

Neil Brown wrote:


On Friday December 9, [EMAIL PROTECTED] wrote:
 


On Fri, 9 Dec 2005, Neil Brown wrote:

   


On Friday December 9, [EMAIL PROTECTED] wrote:
 


Hi,

I found that there's a new sysfs stripe_cache_size variable. I want to
know how does it affect RAID5 read / write performance (if any) ?
Please cc to me if possible, thanks.
   


Would you like to try it out and see?
Any value from about 10 to a few thousand should be perfectly safe,
though very large values may cause the system to run short of memory.

The memory used is approximately
  stripe_cache_size * 4K * number-of-drives
 


What??? I hope that's a typo...
1 - there's no use of the sysfs variable?
   



'stripe_cache_size' is the sysfs variable.  Yes, it is used.

 


2 - that's going to be huge, 128k * 4k * 10 = 5.1GB !!!
   



That is why I warned to limit it to a few thousand (128k is more than
a few thousand!).
 

Sorry, for some reason I read that as being in stripes instead of bytes, 
which would make it 128k for size only 2. My misread.



I just ran bonnie over a 5drive raid5 with stripe_cache_size varying
in from 256 to 4096 in a exponential sequence. (Numbers below 256
cause problems - I'll fix that).

Results:
256   
cage,8G,42594,93,151807,38,50660,18,38610,91,172056,38,912.8,2,16,4356,99,+,+++,+,+++,4389,99,+,+++,14091,100
512   
cage,8G,42145,92,186535,44,60659,21,42249,96,172057,37,971.9,2,16,4407,99,+,+++,+,+++,4452,99,+,+++,13909,99
1024   
cage,8G,42250,92,210407,50,61254,21,42106,96,172575,37,903.1,2,16,4370,99,+,+++,+,+++,4395,99,+,+++,13809,100
2048   
cage,8G,42458,92,229577,55,61762,21,41965,96,168950,36,837.9,2,16,4373,99,+,+++,+,+++,4460,99,+,+++,14084,100
4096   
cage,8G,42305,92,250318,62,62192,21,42156,96,170692,38,981.8,3,16,4380,99,+,+++,+,+++,4426,99,+,+++,13723,99

Seq Write speed   ^
Increases substantially.
Seq Read ^
Doesn't vary much.
Seq rewrite ^
improves a bit

So for that limited test, write speed is helped a lot, read speed
isn't.

Maybe I should try iozone...

NeilBrown

 




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: First RAID Setup

2005-12-22 Thread Bill Davidsen

Brad Campbell wrote:


Callahan, Tom wrote:


It is always wise to build in a spare however, that being said about all
raid levels. In your configuration, if a disk fails in your RAID5, your
array will go down. RAID5 is usually 3+ disks, with a mirror. So you 
should

have 3 disks at minimum, and then a 4th as a spare.



/me wonders in the days of reliable RAID-6 why we use RAID-5 + spare?

RAID-6 has saved me twice now from dual drive failures on a 15 disk 
array.
It's schweett 



It's also a lot more overhead... RAID-5 needs to update just one parity 
block beyond the data written. As I understand the Q sum in RAID-6, and 
watching disk access rates, each write requires the entire stripe to be 
read, then P and Q calculated, then written. You can do the P with a 
read+write, but since you have to read the entire stripe for Q, you save 
a read by recalculating the P from data.


Did I say that right, Neil?

If you are seeing dual drive failures, I suspect your hardware has 
problems. We run multiple 3 and 6 TB databases, and over a dozen 1 TB 
data caching servers, all using a lot of small fast disk, and I haven't 
seen a real dual drive failure in  about 8 years.


We did see some cases which looked like dual failures, it turned out to 
be a firmware limitation, controller not waiting for the bus to settle 
after a real failure, and thinking the next i/o had failed (or similar, 
in any case a false fail on the transaction after the real fail). If you 
run two PATA drives on the same cable in master/slave, it's at least 
possible that this could happen with consumer grade hardware as well. 
Just a thought, dual failures are VERY unlikely unless one triggers the 
other in some way, like failing the bus or cabinet power supply.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: First RAID Setup

2005-12-22 Thread Bill Davidsen

Andargor The Wise wrote:


Yet another thing, someone has suggested that I should
increase the chunk size for my RAID5 from 32 to either
64 or 128.

Is it worth it, considering that the system doesn't
normally run on a heavy load? Mail for a few users,
some read-only database applications, website, etc.
Mostly a development machine.
 



Can't think of a case where it's not worth having better performance... 
I should write a WP on stripe size, and what happens as you change it 
with given loads.



Would this alleviate the pauses during large file
transfers/copies that I have indicated in my previous
post?

I'm asking because backing up ~176 GB, reconfiguring
the RAID, and restoring it properly so the machine
boots (the RAID5 is /) is quite a PITA.



You may have some special case, but I would never put data that large in 
/ just as a system admin issue. It makes backups and restores, as well 
as upgrades quite unpleasant. I guess you may have noticed that by now 
;-) Good luck!


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adding Reed-Solomon Personality to MD, need help/advice

2006-01-08 Thread Bill Davidsen

Bailey, Scott wrote:


Interestingly, I was just browsing this paper
http://www.cs.utk.edu/%7Eplank/plank/papers/CS-05-569.html which appears
to be quite on-topic for this discussion. I admit my eyes glaze over
during intensive math discussions but it appears tuned RS might not be
as horrible as you'd think since apparently state-of-the-art now
provides tricks to avoid the Galois Field operations that used to be
required.

The thought that came to my mind was how does md's RAID-6 personality
compare to EVENODD coding?

Wondering if my home server will ever have enough storage for these
discussions to become non-academic for me, :-) 

The problem is not having storage, it's having backup. The properties of 
backup are

- able to be moved to off-site storage
- cheap and fast enough to use regularly

Making storage more reliable is a desirable end, but it doesn't guard 
against many common failures such as controllers going bad and writing 
unreadable sectors all over before total failure, fire, flood, and 
software errors in the kernel code. While none of these is common in the 
sense of everyday, they are all common in the sense of I never heard of 
that happening response.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid sync observations

2006-01-08 Thread Bill Davidsen

Gordon Henderson wrote:


On Wed, 21 Dec 2005, Sebastian Kuzminsky wrote:

 


But how does the performance for read and write compare?
 


Good question!  I'll post some performance numbers of the RAID-6
configuration when I have it up and running.
   



Post your hardware config too if you don't mind. I have one server with 8
drives and for swap (Which it never does!) I created 2 x 4 disk RAID 6
arrays (same partition on all disks) and gave them to the kernel with
equal priority

FilenameTypeSizeUsedPriority
/dev/md10   partition   1991800 0   1
/dev/md11   partition   1991800 0   1

md10 : active raid6 sdd2[3] sdc2[2] sdb2[1] sda2[0]
 1991808 blocks level 6, 64k chunk, algorithm 2 [4/4] []

md11 : active raid6 sdh2[3] sdg2[2] sdf2[1] sde2[0]
 1991808 blocks level 6, 64k chunk, algorithm 2 [4/4] []

/dev/md10:
Timing buffered disk reads:  64 MB in  0.66 seconds = 97.28 MB/sec

/dev/md11:
Timing buffered disk reads:  64 MB in  0.95 seconds = 67.59 MB/sec

md10 is an on-board 4-port SII SATA controller, md11 is 2 x 2-port SII
PCI cards. (Server is currently moderately loaded, so results are a bit
lower than usual

Cue the must/must not swap on RAID arguments ;-)



I wouldn't swap on RAID-6... performance is importand, swap is tiny 
compared to disk size. I would go to 2GB partitions and four-way RAID-1, 
since fast swap in seems to make for better feel and write is cached 
somewhat.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: more info on the hang with 2.6.15-rc5

2006-01-08 Thread Bill Davidsen

Sebastian Kuzminsky wrote:


Now it works, but I dont trust it one bit.

I had been seeing almost immediate, perfectly repeatable hard lockups
in 2.6.15-rc5 and 2.6.15-rc5-mm3, when using sata_mv, RAID, and LVM
together.  Nothing in the syslog or on the console, and the system is
totally unresponsive to the keyboard  network.

My hardware setup is: four Seagate Barracuda 500 GB disks, on a Marvell
MV88SX6081 8-port SATA-II PCI-X controller, on a PCI-X bus (64/66).

The disks work great when accessed directly.  They work great when used
as four PVs for LVM, and when assembled into a 4-disk RAID-6.

But when I make a RAID-6 array out of them, and use the array as a PV,
the system would hang completely, within seconds.  (This is with LVM
2.02.01, libdevicemapper 1.02.02, and dm-driver 4.5.0.)

I turned on all the debugging options in the kernel config hoping to get
some insight, but this debug kernel doesnt crash.  It's running fine,
and I'm pounding on it.  A timing problem in the interaction between
LVM and RAID?  Some kind of wierd heisenbug


I'd be happy to do any debugging tests people suggest.


 


I've been waiting for more info on this, did it get fixed? 2.6.15?

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm-2.2 SEGFAULT: mdadm --assemble --scan

2006-01-08 Thread Bill Davidsen

Andreas Haumer wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi!

Andre Noll schrieb:
 


sorry if this is already known/fixed: Assemble() is called from mdadm.c with
the update argument equal to NULL:

Assemble(ss, array_list-devname, mdfd, array_list, configfile,
NULL, readonly, runstop, NULL, verbose-quiet, force);

But in Assemble.c we have

if (ident-uuid_set  (!update  strcmp(update, uuid)!= 0)  ...

which yields a segfault in glibc's strcmp().

   


I just found the same problem after upgrading to mdadm-2.2
The logic to test for update not being NULL seems to be
reversed.

I created a small patch which seems to cure the problem
(see attached file)

HTH

- - andreas

- --
Andreas Haumer | mailto:[EMAIL PROTECTED]
*x Software + Systeme  | http://www.xss.co.at/
Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0
A-1100 Vienna, Austria | Fax: +43-1-6060114-71
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFDtqD0xJmyeGcXPhERAsdiAJ0Ve787gscq4VOGtT+9Qp3k62iUEgCgs9pH
Ekg0gkLEk+99XXHw+1ezdu8=
=rh66
-END PGP SIGNATURE-
 




Index: mdadm/Assemble.c
===
RCS file: /home/cvs/repository/distribution/Utilities/mdadm/Assemble.c,v
retrieving revision 1.1.1.7
diff -u -r1.1.1.7 Assemble.c
--- mdadm/Assemble.c5 Dec 2005 05:56:20 -   1.1.1.7
+++ mdadm/Assemble.c31 Dec 2005 15:01:34 -
@@ -219,7 +219,7 @@
}
if (dfd = 0) close(dfd);

-   if (ident-uuid_set  (!update  strcmp(update, uuid)!= 0) 

+   if (ident-uuid_set  (update  strcmp(update, uuid)!= 0) 
(!super || same_uuid(info.uuid, ident-uuid, 
tst-ss-swapuuid)==0)) {
if ((inargv  verbose = 0) || verbose  0)
fprintf(stderr, Name : %s has wrong uuid.\n,
 

Is that right now? Because  evaluates to zero or one left to right, 
the parens and the !=0 are not needed, and I assume they're in for a 
reason (other than to make the code hard to understand). A comment 
before that if would make the intention clear, I originally though the 
(!update was intended to be !(update which would explain the parens, 
but that seems wrong.


If it actually works as intended with the patch, perhaps a comment and 
cleanup in 2.3?


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 16?

2006-02-02 Thread Bill Davidsen
On Thu, 2 Feb 2006, J. Ryan Earl wrote:

  
 X-UID: 40928
 
 Gordon Henderson wrote:
 
 I've actually had very good results hot swapping  SCSI drives on a live
 linux system though.
 
 Anyone tried SATA drives yet?
 
 Yes, and it does NOT work yet.  libata does not support hotplugging of 
 harddrives yet: http://linux.yyz.us/sata/features.html 
 
 It supports hotplugging of the PCI controller itself, but not harddrives.

I can add controllers but no devices? I have to think the norm is exactly 
the opposite... One of the SATA controllers with the plugs on the back for 
your little external box.

A work in progress, I realize.

-- 
bill davidsen [EMAIL PROTECTED]
  CTO TMR Associates, Inc
Doing interesting things with little computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: with the latest mdadm

2006-02-07 Thread Bill Davidsen
On Tue, 7 Feb 2006, Neil Brown wrote:

 On Tuesday February 7, [EMAIL PROTECTED] wrote:
  hi,
  with the latest mdadm-2.3.1 we've got the following message:
  -
  md: md4: sync done.
  RAID1 conf printout:
   --- wd:2 rd:2
   disk 0, wo:0, o:1, dev:sda2
   disk 1, wo:0, o:1, dev:sdb2
  md: mdadm(pid 8003) used obsolete MD ioctl, upgrade your software to use
  new ictls.
  -
  this is just a warning or some kind of problem with mdadm?
 
 This is with a 2.4 kernel, isn't it.
 
 The md driver is incorrectly interpreting an ioctl that it doesn't
 recognise as an obsolete ioctl.  In fact it is a new ioctl that 2.4
 doesn't know about (I suspect it is GET_BITMAP_FILE).
 
 The message should probably be removed from 2.4, but as 2.4 is in
 deep-maintenance mode, I suspect that is unlikely

More to the point, why is mdadm trying to use an illegal ioctl instead of 
checking the kernel version and disabling features which won't work? Users 
are trusting their data to this software, and try it and see if it works 
ioctls don't give me a warm fuzzy feeling about correct operation.

-- 
bill davidsen [EMAIL PROTECTED]
  CTO TMR Associates, Inc
Doing interesting things with little computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID 106 array

2006-02-08 Thread Bill Davidsen
I was doing some advanced planning for storage, and it came to me that 
there might be benefit from thinking outside the box on RAID config. 
Therefore I offer this for comment.

The object is reliability with performance. The means is to set up two 
arrays on physical devices, one RAID-0 for performance, one RAID-6 for 
reliability. Let's call them four and seven drives for the RAID-0 and 
RAID-6+Spare. Then configure RAID-1 over the two arrays, marking the 
RAID-6 as write-mostly.

The intension is that under heavy load the RAID-6 would have a lot of head 
motion going on writing two parity blocks for every one data write, while 
the RAID-0 would be doing as little work as possible for writes and would 
therefore have more ability to handle reads quickly.

Just a thought experiment at the moment.

-- 
bill davidsen [EMAIL PROTECTED]
  CTO TMR Associates, Inc
Doing interesting things with little computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NVRAM support

2006-02-10 Thread Bill Davidsen

Erik Mouw wrote:


On Fri, Feb 10, 2006 at 10:01:09AM +0100, Mirko Benz wrote:
 

Does a high speed NVRAM device makes sense for Linux SW RAID? E.g. a PCI 
card that exports battery backed memory.
   



Unless it's very large (i.e.: as large as one of your disks), it
doesn't make sense. It will probably break less often, but it doesn't
help you in case a disk really breaks. It also won't speed up an MD
device much.

 

Could that significantly improve write speed for RAID 5/6 (e.g. via an 
external journal, asynchronous operation and write caching)?
   



You could use it for an external journal, or you could use it as a swap
device.
 



Let me concur, I used external journal on SSD a decade ago with jfs 
(AIX). If you do a lot of operations which generate journal entries, 
file create, delete, etc, then it will double your performance in some 
cases. Otherwise it really doesn't help much, use as a swap device might 
be more helpful depending on your config.


 


What changes would be required?
   



None, ext3 supports external journals. Look for the -O option in the
mke2fs manual page. Using the NVRAM device as swap is not different
from a using normal swap partition.


Erik

 




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.15: mdrun, udev -- who creates nodes?

2006-02-10 Thread Bill Davidsen

linas wrote:


On Tue, Jan 31, 2006 at 04:40:46PM +, Jason Lunz was heard to remark:
 


[EMAIL PROTECTED] said:
   


-- kernel scans /dev/hda1, looking for md superblock
-- kernel assembles devices according to info found in the superblocks
-- udev creates /dev/md0, etc.=20
   


The problem is that some users and distributions build the drivers as
modules and/or disable in-kernel auto-assembly.
 


Not only that, the raid developers themselves consider autoassembly
deprecated.

http://article.gmane.org/gmane.linux.kernel/373620
   



Hmm. My knee-jerk, didn't-stop-to-think-about-it reaction is that 
this is one of the finest features of linux raid, so why remove it?


Speaking as a real-life sysadmin, with actual servers and actual failed
disks, disk cables and disk controllers, this is a life-saving feature. 
Persistant naming of devices in Linux has long been a problem, and in

this case, it seemed to work.

story
I once had an ide controller fail on an x86 board. I bought a new 
controller at the local store, recabled the disks, and booted. 
I was alarmed to find that the system was trying to mount /home 
as /usr, and /usr as /lib, etc. Turned out that /dev/hdc had  
gotten renamed as /dev/hde, etc. and  had to go through a long,

painful, rocket-science (yes, I *do* have a PhD) boot-floppy rescue
to restore the system to working order.

I shudder to think what would have happened if RAID reconstruction 
had started based on faulty device names. Worse, as part of my rescue

ops, I had to make multle copies of /etc/fstab, which resided on
different disks (my root volume was raided), as well as the boot 
floppy, and each contained inconsistent info (needed to bootstrap 
my way back). Along the way, I made multiple errors in editing 
the /etc/fstab since I could not keep them straight; twiddling 
BIOS settings added to the confusion.  If this had been /etc/raid.conf 
instead, with reconstruction triggered off of it, this could have 
been an absolute disaster.

/story

Based on the above, real-life experience, my gut reaction is 
raid assembly based on config files is a bad idea. I don't 
understand how innocent, minor errors made by the sysadmin 
won't result in catastrophic data loss.




I fear you don't understand how the auto detect and assemble works. Or 
more to the point what it does, since how it works is somewhat more 
complex. If you use partitions and UUID, you can just plug in the drives 
any old place and they will be found and recognised in spite of that. As 
long as you have a boot drive where the BIOS will use it, mdadm with 
find your stuff and put it together correctly. Neil does more magic than 
harry Potter!


I know someone who gave this a real life test, although I'd not say who.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


shared spare

2006-02-10 Thread Bill Davidsen
One of the things I like about the IBM ServeRAID controller is spare 
drive shared between two RAID groups. First to fail gets it. For 
software RAID is this at all in the future?


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: silent corruption with RAID1

2006-02-25 Thread Bill Davidsen

Moses Leslie wrote:


Hi,

I have a machine that currently has 4 drives in it (currently running
2.6.15.4). The first two drives are on the onboard SATA controller (VIA)
in a RAID-1.  I haven't had any issues with these.

The other two drives were added recently, along with an SiL PCI SATA card
to put them on.  lspci reports this card as:

:00:0a.0 Unknown mass storage controller: Silicon Image, Inc.
(formerly CMD Technology Inc) SiI 3112 [SATALink/SATARaid] Serial ATA
Controller (rev 02)

I initially used mdadm to create a new RAID1 of the two new drives, and
added them into the LVM group that the other ones were in to expand the
drive, but pretty quickly noticed (via rsync -c) that all new files were
corrupted.

I've since pulled the 2nd set of drives out of the LVM to test.  It's only
when using a RAID-1 that I get occasionaly corruption.  I split the drives
(each 300GB) into 4 75GB partitions each, and created 3 md devices.   One
75GB raid1, one 150GB raid0, and 1 225GB raid5.

I used a script that newfs'd each one, dd'd multiple copies of files (one
run with a 1GB, one with 3GB, one with 6GB), md5'd those files, then
umounted.

At least once in each test run, there was a file with the wrong checksum
when on the RAID-1 part of the test.

After completing all the tests, I redid the md devices such that none
of them used any of the same partitions that they had used in the first
test (IE the RAID1 was sda1 and sdb1 in the first one, and was sda4 and
sdb4 in the second one).

I also did the same test using each of the regular partitions as well
(sda1-4 and sdb1-4).

I was never able to duplicate any corruption any other time than with the
RAID1.

There's never any error messages in dmesg or syslog.

Is there anything I can do to help track down where the problem is?



Based on my own experience, I would suspect hardware. I can't swear that 
you don't have buggy software of some kind, but I've been running for 
over a year on RAID-1 with critical data on the volume, and haven't seen 
any indication of problems. Because of the data, the files get checked 
against md5sums daily and sha1sums monthly. Some files are old, some are 
added almost every day, files seldom are updated, but it does happen, 
and they are moved to new directories on a fairly frequest (2-3 
times/mo) basis. The checkfiles are run against an archival copy on 
another system about once a month, so I'm pretty sure there is no 
corruption happening.


Cables are my favorite source of intermittent evil, memory problems are 
next, but that usually shows up everywhere if you look hard. Hope any of 
this is useful.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: block level vs. file level

2006-02-25 Thread Bill Davidsen

Molle Bestefich wrote:


it wrote:
 


Ouch.

How does hardware raid deal with this? Does it?
   



Hardware RAID controllers deal with this by rounding the size of
participant devices down to nearest GB, on the assumption that no
drive manufacturers would have the guts to actually sell eg. a 250 GB
drive with less than exactly 250.000.000.000 bytes of space on it.

(It would be nice if the various flavors of Linux fdisk had an option
to do this. It would be very nice if anaconda had an option to do
this.)

I guess if you care you specify the size of the partition instead of 
use it all. I use fdisk usually, cfdisk when installing, both let me 
set size, fdisk let's me set starting track and even play with the 
partition table's idea of geometry. What kind of an option did you have 
in mind?


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4 disks: RAID-6 or RAID-10 ..

2006-03-05 Thread Bill Davidsen

Gordon Henderson wrote:


On Fri, 17 Feb 2006, Francois Barre wrote:

 


2006/2/17, Gordon Henderson [EMAIL PROTECTED]:
   


On Fri, 17 Feb 2006, berk walker wrote:

 


RAID-6 *will* give you your required 2-drive redundancy.
   


Anyway, if you wish to resize your setup to 5 drives one day or
another, I guess raid 6 would be preferable, because one day or
another, a patch will popup and make raid6 resizing possible. Or won't
it ?
   



Resizing isn't something I really care for. This particular box will be
sent away to a data centre where it'll stay for 3 years until I replace
it.

(And if I really do need more disk space in the meantime, I'll just build
another :)

Still scratching my head, trying to work out if raid-10 can withstand
(any) 2 disks of failure though, although after reading md(4) a few times
now, I'm begining to think it can't (unless you are lucky!) So maybe I'll
just stick with Raid-6 as I know that!



With only four drives you can just do the possible failure cases, there 
are only six... when any one drive fails you can only survive the 
failure of two of the three remaining drives, not what you wanted. How 
reliable do you NEED here is the real question.


It isn't too hard to make the drives more reliable than the case they're 
in, how many fans and power supplies can you survive losing?


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does grub support sw raid1?

2006-03-22 Thread Bill Davidsen

Mike Hardy wrote:


This works for me, there are several pages out there (I recall using the
commands from a gentoo one most recently) that show the exact sequence
of grub things you should do to get grub in the MBR of both disks.

It sounds like your machine may not be set to boot off of anything other
than that one disk though? Is that maybe a BIOS thing?

I dunno, but I have definitely pulled a primary drive out of the system
completely and booted off the second one, then had linux come up with
(correctly) degraded arrays
 



Frequently a BIOS will boot a 2nd drive if the first is missing/dead, 
but not if it returns bad data (CRC error). A soft fail is not handled 
correctly by all BIOS' in use, possibly a majority of them.



-Mike

Herta Van den Eynde wrote:
 


(crossposted to linux-raid@vger.kernel.org and redhat-list@redhat.com)
(apologies for this, but this should have be operational last week)

I installed Red Hat EL AS 4 on a HP Proliant DL380, and configured all
system devices in software RAID 1.  I added an entry to grub.conf to
fallback to the second disk in case the first entry fails.  At boottime,
booting from hd0 works fine.  As does booting from hd1.

Until I physically remove hd0 from the system.

I tried manually installing grub on hd1,
I added hd1 to the device.map and subsequently re-installed grub on it,
I remapped hd0 to /dev/cciss/c0d1 and subsequently re-installed grub
all to no avail.

I previously installed this while the devices were in slots 2 and 3. The
system wouldn't even boot then.  It looks as though booting from sw
RAID1 will only work when there's a valid device in slot 0.  Still
preferable over hw RAID1, but even better would be if this worked all
the way.

Is this working for anyone?  Any idea what I may have overlooked?  Any
suggestions on how to debug this?

Kind regards,

Herta

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm



--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: software raid to Hardware raid

2006-03-27 Thread Bill Davidsen

Ken wrote:


Hello,
  We have a Red Hat 7.1 box running kernel 2.4.17-SMP.   We have 
raidtools installed.   We had a hardware RAID that got converted to a 
software raid by mistake.   Is there any way to go back without 
loosing the data?   the device is showing up as /dev/md0.  Thank you. 


1 - back up your data
2 - convert to hw raid
3 - test carefully to be sure it works
4 - reload your data

And left out, step zero, consider why you want to do this if what you 
have is working reliably
also, between 2 and 3 consider upgrading to a distribution and kernel 
written this millenium.


Given the state of hardware raid when that kernel was new, and the state 
of the drivers, I would be very sure I had a GOOD reason to change anything.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partitioning md devices versus partitioining underlying devices

2006-04-06 Thread Bill Davidsen
On Thu, 6 Apr 2006, andy liebman wrote:

 Hi,
 
 I have a fundamental question about WHERE it is best to do partititioning.
 
 Here's a concrete example. I have two 3ware RAID-5 arrays, each made up 
 of 12 500 GB drives. When presented to Linux, these are /dev/sda and 
 /dev/sdb -- each 5.5 TB in size.
 
 I want to stripe the two arrays together, so that 24 drives are all 
 operating as one unit. However, I don't want an 11 TB filesystem. I want 
 to keep my filesystems down below 6 TB.
 
 It seems I have two choices:
 
 1)  partition the 3ware devices to make /dev/sda1, /dev/sda2, /dev/sdb1 
 and /dev/sdb2.  Then I can create TWO md RAID-0 devices -- /dev/sda1 + 
 /dev/sdb1 = /dev/md1, /dev/sda2 + /dev/sdb2 = /dev/md2
 
 OR
 
 2) create /dev/md1 from the entire 3ware devices -- /dev/sda + /dev/sdb 
 = /dev/md1 -- and then partition /dev/md1 into two devices.
 
 The question is, are these essentially equivalent alternatives? Is there 
 any theoretical reason why one choice would be better than the other -- 
 in terms of security, performance, memory usage, etc.
 
 A knowledgeable answer would be appreciated. Thanks in advance.

There is one advantage to partitioning sda and sdb and then building 
devices using the partitions... you can use different stripe sizes on each 
md drive built on the partition. *IF* you have different things going on 
in the filesystems, you may be able to improve performance and spread head 
motion by using tuned stripe sizes.

I did this for an application which had and index of 128 bytes index 
records to a bunch of 500-1000k data records. I used a small stripe size 
on the index and large on the data, and was able to reduce time from 
request to data delivery by more than 20%. I was doing RAID-0 over six 
SCSI drives.

Assuming that you do the same thing on both filesystems, I see no benefit 
to one way over the other, I was just answering your question as to a 
possible benefit. Use of LVM or dm to do the same thing might allow you to 
change f/s sizes and such after the fact, I have only tried that as a 
learning exercise, so I can't say how well it works in practice, or which 
is better for you.

-- 
bill davidsen [EMAIL PROTECTED]
  CTO TMR Associates, Inc
Doing interesting things with little computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Softraid controllers and Linux

2006-04-16 Thread Bill Davidsen

Jim Klimov wrote:


Hello linux-raid,

 I have tried several cheap RAID controllers recently (namely,
 VIA VT6421, Intel 6300ESB and Adaptec/Marvell 885X6081).
 
 VIA one is a PCI card, the second two are built in a Supermicro

 motherboard (E7520/X6DHT-G).

 The intent was to let the BIOS of the controllers make a RAID1
 mirror of two disks independently of an OS to make redundant
 multi-OS booting transparent. While DOS and Windows saw their
 mirrors as a singular block device, Linux (FC5) accessed the
 two drives separately on all adapters.

 Is this a bug or a feature of the kernel driver support? (I did
 not try vendors' binary drivers, if there are any).

 

If I understand how Linux uses the drives, you have to make them raid 
manually. However, the nice thing about BIOS RAID is that it will boot 
the system if the first boot drive fails. If the drive fails hard the 
BIOS will go to the first functional drive and boot. But if you get a 
CRC error, some BIOS will try another and some will just fail.


Vendor dependent.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RHEL3 kernel panic with md

2006-04-16 Thread Bill Davidsen

Colin McDonald wrote:


I appear to have a corrupt file system and now it is mirrored. LOL.

I am running Redhat Enterprise 3 and using mdtools.

I booted from the install media iso and went into rescue mode. RH was
unable to find the partitions automatically but after exiting into
bash i can run fdisk -l and i see all of the partitions.

I know this is sparse info but would any of the group be able to give
the best approach to getting them  mounted and fsck'd?



I'm happy to say I haven't tried this, but I would think that you can 
start the raid array manually and then run fsck (also manually). I would 
worry about why the rescue mode didn't find your data, though.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disks becoming slow but not explicitly failing anyone?

2006-05-04 Thread Bill Davidsen

Nix wrote:


On 23 Apr 2006, Mark Hahn stipulated:
 


I've seen a lot of cheap disks say (generally deep in the data sheet
that's only available online after much searching and that nobody ever
reads) that they are only reliable if used for a maximum of twelve hours
a day, or 90 hours a week, or something of that nature. Even server
 

I haven't, and I read lots of specs.  they _will_ sometimes say that 
non-enterprise drives are intended or designed for a 8x5 desktop-like

usage pattern.
   



That's the phrasing, yes: foolish me assumed that meant `if you leave it
on for much longer than that, things will go wrong'.

 

   to the normal way of thinking about reliability, this would 
simply mean a factor of 4.2x lower reliability - say from 1M to 250K hours
MTBF.  that's still many times lower rate of failure than power supplies or 
fans.
   



Ah, right, it's not a drastic change.

 


It still stuns me that anyone would ever voluntarily buy drives that
can't be left switched on (which is perhaps why the manufacturers hide
 

I've definitely never seen any spec that stated that the drive had to be 
switched off.  the issue is really just what is the designed duty-cycle?
   



I see. So it's just `we didn't try to push the MTBF up as far as we would
on other sorts of disks'.

 


I run a number of servers which are used as compute clusters.  load is
definitely 24x7, since my users always keep the queues full.  but the servers
are not maxed out 24x7, and do work quite nicely with desktop drives
for years at a time.  it's certainly also significant that these are in a 
decent machineroom environment.
   



Yeah; i.e., cooled. I don't have a cleanroom in my house so the RAID
array I run there is necessarily uncooled, and the alleged aircon in the
room housing work's array is permanently on the verge of total collapse
(I think it lowers the temperature, but not by much).

 


it's unfortunate that disk vendors aren't more forthcoming with their drive
stats.  for instance, it's obvious that wear in MTBF terms would depend 
nonlinearly on the duty cycle.  it's important for a customer to know where 
that curve bends, and to try to stay in the low-wear zone.  similarly, disk
   



Agreed! I tend to assume that non-laptop disks hate being turned on and
hate temperature changes, so just keep them running 24x7. This seems to be OK,
with the only disks this has ever killed being Hitachi server-class disks in
a very expensive Sun server which was itself meant for 24x7 operation; the
cheaper disks in my home systems were quite happy. (Go figure...)

 

specs often just give a max operating temperature (often 60C!), which is 
almost disingenuous, since temperature has a superlinear effect on reliability.
   



I'll say. I'm somewhat twitchy about the uncooled 37C disks in one of my
machines: but one of the other disks ran at well above 60C for *years*
without incident: it was an old one with no onboard temperature sensing,
and it was perhaps five years after startup that I opened that machine
for the first time in years and noticed that the disk housing nearly
burned me when I touched it. The guy who installed it said that yes, it
had always run that hot, and was that important? *gah*

I got a cooler for that disk in short order.

 


a system designer needs to evaluate the expected duty cycle when choosing
disks, as well as many other factors which are probably more important.
for instance, an earlier thread concerned a vast amount of read traffic 
to disks resulting from atime updates.
   



Oddly, I see a steady pulse of write traffic, ~100Kb/s, to one dm device
(translating into read+write on the underlying disks) even when the
system is quiescient, all daemons killed, and all fsen mounted with
noatime. One of these days I must fish out blktrace and see what's
causing it (but that machine is hard to quiesce like that: it's in heavy
use).

 

simply using more disks also decreases the load per disk, though this is 
clearly only a win if it's the difference in staying out of the disks 
duty-cycle danger zone (since more disks divide system MTBF).
   



Well, yes, but if you have enough more you can make some of them spares
and push up the MTBF again (and the cooling requirements, and the power
consumption: I wish there was a way to spin down spares until they were
needed, but non-laptop controllers don't often seem to provide a way to
spin anything down at all that I know of).

 

hdparam will let you set the spindown time. I have all mine set that way 
for power and heat reasons, they tend to be in burst use. Dropped the CR 
temp by enough to notice, but I need some more local cooling for that 
room still.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org

Re: Two-disk RAID5?

2006-05-04 Thread Bill Davidsen

Erik Mouw wrote:


On Wed, Apr 26, 2006 at 03:22:38PM -0400, Jon Lewis wrote:
 


On Wed, 26 Apr 2006, Jansen, Frank wrote:

   


It is not possible to flip a bit to change a set of disks from RAID 1 to
RAID 5, as the physical layout is different.
 

As Tuomas pointed out though, a 2 disk RAID5 is kind of a special case 
where all you have is data and parity which is actually also just data. 
   



No, the other way around: RAID1 is a special case of RAID5.

No it isn't. If you have N drives in RAID1 you have N independent copies 
of the data and no parity, there's just no corresponding thing in RAID5, 
which has one copy of the data, plus parity. There is no special case, 
it just doesn't work that way. Set N2 and report back.


Sorry, I couldn't find a diplomatic way to say you're completely wrong.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two-disk RAID5?

2006-05-04 Thread Bill Davidsen

John Rowe wrote:


I'm about to create a RAID1 file system and a strange thought occurs to
me: if I create a two-disk RAID5 array then I can grow it later by the
simple expedient of adding a third disk and hence doubling its size.

Is there any real down-side to this, such as performance? Alternatively
is it likely that mdadm will soon be able to convert a RAID1 pair to
RAID5 any time soon? (Just how different are they anyway? Isn't the
RAID4/5 checksum just an OR?)

I think it works, I just set up a little test case with two 20MB files 
and loopback mount. The mdadm seems to work, the mke2fs seems to work, 
the f/s is there. Please verify, this system is a bit (okay a bunch) hacked.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4 disks in raid 5: 33MB/s read performance?

2006-05-24 Thread Bill Davidsen

Mark Hahn wrote:


I just dd'ed a 700MB iso to /dev/null, dd returned 33MB/s.
Isn't that a little slow?
   



what bs parameter did you give to dd?  it should be at least 3*chunk
(probably 3*64k if you used defaults.)



I would expect readahead to make this unproductive. Mind you, I didn't 
say it is, but I can't see why not. There was a problem with data going 
through stripe cache when it didn't need to, but I thought that was fixed.


Neil? Am I an optimist?

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 kicks non-fresh drives

2006-05-26 Thread Bill Davidsen

Mikael Abrahamsson wrote:


On Thu, 25 May 2006, Craig Hollabaugh wrote:

That did it! I set the partition FS Types from 'Linux' to 'Linux raid 
autodetect' after my last re-sync completed. Manually stopped and 
started the array. Things looked good, so I crossed my fingers and 
rebooted. The kernel found all the drives and all is happy here in 
Colorado.



Would it make sense for the raid code to somehow warn in the log when 
a device in a raid set doesn't have Linux raid autodetect partition 
type? If this was in dmesg, would you have spotted the problem before?


As long as it is written where logwatch will see it, not recognize it, 
and report it... People who don't read their logwatch reports get no 
sympathy from me.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't get drives containing spare devices to spindown

2006-05-30 Thread Bill Davidsen
Did I miss an answer to this? As the weather gets hotter I'm doing all I 
can to reduce heat.


Marc L. de Bruin wrote:


Lo,

Situation: /dev/md0, type raid1, containing 2 active devices 
(/dev/hda1 and /dev/hdc1) and 2 spare devices (/dev/hde1 and /dev/hdg1).


Those two spare 'partitions' are the only partitions on those disks 
and therefore I'd like to spin down those disks using hdparm for 
obvious reasons (noise, heat). Specifically, 'hdparm -S value 
device' sets the standby (spindown) timeout for a drive; the value 
is used by the drive to determine how long to wait (with no disk 
activity) before turning off the spindle motor to save power.


However, it turns out that md actually sort-of prevents those spare 
disks to spindown. I can get them off for about 3 to 4 seconds, after 
which they immediately spin up again. Removing the spare devices from 
/dev/md0 (mdadm /dev/md0 --remove /dev/hd[eg]1) actually solves this, 
but I have no intention actually removing those devices.


How can I make sure that I'm actually able to spin down those two 
spare drives? 




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with raid=noautodetect

2006-05-30 Thread Bill Davidsen

Neil Brown wrote:


On Friday May 26, [EMAIL PROTECTED] wrote:
 


On Tue, May 23, 2006 at 08:39:26AM +1000, Neil Brown wrote:
   


Presumably you have a 'DEVICE' line in mdadm.conf too?  What is it.
My first guess is that it isn't listing /dev/sdd? somehow.
 


Neil,
i am seeing a lot of people that fall in this same error, and i would
propose a way of avoiding this problem

1) make DEVICE partitions the default if no device line is specified.
   



As you note, we think alike on this :-)

 


2) deprecate the DEVICE keyword issuing a warning when it is found in
the configuration file
   



Not sure I'm so keen on that, at least not in the near term.

Let's not start warning and depreciating powerful features because they 
can be misused... If I wanted someone to make decisions for me I would 
be using this software at all.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with raid=noautodetect

2006-05-31 Thread Bill Davidsen

Luca Berra wrote:


On Tue, May 30, 2006 at 01:10:24PM -0400, Bill Davidsen wrote:

2) deprecate the DEVICE keyword issuing a warning when it is 
found in

the configuration file



Not sure I'm so keen on that, at least not in the near term.

Let's not start warning and depreciating powerful features because 
they can be misused... If I wanted someone to make decisions for me I 
would be using this software at all.



you cut the rest of the mail.


Trimming the part about which I make no comment is usually a good thing.


i did not propose to deprecate the feature,
just the keyword.


A rose by any other name would still smell as sweet. In other words, 
the capability is still able to be misused, and changing the name or 
generating error messages will only cause work and concern for people 
using the feature.




but, ok,
just go on writing DEVICE /dev/sda1
DEVICE /dev/sdb1
ARRAY /dev/md0 devices=/dev/sda1,/dev/sdb1

then come on the list and complain when it stops working. 



What I suggest is that the feature keep working, and no one will 
complain. If there is a missing partition the error messages are clear. 
The feature is mainly used when there are partitions or drives which 
should not be examined, and stops working only when a hardware config 
has changed.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID5E

2006-05-31 Thread Bill Davidsen
Where I was working most recently some systems were using RAID5E (RAID5 
with both the parity and hot spare distributed). This seems to be highly 
desirable for small arrays, where spreading head motion over one more 
drive will improve performance, and in all cases where a rebuild to the 
hot spare will avoid a bottleneck on a single drive.


Is there any plan to add this capability?

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which CPU for XOR?

2006-06-13 Thread Bill Davidsen

Dexter Filmore wrote:


What type of operation is XOR anyway? Should be ALU, right?
So - what CPU is best for software raid? One with high integer processing 
power? 

Unless you're running really low on CPU, it probably doesn't matter... 
you run out of memory bandwidth on large data (larger than cache) 
anyway. That's my take on it.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: to understand the logic of raid0_make_request

2006-06-13 Thread Bill Davidsen

Neil Brown wrote:


On Tuesday June 13, [EMAIL PROTECTED] wrote:
 


hello,everyone.
I am studying the code of raid0.But I find that the logic of
raid0_make_request is a little difficult to understand.
Who can tell me what the function of raid0_make_request will do eventually?
   



One of two possibilities.

Most often it will update bio-bi_dev and bio-bi_sector to refer to
the correct location on the correct underlying devices, and then 
will return '1'.

The fact that it returns '1' is noticed by generic_make_request in
block/ll_rw_block.c and generic_make_request will loop around and
retry the request on the new device at the new offset.

However in the unusual case that the request cross a chunk boundary
and so needs to be sent to two different devices, raidi_make_request
will split the bio into to (using bio_split) will submit each of the
two bios directly down to the appropriate devices - and will then
return '0', so that generic make request doesn't loop around.

I hope that helps.

Helps me, anyway, thanks! Wish the comments on stuff like that in 
general were clear, you can see what the code *does*, but you have to 
hope that it's what the coder *intended*. And if you're looking for a 
bug it may not be, so this is not an idle complaint.


Some of the kernel coders think if it was hard to write it should be 
hard to understand.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ANNOUNCE: mdadm 2.5.1 - A tool for managing Soft RAID under Linux

2006-06-16 Thread Bill Davidsen

Paul Clements wrote:


Neil Brown wrote:



I am pleased to announce the availability of
   mdadm version 2.5.1



Hi Neil,

Here's a small patch to allow compilation on gcc 2.x. It looks like 
gcc 3.x allows variable declarations that are not at the start of a 
block of code (I don't know if there's some standard that allows that 
in C code now, but it doesn't work with all C compilers). 


Even if valid, having the declaration at the top of the block in which 
it's used makes the program more readable.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is shrinking raid5 possible?

2006-06-22 Thread Bill Davidsen

Neil Brown wrote:


On Monday June 19, [EMAIL PROTECTED] wrote:
 


Hi,

I'd like to shrink the size of a RAID5 array - is this
possible? My first attempt shrinking 1.4Tb to 600Gb,

mdadm --grow /dev/md5 --size=629145600

gives

mdadm: Cannot set device size/shape for /dev/md5: No space left on device
   



Yep.
The '--size' option refers to:
 Amount  (in  Kibibytes)  of  space  to  use  from  each drive in
 RAID1/4/5/6.  This must be a multiple of  the  chunk  size,  and
 must  leave about 128Kb of space at the end of the drive for the
 RAID superblock.  
(from the man page).


So you were telling md to use the first 600GB of each device in the
array, and it told you there wasn't that much room.
If your array has N drives, you need to divide the target array size
by N-1 to find the target device size.
So if you have a 5 drive array, then you want
 --size=157286400



May I say in all honesty that making people do that math instead of the 
computer is a really bad user interface? Good, consider it said. An 
means to just set the target size of the resulting raid device would be 
a LOT less likely to cause bad user input, and while I'm complaining it 
should inderstand suffices 'k', 'm', and 'g'.


Far easier to use for the case where you need, for instance, 10G of 
storage for a database, tell mdadm what devices to use and what you need 
(and the level of course) and let the computer figure out the details, 
rounding up, leaving 128k, and phase of the moon if you decide to use it.


Sorry, I think the current approach is baaad human interface.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New FAQ entry? (was IBM xSeries stop responding during RAID1 reconstruction)

2006-06-22 Thread Bill Davidsen

Niccolo Rigacci wrote:


personally, I don't see any point to worrying about the default,
compile-time or boot time:

for f in `find /sys/block/* -name scheduler`; do echo cfq  $f; done
 



I tested this case:

- reboot as per power failure (RAID goes dirty)
- RAID start resyncing as soon as the kernel assemble it
- every disk activity is blocked, even DHCP failed!
- host services are unavailable

This is why I changed the kernel default.

 

Changing on the command line assumes that you built all of the 
schedulers in... but making that assumption, perhaps the correct 
fail-safe is to have cfq as the default, and at the end of rc.local 
check for rebuild, and if everything is clean change to whatever work 
best at the end of the boot. If the raid is not clean stay with cfq.


Has anyone tried deadline for this? I think I had this as deafult and 
didn't hand on a raid5 fail/rebuild.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ok to go ahead with this setup?

2006-06-22 Thread Bill Davidsen

Christian Pernegger wrote:


Hi list!

Having experienced firsthand the pain that hardware RAID controllers
can be -- my 3ware 7500-8 died and it took me a week to find even a
7508-8 -- I would like to switch to kernel software RAID.

Here's a tentative setup:

Intel SE7230NH1-E mainboard
Pentium D 930
2x1GB Crucial 533 DDR2 ECC
Intel SC5295-E enclosure

Promise Ultra133 TX2 (2ch PATA)
  - 2x Maxtor 6B300R0 (300GB, DiamondMax 10) in RAID1

Onboard Intel ICH7R (4ch SATA)
  - 4x Western Digital WD5000YS (500GB, Caviar RE2) in RAID5

* Does this hardware work flawlessly with Linux?

* Is it advisable to boot from the mirror?
 Would the box still boot with only one of the disks?



Let me say this about firmware mirror: while virtually every BIOS will 
boot the next disk if the first fails, some will not fail over if the 
first drive is returning a parity but still returning data. Take that 
data any way you want, drive failure at power cycle is somewhat more 
likely than failure while running.




* Can I use EVMS as a frontend?
 Does it even use md or is EVMS's RAID something else entirely?

* Should I use the 300s as a single mirror, or span multiple ones over
the two disks?

* Am I even correct in assuming that I could stick an array in another
box and have it work?

Comments welcome

Thanks,

Chris
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ok to go ahead with this setup?

2006-06-22 Thread Bill Davidsen

Molle Bestefich wrote:


Christian Pernegger wrote:


Anything specific wrong with the Maxtors?



No.  I've used Maxtor for a long time and I'm generally happy with them.

They break now and then, but their online warranty system is great.
I've also been treated kindly by their help desk - talked to a cute
gal from Maxtor in Ireland over the phone just yesterday ;-).

Then again, they've just been acquired by Seagate, or so, so things
may change for the worse, who knows.

I'd watch out regarding the Western Digital disks, apparently they
have a bad habit of turning themselves off when used in RAID mode, for
some reason:
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/1980/ 



Based on three trials in five years, I'm happy with WD and Seagate. WD 
didn't ask when I bought it, just the serial for manufacturing date.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-23 Thread Bill Davidsen

Martin Schröder wrote:


2006/6/23, Francois Barre [EMAIL PROTECTED]:


Loosing data is worse than loosing anything else. You can buy you



That's why RAID is no excuse for backups. 



The problem is that there is no cost effective backup available. When a 
tape was the same size as a disk and 10% the cost, backups were 
practical. Today anything larger than hobby size disk is just not easy 
to back up. Anything large enough to be useful is expensive, small media 
or something you can't take off-site and lock in a vault aren't backups 
so much as copies, which may protect against some problems, but which 
provide little to no protection against site disasters.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-25 Thread Bill Davidsen

winspeareAdam Talbot wrote:


OK, this topic I relay need to get in on.
I have spent the last few week bench marking my new 1.2TB, 6 disk, RAID6
array. I wanted real numbers, not This FS is faster because... I have
moved over 100TB of data on my new array running the bench mark
testing.  I have yet to have any major problems with ReiserFS, EXT2/3,
JFS, or XFS.  I have done extensive testing on all, including just
trying to break the file system with billions of 1k files, or a 1TB
file. Was able to cause some problems with EXT3 and RiserFS with the 1KB
and 1TB tests, respectively. but both were fixed with a fsck. My basic
test is to move all data from my old server to my new server
(whitequeen2) and clock the transfer time.  Whitequeen2 has very little
storage.  The NAS's 1.2TB of storage is attached via iSCSI and a cross
over cable to the back of whitequeen2.  The data is 100GB of user's
files(1KB~2MB), 50GB of MP3's (1MB~5MB) and the rest is movies and
system backups 600MB~2GB.  Here is a copy of my current data sheet,
including specs on the servers and copy times, my numbers are not
perfect, but they should give you a clue about speeds...  XFS wins.
 



In many (most?) cases I'm a lot more concerned about filesystem 
stability than performance. That is, I want the fastest reliable 
filesystem. With ext2 and ext3 I've run multiple multi-TB machines 
spread over four time zones, and not had a f/s problem updating ~1TB/day.



The computer: whitequeen2
AMD Athlon64 3200 (2.0GHz)
1GB Corsair DDR 400 (2X 512MB's running in dual DDR mode)
Foxconn 6150K8MA-8EKRS motherboard
Off brand case/power supply
2X os disks, software raid array, RAID 1, Maxtor 51369U3, FW DA620CQ0
Intel pro/1000 NIC
CentOS 4.3 X86_64 2.6.9
   Main app server, Apache, Samba, NFS, NIS

The computer: nas
AMD Athlon64 3000 (1.8GHz)
256MB Corsair DDR 400 (2X 128MB's running in dual DDR mode)
Foxconn 6150K8MA-8EKRS motherboard
Off brand case/power supply and drive cages
2X os disks, software raid array, RAID 1, Maxtor 51369U3, FW DA620CQ0
6X software raid array, RAID 6, Maxtor 7V300F0, FW VA111900
Gentoo linux. X86_64 2.6.16-gentoo-r9
  System built very lite, only built as an iSCSI based NAS.

EXT3
Config=APP+NFS--NAS+iSCSI
RAID6 64K chunk
[EMAIL PROTECTED] tmp]# time tar cf - . | (cd /data ; tar xf - )
real371m29.802s
user1m28.492s
sys 46m48.947s
/dev/sdb1 1.1T  371G  674G  36% /data
6.192 hours @ 61,262M/hour or 1021M/min or 17.02M/sec


EXT2
Config=APP+NFS--NAS+iSCSI
RAID6 64K chunk
[EMAIL PROTECTED] tmp]# time tar cf - . | ( cd /data/ ; tar xf - )
real401m48.702s
user1m25.599s
sys 30m22.620s
/dev/sdb1 1.1T  371G  674G  36% /data
6.692 hours @ 56,684M/hour or 945M/min or 15.75M/sec


Did you tune the extN filesystems to the stripe size of the raid?

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-25 Thread Bill Davidsen

Justin Piszcz wrote:



On Sat, 24 Jun 2006, Neil Brown wrote:


On Friday June 23, [EMAIL PROTECTED] wrote:


The problem is that there is no cost effective backup available.



One-liner questions :
- How does Google make backups ?



No, Google ARE the backups :-)


- Aren't tapes dead yet ?



LTO-3 does 300Gig, and LTO-4 is planned.
They may not cope with tera-byte arrays in one hit, but they still
have real value.


- What about a NUMA principle applied to storage ?



You mean an Hierarchical Storage Manager?  Yep, they exist.  I'm sure
SGI, EMC and assorted other TLAs could sell you one.



LTO3 is 400GB native and we've seen very good compression, so 
800GB-1TB per tape. 


The problem is in small business use, LTO3 is costly in the 1-10TB 
range, and takes a lot of media changes as well. A TB of RAID-5 is 
~$500, and at that small size the cost of drives and media is 
disproportionally high. Using more drives is cost effective, but they 
are not good for long term off site storage, because they're large and 
fragile.


No obvious solutions in that price and application range that I see.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-26 Thread Bill Davidsen

Adam Talbot wrote:

Not exactly sure how to tune for stripe size. 
What would you advise?

-Adam
 



See the -R option of mke2fs. I don't have a number for the performance 
impact of this, but I bet someone else on the list will. Depending on 
what posts you read, reports range from measurable to significant, 
without quantifying.


Note, next month I will set up either a 2x750 RAID-1 or 4x250 RAID-5 
array, and if I got RAID-5 I will have the chance to run some metrics 
before putting the hardware into production service. I'll report on the 
-R option if I have any data.




Bill Davidsen wrote:
 


winspeareAdam Talbot wrote:

   


OK, this topic I relay need to get in on.
I have spent the last few week bench marking my new 1.2TB, 6 disk, RAID6
array. I wanted real numbers, not This FS is faster because... I have
moved over 100TB of data on my new array running the bench mark
testing.  I have yet to have any major problems with ReiserFS, EXT2/3,
JFS, or XFS.  I have done extensive testing on all, including just
trying to break the file system with billions of 1k files, or a 1TB
file. Was able to cause some problems with EXT3 and RiserFS with the 1KB
and 1TB tests, respectively. but both were fixed with a fsck. My basic
test is to move all data from my old server to my new server
(whitequeen2) and clock the transfer time.  Whitequeen2 has very little
storage.  The NAS's 1.2TB of storage is attached via iSCSI and a cross
over cable to the back of whitequeen2.  The data is 100GB of user's
files(1KB~2MB), 50GB of MP3's (1MB~5MB) and the rest is movies and
system backups 600MB~2GB.  Here is a copy of my current data sheet,
including specs on the servers and copy times, my numbers are not
perfect, but they should give you a clue about speeds...  XFS wins.


 


In many (most?) cases I'm a lot more concerned about filesystem
stability than performance. That is, I want the fastest reliable
filesystem. With ext2 and ext3 I've run multiple multi-TB machines
spread over four time zones, and not had a f/s problem updating ~1TB/day.

   


Did you tune the extN filesystems to the stripe size of the raid?

   




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 degraded after mdadm -S, mdadm --assemble (everytime)

2006-06-26 Thread Bill Davidsen

Ronald Lembcke wrote:


Hi!

I set up a RAID5 array of 4 disks. I initially created a degraded array
and added the fourth disk (sda1) later.

The array is clean, but when I do  
 mdadm -S /dev/md0 
 mdadm --assemble /dev/md0 /dev/sd[abcd]1

it won't start. It always says sda1 is failed.

When I remove sda1 and add it again everything seems to be fine until I
stop the array. 


Below is the output of /proc/mdstat, mdadm -D -Q, mdadm -E and a piece of the
kernel log.
The output of mdadm -E looks strange for /dev/sd[bcd]1, saying 1 failed.

What can I do about this?
How could this happen? I mixed up the syntax when adding the fourth disk and
tried these two commands (at least one didn't yield an error message):
mdadm --manage -a /dev/md0 /dev/sda1
mdadm --manage -a /dev/sda1 /dev/md0


Thanks in advance ...
 Roni



ganges:~# cat /proc/mdstat 
Personalities : [raid5] [raid4] 
md0 : active raid5 sda1[4] sdc1[0] sdb1[2] sdd1[1]

 691404864 blocks super 1.0 level 5, 64k chunk, algorithm 2 [4/4] []
 
unused devices: none


I will just comment that the 0 1 2   4 numbering on the devices is 
unusual. When you created this did you do something which made md think 
there was another device, failed or missing, which was device[3]? I just 
looked at a bunch of my arrays and found no similar examples.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBM xSeries stop responding during RAID1 reconstruction

2006-06-26 Thread Bill Davidsen

Mr. James W. Laferriere wrote:


Hello Gabor ,

On Tue, 20 Jun 2006, Gabor Gombas wrote:


On Tue, Jun 20, 2006 at 03:08:59PM +0200, Niccolo Rigacci wrote:


Do you know if it is possible to switch the scheduler at runtime?


echo cfq  /sys/block/disk/queue/scheduler



At least one can do a ls of the /sys/block area  then do an 
automated
echo cfq down the tree .  Does anyone know of a method to set a 
default
scheduler ?  Scanning down a list or manually maintaining a list 
seems

to be a bug in the waiting .  Tia ,  JimL


Thought I posted this... it can be set in kernel build or on the bloot 
parameters from grub/lilo.


2nd thought: set it to cfq by default, then at the END of rc.local, if 
there are no arrays rebuilding, change to something else if you like.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: I need a PCI V2.1 4 port SATA card

2006-06-29 Thread Bill Davidsen

Gordon Henderson wrote:


On Wed, 28 Jun 2006, Christian Pernegger wrote:

 


I also subscribe to the almost commodity hardware philosophy,
however I've not been able to find a case that comfortably takes even
8 drives. (The Stacker is an absolute nightmare ...) Even most
rackable cases stop at 6 3.5 drive bays -- either that or they are
dedicated storage racks with integrated hw RAID and fiber SCSI
interconnect -- definitely not commodity.
   



I've used these:

 http://www.acme-technology.co.uk/acm338.htm

(8 drives in a 3U case), and their variants

eg:

 http://www.acme-technology.co.uk/acm312.htm

(12 disks in a 3U case)
 



Interesting ad, with a masonic emblem, and a picture of a white case 
with a note saying it's only available in black. Of course the hardware 
may be perfectly fine, but I wouldn't count on color.



for several years with good results. Not the cheapest on the block though,
but never had any real issues with them.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cutting power without breaking RAID

2006-06-29 Thread Bill Davidsen

Niccolo Rigacci wrote:


On Thu, Jun 29, 2006 at 02:00:09PM +1000, Neil Brown wrote:
 

With 2.6, 
  killall -9 md0_raid1


should do the trick (assuming root is on /dev/md0.  If it is elsewhere,
choose a different process name).
   



Thanks, this is what I was looking for!

I will try remounting read-only and killing the md0_raid1.
I will keep you informed.

 

Why should this trickery be needed? When an array is mounted r/o it 
should be clean. How can it be dirty. I assume readonly implies noatime, 
I mount physically readonly devices without explicitly saying noatime 
and nothing whines.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random Seek on Array as slow as on single disk

2006-07-17 Thread Bill Davidsen

A. Liemen wrote:


Hardware Raid.

http://www.areca.com.tw/products/html/pcix-sata.htm



You should ask the vendor, this isn't a software RAID issue, and the 
usual path to improving bad hardware is in upgrading. You may be able to 
get better firmware if you're lucky.




Alex

Jeff Breidenbach schrieb:


Controller: Areca ARC 1160 PCI-X 1GB Cache



Those numbers are for Arica hardware raid or linux software raid?
--Jeff
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hardware assisted parity computation - is it now worth it?

2006-07-17 Thread Bill Davidsen

Burn Alting wrote:


Last year, there were discussions on this list about the possible
use of a 'co-processor' (Intel's IOP333) to compute raid 5/6's
parity data.

We are about to see low cost, multi core cpu chips with very
high speed memory bandwidth. In light of this, is there any
effective benefit to such devices as the IOP333?
 



Was there ever? Unless you're running on a really slow CPU, like 386, 
with a TB of RAID attached, and heavy CPU load, could anyone ever see a 
measureable performance gain? I haven't seen any such benchmarks, 
although I haven't looked beyond reading several related mailing lists.



Or in other words, is a cheaper (power, heat, etc) cpu with
higher memory access speeds, more cost effective than a
bridge/bus device (ie hardware) solution (which typically
has much lower memory access speeds)?

An additional device is always more complex, and less tunable than a CPU 
based solution. Except in the case above where there is very little CPU 
available, I don't see much hope for a cost (money and complexity) 
effective non-CPU solution.


Obviously my opinion only.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Array will not assemble

2006-07-17 Thread Bill Davidsen

Richard Scobie wrote:


Neil Brown wrote:


Add
  DEVICE /dev/sd?
or similar on a separate line.
Remove
  devices=/dev/sdc,/dev/sdd



Thanks.

My mistake, I thought after having assembled the arrays initially, 
that the output of:


 mdadm --detail --scan  mdadm.conf

could be used directly.

I'm using Centos 4.3, which I believe is the latest RHEL 4 and they 
are only on mdadm 1.6  :( 



Do understand that the whole purpose of RHEL is to have a stable system. 
Upgrades are not done, instead bugs are fixed in the original version to 
correct security or stability issues. However, feature changes are not 
provided, because ver versions mean new issues. Stability has its price.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: second controller: what will my discs be called, and does it matter?

2006-07-17 Thread Bill Davidsen

Dexter Filmore wrote:

Currently I have 4 discs on a 4 channel sata controller which does its job 
quite well for 20 bucks. 
Now, if I wanted to grow the array I'd probably go for another one of these.


How can I tell if the discs on the new controller will become sd[e-h] or if 
they'll be the new a-d and push the existing ones back?
 


For software RAID you shouldn't care, for other things you might.


Next question: assembling by UUID, does that matter at all?
 


No. There's the beauty of it.

(And while talking UUID - can I safely migrate to a udev-kernel? Someone on 
this list recently ran into trouble because of such an issue.)


You shouldn't lose data unless you panic at the first learning 
experience and do something without thinking of the results. I would 
convert to UUID first, obviously.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: issue with internal bitmaps

2006-07-17 Thread Bill Davidsen

Neil Brown wrote:


On Thursday July 6, [EMAIL PROTECTED] wrote:
 


hello, i just realized that internal bitmaps do not seem to work
anymore.
   



I cannot imagine why.  Nothing you have listed show anything wrong
with md...

Maybe you were expecting
  mdadm -X /dev/md100
to do something useful.  Like -E, -X must be applied to a component
device.  Try
  mdadm -X /dev/sda1

To take this from the other end, why should -X apply to a component? 
Since the components can and do change names, and you frequently mention 
assembly by UUID, why aren't the component names determined from the 
invariant array name when mdadm wants them, instead of having a user or 
script check the array to get the components?


Between udev and dynamic reconfiguration when component names have 
become less and less relevant, perhaps they can be less used in the future.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] enable auto=yes by default when using udev

2006-07-17 Thread Bill Davidsen

Michael Tokarev wrote:


Neil Brown wrote:
 


On Monday July 3, [EMAIL PROTECTED] wrote:
   


Hello,
the following patch aims at solving an issue that is confusing a lot of
users.
when using udev, device files are created only when devices are
registered with the kernel, and md devices are registered only when
started.
mdadm needs the device file _before_ starting the array.
so when using udev you must add --auto=yes to the mdadm commandline or
to the ARRAY line in mdadm.conf

following patch makes auto=yes the default when using udev
 


The principle I'm reasonably happy with, though you can now make this
the default with a line like

 CREATE auto=yes
in mdadm.conf.

However

   


+
+   /* if we are using udev and auto is not set, mdadm will almost
+* certainly fail, so we force it here.
+*/
+   if (autof == 0  access(/dev/.udevdb,F_OK) == 0)
+   autof=2;
+
 


I'm worried that this test is not very robust.
On my Debian/unstable system running used, there is no
/dev/.udevdb
though there is a
/dev/.udev/db

I guess I could test for both, but then udev might change
again I'd really like a more robust check.
   



Why to test for udev at all?  If the device does not exist, regardless
if udev is running or not, it might be a good idea to try to create it.
Because IT IS NEEDED, period.  Whenever the operation fails or not, and
whenever we fail if it fails or not - it's another question, and I think
that w/o explicit auto=yes, we may ignore create error and try to continue,
and with auto=yes, we fail on create error.
 

I have to agree here, I can't think of a case where creation of the 
device name would not be desirable, udev or no. But to be cautious, 
perhaps the default should be to create the device if the path starts 
with /dev/ or /tmp/ unless auto creation is explicitly off. I don't 
think udev or mount points come into the default decision at all, there 
are just too many options on naming.



Note that /dev might be managed by some other tool as well, like mudev
from busybox, or just a tiny shell /sbin/hotplug script.

Note also that the whole root filesystem might be on tmpfs (like in
initramfs), so /dev will not be a mountpoint.
 


Agree with both points.


Also, I think mdadm should stop creating strange temporary nodes somewhere
as it does now.  If /dev/whatever exist, use it. If not, create it (unless,
perhaps, auto=no is specified) directly with proper mknod(/dev/mdX), but
don't try to use some temporary names in /dev or elsewhere.
 

True, but I don't see a case where this would be useful. And if it is, 
then add an auto=obscure_names option for the case where you really want 
that behaviour.



In case of nfs-mounted read-only root filesystem, if someone will ever need
to assemble raid arrays in that case.. well, he can either prepare proper
/dev on the nfs server, or use tmpfs-based /dev, or just specify /tmp/mdXX
instead of /dev/mdXX - whatever suits their needs better.

Because /dev and /tmp are well known special cases, I would default auto 
for them. In other cases explicit behavior could be specified.


Feel free to point out something bad which occurs by using this default 
behavior.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Test feedback 2.6.17.4+libata-tj-stable (EH, hotplug)

2006-07-17 Thread Bill Davidsen

Christian Pernegger wrote:


I finally got around to testing 2.6.17.4 with libata-tj-stable-20060710.

Hardware: ICH7R in ahci mode + WD5000YS's.

EH: much, much better. Before the patch it seemed like errors were
only printed to dmesg but never handed up to any layer above. Now md
actually fails the disk when I pull the (power) plug. I'll try my bad
cable once I can find it.

Hotplug: Unplugging was fine, took about 15s until the driver gave up
on the disk. After re-plugging the driver had to hard-reset the port
once to get the disk back, though that might be by design.

The fact that the disk had changed minor numbers after it was plugged
back in bugs me a bit. (was sdc before, sde after). Additionally udev
removed the sdc device file, so I had to manually recreate it to be
able to remove the 'faulty' disk from its md array.

Thanks for a great patch! I just hope it doesn't eat my data :) 


And thank you for testing!

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: which disk the the one that data is on?

2006-07-18 Thread Bill Davidsen

Shai wrote:


Hi,

I rebooted my server today to find out that one of the arrays is being
re-synced (see output below)
.
1. What does the (S) to the right of hdh1[5](S) mean?
2. How do I know, from this output, which disk is the one holding the
most current data and from which all the other drives are syncing
from? Or are they all containing the data and this sync process is
something else? Maybe I'm just not understanding what is being done
exactly? 


In addition to what you have already been told, if you find out that 
the array is in rebuild I would be a lot more worried to find out why. 
If it was from unclean shutdown you really should look into a bitmap if 
you don't have one.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: issue with internal bitmaps

2006-07-18 Thread Bill Davidsen

Bill Davidsen wrote:


Neil Brown wrote:


On Thursday July 6, [EMAIL PROTECTED] wrote:
 


hello, i just realized that internal bitmaps do not seem to work
anymore.
  



I cannot imagine why.  Nothing you have listed show anything wrong
with md...

Maybe you were expecting
  mdadm -X /dev/md100
to do something useful.  Like -E, -X must be applied to a component
device.  Try
  mdadm -X /dev/sda1

To take this from the other end, why should -X apply to a component? 
Since the components can and do change names, and you frequently 
mention assembly by UUID, why aren't the component names determined 
from the invariant array name when mdadm wants them, instead of having 
a user or script check the array to get the components?


Boy, I didn't say that well... what I meant to suggest is that when -E 
or -X are applied to the array as a whole, would it not be useful to 
itterate them over all of the components rather than than looking for 
non-existant data in the array itself?


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 005 of 9] md: Replace magic numbers in sb_dirty with well defined bit flags

2006-08-01 Thread Bill Davidsen

Ingo Oeser wrote:


Hi Neil,

I think the names in this patch don't match the description at all.
May I suggest different ones?

On Monday, 31. July 2006 09:32, NeilBrown wrote:
 


Instead of magic numbers (0,1,2,3) in sb_dirty, we have
some flags instead:
MD_CHANGE_DEVS
  Some device state has changed requiring superblock update
  on all devices.
   



MD_SB_STALE or MD_SB_NEED_UPDATE
 


I think STALE is better, it is unambigous.

 


MD_CHANGE_CLEAN
  The array has transitions from 'clean' to 'dirty' or back,
  requiring a superblock update on active devices, but possibly
  not on spares
   



Maybe split this into MD_SB_DIRTY and MD_SB_CLEAN ?
 

I don't think the split is beneficial, but I don't care for the name 
much. Some name like SB_UPDATE_NEEDED or the like might be better.


 


MD_CHANGE_PENDING
  A superblock update is underway.  
   



MD_SB_PENDING_UPDATE

 

I would have said UPDATE_PENDING, but either is more descriptive than 
the original.


Neil - the logic in this code is pretty complex, all the help you can 
give the occasional reader, by using very descriptive names for things, 
is helpful to the reader and reduces your question due to 
misunderstanding load.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Interesting RAID checking observations - I'm getting it too

2006-09-04 Thread Bill Davidsen




Mark Smith wrote:


Just a note, I've noticed this problem too. I run a RAID1 check once
every 24 hours, and while developing the script to do it, noticed that
the machine became virtually unusable - mouse was jumpy, typing lagged.

I run this check every morning at 4.00am so I'm usually asleep and
don't notice it, so it hasn't been a big bother to me.
 

Interesting, but do you run other stuff at that time? Several 
distributions run various things in the middle of the night which really 
bog the machine.



The data may be a bit corse, however here is what sysstat/sar says my
machine does during the check. Let me know if you want any more or other
sar data.


[ data dropped, not relevant to my suggestion ]

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - Drive Linking

2006-09-04 Thread Bill Davidsen

Michael Tokarev wrote:


Tuomas Leikola wrote:
[]
 


Here's an alternate description. On first 'unrecoverable' error, the
disk is marked as FAILING, which means that a spare is immediately
taken into use to replace the failing one. The disk is not kicked, and
readable blocks can still be used to rebuild other blocks (from other
FAILING disks).

The rebuild can be more like a ddrescue type operation, which is
probably a lot faster in the case of raid6, and the disk can be
automatically kicked after the sync is done. If there is no read
access to the FAILING disk, the rebuild will be faster just because
seeks are avoided in a busy system.
   



It's not that simple.  The issue is with writes.  If there's a failing
disk, md code will need to keep track of up-to-date, or good sectors
of it vs obsolete ones.  Ie, when write fails, the data in that block
is either unreadable (but can become readable on the next try, say, after
themperature change or whatnot), or readable but contains old data, or
is readable but contains some random garbage.  So at least that block(s)
of the disk should not be copied to the spare during resync, and should
not be read at all, to avoid returning wrong data to userspace.  In short,
if the array isn't stopped (or changed to read-only), we should watch for
writes, and remember which ones are failed.  Which is some non-trivial
change.  Yes, bitmaps somewhat helps here.
 

It would seem that much of the code needed is already there. When doing 
the recovery the spare can be treated as a RAID1 copy of the failing 
drive, with all sectors out of date. Then the sectors from the failing 
drive can be copied, using reconstruction if needed, until there is a 
valid copy on the new drive.


There are several decision points during this process:
- do writes get tried to the failing drive, or just the spare?
- do you mark the failing drive as failed after the good copy is created?

But I think most of the logic exists, the hardest part would be deciding 
what to do. The existing code looks as if it could be hooked to do this 
far more easily than writing new. In fact, several suggested recovery 
schemes involve stopping the RAID5, replacing the failing drive with a 
created RAID1, etc. So the method is valid, it would just be nice to 
have it happen without human intervention.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: system crash on AMD64 with 2.6.17.11 while accessing 3TB Software-RAID5

2006-09-04 Thread Bill Davidsen

Ralf Herrmann wrote:


Dear Mr.Brown,


Yes.. you are hitting some pretty serious BUGs.  And this is in
code that is not specific to RAID at all, so if there really were bugs
there, we would expect to have seen them well before now.



Your are absolutely right, it doesn't seem to be in RAID at all,
but as of now, it only happened when doing something with /dev/md0.



I really looks to me like a hardware problem.  Some how various bits
of memory sometimes have bad values and cause a problem.

How long did you run memtest?  I would suggest running it for at
least 24 hours, because my best guess is that it is bad memory, even
though your tests so far don't show that.



I ran it for about 16h, with all tests enabled, no error occured.

I was always wondering why it worked before the change and
not now. The only difference were the larger drives. And i've read so 
many reports

of people running much larger RAID5 partitions than we do,
so why should it fail in this case?

So my best bet at the moment, would be a hardware problem, too.
I continued looking at the kernel oops messages and sometimes
disassembly of the code where it broke gave invalid opcodes.
This also looks pretty much like a hardware issue.
But tests of single components did not unvail any error.

It seems to me, that it only happens, when many system components
are involved, several HDDs, the whole RAM, the NIC and so on.
That leads me to another idea i'm currently testing.

It could very well be a bad power supply.
Maybe this box was running at full load of the power supply before,
and now with new drives consumes more power the supply can deliver,
if all system components are used at once.
I switched to a better power supply, tests are running as i write this.

I'm sorry if i wasted your time, i should have checked this before
writing to the list. But power supply problems are pretty odd
and hard to identify. Anyways, i'm not sure if that solves the problem.

Ok, i'll write the results of current tests, when they are finished.

Thanks for your consideration. 


It certainly is a legitimate question, and marginal power would have 
been at the end of my list as well... However, if all else fails, try 
formatting the new drives to use only the size of the old drive capacity 
(RAID on small partitions) and see if that works. If so you may have 
found some rare size-related bug.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 reads and cpu

2006-09-04 Thread Bill Davidsen

Rob Bray wrote:


This might be a dumb question, but what causes md to use a large amount of
cpu resources when reading a large amount of data from a raid1 array?
Examples are on a 2.4GHz AMD64, 2GB, 2.6.15.1 (I realize there are md
enhancements to later versions; I had some other unrelated issues and
rolled back to one I've run on for several months).

A given 7-disk raid0 array can read 450MB/s (using cat  null) and use
virtually no CPU resources. (Although cat and kswapd use quite a bit [60%]
munching on the data)

A raid5 array on the same drive set pulls in at 250MB/s, but md uses
roughly 50% of the CPU (the other 50% is spent dealing with the data,
saturating the processor).

A consistency check on the raid5 array uses roughly 3% of the cpu. It is
otherwise ~97% idle.
md11 : active raid5 sdi2[5] sdh2[4] sdf2[3] sde2[2] sdd2[1] sdc2[6] sdb2[0]
 248974848 blocks level 5, 256k chunk, algorithm 2 [7/7] [UUU]
 [==..]  resync = 72.2% (29976960/41495808)
finish=3.7min speed=51460K/sec
(~350MB/s aggregate throughput, 50MB/s on each device)

Just a friendly question as to why CPU utilization is significantly
different between a check and a real-world read on raid5? I feel like if
there was vm overhead getting the data into userland, the slowdown would
be present in raid0 as well. I assume parity calculations aren't done on a
read of the array, which leaves me at my question.
 


What are you stripe and cache sizes?

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Interesting RAID checking observations

2006-09-04 Thread Bill Davidsen

[EMAIL PROTECTED] wrote:


I don't think the processor is saturating.  I've seen reports of this
sort of thing before and until recently had no idea what was happening,
couldn't reproduce it, and couldn't think of any more useful data to
collect.
   



Well I can reproduce it easily enough.  It's a production server, but
I can do low-risk experiments after hours.

I'd like to note that the symptoms include not even being
able to *type* at the console, which I thought was all in-kernel
code, not subject to being swapped out.  But whatever.

Really? Or is it just that you can type but the characters don't get 
echoed. The type part is in the kernel, but the display involves X 
unless you run a direct console.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux: Why software RAID?

2006-09-04 Thread Bill Davidsen

Gordon Henderson wrote:


On Thu, 24 Aug 2006, Adam Kropelin wrote:

 


Generally speaking the channels on onboard ATA are independant with any
vaguely modern card.
 


Ahh, I did not know that. Does this apply to master/slave connections on
the same PATA cable as well? I know zero about PATA, but I assumed from
the terminology that master and slave needed to cooperate rather closely.
   



I don't know much about co-operation between master  slave, but I do know
that a failing PATA IDE drive can take out the other one on the same bus -
or in my case, render it unusable until I removed the dead drive,
whereupon (to my relief) it sprang back into life.

This was many many moons ago before I started to use s/w RAID, but it's
one thing that would kill a multi-disk array, so I've never done it since.

I guess the same could happen on SCSI, but I suspect the interface is a
little better designed...

Until recently I was working with 38 systems using SCSI RAID controllers 
(IBM ServeRAID Ultra320). With several types of SCSI drives I saw 
failures where one drive failed, hung the bus, and caused the next 
command to another drive to fail. At that point I have to force the 
controller to think the 2nd drive failed was okay, and then it would 
recover. I'm told this happens with other hardware, I just haven't 
personally seen it.


From that standpoint, the SATA on the MB look pretty good!

--

bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID over Firewire

2006-09-04 Thread Bill Davidsen

Richard Scobie wrote:

Has anyone had any experience or comment regarding linux RAID over 
ieee1394?


As a budget backup solution, I am considering using a pair of 500GB 
drives, each connected to a firewire 400 port, configured as a linear 
array, to which the contents of an onboard array will be rsynced weekly.


In theory, throughput performance should not be an issue, but it would 
be great to hear from somone who has done this.


It should work, but I don't like it... it leaves you with a lot of 
exposure between backups.


Unless your data change a lot, you might consider a good incremental 
dump program to DVD or similar.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 producing fake partition table on single drive

2006-09-04 Thread Bill Davidsen

Doug Ledford wrote:


On Mon, 2006-08-21 at 17:35 +1000, Neil Brown wrote:

 


Buffer I/O error on device sde3, logical block 1793
 


This, on the other hand, might be a problem - though possibly only a
small one.
Who is trying to access sde3 I wonder.  I'm fairly sure the kernel
wouldn't do that directly.
   



It's the mount program collecting possible LABEL= data on the partitions
listed in /proc/partitions, of which sde3 is outside the valid range for
the drive.

 

May I belatedly say that this is sort-of a kernel issue, since 
/proc/partitions reflects invalid data? Perhaps a boot option like 
nopart=sda,sdb or similar would be in order?


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can you IMAGE Mirrored OS Drives?

2006-09-04 Thread Bill Davidsen
Alternatives would be to have two such backup devices, and configure 
them, as andy liebman wrote:




I may not have been clear what I was asking. I wanted to know if you 
can make DISK IMAGES -- for example, with a program like Norton Ghost 
or Acronis True Image (better) -- of EACH of the two OS drives from a 
mirrored pair. Then restore Image A to one new disk, Image B to 
another disk. And then have a new working mirrored pair.


May I say belatedly (I've been flat out since July 1) that if I were 
making a significant number of these clones, I'd write a script so that 
I could clone one drive, drop it in another machine, and let the script 
run on the other machine to finish the job. I have no idea how many of 
these you are doing, but automation is nice to avoid finger checks.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware glitches cause softraid rebuilds

2006-09-04 Thread Bill Davidsen

adam radford wrote:


Jim,

Can you try the attached (and below) patch for 2.6.17.11?



Don't you want the sleep BEFORE setting the new value? ie. giving a wait 
for status to change before checking it again?




Also, please make sure you are running the latest firmware.

Thanks,

-Adam

diff -Naur linux-2.6.17.11/drivers/scsi/3w-9xxx.c
linux-2.6.17.12/drivers/scsi/3w-9xxx.c
--- linux-2.6.17.11/drivers/scsi/3w-9xxx.c2006-08-23 
14:16:33.0 -0700
+++ linux-2.6.17.12/drivers/scsi/3w-9xxx.c2006-08-28 
17:48:29.0 -0700

@@ -943,6 +943,7 @@
before = jiffies;
while ((response_que_value  TW_9550SX_DRAIN_COMPLETED) !=
TW_9550SX_DRAIN_COMPLETED) {
response_que_value = 
readl(TW_RESPONSE_QUEUE_REG_ADDR_LARGE(tw_dev));

+msleep(1);
if (time_after(jiffies, before + HZ * 30))
goto out;
}



diff -Naur linux-2.6.17.11/drivers/scsi/3w-9xxx.c 
linux-2.6.17.12/drivers/scsi/3w-9xxx.c
--- linux-2.6.17.11/drivers/scsi/3w-9xxx.c  2006-08-23 14:16:33.0 
-0700
+++ linux-2.6.17.12/drivers/scsi/3w-9xxx.c  2006-08-28 17:48:29.0 
-0700
@@ -943,6 +943,7 @@
before = jiffies;
while ((response_que_value  TW_9550SX_DRAIN_COMPLETED) != 
TW_9550SX_DRAIN_COMPLETED) {
response_que_value = 
readl(TW_RESPONSE_QUEUE_REG_ADDR_LARGE(tw_dev));
+   msleep(1);
if (time_after(jiffies, before + HZ * 30))
goto out;
}
 




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux: Why software RAID?

2006-09-04 Thread Bill Davidsen

Alan Cox wrote:


Ar Iau, 2006-08-24 am 07:31 -0700, ysgrifennodd Marc Perkel:
 


So - the bottom line answer to my question is that unless you are
running raid 5 and you have a high powered raid card with cache and
battery backup that there is no significant speed increase to use
hardware raid. For raid 0 there is no advantage.

   


If your raid is entirely on PCI plug in cards and you are doing RAID1
there is a speed up using hardware assisted raid because of the PCI bus
contention.



I would expect to see this with RAID5 as well, for the same reason...

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID over Firewire

2006-09-05 Thread Bill Davidsen

Richard Scobie wrote:


Bill Davidsen wrote:



It should work, but I don't like it... it leaves you with a lot of 
exposure between backups.


Unless your data change a lot, you might consider a good incremental 
dump program to DVD or similar.



Thanks. I have abandoned this option for various reasons, including 
people randomly unplugging the drives.


Rsync to another machine is the current plan. 


At one time I was evaluating doing RAID1 to an NBD on another machine, 
using write-mostly to make it a one way process. I had to redeplot the 
hardware before I reached a conclusion, and it was with an older kernel, 
so I simply throw it out for discussion.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raidhotadd works, mdadm --add doesn't

2006-09-14 Thread Bill Davidsen
://vger.kernel.org/majordomo-info.html




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: system crash on AMD64 with 2.6.17.11 while accessing 3TB Software-RAID5

2006-09-14 Thread Bill Davidsen

Ralf Herrmann wrote:


Dear Mr. Davidson and Mr. Brown,

It certainly is a legitimate question, and marginal power would have 
been at the end of my list as well... However, if all else fails, try 
formatting the new drives to use only the size of the old drive 
capacity (RAID on small partitions) and see if that works. If so you 
may have found some rare size-related bug.



Seems as if the new power supply did the trick. The box's been
running smoothly, for about 2 days lately.
I'm currently not in office, but since i didn't get emergency
calls from my colleagues i assume it still works.

Thanks again for your time, 



Thanks for letting us know what it was. Even if it was not the first 
thing we suggested. ;-)


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 producing fake partition table on single drive

2006-09-14 Thread Bill Davidsen

Lem wrote:


On Mon, 2006-09-04 at 13:55 -0400, Bill Davidsen wrote:

 

May I belatedly say that this is sort-of a kernel issue, since 
/proc/partitions reflects invalid data? Perhaps a boot option like 
nopart=sda,sdb or similar would be in order?
   



Is this an argument to be passed to the kernel at boot time? It didn't
work for me.



My suggestion was to Neil or other kernel maintainers. If they agree 
that this is worth fixing, the option could be added in the kernel. It 
isn't there now, I was soliciting responses on whether this was desirable.


Unfortunately I see no way to avoid data in the partition table 
location, which looks like a partition table, from being used.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 fill up?

2006-09-16 Thread Bill Davidsen

Mr. James W. Laferriere wrote:

   
Kuca ,  Thank you for posting this snippet .

Neil ,  Might changing ,

 can  be given as max which means to choose the largest 
size that

To
 can  be given as 'max' which means to choose the largest 
size that


help those reading this be aware that this is a 'string' to add to 
the
end of --size= ?  Also if there are other keywords not quoted ('') 
this

might be a good opertunity .  ;-)  Tia ,  JimL

Definitely a good idea! If I hadn't seen an example posted using that 
feature, I would have assumed that it just meant someone was too lazy to 
type 'maximum' at the time.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: proactive-raid-disk-replacement

2006-09-16 Thread Bill Davidsen

Tuomas Leikola wrote:


On 9/10/06, Bodo Thiesen [EMAIL PROTECTED] wrote:

So, we need a way, to feedback the redundancy from the raid5 to the 
raid1.


snip long explanation

Sounds awfully complicated to me. Perhaps this is how it internally
works, but my 2 cents go to the option to gracefully remove a device
(migrating to a spare without losing redundancy) in the kernel (or
mdadm).

I'm thinking

mdadm /dev/raid-device -a /dev/new-disk
mdadm /dev/raid-device --graceful-remove /dev/failing-disk

also hopefully a path to do this instead of kicking (multiple) disks
when bad blocks occur. 



Actually, an internal implementation is really needed if this is to be 
generally useful to a non-guru. And it has other possible uses, as well. 
if there were just a --migrate command:

 mdadm --migrate /dev/md0 /dev/sda /dev/sdf
as an example for discussion, the whole process of not only moving the 
data, but getting recovered information from the RAID array could be 
done by software which does the right thing, creating superblocks, copy 
UUID, etc. And as a last step it could invalidate the superblock on the 
failing drive (so reboots would work right) and leave the array running 
on the new drive.


But wait, there's more! Assume that I want to upgrade from a set of 
250GB drives to 400GB drives. Using this feature I could replace a drive 
at a time, then --grow the array. The process for doing that is complex 
currently, and many manual steps invite errors.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Correct way to create multiple RAID volumes with hot-spare?

2006-09-16 Thread Bill Davidsen

Steve Cousins wrote:




Ruth Ivimey-Cook wrote:


Steve,

The recent Messed up creating new array... thread has someone who 
started by using the whole drives but she now wants to use 
partitions because the array is not starting automatically on boot 
(I think that was the symptom).  I'm guessing this is because there 
is no partigion ID of fd since there isn't even a partition.




Yes, that's right.



Thanks Ruth.


Neil (or others), what is the recommended way to have the array start 
up if you use whole drives instead of partitions?  Do you put mdadm -A 
etc. in rc.local? 



I think you want it earlier than that, unless you want to do the whole 
mounting process by hand. It's distribution dependent, but doing it 
early allows the array to be handled like any other block device.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ATA cables and drives

2006-09-16 Thread Bill Davidsen

Molle Bestefich wrote:


I'm looking for new harddrives.

This is my experience so far.


SATA cables:
=

I have zero good experiences with any SATA cables.
They've all been crap so far.


3.5 ATA harddrives buyable where I live:
==

(All drives are 7200rpm, for some reason.)



Unless you live where delivery services don't go, you can get 10k SATA 
or 15k Ultra320 drives from many vendors. I checked newegg just as a 
reference, there are many others.


The rest of your questions sound like you are running the drives very 
hot, and drive life is inversely proportional to temperature. There are 
mil-spec drives which will live at 100C, but they are not readily 
available and cost way more than keeping drives cool. I have no idea 
what kind of SATA cable failure you are seeing, I have some in machines 
I take to give presentations, and if 10k miles in the back of an SUV 
didn't cause problems, I doubt any normal operation would. I've had them 
bad when I got them, but if they work once they keep working, in my 
experience.





I've tried Maxtor and IBM (now Hitachi) harddrives.
Both makes have failed on me, but most of the time due to horrible 
packaging.


I don't care a split-second whether one kind is marginally faster than
the other, so all the reviews on AnandTech etc. are utterly useless to
me.  There's an infinite number of more effective ways to get better
performance than to buy a slightly faster harddrive.

I DO care about quality, namely:
* How often the drives has catastrophic failure,
* How they handle heat (dissipation  acceptance - how hot before it 
fails?),

* How big the spare area is,
* How often they have single-sector failures,
* How long the manufacturer warranty lasts,
* How easy the manufacturer is to work with wrt. warranty.

I haven't been able to figure the spare area size, heat properties,
etc. for any drives.
Thus my only criteria so far has been manufacturer warranty: How much
bitching do I get when I tell them my drive doesn't work.

My main experience is with Maxtor.
Maxtor has been none less than superb wrt. warranty!
Download an ISO with a diag tool, burn the CD, boot the CD, type in
the fault code it prints on Maxtor's site, and a day or two later
you've got a new drive in the mail and packaging to ship the old one
back in.  If something odd happens, call them up and they're extremely
helpful.

Unfortunately, I lack thorough experience with the other brands.


Questions:
===

A.) Does anyone have experience with returning Hitachi, Seagate or WD
drives to the manufacturer?
   Do they have manufacturer warranty at all?
   How much/little trouble did you have with Hitachi, Seagate or WD?

B.) Can anyone *prove* (to a reasonable degree) that drives from
manufacturer H, M, S or WD is of better quality?
   Has anyone seen a review that heat/shock/stress test drives?

C.) Does good SATA cables exist?
   Eg. cables that lock on to the drives, or backplanes which lock
the entire disk in place?


Thanks for reading, and thanks in advance for answers (if any) :-).
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slackware and RAID

2006-09-16 Thread Bill Davidsen

Dexter Filmore wrote:


Is anyone here who runs a soft raid on Slackware?
Out of the box there are no raid scripts, the ones I made myself seem a little 
rawish, barely more than mdadm --assemble/--stop.


 

I'm pretty much off Slack now, but I have run, the scripts you describe 
are about 2/3 of what you need, see the thread(s) here about monitoring. 
mdadm doesn't need a lot of direction...


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slackware and RAID

2006-09-18 Thread Bill Davidsen

Dexter Filmore wrote:


Am Samstag, 16. September 2006 19:26 schrieb Bill Davidsen:
 


Dexter Filmore wrote:
   


Is anyone here who runs a soft raid on Slackware?
Out of the box there are no raid scripts, the ones I made myself seem a
little rawish, barely more than mdadm --assemble/--stop.
 


I'm pretty much off Slack now, but I have run, the scripts you describe
are about 2/3 of what you need, see the thread(s) here about monitoring.
mdadm doesn't need a lot of direction...
   



What's the remaining third?
 


Monitoring... where I pointed you.

I fumbled it into rc.S and rc.6, reason why I ask is that array degraded about 
6 times in the few months I run it and I can't figure why. Only thing I know 
is that it degrades somewhere in the reboot process, so I suspect it might 
not properly shutdown.




Since I haven't had problems I'll pass on trying to guess what's 
happening. When I have any problem usually a whole drive hits the floor, 
and I know what to fix and how. I assume you look at mdstat after boot? 
If it's clean before you shut down and dirty on boot, something isn't 
shutting down clean, OR that autodetect stuff isn't working as you want 
it to. I do NOT run lvm, I avoid stuff like that unless I really need it.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new features time-line

2006-10-17 Thread Bill Davidsen

Neil Brown wrote:


On Friday October 13, [EMAIL PROTECTED] wrote:
 


I am curious if there are plans for either of the following;
-RAID6 reshape
-RAID5 to RAID6 migration
   



No concrete plans with timelines and milestones and such, no.
I would like to implement both of these but I really don't know when I
will find/make time.  Probably by the end of 2007, but that is not a
promise.

We talked about RAID5E a while ago, is there any thought that this would 
actually happen, or is it one of the would be nice features? With 
larger drives I suspect the number of drives in arrays is going down, 
and anything which offers performance benefits for smaller arrays would 
be useful.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why partition arrays?

2006-10-24 Thread Bill Davidsen

Henrik Holst wrote:


Bodo Thiesen wrote:
 


Ken Walker [EMAIL PROTECTED] wrote:

   


Is LVM stable, or can it cause more problems than separate raids on a array.
 



[description of street smart raid setup]

(The same function could probably be achieved with logical partitions
and ordinary software raid levels.)

 


So, now decide for your own, if you consider LVM stable - I would ;)

Regards, Bodo
   



Have you lost any disc (i.e. physical volumes) since February? Or lost
the meta-data?

I would not recommend anyone to use LVM if they are less than experts on
Linux systems. Setting up a LVM system is easy: administrating and
salvaging the same, was much more work. (I used it ~3 years ago)

My read on LVM is that (a) it's one more thing for the admin to learn, 
(b) because it's seldom used the admin will be working from 
documentation if it has a problem, and (c) there is no bug-free 
software, therefore the use of LVM on top of RAID will be less reliable 
than a RAID-only solution. I can't quantify that, the net effect may be 
too small to measure. However, the cost and chance of a finger check 
from (a) and (b) are significant.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug with RAID1 hot spares?

2006-10-24 Thread Bill Davidsen

Chase Venters wrote:


Greetings,
	I was just testing a server I was about to send into production on kernel 
2.6.18.1. The server has three SCSI disks with md1 set to a RAID1 with 2 
mirrors and 1 spare.


I have to ask, why? If the array is mostly written you might save a bit 
of bus time, but for reads having another copy of the data to read 
(usually) helps the performance by reducing wait for read occurences.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: future hardware

2006-10-27 Thread Bill Davidsen

Dan wrote:


I have been using an older 64bit system, socket 754 for a while now.  It has
the old PCI bus 33Mhz.  I have two low cost (no HW RAID) PCI SATA I cards
each with 4 ports to give me an eight disk RAID 6.  I also have a Gig NIC,
on the PCI bus.  I have Gig switches with clients connecting to it at Gig
speed.

As many know you get a peak transfer rate of 133 MB/s or 1064Mb/s from that
PCI bus http://en.wikipedia.org/wiki/Peripheral_Component_Interconnect

The transfer rate is not bad across the network but my bottle neck it the
PCI bus.  I have been shopping around for new MB and PCI-express cards.  I
have been using mdadm for a long time and would like to stay with it.  I am
having trouble finding an eight port PCI-express card that does not have all
the fancy HW RAID which jacks up the cost.  I am now considering using a MB
with eight SATA II slots onboard.  GIGABYTE GA-M59SLI-S5 Socket AM2 NVIDIA
nForce 590 SLI MCP ATX.

What are other users of mdadm using with the PCI-express cards, most cost
effective solution?

There may still be m/b available with multiple PCI busses. Don't know if 
you are interested in a low budget solution, but that would address 
bandwidth and use existing hardware.


Idle curiousity: what kind of case are you using for the drives? I will 
need to spec a machine with eight drives in the December-January timeframe.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question of PCI bandwidth affect on SW RAID arrays.

2006-10-27 Thread Bill Davidsen

Justin Piszcz wrote:


Quick question,

On older systems, regular PCI motherboards, no PCI-e-- which is faster?

1) One 4 port SATA150 card in 1 PCI slot which has 4 drives connected.
2) Four SATA150 cards in 4 PCI slots, which each have 1 drive connected?

I'd assume 2 would stress the PCI bus more and thus would be slower, but I 
am curious what other people know/think/etc?


You are sending the same amount of data over the bus in either case, 
assuming you are using s/w RAID. This is where (real) hardware RAID has 
an advantage in performance, the parity doesn't go over the bus. If you 
have a server board it may have multiple PCI busses, and therefore a 
higher max bandwidth with multiple cards. I would still expect PCI-e to 
be faster.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New features?

2006-10-31 Thread Bill Davidsen

John Rowe wrote:


All this discussion has led me to wonder if we users of linux RAID have
a clear consensus of what our priorities are, ie what are the things we
really want to see soon as opposed to the many things that would be nice
but not worth delaying the important things for. FWIW, here are mine, in
order although the first two are roughly equal priority.

1 Warm swap - replacing drives without taking down the array but maybe
having to type in a few commands. Presumably a sata or sata/raid
interface issue. (True hot swap is nice but not worth delaying warm-
swap.)
 

That seems to work now. It does assume that you have hardware hot swap 
capability.



2 Adding new disks to arrays. Allows incremental upgrades and to take
advantage of the hard disk equivalent of Moore's law.
 


Also seems to work.


3. RAID level conversion (1 to 5, 5 to 6, with single-disk to RAID 1 a
lower priority).
 

Single to RAID-N is possible, but involves a good bit of magic with 
leaving room for superblocks, etc.



4. Uneven disk sizes, eg adding a 400GB disk to a 2x200GB mirror to
create a 400GB mirror. Together with 2 and 3, allows me to continuously
expand a disk array.
 


???

--

bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new array not starting

2006-11-08 Thread Bill Davidsen

Robin Bowes wrote:


Robin Bowes wrote:
 


Robin Bowes wrote:
   


This worked:

# mdadm --assemble --auto=yes  /dev/md2 /dev/sdc /dev/sdd /dev/sde
/dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
mdadm: /dev/md2 has been started with 8 drives.

However, I'm not sure why it didn't start automatically at boot. Do I
need to put it in /etc/mdadm.conf for it to star automatically? I
thought md start all arrays it found at a start up?
 


OK, I put /dev/md2 in /etc/mdadm.conf and it didn't make any difference.

This is mdadm.conf (uuids are on same line as ARRAY):

DEVICE partitions
ARRAY /dev/md1 level=raid1 num-devices=2
uuid=300c1309:53d26470:64ac883f:2e3de671
ARRAY /dev/md0 level=raid1 num-devices=2
uuid=89649359:d89365a6:0192407d:e0e399a3
ARRAY /dev/md2 level=raid6 num-devices=8
UUID=68c2ea69:a30c3cb0:9af9f0b8:1300276b

I saw an error fly by as the server was booting saying /dev/md2 not found.

Do I need to create this device manually?
   



Well, at the risk of having a complete conversation with myself, I've
created partitions of type fd on each disk and re-created the array
out of the partitions instead of the whole disk.

mdadm --create /dev/md2 --auto=yes --raid-devices=8 --level=6 /dev/sdc1
/dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1

I'm hoping this will enable the array to be auto-detected and started at
boot.
 

I'm guessing that whole devices don't get scanned when partitions is 
used. There was a fix for incorrect partition tables being used on whole 
drives, and perhaps that makes the whole device get ignored, or perhaps 
it never worked. Perhaps there's an interaction with LVM, the more 
complex you make your setup the greater the chance for learning experiences.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is my RAID broken?

2006-11-08 Thread Bill Davidsen

Neil Brown wrote:


On Sunday November 5, [EMAIL PROTECTED] wrote:
 


If its resyncing that means it detected an error, right?
   



Not a disk error.  'resyncing' means that at startup it looked like
the array hadn't been shutdown properly so it is making sure that all
the redundancy in the array is consistent.

So it looks like you machine recently crashed (power failure?) and
it is restarting.

There is always the possibility that shutdown scripts don't do the right 
thing, as well. I believe one of the major distros showed this problem 
within the last few months, depending on the RAID options used.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 array showing as degraded after motherboard replacement

2006-11-08 Thread Bill Davidsen

James Lee wrote:


Hi there,

I'm running a 5-drive software RAID5 array across two controllers.
The motherboard in that PC recently died - I sent the board back for
RMA.  When I refitted the motherboard, connected up all the drives,
and booted up I found that the array was being reported as degraded
(though all the data on it is intact).  I have 4 drives on the on
board controller and 1 drive on an XFX Revo 64 SATA controller card.
The drive which is being reported as not being in the array is the one
connected to the XFX controller.

The OS can see that drive fine, and mdadm --examine on that drive
shows that it is part of the array and that there are 5 active devices
in the array.  Doing mdadm --examine on one of the other four drives
shows that the array has 4 active drives and one failed.  mdadm
--detail for the array also shows 4 active and one failed.

Now I haven't lost any data here and I know I can just force a resync
of the array which is fine.  However I'm concerned about how this has
happened.  One worry is that the XFX SATA controller is doing
something funny to the drive.  I've noticed that it's BIOS has
defaulted to RAID0 mode (even though there's only one drive on it) - I
can't see how this would cause any particular problems here though.  I
guess it's possible that some data on the drive got corrupted when the
motherboard failed... 


I notice in your later post that the driver thinks this is a JBOD setup, 
can you either tell the controller to JBOD or force the driver to 
consider this a RAID0 single disk setup? I don't know what RAID0 on one 
drive means, but I suspect that having the controller in the mode you 
want is desirable. That might have been changed in the hardware failure.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Recovering from default FC6 install

2006-11-11 Thread Bill Davidsen
I tried something new on a test system, using the install partitioning 
tools to partition the disk. I had three drives and went with RAID-1 for 
boot, and RAID-5+LVM for the rest. After the install was complete I 
noted that it was solid busy on the drives, and found that the base RAID 
appears to have been created (a) with no superblock and (b) with no 
bitmap. That last is an issue, as a test system it WILL be getting hung 
and rebooted, and recovering the 1.5TB took hours.


Is there an easy way to recover this? The LVM dropped on it has a lot of 
partitions, and there is a lot of data in them asfter several hours of 
feeding with GigE, so I can't readily back up and recreate by hand.


Suggestions?

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recovering from default FC6 install

2006-11-14 Thread Bill Davidsen

Doug Ledford wrote:


On Sun, 2006-11-12 at 01:00 -0500, Bill Davidsen wrote:
 

I tried something new on a test system, using the install partitioning 
tools to partition the disk. I had three drives and went with RAID-1 for 
boot, and RAID-5+LVM for the rest. After the install was complete I 
noted that it was solid busy on the drives, and found that the base RAID 
appears to have been created (a) with no superblock and (b) with no 
bitmap. That last is an issue, as a test system it WILL be getting hung 
and rebooted, and recovering the 1.5TB took hours.


Is there an easy way to recover this? The LVM dropped on it has a lot of 
partitions, and there is a lot of data in them asfter several hours of 
feeding with GigE, so I can't readily back up and recreate by hand.


Suggestions?
   



First, the Fedora installer *always* creates persistent arrays, so I'm
not sure what is making you say it didn't, but they should be
persistent.
 

I got the detail on the md device, then -E on the components, and got a 
no super block found message, which made me think it wasn't there. 
Given that, I didn't have much hope for the part which starts assuming 
that they are persistent but I do thank you for the information, I'm 
sure it will be useful.


I did try recreating, from the running FC6 rather than the rescue, since 
the large data was on it's own RAID and I could umount the f/s and stop 
the array. Alas, I think a grow is needed somewhere, after 
configuration, start, and mount of the f/s on RAID-5, e2fsck told me my 
data was toast. Shortest time to solution was to recreate the f/s and 
reload the data.


The RAID-1 stuff is small, a total rebuild is acceptable in the case of 
a failure.


FC install suggestion: more optional control over the RAID features 
during creation. Maybe there's an advanced features button in the 
install and I just missed it, but there should be, since the non-average 
user might be able to do useful things with the chunk size, and specify 
a bitmap. I would think that a bitmap would be the default on large 
arrays, assuming that 1TB is still large for the moment.


Instructions and attachments save for future use, trimmed here.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md manpage of mdadm 2.5.6

2006-11-19 Thread Bill Davidsen

Joachim Wagner wrote:

Hi Neil,

In man -l mdadm-2.5.6/md.4 I read

Firstly, after an unclear shutdown, the resync process will consult the 
bitmap and only resync those blocks that correspond to bits in the bitmap 
that are set. This can dramatically increase resync time.


IMHO, increase should be changed to decrease or time to speed. 
  


Probably better to say unclean than unclear shutdown as well.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm RAID5 Grow

2006-11-28 Thread Bill Davidsen

mickg wrote:

Neil Brown wrote:

On Thursday October 5, [EMAIL PROTECTED] wrote:

Neil Brown wrote:

On Wednesday October 4, [EMAIL PROTECTED] wrote:
I have been trying to run: mdadm --grow /dev/md0 --raid-devices=6 
--backup-file /backup_raid_grow

I get:
mdadm: Need to backup 1280K of critical section..
mdadm: /dev/md0: Cannot get array details from sysfs
It shouldn't do that Can you   strace -o /tmp/trace -s 300 
mdadm --grow .

...
open(/sys/block/md0/md/component_size, O_RDONLY) = -1 ENOENT (No 
such file or directory)


So it couldn't open .../component_size.  That was added prior to the
release of 2.6.16, and you are running 2.6.17.13 so the kernel
certainly supports it.  Most likely explanation is that /sys isn't 
mounted.

Do you have a /sys?
Is it mounted?
Can you ls -l /sys/block/md0/md ??

Maybe you need to
  mkdir /sys
  mount -t sysfs sysfs /sys

and try again.


Worked like a charm!

Thank you!

There is a
  sysfs   /syssysfs   noauto 0 0
line in /etc/fstab
I am assuming noauto is the culprit?

Should it be made to automount ?

mickg 
I will belatedly add that experience shows that /proc and /sys are 
optional (and can in theory be mounted other places), in practice a lot 
of software depends on them being present and in the usual place.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Odd (slow) RAID performance

2006-11-30 Thread Bill Davidsen

Pardon if you see this twice, I sent it last night and it never showed up...

I was seeing some bad disk performance on a new install of Fedora Core 
6, so I did some measurements of write speed, and it would appear that 
write performance is so slow it can't write my data as fast as it is 
generated :-(


The method: I wrote 2GB of data to various configurations with

sync; time bash -c dd if=/dev/zero bs=1024k count=2048 of=X; sync

where X was a raw partition, raw RAID device, or ext2 filesystem 
over a RAID device. I recorded the time reported by dd, which doesn't 
include a final sync, and total time from start of write to end of sync, 
which I believe represents the true effective performance. All tests 
were run on a dedicated system, with the RAID devices or filesystem 
freshly created.


For a baseline, I wrote to a single drive, single raw partition, which 
gave about 50MB/s transfer. Then I created a RAID-0 device, striped over 
three test drives. As expected this gave a speed of about 147 MB/s. Then 
I created an ext2 filesystem over that device, and the test showed 139 
MB/s speed. This was as expected.


Then I stopped and deleted the RAID-0 and built a RAID-5 on the same 
partitions. A write to this raw RAID device showed only 37.5 MB/s!! 
Putting an ext2 f/s over that device dropped the speed to 35 MB/s. Since 
I am trying to write bursts at 60MB/s, this is a serious problem for me.


Then I recreated a new RAID-10 array on the same partitions. This showed 
a write speed of 75.8 MB/s, double the speed even though I was 
(presumably) writing twice the data. And and ext2 f/s on that array 
showed 74 MB/s write speed. I didn't use /proc/diskstats to gather 
actual counts, nor do I know if they show actual transfer data below all 
the levels of o/s magic, but that sounds as if RAID-5 is not working 
right. I don't have enough space to use RAID-10 for incoming data, so 
that's not an option.


Then I thought that perhaps my chunk size, defaulted to 64k, was too 
small. So I created and array with 256k chunk size. That showed about 36 
MB/s to the raw array, and 32.4 MB/s to an ext2 f/s using the array. 
Finally I decided to create a new f/s using the stride= option, and 
see if that would work better. I had 256k chunks, two data and a parity 
per stripe, so I used the data size, 512k, for calculation. The man page 
says to use the f/s block size, 4k in this case, for calculation, so 
512/4 was 128 stride size, and I used that. The increase was below the 
noise, about 50KB/s faster.


Any thoughts on this gratefully accepted, I may try the motherboard RAID 
if I can't make this work, and it certainly explains why my swapping is 
so slow. That I can switch to RAID-1, it's used mainly for test, big 
data sets and suspend. If I can't make this fast I'd like to understand 
why it's slow.


I did make the raw results 
http://www.tmr.com/%7Edavidsen/RAID_speed.html available if people 
want to see more info.

http://www.tmr.com/~davidsen/RAID_speed.html

--
Bill Davidsen [EMAIL PROTECTED]
 We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Odd (slow) RAID performance

2006-11-30 Thread Bill Davidsen

Roger Lucas wrote:

-Original Message-
From: [EMAIL PROTECTED] [mailto:linux-raid-
[EMAIL PROTECTED] On Behalf Of Bill Davidsen
Sent: 30 November 2006 14:13
To: linux-raid@vger.kernel.org
Subject: Odd (slow) RAID performance

Pardon if you see this twice, I sent it last night and it never showed
up...

I was seeing some bad disk performance on a new install of Fedora Core
6, so I did some measurements of write speed, and it would appear that
write performance is so slow it can't write my data as fast as it is
generated :-(



What drive configuration are you using (SCSI / ATA / SATA), what chipset is
providing the disk interface and what cpu are you running with?
3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the 
ata-piix driver, with drive cache set to write-back. It's not obvious to 
me why that matters, but if it helps you see the problem I''m glad to 
provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on 
plain stripes, so I'm assuming that either the RAID-5 code is not 
working well or I haven't set it up optimally.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Odd (slow) RAID performance

2006-11-30 Thread Bill Davidsen

Roger Lucas wrote:

What drive configuration are you using (SCSI / ATA / SATA), what chipset
  

is


providing the disk interface and what cpu are you running with?
  

3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the
ata-piix driver, with drive cache set to write-back. It's not obvious to
me why that matters, but if it helps you see the problem I''m glad to
provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on
plain stripes, so I'm assuming that either the RAID-5 code is not
working well or I haven't set it up optimally.



If it had been ATA, and you had two drives as master+slave on the same
cable, then they would be fast individually but slow as a pair.

RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow then
you would see some degradation from that too.

We have similar hardware here so I'll run some tests here and see what I
get...


Much appreciated. Since my last note I tried adding --bitmap=internal to 
the array. Bot is that a write performance killer. I will have the chart 
updated in a minute, but write dropped to ~15MB/s with bitmap. Since 
Fedora can't seem to shut the last array down cleanly, I get a rebuild 
on every boot :-( So the array for the LVM has bitmap on, as I hate to 
rebuild 1.5TB regularly. Have to do some compromises on that!


Thanks for looking!

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >