Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-06 Thread Nix
On 6 Dec 2007, Jan Engelhardt verbalised:
 On Dec 5 2007 19:29, Nix wrote:

 On Dec 1 2007 06:19, Justin Piszcz wrote:

 RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
 you use 1.x superblocks with LILO you can't boot)

 Says who? (Don't use LILO ;-)

Well, your kernels must be on a 0.90-superblocked RAID-0 or RAID-1
device. It can't handle booting off 1.x superblocks nor RAID-[56]
(not that I could really hope for the latter).

 If the superblock is at the end (which is the case for 0.90 and 1.0),
 then the offsets for a specific block on /dev/mdX match the ones for /dev/sda,
 so it should be easy to use lilo on 1.0 too, no?

Sure, but you may have to hack /sbin/lilo to convince it to create the
superblock there at all. It's likely to recognise that this is an md
device without a v0.90 superblock and refuse to continue. (But I haven't
tested it.)

-- 
`The rest is a tale of post and counter-post.' --- Ian Rawlings
   describes USENET
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-05 Thread Nix
On 1 Dec 2007, Jan Engelhardt uttered the following:


 On Dec 1 2007 06:19, Justin Piszcz wrote:

 RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
 you use 1.x superblocks with LILO you can't boot)

 Says who? (Don't use LILO ;-)

Well, your kernels must be on a 0.90-superblocked RAID-0 or RAID-1
device. It can't handle booting off 1.x superblocks nor RAID-[56]
(not that I could really hope for the latter).

But that's just /boot, not everything else.


 Not using ANY initramfs/initrd images, everything is compiled into 1 
 kernel image (makes things MUCH simpler and the expected device layout 
 etc is always the same, unlike initrd/etc).

 My expected device layout is also always the same, _with_ initrd. Why? 
 Simply because mdadm.conf is copied to the initrd, and mdadm will 
 use your defined order.

Of course the same is true of initramfs, which can give you the 1 kernel
image back, too. (It's also nicer in that you can autoassemble
e.g. LVM-on-RAID, or even LVM-on-RAID-over-nbd if you so desire.)

-- 
`The rest is a tale of post and counter-post.' --- Ian Rawlings
   describes USENET
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md device naming question

2007-09-24 Thread Nix
On 19 Sep 2007, maximilian attems said:

 hello,

 working on initramfs i'd be curious to know what the /sys/block
 entry of a /dev/md/NN device is. have a user request to support
 it and no handy box using it.

 i presume it may also be /sys/block/mdNN ?

That's it, e.g. /sys/block/md0. Notable subdirectories include holders/
(block devices within the array, more than one if e.g. LVM is in use),
slaves/ (block devices making up the array) and md/ (RAID-specific
state).

-- 
`Some people don't think performance issues are real bugs, and I think 
such people shouldn't be allowed to program.' --- Linus Torvalds
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software based SATA RAID-5 expandable arrays?

2007-07-11 Thread Nix
On 11 Jul 2007, Michael stated:
 I am running Suse, and the check program is not available

`check' isn't a program. The line suggested has a typo: it should
be something like this:

30 2 * * Mon   echo check  /sys/block/md0/md/sync_action

The only program that line needs is `echo' and I'm sure you've got
that. (You also need to have sysfs mounted at /sys, but virtually
everyone has their systems set up like that nowadays.)

(obviously you can check more than one array: just stick in other lines
that echo `check' into some other mdN at some other time of day.)

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-21 Thread Nix
On 21 Jun 2007, Neil Brown stated:
 I have that - apparently naive - idea that drives use strong checksum,
 and will never return bad data, only good data or an error.  If this
 isn't right, then it would really help to understand what the cause of
 other failures are before working out how to handle them

Look at the section `Disks and errors' in Val Henson's excellent report
on last year's filesystems workshop: http://lwn.net/Articles/190223/.
Most of the error modes given there lead to valid checksums and wrong
data...

(while you're there, read the first part too :) )

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software based SATA RAID-5 expandable arrays?

2007-06-19 Thread Nix
On 19 Jun 2007, Michael outgrape:
[regarding `welcome to my killfile']
 Grow up man, and I thanks for the threat.  I will take that into
 account if anything bad happens to my computer system.

Read http://en.wikipedia.org/wiki/Killfile and learn. All he's saying
is `I am automatically ignoring you'.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: below 10MB/s write on raid5

2007-06-13 Thread Nix
On 12 Jun 2007, Jon Nelson told this:
 On Mon, 11 Jun 2007, Nix wrote:

 On 11 Jun 2007, Justin Piszcz told this:
 loki:~# time dd if=/dev/md1 bs=1000 count=502400 of=/dev/null
 502400+0 records in
 502400+0 records out
 50240 bytes (502 MB) copied, 16.2995 s, 30.8 MB/s
 
 loki:~# time dd if=/dev/raid/usr bs=1000 count=502400 of=/dev/null
 502400+0 records in
 502400+0 records out
 50240 bytes (502 MB) copied, 18.6172 s, 27.0 MB/s

 And what is it like with 'iflag=direct' which I really feel you have to 
 use, otherwise you get caching.

I have little enough memory on this box that caching is really not
significant :)

With iflag=direct I get, um,

loki:/var/log# time dd if=/dev/md1 bs=1000 count=502400 of=/dev/null 
iflag=direct
dd: reading `/dev/md1': Invalid argument
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0324791 s, 0.0 kB/s

real0m0.085s
user0m0.000s
sys 0m0.000s

so not exactly ideal.

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: below 10MB/s write on raid5

2007-06-11 Thread Nix
On 11 Jun 2007, Justin Piszcz told this:
 You can do a read test.

 10gb read test:

 dd if=/dev/md0 bs=1M count=10240 of=/dev/null

 What is the result?

 I've read that LVM can incur a 30-50% slowdown.

FWIW I see a much smaller penalty than that.

loki:~# lvs -o +devices
  LV   VGAttr   LSize   Origin Snap%  Move Log Copy%  Devices
[...]
  usr  raid  -wi-ao   6.00G   /dev/md1(50)

loki:~# time dd if=/dev/md1 bs=1000 count=502400 of=/dev/null
502400+0 records in
502400+0 records out
50240 bytes (502 MB) copied, 16.2995 s, 30.8 MB/s

real0m16.360s
user0m0.310s
sys 0m11.780s

loki:~# time dd if=/dev/raid/usr bs=1000 count=502400 of=/dev/null
502400+0 records in
502400+0 records out
50240 bytes (502 MB) copied, 18.6172 s, 27.0 MB/s

real0m18.790s
user0m0.380s
sys 0m14.750s


So there's a penalty, sure, accounted for mostly in sys time, but it's
only about 10%: small enough that I at least can ignore it in exchange
for the administrative convenience of LVM.

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID SB 1.x autodetection

2007-05-30 Thread Nix
On 29 May 2007, Jan Engelhardt uttered the following:

 from your post at 
 http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I 
 read that autodetecting arrays with a 1.x superblock is currently 
 impossible. Does it at least work to force the kernel to always assume a 
 1.x sb? There are some 'broken' distros out there that still don't use 
 mdadm in initramfs, and recreating the initramfs each time is a bit 
 cumbersome...

The kernel build system should be able to do that for you, shouldn't it?

-- 
`On a scale of one to ten of usefulness, BBC BASIC was several points ahead
 of the competition, scoring a relatively respectable zero.' --- Peter Corlett
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID SB 1.x autodetection

2007-05-30 Thread Nix
On 30 May 2007, Bill Davidsen stated:

 Nix wrote:
 On 29 May 2007, Jan Engelhardt uttered the following:


 from your post at 
 http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I read 
 that autodetecting arrays with a
 1.x superblock is currently impossible. Does it at least work to force the 
 kernel to always assume a 1.x sb? There are some
 'broken' distros out there that still don't use mdadm in initramfs, and 
 recreating the initramfs each time is a bit cumbersome...

 The kernel build system should be able to do that for you, shouldn't it?

 That would be an improvement, yes.

Allow me to rephrase: the kernel build system *can* do that for you ;)
that is, it can build a gzipped cpio archive from components located
anywhere on the filesystem or arbitrary source located under usr/.

-- 
`On a scale of one to ten of usefulness, BBC BASIC was several points ahead
 of the competition, scoring a relatively respectable zero.' --- Peter Corlett
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recovery of software RAID5 using FC6 rescue?

2007-05-09 Thread Nix
On 8 May 2007, Michael Tokarev told this:
 BTW, for such recovery purposes, I use initrd (initramfs really, but
 does not matter) with a normal (but tiny) set of commands inside,
 thanks to busybox.  So everything can be done without any help from
 external recovery CD.  Very handy at times, especially since all
 the network drivers are here on the initramfs too, so I can even
 start a netcat server while in initramfs, and perform recovery from
 remote system... ;)

What you should probably do is drop into the shell that's being used to
run init if mount fails (or, more generally, if after mount runs it
hasn't ended up mounting anything: there's no need to rely on mount's
success/failure status). e.g. from my initramfs's init script (obviously
this is not runnable as is due to all the variables, but it should get
the idea across):

if [ -n $root ]; then
/bin/mount -o $OPTS -t $TYPE $ROOT /new-root
fi

if /bin/mountpoint /new-root /dev/null; then :; else
echo No root filesystem given to the kernel or found on the root RAID 
array.
echo Append the correct 'root=', 'root-type=', and/or 'root-options='
echo boot options.
echo
echo Dropping to a minimal shell.  Reboot with Ctrl-Alt-Delete.

exec /bin/sh
fi
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recovery of software RAID5 using FC6 rescue?

2007-05-09 Thread Nix
On 9 May 2007, Michael Tokarev spake thusly:
 Nix wrote:
 On 8 May 2007, Michael Tokarev told this:
 BTW, for such recovery purposes, I use initrd (initramfs really, but
 does not matter) with a normal (but tiny) set of commands inside,
 thanks to busybox.  So everything can be done without any help from
 external recovery CD.  Very handy at times, especially since all
 the network drivers are here on the initramfs too, so I can even
 start a netcat server while in initramfs, and perform recovery from
 remote system... ;)
 
 What you should probably do is drop into the shell that's being used to
 run init if mount fails (or, more generally, if after mount runs it

 That's exactly what my initscript does ;)

I thought so. I was really talking to Mark, I suppose.

 chk() {
   while ! $@; do
 warn the following command failed:
 warn $*
 p=** Continue(Ignore)/Shell/Retry (C/s/r)? 

Wow. Feature-rich :)) I may reused this rather nifty stuff.

 hasn't ended up mounting anything: there's no need to rely on mount's
 success/failure status). [...]

 Well, so far exitcode has been reliable.

I guess I was being paranoid because I'm using busybox and at various
times the exitcodes of its internal commands have been... unimplemented
or unreliable.

-- 
`In the future, company names will be a 32-character hex string.'
  --- Bruce Schneier on the shortage of company names
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 array , part id 'fd' not assembling at boot .

2007-03-31 Thread Nix
On 19 Mar 2007, James W. Laferriere outgrabe:
   What I don't see is the reasoning behind the use of initrd .  It's a
   kernel ran to put the dev tree in order ,  start up devices ,... Just to
   start the kernel again ?

That's not what initrds do. No second kernel is started, and
constructing /dev is not one of the jobs of initrds in any case. (There
*is* something that runs a second kernel if the first one dies ---
google for `kexec crash dump' --- but it's entirely different in design
and intent from initrds, and isn't an early-boot thing but a kernel-
crash-reporting thing.)

There are three different ways to enter userspace for the first time.

 - You can boot with an initramfs. This is the recommended way and may
   eventually deprecate all the others. initramfses consist of gzipped
   cpio archives, either constructed by hand or built automatically
   by the kernel build system during the build process; and either
   linked into the kernel image or pointed at by the bootloader as if
   they were initrds (or both: the two images are automatically merged).
   These are extracted into the `rootfs', which is the (nonswappable)
   ramfs filesystem which is the root of the mount tree. A minimal rootfs
   (with nothing on it) is linked into the kernel if nothing else is.

   The executable /init in the initramfs is run to switch to userspace
   if such exists, You switch from the rootfs to the real root
   filesystem once you mount it by erasing everything on the rootfs and
   `exec chroot'ing and/or `mount --move'ing it into place. (busybox
   contains a `switch_root' built-in command to do this.)

   (I prefer directly linking an initramfs into the kernel, because the
   kernel image is still stand-alone then and you don't have to engage
   in messes involving tracking which initramfs archive is used by which
   kernel if you run multiple kernels.)

 - You can boot with an initrd, which is a compressed *filesystem image*
   loaded from an external file (which the kernel is pointed at by the
   bootloader). The kernel runs /linuxrc to switch to userspace, and
   userspace should use the `pivot_root' command to flip over to the real
   root filesystem. (There is an older way of switching roots involving
   echoing device numbers into a file under /proc. Ignore it, it's
   disgusting.)

   In both these cases it is the initramfs / initrd's responsibility to
   parse things like the root= and init= kernel command-line parameters
   (and any new ones that you choose to define).

   (This is a far older method than initramfs, which explains the
   apparent duplication of effort. initramfs arose largely out of
   dissatisfaction with the limitations of initrds.)

 - You can boot with neither. In this case the kernel mounts / for you,
   either from a local block device, from auto-assembled md arrays with
   v0.90 superblocks, or remotely off NFS. Because it doesn't fsck the
   root filesystem before mounting it, this is slightly risky compared
   to the other options (where your initramfs/initrd image can fsck
   before mounting as usual). (initramfs archives are safest of all here
   because the filesystem is newly constructed by the kernel at boot
   time, so it is *impossible* for it to be damaged.)

   This option is the one where the RAID auto-assembly kicks in, and the
   only one so inflexible that such is needed. H. Peter Anvin has an
   ongoing project to move everything this option does into a default
   initramfs, and remove this crud from the kernel entirely.

   When that happens, there'll be little excuse for assembling RAID
   arrays using the auto-assembler :)

   In otherwords I beleive that initrd's are essentially pointless .  But
   that's just my opinion .

It's wrong, sorry. Try mounting / on a RAID array atop LVM partially
scattered across the network via the network block device, for instance
(I was running like this for some time after some unfortunate disk
failures left me with too little storage on one critical machine to
store all the stuff it needed to run).

Hell, try mounting / on LVM at all. You need userspace to get LVM up
and running, so you *need* an initrd or initramfs to do that.

-- 
`In the future, company names will be a 32-character hex string.'
  --- Bruce Schneier on the shortage of company names
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm file system type check

2007-03-17 Thread Nix
On 17 Mar 2007, Chris Lindley told this:

 What I think the OP is getting at is that MDADM will create an array
 with partitions whose type is not set to FD (Linux Raid Auto), but are
 perhaps 83.

 The issue with that is that upon a reboot mdadm will not be able to
 start the array.

I think you mean that the Linux kernel's auto-assembly code won't be
able to start the array. mdadm doesn't care.

   If you use MDADM to manually reassemble the array then
 it will work fine. But until you reset the partition type to be FD, you
 will have to run this step every time you reboot the machine.

That's what initramfs/initrd is good at :)

-- 
`In the future, company names will be a 32-character hex string.'
  --- Bruce Schneier on the shortage of company names
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-22 Thread Nix
On 20 Feb 2007, Al Boldi outgrape:
 Eyal Lebedinsky wrote:
 Disks are sealed, and a dessicant is present in each to keep humidity
 down. If you ever open a disk drive (e.g. for the magnets, or the mirror
 quality platters, or for fun) then you can see the dessicant sachet.

 Actually, they aren't sealed 100%.  

I'd certainly hope not, unless you like the sound of imploding drives
when you carry one up a mountain.

 On wd's at least, there is a hole with a warning printed on its side:

   DO NOT COVER HOLE BELOW
   V   V  V  V

   o

I suspect that's for air-pressure equalization.

 In contrast, older models from the last century, don't have that hole.

It was my understanding that disks have had some way of equalizing
pressure with their surroundings for many years; but I haven't verified
this so you may well be right that this is a recent thing. (Anyone know
for sure?)

-- 
`In the future, company names will be a 32-character hex string.'
  --- Bruce Schneier on the shortage of company names
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATA/SATA Disk Reliability paper

2007-02-22 Thread Nix
On 22 Feb 2007, [EMAIL PROTECTED] uttered the following:

 On 20 Feb 2007, Al Boldi outgrape:
 Eyal Lebedinsky wrote:
 Disks are sealed, and a dessicant is present in each to keep humidity
 down. If you ever open a disk drive (e.g. for the magnets, or the mirror
 quality platters, or for fun) then you can see the dessicant sachet.

 Actually, they aren't sealed 100%.  

 I'd certainly hope not, unless you like the sound of imploding drives
 when you carry one up a mountain.

Or even exploding drives. (Oops.)

-- 
`In the future, company names will be a 32-character hex string.'
  --- Bruce Schneier on the shortage of company names
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ooops on read-only raid5 while unmounting as xfs

2007-01-24 Thread Nix
On 23 Jan 2007, Neil Brown said:

 On Tuesday January 23, [EMAIL PROTECTED] wrote:
 
 My question is then : what prevents the upper layer to open the array
 read-write, submit a write and make the md code BUG_ON() ?

 The theory is that when you tell an md array to become read-only, it
 tells the block layer that it is read-only, and then if some process
 tries to open it read/write, it gets an array.

Um. Do you mean it gets an *error*? ;}

-- 
`The serial comma, however, is correct and proper, and abandoning it will
surely lead to chaos, anarchy, rioting in the streets, the Terrorists
taking over, and possibly the complete collapse of Human Civilization.'
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad performance on RAID 5

2007-01-21 Thread Nix
On 18 Jan 2007, Bill Davidsen spake thusly:
 ) Steve Cousins wrote:
 time dd if=/dev/zero of=/mount-point/test.dat bs=1024k count=1024
 That doesn't give valid (repeatable) results due to caching issues. Go
 back to the thread I started on RAID-5 write, and see my results. More
 important, the way I got rid of the cache effects (beside an unloaded
 systems) was:
  sync; time bash -c dd if=/dev/zero bs=1024k count=2048 of=/mnt/point/file; 
 sync
 I empty the cache, then time the dd including the sync at the
 end. Results are far more repeatable.

Recent versions of dd have `oflag=direct' as well, to open the output
with O_DIRECT. (I'm not sure what the state of O_DIRECT on regular files
is though.)

-- 
`The serial comma, however, is correct and proper, and abandoning it will
surely lead to chaos, anarchy, rioting in the streets, the Terrorists
taking over, and possibly the complete collapse of Human Civilization.'
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-15 Thread Nix
On 15 Jan 2007, Bill Davidsen told this:
 Nix wrote:
 Number   Major   Minor   RaidDevice State
0   860  active sync   /dev/sda6
1   8   221  active sync   /dev/sdb6
3  2252  active sync   /dev/hdc5

 Number   Major   Minor   RaidDevice State
0   8   230  active sync   /dev/sdb7
1   871  active sync   /dev/sda7
3   352  active sync   /dev/hda5

 0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ
 from `RaidDevice'? Why have both?)


 Did you ever move the data to these drives from another? I think this
 is what you see when you migrate by adding a drive as a spare, then
 mark an existing drive as failed, so the data is rebuilt on the new
 drive. Was there ever a device 2?

Nope. These arrays were created in one lump and never had a spare.

Plenty of pvmoves have happened on them, but that's *inside* the
arrays, of course...

-- 
`He accused the FSF of being something of a hypocrit, which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-15 Thread Nix
On 14 Jan 2007, Neil Brown told this:
 A quick look suggests that the following patch might make a
 difference, but there is more to it than that.  I think there are
 subtle differences due to the use of version-1 superblocks.  That
 might be just another one-line change, but I want to make sure first.

Well, that certainly made that warning go away. I don't have any
actually-failed disks, so I can't tell if it would *ever* warn anymore ;)

... actually, it just picked up some monthly array check activity:

Jan 15 20:03:17 loki daemon warning: mdadm: Rebuild20 event detected on md 
device /dev/md2

So it looks like it works perfectly well now.

(Looking at the code, yeah, without that change it'll never remember
state changes at all!)

One bit of residue from the state before this patch remains on line 352,
where you initialize disc.state and then never use it for anything...

-- 
`He accused the FSF of being something of a hypocrit, which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-14 Thread Nix
On 13 Jan 2007, [EMAIL PROTECTED] uttered the following:
 mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look
 shortly: I can't afford to not run mdadm --monitor... odd, that
 code hasn't changed during 2.6 development.

Whoo! Compile Monitor.c without optimization and the problem goes away.

Hunting: maybe it's a compiler bug (anyone not using GCC 4.1.1 seeing
this?), maybe mdadm is tripping undefined behaviour somewhere...

-- 
`He accused the FSF of being something of a hypocrit, which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-13 Thread Nix
On 12 Jan 2007, Ernst Herzberg told this:
 Then every about 60 sec 4 times

 event=SpareActive
 mddev=/dev/md3

I see exactly this on both my RAID-5 arrays, neither of which have any
spare device --- nor have any active devices transitioned to spare
(which is what that event is actually supposed to mean).

mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look
shortly: I can't afford to not run mdadm --monitor... odd, that
code hasn't changed during 2.6 development.

-- 
`He accused the FSF of being something of a hypocrit, which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-13 Thread Nix
On 13 Jan 2007, [EMAIL PROTECTED] spake thusly:

 On 12 Jan 2007, Ernst Herzberg told this:
 Then every about 60 sec 4 times

 event=SpareActive
 mddev=/dev/md3

 I see exactly this on both my RAID-5 arrays, neither of which have any
 spare device --- nor have any active devices transitioned to spare
 (which is what that event is actually supposed to mean).

Hm, the manual says that it means that a spare has transitioned to
active (which seems more likely). Perhaps the comment at line 82 of
Monitor.c is wrong, or I just don't understand what a `reverse
transition' is supposed to be.

-- 
`He accused the FSF of being something of a hypocrit, which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-13 Thread Nix
On 13 Jan 2007, [EMAIL PROTECTED] uttered the following:

 On 12 Jan 2007, Ernst Herzberg told this:
 Then every about 60 sec 4 times

 event=SpareActive
 mddev=/dev/md3

 I see exactly this on both my RAID-5 arrays, neither of which have any
 spare device --- nor have any active devices transitioned to spare
 (which is what that event is actually supposed to mean).

One oddity has already come to light. My /proc/mdstat says

md2 : active raid5 sdb7[0] hda5[3] sda7[1]
  19631104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md1 : active raid5 sda6[0] hdc5[3] sdb6[1]
  76807296 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

hda5 and hdc5 look odd. Indeed, --examine says

Number   Major   Minor   RaidDevice State
   0   860  active sync   /dev/sda6
   1   8   221  active sync   /dev/sdb6
   3  2252  active sync   /dev/hdc5

Number   Major   Minor   RaidDevice State
   0   8   230  active sync   /dev/sdb7
   1   871  active sync   /dev/sda7
   3   352  active sync   /dev/hda5

0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ
from `RaidDevice'? Why have both?)

-- 
`He accused the FSF of being something of a hypocrit, which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID version question

2006-11-27 Thread Nix
On 27 Nov 2006, Dragan Marinkovic stated:
 On 11/26/06, Nix [EMAIL PROTECTED] wrote:
 Well, I assemble my arrays with the command

 /sbin/mdadm --assemble --scan --auto=md
[...]
 No metadata versions needed anywhere.
[...]
 But you do have to specify the version (other than 0.90) when you want
 to build the array for the first time, correct?

Unless you've specified it in mdadm.conf's CREATE stanza, sure.

-- 
`The main high-level difference between Emacs and (say) UNIX, Windows,
 or BeOS... is that Emacs boots quicker.' --- PdS
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID version question

2006-11-26 Thread Nix
On 25 Nov 2006, Dragan Marinkovic stated:
 Hm, I was playing with RAID 5 with one spare (3 + 1) and metadata
 version 1.2 . If I let it build to some 10% and cleanly reboot it does
 not start where it left off -- basically it starts from scratch. I was
 under the impression that RAID with metadata version 1.x exhibits
 different behavior with kernel 2.6.18 . I'm staring the array as:

/sbin/mdadm --assemble --scan -e 1.2 --no-degraded
 --config=/etc/mdadm.conf

 On another topic, I looked through the code trying to find if metadata
 version can be stored as a default in conf file. Unfortunately, there
 is no such option. It would be nice to have it so you don't have to
 specify it when assembling the array.

Well, I assemble my arrays with the command

/sbin/mdadm --assemble --scan --auto=md

and mdadm.conf looks like

DEVICE partitions
ARRAY /dev/md0 UUID=3a51b74f:8a759fe7:8520304c:3adbceb1
ARRAY /dev/md1 UUID=a5a6cad4:2c7fdc07:88a409b9:192ed3bf
ARRAY /dev/md2 UUID=fe44916d:a1098576:8007fb81:2ee33b5a

MAILADDR [EMAIL PROTECTED]


No metadata versions needed anywhere.

-- 
`The main high-level difference between Emacs and (say) UNIX, Windows,
 or BeOS... is that Emacs boots quicker.' --- PdS
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: invalid (zero) superblock magic upon creation of a new RAID-1 array

2006-11-06 Thread Nix
On 6 Nov 2006, Thomas Andrews uttered the following:
 Thanks Neil, I fixed my problem by creating the raid set using the -e
 option:

 mdadm -C /dev/md0 -e 0.90 --level=raid1 --raid-devices=2  /dev/sda1 
 /dev/sdb1

 You're suggestion to use mdadm to assemble the array is not an option
 for me because it is the root partition that is raided, but thanks for
 putting me in the right direction.

You can still use mdadm to assemble root filesystems: you just need
an initramfs or initrd to do the work before / is mounted. (As a
bonus you can fsck it before it's mounted, as well.)

Most distros have tools that can do this for you, or you can do it by
hand (see e.g. http://linux-raid.osdl.org/index.php/RAID_Boot).

-- 
`When we are born we have plenty of Hydrogen but as we age our
 Hydrogen pool becomes depleted.'
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why partition arrays?

2006-10-22 Thread Nix
On 21 Oct 2006, Bodo Thiesen yowled:
 was hdb and what was hdd? And hde? Hmmm ...), so we decided the following 
 structure:
 
 hda - vg called raida - creating LVs called raida1..raida4
 hdb - vg called raidb - creating LVs called raidb1..raidb4

I'm interested: why two VGs? Why not have one VG covering all RAID arrays,
and then another one for any unRAIDed space (if any)?

-- 
`When we are born we have plenty of Hydrogen but as we age our
 Hydrogen pool becomes depleted.'
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Starting point of the actual RAID data area

2006-10-11 Thread Nix
On 8 Oct 2006, Daniel Pittman said:
 Jyri Hovila [EMAIL PROTECTED] writes:
 I would appreciate it a lot if somebody could give me a hand here. All
 I need to understand right now is how I can find out the first sector
 of the actual RAID data. I'm starting with a simple configuration,
 where there are three identical drives, all of them used fully for one
 RAID 5 set. And no LVM at this point.
 
 It would start on the first sector of the first disk.

That depends on the superblock version. Versions 0.9x and 1.0 will do as
you say, but versions 1.1 and 1.2 will have the start of the disk
holding either the RAID superblock (for 1.1) or 4Kb of nothing (for
1.2).

-- 
`In typical emacs fashion, it is both absurdly ornate and
 still not really what one wanted.' --- jdev
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recipe for Mirrored OS Drives

2006-10-02 Thread Nix
On 2 Oct 2006, David Greaves spake:
 I suggest you link from http://linux-raid.osdl.org/index.php/RAID_Boot

The pages don't really have the same purpose. RAID_Boot is `how to boot
your RAID system using initramfs'; this is `how to set up a RAID system
in the first place', i.e., setup.

I'll give it a bit of a tweak-and-rename in a bit.

-- 
`In typical emacs fashion, it is both absurdly ornate and
 still not really what one wanted.' --- jdev
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Care and feeding of RAID?

2006-09-09 Thread Nix
On 6 Sep 2006, Mario Holbe spake:
 You don't necessarily need one. However, since Neil considers in-kernel
 RAID-autodetection a bad thing and since mdadm typically relies on
 mdadm.conf for RAID-assembly

You can specify the UUID on the command-line too (although I don't).

The advantage of the config file from my POV is that it lets me activate
*all* my RAID arrays with one command, and the command doesn't change, no
matter how complex the array configuration. (I'll admit that the sheer
number of options to mdadm has always overwhelmed me to some degree,
despite the excellent documentation, so I prefer approaches that keep
a working command-line unchanged, especially for something as critical
as boot-time assembly.)

-- 
`In typical emacs fashion, it is both absurdly ornate and
 still not really what one wanted.' --- jdev
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Care and feeding of RAID?

2006-09-09 Thread Nix
On 5 Sep 2006, Paul Waldo uttered the following:
 What about bitmaps?  Nobody has mentioned them.  It is my
 understanding that you just turn them on with mdadm /dev/mdX -b
 internal.  Any caveats for this?

Notably, how many additional writes does it incur? I have some RAID
arrays using drives which are quiet *until* you access them, and which
then make a bloody racket. The superblock updates are bad enough, but
bitmap updates, well, I don't really like seeing one write turned into
twelve-odd disk hits that much (just a back-of-the-envelope guess for a
three-disk RAID-5 array).

-- 
`In typical emacs fashion, it is both absurdly ornate and
 still not really what one wanted.' --- jdev
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: remark and RFC

2006-08-16 Thread Nix
On 16 Aug 2006, Molle Bestefich murmured woefully:
 Peter T. Breuer wrote:
  The comm channel and hey, I'm OK message you propose doesn't seem
  that different from just hot-adding the disks from a shell script
  using 'mdadm'.

 [snip speculations on possible blocking calls]
 
 You could always try and see.
 Should be easy to simulate a network outage.

Blocking calls are not the problem. Deadlocks are.

The problem is that forking a userspace process necessarily involves
kernel memory allocations (for the task struct, userspace memory map,
possibly text pages if the necessary pieces of mdadm are not in the page
cache), and if your swap is on the remote RAID array, you can't
necessarily carry out those allocations.

Note that the same deadlock situation is currently triggered by
sending/receiving network packets, which is why swapping over NBD is a
bad idea at present: however, this is being fixed at this moment because
until it's fixed you can't reliably have a machine with all storage on
iSCSI, for instance. However, the deadlock is only fixable for kernel
allocations, because the amount of storage that'll need is bounded in
several ways: you can't fix it for userspace allocations.  So you can
never rely on userspace working in this situation.

-- 
`We're sysadmins. We deal with the inconceivable so often I can clearly 
 see the need to define levels of inconceivability.' --- Rik Steenwinkel
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5/lvm setup questions

2006-08-07 Thread Nix
On 5 Aug 2006, David Greaves prattled cheerily:
 As an example of the cons: I've just set up lvm2 over my raid5 and whilst
 testing snapshots, the first thing that happened was a kernel BUG and an 
 oops...

I've been backing up using writable snapshots on LVM2 over RAID-5 for
some time. No BUGs.

I think the blame here is likely to be layable at the snapshots' door,
anyway: they're still a little wobbly and the implementation is pretty
complex: bugs surface on a regular basis.

-- 
`We're sysadmins. We deal with the inconceivable so often I can clearly 
 see the need to define levels of inconceivability.' --- Rik Steenwinkel
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md reports: unknown partition table - fixed.

2006-07-22 Thread Nix
On 20 Jul 2006, Neil Brown uttered the following:
 On Tuesday July 18, [EMAIL PROTECTED] wrote:
 
 I think there's a bug here somewhere. I wonder/suspect that the
 superblock should contain the fact that it's a partitioned/able md device?
 
 I've thought about that and am not in favour.
 I would rather just assume everything is partitionable - put
   CREATE auto=part

As long as `partitionable' doesn't imply `partitioned': I'd quite like
LVM-on-raw-md to keep working...

-- 
`We're sysadmins. We deal with the inconceivable so often I can clearly 
 see the need to define levels of inconceivability.' --- Rik Steenwinkel
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: only 4 spares and no access to my data

2006-07-18 Thread Nix
On 18 Jul 2006, Neil Brown moaned:
 The superblock locations for sda and sda1 can only be 'one and the
 same' if sda1 is at an offset in sda which is a multiple of 64K, and
 if sda1 ends near the end of sda.  This certainly can happen, but it
 is by no means certain.
 
 For this reason, version-1 superblocks record the offset of the
 superblock in the device so that if a superblock is written to sda1
 and then read from sda, it will look wrong (wrong offset) and so will
 be ignored (no valid superblock here).

One case where this can happen is Sun slices (and I think BSD disklabels
too), where /dev/sda and /dev/sda1 start at the *same place*.

(This causes amusing problems with LVM vgscan unless the raw devices
are excluded, too.)

-- 
`We're sysadmins. We deal with the inconceivable so often I can clearly 
 see the need to define levels of inconceivability.' --- Rik Steenwinkel
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still can't get md arrays that were started from an initrd to shutdown

2006-07-17 Thread Nix
On 17 Jul 2006, Christian Pernegger suggested tentatively:
 I'm still having problems with some md arrays not shutting down
 cleanly on halt / reboot.
 
 The problem seems to affect only arrays that are started via an
 initrd, even if they do not have the root filesystem on them.
 That's all arrays if they're either managed by EVMS or the
 ramdisk-creator is initramfs-tools. For yaird-generated initrds only
 the array with root on it is affected.

Hm. FWIW mine (started via initramfs) all shut down happily, so
something more than this is involved.

My initramfs has a /dev populated via busybox's `mdev -s', yielding
a /dev containing all devices the kernel reports, under their kernel
names. mdadm is run like this:

/sbin/mdadm --assemble --scan --auto=md --run

and, well, it works, with udev populating the appropriate devices for me
once we switch to the real root filesystem.

 If only there were an exitrd imge to comlement the initrds ...

This shouldn't be necessary.

One possible difference here is that switch_root (from recent busyboxes)
deletes everything on the initramfs before chrooting to the new root.
Are you sure that there's nothing left running from your initrd that
could be holding open, say, some block device that comprises part of
your RAID array? (I'm just guessing here, but this seems like the
largest difference between your setup and mine.)

-- 
`We're sysadmins. We deal with the inconceivable so often I can clearly 
 see the need to define levels of inconceivability.' --- Rik Steenwinkel
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH*2] mdadm works with uClibc from SVN

2006-06-27 Thread Nix
On 26 Jun 2006, Neil Brown said:
 On Tuesday June 20, [EMAIL PROTECTED] wrote:
 For some time, mdadm's been dumping core on me in my uClibc-built
 initramfs. As you might imagine this is somewhat frustrating, not least
 since my root filesystem's in LVM on RAID. Half an hour ago I got around
 to debugging this.
 
 Imagine my surprise when I found that it was effectively guaranteed to
 crash: map_dev() in util.c is stubbed out for uClibc builds, and
 returns -1 at all times. That means that code in, among other places,
 config.c:load_partitions() is guaranteed to segfault, which is a bit
 tough if you're using the (sane) default `DEVICE partitions' in your
 mdadm.conf.
 
 I'm confused.  map_dev doesn't return -1, it returns NULL.

Yes, that was a typo on my part.

 And the code in load_partitions() handles a NULL return.

Er, what?

The code in load_partitions() in mdadm 2.5.1 and below says

,[ config.c:load_partitions() ]
| name = map_dev(major, minor, 1);
| 
| d = malloc(sizeof(*d));
| d-devname = strdup(name);
`

So if map_dev() returns NULL, we do a strdup(NULL) - crash inside
uClibc.

 So I cannot see why it would be core dumping.
 If you have more details, I'd be really interested.

Backtrace (against uClibc compiled with debugging symbols, tested
against a temporary RAID array on a loopback device):

(gdb) run
Starting program: /usr/packages/mdadm/i686-loki/mdadm.uclibc --assemble 
--uuid=572aeae1:532641bf:49c9aec5:6036454d /dev/md127

Program received signal SIGSEGV, Segmentation fault.
0x080622f8 in strlen (s=0x0) at libc/string/i386/strlen.c:40
40  libc/string/i386/strlen.c: No such file or directory.
in libc/string/i386/strlen.c
(gdb) bt
#0  0x080622f8 in strlen (s=0x0) at libc/string/i386/strlen.c:40
#1  0x08062d1d in strdup (s1=0x0) at libc/string/strdup.c:30
#2  0x0804b0df in load_partitions () at config.c:247
#3  0x0804b77c in conf_get_devs (conffile=0x0) at config.c:715
#4  0x0804d685 in Assemble (st=0x0, mddev=0xbf9c753e /dev/md127, mdfd=6, 
ident=0xbf9c630c, conffile=0x0, devlist=0x0, backup_file=0x0, readonly=0, 
runstop=0, update=0x0, homehost=0x0,
verbose=0, force=0) at Assemble.c:184
#5  0x08049d62 in main (argc=4, argv=0xbf9c6514) at mdadm.c:958
(gdb) frame 2
#2  0x0804b0df in load_partitions () at config.c:247
247 d-devname = strdup(name);
(gdb) info locals
name = 0x0
mp = 0xbf9c5a34  09873360 hda\n
f = (FILE *) 0x807b018
buf =3 09873360 hda\n, '\0' repeats 546 times, 
1\000\000\000\000\000((ÿÿÿ\177\021, '\0' repeats 15 times, 
\021\000\000\000\231\231\231\031Z]\234¿\000\000(\005ÿÿÿ\177¼\\\234¿ÓU\006\bZ]\234¿\000\000\000\000\n,
 '\0' repeats 15 times, 
d^\234¿ÏÈ\004\bX]\234¿\000\000\000\000\n\000\000\000\000\000Linux, '\0' 
repeats 60 times, loki\000), '\0' repeats 43 times, \006, '\0' repeats 
15 times, 2.6.17-s\000\000\000\000à]\234¿ßÉ\005\b|]\234¿ð]\234¿, '\0' 
repeats 12 times...
rv = (mddev_dev_t) 0x0


I hope the bug's more obvious now :)

 That said: utils doesn't handle the case for 'ftw not available' as
 well as it could.  I will fix that.

The __UCLIBC_HAS_FTW__ macro in features.h is the right way to do that
for uClibc, as Luca noted.

-- 
`NB: Anyone suggesting that we should say Tibibytes instead of
 Terabytes there will be hunted down and brutally slain.
 That is all.' --- Matthew Wilcox
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple raids on one machine?

2006-06-27 Thread Nix
On 25 Jun 2006, Chris Allen uttered the following:
 Back to my 12 terabyte fileserver, I have decided to split the storage
 into four partitions each of 3TB. This way I can choose between XFS
 and EXT3 later on.
 
 So now, my options are between the following:
 
 1. Single 12TB /dev/md0, partitioned into four 3TB partitions. But how do
 I do this? fdisk won't handle it. Can GNU Parted handle partitions this big?
 
 2. Partition the raw disks into four partitions and make /dev/md0,md1,md2,md3.
 But am I heading for problems here? Is there going to be a big performance hit
 with four raid5 arrays on the same machine? Am I likely to have dataloss 
 problems
 if my machine crashes?

There is a third alternative which can be useful if you have a mess of
drives of widely-differing capacities: make several RAID arrays so as to 
tesselate
space across all the drives, and then pile an LVM on the top of all of them to
fuse them back into one again.

The result should give you the reliability of RAID-5 and the resizeability of
LVM :)

e.g. the config on my home server, which for reasons of disks-bought-at-
different-times has disks varying in size from 10Gb through 40Gb to
72Gb. Discounting the tiny RAID-1 array used for booting off (LILO won't
boot from RAID-5), it looks like this:

Two RAID arrays, positioned so as to fill up as much space as possible
on the various physical disks:

 Raid Level : raid5
 Array Size : 76807296 (73.25 GiB 78.65 GB)
Device Size : 76807296 (36.62 GiB 39.33 GB)
   Raid Devices : 3
[...]
Number   Major   Minor   RaidDevice State
   0   860  active sync   /dev/sda6
   1   8   221  active sync   /dev/sdb6
   3  2252  active sync   /dev/hdc5

 Array Size : 19631104 (18.72 GiB 20.10 GB)
Device Size : 19631104 (9.36 GiB 10.05 GB)
   Raid Devices : 3

 Raid Level : raid5
Number   Major   Minor   RaidDevice State
   0   8   230  active sync   /dev/sdb7
   1   871  active sync   /dev/sda7
   3   352  active sync   /dev/hda5

(Note that the arrays share some disks, the largest ones: each lays
claim to almost the whole of one of the smaller disks.)

Then atop that we have two LVM volume groups, one filling up any
remaining non-RAIDed space and used for non-critical stuff which can be
regenerated on demand (if a disk dies the whole VG will vanish; if we
wanted to avoid that we could make that space into a RAID-1 array, but I
have a lot of easily-regeneratable data and so didn't bother with that),
and one filling *both* RAID arrays:

  VG#PV #LV #SN Attr   VSize  VFree  Devices
  disks   3   7   0 wz--n- 43.95G 21.80G /dev/sda8(0)
  disks   3   7   0 wz--n- 43.95G 21.80G /dev/sdb8(0)
  disks   3   7   0 wz--n- 43.95G 21.80G /dev/hdc6(0)
  raid2   9   0 wz--n- 91.96G 49.77G /dev/md1(0)
  raid2   9   0 wz--n- 91.96G 49.77G /dev/md2(0)

The result can survive any single disk failure, just like a single
RAID-5 array: the worst case is that one of the /dev/c's dies and both
arrays go degraded at once, but nothing else bad would happen to the
RAIDed storage.

Try doing *that* with hardware RAID. :)))

-- 
`NB: Anyone suggesting that we should say Tibibytes instead of
 Terabytes there will be hunted down and brutally slain.
 That is all.' --- Matthew Wilcox
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH*2] mdadm works with uClibc from SVN

2006-06-27 Thread Nix
On Tue, 27 Jun 2006, Neil Brown prattled cheerily:
 On Tuesday June 27, [EMAIL PROTECTED] wrote:
 ,[ config.c:load_partitions() ]
 | name = map_dev(major, minor, 1);
 | 
 | d = malloc(sizeof(*d));
 | d-devname = strdup(name);
 `
 
 
 Ahh.. uhmmm... Oh yes.  I've fixed that since, but completely forgot
 about it and the change log didn't make it obvious.

Yahoo. That's what I like about free software: the pre-emptive bugfixes!
It seems that as soon as I find a bug, someone else will have already
fixed it :)

 So: that's fixed for 2.5.2.  Thanks for following it up.

No, thank *you* for fixing my boot process :)

-- 
`NB: Anyone suggesting that we should say Tibibytes instead of
 Terabytes there will be hunted down and brutally slain.
 That is all.' --- Matthew Wilcox
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple raids on one machine?

2006-06-27 Thread Nix
On Tue, 27 Jun 2006, Chris Allen wondered:
 Nix wrote:
 There is a third alternative which can be useful if you have a mess of
 drives of widely-differing capacities: make several RAID arrays so as to 
 tesselate
 space across all the drives, and then pile an LVM on the top of all of them 
 to
 fuse them back into one again.
 
 But won't I be stuck with the same problem? ie I'll have a single 12TB
 lvm, and won't be able to use EXT3 on it?

Not without ext3 patches (until the very-large-ext3 patches now pending
on l-k go in), sure. But because it's LVMed you could cut it into a couple
of exg3 filesystems easily. (I find it hard to imagine a single *directory*
whose children contain 12Tb of files in a form that you can't cut into
pieces with suitable use of bind mounts, but still, perhaps such exists.)

-- 
`NB: Anyone suggesting that we should say Tibibytes instead of
 Terabytes there will be hunted down and brutally slain.
 That is all.' --- Matthew Wilcox
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH*2] mdadm works with uClibc from SVN

2006-06-23 Thread Nix
On Fri, 23 Jun 2006, Neil Brown mused:
 On Friday June 23, [EMAIL PROTECTED] wrote:
 On 20 Jun 2006, [EMAIL PROTECTED] prattled cheerily:
  For some time, mdadm's been dumping core on me in my uClibc-built
  initramfs. As you might imagine this is somewhat frustrating, not least
  since my root filesystem's in LVM on RAID. Half an hour ago I got around
  to debugging this.
 
 Ping?
 
 No, but I do know someone by that name.

Sorry, I'm pinging all the patches I've sent out in the last few weeks
and this one was on my hit list :)

 Yeh, it's on my todo list. 

Agreed.

 I suspect I would rather make mdadm not dump core of ftw isn't
 available.

It was... surprising. I'm still not sure how I ever made it work without
this fxi; the first uClibc-based initramfs I ever rolled had mdadm 2.3.1
in it and a `DEVICE partitions', and there was no core dump. A mystery.

 Is there some #define in an include file which will allow me to tell
 if the current uclibc supports ftw or not?

I misspoke: ftw was split into multiple files in late 2005, but it was
originally added in September 2003, in time for version 0.9.21.

Obviously the #defines in ftw.h don't exist before that date, but
that's a bit late to check, really.

features.h provides the macros __UCLIBC_MAJOR__, __UCLIBC_MINOR__, and
__UCLIBC_SUBLEVEL__: versions above 0.9.20 appear to support ftw()
(at least, they have the function, in 32-bit form at least, which
is certainly enough for this application!)

 (and I'm more like to reply to mail if it the To or Cc line
  mentions me specifically - it's less likely to get lost that way).

OK.

-- 
`NB: Anyone suggesting that we should say Tibibytes instead of
 Terabytes there will be hunted down and brutally slain.
 That is all.' --- Matthew Wilcox
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-23 Thread Nix
On 23 Jun 2006, Francois Barre uttered the following:
 The problem is that there is no cost effective backup available.
 
 One-liner questions :
 - How does Google make backups ?

Replication across huge numbers of cheap machines on a massively
distributed filesystem.

-- 
`NB: Anyone suggesting that we should say Tibibytes instead of
 Terabytes there will be hunted down and brutally slain.
 That is all.' --- Matthew Wilcox
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-23 Thread Nix
On 23 Jun 2006, PFC suggested tentatively:
   - ext3 is slow if you have many files in one directory, but has
   more mature tools (resize, recovery etc)

This is much less true if you turn on the dir_index feature.

-- 
`NB: Anyone suggesting that we should say Tibibytes instead of
 Terabytes there will be hunted down and brutally slain.
 That is all.' --- Matthew Wilcox
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-23 Thread Nix
On 23 Jun 2006, Christian Pedaschus said:
 and my main points for using ext3 is still: it's a very mature fs,
 nobody will tell you such horrible storys about data-lossage with ext3
 than with any other filesystem.

Actually I can, but it required bad RAM *and* a broken disk controller
*and* an electrical storm *and* heavy disk loads (only read loads,
but I didn't have noatime active so read implied write).

In my personal experience it's since weathered machines with `only' RAM
so bad that md5sums of 512Kb files wouldn't come out the same way twice
with no problems at all (some file data got corrupted, unsurprisingly,
but the metadata was fine).

Definitely an FS to be relied upon.

-- 
`NB: Anyone suggesting that we should say Tibibytes instead of
 Terabytes there will be hunted down and brutally slain.
 That is all.' --- Matthew Wilcox
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH*2] mdadm works with uClibc from SVN

2006-06-20 Thread Nix
For some time, mdadm's been dumping core on me in my uClibc-built
initramfs. As you might imagine this is somewhat frustrating, not least
since my root filesystem's in LVM on RAID. Half an hour ago I got around
to debugging this.

Imagine my surprise when I found that it was effectively guaranteed to
crash: map_dev() in util.c is stubbed out for uClibc builds, and
returns -1 at all times. That means that code in, among other places,
config.c:load_partitions() is guaranteed to segfault, which is a bit
tough if you're using the (sane) default `DEVICE partitions' in your
mdadm.conf.

As far as I can tell this is entirely because ftw() support is not
implemented in uClibc. But as of November 2005, it *is* implemented in
uClibc 0.9.29-to-be in SVN, which rumour has it is soon to be officially
released. (The ftw() implementation is currently a copy of that from
glibc, but knowing the uClibc folks it will shrink dramatically as time
goes by. :) )

There are two options here. Either the trivial but rather unhelpful
approach of, well, *telling* people that what they're building
isn't going to work:

diff -durN 2.5.1-orig/Makefile 2.5.1-patched/Makefile
--- 2.5.1-orig/Makefile 2006-06-20 22:49:56.0 +0100
+++ 2.5.1-patched/Makefile  2006-06-20 22:50:34.0 +0100
@@ -97,6 +97,7 @@
 mdadm.tcc : $(SRCS) mdadm.h
$(TCC) -o mdadm.tcc $(SRCS)
 
+# This doesn't work
 mdadm.uclibc : $(SRCS) mdadm.h
$(UCLIBC_GCC) -DUCLIBC -DHAVE_STDINT_H -o mdadm.uclibc $(SRCS) 
$(STATICSRC)
 
@@ -115,6 +116,7 @@
rm -f $(OBJS)
$(CC) $(LDFLAGS) $(ASSEMBLE_FLAGS) -static -DHAVE_STDINT_H -o 
mdassemble.static $(ASSEMBLE_SRCS) $(STATICSRC)
 
+# This doesn't work
 mdassemble.uclibc : $(ASSEMBLE_SRCS) mdadm.h
rm -f $(OJS)
$(UCLIBC_GCC) $(ASSEMBLE_FLAGS) -DUCLIBC -DHAVE_STDINT_H -static -o 
mdassemble.uclibc $(ASSEMBLE_SRCS) $(STATICSRC)
diff -durN 2.5.1-orig/mdadm.h 2.5.1-patched/mdadm.h
--- 2.5.1-orig/mdadm.h  2006-06-02 06:35:22.0 +0100
+++ 2.5.1-patched/mdadm.h   2006-06-20 22:50:55.0 +0100
@@ -344,6 +344,7 @@
 #endif
 
 #ifdef UCLIBC
+#error This is known not to work.
   struct FTW {};
 # define FTW_PHYS 1
 #else

Or the even-more-trivial but arguably far more useful approach of making
it work when possible (failing to compile with old uClibc because of the
absence of ftw.h, and working perfectly well with new uClibc):

diff -durN 2.5.1-orig/Makefile 2.5.1-patched/Makefile
--- 2.5.1-orig/Makefile 2006-06-20 22:49:56.0 +0100
+++ 2.5.1-patched/Makefile  2006-06-20 22:52:34.0 +0100
@@ -98,7 +98,7 @@
$(TCC) -o mdadm.tcc $(SRCS)
 
 mdadm.uclibc : $(SRCS) mdadm.h
-   $(UCLIBC_GCC) -DUCLIBC -DHAVE_STDINT_H -o mdadm.uclibc $(SRCS) 
$(STATICSRC)
+   $(UCLIBC_GCC) -DHAVE_STDINT_H -o mdadm.uclibc $(SRCS) $(STATICSRC)
 
 mdadm.klibc : $(SRCS) mdadm.h
rm -f $(OBJS) 
@@ -117,7 +117,7 @@
 
 mdassemble.uclibc : $(ASSEMBLE_SRCS) mdadm.h
rm -f $(OJS)
-   $(UCLIBC_GCC) $(ASSEMBLE_FLAGS) -DUCLIBC -DHAVE_STDINT_H -static -o 
mdassemble.uclibc $(ASSEMBLE_SRCS) $(STATICSRC)
+   $(UCLIBC_GCC) $(ASSEMBLE_FLAGS) -DHAVE_STDINT_H -static -o 
mdassemble.uclibc $(ASSEMBLE_SRCS) $(STATICSRC)
 
 # This doesn't work
 mdassemble.klibc : $(ASSEMBLE_SRCS) mdadm.h
diff -durN 2.5.1-orig/mdadm.h 2.5.1-patched/mdadm.h
--- 2.5.1-orig/mdadm.h  2006-06-02 06:35:22.0 +0100
+++ 2.5.1-patched/mdadm.h   2006-06-20 22:53:02.0 +0100
@@ -343,14 +343,9 @@
 struct stat64;
 #endif
 
-#ifdef UCLIBC
-  struct FTW {};
+#include ftw.h
+#ifdef __dietlibc__
 # define FTW_PHYS 1
-#else
-# include ftw.h
-# ifdef __dietlibc__
-#  define FTW_PHYS 1
-# endif
 #endif
 
 extern int add_dev(const char *name, const struct stat *stb, int flag, struct 
FTW *s);
diff -durN 2.5.1-orig/util.c 2.5.1-patched/util.c
--- 2.5.1-orig/util.c   2006-06-16 01:25:44.0 +0100
+++ 2.5.1-patched/util.c2006-06-20 22:54:13.0 +0100
@@ -354,21 +354,6 @@
 } *devlist = NULL;
 int devlist_ready = 0;
 
-#ifdef UCLIBC
-int add_dev(const char *name, const struct stat *stb, int flag, struct FTW *s)
-{
-   return 0;
-}
-char *map_dev(int major, int minor, int create)
-{
-#if 0
-   fprintf(stderr, Warning - fail to map %d,%d to a device name\n,
-   major, minor);
-#endif
-   return NULL;
-}
-#else
-
 #ifdef __dietlibc__
 int add_dev_1(const char *name, const struct stat *stb, int flag)
 {
@@ -467,8 +452,6 @@
return nonstd ? nonstd : std;
 }
 
-#endif
-
 unsigned long calc_csum(void *super, int bytes)
 {
unsigned long long newcsum = 0;


With this latter patch, mdadm works flawlessly with my uClibc,
svnversion r15342 from 2006-06-08, and probably works just as well with
all SVN releases after r13017, 2005-12-30.


(One final request: please turn on world-readability in your generated
tarballs ;) right now util.c and some others are readable only by
user. Thanks.)

-- 
`NB: Anyone suggesting that we should say 

Re: RAID tuning?

2006-06-14 Thread Nix
On 13 Jun 2006, Gordon Henderson said:
 On Tue, 13 Jun 2006, Adam Talbot wrote:
 Can any one give me more info on this error?  Pulled from
 /var/log/messages.
 raid6: read error corrected!!
 
 Not seen that one!!!

The message is pretty easy to figure out and the code (in
drivers/md/raid6main.c) is clear enough. The block device driver has
reported a read error. In the old days (pre-2.6.15) the drive would have
been kicked from the array for that, and the array would have dropped to
degraded state; but nowadays the system tries to rewrite the stripe that
should have been there (computed from the corresponding stripes on the
other disks in the array), and only fails if that doesn't work.
Generally hard disks activate sector sparing and stop reporting read
errors for bad blocks only when the block is *written* to (it has to do
that, annoying though the read errors are; since it can't read the data
off the bad block, it can't tell what data should go onto the spare
sector that replaces it until you write it).

So it's disk damage, but unless it happens over and over again you
probably don't need to be too conerned anymore.

-- 
`Voting for any American political party is fundamentally
 incomprehensible.' --- Vadik
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mdadm 2.5

2006-06-05 Thread Nix
On 29 May 2006, Neil Brown suggested tentatively:
 On Sunday May 28, [EMAIL PROTECTED] wrote:
 - mdadm-2.4-strict-aliasing.patch
 fix for another srict-aliasing problem, you can typecast a reference to a
 void pointer to anything, you cannot typecast a reference to a
 struct.
 
 Why can't I typecast a reference to a struct??? It seems very
 unfair...

ISO C forbids it. Pretty much the only time you're allowed to cast a pointer
to A to a pointer to B is if one or the other of them is a pointer to void
or char.

The compiler can (and does) assume that all other pointers-to-T are in
different `aliasing sets', i.e. that a modification through a
pointer-to-T1 cannot affect variables of any other type (so they can
be cached in registers, moved around without reference to each other,
and so on).

It's *very* advantageous, and with the interprocedural optimizations in
GCC 4.1+ is getting more so all the time (the compiler can now move code
between *functions* if it needs to, or propagate constants consistently
used as function arguments down int into the function themselves; pointer
aliasing torpedoes both of these optimizations, and many more.)

(One particularly evil thing is the glibc printf handler extension. As
a result of that, we must assume that any call to printf() or scanf()
can potentially modify any global variable and any local to which a
pointer may have escaped. This is decidedly suboptimal given that
nobody sane ever uses those extensions!)


I think I'll dare to upgrade mdadm soon. I tried upgrading from 2.3.1 to
2.4 (with uClibc from SVN dated 2006-03-28) and it promptly segfaulted
at array assembly time. If it happens with 2.5 I'll get a backtrace
(hard though that is to get when you don't have any accessible storage
to put the core dump on... I upgraded uClibc at the same time so it may
well be an uClibc bug.)

-- 
`Voting for any American political party is fundamentally
 incomprehensible.' --- Vadik
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problems with device-mapper on top of RAID-5 and RAID-6

2006-06-05 Thread Nix
On 2 Jun 2006, Uwe Meyer-Gruhl uttered the following:
 Neil's suggestion indicates that there may be a race condition
 stacking md and dm over each other, but I have not yet tested that
 patch. I once had problems stacking cryptoloop over RAID-6, so it
 might really be a stacking problem. We don't know yet if LVM over RAID
 is affected as well.

I've been running LVM on RAID (spanning two RAID-5 arrays) for some time
now and have had no trouble (that I know of: a check just completed OK,
so the RAID layer at least is consistent). The arrays have both IDE and
SCSI (sym53c875) components.

Of course this is a subtle and intermittent problem so this is hardly
a green light :/

-- 
`Voting for any American political party is fundamentally
 incomprehensible.' --- Vadik
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with raid=noautodetect - solved

2006-05-25 Thread Nix
On 24 May 2006, Florian Dazinger uttered the following:
 Neil Brown wrote:
 Presumably you have a 'DEVICE' line in mdadm.conf too?  What is it.
 My first guess is that it isn't listing /dev/sdd? somehow.
 Otherwise, can you add a '-v' to the mdadm command that assembles the
 array, and capture the output.  That might be helpful.
 NeilBrown

 stupid me! I had a DEVICE section, but somehow forgot about my /dev/sdd drive.

`DEVICE partitions' is generally preferable for that reason, unless
you have entries in /proc/partitions which you explicitly want to
exclude from scanning for RAID superblocks.

-- 
`On a scale of 1-10, X's brokenness rating is 1.1, but that's only
 because bringing Windows into the picture rescaled brokenness by
 a factor of 10.' --- Peter da Silva
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does software RAID take advantage of SMP, or 64 bit CPU(s)?

2006-05-25 Thread Nix
On 23 May 2006, Neil Brown noted:
 On Monday May 22, [EMAIL PROTECTED] wrote:
 A few simple questions about the 2.6.16+ kernel and software RAID.
 Does software RAID in the 2.6.16 kernel take advantage of SMP?
 
 Not exactly.  RAID5/6 tends to use just one cpu for parity
 calculations, but that frees up other cpus for doing other important
 work.

To expand on this, that depends on how many RAID arrays you've got,
since there's one parity-computation daemon per array.

If you have several arrays and are writing to them at the same time,
or several arrays and some are degraded, then several md*_raid*
daemons might be working at once.

But that's not very likely, I'd guess. (I have multiple RAID-5 arrays,
but that's only because I'm trying to get useful RAIDing on multiple
disks of drastically different size.)

-- 
`On a scale of 1-10, X's brokenness rating is 1.1, but that's only
 because bringing Windows into the picture rescaled brokenness by
 a factor of 10.' --- Peter da Silva
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: xfs or ext3?

2006-05-10 Thread Nix
On 10 May 2006, Dexter Filmore wrote:
 Do I have to provide stride parameter like for ext2?

Yes, definitely.

-- 
`On a scale of 1-10, X's brokenness rating is 1.1, but that's only
 because bringing Windows into the picture rescaled brokenness by
 a factor of 10.' --- Peter da Silva
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disks becoming slow but not explicitly failing anyone?

2006-04-24 Thread Nix
On 23 Apr 2006, Mark Hahn stipulated:
 I've seen a lot of cheap disks say (generally deep in the data sheet
 that's only available online after much searching and that nobody ever
 reads) that they are only reliable if used for a maximum of twelve hours
 a day, or 90 hours a week, or something of that nature. Even server
 
 I haven't, and I read lots of specs.  they _will_ sometimes say that 
 non-enterprise drives are intended or designed for a 8x5 desktop-like
 usage pattern.

That's the phrasing, yes: foolish me assumed that meant `if you leave it
on for much longer than that, things will go wrong'.

 to the normal way of thinking about reliability, this would 
 simply mean a factor of 4.2x lower reliability - say from 1M to 250K hours
 MTBF.  that's still many times lower rate of failure than power supplies or 
 fans.

Ah, right, it's not a drastic change.

 It still stuns me that anyone would ever voluntarily buy drives that
 can't be left switched on (which is perhaps why the manufacturers hide
 
 I've definitely never seen any spec that stated that the drive had to be 
 switched off.  the issue is really just what is the designed duty-cycle?

I see. So it's just `we didn't try to push the MTBF up as far as we would
on other sorts of disks'.

 I run a number of servers which are used as compute clusters.  load is
 definitely 24x7, since my users always keep the queues full.  but the servers
 are not maxed out 24x7, and do work quite nicely with desktop drives
 for years at a time.  it's certainly also significant that these are in a 
 decent machineroom environment.

Yeah; i.e., cooled. I don't have a cleanroom in my house so the RAID
array I run there is necessarily uncooled, and the alleged aircon in the
room housing work's array is permanently on the verge of total collapse
(I think it lowers the temperature, but not by much).

 it's unfortunate that disk vendors aren't more forthcoming with their drive
 stats.  for instance, it's obvious that wear in MTBF terms would depend 
 nonlinearly on the duty cycle.  it's important for a customer to know where 
 that curve bends, and to try to stay in the low-wear zone.  similarly, disk

Agreed! I tend to assume that non-laptop disks hate being turned on and
hate temperature changes, so just keep them running 24x7. This seems to be OK,
with the only disks this has ever killed being Hitachi server-class disks in
a very expensive Sun server which was itself meant for 24x7 operation; the
cheaper disks in my home systems were quite happy. (Go figure...)

 specs often just give a max operating temperature (often 60C!), which is 
 almost disingenuous, since temperature has a superlinear effect on 
 reliability.

I'll say. I'm somewhat twitchy about the uncooled 37C disks in one of my
machines: but one of the other disks ran at well above 60C for *years*
without incident: it was an old one with no onboard temperature sensing,
and it was perhaps five years after startup that I opened that machine
for the first time in years and noticed that the disk housing nearly
burned me when I touched it. The guy who installed it said that yes, it
had always run that hot, and was that important? *gah*

I got a cooler for that disk in short order.

 a system designer needs to evaluate the expected duty cycle when choosing
 disks, as well as many other factors which are probably more important.
 for instance, an earlier thread concerned a vast amount of read traffic 
 to disks resulting from atime updates.

Oddly, I see a steady pulse of write traffic, ~100Kb/s, to one dm device
(translating into read+write on the underlying disks) even when the
system is quiescient, all daemons killed, and all fsen mounted with
noatime. One of these days I must fish out blktrace and see what's
causing it (but that machine is hard to quiesce like that: it's in heavy
use).

 simply using more disks also decreases the load per disk, though this is 
 clearly only a win if it's the difference in staying out of the disks 
 duty-cycle danger zone (since more disks divide system MTBF).

Well, yes, but if you have enough more you can make some of them spares
and push up the MTBF again (and the cooling requirements, and the power
consumption: I wish there was a way to spin down spares until they were
needed, but non-laptop controllers don't often seem to provide a way to
spin anything down at all that I know of).

-- 
`On a scale of 1-10, X's brokenness rating is 1.1, but that's only
 because bringing Windows into the picture rescaled brokenness by
 a factor of 10.' --- Peter da Silva
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: disks becoming slow but not explicitly failing anyone?

2006-04-23 Thread Nix
On 23 Apr 2006, Mark Hahn said:
   some people claim that if you put a normal (desktop)
 drive into a 24x7 server (with real round-the-clock load), you should 
 expect failures quite promptly.  I'm inclined to believe that with 
 MTBF's upwards of 1M hour, vendors would not claim a 3-5yr warranty
 unless the actual failure rate was low, even if only running 8/24.

I've seen a lot of cheap disks say (generally deep in the data sheet
that's only available online after much searching and that nobody ever
reads) that they are only reliable if used for a maximum of twelve hours
a day, or 90 hours a week, or something of that nature. Even server
disks generally seem to say something like that, but the figure given is
more like `168 hours a week', i.e., constant use.

It still stuns me that anyone would ever voluntarily buy drives that
can't be left switched on (which is perhaps why the manufacturers hide
the info in such an obscure place), and I don't know what might go wrong
if you use the disk `too much': overheating?

But still it seems that there are crappy disks out there with very
silly limits on the time they can safely be used for.

(But this *is* the RAID list: we know that disks suck, right?)

-- 
`On a scale of 1-10, X's brokenness rating is 1.1, but that's only
 because bringing Windows into the picture rescaled brokenness by
 a factor of 10.' --- Peter da Silva
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: naming of md devices

2006-03-24 Thread Nix
On 23 Mar 2006, Dan Christensen moaned:
 To answer myself, the boot parameter raid=noautodetect is supposed
 to turn off autodetection.  However, it doesn't seem to have an
 effect with Debian's 2.6.16 kernel.  It does disable autodetection
 for my self-compiled kernel, but since that kernel has no initrd or
 initramfs, it gets stuck at that point.  [If I understand correctly,
 you can't use mdadm for building the array without an initrd/ramfs.]

That's true if your root filesystem is on RAID.

 I also tried putting root=LABEL=/ on my boot command line.  Debian's
 kernel seemed to understand this but gave:
 
 Begin: Waiting for root filesystem...
 Done.
 Done.
 Begin: Mounting root filesystem
 ...kernel autodetection of raid seemed to happen here...
 ALERT /dev/disk/by_label// does not exist

Ah, welcome to the udev problems. Look at the Debian kernel-maint list
at lists.debian.org and marvel at the trouble they're having because
they're using udev on their initramfs. I'm glad I used mdev instead :)

 Will the Debian kernel/initramfs fall
 back to using mdadm to build the arrays?

`Fall back to'? If autodetection is turned off, it's not a fallback,
it's the common case.

 the above is on unstable... i don't use stable (and stable definitely does 
 the wrong thing -- 
 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=338200).
 
 That bug is against initrd-tools, which is a different package I
 believe.

Yes, it is unmaintained.

 BUT, my self-compiled kernel is now failing to bring up the arrays!  I
 didn't change anything on the arrays or on this kernel's boot line,
 and I have not turned off kernel auto-detection, so I have no idea why
 there is a problem.  Unfortunately, I don't have a serial console, and
 the kernel panics so I can't scroll back to see the relevant part of
 the screen.  My self-compiled kernel has everything needed for
 my root filesystem compiled in, so I avoided needing an initramfs.

Without boot messages it's very hard to say what's going on. If you have
another machine, you could try booting with the messages going over a
serial console...

-- 
`Come now, you should know that whenever you plan the duration of your
 unplanned downtime, you should add in padding for random management
 freakouts.'
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: naming of md devices

2006-03-24 Thread Nix
On 23 Mar 2006, Daniel Pittman uttered the following:
 The initramfs tool, which is mostly shared with Ubuntu, is less stupid.
 It uses mdadm and a loop to scan through the devices found on the
 machine and find what RAID levels are required, then builds the RAID
 arrays with mdrun.

That's much nicer.

 Unfortunately, it still doesn't transfer /etc/mdadm.conf to the
 initramfs, resulting in arrays changing position when constructed, to my
 annoyance.   So, stupid, but not as stupid as the oldest tools.

That surely can't be hard to fix, can it?

-- 
`Come now, you should know that whenever you plan the duration of your
 unplanned downtime, you should add in padding for random management
 freakouts.'
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A random initramfs script

2006-03-17 Thread Nix
On Fri, 17 Mar 2006, Andre Noll murmured woefully:
 On 00:41, Nix wrote:
 
  So I downloaded iproute2-2.4.7-now-ss020116-try.tar.gz, but there
  seems to be a problem with errno.h:
 
 Holy meatballs that's ancient.
 
 It is the most recent version on the ftp server mentioned in the HOWTO.

OK, so I guess the howto is a bit out of date :/

 [uClibc]
 
 Alternatively, just suck down GCC from, say, 
 svn://gcc.gnu.org/svn/gcc/tags/gcc_3_4_5_release,
 or ftp.gnu.org, or somewhere, and point buildroot at that.
 
 Yep, there's a 'dl' directory which contains all downloads. One can
 download the tarballs from anywhere else to that directory. Seems to
 work now.

Ah, I just work from svn checkouts :) but they *are* twice the size of
a normal tarball, so I guess I understand using the tarballs instead :)

-- 
`Come now, you should know that whenever you plan the duration of your
 unplanned downtime, you should add in padding for random management
 freakouts.'
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A random initramfs script

2006-03-15 Thread Nix
On Thu, 16 Mar 2006, Neil Brown wrote:
 On Wednesday March 15, [EMAIL PROTECTED] wrote:
 On 08:29, Nix wrote:
  Yeah, that would work. Neil's very *emphatic* about hardwiring the UUIDs of
  your arrays, though I'll admit that given the existence of --examine 
  --scan,
  I don't really see why. :)
 
 He likes to compare the situation with /etc/fstab. Nobody complains
 about having to edit /etc/fstab, so why keep people complaining about
 having to edit /etc/mdadm.conf?
 
 Indeed!  And if you plug in some devices off another machine for
 disaster recovery, you don't want another disaster because you
 assembled the wrong arrays.

Well, I can't have that go wrong because I assemble all of them :)

One thing I would like to know, though: I screwed up construction
of one of my arrays and forgot to give it a name, so every array
with a V1 superblock has a name *except one*.

Is there a way to change the name after array creation? (Another
overloading of --grow, perhaps?)

(I'm still quite new to md so rather queasy about `just trying'
things like this with active arrays containing critical data.
I suppose I should build a test array on a sparse loopback mount
or something...)

 I would like an md superblock to be able to contain some indication of
 the 'name' of the machine which is meant to host the array, so that
 once a machine knows its own name, it can automatically find and mount
 its own arrays,

... and of course you could request the mounting of some other machine's
arrays if you cart disks around between machines :)

Seems like a good idea.

-- 
`Come now, you should know that whenever you plan the duration of your
 unplanned downtime, you should add in padding for random management
 freakouts.'
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


A random initramfs script

2006-03-14 Thread Nix
In the interests of pushing people away from in-kernel autodetection, I
thought I'd provide the initramfs script I just knocked up to boot my
RAID+LVM system. It's had a whole four days of testing so it must
work. :)

It's being used to boot a system that boots from RAID-1 and has almost
everything else on a pair of RAID-5 arrays (not quite everything, as the
disks are wildly different sizes, so I was left with a 20Gb slice at the
end of the largest disk that I'm swapping onto and using for data that I
don't care about).

It has a number of improvements over the initramfs embedded in the
script that comes with mdadm:

 - It handles LVM2 as well as md (obviously if you boot off RAID
   you still have to boot off RAID1, but /boot can be a RAID1
   filesystem of its own now, with / in LVM, on RAID, or both
   at once)
 - It fscks / before mounting it
 - If anything goes wrong, it drops you into an emergency shell
   in the rootfs, from where you have all the power of ash
   without hardly any builtin commands, lvm and mdadm to
   diagnose your problem :) you can't do *that* with in-
   kernel array autodetection!
 - it supports arguments `rescue', to drop into /bin/ash
   instead of init after mounting the real root filesystem,
   and `emergency', to drop into a shell on the initramfs
   before doing *anything*.
 - It supports root= and init= arguments, although for
   arcane reasons to do with LILO suckage you need to pass
   the root argument as `root=LABEL=/dev/some/device',
   or LILO will helpfully transform it into a device number,
   which is rarely useful if the device name is, say,
   /dev/emergency-volume-group/root ;) right now, if you
   don't pass root=, it tries to mount /dev/raid/root after
   initializing all the RAID arrays and LVM VGs it can.
 - it doesn't waste memory. initramfs isn't like initrd:
   if you just chroot into the new root filesystem, the
   data in the initramfs *stays around*, in *nonswappable*
   kernel memory. And it's not gzipped by that point, either!

The downsides:

 - it needs a very new busybox, from Subversion after the start of
   this year: I'm using svn://busybox.net/trunk/busybox revision 14406,
   and a 2.6.12+ kernel with sysfs and hotplug support; this is
   because it populates /dev with the `mdev' mini-udev tool inside
   busybox, and switches root filesystems with the `switch_root'
   tool, which chroots only after erasing the entire contents
   of the initramfs (taking *great* care not to recurse off that
   filesystem!)
 - if you link against uClibc (recommended), you need a CVS
   uClibc too (i.e., one newer than 0.9.27).
 - it doesn't try to e.g. set up the network, so it can't do really
   whizzy things like mount a root filesystem situated on a network
   block device on some other host: if you want to do something like
   that you've probably already written a script to do it long ago
 - the init script's got a few too many things hardwired still,
   like the type of the root filesystem. I expect it's short
   enough to easily hack up if you need to :)
 - you need an /etc/mdadm.conf and an /etc/lvm/lvm.conf, both taken
   by default from the system you built the kernel on: personally
   I'd recommend a really simple one with no device= lines, like

DEVICE partitions
ARRAY /dev/md0 UUID=some:long:uuid:here
ARRAY /dev/md1 UUID=another:long:uuid:here
ARRAY /dev/md2 UUID=yetanother:long:uuid:here
...

One oddity, also: after booting with this, I see some strange results
from --examine --scan with mdadm-2.3.1:

loki:/root# mdadm --examine --scan
ARRAY /dev/md0 level=raid1 num-devices=4 
UUID=3a51b74f:8a759fe7:8520304c:3adbceb1
ARRAY /dev/?? level=raid5 metadata=1 num-devices=3 
UUID=a5a6cad42c:7fdc0788:a409b919:2ed3bf name=large
ARRAY /dev/?? level=raid5 metadata=1 num-devices=3 
UUID=fe44916da1:09857680:07fb812e:e33b5a
loki:/root# ls -l /dev/md*
brw-rw 1 root disk 9, 0 Mar 14 20:10 /dev/md0
brw-rw 1 root disk 9, 1 Mar 14 20:10 /dev/md1
brw-rw 1 root disk 9, 2 Mar 14 20:10 /dev/md2

This is decidedly peculiar because the kernel said it was using md1 and
md2 on the initramfs, and the device numbers are surely right:

raid5: device sda6 operational as raid disk 0
raid5: device hdc5 operational as raid disk 2
raid5: device sdb6 operational as raid disk 1
raid5: allocated 3155kB for md1
raid5: raid level 5 set md1 active with 3 out of 3 devices, algorithm 2
[...]
raid5: device sdb7 operational as raid disk 0
raid5: device hda5 operational as raid disk 2
raid5: device sda7 operational as raid disk 1
raid5: allocated 3155kB for md2
raid5: raid level 5 set md2 active with 3 out of 3 devices, algorithm 2


Anyway, without further ado, here's usr/init:

#!/bin/sh
#
# init --- locate and mount root filesystem
#  By Nix [EMAIL PROTECTED].
#
#  Placed in the public domain.
#

export PATH=/sbin:/bin

/bin/mount -t proc proc /proc
/bin/mount -t sysfs sysfs /sys
CMDLINE=`cat /proc/cmdline`

# Populate /dev from /sys
/bin/mount -t tmpfs