Re: Got raid10 assembled wrong - how to fix?

2008-02-13 Thread Michael Tokarev
George Spelvin wrote:
 I just discovered (the hard way, sigh, but not too much data loss) that a
 4-drive RAID 10 array had the mirroring set up incorrectly.
 
 Given 4 drvies A, B, C and D, I had intended to mirror A-C and B-D,
 so that I could split the mirror and run on either (A,B) or (C,D).
 
 However, it turns out that the mirror pairs are A-B and C-D.  So
 pulling both A and B off-line results in a non-functional array.
 
 So basically what I need to do is to decommission B and C, and rebuild
 the array with them swapped: A, C, B, D.
 
 Can someone tell me if the following incantation is correct?
 
 mdadm /dev/mdX -f /dev/B -r /dev/B
 mdadm /dev/mdX -f /dev/C -r /dev/C
 mdadm --zero-superblock /dev/B
 mdadm --zero-superblock /dev/C
 mdadm /dev/mdX -a /dev/C
 mdadm /dev/mdX -a /dev/B

That should work.

But I think you'd better just physically swap the drives instead -
this way, no rebuilding the array will be necessary, and your data
will be safe all the time.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: transferring RAID-1 drives via sneakernet

2008-02-13 Thread Michael Tokarev
Jeff Breidenbach wrote:
 It's not a RAID issue, but make sure you don't have any duplicate volume
 names.  According to Murphy's Law, if there are two / volumes, the wrong
 one will be chosen upon your next reboot.
 
 Thanks for the tip. Since I'm not using volumes or LVM at all, I should be
 safe from this particular problem.

If you don't use names, you use numbers - like md0, md10 etc.
The numbers, as they now ARE names, should be different too.

There's more to this topic, much more.

There are different ways to start (assemble) the arrays.  I know at
least 4 - kernel autodetection, mdadm with mdadm.conf listed some
devices, mdadm with empty mdadm.conf and with using of 'homehost'
parameter (assemble all our arrays), and mdrun utility.  Also,
some arrays may be assembled during initrd/initramfs stage, and some
after...

The best is either mdadm with something in mdadm.conf, or mdadm with
homehost.  Note that neither of these ways, your foreign array(s)
will be assembled, and you will have to do it manually - wich is much
better than to screw things up trying to mix-n-match pieces of the
two systems.  You'll just have to figure the device numbers of your
foreign disks and issue an appropriate command, like this:

  mdadm --assemble /dev/md10 /dev/sdc1 /dev/sdd1 ...

using not yet taken mdN number and the right device nodes for your
disks/partitions.

If you want to keep the disks here, you can add the array info into
mdadm.conf or refresh superblock to have new homehost.

But if you're using kernel autodetection or mdrun... well, I for
one can't help here, -- your arrays will be numbered/renumbered
by a chance...

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-05 Thread Michael Tokarev
Janek Kozicki wrote:
 Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100)
 
 2. How can I delete that damn array so it doesn't hang my server up in a 
 loop?
 
 dd if=/dev/zero of=/dev/sdb1 bs=1M count=10

This works provided the superblocks are at the beginning of the
component devices.  Which is not the case by default (0.90
superblocks, at the end of components), or with 1.0 superblocks.

  mdadm --zero-superblock /dev/sdb1

is the way to go here.

 I'm not using mdadm.conf at all. Everything is stored in the
 superblock of the device. So if you don't erase it - info about raid
 array will be still automatically found.

That's wrong, as you need at least something to identify the array
components.  UUID is the most reliable and commonly used.  You
assemble the arrays as

  mdadm --assemble /dev/md1 --uuid=123456789

or something like that anyway.  If not, your arrays may not start
properly in case you shuffled disks (e.g replaced a bad one), or
your disks were renumbered after a kernel or other hardware change
and so on.  The most convient place to store that info is mdadm.conf.
Here, it looks just like:

DEVICE partitions
ARRAY /dev/md1 UUID=4ee58096:e5bc04ac:b02137be:3792981a
ARRAY /dev/md2 UUID=b4dec03f:24ec8947:1742227c:761aa4cb

By default mdadm offers additional information which helps to
diagnose possible problems, namely:

ARRAY /dev/md5 level=raid5 num-devices=4 
UUID=6dc4e503:85540e55:d935dea5:d63df51b

This new info isn't necessary for mdadm to work (but UUID is),
yet it comes handy sometimes.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleting mdadm RAID arrays

2008-02-05 Thread Michael Tokarev
Moshe Yudkowsky wrote:
 Michael Tokarev wrote:
 Janek Kozicki wrote:
 Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100)

 2. How can I delete that damn array so it doesn't hang my server up
 in a loop?
 dd if=/dev/zero of=/dev/sdb1 bs=1M count=10

 This works provided the superblocks are at the beginning of the
 component devices.  Which is not the case by default (0.90
 superblocks, at the end of components), or with 1.0 superblocks.

   mdadm --zero-superblock /dev/sdb1
 
 Would that work if even if he doesn't update his mdadm.conf inside the
 /boot image? Or would mdadm attempt to build the array according to the
 instructions in mdadm.conf? I expect that it might depend on whether the
 instructions are given in terms of UUID or in terms of devices.

After zeroing superblocks, mdadm will NOT assemble the array,
regardless if using UUIDs or devices or whatever.  In order
to assemble the array, all component devices MUST have valid
superblocks and the superblocks must match each other.

mdadm --assemble in initramfs will simple fail to do its work.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Auto generation of mdadm.conf

2008-02-05 Thread Michael Tokarev
Janek Kozicki wrote:
 Michael Tokarev said: (by the date of Tue, 05 Feb 2008 16:52:18 +0300)
 
 Janek Kozicki wrote:
 I'm not using mdadm.conf at all. 
 That's wrong, as you need at least something to identify the array
 components. 
 
 I was afraid of that ;-) So, is that a correct way to automatically
 generate a correct mdadm.conf ? I did it after some digging in man pages:
 
   echo 'DEVICE partitions'  mdadm.conf
   mdadm  --examine  --scan --config=mdadm.conf  ./mdadm.conf 
 
 Now, when I do 'cat mdadm.conf' i get:
 
  DEVICE partitions
  ARRAY /dev/md/0 level=raid1 metadata=1 num-devices=3 
 UUID=75b0f87879:539d6cee:f22092f4:7a6e6f name='backup':0
  ARRAY /dev/md/2 level=raid1 metadata=1 num-devices=3 
 UUID=4fd340a6c4:db01d6f7:1e03da2d:bdd574 name=backup:2
  ARRAY /dev/md/1 level=raid5 metadata=1 num-devices=3 
 UUID=22f22c3599:613d5231:d407a655:bdeb84 name=backup:1

Hmm.  I wonder why the name for md/0 is in quotes, while others are not.

 Looks quite reasonable. Should I append it to /etc/mdadm/mdadm.conf ?

Probably... see below.

 This file currently contains: (commented lines are left out)
 
   DEVICE partitions
   CREATE owner=root group=disk mode=0660 auto=yes
   HOMEHOST system
   MAILADDR root
 
 This is the default content of /etc/mdadm/mdadm.conf on fresh debian
 etch install.

But now I wonder HOW your arrays gets assembled in the first place.

Let me guess... mdrun?  Or maybe in-kernel auto-detection?

The thing is that mdadm will NOT assemble your arrays given this
config.

If you have your disk/controller and md drivers built into the
kernel, AND marked the partitions as linux raid autodetect,
kernel may assemble them right at boot.  But I don't remember
if the kernel will even consider v.1 superblocks for its auto-
assembly.  In any way, don't rely on the kernel to do this
work, in-kernel assembly code is very simplistic and works
up to a moment when anything changes/breaks.  It's almost
the same code as was in old raidtools...

Another possibility is mdrun utility (shell script) shipped
with Debian's mdadm package.  It's deprecated now, but still
provided for compatibility.  mdrun is even worse, it will
try to assemble ALL arrays found, giving them random names
and numbers, not handling failures correctly, and failing
badly in case of, e.g. a foreign disk is found which
happens to contain a valid raid superblock somewhere...

Well.  There's another, 3rd possibility: mdadm can assemble
all arrays automatically (even if not listed explicitly in
mdadm.conf) using homehost (only available with v.1 superblock).
I haven't tried this option yet, so don't remember how it
works.  From the mdadm(8) manpage:

   Auto Assembly
   When --assemble is used with --scan and no devices  are  listed,  mdadm
   will  first  attempt  to  assemble  all the arrays listed in the config
   file.

   If a homehost has been specified (either in the config file or  on  the
   command line), mdadm will look further for possible arrays and will try
   to assemble anything that it finds which is tagged as belonging to  the
   given  homehost.   This is the only situation where mdadm will assemble
   arrays without being given specific device name or identity information
   for the array.

   If  mdadm  finds a consistent set of devices that look like they should
   comprise an array, and if the superblock is tagged as belonging to  the
   given  home host, it will automatically choose a device name and try to
   assemble the array.  If the array uses version-0.90 metadata, then  the
   minor  number as recorded in the superblock is used to create a name in
   /dev/md/ so for example /dev/md/3.  If the array uses  version-1  meta‐
   data,  then  the name from the superblock is used to similarly create a
   name in /dev/md (the name will have any ’host’ prefix stripped  first).

So.. probably this is the way your arrays are being assembled, since you
do have HOMEHOST in your mdadm.conf...  Looks like it should work, after
all... ;)  And in this case there's no need to specify additional array
information in the config file.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-05 Thread Michael Tokarev
Linda Walsh wrote:
 
 Michael Tokarev wrote:
 Unfortunately an UPS does not *really* help here.  Because unless
 it has control program which properly shuts system down on the loss
 of input power, and the battery really has the capacity to power the
 system while it's shutting down (anyone tested this? 
 
 Yes.  I must say, I am not connected or paid by APC.
 
 With new UPS?
 and after an year of use, when the battery is not new?), -- unless
 the UPS actually has the capacity to shutdown system, it will cut
 the power at an unexpected time, while the disk(s) still has dirty
 caches...
 
 If you have a SmartUPS by APC, their is a freeware demon that monitors
[...]

Good stuff.  I knew at least SOME UPSes are good... ;)
Too bad I rarely see such stuff in use by regular
home users...
[]
 Note also that with linux software raid barriers are NOT supported.
 --
 Are you sure about this?  When my system boots, I used to have
 3 new IDE's, and one older one.  XFS checked each drive for barriers
 and turned off barriers for a disk that didn't support it.  ... or
 are you referring specifically to linux-raid setups?

I'm referring especially to linux-raid setups (software raid).
md devices don't support barriers, because of a very simple
reasons: once more than one disk drive is involved, md layer
can't guarantee ordering ACROSS drives too.  The problem is
that in case of power loss during writes, when an array needs
recovery/resync (at least the parts which were being written,
if bitmaps are in use), md layer will choose arbitrary drive
as a master and will copy data to another drive (speaking
of simplest case of 2-drive raid1 array).  But the thing
is that one drive may have two last barriers written (I mean
the data that was assotiated with the barriers), and
another neither of the two - in two different places.  And
hence we may see quite.. some inconsistency here.

This is regardless of whether underlying component devices
supports barriers or not.

 Would it be possible on boot to have xfs probe the Raid array,
 physically, to see if barriers are really supported (or not), and disable
 them if they are not (and optionally disabling write caching, but that's
 a major performance hit in my experience.

Xfs already probes the devices as you describe, exactly the
same way as you've seen with your ide disks, and disables
barriers.

The question and confusing was about what happens when the
barriers are disabled (provided, again, that we don't rely
on UPS and other external things).  As far as I understand,
when barriers are working properly, xfs should be safe wrt
power losses (still a bit unsure about this).  Now, when
barriers are turned off (for whatever reason), is it still
as safe?  I don't know.  Does it use regular cache flushes
in place of barriers in that case (which ARE supported by
md layer)?

Generally, it has been said numerous times that XFS is not
powercut-friendly, and it has to be used when everything
is stable, including power.  Hence I'm afraid to deploy it
where I know the power is not stable (we've about 70 such
places here, with servers in each, where they don't always
replace UPS batteries in time - ext3fs never crashed so
far, while ext2 did).

Thanks.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Michael Tokarev
Moshe Yudkowsky wrote:
[]
 But that's *exactly* what I have -- well, 5GB -- and which failed. I've
 modified /etc/fstab system to use data=journal (even on root, which I
 thought wasn't supposed to work without a grub option!) and I can
 power-cycle the system and bring it up reliably afterwards.

Note also that data=journal effectively doubles the write time.
It's a bit faster for small writes (because all writes are first
done into the journal, i.e. into the same place, so no seeking
is needed), but for larger writes, the journal will become full
and data found in it needs to be written to proper place, to free
space for new data.  Here, if you'll continue writing, you will
have more than 2x speed degradation, because of a) double writes,
and b) more seeking.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Michael Tokarev
Moshe Yudkowsky wrote:
[]
 If I'm reading the man pages, Wikis, READMEs and mailing lists correctly
 --  not necessarily the case -- the ext3 file system uses the equivalent
 of data=journal as a default.

ext3 defaults to data=ordered, not data=journal.  ext2 doesn't have
journal at all.

 The question then becomes what data scheme to use with reiserfs on the

I'd say don't use reiserfs in the first place ;)

 Another way to phrase this: unless you're running data-center grade
 hardware and have absolute confidence in your UPS, you should use
 data=journal for reiserfs and perhaps avoid XFS entirely.

By the way, even if you do have a good UPS, there should be some
control program for it, to properly shut down your system when
UPS loses the AC power.  So far, I've seen no such programs...

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Michael Tokarev
Eric Sandeen wrote:
 Moshe Yudkowsky wrote:
 So if I understand you correctly, you're stating that current the most 
 reliable fs in its default configuration, in terms of protection against 
 power-loss scenarios, is XFS?
 
 I wouldn't go that far without some real-world poweroff testing, because
 various fs's are probably more or less tolerant of a write-cache
 evaporation.  I suppose it'd depend on the size of the write cache as well.

I know no filesystem which is, as you say, tolerant to a write-cache
evaporation.  If a drive says the data is written but in fact it's
not, it's a Bad Drive (tm) and it should be thrown away immediately.
Fortunately, almost all modern disk drives don't lie this way.  The
only thing needed for the filesystem is to tell the drive to flush
it's cache at the appropriate time, and actually wait for the flush
to complete.  Barriers (mentioned in this thread) is just another
way to do so, in a somewhat more efficient way, but normal cache
flush will do as well.  IFF the write caching is enabled in the
first place - note that with some workloads, write caching in
the drive actually makes write speed worse, not better - namely,
in case of massive writes.

Speaking of XFS (and with ext3fs with write barriers enabled) -
I'm confused here as well, and answers to my questions didn't
help either.  As far as I understand, XFS only use barriers,
not regular cache flushes, hence without write barrier support
(which is not here for linux software raid, which is explained
elsewhere) it's unsafe, -- probably the same applies to ext3
with barrier support enabled.  But I'm not sure I got it all
correctly.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-02-04 Thread Michael Tokarev
John Stoffel wrote:
[]
 C'mon, how many of you are programmed to believe that 1.2 is better
 than 1.0?  But when they're not different, just just different
 placements, then it's confusing.

Speaking of more is better thing...

There were quite a few bugs fixed in recent months wrt version 1
superblocks - both in kernel and in mdadm.  While 0.90 format is
stable for a very long time, and unless you're hitting its limits
(namely, max 26 drives in an array, no homehost field), there's
nothing which makes v1 superblocks better than 0.90 ones.

In my view, better = stable first, faster/easier/whatever second.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Michael Tokarev
Eric Sandeen wrote:
[]
 http://oss.sgi.com/projects/xfs/faq.html#nulls
 
 and note that recent fixes have been made in this area (also noted in
 the faq)
 
 Also - the above all assumes that when a drive says it's written/flushed
 data, that it truly has.  Modern write-caching drives can wreak havoc
 with any journaling filesystem, so that's one good reason for a UPS.  If

Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this?  With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...

 the drive claims to have metadata safe on disk but actually does not,
 and you lose power, the data claimed safe will evaporate, there's not
 much the fs can do.  IO write barriers address this by forcing the drive
 to flush order-critical data before continuing; xfs has them on by
 default, although they are tested at mount time and if you have
 something in between xfs and the disks which does not support barriers
 (i.e. lvm...) then they are disabled again, with a notice in the logs.

Note also that with linux software raid barriers are NOT supported.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-03 Thread Michael Tokarev
Moshe Yudkowsky wrote:
 I've been reading the draft and checking it against my experience.
 Because of local power fluctuations, I've just accidentally checked my
 system:  My system does *not* survive a power hit. This has happened
 twice already today.
 
 I've got /boot and a few other pieces in a 4-disk RAID 1 (three running,
 one spare). This partition is on /dev/sd[abcd]1.
 
 I've used grub to install grub on all three running disks:
 
 grub --no-floppy EOF
 root (hd0,1)
 setup (hd0)
 root (hd1,1)
 setup (hd1)
 root (hd2,1)
 setup (hd2)
 EOF
 
 (To those reading this thread to find out how to recover: According to
 grub's map option, /dev/sda1 maps to hd0,1.)

I usually install all the drives identically in this regard -
to be treated as first bios disk (disk 0x80).  As already
pointed out in this thread - not all BIOSes are able to boot
off a second or third disk, so if your first disk (sda) fail
your only option is to put your sdb into place of sda and boot
from it - this way, grub needs to think it's first boot drive
too.

By the way, lilo works here more easily and more reliable.
You just install a standard mbr (lilo has it too) which just
boots from an active partition, and install lilo onto the
raid array, and tell it to NOT do anything fancy with raid
at all (raid-extra-boot none).  But for this to work, you
have to have identical partitions with identical offsets -
at least for the boot partitions.

 After the power hit, I get:
 
 Error 16
 Inconsistent filesystem mounted

But did it actually mount it?

 I then tried to boot up on hda1,1, hdd2,1 -- none of them worked.

Which is in fact expected after the above.  You have 3 identical
copies (thanks to raid) of your boot filesystem, all 3 equally
broken.  When it boots, it assembles your /boot raid array - the
same regardless if you boot off hda, hdb or hdc.

 The culprit, in my opinion, is the reiserfs file system. During the
 power hit, the reiserfs file system of /boot was left in an inconsistent
 state; this meant I had up to three bad copies of /boot.

I've never seen any problem with ext[23] wrt unexpected power loss, so
far.  Running several 100s of different systems, some since 1998, some
since 2000.  Sure there was several inconsistencies, sometimes (maybe
once or twice) some minor data loss (only few newly created files were
lost), but most serious was to find a few items in lost+found after an
fsck - that's ext2, never seen that with ext3.

More, I tried hard to force a power failure at unexpected time, by
doing massive write operations and cutting power while at it - I was
never able to trigger any problem this way, at all.

In any case, even if ext[23] is somewhat damaged, it can be mounted
still - access to some files may return I/O errors (in the parts
where it's really damaged), but the rest will work.

On the other hand, I had several immediate issues with reiserfs.  It
was long time ago, when the filesystem first has been included into
mainline kernel, so that doesn't reflect current situation.  Yet even
at that stage, reiserfs was declared stable by the authors.  Issues
were trivially triggerable by cutting the power at an unexpected
time, and fsck didn't help several times.

So I tend to avoid reiserfs - due to my own experience, and due to
numerous problems elsewhere.

 Recommendations:
 
 1. I'm going to try adding a data=journal option to the reiserfs file
 systems, including the /boot. If this does not work, then /boot must be
 ext3 in order to survive a power hit.

By the way, if your /boot is separate filesystem (ie, there's nothing
more there), I see absolutely, zero no reason for it to crash.  /boot
is modified VERY rarely (only when installing a kernel), and only when
it's modified there's a chance for it to be damaged somehow.  During
the rest of the time, it's constant, and any power cut should not hurt
it at all.  If even for a non-modified filesystem reiserfs shows such
behavour (

 2. We discussed what should be on the RAID1 bootable portion of the
 filesystem. True, it's nice to have the ability to boot from just the
 RAID1 portion. But if that RAID1 portion can't survive a power hit,
 there's little sense. It might make a lot more sense to put /boot on its
 own tiny partition.

Hehe.

/boot doesn't matter really.  Separate /boot were used for 3 purposes:

1) to work around bios 1024th cylinder issues (long gone with LBA)
2) to be able to put the rest of the system onto an unsupported-by-
 bootloader filesystem/raid/lvm/etc.  Like, lilo didn't support
 reiserfs (and still doesn't with tail packing enabled), so if you
 want to use reiserfs for your root fs, put /boot into a separate
 ext2fs.  The same is true for raid - you can put the rest of the
 system into a raid5 array (unsupported by grub/lilo), and in order
 to boot, create small raid1 (or any other supported level) /boot.
3) to keep it as less volatile as possible. Like, an area of the
 disk which never changes (except of a few very rare cases).  For
 

Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-03 Thread Michael Tokarev
Moshe Yudkowsky wrote:
 Michael Tokarev wrote:
 
 Speaking of repairs.  As I already mentioned, I always use small
 (256M..1G) raid1 array for my root partition, including /boot,
 /bin, /etc, /sbin, /lib and so on (/usr, /home, /var are on
 their own filesystems).  And I had the following scenarios
 happened already:
 
 But that's *exactly* what I have -- well, 5GB -- and which failed. I've
 modified /etc/fstab system to use data=journal (even on root, which I
 thought wasn't supposed to work without a grub option!) and I can
 power-cycle the system and bring it up reliably afterwards.
 
 So I'm a little suspicious of this theory that /etc and others can be on
 the same partition as /boot in a non-ext3 file system.

If even your separate /boot failed (which should NEVER fail), what to
say about the rest?

I mean, if you'll save your /boot, what help it will be for you, if
your root fs is damaged?

That's why I said /boot is mostly irrelevant.

Well.  You can have some recovery stuff in your initrd/initramfs - that's
for sure (and for that to work, you can make your /boot more reliable by
creating a separate filesystem for it).  But if to go this route, it's
better to boot off some recovery CD instead of trying recovery from very
limited toolset available in your initramfs.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread Michael Tokarev
Moshe Yudkowsky wrote:
[]
 Mr. Tokarev wrote:
 
 By the way, on all our systems I use small (256Mb for small-software systems,
 sometimes 512M, but 1G should be sufficient) partition for a root filesystem
 (/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all...
 ... doing [it]
 this way, you always have all the tools necessary to repair a damaged system
 even in case your raid didn't start, or you forgot where your root disk is
 etc etc.
 
 An excellent idea. I was going to put just /boot on the RAID 1, but
 there's no reason why I can't add a bit more room and put them all
 there. (Because I was having so much fun on the install, I'm using 4GB
 that I was going to use for swap space to mount base install and I'm
 working from their to build the RAID. Same idea.)
 
 Hmmm... I wonder if this more expansive /bin, /sbin, and /lib causes
 hits on the RAID1 drive which ultimately degrade overall performance?
 /lib is hit only at boot time to load the kernel, I'll guess, but /bin
 includes such common tools as bash and grep.

You don't care of the speed of your root filesystem.  Note there are
two speeds - write and read.

You only write to root (including /bin and /lib and so on) during
software (re)install and during some configuration work (writing
/etc/password and the like).  First is very infrequent, and both
needs only a few writes, -- so write speed isn't important.

Read speed also not that important, because most commonly used
stuff from there will be cached anyway (like libc.so, bash and
grep), and again, reading such tiny stuff - it doesn't matter
if it's fast raid or a slow one.

What you do care about the speed of devices where your large,
commonly accessed/modified files - such as video files esp.
when you want streaming video - are resides.  And even here,
unless you've special requirement for speed, you will not
notice any difference between slow and fast raid levels.

For typical filesystem usage, raid5 works good for both reads
and (cached, delayed) writes.  It's workloads like databases
where raid5 performs badly.

What you do care about is your data integrity.  It's not really
interesting to reinstall a system or lose your data in case if
something goes wrong, and it's best to have recovery tools as
easily available as possible.  Plus, amount of space you need.

 Also, placing /dev on a tmpfs helps alot to minimize number of writes
 necessary for root fs.
 
 Another interesting idea. I'm not familiar with using tmpfs (no need,
 until now); but I wonder how you create the devices you need when you're
 doing a rescue.

When you start udev, your /dev will be on tmpfs.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WRONG INFO (was Re: In this partition scheme, grub does not find md information?)

2008-01-30 Thread Michael Tokarev
Peter Rabbitson wrote:
 Moshe Yudkowsky wrote:
 over the other. For example, I've now learned that if I want to set up
 a RAID1 /boot, it must actually be 1.2 or grub won't be able to read
 it. (I would therefore argue that if the new version ever becomes
 default, then the default sub-version ought to be 1.2.)
 
 In the discussion yesterday I myself made a serious typo, that should
 not spread. The only superblock version that will work with current GRUB
 is 1.0 _not_ 1.2.

Ghrrm.  1.0, or 0.9.  0.9 is still the default with mdadm.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-30 Thread Michael Tokarev
Keld Jørn Simonsen wrote:
[]
 Ugh.  2-drive raid10 is effectively just a raid1.  I.e, mirroring
 without any striping. (Or, backwards, striping without mirroring).
 
 uhm, well, I did not understand: (Or, backwards, striping without
 mirroring).  I don't think a 2 drive vanilla raid10 will do striping.
 Please explain.

I was referring to raid0+1 here - mirror of stripes.  Which makes
no sense by its own, but when we create such thing on only 2 drives,
it becomes just raid0...  Backwards as raid1+0 vs raid0+1.

This is just to show that various raid levels, in corner cases,
tends to transform from one to another.

 Pretty much like with raid5 of 2 disks - it's the same as raid1.
 
 I think in raid5 of 2 disks, half of the chunks are parity chynks which
 are evenly distributed over the two disks, and the parity chunk is the
 XOR of the data chunk. But maybe I am wrong. Also the behaviour of suce
 a raid5 is different from a raid1 as the parity chunk is not used as
 data.

With N-disk raid5, parity in a row is calculated by XORing together
data from all the rest of the disks (N-1), ie, P = D1 ^ ... ^D(N-1).

In case of 2-disk raid5 (it's also a corner case), the above formula
becomes just P = D1.  So, parity block in each row contains exactly
the same data as data block, effectively turning the whole thing into
a raid1 of two disks.  Sure in raid5 parity blocks called just that -
parity, but in reality that parity is THE SAME as data (again, in
case of only 2-disk raid5).

 I am not sure what vanilla linux raid10 (near=2, far=1)
 has of properties. I think it can run with only 1 disk, but I think it
 number of copies should be = number of disks, so no.
 
 I have a clear understanding that in a vanilla linux raid10 (near=2, far=1)
 you can run with one failing disk, that is with only one working disk.
 Am I wrong?

In fact, with (all softs) or raid10, it's not only the number of drives
that can fail that matters, but also WHICH drives can fail.  In classic
raid10:

DiskA   DiskB  DiskC  DiskD
  0   0  1  1
  2   2  3  3
  4   4  5  5
  

(where numbers are the data blocks), you can have only 2 working
disks (ie, 2 failed), but only from different pairs.  You can't
have A and B failed and C and D working for example - you'll lose
half the data and thus the filesystem.  You can have A and C failed
however, or A and D, or BC, or BD.

You see - in the above example, all numbers (data blocks) should be
present at least once (after you pull a drive or two or more).  If
at least some numbers don't appear at all, your raid array's dead.

Now write out the layout you want to use like the above, and try
removing some drives, and see if you still have all numbers.

For example, with 3-disk linux raid10:

  A  B  C
  0  0  1
  1  2  2
  3  3  4
  4  5  5
  

We can't pull 2 drives anymore here.  Eg, pulling AB removes
0 and 3. Pulling BC removes 2 and 5.  AC = 1 and 4.

With 5-drive linux raid10:

   A  B  C  D  E
   0  0  1  1  2
   2  3  3  4  4
   5  5  6  6  7
   7  8  8  9  9
  10 10 11 11 12
   ...

AB can't be removed - 0, 5.  AC CAN be removed, as
are AD.  But not AE - losing 2 and 7.  And so on.

6-disk raid10 with 3 copies of each (near=3 with linux):

   A B C D E F
   0 0 0 1 1 1
   2 2 2 3 3 3

It can run as long as from each triple (ABC and DEF), at
least one disk is here.  Ie, you can lose up to 4 drives,
as far as the condition is true.  But if you lose only
3 - ABC or DEF - it can't work anymore.

The same goes for raid5 and raid6, but they're symmetric --
any single (raid5) or double (raid6) disk failure is Ok.
The principle is this:

  raid5: P = D1^D2^D3^...^D(N-1)
so, you either have all Di (nothing to reconstruct), or
you have all but one Di AND P - in this case, missing Dm
can be recalculated as
  Dm = P^D1^...^D(m-1)^D(m+1)^...^D(N-1)
(ie, a XOR of all the remaining blocks including parity).
(exactly the same applies to raid4, because each row in
raid4 is identical to that of raid5, the difference is
that parity disk is different in each row in raid5, while
in raid4 it stays the same).

I wont write the formula for raid6 as it's somewhat more
complicated, but the effect is the same - any data block
can be reconstructed from any N-2 drives.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Moshe Yudkowsky wrote:
 Peter Rabbitson wrote:
 
 It is exactly what the names implies - a new kind of RAID :) The setup
 you describe is not RAID10 it is RAID1+0. As far as how linux RAID10
 works - here is an excellent article:
 http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10
 
 Thanks. Let's just say that the md(4) man page was finally penetrating
 my brain, but the Wikipedia article helped a great deal. I had thought
 md's RAID10 was more standard.

It is exactly standard - when you create it with default settings
and with even number of drives (2, 4, 6, 8, ...), it will be exactly
standard raid10 (or raid1+0, whatever) as described in various
places on the net.

But if you use odd number of drives, or if you pass some fancy --layout
option, it will look differently.  Still not suitable for lilo or
grub, at least their current versions.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Keld Jørn Simonsen wrote:
 On Tue, Jan 29, 2008 at 09:57:48AM -0600, Moshe Yudkowsky wrote:
 In my 4 drive system, I'm clearly not getting 1+0's ability to use grub 
 out of the RAID10.  I expect it's because I used 1.2 superblocks (why 
 not use the latest, I said, foolishly...) and therefore the RAID10 -- 
 with even number of drives -- can't be read by grub. If you'd patch that 
 information into the man pages that'd be very useful indeed.
 
 If you have 4 drives, I think the right thing is to use a raid1 with 4
 drives, for your /boot partition. Then yo can survive that 3 disks
 crash!

By the way, on all our systems I use small (256Mb for small-software systems,
sometimes 512M, but 1G should be sufficient) partition for a root filesystem
(/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all (usually
identical) drives - be it 4 or 6 or more of them.  Root filesystem does not
change often, or at least it's write speed isn't that important.  But doing
this way, you always have all the tools necessary to repair a damaged system
even in case your raid didn't start, or you forgot where your root disk is
etc etc.

But in this setup, /usr, /home, /var and so on should be separate partitions.
Also, placing /dev on a tmpfs helps alot to minimize number of writes
necessary for root fs.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Peter Rabbitson wrote:
[]
 However if you want to be so anal about names and specifications: md
 raid 10 is not a _full_ 1+0 implementation. Consider the textbook
 scenario with 4 drives:
 
 (A mirroring B) striped with (C mirroring D)
 
 When only drives A and C are present, md raid 10 with near offset will
 not start, whereas standard RAID 1+0 is expected to keep clunking away.

Ugh.  Yes. offset is linux extension.

But md raid 10 with default, n2 (without offset), configuration will behave
exactly like in classic docs.

Again.  Linux md raid10 module implements standard raid10 as known in
all widely used docs.  And IN ADDITION, it can do OTHER FORMS, which
differs from classic variant.  Pretty like a hardware raid card from
a brand vendor probably implements their own variations of standard
raid levels.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Keld Jørn Simonsen wrote:
 On Tue, Jan 29, 2008 at 06:13:41PM +0300, Michael Tokarev wrote:
 Linux raid10 MODULE (which implements that standard raid10
 LEVEL in full) adds some quite.. unusual extensions to that
 standard raid10 LEVEL.  The resulting layout is also called
 raid10 in linux (ie, not giving new names), but it's not that
 raid10 (which is again the same as raid1+0) as commonly known
 in various literature and on the internet.  Yet raid10 module
 fully implements STANDARD raid10 LEVEL.
 
 My understanding is that you can have a linux raid10 of only 2
 drives, while the standard RAID 1+0 requires 4 drives, so this is a huge
 difference.

Ugh.  2-drive raid10 is effectively just a raid1.  I.e, mirroring
without any striping. (Or, backwards, striping without mirroring).

So to say, raid1 is just one particular configuration of raid10 -
with only one mirror.

Pretty much like with raid5 of 2 disks - it's the same as raid1.

 I am not sure what vanilla linux raid10 (near=2, far=1)
 has of properties. I think it can run with only 1 disk, but I think it

number of copies should be = number of disks, so no.

 does not have striping capabilities. It would be nice to have more 
 info on this, eg in the man page. 

It's all in there really.  See md(4).  Maybe it's not that
verbose, but it's not a user's guide (as in: a large book),
after all.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Moshe Yudkowsky wrote:
 Michael Tokarev wrote:
 
 There are more-or-less standard raid LEVELS, including
 raid10 (which is the same as raid1+0, or a stripe on top
 of mirrors - note it does not mean 4 drives, you can
 use 6 - stripe over 3 mirrors each of 2 components; or
 the reverse - stripe over 2 mirrors of 3 components each
 etc).
 
 Here's a baseline question: if I create a RAID10 array using default
 settings, what do I get? I thought I was getting RAID1+0; am I really?

..default settings AND even (4, 6, 8, 10, ...) number of drives.  It
will be standard raid10 or raid1+0 which is the same, as many stripes
of mirrored (2 copies) data as fits with the number of disks.  With odd
number of disks it obviously will be soemthing else, not a standard
raid10.

 My superblocks, by the way, are marked version 01; my metadata in
 mdadm.conf asked for 1.2. I wonder what I really got. The real question

Ugh.  Another source of confusion.  In --superblock=1.2, 1 stands
for the format, and 2 stands for the placement.  So it's really
format version 1.  From mdadm(8):

  1, 1.0, 1.1, 1.2
 Use  the  new  version-1 format superblock.  This has few
 restrictions.   The  different  sub-versions  store   the
 superblock  at  different locations on the device, either
 at the end (for 1.0), at the start (for 1.1) or  4K  from
 the start (for 1.2).


 in my mind now is why grub can't find the info, and either it's because
 of 1.2 superblocks or because of sub-partitioning of components.

As has been said numerous times in this thread, grub can't be used with
anything but raid1 to start with (the same is true for lilo).  Raid10
(or raid1+0, which is the same) - be it standard or linux extension format -
is NOT raid1.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Peter Rabbitson wrote:
 Michael Tokarev wrote:
   Raid10 IS RAID1+0 ;)
 It's just that linux raid10 driver can utilize more.. interesting ways
 to lay out the data.
 
 This is misleading, and adds to the confusion existing even before linux
 raid10. When you say raid10 in the hardware raid world, what do you
 mean? Stripes of mirrors? Mirrors of stripes? Some proprietary extension?

Mirrors of stripes makes no sense.

 What Neil did was generalize the concept of N drives - M copies, and
 called it 10 because it could exactly mimic the layout of conventional
 1+0 [*]. However thinking about md level 10 in the terms of RAID 1+0 is
 wrong. Two examples (there are many more):
 
 * mdadm -C -l 10 -n 3 -o f2 /dev/md10 /dev/sda1 /dev/sdb1 /dev/sdc1
    ^

Those are interesting ways

 Odd number of drives, no parity calculation overhead, yet the setup can
 still suffer a loss of a single drive
 
 * mdadm -C -l 10 -n 2 -o f2 /dev/md10 /dev/sda1 /dev/sdb1
^

And this one too.

There are more-or-less standard raid LEVELS, including
raid10 (which is the same as raid1+0, or a stripe on top
of mirrors - note it does not mean 4 drives, you can
use 6 - stripe over 3 mirrors each of 2 components; or
the reverse - stripe over 2 mirrors of 3 components each
etc).

Vendors often adds their own extensions, sometimes calling
them as the original level, and sometimes giving them new
names, especially in the marketing speak.

Linux raid10 MODULE (which implements that standard raid10
LEVEL in full) adds some quite.. unusual extensions to that
standard raid10 LEVEL.  The resulting layout is also called
raid10 in linux (ie, not giving new names), but it's not that
raid10 (which is again the same as raid1+0) as commonly known
in various literature and on the internet.  Yet raid10 module
fully implements STANDARD raid10 LEVEL.

/mjt

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Peter Rabbitson wrote:
 Moshe Yudkowsky wrote:

 One of the puzzling things about this is that I conceive of RAID10 as
 two RAID1 pairs, with RAID0 on top of to join them into a large drive.
 However, when I use --level=10  to create my md drive, I cannot find
 out which two pairs are the RAID1's: the --detail doesn't give that
 information. Re-reading the md(4) man page, I think I'm badly mistaken
 about RAID10.

 Furthermore, since grub cannot find the /boot on the md drive, I
 deduce that RAID10 isn't what the 'net descriptions say it is.

In fact, everything matches.  For lilo to work, it basically needs
a whole filesystem on the same physical drive.  It's exactly the case
with raid1 (and only).  With raid10, half of the filesystem is on one
mirror, and another half is on another mirror.  Like this:

 filesystem  blocks on raid0
 blocks  DiskADiskB

  0  0
  1   1
  2  2
  3   3
  4  4
  5   5
  ..

(this is  (this is the actual
what LILO  layout)
expects)

(Difference between raid10 and raid0 is that
each of diskA and diskB is in fact composed of
two identical devices).

If your kernel is located in filesytem blocks
number 2 and 3 for example, lilo has to read
BOTH halves, but it is not smart enough to
figure it out - it can only read everything
from a single drive.

 It is exactly what the names implies - a new kind of RAID :) The setup
 you describe is not RAID10 it is RAID1+0.

Raid10 IS RAID1+0 ;)
It's just that linux raid10 driver can utilize more.. interesting ways
to lay out the data.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Error on /dev/sda, but takes down RAID-1

2008-01-23 Thread Michael Tokarev
Martin Seebach wrote:
 Hi, 
 
 I'm not sure this is completely linux-raid related, but I can't figure out 
 where to start: 
 
 A few days ago, my server died. I was able to log in and salvage this content 
 of dmesg: 
 http://pastebin.com/m4af616df 
 
 I talked to my hosting-people and they said it was an io-error on /dev/sda, 
 and replaced that drive. 
 After this, I was able to boot into a PXE-image and re-build the two RAID-1 
 devices with no problems - indicating that sdb was fine. 
 
 I expected RAID-1 to be able to stomach exactly this kind of error - one 
 drive dying. What did I do wrong? 

from that pastebin page.

First, sdb has failed for whatever reason:

ata2.00: qc timeout (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata2.00: revalidation failed (errno=-5)
ata2.00: disabled
ata2: EH complete
sd 1:0:0:0: SCSI error: return code = 0x0004
end_request: I/O error, dev sdb, sector 80324865
raid1: Disk failure on sdb1, disabling device.
Operation continuing on 1 devices
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda1
 disk 1, wo:1, o:0, dev:sdb1
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda1

At this time, it started to (re)sync other(?) arrays for
some reason:

md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 40162432 blocks.
md: md0: sync done.
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:sda1
md: syncing RAID array md1
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 100060736 blocks.

Note again, errors on sdb:

sd 1:0:0:0: SCSI error: return code = 0x0004
end_request: I/O error, dev sdb, sector 112455000
sd 1:0:0:0: SCSI error: return code = 0x0004
end_request: I/O error, dev sdb, sector 112455256
sd 1:0:0:0: SCSI error: return code = 0x0004
end_request: I/O error, dev sdb, sector 112455512
...

raid1: Disk failure on sdb3, disabling device.
Operation continuing on 1 devices

so another md array detected sdb failure.  So we're
with sda only.  And volia, sda fails too, some time
later:

ata1: EH complete
sd 0:0:0:0: SCSI error: return code = 0x0004
end_request: I/O error, dev sda, sector 80324865
sd 0:0:0:0: SCSI error: return code = 0x0004
end_request: I/O error, dev sda, sector 115481
...

At this point, the arrays are hosed - all disks
of each array has failed, there's no data any
more to read/write from/to.

Since later sda has been replaced, and sdb recovered
from the errors (it contains still-valid superblocks
but with somewhat stale information), everything
went ok.

But the original problem is that you had BOTH disks
failed, not only one.  What caused THIS problem is
another question.  Maybe some overheating or power
unit problem or somesuch, -- I don't know...  But
md code worked the best it can here.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Last ditch plea on remote double raid5 disk failure

2007-12-31 Thread Michael Tokarev
Neil Brown wrote:
 On Monday December 31, [EMAIL PROTECTED] wrote:
 I'm hoping that if I can get raid5 to continue despite the errors, I
 can bring back up enough of the server to continue, a bit like the
 remount-ro option in ext2/ext3.

 If not, oh well...
 
 Sorry, but it is oh well.

Speaking of all this bad block handling and dropping device in case
of errors.  Sure thing the situation here improved ALOT when rewriting
a block in case of read error has been introduced.  This was a very
big step into the right direction.  But this is still not sufficient,
I think.

What can be done currently, is to extend bitmap thing, to keep more
information.  Namely, if a block on one drive fails, and we failed
to rewrite it as well (or when there was no way to rewrite it because
the array was already running in degraded mode), don't drop the drive
still, but fail the original request, AND mark THIS PARTICULAR BLOCK
of THIS PARTICULAR DRIVE as bad in the bitmap.

In the other words, bitmap can be extended to cover individual drives
instead of the whole raid device.

It's more - if there's no bitmap for the array, I mean no persistent
bitmap, such a thing can still be done anyway, by keeping such a bitmap
in memory only, up until the raid array will be shut down (in which case
mark the whole drives with errors as bad).  This way, it's possible
to recover alot more data without risking losing the whole array any
time.

It's even more - up until some real write will be performed over a bad
block, there's no need to record its badness - we can return the same
error again as it's expected the drive will return it on a next read
attempt.  It's only write - real write - which makes this particular
block to become bad as we wasn't able to write new data to it...

Hm.  Even in case of write failure, we can still keep the whole drive
without marking anything as bad, again in a hope that the next of
those blocks will error out again.  This is an.. interesting question
really, whenever one can rely on drive to not return bad (read: random)
data in case it errored write operation.  I definitely know a case
when it's not true: we've a batch of seagate drives which seem to have
firmware bug in them, which errors out on write with Defect list
manipulation error sense code, but reads on this very sector returns
something still, especially after a fresh boot (after a power-off).

In any case, keeping this info in a bitmap should be sufficient to
stop kicking the whole drives out of an array, which currently is
a weakest point in linux software raid (IMHO).  As it has been pointed
out numerous times before, due to Murhpy's laws or other factors such
as a phase of the Moon (and partly this behaviour can be described by
the fact that after a drive failure, other drives receives more I/O
requests, esp. when reconstruction starts, and hence have much more
chances to error out on sectors which were not read before in a long
time), drives tend to fail several at once, and often it's trivial to
read the missing information from a drive which has just been kicked
off the array at the place where another drive developed a bad sector.

And another thought around all this.  Linux sw raid definitely need
a way to proactively replace a (probably failing) drive, without removing
it from the array first.  Something like,
  mdadm --add /dev/md0 /dev/sdNEW --inplace /dev/sdFAILING
so that sdNEW will be a mirror of sdFAILING, and once the recovery
procedure finishes (which may use data from other drives in case of
I/O error reading sdFAILING - unlike described scenario of making a
superblock-less mirror of sdNEW and sdFAILING),
  mdadm --remove /dev/md0 /dev/sdFAILING,
which does not involve any further reconstructions anymore.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-29 Thread Michael Tokarev
Justin Piszcz wrote:
[]
 Good to know/have it confirmed by someone else, the alignment does not
 matter with Linux/SW RAID.

Alignment matters when one partitions Linux/SW raid array.
If the inside partitions will not be aligned on a stripe
boundary, esp. in the worst case when the filesystem blocks
cross the stripe boundary (wonder if it's ever possible...
and I think it is, if a partition starts at some odd 512
bytes boundary, and filesystem block size is 4Kb, there's
just no chance for an inside filesystem to do full-stripe
writes, ever, so (modulo stripe cache size) all writes will
go read-modify-write or similar way.

And that's what the original article is about, by the way.
It just happens that hardware raid array is more often split
into partitions (using native tools) than linux software raid
arrays.

And that's what has been pointed out in this thread, as well... ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10: unfair disk load?

2007-12-23 Thread Michael Tokarev
maobo wrote:
 Hi,all
 Yes, Raid10 read balance is the shortest position time first and
 considering the sequential access condition. But its performance is
 really poor from my test than raid0.

Single-stream write performance of raid0, raid1 and raid10 should be
of similar level (with raid5 and raid6 things are different) -- in all
3 cases, it should be near the write speed of a single drive.  The
only possible problematic cases is when you've some unlucky hardware
which does not permit writing into two drives in parallel - in which
case raid1 and raid10 write speed should be less than to raid0 and
single drive.  But even ol'good IDE drives/controllers, even if two
disks are on the same channel, permits parallel writes.  Modern SATA
and SCSI/SAS should be no problem - hopefully, modulo (theoretically)
some very cheap lame controllers.

 I think this is the process flow raid10 influence. But RAID0 is so
 simple and performed very well!
 From this point that striping is better than mirroring! RAID10 is
 stipe+mirror. But for write condition it performed really bad than RAID0.
 Isn't it?

No it's not.  When the hardware (and drivers) is sane anyway.

Also, speed is a very objective thing, so to say - it very much
depends on the workload.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10: unfair disk load?

2007-12-21 Thread Michael Tokarev
Michael Tokarev wrote:
 I just noticed that with Linux software RAID10, disk
 usage isn't equal at all, that is, most reads are
 done from the first part of mirror(s) only.
 
 Attached (disk-hour.png) is a little graph demonstrating
 this (please don't blame me for poor choice of colors and
 the like - this stuff is in works right now, it's a first
 rrd graph I produced :).  There's a 14-drive RAID10 array
 and 2 more drives.  In the graph it's clearly visible that
 there are 3 kinds of load for drives, because graphs for
 individual drives are stacked on each other forming 3 sets.
 One set (the 2 remaining drives) isn't interesting, but the
 2 main ones (with many individual lines) are interesting.

Ok, looks like vger.kernel.org dislikes png attachments.
I wont represent graphs as ascii-art, and it's really not
necessary -- see below.

 The 7 drives with higher utilization receives almost all
 reads - the second half of the array only gets reads
 sometimes.  And all 14 drives - obviously - receives
 all writes.
 
 So the picture (modulo that sometimes above which is
 too small to take into account) is like - writes are
 going to all drives, while reads are done from the
 first half of each pair only.
 
 Also attached are two graphs for individual drives,
 one is from first half of the array (diskrq-sdb-hour.png),
 which receives almost all reads (other disks looks
 pretty much the same), and from the second half
 (diskrq-sdl-hour.png), which receives very few
 reads.  The graphs shows number of disk transactions
 per second, separately for reads and writes.

Here's a typical line from iostat -x:

Dev: rrqm/s wrqm/s   r/s  w/s  rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb0,32   0,03 22,16 5,84 2054,79 163,7479,21 0,20  7,29  4,33 12,12
sdk0,38   0,03  6,28 5,84  716,61 163,7472,66 0,15 12,29  5,55  6,72

where sdb and sdk are two halfs of the same raid1 part
of a raid10 array - i.e., the content of the two are
identical.  As shown, write requests are the same for
the two, but read requests mostly goes to sdb (the
first half), and very little to sdk (the second half).

 Should raid10 balance reads too, maybe in a way similar
 to what raid1 does?
 
 The kernel is 2.6.23 but very similar behavior is
 shown by earlier kernels as well.  Raid10 stripe
 size is 256Mb, but again it doesn't really matter
 other sizes behave the same here.

The amount of data is quite large and it is laid out
and accessed pretty much randomly (it's a database
server), so in theory, even with some optimizations
like raid1 does (route request to a drive with nearest
head position), the read request distribution should
be basically the same.

Thanks!

/mjt

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10: unfair disk load?

2007-12-21 Thread Michael Tokarev
Janek Kozicki wrote:
 Michael Tokarev said: (by the date of Fri, 21 Dec 2007 14:53:38 +0300)
 
 I just noticed that with Linux software RAID10, disk
 usage isn't equal at all, that is, most reads are
 done from the first part of mirror(s) only.
 
 what's your kernel version? I recall that recently there have been
 some works regarding load balancing.

It was in my original email:

 The kernel is 2.6.23 but very similar behavior is
 shown by earlier kernels as well.  Raid10 stripe
 size is 256Mb, but again it doesn't really matter
 other sizes behave the same here.

Strange I missed the new raid10 development you
mentioned (I follow linux-raid quite closely).
Lemme see...  no, nothing relevant in 2.6.24-rc5
(compared with 2.6.23), at least git doesn't show
anything interesting.  What change(s) you're
referring to?

Thanks.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ERROR] scsi.c: In function 'scsi_get_serial_number_page'

2007-12-19 Thread Michael Tokarev
Thierry Iceta wrote:
 Hi
 
 I would like to use raidtools-1.00.3 on Rhel5 distribution
 but I got thie error

Use mdadm instead.  Raidtools is dangerous/unsafe, and is
not maintained for a long time already.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


external bitmaps.. and more

2007-12-06 Thread Michael Tokarev
I come across a situation where external MD bitmaps
aren't usable on any standard linux distribution
unless special (non-trivial) actions are taken.

First is a small buglet in mdadm, or two.

It's not possible to specify --bitmap= in assemble
command line - the option seems to be ignored.  But
it's honored when specified in config file.

Also, mdadm should probably warn or even refuse to
do things (unless --force is given) when an array being
assembled is using external bitmap, but the bitmap file
isn't specified.

Now for something more.. interesting.

The thing is that when a external bitmap is being used
for an array, and that bitmap resides on another filesystem,
all common distributions fails to start/mount and to
shutdown/umount arrays/filesystems properly, because
all starts/stops is done in one script, and all mounts/umounts
in another, but for bitmaps to work the two should be intermixed
with each other.

Here's why.

Suppose I've an array mdX which used bitmap /stuff/bitmap,
where /stuff is another separate filesystem.  In this case,
during startup, /stuff should be mounted before bringing up
mdX, and during shutdown, mdX should be stopped before
trying to umount /stuff.  Or else during startup mdX will
not find /stuff/bitmap, and during shutdown /stuff filesystem
is busy since mdX is holding a reference to it.

Doing things in simple way doesn't work: if I specify to
mount mdX as /data in /etc/fstab, -- since mdX hasn't been
assembled by mdadm (due to missing bitmap), the system will
not start, asking for emergency root password...

Oh well.

So the only solution for this so far is to convert md array
assemble/stop operation into... MOUNTS/UMOUNTS!  And specify
all necessary information in /etc/fstab - for both arrays and
filesystems, with proper ordering in order column.
Ghrm.

Technically speaking it's not difficult - mount.md and fsck.md
wrappers for mdadm are trivially to write (I even tried that
myself - a quick-n-dirty 5-minutes hack works).  But it's...
ugly.

But I don't see any other reasonable solutions.  Alternatives
are additional scripts to start/stop/mount/umount filesystems
residing on or related to advanced arrays (with external
bitmaps in this case) - but looking at how much code is in
current startup scripts around mounting/fscking, and having
in mind that mount/umount does not support alternative
/etc/fstab, this is umm.. even more ugly...

Comments anyone?

Thanks.

/mjt

P.S.  Why external bitmaps in the first place?  Well, that's
a good question, and here's a (hopefully good too) answer:
When there are sufficient disk drives available to dedicate
some of them for bitmap(s), and there's a large array(s)
with dynamic content (many writes), and the content is
important enough to care about data safety wrt possible
power losses and kernel OOPSes and whatnot, placing bitmap
into another disk(s) helps alot with resyncs (it's not
about resync speed, it's about resync general UNRELIABILITY,
which is another topic - hopefully a long-term linux
raid gurus will understand me here), but does not slow
down writes hugely due to constant disk seeks when updating
bitmaps.  Those seeks tends to have huge impact on random
write performance.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assemble vs create an array.......

2007-12-06 Thread Michael Tokarev
[Cc'd to xfs list as it contains something related]

Dragos wrote:
 Thank you.
 I want to make sure I understand.

[Some background for XFS list.  The talk is about a broken linux software
raid (the reason for breakage isn't relevant anymore).  The OP seems to
lost the order of drives in his array, and now tries to create new array
ontop, trying different combinations of drives.  The filesystem there
WAS XFS.  One point is that linux refuses to mount it, saying
structure needs cleaning.  This all is mostly md-related, but there
are several XFS-related questions and concerns too.]

 
 1- Does it matter which permutation of drives I use for xfs_repair (as
 long as it tells me that the Structure needs cleaning)? When it comes to
 linux I consider myself at intermediate level, but I am a beginner when
 it comes to raid and filesystem issues.

The permutation DOES MATTER - for all the devices.
Linux, when mounting an fs, only looks at the superblock of the filesystem,
which is usually located at the beginning of the device.

So in each case linux actually recognizes the filesystem (instead of
seeing complete garbage), the same device is the first one - I.e, this
way you found your first device.  The rest may be still out of order.

Raid5 data is laid like this (with 3 drives for simplicity, it's similar
with more drives):

   DiskA   DiskB   DiskC
Blk0   Data0   Data1   P0
Blk1   P1  Data2   Data3
Blk2   Data4   P2  Data5
Blk3   Data6   Data7   P3
... and so on ...

where your actual data blocks are Data0, Data1, ... DataN,
and PX are parity blocks.

As long as DiskA remains in this position, the beginning of
the array is Data0 block, -- hence linux sees the beginning
of the filesystem and recognizes it.  But you can switch
DiskB and DiskC still, and the rest of the data will be
complete garbage, only data blocks on DiskA will be in
place.

So you still need to find order of the other drives
(you found your first drive, DriveA, already).

Note also that if Data1 block is all-zeros (a situation
which is unlikely for a non-empty filesystem), P0 (first
parity block) will be exactly the same as Data0, because
XORing anything with zeros gives the same anything again
(XOR is the operation used to calculate parity blocks in
RAID5).  So there's still a remote chance you've TWO
first disks...

What to do is to give repairfs a try for each permutation,
but again without letting it to actually fix anything.
Just run it in read-only mode and see which combination
of drives gives less errors, or no fatal errors (there
may be several similar combinations, with the same order
of drives but with different drive missing).

It's sad that xfs refuses mount when structure needs
cleaning - the best way here is to actually mount it
and see how it looks like, instead of trying repair
tools.  Is there some option to force-mount it still
(in readonly mode, knowing it may OOPs kernel etc)?

I'm not very familiar with xfs yet - it seems to be
much faster than ext3 for our workload (mostly databases),
and I'm experimenting with it slowly.  But this very
thread prompted me to think.  If I can't force-mount it
(or browse it using other ways) as I can almost always
do with (somewhat?) broken ext[23] just to examine things,
maybe I'm trying it before it's mature enough? ;)  Note
the smile, but note there's a bit of joke in every joke... :)

 2- After I do it, assuming that it worked, how do I reintegrate the
 'missing' drive while keeping my data?

Just add it back -- mdadm --add /dev/mdX /dev/sdYZ.
But don't do that till you actually see your data.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-02 Thread Michael Tokarev
 Justin Piszcz said: (by the date of Sun, 2 Dec 2007 04:11:59 -0500 (EST))
 
 The badblocks did not do anything; however, when I built a software raid 5 
 and the performed a dd:

 /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M

 I saw this somewhere along the way:

 [42332.936706] ata5.00: spurious completions during NCQ issue=0x0 
 SAct=0x7000 FIS=004040a1:0800
 [42333.240054] ata5: soft resetting port

There's some (probably timing-related) bug with spurious completions
during NCQ.  Alot of people are seeing this same effect with different
drives and controllers.  Tejun is working on it.  It's different to
reproduce.

Search for spurious completion - there are many hits...

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assemble vs create an array.......

2007-11-30 Thread Michael Tokarev
Bryce wrote:
[]
 mdadm -C -l5 -n5 -c128  /dev/md0 /dev/sdf1 /dev/sde1 /dev/sdg1 /dev/sdc1 
 /dev/sdd1
...
 IF you don't have the configuration printout, then you're left with
 exhaustive brute force searching of the combinations

You're missing a very important point -- --assume-clean option.
For experiments like this (trying to figure out the order of disks),
you'd better ensure the data on disks isn't being changed while
you try different combinations.  But on each build, md always
destroys one drive by re-calculating parity.  You have to stop
it from doing so - to not trash your data.

Another option is to use one missing drive always, i.e.,

 mdadm -C -l5 -n5 -c128  /dev/md0 /dev/sdf1 missing /dev/sdg1 /dev/sdc1 
/dev/sdd1

so that the array will be degraded and no way to resync anything -
this also prevents md from trashing data.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: man mdadm - suggested correction.

2007-11-05 Thread Michael Tokarev
Janek Kozicki wrote:
[]
 Can you please add do the manual under 'SEE ALSO' a reference
 to /usr/share/doc/mdadm ?

/usr/share/doc/mdadm is Debian-specific (well.. not sure it's really
Debian (or something derived from it) -- some other distros may use
the same naming scheme, too).  Other distributions may place the
files into a different directory, or not ship them at all, or ship
them in alternative package.

In any case, say, on Debian a user always knows that other misc.
docs are in /usr/share/doc/$package - even if no other links are
provided in the manpage.  Users familiar with other distributions
knows where/how to find other docs there.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Michael Tokarev
Justin Piszcz wrote:
 # ps auxww | grep D
 USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
 root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
 root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]
 
 After several days/weeks, this is the second time this has happened,
 while doing regular file I/O (decompressing a file), everything on the
 device went into D-state.

The next time you come across something like that, do a SysRq-T dump and
post that.  It shows a stack trace of all processes - and in particular,
where exactly each task is stuck.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Michael Tokarev
Justin Piszcz wrote:
 On Sun, 4 Nov 2007, Michael Tokarev wrote:
[]
 The next time you come across something like that, do a SysRq-T dump and
 post that.  It shows a stack trace of all processes - and in particular,
 where exactly each task is stuck.

 Yes I got it before I rebooted, ran that and then dmesg  file.
 
 Here it is:
 
 [1172609.665902]  80747dc0 80747dc0 80747dc0 
 80744d80
 [1172609.668768]  80747dc0 81015c3aa918 810091c899b4 
 810091c899a8

That's only partial list.  All the kernel threads - which are most important
in this context - aren't shown.  You ran out of dmesg buffer, and the most
interesting entries was at the beginning.  If your /var/log partition is
working, the stuff should be in /var/log/kern.log or equivalent.  If it's
not working, there is a way to capture the info still, by stopping syslogd,
cat'ing /proc/kmsg to some tmpfs file and scp'ing it elsewhere.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-22 Thread Michael Tokarev
John Stoffel wrote:

 Michael == Michael Tokarev [EMAIL PROTECTED] writes:

 If you are going to mirror an existing filesystem, then by definition
 you have a second disk or partition available for the purpose.  So you
 would merely setup the new RAID1, in degraded mode, using the new
 partition as the base.  Then you copy the data over to the new RAID1
 device, change your boot setup, and reboot.
 
 Michael And you have to copy the data twice as a result, instead of
 Michael copying it only once to the second disk.
 
 So?  Why is this such a big deal?  As I see it, there are two seperate
 ways to setup a RAID1 setup, on an OS.
[..]
that was just a tiny nitpick, so to say, about a particular way to
convert existing system into raid1 - not something which's done every
day anyway.  Still, double the time for copying your terabyte-sized
drive is something to consider.

[]
 Michael automatically activate it, thus making it busy.  What I'm
 Michael talking about here is that any automatic activation of
 Michael anything should be done with extreme care, using smart logic
 Michael in the startup scripts if at all.
 
 Ah... but you can also de-active LVM partitions as well if you like.  

Yes, esp. being a newbie user who first installed linux on his PC just
to see that he can't use his disk.. ;)  That was a real situation - I
helped someone who had never heard of LVM and did little of anything
with filesystems/disks before.

 Michael The Doug's example - in my opinion anyway - shows wrong tools
 Michael or bad logic in the startup sequence, not a general flaw in
 Michael superblock location.
 
 I don't agree completely.  I think the superblock location is a key
 issue, because if you have a superblock location which moves depending
 the filesystem or LVM you use to look at the partition (or full disk)
 then you need to be even more careful about how to poke at things.

Superblock location does not depend on the filesystem.  Raid exports
the inside space only, excluding superblocks, to the next level
(filesystem or else).

 This is really true when you use the full disk for the mirror, because
 then you don't have the partition table to base some initial
 guestimates on.  Since there is an explicit Linux RAID partition type,
 as well as an explicit linux filesystem (filesystem is then decoded
 from the first Nk of the partition), you have a modicum of safety.

Speaking of whole disks - first, don't do that (for reasons suitable
for another topic), and second, using the whole disk or partitions
makes no real difference whatsoever to the topic being discussed.

There's just no need for the guesswork, except for the first install
(to automatically recognize existing devices, and to use them after
confirmation), and maybe for rescue systems, which again is a different
topic.

In any case, for a tool that does a guesswork (like libvolume-id, to
create /dev/ symlinks), it's as easy to look at the end of the device
as to the beginning or to any other fixed place - since the tool has
to know the superblock format, it knows superblock location as well).

Maybe manual guesswork, based on hexdump of first several kilobytes
of data, is a bit more difficult in case where superblock is located
at the end.  But if one has to analyze hexdump, he doesn't care about
raid anymore.

 If ext3 has the superblock in the first 4k of the disk, but you've
 setup the disk to use RAID1 with the LVM superblock at the end of the
 disk, you now need to be careful about how the disk is detected and
 then mounted.

See above.  For tools, it's trivial to distinguish a component of a
raid volume from the volume itself, by looking for superblock at whatever
location.  Including stuff like mkfs, which - like mdadm does - may
warn one about previous filesystem/volume information on the device
in question.

 Michael Speaking of cases where it was really helpful to have an
 Michael ability to mount individual raid components directly without
 Michael the raid level - most of them was due to one or another
 Michael operator errors, usually together with bugs and/or omissions
 Michael in software.  I don't remember exact scenarious anymore (last
 Michael time it was more than 2 years ago).  Most of the time it was
 Michael one or another sort of system recovery.
 
 In this case, you're only talking about RAID1 mirrors, no other RAID
 configuration fits this scenario.  And while this might look to be

Definitely.  However, linear - to some extent - can be used partially.
But sure with much less usefulness.

However, raid1 is much more common setup than anything else - IMHO anyway.
It's the cheapest and the most reliable thing for an average user anyway -
it's cheaper to get 2 large drives than to, say, 3 a bit smaller drives.
Yes, raid1 has 1/2 space wasted, compared with, say, raid5 on top of 3
drives (only 1/3 wasted), but still 3 smallish drives costs more than
2 larger drives.

 helpful, I would strongly argue that it's not, because it's a special

Re: Software RAID when it works and when it doesn't

2007-10-20 Thread Michael Tokarev
Justin Piszcz wrote:
[]
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Justin, forgive me please, but can you learn to trim the original
messages when replying, at least cut off the very irrelevant parts?
You're always quoting the whole message, even including the part
after a line consiting of single minus sign - - a part that most
MUAs will remove when replying...

 I have a question with re-mapping sectors, can software raid be as
 efficient or good at remapping bad sectors as an external raid
 controller for, e.g., raid 10 or raid5?

Hard disks ARE remapping bad sectors by their own.  In most cases
that's sufficient - there's nothing to do for raid (be it hardware
raid or software) except of perform a write to the bad place, just
to trigger an in-disk remapping procedure.  Even the cheapest drives
nowadays has some remapping capability.

There was an idea some years ago about having an additional layer on
between a block device and whatever else is above it (filesystem or
something else), that will just do bad block remapping.  Maybe it was
even implemented in LVM or IBM-proposed EVMS (the version that included
in-kernel stuff too, not only the userspace management), but I don't
remember details anymore.  In any case, - but again, if memory serves
me right, -- there was low interest in that because of exactly this --
drives are now more intelligent, there's hardly a notion of bad block
anymore, at least persistent bad block, -- at least visible to the
upper layers.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-20 Thread Michael Tokarev
Doug Ledford wrote:
[]
 1.0, 1.1, and 1.2 are the same format, just in different positions on
 the disk.  Of the three, the 1.1 format is the safest to use since it
 won't allow you to accidentally have some sort of metadata between the
 beginning of the disk and the raid superblock (such as an lvm2
 superblock), and hence whenever the raid array isn't up, you won't be
 able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
 case situations, I've seen lvm2 find a superblock on one RAID1 array
 member when the RAID1 array was down, the system came up, you used the
 system, the two copies of the raid array were made drastically
 inconsistent, then at the next reboot, the situation that prevented the
 RAID1 from starting was resolved, and it never know it failed to start
 last time, and the two inconsistent members we put back into a clean
 array).  So, deprecating any of these is not really helpful.  And you
 need to keep the old 0.90 format around for back compatibility with
 thousands of existing raid arrays.

Well, I strongly, completely disagree.  You described a real-world
situation, and that's unfortunate, BUT: for at least raid1, there ARE
cases, pretty valid ones, when one NEEDS to mount the filesystem without
bringing up raid.  Raid1 allows that.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-20 Thread Michael Tokarev
John Stoffel wrote:
 Michael == Michael Tokarev [EMAIL PROTECTED] writes:
[]
 Michael Well, I strongly, completely disagree.  You described a
 Michael real-world situation, and that's unfortunate, BUT: for at
 Michael least raid1, there ARE cases, pretty valid ones, when one
 Michael NEEDS to mount the filesystem without bringing up raid.
 Michael Raid1 allows that.
 
 Please describe one such case please.  There have certainly been hacks
 of various RAID systems on other OSes such as Solaris where the VxVM
 and/or Solstice DiskSuite allowed you to encapsulate an existing
 partition into a RAID array.  
 
 But in my experience (and I'm a professional sysadm... :-) it's not
 really all that useful, and can lead to problems liks those described
 by Doug.  

I'm doing a sysadmin work for about 15 or 20 years.

 If you are going to mirror an existing filesystem, then by definition
 you have a second disk or partition available for the purpose.  So you
 would merely setup the new RAID1, in degraded mode, using the new
 partition as the base.  Then you copy the data over to the new RAID1
 device, change your boot setup, and reboot.
[...]

And you have to copy the data twice as a result, instead of copying
it only once to the second disk.

 As Doug says, and I agree strongly, you DO NOT want to have the
 possibility of confusion and data loss, especially on bootup.  And

There are different point of views, and different settings etc.
For example, I once dealt with a linux user who was unable to
use his disk partition, because his system (it was RedHat if I
remember correctly) recognized some LVM volume on his disk (it
was previously used with Windows) and tried to automatically
activate it, thus making it busy.  What I'm talking about here
is that any automatic activation of anything should be done with
extreme care, using smart logic in the startup scripts if at
all.

The Doug's example - in my opinion anyway - shows wrong tools
or bad logic in the startup sequence, not a general flaw in
superblock location.

Another example is ext[234]fs - it does not touch first 512
bytes of the device, so if there was an msdos filesystem there
before, it will be recognized as such by many tools, and an
attempt to mount it automatically will lead to at least scary
output and nothing mounted, or in fsck doing fatal things to
it in worst scenario.  Sure thing the first 512 bytes should
be just cleared.. but that's another topic.

Speaking of cases where it was really helpful to have an ability
to mount individual raid components directly without the raid
level - most of them was due to one or another operator errors,
usually together with bugs and/or omissions in software.  I don't
remember exact scenarious anymore (last time it was more than 2
years ago).  Most of the time it was one or another sort of
system recovery.

In almost all machines I maintain, there's a raid1 for the root
filesystem built of all the drives (be it 2 or 4 or even 6 of
them) - the key point is to be able to boot off any of them
in case some cable/drive/controller rearrangement has to be
done.  Root filesystem is quite small (256 or 512 Mb here),
and it's not too dynamic either -- so it's not a big deal to
waste space for it.

Problem occurs - obviously - when something goes wrong.
And most of the time issues we had happened on a remote site,
where there was no expirienced operator/sysadmin handy.

For example, when one drive was almost dead, and mdadm tried
to bring the array up, machine just hanged for unknown amount
of time.  An unexpirienced operator was there.  Instead of
trying to teach him how to pass parameter to the initramfs
to stop trying to assemble root array and next assembling
it manually, I told him to pass root=/dev/sda1 to the
kernel.  Root mounts read-only, so it should be a safe thing
to do - I only needed root fs and minimal set of services
(which are even in initramfs) just for it to boot up to SOME
state where I can log in remotely and fix things later.
(no I didn't want to remove the drive yet, I wanted to
examine it first, and it turned to be a good idea because
the hang was happening only at the beginning of it, and
while we tried to install replacement and fill it up with
data, there was an unreadable sector found on another
drive, so this old but not removed drive was really handy).

Another situation - after some weird crash I had to examine
the filesystems found on both components - I want to look
at the filesystems and compare them, WITHOUT messing up
with raid superblocks (later on I wrote a tiny program to
save/restore 0.90 superblocks), and without attempting a
reconstruction attempts.  In fact, this very case - examining
the contents - is something I've been doing many times for
one or another reason.  There's just no need to involve
raid layer here at all, but it doesn't disturb things either
(in some cases anyway).

Yet another - many times we had to copy an old system to
a new one - new machine boots with 3 drives

Re: Time to deprecate old RAID formats?

2007-10-20 Thread Michael Tokarev
Justin Piszcz wrote:
 
 On Fri, 19 Oct 2007, Doug Ledford wrote:
 
 On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:
[]
 Got it, so for RAID1 it would make sense if LILO supported it (the
 later versions of the md superblock)

 Lilo doesn't know anything about the superblock format, however, lilo
 expects the raid1 device to start at the beginning of the physical
 partition.  In otherwords, format 1.0 would work with lilo.
 Did not work when I tried 1.x with LILO, switched back to 00.90.03 and
 it worked fine.

There are different 1.x - and the difference is exactly this -- location
of the superblock.  In 1.0, superblock is located at the end, just like
with 0.90, and lilo works just fine with it.  It gets confused somehow
(however I don't see how really, because it uses bmap() to get a list
of physical blocks for the files it wants to access - those should be
in absolute numbers, regardless of the superblock locaction) when the
superblock is at the beginning (v 1.1 or 1.2).

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-09 Thread Michael Tokarev
Neil Brown wrote:
 On Tuesday October 9, [EMAIL PROTECTED] wrote:
[]
 o During this reshape time, errors may be fatal to the whole array -
   while mdadm do have a sense of critical section, but the
   whole procedure isn't as much tested as the rest of raid code,
   I for one will not rely on it, at least for now.  For example,
   a power failure at an unexpected moment, or some plain-stupid
   error in reshape code so that the whole array goes boom etc...
 
 While it is true that the resize code is less tested than other code,
 it is designed to handle a single failure at any time (so a power
 failure is OK as long as the array is not running degraded), and I
 have said that if anyone does suffer problems while performing a
 reshape, I will do my absolute best to get the array functioning and
 the data safe again.

Well... Neil, it's your code, so you trust it - that's ok, I also
(tries to) trust my code until someone finds a bug in it.. ;)
And I'm a sysadmin (among other things), who's professional
property must be a bit of paranoia..  You got the idea ;)

 o A filesystem on the array has to be resized separately after
   re{siz,shap}ing the array.  And filesystems are different at
   this point, too - there are various limitations.  For example,
   it's problematic to grow ext[23]fs by large amounts, because
   when it gets initially created, mke2fs calculates sizes of
   certain internal data structures based on the device size,
   and those structures can't be grown significantly, only
   recreating the filesystem will do the trick.
 
 This isn't entirely true.
 For online resizing (while the filesystem is mounted) there are some
 limitations as you suggest.  For offline resizing (while filesystem is
 not mounted) there are no such limitations.

There still is - at least for ext[23].  Even offline resizers
can't do resizes from any to any size, extfs developers recommend
to recreate filesystem anyway if size changes significantly.
I'm too lazy to find a reference now, it has been mentioned here
on linux-raid at least this year.  It's sorta like fat (yea, that
ms-dog filesystem) - when you resize it from, say, 501Mb to 999Mb,
everything is ok, but if you want to go from 501Mb to 1Gb+1, you
have to recreate almost all data structures because sizes of
all internal fields changes - and here it's much safer to just
re-create it from scratch than trying to modify it in place.
Sure it's much better for extfs, but the point is still the same.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Michael Tokarev
Janek Kozicki wrote:
 Hello,
 
 Recently I started to use mdadm and I'm very impressed by its
 capabilities. 
 
 I have raid0 (250+250 GB) on my workstation. And I want to have
 raid5 (4*500 = 1500 GB) on my backup machine.

Hmm.  Are you sure you need that much space on the backup, to
start with?  Maybe better backup strategy will help to avoid
hardware costs?  Such as using rsync for backups as discussed
on this mailinglist about a month back (rsync is able to keep
many ready to use copies of your filesystems but only store
files that actually changed since the last backup, thus
requiring much less space than many full backups).

 The backup machine currently doesn't have raid, just a single 500 GB
 drive. I plan to buy more HDDs to have a bigger space for my
 backups but since I cannot afford all HDDs at once I face a problem
 of expanding an array. I'm able to add one 500 GB drive every few
 months until I have all 4 drives.
 
 But I cannot make a backup of a backup... so reformatting/copying all
 data each time when I add new disc to the array is not possible for me.
 
 Is it possible anyhow to create a very degraded raid array - a one
 that consists of 4 drives, but has only TWO ?
 
 This would involve some very tricky *hole* management on the block
 device... A one that places holes in stripes on the block device,
 until more discs are added to fill the holes. When the holes are
 filled, the block device grows bigger, and with lvm I just increase
 the filesystem size. This is perhaps coupled with some unstripping
 that moves/reorganizes blocks around to fill/defragment the holes.

It's definitely not possible with raid5.  Only option is to create a
raid5 array consisting of less drives than it should contain at the
end, and reshape it when you get more drives, as others noted in this
thread.  But do note the following points:

o degraded raid5 isn't really Raid - i.e, it's not any better than
  a raid0 array, that is, any disk fails = the whole array fails.
  So instead of creating a degraded raid5 array initially, create
  smaller one instead, but not degraded, and reshape it when
  necessary.

o reshaping takes time, and for this volume, reshape will take
  many hours, maybe days, to complete.

o During this reshape time, errors may be fatal to the whole array -
  while mdadm do have a sense of critical section, but the
  whole procedure isn't as much tested as the rest of raid code,
  I for one will not rely on it, at least for now.  For example,
  a power failure at an unexpected moment, or some plain-stupid
  error in reshape code so that the whole array goes boom etc...

o A filesystem on the array has to be resized separately after
  re{siz,shap}ing the array.  And filesystems are different at
  this point, too - there are various limitations.  For example,
  it's problematic to grow ext[23]fs by large amounts, because
  when it gets initially created, mke2fs calculates sizes of
  certain internal data structures based on the device size,
  and those structures can't be grown significantly, only
  recreating the filesystem will do the trick.

 is it just a pipe dream?

I'd say it is... ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Journalling filesystem corruption fixed in between?

2007-10-03 Thread Michael Tokarev
Rustedt, Florian wrote:
 Hello list,
 
 some folks reported severe filesystem-crashes with ext3 and reiserfs on
 mdraid level 1 and 5.

I guess much more strong evidience and details are needed.
Without any additional information I for one can only make
a (not-so-pleasant) guess about those some folks, nothing
more.  We're running several dozens of systems on raid1s and
raid5s since 2.4 kernel (and some since 2.2 if memory serves,
with an additional patch for raid functionality), -- nothing
except of usual mostly hardware problems since that.  And
many other people use linux raid and especially ext3 file-
system in production on large boxes with good load -- such
a corruption, be it not a particular system specific (due
to, for example, a bad ram or faulty controller or whatever),
should cause alot of messages here @linux-raid and elsewhere.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problem killing raid 5

2007-10-01 Thread Michael Tokarev
Daniel Santos wrote:
 I retried rebuilding the array once again from scratch, and this time
 checked the syslog messages. The reconstructions process is getting
 stuck at a disk block that it can't read. I double checked the block
 number by repeating the array creation, and did a bad block scan. No bad
 blocks were found. How could the md driver be stuck if the block is fine ?
 
 Supposing that the disk has bad blocks, can I have a raid device on
 disks that have badblocks ? Each one of the disks is 400 GB.
 
 Probably not a good idea because if a drive has bad blocks it probably
 will have more in the future. But anyway, can I ?
 The bad blocks would have to be known to the md driver.

Well, almost all modern drives can remap bad blocks (at least I know no
drive that can't).  Most of the time it happens on write - becaue if such
a bad block is found during read operation and the drive really can't
read the content of that block, it can't remap it either without losing
data.  From my expirience (about 20 years, many 100s of drives, mostly
(old) SCSI but (old) IDE too), it's pretty normal for a drive to develop
several bad blocks, especially during first year of usage.  Sometimes
however, number of bad blocks grows quite rapidly and such a drive
definietely should be replaced - at least Seagate drives are covered
by warranty in this case.

SCSI drives has 2 so-called defect lists, stored somewhere inside the
drive - factory-preset list (bad blocks found during internal testing
when producing a drive), and grown list (bad blocks found by drive
during normal usage).  Factory-preset list can contain from 0 to about
1000 entries or even more (depending on the size too), grown list can
be as large as 500 blocks or more, whenever it's fatal or not depends
on whenever new bad blocks continues to be found or not.  We have
several drives which developed that many bad blocks in first few
months of usage, the list stopped growing, and they're still working
just fine for 5 years.  Both defect lists can be shown by scsitools
programs.

I don't know how one can see defect lists on a IDE or SATA drive.

Note that md layer (raid1, 4, 5, 6, 10 - but obviously not raid0 and
linear) are now able to repair bad blocks automatically, by forcing
write to the same place of the drive where a read error occured -
this usually forces drive to automatically reallocate that block
and continue.

But in any case, md should not stall - be it during reconstruction
or not.  For this, I can't comment - to me it smells like a bug
somewhere (md layer? error handling in driver? something else?)
which should be found and fixed.  And for this, some more details
are needed I guess -- kernel version is a start.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problem killing raid 5

2007-10-01 Thread Michael Tokarev
Patrik Jonsson wrote:
 Michael Tokarev wrote:
[]
 But in any case, md should not stall - be it during reconstruction
 or not.  For this, I can't comment - to me it smells like a bug
 somewhere (md layer? error handling in driver? something else?)
 which should be found and fixed.  And for this, some more details
 are needed I guess -- kernel version is a start.
 
 Really? It's my understanding that if md finds an unreadable block
 during raid5 reconstruction, it has no option but to fail since the
 information can't be reconstructed. When this happened to me, I had to

Yes indeed, it should fail, but not stuck as Daniel reported.
Ie, it should either complete the work or fail, but not sleep
somewhere in between.

[]
 This is why it's important to run a weekly check so md can repair blocks
 *before* a drive fails.

*nod*.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backups w/ rsync

2007-09-29 Thread Michael Tokarev
Dean S. Messing wrote:
 Michael Tokarev writes:
[]
 : the procedure is something like this:
 : 
 :   cd /backups
 :   rm -rf tmp/
 :   cp -al $yesterday tmp/
 :   rsync -r --delete -t ... /filesystem tmp
 :   mv tmp $today
 : 
 : That is, link the previous backup to temp (which takes no space
 : except directories), rsync current files to there (rsync will
 : break links for changed files), and rename temp to $today.
 
 Very nice.  The breaking of the hardlink is the key.  I wondered about
 this when Michal using rsync yesterday.  I just tested the idea. It
 does indeed work.

Well, others in this thread already presented other, simpler ways,
namely using --link-dest rsync option.  I was just too lazy to read
the man page, but I already knew other tools can do the work ;)

 One question: why do you not use -a instead of -r -t?  It would
 seem that one would want to preserve permissions, and group and user
 ownerships.  Also, is there a reason to _not_ preserve sym-links
 in the backup.  Your script appears to copy the referent.

Note the above -- SOMETHING like this.  I was typing from memory,
it's not an actual script, just to show an idea.  Sure real script
does more than that, including error checking too.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help: very slow software RAID 5.

2007-09-20 Thread Michael Tokarev
Dean S. Messing wrote:
[]
 []  That's what
 attracted me to RAID 0 --- which seems to have no downside EXCEPT
 safety :-).
 
 So I'm not sure I'll ever figure out the right tuning.  I'm at the
 point of abandoning RAID entirely and just putting the three disks
 together as a big LV and being done with it.  (I don't have quite the
 moxy to define a RAID 0 array underneath it. :-)

Putting three disks together as a big LV - that's exactly what
linear md module.  It's almost as unsafe as raid0, but with
linear read/write speed equal to speed of single drive...
Note also that the more drives you add to raid0-like config,
the more chances of failure you'll have - because raid0 fails
when ANY drive fails.  Ditto - for certain extent - for linear
md module and for one big LV which is basically the same thing.

By the way, before abandoming R in RAID, I'd check whenever
the resulting speed with raid5 (after at least read-ahead tuning)
is acceptable, and use that if yes.  If no, maybe raid10 over
the same 3 drives will give better results.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SWAP file on a RAID-10 array possible?

2007-08-15 Thread Michael Tokarev
Tomas France wrote:
 Thanks for the answer, David!
 
 I kind of think RAID-10 is a very good choice for a swap file. For now I
 will need to setup the swap file on a simple RAID-1 array anyway, I just
 need to be prepared when it's time to add more disks and transform the
 whole thing into RAID-10... which will be big fun anyway, for sure ;)

By the way, you don't really need raid10 for swap.  Built-in linux
swap code can utilize multiple swap areas just fine - mkswap + swapon
on multiple devices/files.  This is essentially a raid0.  For raid10,
one thing needed is the mirroring, with is provided by raid1.  So
when you've two drives, use single partition on both to form a raid1
array for swap space.  If you've 4 drives, create 2 raid1 arrays and
specify them both as swap space, giving them appropriate priority
(prio=xxx in swap line in fstab).  With 6 drives, have 3 raid1 arrays
and so on...  This way, the whole thing is much simpler and more
manageable.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A raid in a raid.

2007-07-21 Thread Michael Tokarev
mullaly wrote:
[]
 All works well until a system reboot. md2 appears to be brought up before
 md0 and md1 which causes the raid to start without two of its drives.
 
 Is there anyway to fix this?

How about listing the arrays in proper order in mdadm.conf ?

/mjt

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware 9650 tips

2007-07-13 Thread Michael Tokarev
Joshua Baker-LePain wrote:
[]
 Yep, hardware RAID -- I need the hot swappability (which, AFAIK, is
 still an issue with md).

Just out of curiocity - what do you mean by swappability ?

For many years we're using linux software raid, we had no problems
with swappability of the component drives (in case of drive
failures and what not).  With non-hotswappable drives (old scsi
and ide ones), rebooting is needed for the system to recognize the
drives.  For modern sas/sata drives, i can replace a faulty drive
without anyone noticing...  Maybe you're referring to something
else?

Thanks.

/mjt


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: dealing with bad blocks: another view

2007-06-13 Thread Michael Tokarev
Now MD subsystem does a very good job at trying to
recover a bad block on a disk, by re-writing its
content (to force drive to reallocate the block in
question) and verifying it's written ok.

But I wonder if it's worth the effort to go further
than that.

Now, md can use bitmaps.  And a bitmap can be used
not only to mark clean vs dirty chunks, but also,
for example, good vs bad chunks.

That to say.  If we discover a read error on one
component of an array, we tried to re-write it but
rewrite (or reread) failed.  Now current implementation
will kick the bad drive from the array.  But here it
is possible to not kick it, but to turn corresponding
bit(s) in the bitmap that says the data on this location
on this drive is wrong, don't try to read from it.  And
continue using this drive as before (modulo the bits/parts
just turned on).

The rationale is -- each time we kick the whole drive from
an array, for whatever reason, -- we greatly reduce chances
of the whole array to be in working condition.

For some reason, drives from the same batch tend to discover
bad blocks close to each other - i mean, we see a bad block
on one drive, and pretty soon we see another bad block on
another drive (at least from our expirience).  So by kicking
one drive, we increase failure probability even more.

We had a large batch of seagate 36g scsi drives, which all
has some issue with firmware -- each time a drive detects
a bad sector, and we try to mitigate it (by rewriting it),
the drive reports defect list manipulation error (I don't
remember exact sense code), and only on second attempt it
rewrites the sector in question successefully.  Seagate
refused to acknowlege this problem, no matter how we
argued -- they said it's mishandling (like, we improperly
handled the drives).

That to show just one example of numerous cases when such
kicking of the whole drive is not good idea.

Even more.  If we see *read* error, there's no need to mark
this chunk as bad in the bitmap -- only if we see *write*
error while writing some *new* data.  Ie, that bad bit in
the bitmap may mean data at this place is out of sync, like
extended dirty.  When interpreted like that, there's no
need to allocate new bit, but existing dirty bit can be
used.  On resync, we try to write again, and just keep that
dirty bit if write failed.

Obviously, we should not try to read from those dirty
places.  And if there's no components left to read from,
just return read error - for this single read, but continue
running the array (maybe in read-only mode, whatever).

It seems like it's pretty simple to implement with existing
code.  The only requiriment is to have a bitmap - obviously,
without the bitmap the whole idea does not work.

This fits perfectly the policy does not belong to the kernel
model as well.  Never, ever, try to do something large (like
kicking off the whole disk), but let userspace to descide what
to do...  Mdadm event handlers (scripts called when something
goes wrong) can kick the disk off just fine.

Comments, anyone?

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recovery of software RAID5 using FC6 rescue?

2007-05-09 Thread Michael Tokarev
Nix wrote:
 On 8 May 2007, Michael Tokarev told this:
 BTW, for such recovery purposes, I use initrd (initramfs really, but
 does not matter) with a normal (but tiny) set of commands inside,
 thanks to busybox.  So everything can be done without any help from
 external recovery CD.  Very handy at times, especially since all
 the network drivers are here on the initramfs too, so I can even
 start a netcat server while in initramfs, and perform recovery from
 remote system... ;)
 
 What you should probably do is drop into the shell that's being used to
 run init if mount fails (or, more generally, if after mount runs it

That's exactly what my initscript does ;)

chk() {
  while ! $@; do
warn the following command failed:
warn $*
p=** Continue(Ignore)/Shell/Retry (C/s/r)? 
while : ; do
  if ! read -t 10 -p $p x 21; then
echo (timeout, continuing)
return 1
  fi
  case $x in
[Ss!]*) /bin/sh 21 ;;
[Rr]*) break;;
[CcIi]*|) return 1;;
*) echo (unrecognized response);;
  esac
done
  done
}

chk mount -n -t proc proc /proc
chk mount -n -t sysfs sysfs /sys
...
info mounting $rootfstype fs on $root (options: $rootflags)
chk mount -n -t $rootfstype -o $rootflags $root /root
if [ $? != 0 ]  ! grep -q ^[^ ]\\+ /root  /proc/mounts; then
  warn root filesystem ($rootfstype on $root) is NOT mounted!
fi
...

 hasn't ended up mounting anything: there's no need to rely on mount's
 success/failure status). [...]

Well, so far exitcode has been reliable.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: No such device on --remove

2007-05-09 Thread Michael Tokarev
Bernd Schubert wrote:
 Benjamin Schieder wrote:
 
 
 [EMAIL PROTECTED]:~# mdadm /dev/md/2 -r /dev/hdh5
 mdadm: hot remove failed for /dev/hdh5: No such device

 md1 and md2 are supposed to be raid5 arrays.
 
 You are probably using udev, don't you? Somehow there's presently
 no /dev/hdh5, but to remove /dev/hdh5 out of the raid, mdadm needs this
 device. There's a workaround, you need to create the device in /dev using
 mknod and then you can remove it with mdadm.

In case the /dev/hdh5 device node is missing, mdadm will complain
No such file or directory (ENOENT), instead of No such device
(ENODEV).

In this case, as I explained in my previous email, the arrays aren't
running, and the error refers to manipulations (md ioctls) with existing
/dev/md/2.

It has nothing to do with udev.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: removed disk md-device

2007-05-09 Thread Michael Tokarev
Bernd Schubert wrote:
 Hi,
 
 we are presently running into a hotplug/linux-raid problem.
 
 Lets assume a hard disk entirely fails or a stupid human being pulls it out 
 of 
 the system. Several partitions of the very same hardisk are also part of 
 linux-software raid. Also, /dev is managed by udev.
 
 Problem-1) When the disk fails, udev will remove it from /dev. Unfortunately 
 this will make it impossible to remove the disk or its partitions 
 from /dev/mdX device, since mdadm tries to read the device fail and will 
 abort if this file is not there.

What do you mean by fails here?

All the device information is still here, look at /sys/block/mdX/md/rdY/block .
Even if, say, sda (which was a part of md0) disappeared, there will still be
/sys/block/sda directory, because md subsystem keeps it open.  Yes the device
node may be removed by udev (oh how i dislike udev!), but all the info is still
here.  Also, all the info is in the array information available using ioctl.

mdadm can work it out from here, but it's a bit ugly.

 Problem-2) Even though the kernel detected the device to not exist anymore, 
 it 
 didn't inform its md-layer about this event. The md-layer will first detect 
 non-existent disk, if a read or write attempt to one of its raid-partitions 
 fails. Unfortunately, if you are unluckily, it might never detect that, e.g. 
 for raid1 devices.

This is backwards.

If you're unlucky should be the opposite -- You're lucky.  Well ok, it 
really
depends on other things.  Because if md-layer does not detect failed disk, it
means that disk hasn't been needed so far (because any attempt to do I/O on it
will fail, and the disk will be kicked off the array).  And since there was no
need in that disk, that means no changes has been made to the array (because
in case of any change, all disks will be written to).  Which, in turn, means
either of:

 a) disk will reappear (there are several failure modes, sometimes just bus
   rescan or powercycle will do the trick), and noone will even notice, and
   everything will be ok.

 b) disk is dead.  And I think this is where you say unlucky - because for
  quite some (unknown amount) of time, the array will be running in degraded
  mode, instead of enabling/resyncing hot spare etc.

Again: it depends on the failure scenario.  What to do here is questionable,
because a) contradicts with b).  So far, I haven't seen disks dying (well,
maybe 2 or 3 times), but I've seen disks disappearing randomly for no
apparent reason, and bus reset or powercycle brings them back just fine.
So for me, this is lucky behaviour.. ;)

Also, with all the modern hotpluggable drives (usb, sata, hotpluggable scsi,
and esp. networked storage, where network may add its own failure modes),
it's much more easier to make a device disappear - by touching cables for
example - this is the case a).

 I think there should be several solutions to these problems.
 
 1) Before udev removes a device file, it should run a pre-remove script, 
 which 
 should check if the device is listed in /proc/mdstat and if it is listed 
 there, it should run mdadm to remove this device from the.
 Does udev presently support to run pre-remove scripts?
 
 2.) As soon as the kernel detects a failed device, it should also inform the 
 md layer.

See above: it depends.

 3.) Does mdadm really need the device?

No it doesn't.  In order to fail or remove a component device from
an array, only major:minor number is needed.  Device nodes aren't needed
even to assemble array, but only if doing it the dumb way - during
assembly, mdadm examines the devices and tries to add some intelligency
to the process, and for that, device nodes are really necessary.  But
not for hotremovals.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Swapping out for larger disks

2007-05-08 Thread Michael Tokarev
Brad Campbell wrote:
[]
 It occurs though that the superblocks would be in the wrong place for
 the new drives and I'm wondering if the kernel or mdadm might not find
 them.

I once had a similar issue.  And wrote a tiny program (a hack, sort of),
to read or write md superblock from/to a component device.  The only
thing it really does is to calculate the superblock location - exactly
as it is done in mdadm.  Here it is:

  http://www.corpit.ru/mjt/mdsuper.c

Usage is like:

 mdsuper read /dev/old-device | mdsuper write /dev/new-device

(or using an intermediate file).

So you're doing like this:

 shutdown array
 for i in all-devices-in-array
   dd if=old-device[i] of=new-device[i] iflag=direct oflag=direct
   mdsuper read old-device | mdsuper write new-device
 done
 assemble-array-on-new-devices
 mdadm -G --size=max /dev/mdx

or something like that.

Note that the program does not work for anything but 0.90
superblocks (i haven't used 1.0 superblocks yet - 0.90 works
for me just fine).  However, it should be trivial to extend
it to handle v1 superblocks too.

Note also that it's trivial to do something like that in shell
too, with blockdev --getsz to get the device size, some shell-
style $((math)), and dd magic.

And 3rd note: using direct as above speeds up the copying *alot*,
while keeping system load at zero.  Without direct, one pair of
disks and the system is doing nothing but the copying...

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: No such device on --remove

2007-05-08 Thread Michael Tokarev
Benjamin Schieder wrote:
 Hi list.
 
 md2 : inactive hdh5[4](S) hdg5[1] hde5[3] hdf5[2]
   11983872 blocks

 [EMAIL PROTECTED]:~# mdadm -R /dev/md/2
 mdadm: failed to run array /dev/md/2: Input/output error
 [EMAIL PROTECTED]:~# mdadm /dev/md/
 0  1  2  3  4  5
 [EMAIL PROTECTED]:~# mdadm /dev/md/2 -r /dev/hdh5
 mdadm: hot remove failed for /dev/hdh5: No such device
 
 md1 and md2 are supposed to be raid5 arrays.

The arrays are inactive.  In this condition, an array can be
shut down, or bought up by adding another disk with proper
superblock.  So running it isn't possible because kernel thinks
the array is inconsistent, and removing isn't possible because
the array isn't running.

It's inactive because when mdadm tried to assemble it, it didn't
find enough devices with recent-enough event counter.  In other
words, the raid superblocks on the individual drives are inconsistent
(some are older than others).

If the problem is due to power failure, fixing the situation is
usually just a matter of adding -f (force) option to mdadm assemble
line, forcing mdadm to increment the almost-recent drive's event
counter before bringing the array up.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recovery of software RAID5 using FC6 rescue?

2007-05-08 Thread Michael Tokarev
Mark A. O'Neil wrote:
 Hello,
 
 I hope this is the appropriate forum for this request if not please
 direct me to the correct one.
 
 I have a system running FC6, 2.6.20-1.2925, software RAID5 and a power
 outage seems to have borked the file structure on the RAID.
 
 Boot shows the following disks:
 sda #first disk in raid5: 250GB
 sdb #the boot disk: 80GB
 sdc #second disk in raid5: 250GB
 sdd #third disk in raid5: 250GB
 sde #fourth disk in raid5: 250GB
 
 When I boot the system kernel panics with the following info displayed:
 ...
 ata1.00: cd c8/00:08:e6:3e:13/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
 exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
 ata1.00: (BMDMA stat 0x25)
 ata1.00: cd c8/00:08:e6:3e:13/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
 EXT3-fs error (device sda3) ext_get_inode_loc: unable to read inode block
 -inode=8, block=1027
 EXT3-fs: invalid journal inode
 mount: error mounting /dev/root on /sysroot as ext3: invalid argument
 setuproot: moving /dev failed: no such file or directory
 setuproot: error mounting /proc:  no such file or directory
 setuproot: error mounting /sys:  no such file or directory
 switchroot: mount failed: no such file of directory
 Kernel panic - not synching: attempted to kil init!

Wug.

 At which point the system locks as expected.
 
 Another perhaps not related tidbit is when viewing sda1 using  (I think
 I did not write down the command) mdadm --misc --examine device I see
 (inpart) data describing the device in the array:
 
 sda1 raid 4, total 4, active 4, working 4
 and then a listing of disks sdc1, sdd1, sde1 all of which show
 
 viewing the remaining disks in the list shows:
 sdX1 raid 4, total 3, active 3, working 3

You sure it's raid4, not raid5?  Because if it really is raid4, but before
you had a raid5 array, you're screwed, and the only way to recover is to
re-create the array (without losing data), re-writing superblocks (see below).

BTW, --misc can be omited - you only need

  mdadm -E /dev/sda1

 and then a listing of the disks with the first disk being shonw as removed.
 It seems that the other disks do not have a reference to sda1? That in
 itself is perplexing to me but I vaguely recall seeing that before - it
 has been awhile since I set the system up.

Check UUID values on all drives (also from mdadm -E output) - shoule be the
same.  And compare Events field in there too.  Maybe you had 4-disk array
before, but later re-created it to be 3-disks?  Another possible cause is the
disk failures resulting in bad superblock reads, but that's highly unlikely.

 Anyway, I think the ext3-fs error is less an issue with the software
 raid and more an issue that fsck could fix. My problem is how to
 non-destructively mount the raid from the rescue disk so that I can run
 fsck on the raid. I do not think mounting and running fsck on the
 individual disks is the correct solution.
 
 Some straight forward instructions (or a pointer to some) on doing this
 from the rescue prompt would be most useful. I have been searching the
 last couple evenings and have yet to find something I completely
 understand. I have little experience with software raid and mdadm and
 while this is an excellent opportunity to learn a bit (and I am) I would
 like to successfully recover my data in a more timely fashion rather
 than mess it up beyond recovery as the result of a dolt interpretation
 of a man page. The applications and data itself is replaceable - just
 time consuming as in days rather than what I hope, with proper
 instruction, will amount to an evening or two worth of work to mount the
 RAID and run fsck.

Not sure about pointers.  But here are some points.

Figure out which arrays/disks you really had.  The raid level and number
of drives are really important.

Now two mantras:

  mdadm --assemble /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sde1

This will try to bring the array up.  It will either come ok, or will
fail due to event count mismatches (more than 1 difference).

In case you have more than 1 mismatch, you can try adding --force option,
to tell mdadm to ignore mismatches and try the best it can.  The array
wont resync, it will be started from best (n-1) drives.

If there's a drive error, you can omit the bad drive from the command
and assemble a degraded array, but before doing so, see which drives
are more fresh (by examining Event counts in mdadm -E output).  If
one of the remaining drives has (much) lower event count than the rest,
while the bad one is (more-or-less) good, you've a good chance to have
bad (unrecoverable) filesystem.  This happens if the lower-events drive
has been kicked off the array (for whatever reason) long before your
last disaster happened, and hence it contains very old data and you've
very few chances to recover it without the bad drive.

And another mantra, which can be helpful if assemble doesn't work for
some reason:

  mdadm --create /dev/md0 --level=5 --num-devices=4 --layout=x 

Re: s2disk and raid

2007-04-04 Thread Michael Tokarev
Neil Brown wrote:
 On Tuesday April 3, [EMAIL PROTECTED] wrote:
[]
 After the power cycle the kernel boots, devices are discovered, among
 which the ones holding raid. Then we try to find the device that holds
 swap in case of resume and / in case of a normal boot.

 Now comes a crucial point. The script that finds the raid array, finds
 the array in an unclean state and starts syncing.
[]
 So you can start arrays 'readonly', and resume off a raid1 without any
 risk of the the resync starting when it shouldn't.

But I wonder why this raid is necessary in the first place.
For raid1, assuming the superblock is at the end, -- the only
thing needed for resume is one component of the mirror.  I.e,
if your raid array is (was) composed off hda1 and hdb1, either
of the two will do as source of resume image.  The trick is to
find which, in case the array was degraded -- and mdadm does the
job here, but assembling it isn't really necessary.  Maybe mdadm
can be told to examine the component devices and write a short
line to stdout *instead* of real assembly (like mdadm -A --dummy),
to show the most recent component, and the offset if superblock
is at the beginning... having that, it will be possible to resume
from that component directly...

By the way, my home-grown initramfs stuff accepts several devices
for resume= command line, and tries each in turn.  If main disks
has more-or-less stable names, this may be an alternative way.
To mean, just give the component devices in resume= line...

Yes, this way it may do some weird things in case when the original
swap array was degraded (with first component, which contained a
valid resume image, removed from the array)...  But it's not really
a big issue, since - usually anyway - if one uses resume=, it means
the machine in question isn't some remote 100-miles-away, but it's
here, and it's ok to bypass the resume for recovery purposes.

Just some random thoughts.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Swap initialised as an md?

2007-03-23 Thread Michael Tokarev
Bill Davidsen wrote:
[]
 If you use RAID0 on an array it will be faster (usually) than just
 partitions, but any process with swapped pages will crash if you lose
 either drive. With RAID1 operation will be more reliable but no faster.
 If you use RAID10 the array will be faster and more reliable, but most
 recovery CDs don't know about RAID10 swap. Any reliable swap will also
 have the array size smaller than the sum of the partitions (you knew that).

You seems to forgot to mention 2 more things:

 o swap isn't usually needed for recovery CDs

 o kernel vm subsystem already can do equivalent of raid0 for swap internally,
   by means of allocating several block devices for swap space with the
   same priority.

If reliability (of swapped processes) is important, one can create several
RAID1 arrays and raid0 them using regular vm techniques.  The result will
be RAID10 for swap.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid 10 Problems?

2007-03-08 Thread Michael Tokarev
Jan Engelhardt wrote:
[]
 The other thing is, the bitmap is supposed to be written out at intervals,
 not at every write, so the extra head movement for bitmap updates should
 be really low, and not making the tar -xjf process slower by half a minute.
 Is there a way to tweak the write-bitmap-to-disk interval? Perhaps 
 something in /sys or ye olde /proc. Maybe linux-raid@ knows 8)

Hmm.  Bitmap is supposed to be written before actual data write, to mark
the to-be-written areas of the array as being written, so that those
areas can be detected and recovered in case of power failure during
actual write.

So in case of writing to a clean array, head movement always takes place -
first got to bitmap area, and second to the actual data area.

That written at intervals is about clearing the bitmaps after some idle
time.

In other words, dirtying bitmap bits occurs right before actual write,
and clearing bits occurs at intervals.

Sure, if you write to (or near) the same place again and again, without
giving a chance to md subsystem to actually clean the bitmap, there will
be no additional head movement.  And that means, for example, tar -xjf
sometimes, since filesystem will place the files being extracted close to
each other, thus hitting the same bit in the bitmap, hence md will skip
repeated bitmap updates in this case.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-02-24 Thread Michael Tokarev
Jason Rainforest wrote:
 I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5,
 multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200
 +).
 
 I then ordered a resync. The mismatch_cnt returned to 0 at the start of

As pointed out later it was repair, not resync.

 the resync, but around the same time that it went up to 8 with the
 check, it went up to 8 in the resync. After the resync, it still is 8. I
 haven't ordered a check since the resync completed.

As far as I understand, repair will do the same as check does, but ALSO
will try to fix the problems found.  So the number in mismatch_cnt after
a repair will indicate the amount of mismatches found _and fixed_

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Move superblock on partition resize?

2007-02-07 Thread Michael Tokarev
Rob Bray wrote:
 I am trying to grow a raid5 volume in-place. I would like to expand the
 partition boundaries, then grow raid5 into the newly-expanded partitions.
 I was wondering if there is a way to move the superblock from the end of
 the old partition to the end of the new partition. I've tried dd
 if=/dev/sdX1 of=/dev/sdX1 bs=512 count=256
 skip=(sizeOfOldPartitionInBlocks - 256) seek=(sizeOfNewPartitionInBlocks -
 256) unsuccessfully. Also, copying the last 128KB (256 blocks) of the old
 partition before the table modification to a file, and placing that data
 at the tail of the new partition also yields no beans. I can drop one
 drive at a time from the group, change the partition table, then hot-add
 it, but a resync times 7 drives is a lot of juggling. Any ideas?

The superblock location is somewhat tricky to calculate correctly.

I've used a tiny program (attached) for exactly this purpose.

/mjt
/* mdsuper: read or write a linux software raid superbloc (version 0.90)
 * from or to a given device.
 * 
 * GPL.
 * Written by Michael Tokarev ([EMAIL PROTECTED])
 */

#define _GNU_SOURCE
#include sys/types.h
#include stdio.h
#include unistd.h
#include errno.h
#include stdlib.h
#include fcntl.h
#include string.h
#include sys/ioctl.h
#include linux/ioctl.h
#include linux/types.h
#include linux/raid/md_p.h
#include linux/fs.h

int main(int argc, char **argv) {
  unsigned long long dsize;
  unsigned long long offset;
  int mdfd;
  int n;
  mdp_super_t super;
  const char *dev;

  if (argc != 3) {
fprintf(stderr, mdsuper: usage: mdsuper {read|write} mddev\n);
return 1;
  }

  if (strcmp(argv[1], read) == 0)
n = O_RDONLY;
  else if (strcmp(argv[1], write) == 0)
n = O_WRONLY;
  else {
fprintf(stderr, mdsuper: read or write arg required, not \%s\\n,
argv[1]);
return 1;
  }

  dev = argv[2];
  mdfd = open(dev, n, 0);
  if (mdfd  0) {
perror(dev);
return 1;
  }

  if (ioctl(mdfd, BLKGETSIZE64, dsize)  0) {
perror(dev);
return 1;
  }

  if (dsize  MD_RESERVED_SECTORS*2) {
fprintf(stderr, mdsuper: %s is too small\n, dev);
return 1;
  }

  offset = MD_NEW_SIZE_SECTORS(dsize9);

  fprintf(stderr, size=%Lu (%Lu sect), offset=%Lu (%Lu sect)\n,
  dsize, dsize9, offset * 512, offset);
  offset *= 512;

  if (n == O_RDONLY) {
if (pread64(mdfd, super, sizeof(super), offset) != sizeof(super)) {
  perror(dev);
  return 1;
}
if (super.md_magic != MD_SB_MAGIC) {
  fprintf(stderr, %s: bad magic (0x%08x, should be 0x%08x)\n,
  dev, super.md_magic, MD_SB_MAGIC);
  return 1;
}
if (write(1, super, sizeof(super)) != sizeof(super)) {
  perror(write);
  return 1;
}
  }
  else {
if (read(0, super, sizeof(super)) != sizeof(super)) {
  perror(read);
  return 1;
}
if (pwrite64(mdfd, super, sizeof(super), offset) != sizeof(super)) {
  perror(dev);
  return 1;
}
  }

  return 0;
}


Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba - RAID5)

2007-01-23 Thread Michael Tokarev
Justin Piszcz wrote:
[]
 Is this a bug that can or will be fixed or should I disable pre-emption on 
 critical and/or server machines?

Disabling pre-emption on critical and/or server machines seems to be a good
idea in the first place.  IMHO anyway.. ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba - RAID5)

2007-01-23 Thread Michael Tokarev
Justin Piszcz wrote:
 
 On Tue, 23 Jan 2007, Michael Tokarev wrote:
 
 Disabling pre-emption on critical and/or server machines seems to be a good
 idea in the first place.  IMHO anyway.. ;)

 So bottom line is make sure not to use preemption on servers or else you 
 will get weird spinlock/deadlocks on RAID devices--GOOD To know!

This is not a reason.  The reason is that preemption usually works worse
on servers, esp. high-loaded servers - the more often you interrupt a
(kernel) work, the more nedleess context switches you'll have, and the
more slow the whole thing works.

Another point is that with preemption enabled, we have more chances to
hit one or another bug somewhere.  Those bugs should be found and fixed
for sure, but important servers/data isn't a place usually for bughunting.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-15 Thread Michael Tokarev
dean gaudet wrote:
[]
 if this is for a database or fs requiring lots of small writes then 
 raid5/6 are generally a mistake... raid10 is the only way to get 
 performance.  (hw raid5/6 with nvram support can help a bit in this area, 
 but you just can't beat raid10 if you need lots of writes/s.)

A small nitpick.

At least some databases never do small-sized I/O, at least not against
the datafiles.  That is, for example, Oracle uses a fixed-size I/O block
size, specified at database (or tablespace) creation time, -- by default
it's 4Kb or 8Kb, but may be 16Kb or 32Kb as well.  Now, if you'll make your
raid array stripe size to match the blocksize of a database, *and* ensure
the files are aligned on disk properly, it will just work without needless
reads to calculate parity blocks during writes.

But the problem with that is it's near impossible to do.

First, even if the db writes in 32Kb blocks, it means the stripe size should
be 32Kb, which is only suitable for raid5 with 3 disks, having chunk size of
16Kb, or with 5 disks, chunk size 8Kb (this last variant is quite bad, because
chunk size of 8Kb is too small).  In other words, only very limited set of
configurations will be more-or-less good.

And second, most filesystems used for databases don't care about correct
file placement.  For example, ext[23]fs with maximum blocksize of 4Kb will
align files by 4Kb, not by stripe size - which means that a whole 32Kb block
will be laid like - first 4Kb on first stripe, rest 24Kb on the next stripe,
which means that for both parts full read-write cycle will be needed again
to update parity blocks - the thing we tried to avoid by choosing the sizes
in a previous step.  Only xfs so far (from the list of filesystems I've
checked) pays attention to stripe size and tries to ensure files are aligned
to stripe size.  (Yes I know mke2fs's stride=xxx parameter, but it only
affects metadata, not data).

That's why all the above is a small nitpick - i.e., in theory, it IS possible
to use raid5 for database workload in certain cases, but due to all the gory
details, it's nearly impossible to do right.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)

2007-01-12 Thread Michael Tokarev
Justin Piszcz wrote:
 Using 4 raptor 150s:
 
 Without the tweaks, I get 111MB/s write and 87MB/s read.
 With the tweaks, 195MB/s write and 211MB/s read.
 
 Using kernel 2.6.19.1.
 
 Without the tweaks and with the tweaks:
 
 # Stripe tests:
 echo 8192  /sys/block/md3/md/stripe_cache_size
 
 # DD TESTS [WRITE]
 
 DEFAULT: (512K)
 $ dd if=/dev/zero of=10gb.no.optimizations.out bs=1M count=10240
 10240+0 records in
 10240+0 records out
 10737418240 bytes (11 GB) copied, 96.6988 seconds, 111 MB/s
[]
 8192K READ AHEAD
 $ dd if=10gb.16384k.stripe.out of=/dev/null bs=1M
 10240+0 records in
 10240+0 records out
 10737418240 bytes (11 GB) copied, 64.9454 seconds, 165 MB/s

What exactly are you measuring?  Linear read/write, like copying one
device to another (or to /dev/null), in large chunks?

I don't think it's an interesting test.  Hint: how many times a day
you plan to perform such a copy?

(By the way, for a copy of one block device to another, try using
O_DIRECT, with two dd processes doing the copy - one reading, and
another writing - this way, you'll get best results without huge
affect on other things running on the system.  Like this:

 dd if=/dev/onedev bs=1M iflag=direct |
 dd of=/dev/twodev bs=1M oflag=direct
)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 root and swap and initrd

2006-12-21 Thread Michael Tokarev
[A late follow-up]

Bill Davidsen wrote:
 Michael Tokarev wrote:
 Andre Majorel wrote:
 []
  
 Thanks Jurriaan and Gordon. I think I may still be f*cked,
 however. The Lilo doc says you can't use raid-extra-boot=mbr-only
 if boot= does not point to a raid device. Which it doesn't because
 in my setup, boot=/dev/sda.

 Using boot=/dev/md5 would solve the raid-extra-boot issue but the
 components of /dev/md5 are not primary partitions (/dev/sda5,
 /dev/sdb5) so I don't think that would work.

 So just move it to sda1 (or sda2, sda3) from sda5, ensure you've
 two identical drives (or at least your boot partitions are layed
 up identically), and use boot=/dev/md1 (or md2, md3).  Do NOT
 use raid-extra-boot (set it to none), but set up standard mbr
 code into boot sector of both drives (in debian, it's 'mbr' package;
 lilo can be used for that too - once for each drive), and mark your
 boot partition on both drives as active.

 This is the most clean setup to boot off raid.  You'll have two
 drives, both will be bootable, and both will be updated when
 you'll run lilo.

 Another bonus - if you'll ever install a foreign OS on this system,
 which tends to update boot code, all your stuff will still be intact -
 the only thing you'll need to do to restore linux boot is to reset
 'active' flags for your partitions (and no, winnt disk manager does
 not allow you to do so - no ability to set non-dos (non-windows)
 partition active).

 I *could* run lilo once for each disk after tweaking boot= in
 lilo.conf, or just supply a different -M option but I'm not sure.
 The Lilo doc is not terribly enlightening. Not for me, anyway. :-)

 No, don't do that. Even if you can automate it.  It's error-prone
 to say the best, and it will bite you at an unexpected moment.
 
 The desirable solution is to use the DOS MBR (boot active partition) and
 put the boot stuff in the RAID device. However, you can just write the
 MBR to the hda and then to hdb. Note that you don't play with the
 partition names, the 2nd MBR will only be used if the 1st drive fails,
 and therefore at the BIOS level the 2nd drive will now be hda (or C:) if
 LILO stiff uses the BIOS to load the next sector.

Just a small note.  DOS MBR can't boot from non-primary partition.
In this case, the boot code is at sda5, which is on extended partition.

But.  Lilo now has ability to write its own MBR, which, in turn, IS
able to boot of logical partition.

 lilo -M /dev/sda ext

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 root and swap and initrd

2006-12-16 Thread Michael Tokarev
Andre Majorel wrote:
[]
 Thanks Jurriaan and Gordon. I think I may still be f*cked,
 however. The Lilo doc says you can't use raid-extra-boot=mbr-only
 if boot= does not point to a raid device. Which it doesn't because
 in my setup, boot=/dev/sda.
 
 Using boot=/dev/md5 would solve the raid-extra-boot issue but the
 components of /dev/md5 are not primary partitions (/dev/sda5,
 /dev/sdb5) so I don't think that would work.

So just move it to sda1 (or sda2, sda3) from sda5, ensure you've
two identical drives (or at least your boot partitions are layed
up identically), and use boot=/dev/md1 (or md2, md3).  Do NOT
use raid-extra-boot (set it to none), but set up standard mbr
code into boot sector of both drives (in debian, it's 'mbr' package;
lilo can be used for that too - once for each drive), and mark your
boot partition on both drives as active.

This is the most clean setup to boot off raid.  You'll have two
drives, both will be bootable, and both will be updated when
you'll run lilo.

Another bonus - if you'll ever install a foreign OS on this system,
which tends to update boot code, all your stuff will still be intact -
the only thing you'll need to do to restore linux boot is to reset
'active' flags for your partitions (and no, winnt disk manager does
not allow you to do so - no ability to set non-dos (non-windows)
partition active).

 I *could* run lilo once for each disk after tweaking boot= in
 lilo.conf, or just supply a different -M option but I'm not sure.
 The Lilo doc is not terribly enlightening. Not for me, anyway. :-)

No, don't do that. Even if you can automate it.  It's error-prone
to say the best, and it will bite you at an unexpected moment.

 A nice little tip I learned from someone else on the list is to
 have your md devices named after the partition numbers. So:

That'd be me... ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 root and swap and initrd

2006-12-16 Thread Michael Tokarev
Andre Majorel wrote:
[]
 So just move it to sda1 (or sda2, sda3) from sda5
 
 Problem is, the disks are entirely used by an extended partition.
 There's nowhere to move sd?5 to.

You're using raid, so you've at least two disk drives.

remove one component off all your raid devices (second disk),
repartition the disk, re-add the components back - this will copy
data over to the second diks.  Next repeat the same procedure with
the first disk.

Or something like that -- probably only single partition (on both
disks) needs to be recreated this way.

 I think it's possible to turn the first logical partition into a
 primary partition by modifying the partition table on the MBR but
 I'm not sure I'm up for that.

It's possible to move sda5 to sda1, but not easy - because at the
start of extended partition there's a (relatively large) space for
the logical partitions, you can't just relabel your partitions,
you have to actually move data.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why not make everything partitionable?

2006-11-15 Thread Michael Tokarev
martin f krafft wrote:
 Hi folks,
 
 you cannot create partitions within partitions, but you can well use
 whole disks for a filesystem without any partitions.

It's usually better to have a partition table in place, at least on x86.
Just to stop possible confusion - be it from kernel, or from inability
to identify disks properly (think [c]fdisk displaying labels) or from
anything else.  But ok.

 Along the same lines, I wonder why md/mdadm distinguish between
 partitionable and non-partitionable in the first place. Why isn't
 everything partitionable?

It's both historic (before, there was no partitionable md arrays),
and due to the fact that the number of partitions is limited by
only single major number (ie, 256 (sub)partitions max).

Maybe there are other reasons - I don't have a defite answer.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.

2006-11-09 Thread Michael Tokarev
Neil Brown wrote:
[/dev/mdx...]
 (much like how /dev/ptmx is used to create /dev/pts/N entries.)
[]
 I have the following patch sitting in my patch queue (since about
 March).
 It does what you suggest via /sys/module/md-mod/parameters/MAGIC_FILE
 which is the only md-specific part of the /sys namespace that I could
 find.
 
 However I'm not at all convinced that it is a good idea.  I would much
 rather have mdadm control device naming than leave it up to udev.

This is again the same device naming question as pops up every time
someone mentions udev.  And as usual, I'm suggesting the following, which
should - hopefully - make everyone happy:

  create kernel names *always*, be it /dev/mdN or /dev/sdF or whatever,
  so that things like /proc/partitions, /proc/mdstat etc will be useful.
  For this, the ideal solution - IMHO - is to have mini-devfs-like filesystem
  mounted as /dev, so that it is possible to have bare names without any
  help from any external programs like udev, but I don't want to start another
  flamewar here, esp. since it's off-topic to *this* discussion.
  Note /dev/mdN is as good as /dev/md/N - because there are only a few active
  devices wich appear in /dev, there's no risk to have too many files in
  /dev, hence no need to put them into subdirs like /dev/md/, /dev/sd/ etc.

  if so desired, create *symlinks* at /dev with appropriate user-controlled
  names to those official kernel device nodes.  Be it like /dev/disk/by-label/
  or /dev/cdrom0 or whatever.
  The links can be created by mdadm, OR by udev - in this case, it's really
  irrelevant.  Udev rules does a good job of creating /dev/disk/ hierarchy
  already, and that seems to be sufficient - i see no reason to make other
  device nodes (symlinks) by mdadm.

By the way, unlike /dev/sdE and /dev/hdF entries, /dev/mdN nodes are pretty
stable.  Even if scsi disks gets reordered, mdadm finds the component devices
by UUID (if DEVICE partitions is given in config file), and you have /dev/md1
pointing to the same logical partition (have the same filesystem or data)
regardless how you shuffle your disks (IF mdadm was able to find all components
and assemble the array, anyway).  So sometimes, I use md/mdadm on systems
WITHOUT any raided drives, but where I suspect disk devices may change for
whatever reason - I just create raid0 arrays composed of a single partition
and let mdadm to find them in /dev/sd* and to assemble stable-numbered /dev/mdN
devices - without any help of udev or anything else (I for one dislike udev for
several reasons).

 An in any case, we have the semantic that opening an md device-file
 creates the device, and we cannot get rid of that semantic without a
 lot of warning and a lot of pain.  And adding a new semantic isn't
 really going to help.

I don't think so.  With new semantic in place, we've two options (provided
current semantics stays, and I don't see a strong reason why it should be
removed except of the bloat):

 a) with new mdadm utilizing new semantics, there's nothing to change in udev --
it will all Just Work, by mdadm opening /dev/md-control-node (how it's 
called)
and assembling devices using that, and during assemble, udev will receive 
proper
events about new disks appearing and will handle that as usual.

 b) without new mdadm, it will work as before (now).  And in this case, let's 
not
send any udev events, as mdadm already created the nodes etc.

So if a user wants neat and nice md/udev integration, the way to go is case a.
If it's not required, either case will do.

Sure, eventually, long term, support for case b can be removed.  Or not - 
depending
on how the things will be implemented, because when done properly, both cases 
will
call the same routine(s), but case b will just skip sending uevents, so ioctl 
handlers
becomes two- or one-liners (two in case a and one in case b), which isn't bloat 
really ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.

2006-11-09 Thread Michael Tokarev
Michael Tokarev wrote:
 Neil Brown wrote:
 [/dev/mdx...]
[]
 An in any case, we have the semantic that opening an md device-file
 creates the device, and we cannot get rid of that semantic without a
 lot of warning and a lot of pain.  And adding a new semantic isn't
 really going to help.
 
 I don't think so.  With new semantic in place, we've two options (provided
 current semantics stays, and I don't see a strong reason why it should be
 removed except of the bloat):
 
  a) with new mdadm utilizing new semantics, there's nothing to change in udev 
 --
 it will all Just Work, by mdadm opening /dev/md-control-node (how it's 
 called)
 and assembling devices using that, and during assemble, udev will receive 
 proper
 events about new disks appearing and will handle that as usual.
 
  b) without new mdadm, it will work as before (now).  And in this case, let's 
 not
 send any udev events, as mdadm already created the nodes etc.

Forgot to add.  This is important point: do NOT change current behavour wrt 
uevents,
ie, don't add uevents for current semantics at all.  Only send uevents (and in 
this
case it will be normal add and remove events) when assembling arrays the 
new way,
using (stable!) /dev/mdcontrol misc device, after RUN_ARRAY and STOP_ARRAY 
actions has
been performed.

/mjt

 So if a user wants neat and nice md/udev integration, the way to go is case 
 a.
 If it's not required, either case will do.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm: bitmaps not supported by this kernel?

2006-10-25 Thread Michael Tokarev
Another 32/64 bits issue, it seems.
Running 2.6.18.1 x86-64 kernel and mdadm 2.5.3 (32 bit).

# mdadm -G /dev/md1 --bitmap=internal
mdadm: bitmaps not supported by this kernel.

# mdadm -G /dev/md1 --bitmap=none
mdadm: bitmaps not supported by this kernel.

etc.

Recompiling mdadm in 64bit mode eliminates the problem.

So far, only bitmap manipulation is broken this way.
I dunno if other things are broken too - at least
--assemble, --create, --stop, --detail works.

Thanks.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.

2006-10-12 Thread Michael Tokarev
Neil Brown wrote:
[]
 Fix count of degraded drives in raid10.
 
 Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
 --- .prev/drivers/md/raid10.c 2006-10-09 14:18:00.0 +1000
 +++ ./drivers/md/raid10.c 2006-10-05 20:10:07.0 +1000
 @@ -2079,7 +2079,7 @@ static int run(mddev_t *mddev)
   disk = conf-mirrors + i;
  
   if (!disk-rdev ||
 - !test_bit(In_sync, rdev-flags)) {
 + !test_bit(In_sync, disk-rdev-flags)) {
   disk-head_position = 0;
   mddev-degraded++;
   }

Neil, this makes me nervous.  Seriously.

How many bugs like this has been fixed so far? 10? 50?  I stopped counting
long time ago.  And it's the same thing in every case - misuse of rdev vs
disk-rdev.  The same pattern.

I wonder if it can be avoided in the first place somehow - maybe don't
declare and use local variable `rdev' (not by name, but by the semantics
of it), and always use disk-rdev or mddev-whatever in every place,
explicitly, and let the compiler optimize the deref if possible?

And btw, this is another 2.6.18.1 candidate (if it's not too late already).

Thanks.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: avoiding the initial resync on --create

2006-10-11 Thread Michael Tokarev
Doug Ledford wrote:
 On Mon, 2006-10-09 at 15:10 -0400, Rob Bray wrote:
[]
 Probably the best thing to do would be on create of the array, setup a
 large all 0 block of mem and repeatedly write that to all blocks in the
 array devices except parity blocks and use a large all 1 block for that.
 Then you could just write the entire array at blinding speed.  You could
 call that the quick-init option or something.  You wouldn't be able to
 use the array until it was done, but it would be quick.  If you wanted
 to be *really* fast, at least for SCSI drives you could write one large
 chunk of 0's and one large chunk of 1's at the first parity block, then
 use the SCSI COPY command to copy the 0 chunk everywhere it needs to go,
 and likewise for the parity chunk, and avoid transferring the data over
 the SCSI bus more than once.

Some notes.

First, raid array gets created sometimes in order to repair a broken array.
Ie, you had an array, you lose it for whatever reason, and re-create it,
avoiding initial resync (--assume-clean option), in a hope your data is
still here.  For that, you don't want to zero-fill your drives, for sure! :)

And second, at least SCSI drives have FORMAT UNIT command, which has a
range argument (from-sector and to-sector), and, if memory serves me
right, also filler argument as well (the data, 512-byte block, to
write to all the sectors in the range).  (Well, it was long ago when
I looked at that stuff, so it might be some other command, but it's
here anyway).  I'm not sure it's used/available in block device layer
(most probably it isn't).  But this is the fastest way to fill (parts
of) your drives with whatever repeated pattern of bytes you want.
Including this initial zero-filling.

But either way, you don't really need to do that in kernel space --
Userspace solution will work too.  Ok ok, if kernel is doing it after
array creation, the array is available immediately for other use,
which is a plus.

And yes, I'm not sure implementing it is worth the effort.  Unless
you're re-creating your multi-terabyte array several times a day ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Simulating Drive Failure on Mirrored OS drive

2006-10-02 Thread Michael Tokarev
andy liebman wrote:
 

 Read up on the md-faulty device.
 
 Got any links to this? As I said, we know how to set the device as
 faulty, but I'm not convinced this is a good simulation of a drive that
 fails (times out, becomes unresponsive, etc.)

Note that 'set device as faulty' is NOT the same as `md-faulty device'.
Read mdamd(8) manpage, and see options `-l' (level) and `-p' (parity)
for create mode.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG/PATCH] md bitmap broken on big endian machines

2006-09-29 Thread Michael Tokarev
Paul Clements wrote:
 Michael Tokarev wrote:
 Neil Brown wrote:
 
 ffs is closer, but takes an 'int' and we have a 'unsigned long'.
 So use ffz(~X) to convert a chunksize into a chunkshift.

 So we don't use ffs(int) for an unsigned value because of int vs
 unsigned int, but we use ffz() with negated UNSIGNED.  Looks even
 more broken to me, even if it happens to work correctly... ;)
 
 No, it doesn't matter about the signedness, these are just bit
 operations. The problem is the size (int vs. long), even though in
 practice it's very unlikely you'd ever have a bitmap chunk size that
 exceeded 32 bits. But it's better to be correct and not have to worry
 about it.

I understand the point, in the first place (I didn't mentioned long
vs int above, however).  The thing is: when reading the code, it looks
just plain wrong.  Esp. since function prototypes aren't here, but for
those ffs(), ffz() etc they're hidden somewhere in include/asm/* (as
they're architecture-dependent), and it's not at all obvious which is
signed and which is unsigned, which is long or int etc.

At the very least, return -ENOCOMMENT :)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG/PATCH] md bitmap broken on big endian machines

2006-09-28 Thread Michael Tokarev
Neil Brown wrote:
[]
 Use ffz instead of find_first_set to convert multiplier to shift.
 
 From: Paul Clements [EMAIL PROTECTED]
 
 find_first_set doesn't find the least-significant bit on bigendian
 machines, so it is really wrong to use it.
 
 ffs is closer, but takes an 'int' and we have a 'unsigned long'.
 So use ffz(~X) to convert a chunksize into a chunkshift.

So we don't use ffs(int) for an unsigned value because of int vs
unsigned int, but we use ffz() with negated UNSIGNED.  Looks even
more broken to me, even if it happens to work correctly... ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


proactive-raid-disk-replacement

2006-09-08 Thread Michael Tokarev
Recently Dean Gaudet, in thread titled 'Feature
Request/Suggestion - Drive Linking', mentioned his
document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

I've read it, and have some umm.. concerns.  Here's why:


 mdadm -Gb internal --bitmap-chunk=1024 /dev/md4
 mdadm /dev/md4 -r /dev/sdh1
 mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
 mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
 mdadm /dev/md4 --re-add /dev/md5
 mdadm /dev/md5 -a /dev/sdh1

 ... wait a few hours for md5 resync...

And here's the problem.  While new disk, sdh1, are resynced from
old, probably failing disk sde1, chances are high that there will
be an unreadable block on sde1.  And this means the whole thing
will not work -- md5 initially contained one working drive (sde1)
and one spare (sdh1) which is being converted (resynced) to working
disk.  But after read error on sde1, md5 will contain one failed
drive and one spare -- for raid1 it's fatal combination.

While at the same time, it's perfectly easy to reconstruct this
failing block from other component devices of md4.

That to say: this way of replacing disk in a software raid array
isn't much better than just removing old drive and adding new one.
And if the drive you're replacing is failing (according to SMART
for example), this method is more likely to fail.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 fill up?

2006-09-08 Thread Michael Tokarev
Lars Schimmer wrote:
 Hi!
 
 I´ve got a software RAiD5 with 6 250GB HDs.
 Now I changed one disk after another to a 400GB HD and resynced the
 raid5 after each change.
 Now the RAID5 has got 6 400GB HDs and still uses only 6*250GB space.
 How can I grow the md0 device to use 6*400GB?

mdadm --grow is your friend.

/mjt

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: proactive-raid-disk-replacement

2006-09-08 Thread Michael Tokarev
dean gaudet wrote:
 On Fri, 8 Sep 2006, Michael Tokarev wrote:
 
 Recently Dean Gaudet, in thread titled 'Feature
 Request/Suggestion - Drive Linking', mentioned his
 document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt

 I've read it, and have some umm.. concerns.  Here's why:

 
 mdadm -Gb internal --bitmap-chunk=1024 /dev/md4

By the way, don't specify bitmap-chunk for internal bitmap.
It's needed for file-based (external) bitmap.  With internal
bitmap, we have fixed size in superblock for it, so bitmap-chunk
is determined by dividing that size by size of the array.

 mdadm /dev/md4 -r /dev/sdh1
 mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1
 mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing
 mdadm /dev/md4 --re-add /dev/md5
 mdadm /dev/md5 -a /dev/sdh1

 ... wait a few hours for md5 resync...
 And here's the problem.  While new disk, sdh1, are resynced from
 old, probably failing disk sde1, chances are high that there will
 be an unreadable block on sde1.  And this means the whole thing
 will not work -- md5 initially contained one working drive (sde1)
 and one spare (sdh1) which is being converted (resynced) to working
 disk.  But after read error on sde1, md5 will contain one failed
 drive and one spare -- for raid1 it's fatal combination.

 While at the same time, it's perfectly easy to reconstruct this
 failing block from other component devices of md4.
 
 this statement is an argument for native support for this type of activity 
 in md itself.

Yes, definitely.

 That to say: this way of replacing disk in a software raid array
 isn't much better than just removing old drive and adding new one.
 
 hmm... i'm not sure i agree.  in your proposal you're guaranteed to have 
 no redundancy while you wait for the new disk to sync in the raid5.

It's not a proposal per se, it's just another possible way (used by majority
of users I think, because it's way simpler ;)

 in my proposal the probability that you'll retain redundancy through the 
 entire process is non-zero.  we can debate how non-zero it is, but 
 non-zero is greater than zero.

Yes there will be no redundancy in my variant, guaranteed.  And yes,
there is probability to complete the whole your process without a glitch.

 i'll admit it depends a heck of a lot on how long you wait to replace your 
 disks, but i prefer to replace mine well before they get to the point 
 where just reading the entire disk is guaranteed to result in problems.
 
 And if the drive you're replacing is failing (according to SMART
 for example), this method is more likely to fail.
 
 my practice is to run regular SMART long self tests, which tend to find 
 Current_Pending_Sectors (which are generally read errors waiting to 
 happen) and then launch a repair sync action... that generally drops the 
 Current_Pending_Sector back to zero.  either through a realloc or just 
 simply rewriting the block.  if it's a realloc then i consider if there's 
 enough of them to warrant replacing the disk...
 
 so for me the chances of a read error while doing the raid1 thing aren't 
 as high as they could be...

So the whole thing goes this way:
  0) do a SMART selftest ;)
  1) do repair for the whole array
  2) copy data from failing to new drive
(using temporary superblock-less array)
  2a) if step 2 failed still, probably due to new bad sectors,
  go the old way, removing the failing drive and adding
  new one.

That's 2x or 3x (or 4x counting the selftest, but that should be
done regardless) more work than just going the old way from the
beginning, but still some chances to have it completed flawlessly
in 2 steps, without losing redundancy.

Too complicated and too long for most people I'd say ;)

I can come to yet another way, which is only somewhat possible with
current md code. In 3 variants.

1)  Offline the array, stop it.
Make a copy of the drive using dd with error=skip (or how it is),
 noticing the bad blocks
Mark those bad blocks in bitmap as dirty
Assemble the array with new drive, letting it to resync the blocks
to new drive which we were unable to copy previously.

This variant does not lose redundancy at all, but requires the array to
be off-line during the whole copy procedure.  What's missing (which has
been discussed on linux-raid@ recently too) is the ability to mark those
bad blocks in bitmap.

2)  The same, but not offlining the array.  Hot-remove a drive, make copy
   of it to new drive, flip necessary bitmap bits, and re-add the new drive,
   and let raid code to resync changed (during copy, while the array was
   still active, something might has changed) and missing blocks.

This variant still loses redundancy, but not much of it, provided the bitmap
code works correctly.

3)  The same as your way, with the difference that we tell md to *skip* and
  ignore possible errors during resync (which is also not possible currently).

 but yeah you've convinced me this solution isn't good

Re: Feature Request/Suggestion - Drive Linking

2006-09-03 Thread Michael Tokarev
Tuomas Leikola wrote:
[]
 Here's an alternate description. On first 'unrecoverable' error, the
 disk is marked as FAILING, which means that a spare is immediately
 taken into use to replace the failing one. The disk is not kicked, and
 readable blocks can still be used to rebuild other blocks (from other
 FAILING disks).
 
 The rebuild can be more like a ddrescue type operation, which is
 probably a lot faster in the case of raid6, and the disk can be
 automatically kicked after the sync is done. If there is no read
 access to the FAILING disk, the rebuild will be faster just because
 seeks are avoided in a busy system.

It's not that simple.  The issue is with writes.  If there's a failing
disk, md code will need to keep track of up-to-date, or good sectors
of it vs obsolete ones.  Ie, when write fails, the data in that block
is either unreadable (but can become readable on the next try, say, after
themperature change or whatnot), or readable but contains old data, or
is readable but contains some random garbage.  So at least that block(s)
of the disk should not be copied to the spare during resync, and should
not be read at all, to avoid returning wrong data to userspace.  In short,
if the array isn't stopped (or changed to read-only), we should watch for
writes, and remember which ones are failed.  Which is some non-trivial
change.  Yes, bitmaps somewhat helps here.

/mjt

-- 
VGER BF report: H 0.418675
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


spurious dots in dmesg when reconstructing arrays

2006-08-17 Thread Michael Tokarev
A long time ago I noticied pretty bad formatting of
dmesg text in md array reconstruction output, but
never bothered to ask.  So here it goes.

Example dmesg (RAID conf printout sections omitted):

md: bindsdb1
RAID1 conf printout:
..6md: syncing RAID array md1
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 248896 blocks.
md: bindsdb2
RAID5 conf printout:
..6md: delaying resync of md2 until md1 has finished resync (they 
share one or more physical units)
md: bindsdb3
RAID1 conf printout:
..6md: delaying resync of md31 until md1 has finished resync (they 
share one or more physical units)
6md: delaying resync of md2 until md31 has finished resync (they 
share one or more physical units)
md: bindsdb5
RAID5 conf printout:
6md: delaying resync of md5 until md31 has finished resync (they 
share one or more physical units)
..6md: delaying resync of md31 until md1 has finished resync (they 
share one or more physical units)
..6md: delaying resync of md2 until md5 has finished resync (they share 
one or more physical units)
md: bindsdb6
RAID1 conf printout:
..6md: delaying resync of md61 until md5 has finished resync (they share 
one or more physical units)
..6md: delaying resync of md31 until md1 has finished resync (they 
share one or more physical units)
..6md: delaying resync of md2 until md5 has finished resync (they share 
one or more physical units)
6md: delaying resync of md5 until md31 has finished resync (they 
share one or more physical units)
md: bindsdb7
RAID5 conf printout:
..6md: delaying resync of md7 until md5 has finished resync (they share 
one or more physical units)
..6md: delaying resync of md31 until md1 has finished resync (they 
share one or more physical units)
6md: delaying resync of md5 until md31 has finished resync (they 
share one or more physical units)
..6md: delaying resync of md2 until md5 has finished resync (they share 
one or more physical units)
..6md: delaying resync of md61 until md7 has finished resync (they share one 
or more physical units)
md: md1: sync done.
..6md: delaying resync of md61 until md7 has finished resync (they share one 
or more physical units)
..6md: delaying resync of md2 until md5 has finished resync (they share 
one or more physical units)
6md: delaying resync of md5 until md31 has finished resync (they 
share one or more physical units)
...6md: syncing RAID array md31
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 995904 blocks.
..6md: delaying resync of md7 until md5 has finished resync (they share 
one or more physical units)
RAID1 conf printout:
md: md31: sync done.
RAID1 conf printout:
=== here, the actual conf printout is below:
...6md: delaying resync of md7 until md5 has finished resync (they share one 
or more physical units)
...6md: syncing RAID array md5
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 666560 blocks.
= here:
 --- wd:2 rd:2
 disk 0, wo:0, o:1, dev:sdb3
 disk 1, wo:0, o:1, dev:sdd3
..6md: delaying resync of md61 until md7 has finished resync (they share one 
or more physical units)
..6md: delaying resync of md2 until md5 has finished resync (they share 
one or more physical units)
md: md5: sync done.
6md: delaying resync of md7 until md2 has finished resync (they share 
one or more physical units)
..6md: delaying resync of md61 until md7 has finished resync (they share one 
or more physical units)
..6md: syncing RAID array md2
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 1252992 blocks.
RAID5 conf printout:
md: md2: sync done.
...6md: delaying resync of md61 until md7 has finished resync (they share one 
or more physical units)
..6md: syncing RAID array md7
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 136 blocks.
RAID5 conf printout:
md: md7: sync done.
...6md: syncing RAID array md61
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 20 KB/sec) 
for reconstruction.
md: using 128k window, over a total of 30394880 blocks.
RAID5 conf printout:
md: md61: sync 

Re: [bug?] raid1 integrity checking is broken on 2.6.18-rc4

2006-08-12 Thread Michael Tokarev
Justin Piszcz wrote:
 Is there a doc for all of the options you can echo into the sync_action?
 I'm assuming mdadm does these as well and echo is just another way to
 run work with the array?

How about the obvious, Documentation/md.txt ?

And no, mdadm does not perform or trigger data integrity checking.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: modifying degraded raid 1 then re-adding other members is bad

2006-08-08 Thread Michael Tokarev
Neil Brown wrote:
 On Tuesday August 8, [EMAIL PROTECTED] wrote:
 Assume I have a fully-functional raid 1 between two disks, one
 hot-pluggable and the other fixed.

 If I unplug the hot-pluggable disk and reboot, the array will come up
 degraded, as intended.

 If I then modify a lot of the data in the raid device (say it's my
 root fs and I'm running daily Fedora development updates :-), which
 modifies only the fixed disk, and then plug the hot-pluggable disk in
 and re-add its members, it appears that it comes up without resyncing
 and, well, major filesystem corruption ensues.

 Is this a known issue, or should I try to gather more info about it?
 
 Looks a lot like
http://bugzilla.kernel.org/show_bug.cgi?id=6965
 
 Attached are two patches.  One against -mm and one against -linus.
 
 They are below.
 
 Please confirm if the appropriate one help.
 
 NeilBrown
 
 (-mm)
 
 Avoid backward event updates in md superblock when degraded.
 
 If we
   - shut down a clean array,
   - restart with one (or more) drive(s) missing
   - make some changes
   - pause, so that they array gets marked 'clean',
 the event count on the superblock of included drives
 will be the same as that of the removed drives.
 So adding the removed drive back in will cause it
 to be included with no resync.
 
 To avoid this, we only update the eventcount backwards when the array
 is not degraded.  In this case there can (should) be no non-connected
 drives that we can get confused with, and this is the particular case
 where updating-backwards is valuable.

Why we're updating it BACKWARD in the first place?

Also, why, when we adding something to the array, the event counter is
checked -- should it resync regardless?

Thanks.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Converting Ext3 to Ext3 under RAID 1

2006-08-03 Thread Michael Tokarev
Paul Clements wrote:
Is 16 blocks a large enough area?
 
 Maybe. The superblock will be between 64KB and 128KB from the end of the
 partition. This depends on the size of the partition:
 
 SB_LOC = PART_SIZE - 64K - (PART_SIZE  (64K-1))
 
 So, by 16 blocks, I assume you mean 16 filesystem blocks (which are
 generally 4KB for ext3). So as long as your partition ends exactly on a
 64KB boundary, you should be OK.
 
 Personally, I would err on the safe side and just shorten the filesystem
 by 128KB. It's not like you're going to miss the extra 64KB.

Or, better yet, shrink it by 1Mb or even 10Mb, whatever, convert
to raid, and - the point - resize it to the max size of the raid
device (ie, don't give size argument to resize2fs).  This way,
you will be both safe and will use 100% of the available size.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: let md auto-detect 128+ raid members, fix potential race condition

2006-08-01 Thread Michael Tokarev
Alexandre Oliva wrote:
[]
 If mdadm can indeed scan all partitions to bring up all raid devices
 in them, like nash's raidautorun does, great.  I'll give that a try,

Never, ever, try to do that (again).  Mdadm (or vgscan, or whatever)
should NOT assemble ALL arrays found, but only those which it has
been told to assemble.  This is it again: you bring another disk into
a system (disk which comes from another machine), and mdadm finds
FOREIGN arrays and brings them up as /dev/md0, where YOUR root
filesystem should be.  That's what 'homehost' option is for, for
example.

If initrd should be reconfigured after some changes (be it raid
arrays, LVM volumes, hostname, whatever), -- I for one am fine
with that.  Hopefully no one will argue that if you forgot to
install an MBR into your replacement drive, it was entirely your
own fault that your system become unbootable, after all ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 006 of 9] md: Remove the working_disks and failed_disks from raid5 state data.

2006-07-31 Thread Michael Tokarev
NeilBrown wrote:
 They are not needed.
 conf-failed_disks is the same as mddev-degraded

By the way, `failed_disks' is more understandable than `degraded'
in this context.  Degraded usually refers to the state of the
array in question, when failed_disks  0.

That to say: I'd rename degraded back to failed_disks, here and
in the rest of raid drivers... ;)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Grub vs Lilo

2006-07-26 Thread Michael Tokarev
Jason Lunz wrote:
 [EMAIL PROTECTED] said:
 Wondering if anyone can comment on an easy way to get grub to update
 all components in a raid1 array.  I have a raid1 /boot with a raid10
 /root and have previously used lilo with the raid-extra-boot option to
 install to boot sectors of all component devices.  With grub it
 appears that you can only update non default devices via the command
 line.  I like the ability to be able to type lilo and have all updated
 in one hit.  Is there a way to do this with grub?  
 
 assuming your /boot is made of hda1 and hdc1:
 
 grub-install /dev/hda1
 grub-install /dev/hdc1

Don't do that.
Because if your hda dies, and you will try to boot off hdc instead (which
will be hda in this case), grub will try to read hdc which is gone, and
will fail.

Most of the time (unless the bootloader is really smart and understands
mirroring in full - lilo and grub does not) you want to have THE SAME
boot code on both (or more, in case of 3 or 4 disks mirrors) your disks,
including bios disk codes.  after the above two commands, grub will write
code to boot from disk 0x80 to hda, and from disk 0x81 (or 0x82) to hdc.
So when your hdc becomes hda, it will not boot.

In order to solve this all, you have to write diskmap file and run grub-install
twice.  Both times, diskmap should list 0x80 for the device to which you're
installing grub.

I don't remember the syntax of the diskmap file (or even if it's really
called 'diskmap'), but assuming hda and hdc notation, I mean the following:

  echo /dev/hda 0x80  /boot/grub/diskmap
  grub-install /dev/hda1

  echo /dev/hdc 0x80  /boot/grub/diskmap # overwrite it!
  grub-install /dev/hdc1

The thing with all this my RAID devices works, it is really simple! thing is:
for too many people it indeed works, so they think it's good and correct way.
But it works up to the actual failure, which, in most setups, isn't tested.
But once something failed, umm...  Jason, try to remove your hda (pretend it
is failed) and boot off hdc to see what I mean ;)  (Well yes, rescue disk will
help in that case... hopefully.  But not RAID, which, when installed properly,
will really make disk failure transparent).

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Grub vs Lilo

2006-07-26 Thread Michael Tokarev
Bernd Rieke wrote:
 Michael Tokarev wrote on 26.07.2006 20:00:
 .
 .
The thing with all this my RAID devices works, it is really simple! thing 
is:
for too many people it indeed works, so they think it's good and correct way.
But it works up to the actual failure, which, in most setups, isn't tested.
But once something failed, umm... Jason, try to remove your hda (pretend it
is failed) and boot off hdc to see what I mean ;) (Well yes, rescue disk will
help in that case... hopefully. But not RAID, which, when installed properly,
will really make disk failure transparent).
/mjt
 
 Yes Michael, your right. We use a simple RAID1 config with swap and  / on
 three SCSI-disks (2 working, one hot-spare) on SuSE 9.3 systems. We had to
 use lilo to handle the boot off of any of the two (three) disks. But we had
 problems over problems until lilo 22.7 came up. With this version of lilo
 we can pull off any disk in any scenario. The box boots in any case.

Well, alot of systems here works on root-on-raid1 with lilo-2.2.4 (Debian
package), and grub.  By works I mean they really works, ie, any disk
failure don't prevent the system from working and (re)booting flawlessly
(provided the disk is really dead, as opposed to when it is present but
fails to read (some) data - in which case the only way is either to remove
it physically or to choose another boot device in BIOS.  But that's
entirely different story, about (non-existed) really smart boot loader
I mentioned in my previous email).

The trick is to set the system up properly.  Simple/obvious way
(installing grub to hda1 and hdc1) don't work when you remove hda, but
complex way works.

More, I'd not let LILO to do more guesswork for me (like raid-extra-boot
stuff, or whatever comes with 22.7 - to be honest, I didn't look at it
at all, as debian package of 2.2.4 (or 22.4?) works for me just fine).
Just write the damn thing into the start of mdN (and let raid code to
replicate it to all drives, regardless of how many of them there is),
after realizing it's really a partition number X (with offset Y) on a
real disk, and use bios code 0x80 for all disk access.  That's all.
The rest - like ensuring all the (boot) partitions are at the same
place on every disk, that disk geometry is the same etc - is my duty,
and this duty is done by me accurately - because I want the disks to
be interchangeable.

 We were wondering when we asked the groups while in trouble with lilo
 before 22.7 not having any response. Ok, the RAID-Driver and the kernel
 worked fine while resyncing the spare in case of a disk failure (thanks to
 Neil Brown for that). But if a box had to be rebooted with a failed disk
 the situation became worse. And you have to reboot because hotplug still
 doesn't work. But nobody seems to care abou or nobody apart of us has
 these problems  ...

Just curious - when/where you asked?
[]
 So we came to the conclusion that everybody is working on RAID but nobody
 cares about the things around, just as you mentioned, thanks for that.

I tend to disagree.  My statement above refers to simple advise sometimes
given here and elsewhere, do this and that, it worked for me.  By users
who didn't do their homework, who never tested the stuff, who, sometimes,
just has no idea as of HOW to test (it's not an insulting statement
hopefully - I don't blame them for their lack of knowlege, it's something
which isn't really cheap, after all).  Majority of users are of this sort,
and they follow each other's advises, again, without testing.  HOWTOs
written by such users, as well (as someone mentioned to me in private
email as a response to my reply).

I mean, the existing software works.  It really works.  The only thing left
is to set it up correctly.

And please PLEASE don't treat it all as blames to bad users.  It's not.
I learned this stuff the hard way too.  After having unbootable remote
machines after a disk failure, when everything seemed to be ok.  After
screwing up systems using famous linux raid autodetect stuff everyone
loves, when, after replacing a failed disk to another, which -- bad me --
was a part of another raid array on another system, and the box choosen
to assemble THAT raid array instead of this box's one, and overwritten
good disk with data from new disk which was in a testing machine.  And
so on.  That all to say: it's easy to make a mistake, and treating the
resulting setup as a good one, until shit start happening.  But shit
happens very rarely, compared to average system usage, so you may never
know at all that your setup is wrong, and ofcourse you will tell how
to do things to others... :)

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] enable auto=yes by default when using udev

2006-07-04 Thread Michael Tokarev
Neil Brown wrote:
 On Monday July 3, [EMAIL PROTECTED] wrote:
 Hello,
 the following patch aims at solving an issue that is confusing a lot of
 users.
 when using udev, device files are created only when devices are
 registered with the kernel, and md devices are registered only when
 started.
 mdadm needs the device file _before_ starting the array.
 so when using udev you must add --auto=yes to the mdadm commandline or
 to the ARRAY line in mdadm.conf

 following patch makes auto=yes the default when using udev
 
 The principle I'm reasonably happy with, though you can now make this
 the default with a line like
 
   CREATE auto=yes
 in mdadm.conf.
 
 However
 
 +
 +/* if we are using udev and auto is not set, mdadm will almost
 + * certainly fail, so we force it here.
 + */
 +if (autof == 0  access(/dev/.udevdb,F_OK) == 0)
 +autof=2;
 +
 
 I'm worried that this test is not very robust.
 On my Debian/unstable system running used, there is no
  /dev/.udevdb
 though there is a
  /dev/.udev/db
 
 I guess I could test for both, but then udev might change
 again I'd really like a more robust check.

Why to test for udev at all?  If the device does not exist, regardless
if udev is running or not, it might be a good idea to try to create it.
Because IT IS NEEDED, period.  Whenever the operation fails or not, and
whenever we fail if it fails or not - it's another question, and I think
that w/o explicit auto=yes, we may ignore create error and try to continue,
and with auto=yes, we fail on create error.

Note that /dev might be managed by some other tool as well, like mudev
from busybox, or just a tiny shell /sbin/hotplug script.

Note also that the whole root filesystem might be on tmpfs (like in
initramfs), so /dev will not be a mountpoint.

Also, I think mdadm should stop creating strange temporary nodes somewhere
as it does now.  If /dev/whatever exist, use it. If not, create it (unless,
perhaps, auto=no is specified) directly with proper mknod(/dev/mdX), but
don't try to use some temporary names in /dev or elsewhere.

In case of nfs-mounted read-only root filesystem, if someone will ever need
to assemble raid arrays in that case.. well, he can either prepare proper
/dev on the nfs server, or use tmpfs-based /dev, or just specify /tmp/mdXX
instead of /dev/mdXX - whatever suits their needs better.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New FAQ entry? (was IBM xSeries stop responding during RAID1 reconstruction)

2006-06-21 Thread Michael Tokarev
Niccolo Rigacci wrote:
[]
 From the command line you can see which schedulers are supported 
 and change it on the fly (remember to do it for each RAID disk):
 
   # cat /sys/block/hda/queue/scheduler
   noop [anticipatory] deadline cfq
   # echo cfq  /sys/block/hda/queue/scheduler
 
 Otherwise you can recompile your kernel and set CFQ as the 
 default I/O scheduler (CONFIG_DEFAULT_CFQ=y in Block layer, IO 
 Schedulers, Default I/O scheduler).

There's much easier/simpler way to set default scheduler.  As
someone suggested, RTFM Documentation/kernel-parameters.txt.
Passing elevator=cfq (or whatever) will do the trick much simpler
than kernel recompile.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with raid=noautodetect

2006-05-29 Thread Michael Tokarev
Neil Brown wrote:
 On Friday May 26, [EMAIL PROTECTED] wrote:
[]
 If we assume there is a list of devices provided by a (possibly
 default) 'DEVICE' line, then 
 
 DEVICEFILTER   !pattern1 !pattern2 pattern3 pattern4
 
 could mean that any device in that list which matches pattern 1 or 2
 is immediately discarded, and remaining device that matches patterns 3
 or 4 are included, and the remainder are discard.
 
 The rule could be that the default is to include any devices that
 don't match a !pattern, unless there is a pattern without a '!', in
 which case the default is to reject non-accepted patterns.
 Is that straight forward enough, or do I need an
   order allow,deny
 like apache has?

I'd suggest the following.

All the other devices are included or excluded from the list of devices
to consider based on the last component in the DEVICE line.  Ie. if it
ends up at !dev, all the rest of devices are included.  If it ends up at
dev (w/o !), all the rest are excluded.  If memory serves me right, it's
how squid ACLs works.

There's no need to introduce new keyword.  Given this rule, a line

 DEVICE a b c

will do exactly as it does now.  Line

 DEVICE a b c !d

is somewhat redundant - it's the same as DEVICE !d
Ie, if the list ends up at !stuff, append `partitions' (or *) to it.

Ofcourse mixing !s and !!s is useful, like to say use all sda* but not
sda1:

 DEVICE !sda1 sda*

(and nothing else).

And the default is to have `DEVICE partitions'.

The only possible issue I see here is that with udev, it's possible to
use, say, /dev/disk/by-id/*-like stuff (don't remember exact directory
layout) -- symlinked to /dev/sd* according to the disk serial number or
something like that -- for this to work, mdadm needs to use glob()
internally.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-20 Thread Michael Tokarev

Neil Brown wrote:

On Tuesday April 18, [EMAIL PROTECTED] wrote:

[]

I mean, mergeing bios into larger requests makes alot of sense between
a filesystem and md levels, but it makes alot less sense to do that
between md and physical (fsvo physical anyway) disks.


This seems completely backwards to me, so we must be thinking of
different things.

Creating large requests above the md level doesn't make a lot of sense
to me because there is a reasonable chance that md will just need to
break the requests up again to submit to different devices lower down.

Building large requests for the physical disk makes lots of sense
because you get much better throughput on an e.g. SCSI buss by having
few large requests rather than many small requests.  But this building
should be done close to the device so that as much information as
possible is available about particular device characteristics.

What is the rationale for your position?


My rationale was that if md layer receives *write* requests not smaller
than a full stripe size, it is able to omit reading data to update, and
can just calculate new parity from the new data.  Hence, combining a
dozen small write requests coming from a filesystem to form a single
request = full stripe size should dramatically increase performance.

Eg, when I use dd with O_DIRECT mode (oflag=direct) and experiment with
different block size, write performance increases alot when bs becomes
full stripe size.  Ofcourse it decreases again when bs is increased a
bit further (as md starts reading again, to construct parity blocks).

For read requests, it makes much less difference where to combine them.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-18 Thread Michael Tokarev

Neil Brown wrote:
[]

raid5 shouldn't need to merge small requests into large requests.
That is what the 'elevator' or io_scheduler algorithms are for.  There
already merge multiple bio's into larger 'requests'.  If they aren't
doing that, then something needs to be fixed.

It is certainly possible that raid5 is doing something wrong that
makes merging harder - maybe sending bios in the wrong order, or
sending them with unfortunate timing.  And if that is the case it
certainly makes sense to fix it.  
But I really don't see that raid5 should be merging requests together

- that is for a lower-level to do.


Hmm.  So where's the elevator level - before raid level (between e.g.
a filesystem and md), or after it (between md and physical devices) ?

I mean, mergeing bios into larger requests makes alot of sense between
a filesystem and md levels, but it makes alot less sense to do that
between md and physical (fsvo physical anyway) disks.

Thanks.

/mjt

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: accessing mirrired lvm on shared storage

2006-04-18 Thread Michael Tokarev

Neil Brown wrote:
[]

Very cool... that would be extremely nice to have.  Any estimate on
when you might get to this?



I'm working on it, but there are lots of distractions


Neil, is there anything you're NOT working on? ;)

Sorry just can't resist... ;)

/mjt

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Terrible slow write speed to MegaRAID SCSI array

2006-03-18 Thread Michael Tokarev
We've installed an LSI Logic MegaRaid SCSI 320-1 card
on our server (used only temporarily to move data to
larger disks, but that's not the point), and measured
linear write performance, just to know how much time
it will took to copy our (somewhat large) data to the
new array.  And to my surprize, this card, with current
firmware, current (2.6.15) kernel and modern disks is
terrible slow on writing - writing speed varies between
1.5 megabutes/sec to 10 megabytes/sec, depending on the
logical drive settings in the megaraid adapter.

I tried different raid configurations (raid10 - 3 spans
of two-disk raid1s; raid1 out of 2 drives; raid 0 on one
drive) - makes no difference whatsoever (but on raid1,
I only can get 8 megabytes/sec write speed - 10 mb/sec
is on raid10).

The disks are 140Gb FUJITSU MAT3147NC ones, pretty modern.
One disk, when plugged into non-megaraid card and accessed
as single disk, delivers about 60 megabytes/sec write speed,
and about 80 mb/sec read (reading speed on megaraid array
is quite good - about 240 mb/sec for 6-disk raid10).

I've upgraded firmware on the megaraid card to the latest one
available on lsi logic website - nothing changed wrt write
speed (but read speed decreased from 240 to about 190 mb/sec -
still acceptable for me).

I'm using megaraid_mbox driver.

The only improvement I was able to get is when I enable
write caching on the card, -- in this case, writing speed
is very good up to first 64Mb (amount of memory on the
card), and decreases again back to 1.5..10 mb/sec after
64Mb.

The question is: where's the problem? linux? driver? hardware?
Anyone else expirienced this problem?

Thanks.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 000 of 5] md: Introduction

2006-01-17 Thread Michael Tokarev
NeilBrown wrote:
 Greetings.
 
 In line with the principle of release early, following are 5 patches
 against md in 2.6.latest which implement reshaping of a raid5 array.
 By this I mean adding 1 or more drives to the array and then re-laying
 out all of the data.

Neil, is this online resizing/reshaping really needed?  I understand
all those words means alot for marketing persons - zero downtime,
online resizing etc, but it is much safer and easier to do that stuff
'offline', on an inactive array, like raidreconf does - safer, easier,
faster, and one have more possibilities for more complex changes.  It
isn't like you want to add/remove drives to/from your arrays every day...
Alot of good hw raid cards are unable to perform such reshaping too.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >