Re: Linux RAID migration

2007-08-07 Thread Tuomas Leikola
On 8/7/07, saeed bishara [EMAIL PROTECTED] wrote:
 Hi,
  I'm looking for a method for doing RAID migration while keeping the
 data available.
 the migrations I'm interested with are:
 1. Single drive -RAID1/RAID5
 2. RAID1 - RAID5.


1. is a bit complicated, as a raid device on a disk is slightly
smaller than the original device. you might need to copy data
manually.

2. should be simple as (offline) re-creating the raid as raid5.

 - can I really assume that RAID 5 on 2 hdds (degraded mode) will
 function as raid 5?

You should test by using loopback devices and files. But why degraded?
raid5 of two disks should look like raid1.

 - how to build raid while keeping the contents of an existing drive available?

Normally the array is available when building/rebuilding.

- tuomas
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid 1 recovery steps

2006-09-26 Thread Tuomas Leikola

On 9/24/06, chapman [EMAIL PROTECTED] wrote:

Can I assume the disk is ok, just needs to be
re-added to the array?


Not necessarily. You should look for logs indicating _why_ it is
marked bad. If you don't know how long it's been broken, you need some
monitoring system, like mdadm or logcheck.

This is most commonly caused by bad sectors. If you can re-add and
resync goes through without complaining, you're probably ok.


I'm assuming I need to first remove sda1 from the raid then re-add it,
correct?  If so, what are the specific steps?


mdadm /dev/md0 -r /dev/sda1
mdadm /dev/md0 -a /dev/sda1


Can this be done safely on a
live server without pulling the system down?


sure.


How will this affect rebooting
once completed - if at all?


Should not have any impact. At least if boot setup is ok.


Any gotchas I should look out for?


Things get tricky if the other disk has stealthily become bad.

- tuomas
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 Problem - $1000 reward for help

2006-09-17 Thread Tuomas Leikola

On 9/15/06, Reza Naima [EMAIL PROTECTED] wrote:

I've picked up two 500G disks, and am in the process of dd'ing the
contents of the raid partitions over.  The 2nd failed disk came up just
fine, and has been coping the data over without fault.  I expect it to
finish, but thought I would send this email now.  I will include some
data that I captured before the system failed.


Note that if there are bad blocks, dd might not be reliable,
ddrescue will do a better job of filling the gaps with zeroes
(silent data corruption may result).

The theory with your problem is to recreate the raid in degraded mode
(the replaced drive as missing), something like

mdadm --create /dev/md0 --chunk=256 --layout=left-symmetric
--raid-devices=4 /dev/hda3 missing /dev/hdf1 /dev/hdg1

after which see if the filesystem mounts and is ok, if so, add the
remaining drive back to the array.

It's recommended to use a script to scrub the raid device regularly,
to detect sleeping bad blocks early.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 Problem - $1000 reward for help

2006-09-17 Thread Tuomas Leikola

On 9/17/06, Ask Bjørn Hansen [EMAIL PROTECTED] wrote:

 It's recommended to use a script to scrub the raid device regularly,
 to detect sleeping bad blocks early.

What's the best way to do that?  dd the full md device to /dev/null?



echo check /sys/block/md?/md/sync_action

Distros may have cron scripts to do this right.

And you need a fairly recent kernel.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub was Re: RAID5 Problem - $1000 reward for help

2006-09-17 Thread Tuomas Leikola

On 9/17/06, Dexter Filmore [EMAIL PROTECTED] wrote:

   It's recommended to use a script to scrub the raid device regularly,
   to detect sleeping bad blocks early.
 
  What's the best way to do that?  dd the full md device to /dev/null?

 echo check /sys/block/md?/md/sync_action

 Distros may have cron scripts to do this right.

 And you need a fairly recent kernel.

Does this test stress the discs a lot, like a resync?
How long does it take?
Can I use it on a mounted array?



yup.
long. think resync.
yup.

It's practically read everything, verify checksum, report bad blocks
or inconsistencies.

echo repairsync_action causes md to fix redundancy blocks if they're
out of sync (but at that point you already have another problem like
flakey hardware or so)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: access *exisiting* array from knoppix

2006-09-14 Thread Tuomas Leikola

 mdadm --assemble /dev/md0 /dev/hda1 /dev/hdb1 # i think, man mdadm

Not what I meant: there already exists an array on a file server that was
created from the server os, I want to boot that server from knoppix instead
and access the array.



exactly what --assemble does. looks at disks, finds raid components,
assembles an array out of them (meaning, tells the kernel where to
find the pieces) and starts it.

no? did you try? read the manual?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: access *exisiting* array from knoppix

2006-09-14 Thread Tuomas Leikola

On 9/14/06, Dexter Filmore [EMAIL PROTECTED] wrote:

How about you read the rest of the thread, wisecracker?


sorry. mailreader-excuse/
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: proactive-raid-disk-replacement

2006-09-10 Thread Tuomas Leikola

On 9/10/06, Bodo Thiesen [EMAIL PROTECTED] wrote:

So, we need a way, to feedback the redundancy from the raid5 to the raid1.

snip long explanation

Sounds awfully complicated to me. Perhaps this is how it internally
works, but my 2 cents go to the option to gracefully remove a device
(migrating to a spare without losing redundancy) in the kernel (or
mdadm).

I'm thinking

mdadm /dev/raid-device -a /dev/new-disk
mdadm /dev/raid-device --graceful-remove /dev/failing-disk

also hopefully a path to do this instead of kicking (multiple) disks
when bad blocks occur.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please help me save my data

2006-09-09 Thread Tuomas Leikola

On 9/8/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

So, what I want to do is:

 * Mark the synced spare drive as working and in position 1
 * Assemble the array without the unsynced spare and check if this
   provides consistent data
 * If it didnt, I want to mark the synced spare as working and in
   position 3, and try the same thing again
 * When I have it working, I just want to add the unsynced spare and
   let it sync normally
 * Then I will create a write-intent bitmap to avoid the dangerously
   long sync times, and also buy a new USB controller hoping that it
   will solve my problems


You can recreate the raid array with 1 missing disk, like this:

mdadm -C /dev/md1 /dev/sdn1 /dev/sdX1 /dev/sdn1 /dev/sdn1 missing

The ordering is relevant, raid-disks 0,1,2,3,4 or so. beware, you have
to have block size and symmetry correct, so better backup mdadm
--examine and --detail output beforehand.

This create op causes no sync (no danger data overwrites), as there is
still the one drive missing, but raid-superblocks are rewritten.

(On a sidenote, i'm uncertain if a bitmap helps in the case of
single-device remove-add cycle? I thought it was only for crashes, at
least for now..)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Check/repair on composite RAID

2006-09-09 Thread Tuomas Leikola

On 9/9/06, Richard Scobie [EMAIL PROTECTED] wrote:

If I have a RAID 10, comprising a RAID 0, /dev/md3 made up of RAID1,
/dev/md1 and RAID1, /dev/md2 and I do an:

echo repair  /sys/block/md3/md/sync_action

will this run simultaneous repairs on the the underlying RAID 1's, or
should seperate repairs be done to md1 and 2?


check/repair is pointless on raid0, as there is no redundancy.

You should run separate checks (repairs) on the underlying devices.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature Request/Suggestion - Drive Linking

2006-09-03 Thread Tuomas Leikola

This way I could get the replacement in and do the resync without
actually having to degrade the array first.


snip


2) This sort of brings up a subject I'm getting increasingly paranoid
about. It seems to me that if disk 1 develops a unrecoverable error at
block 500 and disk 4 develops one at 55,000 I'm going to get a double
disk failure as soon as one of the bad blocks is read


Here's an alternate description. On first 'unrecoverable' error, the
disk is marked as FAILING, which means that a spare is immediately
taken into use to replace the failing one. The disk is not kicked, and
readable blocks can still be used to rebuild other blocks (from other
FAILING disks).

The rebuild can be more like a ddrescue type operation, which is
probably a lot faster in the case of raid6, and the disk can be
automatically kicked after the sync is done. If there is no read
access to the FAILING disk, the rebuild will be faster just because
seeks are avoided in a busy system.

Personally I feel this is a good idea, count my vote in.

- Tuomas

--
VGER BF report: U 0.505245
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID6 fallen apart

2006-09-03 Thread Tuomas Leikola

Possibly safer to recreate with two missing if you aren't sure of the
order.  That way you can look in the array to see if it looks right,
or if you have to try a different order.


I'd say it's safer to recreate with all disks, in order to get the
resync. Otherwise you risk the all so famous silent data corruption on
stripes with writes in-flight at the time of failure.

Tuomas

--
VGER BF report: U 0.497554
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID6 fallen apart

2006-09-03 Thread Tuomas Leikola

On 9/3/06, Tuomas Leikola [EMAIL PROTECTED] wrote:

 Possibly safer to recreate with two missing if you aren't sure of the
 order.  That way you can look in the array to see if it looks right,
 or if you have to try a different order.

I'd say it's safer to recreate with all disks, in order to get the
resync. Otherwise you risk the all so famous silent data corruption on
stripes with writes in-flight at the time of failure.


Ment to say: after you know the correct order. Sorry.



Tuomas



--
VGER BF report: H 3.07213e-07
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-12 Thread Tuomas Leikola

On 8/9/06, James Peverill [EMAIL PROTECTED] wrote:


I'll try the force assemble but it sounds like I'm screwed.  It sounds
like what happened was that two of my drives developed bad sectors in
different places that weren't found until I accessed certain areas (in
the case of the first failure) and did the drive rebuild (for the second
failure).  In the future, is there a way to help prevent this?


This is a common scenario, and I feel could be helped if md could be
told to not drop the disk on first failure, but rather keep it running
in FAILING status (as opposed to FAILED), until all data from it has
been evacuated (hot spare). This way, if another disk became failing
during rebuild, due to another area of the disk, those blocks could be
rebuilt using the other failing disk. (Also, this allows for the
rebuild to mostly be a ddrescue-style copy operation, rather than
parity computation).

Do you guys feel this is feasible? Neil?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trying to brute-force my RAID 5...

2006-07-23 Thread Tuomas Leikola

On 7/19/06, Sevrin Robstad [EMAIL PROTECTED] wrote:

I tried file -s /dev/md0 also, and with one of the disk as first disk I
got ext 3 filedata (needs journal recovery) (errors) .


Congratulations, you have found your first disk. Does fsck still
complain about the magic number?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two-disk RAID5?

2006-04-26 Thread Tuomas Leikola
 No.  When one of the 2 drives in your RAID5 dies, and all you have for
 some blocks is parity info, how will the missing data be reconstructed?

 You could [I suspect] create a 2 disk RAID5 in degraded mode (3rd member
 missing), but it'll obviously lack redundancy until you add a 3rd disk,
 which won't add anything to your RAID5 storage capacity.

IMO if you have a 2-disk raid5, the parity for each block is the same
as the data. There is performance drop as I suspect md isn't smart
enough to read data from both disks, but that's all.

When one disk fails, the (lone) parity block is quite enough to
reconstruct. With XOR parity, you can always assume any amount of
additional disks full of zero, it doesn't really change the algorithm.

(maybe mdadm could/can change a raid-1 into raid5 by just changing the
superblocks, for the purpose of expanding into more disks..)

- tuomas
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't mount /dev/md0 after stopping a synchronization

2006-04-09 Thread Tuomas Leikola
On 4/8/06, Mike Garey [EMAIL PROTECTED] wrote:
 I have one last question though.. When I update /boot/grub/menu.lst
 while booted from /dev/md0 with both disks available, does this file
 get written to the MBR on both disks, or do I have to do this
 manually?


Grub's configuration lives on both mirrors, as it's in the filesystem,
not in MBR. At boot time, grub kinda mounts the filesystem and reads
the configuration from there. (grub doesn't understand the mirror, but
it doesn't need to)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Real Time Mirroring of a NAS

2006-04-07 Thread Tuomas Leikola
  I'm looking for a way to create a real-time mirror of a NAS. In other words,
  say I have a 5.5 TB NAS (3ware 16-drive array, RAID-5, 500 GB drives). I 
  want
  to mirror it in real time to a completely separate 5.5 TB NAS. RSYNCing in
  the background is not an option. The two NAS boxes need to hold identical
  data at all times. It is NOT necessary that the data be accessed from both
  NAS boxes simultaneously. One is simply a backup of the other.
 
 I don't have any experience with it, but I've often seen DRBD mentioned
 for just this sort of situation.  I'd look into that.


Linux NBD (and md on top) is a simpler solution for the same thing.
I'd look into that also :)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't mount /dev/md0 after stopping a synchronization

2006-04-05 Thread Tuomas Leikola
On 4/5/06, Mike Garey [EMAIL PROTECTED] wrote:
 I tried booting from /dev/hdc1 (as /dev/md0 in grub) using a 2.6.15
 kernel with md and raid1 support built in and this is what I now get:

 md: autodetecting raid arrays
 md: autorun ...
 md: considering hdc1 ...
 md: adding hdc1 ...
 md: created md0
 md: bind:hdc1
 raid1: RAID set md0 active with 1 out of 2 mirrors
 md: ...autrun done.

 Warning: unable to open an initial console
 Input: AT translated set 2 keyboard as /class/input/input0

 and then at this point, the system just hangs and nothing happens.  So
 I seem to be getting closer.. If I try booting from a kernel without
 raid1 and md support, but using an initrd with raid1/md modules, then
 I get the ALERT! /dev/md0 does not exist.  Dropping to a shell!
 message.  I can't understand why there would be any difference between
 using a kernel with raid1/md support, or using an initrd image with
 raid1/md support, but apparently there is.  If anyone else has any
 suggestions, please keep them coming.

Sounds like your initrd could use a command like

mdadm --assemble /dev/md0 /dev/hda1 /dev/hdc1

at some point before mounting the real rootfs. There are many cleaner
examples in the list archive, but that should do the trick. It seems
like your initrd-kernel doesn't autostart the raid for some reason
(config option?).

Note, you should never do any read/write access to the component disks
after creating the raid. I guess you know this already, but some
wording seemed suspect.

Can you specify more what is the problem with mounting md0? The log
snipped doesn't show any errors about that.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Conflicting Size Numbers -- Bug in mdadm, md, df?

2006-03-28 Thread Tuomas Leikola
On 3/27/06, andy liebman [EMAIL PROTECTED] wrote:
 Case 1: When we stripe together TWO RAW 3ware RAID-5 devices (i.e.,
 /dev/sdc + /dev/sdd = /dev/md2), df -h tells us that the device is 11
 TB in size. df -k tells us that the device is 10741827072 blocks in
 size and cat /proc/partitions tells us the md device is 10741958144
 blocks in size (a little larger)

This is what you lose in creating a file system. df reports available
space usable in files, /proc/partitions reports the underlying block
device.

 Case 2: When we create a SINGLE partition on each 3ware device using
 parted, the partitions /dev/sdb1 and /dev/sdc1 are each reported to be
 34 blocks smaller than the RAW 3ware devices mentioned above in Case 1.

Partition table+overhead

 Yet, when we stripe together /dev/sdb1 + /dev/sdc1, we get a Linux md
 device that is IDENTICAL in size to the Linux md device mentioned
 above -- 10741958144 blocks. We don't understand why the resulting
 Linux md device isn't 68 blocks smaller than when we use the raw 3ware
 device. In the SINGLE partition case, df -h also tells us that the
 device is 11 TB in size.

I'd suspect the reason is RAID with 256k chunk size. The resulting
block device is rounded down - and 68 blocks isn't that much. Didn't
do the math tho.

 Case 3: However, when we use mdadm to stripe together the first
 partition on each device and also to stripe together the second
 partition on each device (/dev/sdb1 + /dev/sdc1 = /dev/md1 AND /dev/sdb2
 + /dev/sdc2 = /dev/md2), df -h reports that the total size of the two
 Linux RAID-0 arrays is 0.8 TB LESS than when we stripe together the RAW
 3ware devices or when we only have ONE partition.

That seems curious, however I'd trust the block count in this case.
0.8TB is a lot.

 And df -k reports
 that the total block size of the two mdX arrays is 10741694464 blocks,
 which is 114532 blocks smaller than size reported for the md device
 when we have NO partitions and 132072 blocks smaller than when we have a
 SINGLE partition.

In addition to chunk size rounding, you also lose some space to md
superblock and other stuff (bitmap, etc if you have those). 100k
blocks is around what I'd expect. Didn't do the math here either, tho.

 We are wondering what these discrepencies mean and whether they could
 lead to filesystem corruption issues?

Hope i've shed some light on this.

mdadm 1.x isn't the latest. 2.3.x is.

Tuomas
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Avoiding resync of RAID1 during creation

2006-02-20 Thread Tuomas Leikola
On 2/20/06, Bryan Wann [EMAIL PROTECTED] wrote:
  mdadm --assume-clean
 
 What version of mdadm was that from?  From mdadm(8) in
 mdadm-1.11.0-4.fc4 on my systems:

cut
 I tried with --assume-clean, it still wanted to sync.

The man page i quoted was from 2.3.1 (6 feb) - relatively new.

I tested this with 2 boxes: 1.9.0 starts the resync and 2.3.1 doesn't.
Used kernel 2.6.14 - altough I don't expect that to make much of a
difference.

-tuomas
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: paralellism of device use in md

2006-01-22 Thread Tuomas Leikola
On 1/19/06, Neil Brown [EMAIL PROTECTED] wrote:

 The read balancing in raid1 is clunky at best.  I've often thought
 there must be a better way.  I've never thought what the better way
 might be (though I haven't tried very hard).

 If anyone would like to experiment with the read-balancing code,
 suggest and test changes, it would be most welcome.


An interesting and desperately complex topic, intertwined with IO
schedulers in general. I'll follow with my 2 cents.

The way I see this, there are two fundamentally different approaches:
optimize for throughput or latency.

When optimizing for latency, the balancer would always choose a device
that can serve a request in the shortest time. This is close to what
the current code does, altough it doesn't seem to account for devices'
pending request queue lengths. (I'd estimate for a traditional ATA
disk, around 2-3 short seek requests is worth 1 long seek, because of
spindle latency). I'd assume a fair in-order service for the latency
mode.

When optimizing for throughput, the balancer would choose a device
that will have it's total queue completion time increased the least.
This indicates reordering of requests etc.

For queue depth of 1, the thoughput balancer would pick the closest
available device as long as the devices are idle, and when they are
all busy, leave the requests into array-wide queue until one of the
devices becomes available, and then dequeue the request the device can
serve fastest (or one that's had its deadline exceeded).

Both approaches become difficult when taking into account device
queues. The throughput balancer, as described, could just estimate how
close the new request is to all others already in the device, and pick
one that is nearby the other work. The latency scheduler is propably
pretty much useless in this scenario, as its definition will change if
requests can push each other around. I'd expect it to be useful in the
common desktop configuration with no device queues though.

One thing i'd like to see is more powerful estimates of request cost
for a device. It's possible, if not practical, to profile devices for
things like spindle latency and sector locations. If this cost
estimation data is correct enough, per-device queues become less
important as performance factors. As it is now, one can only hope that
requests that are near LBA-wise are near timewise, which is not true
for most devices. Yes, i know it's mostly wishful thinking.
Measurements would be tricky and would provide complex maps for
estimating costs, and (I think) would be virtually impossible to do
correctly for anything with device queues.

I'd expect that no drives in the market expose this kind of latency
estimation data to the controller or OS. I'd also expect that high end
storage system vendors use the very same information in their hardware
raid implementations to provide better queuing and load balancing.

Both the described balancer algorithms can be implemented somewhat
easily, and (I'd expect) will work relatively well with common desktop
drives. They could be optional (like the IO schedulers currently are),
and different cost estimation algorithms could also be optional (and
tunable if autotuning is out of question). Unfortunately my kernel
hacking skills are too weak for most of this - there needs to be
another who's interested enough.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html