[BUGREPORT] The kernel thread for md RAID10 could cause a md RAID10 array deadlock

2008-02-13 Thread K.Tanaka
This message describes another issue about md-RAID10 found by
testing the 2.6.24 md RAID10 using new scsi fault injection framework.

Abstract:
When a scsi command timeout occurs during RAID10 recovery, the kernel
threads for md RAID10 could cause a md RAID10 array deadlock.
The nr_pending flag set during normal I/O and barrier flag set by recovery
thread conflicts, results in raid10d() and sync_request() deadlock.

Details:
 normal I/O recovery I/O
   -
   B-1. kernel thread starts by calling
   A-1. A process issues a read request. md_do_sync()
make_request() for raid10 is called
by block layer.
   B-2. md_do_sync() calls sync_request
operation for md raid10.
   A-2. In make_request(), wait_barrier()
increments nr_pending flag.

   A-3. A read command is issued to the disk,
but it takes a lot of time because
of no response from the disk.
   B-3. sync_request() of raid10 calls
raise_barrier(), increments 
barrier
flag, and waits for nr_pending 
set
in (A-2) to be cleared.
   A-4. raid10_end_read_request() is called
in the interrupt context. It detects
read error and wakes up raid10d kernel
thread.

   A-5. raid10d() calls freeze_array() and waits
for barrier flag incremented in (B-3)
to be cleared.

(**  stalls here because waiting conditions in A-5 and B-3 are never met **)


   A-6. raid1d calls fix_read_error() to
handle read error. B-4. barrier flag will be cleared 
after
the pending barrier request 
completes.
   A-7  nr_pending flag will be cleared after
the pending read request completes.

The deadlock mechanism:
When a normal I/O occurs during recovery, nr_pending flag incremented in (A-2)
blocks subsequent recovery I/O until the normal I/O completes. The recovery 
thread
will increment barrier flag and wait for nr_pending flag to be decremented 
(B-3).

Normally, nr_pending flag is decremented after the I/O has completed 
successfully.
Also, barrier flag is decremented after barrier request (such as recovery I/O) 
has
completed successfully.

If a normal read I/O results in scsi command timeout, the read request is 
handled
by error handler in raid10d kernel thread. Then, raid10d calls freeze_array().
But the barrier flag is set by (B-3), freeze_array() waits for barrier request
completion. On the other hand, the recovery thread stalls waiting for nr_pending
flag to  be decremented(B-3). In this way, both error handler and recovery
thread are deadlocked.

This problem can be reproduced  by using the new scsi fault injection framework,
using no response from the SCSI device simulation.
I think the new scsi fault injection framework is a little bit complicated
to use, so I will upload some sample wrapper shell scripts for usability.

-- 

-
Kenichi TANAKA| Open Source Software Platform Development Division
  | Computers Software Operations Unit, NEC Corporation
  | [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Got raid10 assembled wrong - how to fix?

2008-02-13 Thread Michael Tokarev
George Spelvin wrote:
 I just discovered (the hard way, sigh, but not too much data loss) that a
 4-drive RAID 10 array had the mirroring set up incorrectly.
 
 Given 4 drvies A, B, C and D, I had intended to mirror A-C and B-D,
 so that I could split the mirror and run on either (A,B) or (C,D).
 
 However, it turns out that the mirror pairs are A-B and C-D.  So
 pulling both A and B off-line results in a non-functional array.
 
 So basically what I need to do is to decommission B and C, and rebuild
 the array with them swapped: A, C, B, D.
 
 Can someone tell me if the following incantation is correct?
 
 mdadm /dev/mdX -f /dev/B -r /dev/B
 mdadm /dev/mdX -f /dev/C -r /dev/C
 mdadm --zero-superblock /dev/B
 mdadm --zero-superblock /dev/C
 mdadm /dev/mdX -a /dev/C
 mdadm /dev/mdX -a /dev/B

That should work.

But I think you'd better just physically swap the drives instead -
this way, no rebuilding the array will be necessary, and your data
will be safe all the time.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: transferring RAID-1 drives via sneakernet

2008-02-13 Thread Michael Tokarev
Jeff Breidenbach wrote:
 It's not a RAID issue, but make sure you don't have any duplicate volume
 names.  According to Murphy's Law, if there are two / volumes, the wrong
 one will be chosen upon your next reboot.
 
 Thanks for the tip. Since I'm not using volumes or LVM at all, I should be
 safe from this particular problem.

If you don't use names, you use numbers - like md0, md10 etc.
The numbers, as they now ARE names, should be different too.

There's more to this topic, much more.

There are different ways to start (assemble) the arrays.  I know at
least 4 - kernel autodetection, mdadm with mdadm.conf listed some
devices, mdadm with empty mdadm.conf and with using of 'homehost'
parameter (assemble all our arrays), and mdrun utility.  Also,
some arrays may be assembled during initrd/initramfs stage, and some
after...

The best is either mdadm with something in mdadm.conf, or mdadm with
homehost.  Note that neither of these ways, your foreign array(s)
will be assembled, and you will have to do it manually - wich is much
better than to screw things up trying to mix-n-match pieces of the
two systems.  You'll just have to figure the device numbers of your
foreign disks and issue an appropriate command, like this:

  mdadm --assemble /dev/md10 /dev/sdc1 /dev/sdd1 ...

using not yet taken mdN number and the right device nodes for your
disks/partitions.

If you want to keep the disks here, you can add the array info into
mdadm.conf or refresh superblock to have new homehost.

But if you're using kernel autodetection or mdrun... well, I for
one can't help here, -- your arrays will be numbered/renumbered
by a chance...

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: transferring RAID-1 drives via sneakernet

2008-02-13 Thread David Greaves
Jeff Breidenbach wrote:
 It's not a RAID issue, but make sure you don't have any duplicate volume
 names.  According to Murphy's Law, if there are two / volumes, the wrong
 one will be chosen upon your next reboot.
 
 Thanks for the tip. Since I'm not using volumes or LVM at all, I should be
 safe from this particular problem.

Volumes is being used as a generic term here.

You would be safest if, for the disks/partitions you are transferring, you made
the partition type 0x83 (linux) instead of 0xfd to prevent the kernel 
autodetecting.

Otherwise there is a risk that /dev/md0 and /dev/md1 will be transposed.

Having done that you can manually assemble the array and then configure
mdadm.conf to associate the UUID with the correct md device.

David
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html