Re: RAID1 root and swap and initrd

2006-12-19 Thread Bill Davidsen

Michael Tokarev wrote:

Andre Majorel wrote:
[]
  

Thanks Jurriaan and Gordon. I think I may still be f*cked,
however. The Lilo doc says you can't use raid-extra-boot=mbr-only
if boot= does not point to a raid device. Which it doesn't because
in my setup, boot=/dev/sda.

Using boot=/dev/md5 would solve the raid-extra-boot issue but the
components of /dev/md5 are not primary partitions (/dev/sda5,
/dev/sdb5) so I don't think that would work.



So just move it to sda1 (or sda2, sda3) from sda5, ensure you've
two identical drives (or at least your boot partitions are layed
up identically), and use boot=/dev/md1 (or md2, md3).  Do NOT
use raid-extra-boot (set it to none), but set up standard mbr
code into boot sector of both drives (in debian, it's 'mbr' package;
lilo can be used for that too - once for each drive), and mark your
boot partition on both drives as active.

This is the most clean setup to boot off raid.  You'll have two
drives, both will be bootable, and both will be updated when
you'll run lilo.

Another bonus - if you'll ever install a foreign OS on this system,
which tends to update boot code, all your stuff will still be intact -
the only thing you'll need to do to restore linux boot is to reset
'active' flags for your partitions (and no, winnt disk manager does
not allow you to do so - no ability to set non-dos (non-windows)
partition active).

  

I *could* run lilo once for each disk after tweaking boot= in
lilo.conf, or just supply a different -M option but I'm not sure.
The Lilo doc is not terribly enlightening. Not for me, anyway. :-)



No, don't do that. Even if you can automate it.  It's error-prone
to say the best, and it will bite you at an unexpected moment.


The desirable solution is to use the DOS MBR (boot active partition) and 
put the boot stuff in the RAID device. However, you can just write the 
MBR to the hda and then to hdb. Note that you don't play with the 
partition names, the 2nd MBR will only be used if the 1st drive fails, 
and therefore at the BIOS level the 2nd drive will now be hda (or C:) if 
LILO stiff uses the BIOS to load the next sector.


I tried that the hard way once, and didn't even know there was a failure 
until I got mail from the md monitor I was using, saying that md0 was 
degraded. From a sample size of one actual test, it works.


Now the bad news: most BIOS setups will fail over if the 1st drive dies. 
If the first drive returns a bad sector error it will not fail over, it 
will give you some flavor of I tried my best message and not boot. 
Sample size of three machines on that problem.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]

2006-12-19 Thread Luben Tuikov
--- [EMAIL PROTECTED] wrote:
 From: Andrew Morton [EMAIL PROTECTED]
 Date: Sun, Dec 17, 2006 at 03:05:39AM -0800
  On Sun, 17 Dec 2006 12:00:12 +0100
  Rafael J. Wysocki [EMAIL PROTECTED] wrote:
  
   Okay, I have identified the patch that causes the problem to appear, 
   which is
   
   fix-sense-key-medium-error-processing-and-retry.patch
   
   With this patch reverted -rc1-mm1 is happily running on my test box.
  
  That was rather unexpected.   Thanks.
 
 I can confirm that 2.6.20-rc1-mm1 with this patch reverted mounts my
 raid6 partition without problems. This is x86_64 with SMP.
 

The reason was that my dev tree was tainted by this bug:

if (good_bytes 
-   scsi_end_request(cmd, 1, good_bytes, !!result) == NULL)
+   scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
return;

in scsi_io_completion().  I had there !!result which is wrong, and when
I diffed against master, it produced a bad patch.

As James mentioned one of the chunks is good and can go in.

Luben

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sata badness in 2.6.20-rc1? [Was: Re: md patches in -mm]

2006-12-19 Thread Andrew Morton
On Tue, 19 Dec 2006 15:26:00 -0800 (PST)
Luben Tuikov [EMAIL PROTECTED] wrote:

 The reason was that my dev tree was tainted by this bug:
 
 if (good_bytes 
 -   scsi_end_request(cmd, 1, good_bytes, !!result) == NULL)
 +   scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
 return;
 
 in scsi_io_completion().  I had there !!result which is wrong, and when
 I diffed against master, it produced a bad patch.

Oh.  I thought that got sorted out.  It's a shame this wasn't made clear to
me..

 As James mentioned one of the chunks is good and can go in.

Please send a new patch, not referential to any previous patch or email,
including full changelogging.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [solved]

2006-12-19 Thread Bill Davidsen

Louis-David Mitterrand wrote:

On Thu, Nov 09, 2006 at 03:27:31PM +0100, Louis-David Mitterrand wrote:
  
I forgot to add that to help us solve this we are ready to hire a paid 
consultant please contact me by mail or phone at +33.1.46.47.21.30



Update: we eventually succeded in reassembling the partition, with two 
missing disks.


Your update would be far more interesting if you found out why it 
ejected three drives at once... The obvious common failures, controller 
and power supply would not prevent reassembly in a functional environment.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Odd (slow) RAID performance

2006-12-19 Thread Bill Davidsen

Mark Hahn wrote:

which is right at the edge of what I need. I want to read the doc on
stripe_cache_size before going huge, if that's K 10MB is a LOT of 
cache

when 256 works perfectly in RAID-0.


but they are basically unrelated.  in r5/6, the stripe cache is 
absolutely
critical in caching parity chunks.  in r0, never functions this way, 
though
it may help some workloads a bit (IOs which aren't naturally aligned 
to the underlying disk layout.)


Any additional input appreciated, I would expect the speed to be 
(Ndisk

- 1)*SingleDiskSpeed without a huge buffer, so the fact that it isn't


as others have reported, you can actually approach that with naturally
aligned and sized writes.
I don't know what would be natural, I have three drives, 256 chunk size 
and was originally testing with 1MB writes. I have a hard time seeing a 
case where there would be a need to read-alter-rewrite, each chunk 
should be writable as data1, data2, and parity, without readback. I was 
writing directly to the array, so the data should start on a chunk 
boundary. Until I went very large on stripe-cache-size performance was 
almost exactly 100% the write speed of a single drive. There is no 
obvious way to explain that other than writing one drive at a time. And 
shrinking write size by factors of two resulted in decreasing 
performance down to about 13% of the speed of a single drive. Such 
performance just isn't useful, and going to RAID-10 eliminated the 
problem, indicating that the RAID-5 implementation is the cause.


I'm doing the tests writing 2GB of data to the raw array, in 1MB 
writes. The array is RAID-5 with 256 chunk size. I wouldn't really 
expect any reads,


but how many disks?  if your 1M writes are to 4 data disks, you stand 
a chance of streaming (assuming your writes are naturally aligned, or 
else you'll be somewhat dependent on the stripe cache.)
in other words, your whole-stripe size is ndisks*chunksize, and for 
256K chunks and, say, 14 disks, that's pretty monstrous...

Three drives, so they could be totally isolated from other i/o.


I think that's a factor often overlooked - large chunk sizes, especially
with r5/6 AND lots of disks, mean you probably won't ever do blind 
updates, and thus need the r/m/w cycle.  in that case, if the stripe 
cache

is not big/smart enough, you'll be limited by reads.
I didn't have lots of disks, and when the data and parity are all being 
updated in full chunk increments, there's no reason for a read, since 
the data won't be needed. I agree that it's probably being read, but 
needlessly.


I'd like to experiment with this, to see how much benefit you really 
get from using larger chunk sizes.  I'm guessing that past 32K
or so, normal *ata systems don't speedup much.  fabrics with higher 
latency or command/arbitration overhead would want larger chunks.


tried was 2K blocks, so I can try other sizes. I have a hard time 
picturing why smaller sizes would be better, but that's what testing 
is for.


larger writes (from user-space) generally help, probably up to MB's.
smaller chunks help by making it more likley to do blind parity updates;
a larger stripe cache can help that too.
I tried 256B to 1MB sizes, 1MB was best, or more correctly least 
unacceptable.


I think I recall an earlier thread regarding how the stripe cache is used
somewhat naively - that all IO goes through it.  the most important 
blocks would be parity and ends of a write that partially update an 
underlying chunk.  (conversely, don't bother caching anything which 
can be blindly written to disk.) 

I fear that last parenthetical isn't being observed.

If it weren't for RAID-1 and RAID-10 being fast I wouldn't complain 
about RAID-5.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [solved] supermicro failure

2006-12-19 Thread Louis-David Mitterrand
On Tue, Dec 19, 2006 at 10:47:29PM -0500, Bill Davidsen wrote:
 Louis-David Mitterrand wrote:
 On Thu, Nov 09, 2006 at 03:27:31PM +0100, Louis-David Mitterrand wrote:
   
 I forgot to add that to help us solve this we are ready to hire a paid 
 consultant please contact me by mail or phone at +33.1.46.47.21.30
 
 
 Update: we eventually succeded in reassembling the partition, with two 
 missing disks.
 
 Your update would be far more interesting if you found out why it 
 ejected three drives at once... The obvious common failures, controller 
 and power supply would not prevent reassembly in a functional environment.

Actually the motherboard and/or its on-board scsi controller turned out 
defective. Reassembly succeded once the disk were transfered to another 
box.

Has anyone seen such hardware failure on a brand new SuperMicro machine?

Thanks,
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html