from:"Doug Ledford"

Re: stride / stripe alignment on LVM ?

2007-11-03 Thread Doug Ledford

On Fri, 2007-11-02 at 23:16 +0100, Janek Kozicki wrote:
 Bill Davidsen said: (by the date of Fri, 02 Nov 2007 09:01:05 -0400)
 
  So I would expect this to make a very large performance difference, so 
  even if it work it would do so slowly.
 
 I was trying to find out the stripe layout for few hours, using
 hexedit and dd. And I'm baffled:
 
 md1 : active raid5 hda3[0] sda3[1]
   969907968 blocks super 1.1 level 5, 128k chunk, algorithm 2 [3/2] [UU_]
   ^^^

You have the raid superblock in the front of hda3 and sda3, as well as a
bitmap I assume, which means that any data you write to md0 will
actually be written to sda3/hda3 *after* the superblock and bitmap.  If
you run mdadm -D /dev/md1 it will tell you the data offset (in sectors
IIRC).  When you hexedit hda3, you need to jump forward the same number
of sectors to get at the beginning of the actual md1 data.

Of course, being raid5 with one disk missing, the data may or may not be
present in its non-parity format depending on exactly which block you
are looking at.

However, you don't really need to do anything to figure out the stripe
size on your array, you have it already.  All the talk about figuring
out stripe layouts is for external raid devices that hide the raid
layout from you.  When you are talking about your own raid device that
you created with mdadm, you specified the stripe layout when you created
the array.  In your case, the chunk size is 128K, and since you have a 3
disk raid5 array and one chunk in each stripe of a raid5 array is
parity, the amount of data stored per stripe is chunk size * (number of
disks - 1), so 256K in your case.  But you don't even have to align the
lvm to the stripe, just to a chunk, so you really only need to align the
lvm superblock so that data starts at 128K offset into the raid array.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: stride / stripe alignment on LVM ?

2007-11-03 Thread Doug Ledford

On Sat, 2007-11-03 at 21:21 +0100, Janek Kozicki wrote:

  If you run mdadm -D /dev/md1 it will tell you the data offset
  (in sectors IIRC).
 
 Uh, I don't see it:

Sorry, it's part of mdadm -E instead:

[EMAIL PROTECTED] ~]# mdadm -E /dev/sdc1
/dev/sdc1:
  Magic : a92b4efc
Version : 1.1
Feature Map : 0x1
 Array UUID : c746e4f5:b015ffac:7216dbbd:48d973a7
   Name : firewall:home:2
  Creation Time : Mon May 28 20:47:07 2007
 Raid Level : raid1
   Raid Devices : 2

  Used Dev Size : 625137018 (298.09 GiB 320.07 GB)
 Array Size : 625137018 (298.09 GiB 320.07 GB)
Data Offset : 264 sectors
   Super Offset : 0 sectors
  State : clean
Device UUID : 7efd05d5:dd921536:1d1a1750:6ba49303

Internal Bitmap : 8 sectors from superblock
Update Time : Sat Nov  3 21:01:24 2007
   Checksum : 27b3958f - correct
 Events : 2


Array Slot : 0 (0, 1)
   Array State : Uu
[EMAIL PROTECTED] ~]# 

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Implementing low level timeouts within MD

2007-11-02 Thread Doug Ledford

On Fri, 2007-11-02 at 03:41 -0500, Alberto Alonso wrote:
 On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote:
  Not in the older kernel versions you were running, no.
 
 These old versions (specially the RHEL) are supposed to be
 the official versions supported by Redhat and the hardware 
 vendors, as they were very specific as to what versions of 
 Linux were supported.

The key word here being supported.  That means if you run across a
problem, we fix it.  It doesn't mean there will never be any problems.

  Of all people, I would think you would
 appreciate that. Sorry if I sound frustrated and upset, but 
 it is clearly a result of what supported and tested really 
 means in this case.

I'm sorry, but given the specially the RHEL case you cited, it is
clear I can't help you.  No one can.  You were running first gen
software on first gen hardware.  You show me *any* software company
who's first gen software never has to be updated to fix bugs, and I'll
show you a software company that went out of business they day after
they released their software.

Our RHEL3 update kernels contained *significant* updates to the SATA
stack after our GA release, replete with hardware driver updates and bug
fixes.  I don't know *when* that RHEL3 system failed, but I would
venture a guess that it wasn't prior to RHEL3 Update 1.  So, I'm
guessing you didn't take advantage of those bug fixes.  And I would
hardly call once a quarter continuously updating your kernel.  In any
case, given your insistence on running first gen software on first gen
hardware and not taking advantage of the support we *did* provide to
protect you against that failure, I say again that I can't help you.

  I don't want to go into a discussion of
 commercial distros, which are supported as this is nor the
 time nor the place but I don't want to open the door to the
 excuse of its an old kernel, it wasn't when it got installed.

I *really* can't help you.

 Outside of the rejected suggestion, I just want to figure out 
 when software raid works and when it doesn't. With SATA, my 
 experience is that it doesn't. So far I've only received one 
 response stating success (they were using the 3ware and Areca 
 product lines).

No, your experience, as you listed it, is that
SATA/usb-storage/Serverworks PATA failed you.  The software raid never
failed to perform as designed.

However, one of the things you are doing here is drawing sweeping
generalizations that are totally invalid.  You are saying your
experience is that SATA doesn't work, but you aren't qualifying it with
the key factor: SATA doesn't work in what kernel version?  It is
pointless to try and establish whether or not something like SATA works
in a global, all kernel inclusive fashion because the answer to the
question varies depending on the kernel version.  And the same is true
of pretty much every driver you can name.  This is why commercial
companies don't just certify hardware, but the software version that
actually works as opposed to all versions.  In truth, you have *no idea*
if SATA works today, because you haven't tried.  As David pointed out,
there was a significant overhaul of the SATA error recovery that took
place *after* the kernel versions that failed you which totally
invalidates your experiences and requires retesting of the later
software to see if it performs differently.

 Anyway, this thread just posed the question, and as Neil pointed
 out, it isn't feasible/worth to implement timeouts within the md
 code. I think most of the points/discussions raised beyond that
 original question really belong to the thread Software RAID when 
 it works and when it doesn't 
 
 I do appreciate all comments and suggestions and I hope to keep
 them coming. I would hope however to hear more about success
 stories with specific hardware details. It would be helpfull
 to have a list of tested configurations that are known to work.

I've had *lots* of success with software RAID as I've been running it
for years.  I've had old PATA drives fail, SCSI drives fail, FC drives
fail, and I've had SATA drives that got kicked from the array due to
read errors but not out and out drive failures.  But I keep at least
reasonably up to date with my kernels.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: switching root fs '/' to boot from RAID1 with grub

2007-11-02 Thread Doug Ledford

On Thu, 2007-11-01 at 11:57 -0700, H. Peter Anvin wrote:
 Doug Ledford wrote:
  
  Correct, and that's what you want.  The alternative is that if the BIOS
  can see the first disk but it's broken and can't be used, and if you
  have the boot sector on the second disk set to read from BIOS disk 0x81
  because you ASSuMEd the first disk would be broken but still present in
  the BIOS tables, then your machine won't boot unless that first dead but
  preset disk is present.  If you remove the disk entirely, thereby
  bumping disk 0x81 to 0x80, then you are screwed.  If you have any drive
  failure that prevents the first disk from being recognized (blown fuse,
  blown electronics, etc), you are screwed until you get a new disk to
  replace it.
  
 
 What you want is for it to use the drive number that BIOS passes into it 
 (register DL), not a hard-coded number.  That was my (only) point -- 
 you're obviously right that hard-coding a number to 0x81 would be worse 
 than useless.

Oh, and I forgot to mention that in grub2, the DL register is ignored
for RAID1 devices.  Well, maybe not ignored, but once grub2 has
determined that the intended boot partition is a raid partition, the
raid code takes over and the raid code doesn't care about the DL
register.  Instead, it scans for all the other members of the raid array
and utilizes whichever drives it needs to in order to complete the boot
process.  And since it does reads a sector (or a small group of sectors)
at a time, it doesn't need any member of a raid1 array to be perfect, it
will attempt a round robin read on all the sectors and only fail if all
drives return an error for a given read.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-11-02 Thread Doug Ledford

On Thu, 2007-11-01 at 14:02 -0700, H. Peter Anvin wrote:
 Doug Ledford wrote:
 
  I would argue that ext[234] should be clearing those 512 bytes.  Why
  aren't they cleared  
  
  Actually, I didn't think msdos used the first 512 bytes for the same
  reason ext3 doesn't: space for a boot sector.
  
 
 The creators of MS-DOS put the superblock in the bootsector, so that the 
 BIOS loads them both.  It made sense in some diseased Microsoft 
 programmer's mind.
 
 Either way, for RAID-1 booting, the boot sector really should be part of 
 the protected area (and go through the MD stack.)

It depends on what you are calling the protected area.  If by that you
mean outside the filesystem itself, and in a non-replicated area like
where the superblock and internal bitmaps go, then yes, that would be
ideal.  If you mean in the file system proper, then that depends on the
boot loader.

   The bootloader should 
 deal with the offset problem by storing partition/filesystem-relative 
 pointers, not absolute ones.

Grub2 is on the way to this, but it isn't there yet.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Implementing low level timeouts within MD

2007-11-02 Thread Doug Ledford

On Fri, 2007-11-02 at 13:21 -0500, Alberto Alonso wrote:
 On Fri, 2007-11-02 at 11:45 -0400, Doug Ledford wrote:
 
  The key word here being supported.  That means if you run across a
  problem, we fix it.  It doesn't mean there will never be any problems.
 
 On hardware specs I normally read supported as tested within that
 OS version to work within specs. I may be expecting too much.

It was tested, it simply obviously had a bug you hit.  Assuming that
your particular failure situation is the only possible outcome for all
the other people that used it would be an invalid assumption.  There are
lots of code paths in an error handler routine, and lots of different
hardware failure scenarios, and they each have their own independent
outcome should they ever be experienced.

  I'm sorry, but given the specially the RHEL case you cited, it is
  clear I can't help you.  No one can.  You were running first gen
  software on first gen hardware.  You show me *any* software company
  who's first gen software never has to be updated to fix bugs, and I'll
  show you a software company that went out of business they day after
  they released their software.
 
 I only pointed to RHEL as an example since that was a particular
 distro that I use and exhibited the problem. I probably could of
 replaced it with Suse, Ubuntu, etc. I may have called the early
 versions back in 94 first gen but not today's versions. I know I 
 didn't expect the SLS distro to work reliably back then. 

Then you didn't pay attention to what I said before: RHEL3 was the first
ever RHEL product that had support for SATA hardware.  The SATA drivers
in RHEL3 *were* first gen.

 Can you provide specific chipsets that you used (specially for SATA)? 

All of the Adaptec SCSI chipsets through the 7899, Intel PATA, QLogic
FC, and nVidia and winbond based SATA.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: switching root fs '/' to boot from RAID1 with grub

2007-11-01 Thread Doug Ledford

On Thu, 2007-11-01 at 10:31 -0700, H. Peter Anvin wrote:
 Doug Ledford wrote:
  
  device /dev/sda (hd0)
  root (hd0,0)
  install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
  /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst
  device /dev/hdc (hd0)
  root (hd0,0)
  install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
  /boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst
  
  That will install grub on the master boot record of hdc and sda, and in
  both cases grub will look to whatever drive it is running on for the
  files to boot instead of going to a specific drive.
  
 
 No, it won't... it'll look for the first drive in the system (BIOS drive 
 80h).

Yes, and except for some fantastic BIOS I've never heard of, the drive
that the BIOS reads the boot sector from is always the 0x80 drive.  This
is either because the drive truly is the first drive, or because the
BIOS is remapping a later drive to 0x80 for boot purposes.  In either
case, what I said is still true: the boot sector will look to read the
data files from the drive the boot sector itself was read from.

   This means that if the BIOS can see the bad drive, but it doesn't 
 work, you're still screwed.

Correct, and that's what you want.  The alternative is that if the BIOS
can see the first disk but it's broken and can't be used, and if you
have the boot sector on the second disk set to read from BIOS disk 0x81
because you ASSuMEd the first disk would be broken but still present in
the BIOS tables, then your machine won't boot unless that first dead but
preset disk is present.  If you remove the disk entirely, thereby
bumping disk 0x81 to 0x80, then you are screwed.  If you have any drive
failure that prevents the first disk from being recognized (blown fuse,
blown electronics, etc), you are screwed until you get a new disk to
replace it.

Follow these simple rules when setting up boot sectors and you'll be OK:

1)  If you are using RAID1, then a boot sector should *never* try and
read data from anything other than the disk the boot sector is on.  To
do otherwise defeats the whole purpose of RAID1 which is that you only
need 1 disk to survive in order for the array to survive.
2)  If the BIOS runs any given MBR in the RAID array, then that MBR will
be on the disk the BIOS has mapped to 0x80.
3)  While there are failure scenarios that would leave a disk unusable
but still visible to the OS, there are no magic BIOS switches to fake a
totally dead device.  So, since you can remove a defunct but present
disk in order to allow disk B to become disk A, but you can't magic a
new disk A out of thin air should it fail to the point of not being
recognized, set all your raid boot sectors to think they are the first
disk in the system and you will always be able to start your machine.

So, what I said is true, the MBR will search on the disk it is being run
from for the files it needs: 0x80.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Implementing low level timeouts within MD

2007-11-01 Thread Doug Ledford

On Thu, 2007-11-01 at 00:08 -0500, Alberto Alonso wrote:
 On Tue, 2007-10-30 at 13:39 -0400, Doug Ledford wrote:
  
  Really, you've only been bitten by three so far.  Serverworks PATA
  (which I tend to agree with the other person, I would probably chock
 
 3 types of bugs is too many, it basically affected all my customers
 with  multi-terabyte arrays. Heck, we can also oversimplify things and 
 say that it is really just one type and define everything as kernel type
 problems (or as some other kernel used to say... general protection
 error).
 
 I am sorry for not having hundreds of RAID servers from which to draw
 statistical analysis. As I have clearly stated in the past I am trying
 to come up with a list of known combinations that work. I think my
 data points are worth something to some people, specially those 
 considering SATA drives and software RAID for their file servers. If
 you don't consider them important for you that's fine, but please don't
 belittle them just because they don't match your needs.

I wasn't belittling them.  I was trying to isolate the likely culprit in
the situations.  You seem to want the md stack to time things out.  As
has already been commented by several people, myself included, that's a
band-aid and not a fix in the right place.  The linux kernel community
in general is pretty hard lined when it comes to fixing the bug in the
wrong way.

  this up to Serverworks, not PATA), USB storage, and SATA (the SATA stack
  is arranged similar to the SCSI stack with a core library that all the
  drivers use, and then hardware dependent driver modules...I suspect that
  since you got bit on three different hardware versions that you were in
  fact hitting a core library bug, but that's just a suspicion and I could
  well be wrong).  What you haven't tried is any of the SCSI/SAS/FC stuff,
  and generally that's what I've always used and had good things to say
  about.  I've only used SATA for my home systems or workstations, not any
  production servers.
 
 The USB array was never meant to be a full production system, just to 
 buy some time until the budget was allocated to buy a real array. Having
 said that, the raid code is written to withstand the USB disks getting
 disconnected as far as the driver reports it properly. Since it doesn't,
 I consider it another case that shows when not to use software RAID
 thinking that it will work.
 
 As for SCSI I think it is a greatly proved and reliable technology, I've
 dealt with it extensively and have always had great results. I know deal
 with it mostly on non Linux based systems. But I don't think it is
 affordable to most SMBs that need multi-terabyte arrays.
 
  
   I'll repeat my plea one more time. Is there a published list
   of tested combinations that respond well to hardware failures
   and fully signals the md code so that nothing hangs?
  
  I don't know of one, but like I said, I've not used a lot of the SATA
  stuff for production.  I would make this one suggestion though, SATA is
  still an evolving driver stack to a certain extent, and as such, keeping
  with more current kernels than you have been using is likely to be a big
  factor in whether or not these sorts of things happen.
 
 OK, so based on this it seems that you would not recommend the use
 of SATA for production systems due to its immaturity, correct?

Not in the older kernel versions you were running, no.

  Keep in
 mind that production systems are not able to be brought down just to
 keep up with kernel changes. We have some tru64 production servers with
 1500 to 2500 days uptime, that's not uncommon in industry.

And I guarantee not a single one of those systems even knows what SATA
is.  They all use tried and true SCSI/FC technology.

In any case, if Neil is so inclined to do so, he can add timeout code
into the md stack, it's not my decision to make.

However, I would say that the current RAID subsystem relies on the
underlying disk subsystem to report errors when they occur instead of
hanging infinitely, which implies that the raid subsystem relies upon a
bug free low level driver.  It is intended to deal with hardware
failure, in as much as possible, and a driver bug isn't a hardware
failure.  You are asking the RAID subsystem to be extended to deal with
software errors as well.

Even though you may have thought it should handle this type of failure
when you put those systems together, it in fact was not designed to do
so.  For that reason, choice of hardware and status of drivers for
specific versions of hardware is important, and therefore it is also
important to keep up to date with driver updates.

It's highly likely that had you been keeping up to date with kernels,
several of those failures might not have happened.  One of the benefits
of having many people running a software setup is that when one person
hits a bug and you fix it, and then distribute that fix to everyone
else, you save everyone else from also hitting that bug.  You have

Re: switching root fs '/' to boot from RAID1 with grub

2007-11-01 Thread Doug Ledford

On Thu, 2007-11-01 at 20:04 +0100, Janek Kozicki wrote:
 Doug Ledford said: (by the date of Thu, 01 Nov 2007 14:30:58 -0400)
 
  So, what I said is true, the MBR will search on the disk it is being run
  from for the files it needs: 0x80.
 
 my motherboard allows to pick a boot device if I press F11 during
 boot. Do you mean, that no matter which HDD I will choose it will
 have 0x80 number?

All the motherboard BIOS drive mapping things I've seen will do exactly
that.  In order to boot from say drive sda when hda is present, they map
BIOS device 0x80 to sda instead of hda.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: switching root fs '/' to boot from RAID1 with grub

2007-10-31 Thread Doug Ledford

On Wed, 2007-10-31 at 16:01 +0100, Janek Kozicki wrote:
 And now I have a full RAID1 array. Now just two questions:
 
 1. when I `shutdown -r now` I see a worrying message at the end:
 
 Stopping array md2  done (stopped)
 Stopping array md1  done (stopped)
 Stopping array md0  failed (busy)
 Will now reboot
 md: Stopping all md devices
 md: md0 still in use
   
 reboots
 
Is that ok ?

Yes.  That's typical for the root device because root is never unmounted
prior to shutdown.  The messages are probably more worrying than they
need to be.  The system should have successfully switched the array to
read only mode at the first attempt to stop the array.  Neil, any chance
on getting the messages for the / device to be less worrisome?

 2. Will grub update all drives automatically, for instance when I will
upgrade kernel by 'aptitude upgrade'? Or do I need to repeat your
grub instructions each time a new kernel is installed?

Now that grub's installed, you won't have to do anything manual again.
The only time you might have to repeat that grub install procedure is if
you loose a drive and need to add a new one back in, then the new one
will need it.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-30 Thread Doug Ledford

On Tue, 2007-10-30 at 07:55 +0100, Luca Berra wrote:

 Well it might be a matter of personal preference, but i would prefer
 an initrd doing just the minumum necessary to mount the root filesystem
 (and/or activating resume from a swap device), and leaving all the rest
 to initscripts, then an initrd that tries to do everything.

The initrd does exactly that.  The rescan for superblocks does not
happen in initrd or mkinitrd, it must be done manually.  The code in
mkinitrd uses the mdadm.conf file as it stands, but in the initrd image
it doesn't start all the arrays, just the needed arrays to get booted
into your / partition.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Implementing low level timeouts within MD

2007-10-30 Thread Doug Ledford

On Tue, 2007-10-30 at 00:19 -0500, Alberto Alonso wrote:
 On Sat, 2007-10-27 at 12:33 +0200, Samuel Tardieu wrote:
  I agree with Doug: nothing prevents you from using md above very slow
  drivers (such as remote disks or even a filesystem implemented over a
  tape device to make it extreme). Only the low-level drivers know when
  it is appropriate to timeout or fail.
  
Sam
 
 The problem is when some of these drivers are just not smart
 enough to keep themselves out of trouble. Unfortunately I've
 been bitten by apparently too many of them.

Really, you've only been bitten by three so far.  Serverworks PATA
(which I tend to agree with the other person, I would probably chock
this up to Serverworks, not PATA), USB storage, and SATA (the SATA stack
is arranged similar to the SCSI stack with a core library that all the
drivers use, and then hardware dependent driver modules...I suspect that
since you got bit on three different hardware versions that you were in
fact hitting a core library bug, but that's just a suspicion and I could
well be wrong).  What you haven't tried is any of the SCSI/SAS/FC stuff,
and generally that's what I've always used and had good things to say
about.  I've only used SATA for my home systems or workstations, not any
production servers.

 I'll repeat my plea one more time. Is there a published list
 of tested combinations that respond well to hardware failures
 and fully signals the md code so that nothing hangs?

I don't know of one, but like I said, I've not used a lot of the SATA
stuff for production.  I would make this one suggestion though, SATA is
still an evolving driver stack to a certain extent, and as such, keeping
with more current kernels than you have been using is likely to be a big
factor in whether or not these sorts of things happen.

 If not, I would like to see what people that have experienced
 hardware failures and survived them are using so that such
 a list can be compiled.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Implementing low level timeouts within MD

2007-10-30 Thread Doug Ledford

On Tue, 2007-10-30 at 00:08 -0500, Alberto Alonso wrote:
 On Mon, 2007-10-29 at 13:22 -0400, Doug Ledford wrote:
 
  OK, these you don't get to count.  If you run raid over USB...well...you
  get what you get.  IDE never really was a proper server interface, and
  SATA is much better, but USB was never anything other than a means to
  connect simple devices without having to put a card in your PC, it was
  never intended to be a raid transport.
 
 I still count them ;-) I guess I just would of hoped for software raid
 to really don't care about the lower layers.

The job of software raid is to help protect your data.  In order to do
that, the raid needs to be run over something that *at least* provides a
minimum level of reliability itself.  The entire USB spec is written
under the assumption that a USB device can disappear at any time and the
stack must accept that (and it can, just trip on a cable some time and
watch your raid device get all pissy).  So, yes, software raid can run
over any block device, but putting it over an unreliable connection
medium is like telling a gladiator that he has to face the lion with no
sword, no shield, and his hands tied behind his back.  He might survive,
but you have so seriously handicapped him that it's all but over.

  
   * Supermicro MB with ICH5/ICH5R controller and 2 RAID5 arrays of 3 
 disks each. (only one drive on one array went bad)
   
   * VIA VT6420 built into the MB with RAID1 across 2 SATA drives.
   
   * And the most complex is this week's server with 4 PCI/PCI-X cards.
 But the one that hanged the server was a 4 disk RAID5 array on a
 RocketRAID1540 card.
  
  And 3 SATA failures, right?  I'm assuming the Supermicro is SATA or else
  it has more PATA ports than I've ever seen.
  
  Was the RocketRAID card in hardware or software raid mode?  It sounds
  like it could be a combination of both, something like hardware on the
  card, and software across the different cards or something like that.
  
  What kernels were these under?
 
 
 Yes, these 3 were all SATA. The kernels (in the same order as above) 
 are:
 
 * 2.4.21-4.ELsmp #1 (Basically RHEL v3)

*Really* old kernel.  RHEL3 is in maintenance mode already, and that was
the GA kernel.  It was also the first RHEL release with SATA support.
So, first gen driver on first gen kernel.

 * 2.6.18-4-686 #1 SMP on a Fedora Core release 2
 * 2.6.17.13 (compiled from vanilla sources)
 
 The RocketRAID was configured for all drives as legacy/normal and
 software RAID5 across all drives. I wasn't using hardware raid on
 the last described system when it crashed.

So, the system that died *just this week* was running 2.6.17.13?  Like I
said in my last email, the SATA stack has been evolving over the last
few years, and that's quite a few revisions behind.  My basic advice is
this: if you are going to use the latest and greatest hardware options,
then you should either make sure you are using an up to date distro
kernel of some sort or you need to watch the kernel update announcements
for fixes related to that hardware and update your kernels/drivers as
appropriate.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: switching root fs '/' to boot from RAID1 with grub

2007-10-30 Thread Doug Ledford

On Tue, 2007-10-30 at 21:07 +0100, Janek Kozicki wrote:
 Hello,
 
 I have and olde HDD and two new HDDs:
 
 - hda1 - my current root filesystem '/'
 - sda1 - part of raid1 /dev/md0 [U_U]
 - hdc1 - part of raid1 /dev/md0 [U_U]
 
 I want all hda1, sda1, hdc1 to be a raid1. I remounted hda1 readonly 
 then I did 'dd if=/dev/hda1 of=/dev/md0'. I carefully checked that
 the partition sizes match exactly. So now md0 contains the same thing
 as hda1. 
 
 But hda1 is still outside of the array. I want to add it to the array.
 But before I do this I think that I should boot from /dev/md0 ?
 Otherwise I might hose this system. I tried `grub-install /dev/sda1`
 (assuming that grub would see no problem with reading raid1
 partition, and boot from it, until mdadm detects an array). I tried
 `grub-install /dev/sda` as well as on /dev/hdc and /dev/hdc1.
 I turned off 'active' flag for partition hda1 and turned it on for hdc1
 and sda1. But still grub is booting from hda1.

Well, without going into a lot of detail, you would have to boot
from /dev/hda1 and specify a root=/dev/md0 option to the kernel to
actually boot to the new / filesystem before grub-install will do what
you are expecting.  The fact that hda1 is mounted as / and that hda1
contains /boot with all your kernels and initrd images means that when
you run grub-install it looks up the current location of the /boot
files, sees they are on /dev/hda1, and regardless of where you put the
boot sector (sda1, hdc1), those sectors point to the files grub found
in /boot which are on hda1.

 I did all this with version 1.1

Which won't work, and you'll see that as soon as you have md0 mounted
as / and try to run grub-install again.

 mdadm --create --verbose /dev/md0 --chunk=64 --level=raid1 \
   --metadata=1.1  --bitmap=internal --raid-devices=3 /dev/sda1 \
   missing /dev/hdc1
 
 I'm NOT using LVM here.
 
 Can someone tell me how should I switch grub to boot from /dev/md0 ?
 
 After the boot I will add hda1 to the array, and all three partitions
 should become a raid1.

Grub doesn't work with version 1.1 superblocks at the moment.  It could
be made to work quick and dirty in a short period of time, making it
work properly would take longer.

So, here's what I would do in your case.  Scrap the current /dev/md0
setup.  Make a new /dev/md0 using a version 1.0 superblock with all the
other options the same as before.  BTW, your partition sizes don't need
to match exactly.  If the new device is larger than your /dev/hda1, then
no big deal, just do the dd like you did before and when you are done
you can resize the fs to fit the device.  If the new device is slightly
smaller than /dev/hda1, then just run resizefs to shrink your /dev/hda1
to the same size as the fs on the /dev/md0 *before* you do the dd from
hda1 to md0.  Once you have the data copied to /dev/md0, you'll need to
reboot the system and this time specify /dev/md0 as your root device
(you may need to remake your initrd before you reboot, I don't know if
your initrd starts /dev/md0, but it needs to).  Once you are running
with md0 as your root partition, you need to run grub to install the
boot sectors on the md0 devices.  You can't use grub-install though, it
gets it wrong.  Run the grub command, then enter these commands:

device /dev/sda (hd0)
root (hd0,0)
install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
/boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst
device /dev/hdc (hd0)
root (hd0,0)
install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
/boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst

That will install grub on the master boot record of hdc and sda, and in
both cases grub will look to whatever drive it is running on for the
files to boot instead of going to a specific drive.

Next you need to modify the /etc/grub.conf and change all the root=
lines to be root=/dev/md0, and you need to modify /etc/fstab the same
way.  Then you probably need to remake all the initrd images so that
they contain the update.

Once you've done that, shut the system down, remove /dev/hda from the
machine entirely, move /dev/hdc to /dev/hda, then reboot.  The system
should boot up to your raid array just fine.  If it doesn't work, you
can always put your old hda back in and boot up from it.  If it does
work, shut the machine down one more time, put the old hda in as hdc,
boot back up (which should boot from hda to the md0 root, it should not
touch hdc), add hdc to the raid array, let it resync, and then the final
step is to run the grub install on hdc to make it match the other two
disks.  After that, you have a fully functional and booting raid1 array.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford

On Sun, 2007-10-28 at 20:21 -0400, Bill Davidsen wrote:
 Doug Ledford wrote:
  On Fri, 2007-10-26 at 11:15 +0200, Luca Berra wrote:

  On Thu, Oct 25, 2007 at 02:40:06AM -0400, Doug Ledford wrote:
  
  The partition table is the single, (mostly) universally recognized
  arbiter of what possible data might be on the disk.  Having a partition
  table may not make mdadm recognize the md superblock any better, but it
  keeps all that other stuff from even trying to access data that it
  doesn't have a need to access and prevents random luck from turning your
  day bad.

  on a pc maybe, but that is 20 years old design.
  
 
  So?  Unix is 35+ year old design, I suppose you want to switch to Vista
  then?
 

  partition table design is limited because it is still based on C/H/S,
  which do not exist anymore.
  Put a partition table on a big storage, say a DMX, and enjoy a 20%
  performance decrease.
  
 
  Because you didn't stripe align the partition, your bad.

 Align to /what/ stripe? Hardware (CHS is fiction), software (of the RAID 
 you're about to create), or ??? I don't notice my FC6 or FC7 install 
 programs using any special partition location to start, I have only run 
 (tried to run) FC8-test3 for the live CD, so I can't say what it might 
 do. CentOS4 didn't do anything obvious, either, so unless I really 
 misunderstand your position at redhat, that would be your bad.  ;-)
 
 If you mean start a partition on a pseudo-CHS boundary, fdisk seems to 
 use what it thinks are cylinders for that.
 
 Please clarify what alignment provides a performance benefit.

Luca was specifically talking about the big multi-terabyte to petabyte
hardware arrays on the market.  DMX, DDN, and others.  When they export
a volume to the OS, there is an underlying stripe layout to that volume.
If you don't use any partition table at all, you are automatically
aligned with their stripes.  However, if you do, then you have to align
your partition on a chunk boundary or else performance drops pretty
dramatically as a result of more writes than not crossing chunk
boundaries unnecessarily.  It's only relevant when you are talking about
a raid device that shows the OS a single logical disk made from lots of
other disks.


-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford

On Mon, 2007-10-29 at 09:22 -0400, Bill Davidsen wrote:

  consider a storage with 64 spt, an io size of 4k and partition starting
  at sector 63.
  first io request will require two ios from the storage (1 for sector 63,
  and one for sectors 64 to 70)
  the next 7 io (71-78,79-86,97-94,95-102,103-110,111-118,119-126) will be
  on the same track
  the 8th will again require to be split, and so on.
  this causes the storage to do 1 unnecessary io every 8. YMMV.
 No one makes drives with fixed spt any more. Your assumptions are a 
 decade out of date.

Your missing the point, it's not about drive tracks, it's about array
tracks, aka chunks.  A 64k write, that should write to one and only one
chunk, ends up spanning two.  That increases the amount of writing the
array has to do and the number of disks it busies for a typical single
I/O operation.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-29 Thread Doug Ledford

On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:

 Remaking the initrd installs the new mdadm.conf file, which would have
 then contained the whole disk devices and it's UUID.  There in would
 have been the problem.
 yes, i read the patch, i don't like that code, as i don't like most of
 what has been put in mkinitrd from 5.0 onward.
 Imho the correct thing here would not have been copying the existing
 mdadm.conf but generating a safe one from output of mdadm -D (note -D,
 not -E)

I'm not sure I'd want that.  Besides, what makes you say -D is safer
than -E?

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford

On Mon, 2007-10-29 at 09:18 +0100, Luca Berra wrote:
 On Sun, Oct 28, 2007 at 10:59:01PM -0700, Daniel L. Miller wrote:
 Doug Ledford wrote:
 Anyway, I happen to *like* the idea of using full disk devices, but the
 reality is that the md subsystem doesn't have exclusive ownership of the
 disks at all times, and without that it really needs to stake a claim on
 the space instead of leaving things to chance IMO.

 I've been re-reading this post numerous times - trying to ignore the 
 burgeoning flame war :) - and this last sentence finally clicked with me.
 
 I am sorry Daniel, when i read Doug and Bill, stating that your issue
 was not having a partition table, i immediately took the bait and forgot
 about your original issue.

I never said *his* issue was lack of partition table, I just said I
don't recommend that because it's flaky.  The last statement I made
about his issue was to ask about whether the problem was happening
during initrd time or sysinit time to try and identify if it was failing
before or after / was mounted to try and determine where the issue might
lay.  Then we got off on the tangent about partitions, and at the same
time Neil started asking about udev, at which point it came out that
he's running ubuntu, and as much as I would like to help, the fact of
the matter is that I've never touched ubuntu and wouldn't have the
faintest clue, so I let Neil handle it.  At which point he found that
the udev scripts in ubuntu are being stupid, and from the looks of it
are the cause of the problem.  So, I've considered the initial issue
root caused for a bit now.


 like udev/hal that believes it knows better than you about what you have
 on your disks.
 but _NEITHER OF THESE IS YOUR PROBLEM_ imho

Actually, it looks like udev *is* the problem, but not because of
partition tables.

 I am also sorry to say that i fail to identify what the source of your
 problem is, we should try harder instead of flaming between us.

We can do both, or at least I can :-P

 Is it possible to reproduce it on the live system
 e.g. unmount, stop array, start it again and mount.
 I bet it will work flawlessly in this case.
 then i would disable starting this array at boot, and start it manually
 when the system is up (stracing mdadm, so we can see what it does)
 
 I am also wondering about this:
 md: md0: raid array is not clean -- starting background reconstruction
 does your system shut down properly?
 do you see the message about stopping md at the very end of the
 reboot/halt process?

The root cause is that as udev adds his sata devices one at a time, on
each add of the sata device it invokes mdadm to see if there is an array
to start, and it doesn't use incremental mode on mdadm.  As a result, as
soon as there are 3 out of the 4 disks present, mdadm starts the array
in degraded mode.  It's probably a race between the mdadm started on the
third disk and mdadm started on the fourth disk that results in the
message about being unable to set the array info.  The one loosing the
race gets the error as the other one has already manipulated the array
(for example, the 4th disk mdadm could be trying to add the first disk
to the array, but it's already there, so it gets this error and bails).

So, as much as you might dislike mkinitrd since 5.0 Luca, it doesn't
have this particular problem ;-)  In the initrd we produce, it loads all
the SCSI/SATA/etc drivers first, then calls mkblkdevs which forces all
of the devices to appear in /dev, and only then does it start the
mdadm/lvm configuration.  Daniel, I make no promises what so ever that
this will even work at all as it may fail to load modules or all other
sorts of weirdness, but if you want to test the theory, you can download
the latest mkinitrd from fedoraproject.org, then use it to create an
initrd image under some other name than your default image name, then
manually edit your boot to have an extra stanza that uses the mkinitrd
generated initrd image instead of the ubuntu image, and then just see if
it brings the md device up cleanly instead of in degraded mode.  That
should be a fairly quick and easy way to test if Neil's analysis of the
udev script was right.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford

On Sun, 2007-10-28 at 22:59 -0700, Daniel L. Miller wrote:
 Doug Ledford wrote:
  Anyway, I happen to *like* the idea of using full disk devices, but the
  reality is that the md subsystem doesn't have exclusive ownership of the
  disks at all times, and without that it really needs to stake a claim on
  the space instead of leaving things to chance IMO.

 I've been re-reading this post numerous times - trying to ignore the 
 burgeoning flame war :) - and this last sentence finally clicked with me.
 
 As I'm a novice Linux user - and not involved in development at all - 
 bear with me if I'm stating something obvious.  And if I'm wrong - 
 please be gentle!
 
 1.  md devices are not native to the kernel - they are 
 created/assembled/activated/whatever by a userspace program.

My real point was that md doesn't own the disks, meaning that during
startup, and at other points in time, software other than the md stack
can attempt to use the disk directly.  That software may be the linux
file system code, linux lvm code, or in some case entirely different OS
software.  Given that these situations can arise, using a partition
table to mark the space as in use by linux is what I meant by staking a
claim.  It doesn't keep the linux kernel from using it because it thinks
it owns it, but it does stop other software from attempting to use it.

 2.  Because md devices are non-native devices, and are composed of 
 native devices, the kernel may try to use those components directly 
 without going through md.

In the case of superblocks at the end, yes.  The kernel may see the
underlying file system or lvm disk label even if the md device is not
started.

 3.  Creating a partition table somehow (I'm still not clear how/why) 
 reduces the chance the kernel will access the drive directly without md.

The partition table is more to tell other software that linux owns the
space and to avoid mistakes where someone runs fdisk on a disk
accidentally and wipes out your array because they added a partition
table on what they thought was a new disk (more likely when you have
large arrays of disks attached via fiber channel or such than in a
single system).  Putting the superblock at the beginning of the md
device is the main thing that guarantees the kernel will never try to
use what's inside the md device without the md device running.

 These concepts suddenly have me terrified over my data integrity.  Is 
 the md system so delicate that BOOT sequence can corrupt it?

If you have your superblocks at the end of the devices, then there are
certain failure modes that can cause data inconsistencies.  Generally
speaking they won't harm the array itself, it's just that the different
disks in a raid1 array might contain different data.  If you don't use
partitions, then the majority of failure scenarios involve things like
accidental use of fdisk on the unpartitioned device, access of the
device by other OSes, that sort of thing.

   How is it 
 more reliable AFTER the completed boot sequence?

Once the array is up and running, the constituent disks are marked as
busy in the operating system, which prevents other portions of the linux
kernel and other software in general from getting at the md owned disks.

 Nothing in the documentation (that I read - granted I don't always read 
 everything) stated that partitioning prior to md creation was necessary 
 - in fact references were provided on how to use complete disks.  Is 
 there an official position on, To Partition, or Not To Partition?  
 Particularly for my application - dedicated Linux server, RAID-10 
 configuration, identical drives.
 
 And if partitioning is the answer - what do I need to do with my live 
 dataset?  Drop one drive, partition, then add the partition as a new 
 drive to the set - and repeat for each drive after the rebuild finishes?

You *probably*, and I emphasize probably, don't need to do anything.  I
emphasize it because I don't know enough about your situation to say so
with 100% certainty.  If I'm wrong, it's not my fault.

Now, that said, here's the gist of the situation.  There are specific
failure cases that can corrupt data in an md raid1 array mainly related
to superblocks at the end of devices.  There are specific failure cases
where an unpartitioned device can be accidentally partitioned or where a
partitioned md array in combination with superblocks at the end and
using a whole disk device can be misrecognized as a partitioned normal
drive.  There are, on the other hand, cases where it's perfectly safe to
use unpartitioned devices, or superblocks at the end of devices.  My
recommendation when someone asks what to do is to use partitions, and to
use superblocks at the beginning of the devices (except for /boot since
that isn't supported at the moment).  The reason I give that advice is
that I assume if a person knows enough to know when it's safe to use
unpartitioned devices, like Luca, then they wouldn't be asking me for
advice.  So since

Re: Implementing low level timeouts within MD

2007-10-29 Thread Doug Ledford

On Sun, 2007-10-28 at 01:27 -0500, Alberto Alonso wrote:
 On Sat, 2007-10-27 at 19:55 -0400, Doug Ledford wrote:
  On Sat, 2007-10-27 at 16:46 -0500, Alberto Alonso wrote:
   Regardless of the fact that it is not MD's fault, it does make
   software raid an invalid choice when combined with those drivers. A
   single disk failure within a RAID5 array bringing a file server down
   is not a valid option under most situations.
  
  Without knowing the exact controller you have and driver you use, I
  certainly can't tell the situation.  However, I will note that there are
  times when no matter how well the driver is written, the wrong type of
  drive failure *will* take down the entire machine.  For example, on an
  SPI SCSI bus, a single drive failure that involves a blown terminator
  will cause the electrical signaling on the bus to go dead no matter what
  the driver does to try and work around it.
 
 Sorry I thought I copied the list with the info that I sent to Richard.
 Here is the main hardware combinations.
 
 --- Excerpt Start 
 Certainly. The times when I had good results (ie. failed drives
 with properly degraded arrays have been with old PATA based IDE 
 controllers built in the motherboard and the Highpoint PATA
 cards). The failures (ie. single disk failure bringing the whole
 server down) have been with the following:
 
 * External disks on USB enclosures, both RAID1 and RAID5 (two different
   systems) Don't know the actual controller for these. I assume it is
   related to usb-storage, but can probably research the actual chipset,
   if it is needed.

OK, these you don't get to count.  If you run raid over USB...well...you
get what you get.  IDE never really was a proper server interface, and
SATA is much better, but USB was never anything other than a means to
connect simple devices without having to put a card in your PC, it was
never intended to be a raid transport.

 * Internal serverworks PATA controller on a netengine server. The
   server if off waiting to get picked up, so I can't get the important
   details.

1 PATA failure.

 * Supermicro MB with ICH5/ICH5R controller and 2 RAID5 arrays of 3 
   disks each. (only one drive on one array went bad)
 
 * VIA VT6420 built into the MB with RAID1 across 2 SATA drives.
 
 * And the most complex is this week's server with 4 PCI/PCI-X cards.
   But the one that hanged the server was a 4 disk RAID5 array on a
   RocketRAID1540 card.

And 3 SATA failures, right?  I'm assuming the Supermicro is SATA or else
it has more PATA ports than I've ever seen.

Was the RocketRAID card in hardware or software raid mode?  It sounds
like it could be a combination of both, something like hardware on the
card, and software across the different cards or something like that.

What kernels were these under?

 --- Excerpt End 
 
  
   I wasn't even asking as to whether or not it should, I was asking if
   it could.
  
  It could, but without careful control of timeouts for differing types of
  devices, you could end up making the software raid less reliable instead
  of more reliable overall.
 
 Even if the default timeout was really long (ie. 1 minute) and then
 configurable on a per device (or class) via /proc it would really help.

It's a band-aid.  It's working around other bugs in the kernel instead
of fixing the real problem.

  Generally speaking, most modern drivers will work well.  It's easier to
  maintain a list of known bad drivers than known good drivers.
 
 That's what has been so frustrating. The old PATA IDE hardware always
 worked and the new stuff is what has crashed.

In all fairness, the SATA core is still relatively young.  IDE was
around for eons, where as Jeff started the SATA code just a few years
back.  In that time I know he's had to deal with both software bugs and
hardware bugs that would lock a SATA port up solid with no return.  What
it sounds like to me is you found some of those.

  Be careful which hardware raid you choose, as in the past several brands
  have been known to have the exact same problem you are having with
  software raid, so you may not end up buying yourself anything.  (I'm not
  naming names because it's been long enough since I paid attention to
  hardware raid driver issues that the issues I knew of could have been
  solved by now and I don't want to improperly accuse a currently well
  working driver of being broken)
 
 I have settled for 3ware. All my tests showed that it performed quite
 well and kicked drives out when needed. Of course, I haven't had a
 bad drive on a 3ware production server yet, so I may end up
 pulling the little bit of hair I have left.
 
 I am now rushing the RocketRAID 2220 into production without testing
 due to it being the only thing I could get my hands on. I'll report
 any experiences as they happen.
 
 Thanks for all the info,
 
 Alberto
 
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Re: Time to deprecate old RAID formats?

2007-10-29 Thread Doug Ledford

On Mon, 2007-10-29 at 22:44 +0100, Luca Berra wrote:
 On Mon, Oct 29, 2007 at 11:30:53AM -0400, Doug Ledford wrote:
 On Mon, 2007-10-29 at 09:41 +0100, Luca Berra wrote:
 
  Remaking the initrd installs the new mdadm.conf file, which would have
  then contained the whole disk devices and it's UUID.  There in would
  have been the problem.
  yes, i read the patch, i don't like that code, as i don't like most of
  what has been put in mkinitrd from 5.0 onward.
 in case you wonder i am referring to things like
 
 emit dm create $1 $UUID $(/sbin/dmsetup table $1)

I make no judgments on the dm setup stuff, I know too little about the
dm stack to be qualified.

  Imho the correct thing here would not have been copying the existing
  mdadm.conf but generating a safe one from output of mdadm -D (note -D,
  not -E)
 
 I'm not sure I'd want that.  Besides, what makes you say -D is safer
 than -E?
 
 mdadm -D  /dev/mdX works on an active md device, so i strongly doubt the 
 information
 gathered from there would be stale
 while mdadm -Es will scan disk devices for md superblock, thus
 possibly even finding stale superblocks or leftovers.
 I would strongly recommend against blindly doing mdadm -Es 
 /etc/mdadm.conf and not supervising the result.

Well, I agree that blindly doing mdadm -Esb  mdadm.conf would be bad,
but that's not what mkinitrd is doing, it's using the mdadm.conf that's
in place so you can update the mdadm.conf whenever you find it
appropriate.

And I agree -D has less chance of finding a stale superblock, but it's
also true that it has no chance of finding non-stale superblocks on
devices that aren't even started.  So, as a method of getting all the
right information in the event of system failure and rescuecd boot, it
leaves something to be desired ;-)  In other words, I'd rather use a
mode that finds everything and lets me remove the stale than a mode that
might miss something.  But, that's a matter of personal choice.
Considering that we only ever update mdadm.conf automatically during
installs, after that the user makes manual mdadm.conf changes
themselves, they are free to use whichever they prefer.

The one thing I *do* like about mdadm -E above -D is it includes the
superblock format in its output.  The one thing I don't like, is it
almost universally gets the name wrong.  What I really want is a brief
query format that both gives me the right name (-D) and the superblock
format (-E).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Doug Ledford

On Mon, 2007-10-29 at 22:29 +0100, Luca Berra wrote:
 At which point he found that
 the udev scripts in ubuntu are being stupid, and from the looks of it
 are the cause of the problem.  So, I've considered the initial issue
 root caused for a bit now.
 It seems i made an idiot of myself by missing half of the thread, and i
 even knew ubuntu was braindead in their use of udev at startup, since a
 similar discussion came up on the lvm or the dm-devel mailing list (that
 time iirc it was about lvm over multipath)

Nah.  Even if we had concluded that udev was to blame here, I'm not
entirely certain that we hadn't left Daniel with the impression that we
suspected it versus blamed it, so reiterating it doesn't hurt.  And I'm
sure no one has given him a fix for the problem (although Neil did
request a change that will give debug output, but not solve the
problem), so not dropping it entirely would seem appropriate as well.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-28 Thread Doug Ledford

On Sun, 2007-10-28 at 15:13 +0100, Luca Berra wrote:
 On Sat, Oct 27, 2007 at 08:26:00PM -0400, Doug Ledford wrote:
 On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote:
  On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
  
   In fact, no you can't.  I know, because I've created a device that had
   both but wasn't a raid device.  And it's matching partner still existed
   too.  What you are talking about would have misrecognized this
   situation, guaranteed.
  
  Maybe we need a 2.0 superblock that contains the physical size of every
  component, not just the logical size that is used for RAID. That way if
  the size read from the superblock does not match the size of the device,
  you know that this device should be ignored.
 
 In my case that wouldn't have helped.  What actually happened was I
 create a two disk raid1 device using whole devices and a version 1.0
 superblock.  I know a version 1.1 wouldn't work because it would be
 where the boot sector needed to be, and wasn't sure if a 1.2 would work
 either.  Then I tried to make the whole disk raid device a partitioned
 device.  This obviously put a partition table right where the BIOS and
 the kernel would look for it whether the raid was up or not.  I also
 the only reason i can think for the above setup not working is udev
 mucking with your device too early.

It was a combination of boot loader issues and an inability to get this
device partitioned up the way I needed.  I went with a totally different
setup in the end because I essentially started out with a two drive
raid1 for the OS and another 2 drive raid1 for data, but I wanted to
span them and I was attempting to do so with a mixture of md raid and
lvm physical volume striping.  Didn't work.

 tried doing an lvm setup to split the raid up into chunks and that
 didn't work either.  So, then I redid the partition table and created
 individual raid devices from the partitions.  But, I didn't think to
 zero the old whole disk superblock.  When I made the individual raid
 devices, I used all 1.1 superblocks.  So, when it was all said and done,
 I had a bunch of partitions that looked like a valid set of partitions
 for the whole disk raid device and a whole disk raid superblock, but I
 also had superblocks in each partition with their own bitmaps and so on.
 OK
 
 It was only because I wasn't using mdadm in the initrd and specifying
 uuids that it found the right devices to start and ignored the whole
 disk devices.  But, when I later made some more devices and went to
 update the mdadm.conf file using mdadm -Eb, it found the devices and
 added it to the mdadm.conf.  If I hadn't checked it before remaking my
 initrd, it would have hosed the system.  And it would have passed all
 the above is not clear to me, afair redhat initrd still uses
 raidautorun,

RHEL does, but this is on a personal machine I installed Fedora an and
latest Fedora has a mkinitrd that installs mdadm and mdadm.conf and
starts the needed devices using the UUID.  My first sentence above
should have read that I *was* using mdadm.

  which iirc does not works with recent superblocks,
 so you used uuids on kernel command line?
 or you use something else for initrd?
 why would remaking the initrd break it?

Remaking the initrd installs the new mdadm.conf file, which would have
then contained the whole disk devices and it's UUID.  There in would
have been the problem.

 the tests you can throw at it.  Quite simply, there is no way to tell
 the difference between those two situations with 100% certainty.  Mdadm
 tries to be smart and start the newest devices, but Luca's original
 suggestion of skip the partition scanning in the kernel and figure it
 out from user space would not have shown mdadm the new devices and would
 have gotten it wrong every time.
 yes, in this particular case it would have, congratulation you found a new
 creative way of shooting yourself in the feet.

Creative, not so much.  I just backed out of what I started and tried
something else.  Lots of people do that.

 maybe mdadm should do checks when creating a device to prevent this kind
 of mistakes.
 i.e.
 if creating an array on a partition, check the whole device for a
 superblock and refuse in case it finds one
 
 if creating an array on a whole device that has a partition table,
 either require --force, or check for superblocks in every possible
 partition.

What happens if you add the partition table *after* you make the whole
disk device and there are stale superblocks in the partitions?  This
still isn't infallible.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-28 Thread Doug Ledford

On Sun, 2007-10-28 at 14:37 +0100, Luca Berra wrote:
 On Sat, Oct 27, 2007 at 04:47:30PM -0400, Doug Ledford wrote:

 Most of the time it does.  But those times where it can fail, the
 failure is due to not taking the precautions necessary to prevent it:
 aka labeling disk usage via some sort of partition table/disklabel/etc.
 I strongly disagree.
 the failure is badly designed software.

Then you need to blame Ingo who made putting the superblock at the end
of the device the standard.  If the superblock were always at the
beginning, then this whole argument would be moot.  Things would be
reliable the way you want.

 Using whole disk devices isn't a means of organizing space.  It's a way
 to get a rather miniscule amount of space back by *not* organizing the
 space.
 if i am using, say lvm to organize disk space, a partition table is
 unnecessary to the organization, and it is natural not using them.

If you are using straight lvm then you don't have this problem anyway.
Lvm doesn't allow the underlying physical device to *look* like a valid,
partitioned, single device.  Md does when the superblock is at the end.

 This whole argument seems to boil down to you wanting to perfectly
 optimize your system for your use case which includes controlling the
 environment enough that you know it's safe to not partition your disks,
 where as I argue that although this works in controlled environments, it
 is known to have failure modes in other environments, and I would be
 totally remiss if I recommended to my customers that they should take
 the risk that you can ignore because of your controlled environment
 since I know a lot of my customers *don't* have a controlled environment
 such as you do.
 
 The whole argument to me boils down to the fact that not having a partition
 table on a device is possible, and software that do not consider this
 eventuality is flawed,

It's simply not possible to 100% certain differentiate between an md
whole disk partitioned device with a superblock at the end and a regular
device.  Period.  You can try to be clever, but you can also get tripped
up.  The flaw is not with the software, it's with a design that allowed
this to happen.

  and recommnding to work-around flawed software is
 just digging your head in the sand.

If a design is broken but in place, I have no choice but to work around
it.  Anything else is just stupid.

 But i believe i did not convince you one ounce more than you convinced
 me, so i'll quit this thread which is getting too far.
 
 Regards,
 L.
 
 -- 
 Luca Berra -- [EMAIL PROTECTED]
 Communication Media  Services S.r.l.
  /\
  \ / ASCII RIBBON CAMPAIGN
   XAGAINST HTML MAIL
  / \
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-27 Thread Doug Ledford

On Sat, 2007-10-27 at 10:00 +0200, Luca Berra wrote:
 On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
 On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:
  On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
  just apply some rules, so if you find a partition table _AND_ an md
  superblock at the end, read both and you can tell if it is an md on a
  partition or a partitioned md raid1 device.
 
 In fact, no you can't.  I know, because I've created a device that had
 both but wasn't a raid device.  And it's matching partner still existed
 too.  What you are talking about would have misrecognized this
 situation, guaranteed.
 then just ignore the device and log a warning, instead of doing a random
 choice.
 L.

It also happened to be my OS drive pair.  Ignoring it would have
rendered the machine unusable.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-27 Thread Doug Ledford

On Sat, 2007-10-27 at 09:50 +0200, Luca Berra wrote:
 On Fri, Oct 26, 2007 at 03:26:33PM -0400, Doug Ledford wrote:
 On Fri, 2007-10-26 at 11:15 +0200, Luca Berra wrote:
  On Thu, Oct 25, 2007 at 02:40:06AM -0400, Doug Ledford wrote:
  The partition table is the single, (mostly) universally recognized
  arbiter of what possible data might be on the disk.  Having a partition
  table may not make mdadm recognize the md superblock any better, but it
  keeps all that other stuff from even trying to access data that it
  doesn't have a need to access and prevents random luck from turning your
  day bad.
  on a pc maybe, but that is 20 years old design.
 
 So?  Unix is 35+ year old design, I suppose you want to switch to Vista
 then?
 unix is a 35+ year old design that evolved in time, some ideas were
 kept, some ditched.

BSD disk labels are still in use, SunOS disk labels are still in use,
partition tables are somewhat on the way out, but only because they are
being replaced by the new EFI disk partitioning method.  The only place
where partitionless devices is common is in dedicated raid boxes where
the raid controller is the only thing that will *ever* see that disk.
Sometimes they do it on big SAN/NAS stuff because they don't want to
align the partition table to the underlying device's stripe layout, but
even then they do so in a tightly controlled environment where they know
exactly which machines will be allowed to even try and access the
device.

  partition table design is limited because it is still based on C/H/S,
  which do not exist anymore.
  Put a partition table on a big storage, say a DMX, and enjoy a 20%
  performance decrease.
 
 Because you didn't stripe align the partition, your bad.
 :)
 by default fdisk misalignes partition tables
 and aligning them is more complex than just doing without.

So.  You really need to take the time and to understand the alignment of
the device because then and only then can you pass options to mke2fs to
align the fs metadata with the stripes as well thereby buying you ever
more performance than just leaving off the partition table (assuming
that's what you use, I don't know if other mkfs programs have the same
options for aligning metadata with stripes).  And if you take the time
to understand the underlying stripe layout for the mkfs stuff, then you
can use the same information to align the partition table.

  Oh, and let's not go into what can happen if you're talking about a dual
  boot machine and what Windows might do to the disk if it doesn't think
  the disk space is already spoken for by a linux partition.
  Why the hell should the existance of windows limit the possibility of
  linux working properly.
 
 Linux works properly with a partition table, so this is a specious
 statement.
 It should also work properly without one.

Most of the time it does.  But those times where it can fail, the
failure is due to not taking the precautions necessary to prevent it:
aka labeling disk usage via some sort of partition table/disklabel/etc.

  If i have a pc that dualboots windows i will take care of using the
  common denominator of a partition table, if it is my big server i will
  probably not. since it won't boot anything else than Linux.
 
 Doesn't really gain you anything, but your choice.  Besides, the
 question wasn't why shouldn't Luca Berra use whole disk devices, it
 was why I don't recommend using whole disk devices, and my
 recommendation wasn't based in the least bit upon a single person's use
 scenario.
 If i am the only person in the world that believes partition tables
 should not be required then i'll shut up.
 
  On the opposite, i once inserted an mmc memory card, which had been
  initialized on my mobile phone, into the mmc slot of my laptop, and was
  faced with a load of error about mmcblk0 having an invalid partition
  table.
 
 So?  The messages are just informative, feel free to ignore them.
 but did not anaconda propose to wipe unpartitioned disks?

Did you stick your mmc card in there during the install of the OS?
That's the only time anaconda ever runs, and therefore the only time it
ever checks your devices.  It makes sense that during the initial
install, when the OS is only configured to see locally connected
devices, or possibly iSCSI devices that you have specifically told it to
probe, that it would then ask you the question about those devices.
Other network attached or shared devices are generally added after the
initial install.

 The phone dictates the format, only a moron would say otherwise.  But,
 then again, the phone doesn't care about interoperability and many other
 issues on memory cards that it thinks it owns, so only a moron would
 argue that because a phone doesn't use a partition table that nothing
 else in the computer realm needs to either.
 i don't count myself as a moron, what i am trying to say is that
 partition tables are one way of organizing disk space, not the only one.

Using whole disk devices isn't

Re: Time to deprecate old RAID formats?

2007-10-27 Thread Doug Ledford

On Fri, 2007-10-26 at 14:41 -0400, Doug Ledford wrote:
 Actually, after doing some research, here's what I've found:

 * When using grub2, there is supposedly already support for raid/lvm
 devices.  However, I do not know if this includes version 1.0, 1.1, or
 1.2 superblocks.  I intend to find that out today.

It does not include support for any version 1 superblocks.  It's noted
in the code that it should, but doesn't yet.  However, the interesting
bit is that they rearchitected grub so that any reads from a device
during boot are filtered through the stack that provides the device.
So, when you tell grub2 to set root=md0, then all reads from md0 are
filtered through the raid module, and the raid module then calls the
reads from the IO module, which then does the actual int 13 call.  This
allows the raid module to read superblocks, detect the raid level and
layout, and actually attempt to work on raid0/1/5 devices (at the
moment).  It also means that all the calls from the ext2 module when it
attempts to read from the md device are filtered through the md module
and therefore it would be simple for it to implement an offset into the
real device to get past the version 1.1/1.2 superblocks.

In terms of resilience, the raid module actually tries to utilize the
raid itself during any failure.  On raid1 devices, if it gets a read
failure on any block it attempts to read, then it goes to the next
device in the raid1 array and attempts to read from it.  So, in the
event that your normal boot disk suffers a sector failure in your actual
kernel image, but the raid disk is otherwise fine, grub2 should be able
to boot from the kernel image on the next raid device.  Similarly, on
raid5 it will attempt to recover from a block read failure by using the
parity to generate the missing data unless the array is already in
degraded mode at which point it will bail on any read failure.

The lvm module attempts to properly map extents to physical volumes and
allows you to have your bootable files in lvm logical volume.  In that
case you set root=logical-volume-name-as-it-appears-in-/dev/mapper and
the lvm module then figures out what physical volumes contain that
logical volume and where the extents are mapped and goes from there.

I should note that both the lvm code and raid code are simplistic at the
moment.  For example, the raid5 mapping only supports the default raid5
layout.  If you use any other layout, game over.  Getting it to work
with version 1.1 or 1.2 superblocks probably wouldn't be that hard, but
getting it to the point where it handles all the relevant setups
properly would require a reasonable amount of coding.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Implementing low level timeouts within MD

2007-10-27 Thread Doug Ledford

On Sat, 2007-10-27 at 16:46 -0500, Alberto Alonso wrote:
 On Fri, 2007-10-26 at 15:00 -0400, Doug Ledford wrote:
  
  This isn't an md problem, this is a low level disk driver problem.  Yell
  at the author of the disk driver in question.  If that driver doesn't
  time things out and return errors up the stack in a reasonable time,
  then it's broken.  Md should not, and realistically can not, take the
  place of a properly written low level driver.
  
 
 I am not arguing whether or not MD is at fault, I know it isn't. 
 
 Regardless of the fact that it is not MD's fault, it does make
 software raid an invalid choice when combined with those drivers. A
 single disk failure within a RAID5 array bringing a file server down
 is not a valid option under most situations.

Without knowing the exact controller you have and driver you use, I
certainly can't tell the situation.  However, I will note that there are
times when no matter how well the driver is written, the wrong type of
drive failure *will* take down the entire machine.  For example, on an
SPI SCSI bus, a single drive failure that involves a blown terminator
will cause the electrical signaling on the bus to go dead no matter what
the driver does to try and work around it.

 I wasn't even asking as to whether or not it should, I was asking if
 it could.

It could, but without careful control of timeouts for differing types of
devices, you could end up making the software raid less reliable instead
of more reliable overall.

  Should is a relative term, could is not. If the MD code
 can not cope with poorly written drivers then a list of valid drivers
 and cards would be nice to have (that's why I posted my ... when it
 works and when it doesn't, I was trying to come up with such a list).

Generally speaking, most modern drivers will work well.  It's easier to
maintain a list of known bad drivers than known good drivers.

 I only got 1 answer with brand specific information to figure out when
 it works and when it doesn't work. My recent experience is that too
 many drivers seem to have the problem so software raid is no longer
 an option for any new systems that I build, and as time and money
 permits I'll be switching to hardware/firmware raid all my legacy
 servers.

Be careful which hardware raid you choose, as in the past several brands
have been known to have the exact same problem you are having with
software raid, so you may not end up buying yourself anything.  (I'm not
naming names because it's been long enough since I paid attention to
hardware raid driver issues that the issues I knew of could have been
solved by now and I don't want to improperly accuse a currently well
working driver of being broken)

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-27 Thread Doug Ledford

 device, they automatically only use the second method of boot loader
  installation.  This gives them the freedom to be able to modify the
  second stage boot loader on a boot disk by boot disk basis.  The
  downside to this is that they need lots of room after the MBR and before
  the first partition in order to put their core.img file in place.  I
  *think*, and I'll know for sure later today, that the core.img file is
  generated during grub install from the list of optional modules you
  specify during setup.  Eg., the pc module gives partition table support,
  the lvm module lvm support, etc.  You list the modules you need, and
  grub then builds a core.img out of all those modules.  The normal amount
  of space between the MBR and the first partition is (sectors_per_track -
  1).  For standard disk geometries, that basically leaves 254 sectors, or
  127k of space.  This might not be enough for your particular needs if
  you have a complex boot environment.  In that case, you would need to
  bump at least the starting track of your first partition to make room
  for your boot loader.  Unfortunately, how is a person to know how much
  room their setup needs until after they've installed and it's too late
  to bump the partition table start?  They can't.  So, that's another
  thing I think I will check out today, what the maximum size of grub2
  might be with all modules included, and what a common size might be.
 

 Based on your description, it sounds as if grub2 may not have given 
 adequate thought to what users other than the authors might need (that 
 may be a premature conclusion). I have multiple installs on several of 
 my machines, and I assume that the grub2 for 32 and 64 bit will be 
 different. Thanks for the research.

No, not really.  The grub command on the two is different, but they
actually build the boot sector out of 16 bit non-protected mode code,
just like DOS.  So either one would build the same boot sector given the
same config.  And you can always use the same trick I've used in the
past of creating a large /boot partition (say 250MB) and using that same
partition as /boot in all of your installs.  Then they share a single
grub config (while the grub binaries are in the individual / partitions)
and from the single grub instance you can boot to any of the installs,
as well as a kernel update in any install updates that global grub
config.  The other option is to use separate /boot partitions and chain
load the grub instances, but I find that clunky in comparison.  Of
course, in my case I also made /lib/modules its own partition and also
shared it between all the installs so that I could manually edit the
various kernel boot params to specify different root partitions and in
so doing I could boot a RHEL5 kernel using a RHEL4 install and vice
versa.  But if you do that, you have to manually
patch /etc/rc.d/rc.sysinit to mount the /lib/modules partition before
ever trying to do anything with modules (and you have to mount it rw so
they can do a depmod if needed), then remount it ro for the fsck, then
it gets remounted rw again after the fs check.  It was a pain in the ass
to maintain because every update to initscripts would wipe out the patch
and if you forgot to repatch the file, the system wouldn't boot and
you'd have to boot into another install, mount the / partition of the
broken install, patch the file, then it would work again in that
install.


-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-27 Thread Doug Ledford

On Sat, 2007-10-27 at 00:30 +0200, Gabor Gombas wrote:
 On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:
 
  In fact, no you can't.  I know, because I've created a device that had
  both but wasn't a raid device.  And it's matching partner still existed
  too.  What you are talking about would have misrecognized this
  situation, guaranteed.
 
 Maybe we need a 2.0 superblock that contains the physical size of every
 component, not just the logical size that is used for RAID. That way if
 the size read from the superblock does not match the size of the device,
 you know that this device should be ignored.

In my case that wouldn't have helped.  What actually happened was I
create a two disk raid1 device using whole devices and a version 1.0
superblock.  I know a version 1.1 wouldn't work because it would be
where the boot sector needed to be, and wasn't sure if a 1.2 would work
either.  Then I tried to make the whole disk raid device a partitioned
device.  This obviously put a partition table right where the BIOS and
the kernel would look for it whether the raid was up or not.  I also
tried doing an lvm setup to split the raid up into chunks and that
didn't work either.  So, then I redid the partition table and created
individual raid devices from the partitions.  But, I didn't think to
zero the old whole disk superblock.  When I made the individual raid
devices, I used all 1.1 superblocks.  So, when it was all said and done,
I had a bunch of partitions that looked like a valid set of partitions
for the whole disk raid device and a whole disk raid superblock, but I
also had superblocks in each partition with their own bitmaps and so on.
It was only because I wasn't using mdadm in the initrd and specifying
uuids that it found the right devices to start and ignored the whole
disk devices.  But, when I later made some more devices and went to
update the mdadm.conf file using mdadm -Eb, it found the devices and
added it to the mdadm.conf.  If I hadn't checked it before remaking my
initrd, it would have hosed the system.  And it would have passed all
the tests you can throw at it.  Quite simply, there is no way to tell
the difference between those two situations with 100% certainty.  Mdadm
tries to be smart and start the newest devices, but Luca's original
suggestion of skip the partition scanning in the kernel and figure it
out from user space would not have shown mdadm the new devices and would
have gotten it wrong every time.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-26 Thread Doug Ledford

On Fri, 2007-10-26 at 10:18 -0400, Bill Davidsen wrote:
 Neil Brown wrote:
  On Thursday October 25, [EMAIL PROTECTED] wrote:

  I didn't get a reply to my suggestion of separating the data and 
  location...
  
 
  No. Sorry.
 

  ie not talking about superblock versions 0.9, 1.0, 1.1, 1.2 etc but a data
  format (0.9 vs 1.0) and a location (end,start,offset4k)?
 
  This would certainly make things a lot clearer to new (and old!) users:
 
  mdadm --create /dev/md0 --metadata 1.0 --meta-location offset4k
  or
  mdadm --create /dev/md0 --metadata 1.0 --meta-location start
  or
  mdadm --create /dev/md0 --metadata 1.0 --meta-location end
  
 
  I'm happy to support synonyms.  How about
 
 --metadata 1-end
 --metadata 1-start
 
  ??

 Offset? Do you like 1-offset4k or maybe 1-start4k or even 
 1-start+4k for that? The last is most intuitive but I don't know how 
 you feel about the + in there.

Actually, after doing some research, here's what I've found:

* When using lilo to boot from a raid device, it automatically installs
itself to the mbr, not to the partition.  This can not be changed.  Only
0.90 and 1.0 superblock types are supported because lilo doesn't
understand the offset to the beginning of the fs otherwise.

* When using grub to boot from a raid device, only 0.90 and 1.0
superblocks are supported[1] (because grub is ignorant of the raid and
it requires the fs to start at the start of the partition).  You can use
either MBR or partition based installs of grub.  However, partition
based installs require that all bootable partitions be in exactly the
same logical block address across all devices.  This limitation can be
an extremely hazardous limitation in the event a drive dies and you have
to replace it with a new drive as newer drives may not share the older
drive's geometry and will require starting your boot partition in an odd
location to make the logical block addresses match.

* When using grub2, there is supposedly already support for raid/lvm
devices.  However, I do not know if this includes version 1.0, 1.1, or
1.2 superblocks.  I intend to find that out today.  If you tell grub2 to
install to an md device, it searches out all constituent devices and
installs to the MBR on each device[2].  This can't be changed (at least
right now, probably not ever though).

So, given the above situations, really, superblock format 1.2 is likely
to never be needed.  None of the shipping boot loaders work with 1.2
regardless, and the boot loader under development won't install to the
partition in the event of an md device and therefore doesn't need that
4k buffer that 1.2 provides.

[1] Grub won't work with either 1.1 or 1.2 superblocks at the moment.  A
person could probably hack it to work, but since grub development has
stopped in preference to the still under development grub2, they won't
take the patches upstream unless they are bug fixes, not new features.

[2] There are two ways to install to a master boot record.  The first is
to use the first 512 bytes *only* and hardcode the location of the
remainder of the boot loader into those 512 bytes.  The second way is to
use the free space between the MBR and the start of the first partition
to embed the remainder of the boot loader.  When you point grub2 at an
md device, they automatically only use the second method of boot loader
installation.  This gives them the freedom to be able to modify the
second stage boot loader on a boot disk by boot disk basis.  The
downside to this is that they need lots of room after the MBR and before
the first partition in order to put their core.img file in place.  I
*think*, and I'll know for sure later today, that the core.img file is
generated during grub install from the list of optional modules you
specify during setup.  Eg., the pc module gives partition table support,
the lvm module lvm support, etc.  You list the modules you need, and
grub then builds a core.img out of all those modules.  The normal amount
of space between the MBR and the first partition is (sectors_per_track -
1).  For standard disk geometries, that basically leaves 254 sectors, or
127k of space.  This might not be enough for your particular needs if
you have a complex boot environment.  In that case, you would need to
bump at least the starting track of your first partition to make room
for your boot loader.  Unfortunately, how is a person to know how much
room their setup needs until after they've installed and it's too late
to bump the partition table start?  They can't.  So, that's another
thing I think I will check out today, what the maximum size of grub2
might be with all modules included, and what a common size might be.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-26 Thread Doug Ledford

On Fri, 2007-10-26 at 11:54 +0200, Luca Berra wrote:
 On Sat, Oct 20, 2007 at 09:11:57AM -0400, Doug Ledford wrote:
 just apply some rules, so if you find a partition table _AND_ an md
 superblock at the end, read both and you can tell if it is an md on a
 partition or a partitioned md raid1 device.

In fact, no you can't.  I know, because I've created a device that had
both but wasn't a raid device.  And it's matching partner still existed
too.  What you are talking about would have misrecognized this
situation, guaranteed.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-26 Thread Doug Ledford

On Fri, 2007-10-26 at 11:15 +0200, Luca Berra wrote:
 On Thu, Oct 25, 2007 at 02:40:06AM -0400, Doug Ledford wrote:
 The partition table is the single, (mostly) universally recognized
 arbiter of what possible data might be on the disk.  Having a partition
 table may not make mdadm recognize the md superblock any better, but it
 keeps all that other stuff from even trying to access data that it
 doesn't have a need to access and prevents random luck from turning your
 day bad.
 on a pc maybe, but that is 20 years old design.

So?  Unix is 35+ year old design, I suppose you want to switch to Vista
then?

 partition table design is limited because it is still based on C/H/S,
 which do not exist anymore.
 Put a partition table on a big storage, say a DMX, and enjoy a 20%
 performance decrease.

Because you didn't stripe align the partition, your bad.

 Oh, and let's not go into what can happen if you're talking about a dual
 boot machine and what Windows might do to the disk if it doesn't think
 the disk space is already spoken for by a linux partition.
 Why the hell should the existance of windows limit the possibility of
 linux working properly.

Linux works properly with a partition table, so this is a specious
statement.

 If i have a pc that dualboots windows i will take care of using the
 common denominator of a partition table, if it is my big server i will
 probably not. since it won't boot anything else than Linux.

Doesn't really gain you anything, but your choice.  Besides, the
question wasn't why shouldn't Luca Berra use whole disk devices, it
was why I don't recommend using whole disk devices, and my
recommendation wasn't based in the least bit upon a single person's use
scenario.

 And, in particular with mdadm, I once created a full disk md raid array
 on a couple disks, then couldn't get things arranged like I wanted, so I
 just partitioned the disks and then created new arrays in the partitions
 (without first manually zeroing the superblock for the whole disk
 array).  Since I used a version 1.0 superblock on the whole disk array,
 and then used version 1.1 superblocks in the partitions, the net result
 was that when I ran mdadm -Eb, mdadm would find both the 1.1 and 1.0
 superblocks in the last partition on the disk.  Confused both myself and
 mdadm for a while.
 yes, this is fun
 On the opposite, i once inserted an mmc memory card, which had been
 initialized on my mobile phone, into the mmc slot of my laptop, and was
 faced with a load of error about mmcblk0 having an invalid partition
 table.

So?  The messages are just informative, feel free to ignore them.

  Obviously it had none, it was a plain fat filesystem.
 Is the solution partitioning it? I don't think the phone would
 agree.

The phone dictates the format, only a moron would say otherwise.  But,
then again, the phone doesn't care about interoperability and many other
issues on memory cards that it thinks it owns, so only a moron would
argue that because a phone doesn't use a partition table that nothing
else in the computer realm needs to either.

 Anyway, I happen to *like* the idea of using full disk devices, but the
 reality is that the md subsystem doesn't have exclusive ownership of the
 disks at all times, and without that it really needs to stake a claim on
 the space instead of leaving things to chance IMO.
 Start removing the partition detection code from the blasted kernel and
 move it to userspace, which is already in place, but it is not the
 default.

Which just moves where the work is done, not what work needs to be done.
It's a change for no benefit and a waste of time.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Implementing low level timeouts within MD

2007-10-26 Thread Doug Ledford

On Fri, 2007-10-26 at 12:12 -0500, Alberto Alonso wrote:
 I've been asking on my other posts but haven't seen
 a direct reply to this question:
 
 Can MD implement timeouts so that it detects problems when
 drivers don't come back?
 
 For me this year shall be known as the year the array
 stood still (bad scifi reference :-)
 
 After 4 different array failures all due to a single drive
 failure I think it would really be helpful if the md code
 timed out the driver.

This isn't an md problem, this is a low level disk driver problem.  Yell
at the author of the disk driver in question.  If that driver doesn't
time things out and return errors up the stack in a reasonable time,
then it's broken.  Md should not, and realistically can not, take the
place of a properly written low level driver.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-25 Thread Doug Ledford

On Wed, 2007-10-24 at 22:43 -0700, Daniel L. Miller wrote:
 Bill Davidsen wrote:
  Daniel L. Miller wrote:
  Current mdadm.conf:
  DEVICE partitions
  ARRAY /dev/.static/dev/md0 level=raid10 num-devices=4 
  UUID=9d94b17b:f5fac31a:577c252b:0d4c4b2a auto=part
 
  still have the problem where on boot one drive is not part of the 
  array.  Is there a log file I can check to find out WHY a drive is 
  not being added?  It's been a while since the reboot, but I did find 
  some entries in dmesg - I'm appending both the md lines and the 
  physical disk related lines.  The bottom shows one disk not being 
  added (this time is was sda) - and the disk that gets skipped on each 
  boot seems to be random - there's no consistent failure:
 
  I suspect the base problem is that you are using whole disks instead 
  of partitions, and the problem with the partition table below is 
  probably an indication that you have something on that drive which 
  looks like a partition table but isn't. That prevents the drive from 
  being recognized as a whole drive. You're lucky, if the data looked 
  enough like a partition table to be valid the o/s probably would have 
  tried to do something with it.
  [...]
  This may be the rare case where you really do need to specify the 
  actual devices to get reliable operation.
 OK - I'm officially confused now (I was just unofficially before).  WHY 
 is it a problem using whole drives as RAID components?  I would have 
 thought that building a RAID storage unit with identically sized drives 
 - and using each drive's full capacity - is exactly the way you're 
 supposed to!

As much as anything else this can be summed up as you are thinking of
how you are using the drives and not how unexpected software on your
system might try and use your drives.  Without a partition table, none
of the software on your system can know what to do with the drives
except mdadm when it finds an md superblock.  That doesn't stop other
software from *trying* to find out how to use your drives though.  That
includes the kernel trying to look for a valid partition table, mount
possibly scanning the drive for a file system label, lvm scanning for an
lvm superblock, mtools looking for a dos filesystem, etc.  Under normal
conditions, the random data on your drive will never look valid to these
other pieces of software.  But, once in a great while, it will look
valid.  And that's when all hell breaks loose.  Or worse, you run a
partition program such as fdisk on the device and it initializes the
partition table (something that the Fedora/RHEL installers do to all
disks without partition tables...well, the installer tells you there's
no partition table and asks if you want to initialize it, but if someone
is in a hurry and hits yes when they meant no, bye bye data).

The partition table is the single, (mostly) universally recognized
arbiter of what possible data might be on the disk.  Having a partition
table may not make mdadm recognize the md superblock any better, but it
keeps all that other stuff from even trying to access data that it
doesn't have a need to access and prevents random luck from turning your
day bad.

Oh, and let's not go into what can happen if you're talking about a dual
boot machine and what Windows might do to the disk if it doesn't think
the disk space is already spoken for by a linux partition.

And, in particular with mdadm, I once created a full disk md raid array
on a couple disks, then couldn't get things arranged like I wanted, so I
just partitioned the disks and then created new arrays in the partitions
(without first manually zeroing the superblock for the whole disk
array).  Since I used a version 1.0 superblock on the whole disk array,
and then used version 1.1 superblocks in the partitions, the net result
was that when I ran mdadm -Eb, mdadm would find both the 1.1 and 1.0
superblocks in the last partition on the disk.  Confused both myself and
mdadm for a while.

Anyway, I happen to *like* the idea of using full disk devices, but the
reality is that the md subsystem doesn't have exclusive ownership of the
disks at all times, and without that it really needs to stake a claim on
the space instead of leaving things to chance IMO.

   I should mention that the boot/system drive is IDE, and 
 NOT part of the RAID.  So I'm not worried about losing the system - but 
 I AM concerned about the data.  I'm using four drives in a RAID-10 
 configuration - I thought this would provide a good blend of safety and 
 performance for a small fileserver.
 
 Because it's RAID-10 - I would ASSuME that I can drop one drive (after 
 all, I keep booting one drive short), partition if necessary, and add it 
 back in.  But how would splitting these disks into partitions improve 
 either stability or performance?

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http

Re: Raid-10 mount at startup always has problem

2007-10-25 Thread Doug Ledford

On Thu, 2007-10-25 at 16:12 +1000, Neil Brown wrote:

  md: md0 stopped.
  md: md0 stopped.
  md: bindsdc
  md: bindsdd
  md: bindsdb
  md: md0: raid array is not clean -- starting background reconstruction
  raid10: raid set md0 active with 3 out of 4 devices
  md: couldn't update array info. -22
   ^^^
 
 This is the most surprising line, and hence the one most likely to
 convey helpful information.
 
 This message is generated when a process calls SET_ARRAY_INFO on an
 array that is already running, and the changes implied by the new
 array_info are not supportable.
 
 The only way I can see this happening is if two copies of mdadm are
 running at exactly the same time and are both are trying to assemble
 the same array.  The first calls SET_ARRAY_INFO and assembles the
 (partial) array.  The second calls SET_ARRAY_INFO and gets this error.
 Not all devices are included because while when one mdadm when to
 look, at a device, the other has it locked and so the first just
 ignored it.

If mdadm copy A gets three of the devices, I wouldn't think mdadm copy B
would have been able to get enough devices to decide to even try and
assemble the array (assuming that once copy A locked the devices during
open, that it then held the devices until time to assemble the array).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-25 Thread Doug Ledford

On Thu, 2007-10-25 at 09:55 +1000, Neil Brown wrote:

 As for where the metadata should be placed, it is interesting to
 observe that the SNIA's DDFv1.2 puts it at the end of the device.
 And as DDF is an industry standard sponsored by multiple companies it
 must be ..
 Sorry.  I had intended to say correct, but when it came to it, my
 fingers refused to type that word in that context.
 
 DDF is in a somewhat different situation though.  It assumes that the
 components are whole devices, and that the controller has exclusive
 access - there is no way another controller could interpret the
 devices differently before the DDF controller has a chance.

Putting a superblock at the end of a device works around OS
compatibility issues and other things related to transitioning the
device from part of an array to not, etc.  But, it works if and only if
you have the guarantee you mention.  Long, long ago I tinkered with the
idea of md multipath devices using an end of device superblock on the
whole device to allow reliable multipath detection and autostart,
failover of all partitions on a device when a command to any partition
failed, ability to use standard partition tables, etc. while being 100%
transparent to the rest of the OS.  The second you considered FC
connected devices and multi-OS access, that fell apart in a big way.
Very analogous.

So, I wouldn't necessarily call it wrong, but it's fragile.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-25 Thread Doug Ledford

On Wed, 2007-10-24 at 16:22 -0400, Bill Davidsen wrote:
 Doug Ledford wrote:
  On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:
 

  I don't agree completely.  I think the superblock location is a key
  issue, because if you have a superblock location which moves depending
  the filesystem or LVM you use to look at the partition (or full disk)
  then you need to be even more careful about how to poke at things.
  
 
  This is the heart of the matter.  When you consider that each file
  system and each volume management stack has a superblock, and they some
  store their superblocks at the end of devices and some at the beginning,
  and they can be stacked, then it becomes next to impossible to make sure
  a stacked setup is never recognized incorrectly under any circumstance.
  It might be possible if you use static device names, but our users
  *long* ago complained very loudly when adding a new disk or removing a
  bad disk caused their setup to fail to boot.  So, along came mount by
  label and auto scans for superblocks.  Once you do that, you *really*
  need all the superblocks at the same end of a device so when you stack
  things, it always works properly.
 Let me be devil's advocate, I noted in another post that location might 
 be raid level dependent. For raid-1 putting the superblock at the end 
 allows the BIOS to treat a single partition as a bootable unit.

This is true for both the 1.0 and 1.2 superblock formats.  The BIOS
couldn't care less if there is an offset to the filesystem because it
doesn't try to read from the filesystem.  It just jumps to the first 512
byte sector and that's it.  Grub/Lilo are the ones that have to know
about the offset, and they would be made aware of the offset at install
time.

So, we are back to the exact same thing I was talking about.  With the
superblock at the beginning of the device, you don't hinder bootability
with or without the raid working, the raid would be bootable regardless
as long as you made it bootable, it only hinders accessing the
filesystem via a running linux installation without bringing up the
raid.

  For all 
 other arrangements the end location puts the superblock where it is 
 slightly more likely to be overwritten, and where it must be moved if 
 the partition grows or whatever.
 
 There really may be no right answer.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Raid-10 mount at startup always has problem

2007-10-24 Thread Doug Ledford

On Wed, 2007-10-24 at 07:22 -0700, Daniel L. Miller wrote:
 Daniel L. Miller wrote:
  Richard Scobie wrote:
  Daniel L. Miller wrote:
 
  And you didn't ask, but my mdadm.conf:
  DEVICE partitions
  ARRAY /dev/.static/dev/md0 level=raid10 num-devices=4 
  UUID=9d94b17b:f5fac31a:577c252b:0d4c4b2a
 
  Try adding
 
  auto=part
 
  at the end of you mdadm.conf ARRAY line.
  Thanks - will see what happens on my next reboot.
 
 Current mdadm.conf:
 DEVICE partitions
 ARRAY /dev/.static/dev/md0 level=raid10 num-devices=4 
 UUID=9d94b17b:f5fac31a:577c252b:0d4c4b2a auto=part
 
 still have the problem where on boot one drive is not part of the 
 array.  Is there a log file I can check to find out WHY a drive is not 
 being added?

It usually means either the device is busy at the time the raid startup
happened, or the device wasn't created by udev yet at the time the
startup happened.  It it failing to start the array properly in the
initrd or is this happening after you've switched to the rootfs and are
running the startup scripts?


 md: md0 stopped.
 md: md0 stopped.
 md: bindsdc
 md: bindsdd
 md: bindsdb

Whole disk raid devices == bad.  Lots of stuff can go wrong with that
setup.

 md: md0: raid array is not clean -- starting background reconstruction
 raid10: raid set md0 active with 3 out of 4 devices
 md: couldn't update array info. -22
 md: resync of RAID array md0
 md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
 md: using maximum available idle IO bandwidth (but not more than 20 
 KB/sec) for resync.
 md: using 128k window, over a total of 312581632 blocks.
 Filesystem md0: Disabling barriers, not supported by the underlying device
 XFS mounting filesystem md0
 Starting XFS recovery on filesystem: md0 (logdev: internal)
 Ending XFS recovery on filesystem: md0 (logdev: internal)
 
 
 
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-23 Thread Doug Ledford

On Tue, 2007-10-23 at 19:03 -0400, Bill Davidsen wrote:
 John Stoffel wrote:
  Why do we have three different positions for storing the superblock?  

 Why do you suggest changing anything until you get the answer to this 
 question? If you don't understand why there are three locations, perhaps 
 that would be a good initial investigation.
 
 Clearly the short answer is that they reflect three stages of Neil's 
 thinking on the topic, and I would bet that he had a good reason for 
 moving the superblock when he did it.

I believe, and Neil can correct me if I'm wrong, that 1.0 (at the end of
the device) is to satisfy people that want to get at their raid1 data
without bringing up the device or using a loop mount with an offset.
Version 1.1, at the beginning of the device, is to prevent accidental
access to a device when the raid array doesn't come up.  And version 1.2
(4k from the beginning of the device) would be suitable for those times
when you want to embed a boot sector at the very beginning of the device
(which really only needs 512 bytes, but a 4k offset is as easy to deal
with as anything else).  From the standpoint of wanting to make sure an
array is suitable for embedding a boot sector, the 1.2 superblock may be
the best default.

 Since you have to support all of them or break existing arrays, and they 
 all use the same format so there's no saving of code size to mention, 
 why even bring this up?
 
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: chunk size (was Re: Time to deprecate old RAID formats?)

2007-10-23 Thread Doug Ledford

On Tue, 2007-10-23 at 21:21 +0200, Michal Soltys wrote:
 Doug Ledford wrote:
  
  Well, first I was thinking of files in the few hundreds of megabytes
  each to gigabytes each, and when they are streamed, they are streamed at
  a rate much lower than the full speed of the array, but still at a fast
  rate.  How parallel the reads are then would tend to be a function of
  chunk size versus streaming rate. 
 
 Ahh, I see now. Thanks for explanation.
 
 I wonder though, if setting large readahead would help, if you used larger 
 chunk size. Assuming other options are not possible - i.e. streaming from 
 larger buffer, while reading to it in a full stripe width at least.

Probably not.  All my trial and error in the past with raid5 arrays and
various situations that would cause pathological worst case behavior
showed that once reads themselves reach 16k in size, and are sequential
in nature, then the disk firmware's read ahead kicks in and your
performance stays about the same regardless of increasing your OS read
ahead.  In a nutshell, once you've convinced the disk firmware that you
are going to be reading some data sequentially, it does the rest.  With
a large stripe size (say 256k+), you'll trigger this firmware read ahead
fairly early on in reading any given stripe, so you really don't buy
much by reading the next stripe before you need it, and in fact can end
up wasting a lot of RAM trying to do so, hurting overall performance.

  
  I'm not familiar with the benchmark you are referring to.
  
 
 I was thinking about 
 http://www.mail-archive.com/linux-raid@vger.kernel.org/msg08461.html
 
 with small discussion that happend after that.
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-23 Thread Doug Ledford

On Sat, 2007-10-20 at 22:24 +0400, Michael Tokarev wrote:
 John Stoffel wrote:
  Michael == Michael Tokarev [EMAIL PROTECTED] writes:

  As Doug says, and I agree strongly, you DO NOT want to have the
  possibility of confusion and data loss, especially on bootup.  And
 
 There are different point of views, and different settings etc.

Indeed, there are different points of view.  And with that in mind, I'll
just point out that my point of view is that of an engineer who is
responsible for all the legitimate md bugs in our products once tech
support has weeded out the you tried to do what? cases.  From that
point of view, I deal with *every* user's preferred use case, not any
single use case.

 For example, I once dealt with a linux user who was unable to
 use his disk partition, because his system (it was RedHat if I
 remember correctly) recognized some LVM volume on his disk (it
 was previously used with Windows) and tried to automatically
 activate it, thus making it busy.

Yep, that can still happen today under certain circumstances.

   What I'm talking about here
 is that any automatic activation of anything should be done with
 extreme care, using smart logic in the startup scripts if at
 all.

We do.  Unfortunately, there is no logic smart enough to recognize all
the possible user use cases that we've seen given the way things are
created now.

 The Doug's example - in my opinion anyway - shows wrong tools
 or bad logic in the startup sequence, not a general flaw in
 superblock location.

Well, one of the problems is that you can both use an md device as an
LVM physical volume and use an LVM logical volume as an md constituent
device.  Users have done both.

 For example, when one drive was almost dead, and mdadm tried
 to bring the array up, machine just hanged for unknown amount
 of time.  An unexpirienced operator was there.  Instead of
 trying to teach him how to pass parameter to the initramfs
 to stop trying to assemble root array and next assembling
 it manually, I told him to pass root=/dev/sda1 to the
 kernel.  Root mounts read-only, so it should be a safe thing
 to do - I only needed root fs and minimal set of services
 (which are even in initramfs) just for it to boot up to SOME
 state where I can log in remotely and fix things later.

Umm, no.  Generally speaking (I can't speak for other distros) but both
Fedora and RHEL remount root rw even when coming up in single user mode.
The only time the fs is left in ro mode is when it drops to a shell
during rc.sysinit as a result of a failed fs check.  And if you are
using an ext3 filesystem and things didn't go down clean, then you also
get a journal replay.  So, then what happens when you think you've fixed
things, and you reboot, and then due to random chance, the ext3 fs check
gets the journal off the drive that wasn't mounted and replays things
again?  Will this overwrite your fixes possibly?  Yep.  Could do all
sorts of bad things.  In fact, unless you do a full binary compare of
your constituent devices, you could have silent data corruption and just
never know about it.  You may get off lucky and never *see* the
corruption, but it could well be there.  The only safe way to
reintegrate your raid after doing what you suggest is to kick the
unmounted drive out of the array before rebooting by using mdadm to zero
its superblock, boot up with a degraded raid1 array, and readd the
kicked device back in.

So, while you list several more examples of times when it was convenient
to do as you suggest, these times can be handled in other ways (although
it may mean keeping a rescue CD handy at each location just for
situations like this) that are far safer IMO.

Now, putting all this back into the point of view I have to take, which
is what's the best default action to take for my customers, I'm sure you
can understand how a default setup and recommendation of use that leaves
silent data corruption is simply a non-starter for me.  If someone wants
to do this manually, then go right ahead.  But as for what we do by
default when the user asks us to create a raid array, we really need to
be on superblock 1.1 or 1.2 (although we aren't yet, we've waited for
the version 1 superblock issues to iron out and will do so in a future
release).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-23 Thread Doug Ledford

On Mon, 2007-10-22 at 16:39 -0400, John Stoffel wrote:

 I don't agree completely.  I think the superblock location is a key
 issue, because if you have a superblock location which moves depending
 the filesystem or LVM you use to look at the partition (or full disk)
 then you need to be even more careful about how to poke at things.

This is the heart of the matter.  When you consider that each file
system and each volume management stack has a superblock, and they some
store their superblocks at the end of devices and some at the beginning,
and they can be stacked, then it becomes next to impossible to make sure
a stacked setup is never recognized incorrectly under any circumstance.
It might be possible if you use static device names, but our users
*long* ago complained very loudly when adding a new disk or removing a
bad disk caused their setup to fail to boot.  So, along came mount by
label and auto scans for superblocks.  Once you do that, you *really*
need all the superblocks at the same end of a device so when you stack
things, it always works properly.

 Michael Another example is ext[234]fs - it does not touch first 512
 Michael bytes of the device, so if there was an msdos filesystem
 Michael there before, it will be recognized as such by many tools,
 Michael and an attempt to mount it automatically will lead to at
 Michael least scary output and nothing mounted, or in fsck doing
 Michael fatal things to it in worst scenario.  Sure thing the first
 Michael 512 bytes should be just cleared.. but that's another topic.
 
 I would argue that ext[234] should be clearing those 512 bytes.  Why
 aren't they cleared  

Actually, I didn't think msdos used the first 512 bytes for the same
reason ext3 doesn't: space for a boot sector.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-20 Thread Doug Ledford

On Sat, 2007-10-20 at 09:53 +0200, Iustin Pop wrote:

 Honestly, I don't see how a properly configured system would start
 looking at the physical device by mistake. I suppose it's possible, but
 I didn't have this issue.

Mount by label support scans all devices in /proc/partitions looking for
the filesystem superblock that has the label you are trying to mount.
LVM (unless told not to) scans all devices in /proc/partitions looking
for valid LVM superblocks.  In fact, you can't build a linux system that
is resilient to device name changes without doing that.

 It's not only about the activation of the array. I'm mostly talking
 about RAID1, but the fact that migrating between RAID1 and plain disk is
 just a few hundred K at the end increases the flexibility very much.

Flexibility, no.  Convenience, yes.  You can do all the things with
superblock at the front that you can with it at the end, it just takes a
little more effort.

 Also, sometime you want to recover as much as possible from a not intact
 copy of the data...

And you can with superblock at the front.  You can create a new single
disk raid1 over the existing superblock or you can munge the partition
table to have it point at the start of your data.  There are options,
they just require manual intervention.  But if you are trying to rescue
data off of a seriously broken device, you are already doing manual
intervention anyway.

 Of course, different people have different priorities, but as I said, I
 like that this conversion is possible, and I never had the case of a
 tool saying hmm, /dev/mdsomething is not there, let's look at
 /dev/sdc instead.

mount, pvscan.

 thanks,
 iustin
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: chunk size (was Re: Time to deprecate old RAID formats?)

2007-10-20 Thread Doug Ledford

On Sat, 2007-10-20 at 00:43 +0200, Michal Soltys wrote:
 Doug Ledford wrote:
  course, this comes at the expense of peak throughput on the device.
  Let's say you were building a mondo movie server, where you were
  streaming out digital movie files.  In that case, you very well may care
  more about throughput than seek performance since I suspect you wouldn't
  have many small, random reads.  Then I would use a small chunk size,
  sacrifice the seek performance, and get the throughput bonus of parallel
  reads from the same stripe on multiple disks.  On the other hand, if I
  
 
 Out of curiosity though - why wouldn't large chunk work well here ? If you 
 stream video (I assume large files, so like a good few MBs at least), the 
 reads are parallel either way.

Well, first I was thinking of files in the few hundreds of megabytes
each to gigabytes each, and when they are streamed, they are streamed at
a rate much lower than the full speed of the array, but still at a fast
rate.  How parallel the reads are then would tend to be a function of
chunk size versus streaming rate.  I guess I should clarify what I'm
talking about anyway.  To me, a large chunk size is 1 to 2MB or so, a
small chunk size is in the 64k to 256k range.  If you have a 10 disk
raid5 array with a 2mb chunk size, and you aren't just copying files
around, then it's hard to ever get that to do full speed parallel reads
because you simply won't access the data fast enough.

 Yes, the amount of data read from each of the disks will be in less perfect 
 proportion than in small chunk size scenario, but it's pretty neglible. 
 Benchamrks I've seen (like Justin's one) seem not to care much about chunk 
 size in sequential read/write scenarios (and often favors larger chunks). 
 Some of my own tests I did few months ago confirmed that as well.

I'm not familiar with the benchmark you are referring to.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-20 Thread Doug Ledford

On Sat, 2007-10-20 at 17:07 +0200, Iustin Pop wrote:
 On Sat, Oct 20, 2007 at 10:52:39AM -0400, John Stoffel wrote:
  Michael Well, I strongly, completely disagree.  You described a
  Michael real-world situation, and that's unfortunate, BUT: for at
  Michael least raid1, there ARE cases, pretty valid ones, when one
  Michael NEEDS to mount the filesystem without bringing up raid.
  Michael Raid1 allows that.
  
  Please describe one such case please.
 
 Boot from a raid1 array, such that everything - including the partition
 table itself - is mirrored.

That's a *really* bad idea.  If you want to subpartition a raid array,
you really should either run lvm on top of raid or use a partitionable
raid array embedded in a raid partition.  If you don't, there are a
whole slew of failure cases that would result in the same sort of
accidental access and data corruption that I talked about.  For
instance, if you ever ran fdisk on the disk itself instead of the raid
array, fdisk would happily create a partition that runs off the end of
the raid device and into the superblock area.  The raid subsystem
autodetect only works on partitions labeled as type 0xfd, so it would
never search for a raid superblock at the end of the actual device, and
that means that if you boot from a rescue CD that doesn't contain an
mdadm.conf file that specifies the whole disk device as a search device,
then it is guaranteed to not start the device and possibly try and
modify the underlying constituent devices.  All around, it's just a
*really* bad idea.

I've heard several descriptions of things you *could* do with the
superblock at the end, but as of yet, not one of them is a good idea if
you really care about your data.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-20 Thread Doug Ledford

On Sat, 2007-10-20 at 22:38 +0400, Michael Tokarev wrote:
 Justin Piszcz wrote:
  
  On Fri, 19 Oct 2007, Doug Ledford wrote:
  
  On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote:
 []
  Got it, so for RAID1 it would make sense if LILO supported it (the
  later versions of the md superblock)
 
  Lilo doesn't know anything about the superblock format, however, lilo
  expects the raid1 device to start at the beginning of the physical
  partition.  In otherwords, format 1.0 would work with lilo.
  Did not work when I tried 1.x with LILO, switched back to 00.90.03 and
  it worked fine.
 
 There are different 1.x - and the difference is exactly this -- location
 of the superblock.  In 1.0, superblock is located at the end, just like
 with 0.90, and lilo works just fine with it.  It gets confused somehow
 (however I don't see how really, because it uses bmap() to get a list
 of physical blocks for the files it wants to access - those should be
 in absolute numbers, regardless of the superblock locaction) when the
 superblock is at the beginning (v 1.1 or 1.2).
 
 /mjt

It's been a *long* time since I looked at the lilo raid1 support (I
wrote the original patch that Red Hat used, I have no idea if that's
what the lilo maintainer integrated though).  However, IIRC, it uses
bmap on the file, which implies it's via the filesystem mounted on the
raid device.  And the numbers are not absolute I don't think except with
respect to the file system.  So, I think the situation could be made to
work if you just taught lilo that on version 1.1 or version 1.2
superblock raids that it should add the data offset of the raid to the
bmap numbers (which I think are already added to the partition offset
numbers).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Doug Ledford

On Fri, 2007-10-19 at 23:23 +0200, Iustin Pop wrote:
 On Fri, Oct 19, 2007 at 02:39:47PM -0400, John Stoffel wrote:
  And if putting the superblock at the end is problematic, why is it the
  default?  Shouldn't version 1.1 be the default?  
 
 In my opinion, having the superblock *only* at the end (e.g. the 0.90
 format) is the best option.
 
 It allows one to mount the disk separately (in case of RAID 1), if the
 MD superblock is corrupt or you just want to get easily at the raw data.

Bad reasoning.  It's the reason that the default is at the end of the
device, but that was a bad decision made by Ingo long, long ago in a
galaxy far, far away.

The simple fact of the matter is there are only two type of raid devices
for the purpose of this issue: those that fragment data (raid0/4/5/6/10)
and those that don't (raid1, linear).

For the purposes of this issue, there are only two states we care about:
the raid array works or doesn't work.

If the raid array works, then you *only* want the system to access the
data via the raid array.  If the raid array doesn't work, then for the
fragmented case you *never* want the system to see any of the data from
the raid array (such as an ext3 superblock) or a subsequent fsck could
see a valid superblock and actually start a filesystem scan on the raw
device, and end up hosing the filesystem beyond all repair after it hits
the first chunk size break (although in practice this is usually a
situation where fsck declares the filesystem so corrupt that it refuses
to touch it, that's leaving an awful lot to chance, you really don't
want fsck to *ever* see that superblock).

If the raid array is raid1, then the raid array should *never* fail to
start unless all disks are missing (in which case there is no raw device
to access anyway).  The very few failure types that will cause the raid
array to not start automatically *and* still have an intact copy of the
data usually happen when the raid array is perfectly healthy, in which
case automatically finding a constituent device when the raid array
failed to start is exactly the *wrong* thing to do (for instance, you
enable SELinux on a machine and it hasn't been relabeled and the raid
array fails to start because /dev/mdblah can't be created because of
an SELinux denial...all the raid1 members are still there, but if you
touch a single one of them, then you run the risk of creating silent
data corruption).

It really boils down to this: for any reason that a raid array might
fail to start, you *never* want to touch the underlying data until
someone has taken manual measures to figure out why it didn't start and
corrected the problem.  Putting the superblock in front of the data does
not prevent manual measures (such as recreating superblocks) from
getting at the data.  But, putting superblocks at the end leaves the
door open for accidental access via constituent devices when you
*really* don't want that to happen.

So, no, the default should *not* be at the end of the device.

 As to the people who complained exactly because of this feature, LVM has
 two mechanisms to protect from accessing PVs on the raw disks (the
 ignore raid components option and the filter - I always set filters when
 using LVM ontop of MD).
 
 regards,
 iustin
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Doug Ledford

On Fri, 2007-10-19 at 12:38 -0400, John Stoffel wrote:


 1, 1.0, 1.1, 1.2
 
   Use the new version-1 format superblock.  This has few restrictions.
   The different sub-versions store the superblock at different locations
   on the device, either at the end (for 1.0), at the start (for 1.1) or
   4K from the start (for 1.2).
 
 
 It looks to me that the 1.1, combined with the 1.0 should be what we
 use, with the 1.2 format nuked.  Maybe call it 1.3?  *grin*

You're somewhat misreading the man page.  You *can't* combine 1.0 with
1.1.  All of the above options: 1, 1.0, 1.1, 1.2; specifically mean to
use a version 1 superblock.  1.0 means use a version 1 superblock at the
end of the disk.  1.1 means version 1 superblock at beginning of disk.
`1.2 means version 1 at 4k offset from beginning of the disk.  There
really is no actual version 1.1, or 1.2, the .0, .1, and .2 part of the
version *only* means where to put the version 1 superblock on the disk.
If you just say version 1, then it goes to the default location for
version 1 superblocks, and last I checked that was the end of disk (aka,
1.0).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Time to deprecate old RAID formats?

2007-10-19 Thread Doug Ledford

On Fri, 2007-10-19 at 11:46 -0400, John Stoffel wrote:
  Justin == Justin Piszcz [EMAIL PROTECTED] writes:
 
 Justin On Fri, 19 Oct 2007, John Stoffel wrote:
 
  
  So,
  
  Is it time to start thinking about deprecating the old 0.9, 1.0 and
  1.1 formats to just standardize on the 1.2 format?  What are the
  issues surrounding this?

1.0, 1.1, and 1.2 are the same format, just in different positions on
the disk.  Of the three, the 1.1 format is the safest to use since it
won't allow you to accidentally have some sort of metadata between the
beginning of the disk and the raid superblock (such as an lvm2
superblock), and hence whenever the raid array isn't up, you won't be
able to accidentally mount the lvm2 volumes, filesystem, etc.  (In worse
case situations, I've seen lvm2 find a superblock on one RAID1 array
member when the RAID1 array was down, the system came up, you used the
system, the two copies of the raid array were made drastically
inconsistent, then at the next reboot, the situation that prevented the
RAID1 from starting was resolved, and it never know it failed to start
last time, and the two inconsistent members we put back into a clean
array).  So, deprecating any of these is not really helpful.  And you
need to keep the old 0.90 format around for back compatibility with
thousands of existing raid arrays.

  It's certainly easy enough to change mdadm to default to the 1.2
  format and to require a --force switch to  allow use of the older
  formats.
  
  I keep seeing that we support these old formats, and it's never been
  clear to me why we have four different ones available?  Why can't we
  start defining the canonical format for Linux RAID metadata?
  
  Thanks,
  John
  [EMAIL PROTECTED]
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 Justin I hope 00.90.03 is not deprecated, LILO cannot boot off of
 Justin anything else!
 
 Are you sure?  I find that GRUB is much easier to use and setup than
 LILO these days.  But hey, just dropping down to support 00.09.03 and
 1.2 formats would be fine too.  Let's just lessen the confusion if at
 all possible.
 
 John
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Partitioned arrays initially missing from /proc/partitions

2007-04-24 Thread Doug Ledford


Neil Brown wrote:


Yes, but it should not be needed, and I'd like to understand why it
is.
One of the last things do_md_run does is
   mddev-changed = 1;

When you next open /dev/md_d0, md_open is called which calls
check_disk_change().
This will call into md_fops-md_media_changed which will return the
value of mddev-changed, which will be '1'.
So check_disk_change will then call md_fops-revalidate_disk which
will set mddev-changed to 0, and will then set bd_invalidated to 1
(as bd_disk-minors  1 (being 64)).

md_open will then return into do_open (in fs/block_dev.c) and because
bd_invalidated is true, it will call rescan_partitions and the
partitions will appear.


Yuck.  The md stack should populate the partition information on device 
creation *without* needing someone to open the resulting device.  That 
you can tweak mdadm to open the device after creation is fine, but 
unless no other program is allowed to use the ioctls to start devices, 
and unless this is a documented part of the API, waiting until second 
open to populate the device info is just flat wrong.  It breaks all 
sorts of expectations people have regarding things like mount by label, etc.



Hmmm... there is room for a race there.  If some other process opens
/dev/md_d0 before mdadm gets to close it, it will call
rescan_partitions before first calling  bd_set_size to update the size
of the bdev.  So when we try to read the partition table, it will
appear to be reading past the EOF, and will not actually read
anything..

I guess udev must be opening the block device at exactly the wrong
time. 


I can simulate this by holding /dev/md_d0 open while assembling the
array.  If I do that, the partitions don't get created.
Yuck.

Maybe I could call bd_set_size in md_open before calling
check_disk_change..

Yep, this patch seems to fix it.  Could you confirm?

Thanks,

NeilBrown

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2007-04-17 11:42:15.0 +1000
+++ ./drivers/md/md.c   2007-04-24 21:29:51.0 +1000
@@ -4485,6 +4485,8 @@ static int md_open(struct inode *inode, 
 	mddev_get(mddev);

mddev_unlock(mddev);
 
+	if (mddev-changed)

+   bd_set_size(inode-i_bdev, mddev-array_size  1);
check_disk_change(inode-i_bdev);
  out:
return err;

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Doug Ledford [EMAIL PROTECTED]
http://people.redhat.com/dledford

Infiniband specific RPMs can be found at
http://people.redhat.com/dledford/Infiniband
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC: 2.6 patch] simplify drivers/md/md.c:update_size()

2006-12-14 Thread Doug Ledford

On Fri, 2006-12-15 at 01:19 +0100, Adrian Bunk wrote:
 While looking at commit 8ddeeae51f2f197b4fafcba117ee8191b49d843e,
 I got the impression that this commit couldn't fix anything, since the 
 size variable can't be changed before fit gets used.
 
 Is there any big thinko, or is the patch below that slightly simplifies 
 update_size() semantically equivalent to the current code?

No, this patch is broken.  Where it fails is specifically the case where
you want to autofit the largest possible size, you have different size
devices, and the first device is not the smallest.  When you hit the
first device, you will set size, then as you repeat the ITERATE_RDEV
loop, when you hit the smaller device, size will be non-0 and you'll
then trigger the later if and return -ENOSPC.  In the case of autofit,
you have to preserve the fit variable instead of looking at size so you
know whether or not to modify the size when you hit a smaller device
later in the list.

 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
 
 ---
 
  drivers/md/md.c |3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)
 
 --- linux-2.6.19-mm1/drivers/md/md.c.old  2006-12-15 00:57:05.0 
 +0100
 +++ linux-2.6.19-mm1/drivers/md/md.c  2006-12-15 00:57:42.0 +0100
 @@ -4039,57 +4039,56 @@
* Generate a 128 bit UUID
*/
   get_random_bytes(mddev-uuid, 16);
  
   mddev-new_level = mddev-level;
   mddev-new_chunk = mddev-chunk_size;
   mddev-new_layout = mddev-layout;
   mddev-delta_disks = 0;
  
   mddev-dead = 0;
   return 0;
  }
  
  static int update_size(mddev_t *mddev, unsigned long size)
  {
   mdk_rdev_t * rdev;
   int rv;
   struct list_head *tmp;
 - int fit = (size == 0);
  
   if (mddev-pers-resize == NULL)
   return -EINVAL;
   /* The size is the amount of each device that is used.
* This can only make sense for arrays with redundancy.
* linear and raid0 always use whatever space is available
* We can only consider changing the size if no resync
* or reconstruction is happening, and if the new size
* is acceptable. It must fit before the sb_offset or,
* if that is data_offset, it must fit before the
* size of each device.
* If size is zero, we find the largest size that fits.
*/
   if (mddev-sync_thread)
   return -EBUSY;
   ITERATE_RDEV(mddev,rdev,tmp) {
   sector_t avail;
   avail = rdev-size * 2;
  
 - if (fit  (size == 0 || size  avail/2))
 + if (size == 0)
   size = avail/2;
   if (avail  ((sector_t)size  1))
   return -ENOSPC;
   }
   rv = mddev-pers-resize(mddev, (sector_t)size *2);
   if (!rv) {
   struct block_device *bdev;
  
   bdev = bdget_disk(mddev-gendisk, 0);
   if (bdev) {
   mutex_lock(bdev-bd_inode-i_mutex);
   i_size_write(bdev-bd_inode, (loff_t)mddev-array_size 
  10);
   mutex_unlock(bdev-bd_inode-i_mutex);
   bdput(bdev);
   }
   }
   return rv;
  }
-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Recovering from default FC6 install

2006-11-12 Thread Doug Ledford

On Sun, 2006-11-12 at 01:00 -0500, Bill Davidsen wrote:
 I tried something new on a test system, using the install partitioning 
 tools to partition the disk. I had three drives and went with RAID-1 for 
 boot, and RAID-5+LVM for the rest. After the install was complete I 
 noted that it was solid busy on the drives, and found that the base RAID 
 appears to have been created (a) with no superblock and (b) with no 
 bitmap. That last is an issue, as a test system it WILL be getting hung 
 and rebooted, and recovering the 1.5TB took hours.
 
 Is there an easy way to recover this? The LVM dropped on it has a lot of 
 partitions, and there is a lot of data in them asfter several hours of 
 feeding with GigE, so I can't readily back up and recreate by hand.
 
 Suggestions?

First, the Fedora installer *always* creates persistent arrays, so I'm
not sure what is making you say it didn't, but they should be
persistent.

So, assuming that they are persistent, just recreate the arrays in place
as version 1.0 superblocks with internal bitmap.  I did that exact thing
on my FC6 machine I was testing with (raid1, not raid5, but no biggie
there) and it worked fine.  The detailed list of instructions:

Reboot with a rescue CD, skip the finding of the installation, when you
are at a prompt, use mdadm to examine the raid superblocks so you get
all the pertinent data such as chunk size on the raid5 and ordering of
constituent drives in the raid5 right.  Then recreate the arrays as
version 1.0 superblocks with internal write intent bitmaps.  Then mount
the partitions manually, bind mount things like /dev and /proc into
wherever you mounted the root filesystem, edit the mdadm.conf on the
root filesystem and remove the ARRAY lines (the GUIDs will be wrong
now), use mdadm -Db or mdadm -Eb to get new ARRAY lines and append them
to the mdadm.conf (possibly altering the device names for the arrays,
and if you use -E remember to correct the printout of the GUID in the
array line, it's 10:8:8:6 instead of 8:8:8:8), patch mkinitrd with
something like the attached patch, patch /etc/rc.d/rc.sysinit with
something like the other attached patch (or leave this patch out but
manually add the correct auto= parameter to your ARRAY lines in the
mdadm.conf), chroot into the root filesystem, remake your initrd image,
fdisk the drives and switch the linux partition types from raid
autodetect to plain linux, reboot, and you are done.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband
--- /sbin/mkinitrd	2006-09-28 12:51:28.0 -0400
+++ mkinitrd	2006-11-12 10:28:31.0 -0500
@@ -1096,6 +1096,13 @@
 mknod $MNTIMAGE/dev/efirtc c 10 136
 fi
 
+if [ -n $raiddevices ]; then
+inst /sbin/mdadm.static $MNTIMAGE/bin/mdadm
+if [ -f /etc/mdadm.conf ]; then
+cp $verbose /etc/mdadm.conf $MNTIMAGE/etc/mdadm.conf
+fi
+fi
+
 # FIXME -- this can really go poorly with clvm or duplicate vg names.
 # nash should do lvm probing for us and write its own configs.
 if [ -n $vg_list ]; then
@@ -1234,8 +1241,7 @@
 
 if [ -n $raiddevices ]; then
 for dev in $raiddevices; do
-cp -a /dev/${dev} $MNTIMAGE/dev
-emit raidautorun /dev/${dev}
+emit mdadm -As --auto=yes /dev/${dev}
 done
 fi
 
--- /etc/rc.d/rc.sysinit	2006-10-04 18:14:53.0 -0400
+++ rc.sysinit	2006-11-12 10:29:03.0 -0500
@@ -403,7 +403,7 @@
 update_boot_stage RCraid
 [ -x /sbin/nash ]  echo raidautorun /dev/md0 | nash --quiet
 if [ -f /etc/mdadm.conf ]; then
-/sbin/mdadm -A -s
+/sbin/mdadm -A -s --auto=yes
 fi
 
 # Device mapper  related initialization


signature.asc
Description: This is a digitally signed message part

Re: mdadm-2.5.4 issues and 2.6.18.1 kernel md issues

2006-11-09 Thread Doug Ledford

 it
was partitioned or not.  So, for example, if it's not a partitioned
array, you would have to teach grub that, let's say you have your boot
data on (hd0,0), then if (hd0,0) is part of a raid array with certain
superblock types (probably have to read /proc/mdstat to know), then the
start of (hd0,0) is not the start of the partition, but instead
something like partition size in blocks - whole md device size in blocks
= offset into partition to start of md device, and consequently the ext
filesystem that /boot is comprised of.  If it is partitioned, then you
could teach it the notion of (hd0,0,0), aka chained partition tables,
where you use the same offset calculation above to get to the chained
partition table, then read that partition table to get the offset to the
filesystem.  I don't think it would be too difficult for grub, but it
would have to be added.  This does, however, point out that the md
stack's decision to use a geometry on it's devices that is totally
different than the real constituent device geometry means that grub
would have to perform conversions on that chained partition table to get
from md offset to real device offset.  That may not matter much in the
end, but it will have to be done.

The difference in geometry also precludes doing a whole device md array
with the superblock at the end and the partition table where the normal
device partition table would be.  Although that sort of setup is risky
in terms of failure to assemble like I pointed out above, it does have
it's appeal for certain situations like multipath or the ability to have
a partitioned raid1 device with /boot in the array without needing to
modify grub, especially on machines that don't have built in SATA raid
that dm-raid could make use of.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

mdadm-2.5.4 issues and 2.6.18.1 kernel md issues

2006-11-02 Thread Doug Ledford

If I use mdadm 2.5.4 to create a version 1 superblock raid1 device, it
starts a resync.  If I then reboot the computer part way through, when
it boots back up, the resync gets cancelled and the array is considered
clean.  This is against a 2.6.18.1 kernel.

If I create a version 1 superblock raid1 array, mdadm -D constituent
device says that the device is not part of a raid array (and likewise
the kernel autorun facility fails to find the device).

If I create a version 1 superblock raid1 array, mdadm -E constituent
device sees the superblock.  If I then run mdadm -E --brief on that
same device, it prints out the 1 line ARRAY line, but it misprints the
UUID such that is a 10 digit hex number: 8 digit hex number: 8 digit hex
number: 6 digit hex number.  It also prints the mdadm device in the
ARRAY line as /dev/md/# where as mdadm -D --brief prints the device
as /dev/md#.  Consistency would be nice.

Does the superblock still not store any information about whether or not
the array is a single device or partitionable?  Would be nice if the
superblock gave some clue as to that fact so that it could be used to
set the auto= param on an mdadm -E --brief line to the right mode.

Mdadm assumes that the --name option passed in to create an array means
something in particular to the md array name and modifies subsequent
mdadm -D --brief and mdadm -E --brief outputs to include the name option
minus the hostname.  Aka, if I set the name to firewall:/boot, mdadm -E
--brief will then print out the ARRAY line device as /dev/md//boot.  I
don't think this is documented anywhere.  This also raises the question
of how partitionable md devices will be handled in regards to their name
component.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: future hardware

2006-10-29 Thread Doug Ledford

On Fri, 2006-10-27 at 17:18 -0500, Daniel Korstad wrote:

  leaving me one 5.25 left for the fan.  In addition to the fan in the
  item above, I have the exhaust fan on the Power Supply, another 12mm
  exhaust fan and a 12mm intake that blows across the other HDs.
 Sorry, I too much of a hurry, those are 120cm exhaust and 120cm intake

Hehehe, I'll burn in hell for pointing this out, but as 10mm == 1cm, a
120*mm* fan or 12*cm* fan would be correct.  I'm pretty sure your fans
are neither 12mm nor 120cm (or if you do have a 120cm
fan...damn...that's a lot of cooling)...

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

RE: why partition arrays?

2006-10-19 Thread Doug Ledford

On Thu, 2006-10-19 at 12:25 +0100, Ken Walker wrote:
 So is LVM better for partitions on a large raid5, or any raid, than separate
 partitions on that array.

In some ways yes, although it introduces a certain amount of uncertainty
in tuning of block devices.

 I'm still in my learning curve :)
 
 for example, if one has Linux running on a two disk mirror array, raid1, and
 the first disk is partitioned, say 5 partitions, with those partitions
 mirrored on the second disk, and each identical partition is then run as a
 mirror raid1.
 
 What your saying is that, if a single partition fails, to remove the drive
 you have to fail all the array partitions on the drive your taking out, then
 rebuild the partitions and then add to the dirty raid the new partitions one
 at a time.

Yep.

 Will LVM remove all this, so if you have a mirror as a single raid
 partition, and use LVM to create the partitions on that mirror, if a disk
 goes down, can it be removed, replaced, and then just added to the single
 raid, with LVM having had no idea what was going on in the background and
 just plod along merrily.

Yep.  In addition, with LVM, if you added two new disks, also in a raid1
array, then you could add that to your current volume group as another
physical volume, and the LVM code would happily extend your volume to
span both RAID1 arrays and increase the size.  Since the md code can now
grow things, this isn't as impressive as it used to be, but it's
probably a little easier to handle the lvm stuff than the md growth
stuff if for no other reason than they have graphical LVM tools that you
can do this with.

 Is LVM stable, or can it cause more problems than separate raids on a array.

Current incarnations are very stable.  I mentioned earlier that it can
introduce some tuning issues.  If you are dealing with a raid device
directly, then it's relatively straight forward to set the stripe size,
chunk size, etc. according to the number of raid disks and then set the
elevator and possibly things like read ahead values to optimize the raid
array's performance for different needs.  When you introduce LVM on top
of raid, there is the possibility that there will be interactions
between the two that have a detrimental impact on performance (this may
not always be the case, and it may not be unfixable, I'm just saying
it's an additional layer you have to deal with).

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Propose of enhancement of raid1 driver

2006-10-19 Thread Doug Ledford

On Thu, 2006-10-19 at 13:28 +1000, Neil Brown wrote:
 On Tuesday October 17, [EMAIL PROTECTED] wrote:
  I would like to propose an enhancement of raid 1 driver in linux kernel.
  The enhancement would be speedup of data reading on mirrored partitions.
  The idea is easy.
  If we have mirrored partition over 2 disks, and these disk are in sync, 
  there is
  possibility of simultaneous reading of the data from both disks on the same 
  way
  as in raid 0. So it would be chunk1 read from master, chunk2 read from 
  slave at
  the same time. 
  As result it would give significant speedup of read operation (comparable 
  with
  speed of raid 0 disks).
 
 This is not as easy as it sounds.
 Skipping over blocks within a track is no faster than reading blocks
 in the track, so you would need to make sure that your chunk size is
 larger than one track - probably it would need to be several tracks.
 
 Raid1 already does some read-balancing, though it is possible (even
 likely) that it doesn't balance very effectively.  Working out how
 best to do the balancing in general in a non-trivial task, but would
 be worth spending time on.
 
 The raid10 module in linux supports a layout described as 'far=2'.
 In this layout, with two drives, the first half of the drives is used
 for a raid0, and the second half is used for a mirrored raid0 with the
 data on the other disk.
 In this layout reads should certainly go at raid0 speeds, though
 there is cost in the speed of writes.
 
 Maybe you would like to experiment.  Write a program that reads from
 two drives in parallel, reading all the 'odd' chunks from one drive
 and the 'even' chunks from the other, and find out how fast it is.
 Maybe you could get it to try lots of different chunk sizes and see
 which is the fastest.

Too artificial.  The results of this sort of test would not translate
well to real world usage.

 That might be quite helpful in understanding how to get read-balancing
 working well.

Doing *good* read balancing is hard, especially given things like FC
attached storage, iSCSI/iSER, etc.  If I wanted to do this right, I'd
start by teaching the md code to look more deeply into block devices,
possibly even with a self tuning series of reads at startup to test
things like close seek sequential operation times versus maximum seek
throughput which would clue you in as to whether the device you are
talking to might have more than 1 physical spindle which would impact
the cost you associate to seek requiring operations relative to
bandwidth heavy operations, I might even go so far as to look into the
SCSI transport classes for clues about data throughput at bus bandwidth
versus command startup/teardown costs on the bus so you have an accurate
idea if lots of outstanding small commands are likely to cause your
device to suffer bus starvation issues from overhead.  Then I'd use that
data to help me numerically quantify the load on a device, updated both
when a command is added to the block layer queue (the queued load) and
when the command is actually removed from the block queue and sent to
the device (the active load) and updated again when the command is
received back.  Then, I'd basically look at what an incoming command
*would* do to each constituent disk's load values to see whether it
should go to one or the other.  But, that's just off the top of my head
and I may be on crack...I didn't check what my wife handed me this
morning.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: why partition arrays?

2006-10-18 Thread Doug Ledford

On Wed, 2006-10-18 at 15:43 +0200, martin f krafft wrote:
 also sprach Doug Ledford [EMAIL PROTECTED] [2006.10.18.1526 +0200]:
  There are a couple reasons I can think.
 
 Thanks for your elaborate response. If you don't mind, I shall link
 to it from the FAQ.

Sure.

 I have one other question: do partitionable and traditional arrays
 actually differ in format? Put differently: can I assemble
 a traditional array as a partitionable one simply by specifying:
 
   mdadm --create ... /dev/md0 ...
   mdadm --stop /dev/md0
   mdadm --assemble --auto=part ... /dev/md0 ...
 
 ? Or do the superblocks actually differ?

Neil would be more authoritative about what would differ in the
superblocks, but yes, it is possible to do as you listed above.  In
fact, if you create a partitioned array, and your mkinitrd doesn't
restart it as a partitioned array, you'll wonder how to mount your
filesystems since the system will happily start that originally
partitioned array as non partitioned.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: avoiding the initial resync on --create

2006-10-10 Thread Doug Ledford

On Tue, 2006-10-10 at 11:55 +0200, Gabor Gombas wrote:
 On Mon, Oct 09, 2006 at 12:32:00PM -0400, Doug Ledford wrote:
 
  You don't really need to.  After a clean install, the operating system
  has no business reading any block it didn't write to during the install
  unless you are just reading disk blocks for the fun of it.
 
 What happens if you have a crash, and fsck for some reason tries to read
 into that uninitialized area? This may happen even years after the
 install if the array was never resynced and the filesystem was never
 100% full... What happens, if fsck tries to read the same area twice but
 gets different data, because the second time the read went to a
 different disk?
 
 And yes, fsck is exactly an application that reads blocks just for the
 fun of it when it tries to find all the pieces of the filesystem, esp.
 for filesystems that (unlike e.g. ext3) do not keep metadata at fixed
 locations.

Not at all true.  Every filesystem, no matter where it stores its
metadata blocks, still writes to every single metadata block it
allocates to initialize that metadata block.  The same is true for
directory blocks...they are created with a . and .. entry and nothing
else.  What exactly do you think mke2fs is doing when it's writing out
the inode groups, block groups, bitmaps, etc.?  Every metadata block
needed by fsck is written either during mkfs or during use as the
filesystem data is grown.

So, like my original email said, fsck has no business reading any block
that hasn't been written to either by the install or since the install
when the filesystem was filled up more.  It certainly does *not* read
blocks just for the fun of it, nor does it rely on anything the
filesystem didn't specifically write.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: avoiding the initial resync on --create

2006-10-10 Thread Doug Ledford

On Tue, 2006-10-10 at 22:37 +0200, Gabor Gombas wrote:
 You don't get my point. I'm not talking about normal operation, but
 about the case when the filesystem becomes corrupt, and fsck has to glue
 together the pieces. Consider reiserfs:

See my other on list mail about the fallacy of the idea that consistency
of garbage data blocks is any better than inconsistency.  As I mentioned
in it, even if it's a deleted file, a lost metadata block, etc., it will
always be consistent if it's a valid block to consider during rebuild
because *at some point in time* since the filesystem was created, it
will have been written.  Reiserfsck is just as susceptible to random
garbage on a single disk not part of any raid array as it is to
inconsistent blocks in a raid1 as it is to a fully synced raid1 array
with garbage that looks like a reiserfs.  That's a shortcoming of that
filesystem and there is no one to blame but Hans Reiser for that.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: avoiding the initial resync on --create

2006-10-09 Thread Doug Ledford

On Mon, 2006-10-09 at 15:10 -0400, Rob Bray wrote:
  On Mon, 2006-10-09 at 15:49 +0200, Erik Mouw wrote:
 
  There is no way to figure out what exactly is correct data and what is
  not. It might work right after creation and during the initial install,
  but after the next reboot there is no way to figure out what blocks to
  believe.
 
  You don't really need to.  After a clean install, the operating system
  has no business reading any block it didn't write to during the install
  unless you are just reading disk blocks for the fun of it.  And any
  program that depends on data that hasn't first been written to disk is
  just wrong and stupid anyway.
 
 I suppose a partial-stripe write would read back junk data on the other
 disks, xor with your write, and update the parity block.

The original email was about raid1 and the fact that reads from
different disks could return different data.  For that scenario, my
comments are accurate.  For the parity based raids, you never have two
disks with the same block, so you would only ever get different results
if you had a disk fail and the parity was never initialized.  For that
situation, you would need to init the parity on any stripe that has been
even partially written to.  Totally unwritten stripes could have any
parity you want since the data is undefined anyway, so who cares if it
changes when a disk fails and you are reconstructing from parity.

 If you benchmark the disk, you're going to be reading blocks you didn't
 necessarily write, which could kick out consistency errors.

The only benchmarks I know of that give a rats ass about the data
integrity are ones that write a pattern first and then read it back.  In
that case, parity would have been init'ed during the write.

 A whole-array consistency check would puke on the out-of-whack parity data.

Or a whole array consistency check on an array that hasn't had a whole
array parity init makes no sense.  You could create the array without
touching the parity, update parity on all stripes that are written,
leave a flag in the superblock indicating the array has never been
init'ed, and in the event of failure you can use the parity safe in the
knowledge that all stripes that have been written to have valid parity
and all other stripes we don't care about.  The main problem here is
that if we *did* need a consistency check, we couldn't tell errors from
uninit'ed stripes.  You could also make it so that the first time you
run a consistency check with the uninit'ed flag in the superblock set,
you calculate all parity and then clear the flag in the superblock and
on all subsequent runs you would then know when you have an error as
opposed to an uninit'ed block.

Probably the best thing to do would be on create of the array, setup a
large all 0 block of mem and repeatedly write that to all blocks in the
array devices except parity blocks and use a large all 1 block for that.
Then you could just write the entire array at blinding speed.  You could
call that the quick-init option or something.  You wouldn't be able to
use the array until it was done, but it would be quick.  If you wanted
to be *really* fast, at least for SCSI drives you could write one large
chunk of 0's and one large chunk of 1's at the first parity block, then
use the SCSI COPY command to copy the 0 chunk everywhere it needs to go,
and likewise for the parity chunk, and avoid transferring the data over
the SCSI bus more than once.

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: avoiding the initial resync on --create

2006-10-09 Thread Doug Ledford

On Tue, 2006-10-10 at 07:33 +1000, Neil Brown wrote:
 On Monday October 9, [EMAIL PROTECTED] wrote:
  
  The original email was about raid1 and the fact that reads from
  different disks could return different data.
 
 To be fair, the original mail didn't mention raid1 at all.  It did
 mention raid5 and raid6 as a possible contrast so you could reasonably
 get the impression that it was talking about raid1.  But that wasn't
 stated.

OK, well I got that impression from the contrast ;-)

 Otherwise I agree.  There is no real need to perform the sync of a
 raid1 at creation.
 However it seems to be a good idea to regularly 'check' an array to
 make sure that all blocks on all disks get read to find sleeping bad
 blocks early.  If you didn't sync first, then every check will find
 lots of errors.  Ofcourse you could 'repair' instead of 'check'.  Or
 do that once.  Or something.
 
 For raid6 it is also safe to not sync first, though with the same
 caveat as raid1.  Raid6 always updates parity by reading all blocks in
 the stripe that aren't known and calculating P and Q.  So the first
 write to a stripe will make P and Q correct for that stripe.
 This is current behaviour.  I don't think I can guarantee it will
 never changed.
 
 For raid5 it is NOT safe to skip the initial sync.  It is possible for
 all updates to be read-modify-write updates which assume the parity
 is correct.  If it is wrong, it stays wrong.  Then when you lose a
 drive, the parity blocks are wrong so the data you recover using them
 is wrong.

superblock-init_flag == FALSE then make all writes a parity generating
not updating write (less efficient, so you would want to resync the
array and clear this up soon, but possible).

 In summary, it is safe to use --assume-clean on a raid1 or raid1o,
 though I would recommend a repair before too long.  For other raid
 levels it is best avoided.
 
  
  Probably the best thing to do would be on create of the array, setup a
  large all 0 block of mem and repeatedly write that to all blocks in the
  array devices except parity blocks and use a large all 1 block for that.
 
 No, you would want 0 for the parity block too.  0 + 0 = 0.

Sorry, I was thinking odd parity.

  Then you could just write the entire array at blinding speed.  You could
  call that the quick-init option or something.  You wouldn't be able to
  use the array until it was done, but it would be quick. 
 
 I doubt you would notice it being faster than the current
 resync/recovery that happens on creation.  We go at device-speed -
 either the buss device or the storage device depending on which is
 slower.

There's memory overhead though.  That can impact other operations the
cpu might do while in the process of recovering.

 
   If you wanted
  to be *really* fast, at least for SCSI drives you could write one large
  chunk of 0's and one large chunk of 1's at the first parity block, then
  use the SCSI COPY command to copy the 0 chunk everywhere it needs to go,
  and likewise for the parity chunk, and avoid transferring the data over
  the SCSI bus more than once.
 
 Yes, that might be measurably faster.  It is the sort of thing you might
 do in a hardware RAID controller but I doubt it would ever get done
 in md (there is a price for being very general).

Bleh...sometimes I really dislike always making things cater to the
lowest common denominator...you're never as good as you could be and you
are always as bad as the worst case...

-- 
Doug Ledford [EMAIL PROTECTED]
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part

Re: Strange intermittant errors + RAID doesn't fail the disk.

2006-07-07 Thread Doug Ledford

On Fri, 2006-07-07 at 00:29 +0200, Christian Pernegger wrote:

  I don't know exactly how the driver was responding to the bad cable,
  but it clearly wasn't returning an error, so md didn't fail it.
 
 There were a lot of errors in dmesg -- seems like they did not get
 passed up to md? I find it surprising that the md layer doesn't have
 its own timeouts, but then I know nothing about such things :)
 
 Thanks for clearing this up for me,
 
 C.
 
 [...]
 ata2: port reset, p_is 800 is 2 pis 0 cmd 44017 tf d0 ss 123 se 0
 ata2: status=0x50 { DriveReady SeekComplete }
 sdc: Current: sense key: No Sense
Additional sense: No additional sense information
 ata2: handling error/timeout
 ata2: port reset, p_is 0 is 0 pis 0 cmd 44017 tf 150 ss 123 se 0
 ata2: status=0x50 { DriveReady SeekComplete }
 ata2: error=0x01 { AddrMarkNotFound }
 sdc: Current: sense key: No Sense
Additional sense: No additional sense information
 [repeat]

This looks like a bad sd/sata lld interaction problem.  Specifically,
the sata driver wasn't filling in a suitable sense code block to
simulate auto-sense on the command, and the scsi disk driver was either
trying to get sense or retrying the same command.  Anyway, not an md
issue, a sata/scsi issue in terms of why it wasn't getting out of the
reset loop eventually.  I would send your bad cable to Jeff Garzik for
further analysis of the problem ;-)

-- 
Doug Ledford [EMAIL PROTECTED]
http://people.redhat.com/dledford

Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: AW: RAID1 and data safety?

2005-04-07 Thread Doug Ledford

 writing the end of journal entry), then you basically
wait for all your journal transactions to complete before sending the
end of journal transaction.  You don't have to wait for *all* writes to
the drive to complete, just the journal writes.  This is why performance
isn't killed by journaling.  The filesystem proper writes for previous
journal transactions can be taking place while you are doing this
waiting.

 ---
 
 You mentioned data journaling, and it sounded like it is reliable working. 
 Which one of the existing journaling fs did you have in your mind?

I use ext3 personally.  But that's as much because it's the default
filesystem and I know Stephen Tweedie will fix it if it's broken ;-)

 ---
 
 Afaik a read only reads from *one* HD (in raid1). So how to be sure, 
 that *both* HDs are still perfectly o.k.? Am I am fine to do a 
cat /dev/hda2  /dev/null ; cat /dev/hdb2  /dev/null
 even *during* the md is active and getting used r/w?

It's ok to do this.  However, reads happen from both hard drives in a
raid1 array in a sort of round robin fashion.  You don't really know
which reads are going to go where, but each drive will get read from.
Doing what you suggest will get you a full read check on each drive and
do so safely.  Of course, if it's supported on your system, you could
also just enable the SMART daemon and have it tell the drives to do
continuous background media checks to detect sectors that are either
already bad or getting ready to go bad (corrected error conditions).

-- 
Doug Ledford [EMAIL PROTECTED]
http://www.xsintricity.com/dledford
http://www.livejournal.com/users/deerslayer
AIM: DeerObliterator


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 - failed disks - i'm confusing

2005-04-04 Thread Doug Ledford

On Fri, 2005-04-01 at 03:22 -0800, Alvin Oga wrote:
 On Fri, 1 Apr 2005, Gordon Henderson wrote:
  On Fri, 1 Apr 2005, Alvin Oga wrote:
  
 - ambient temp should be 65F or less
 and disk operating temp ( hddtemp ) should be 35 or less
  
  Are we confusing F and C here?
 
 65F was for normal server room environment 
   ( some folks use 72F for office )
 
 and i changed units to 35C for hd operating temp vs 25C 
   - most of my ide disks run at under 30C
   - p4-2.xG cpu temps under 40C
 
  hddtemp typically reports temperatures in C. 35F is bloody cold!
 
 nah ... i like my disks cold to the touch ... ( 2 fans per disks )

Just for the record, second guessing mechanical engineers with
thermodynamics background training and an eye towards differing material
expansion rates and the like can be risky.  This is like saying Nah, I
like the engine in my car to run cold, so I use no thermostat and two
fans on the radiator.  It might sound like a good idea to you, but
proper cylinder to piston wall clearance is obtained at a specific
temperature (cylinder sleeves are typically some sort of iron or steel
compound and expand in diameter slower than the aluminum pistons when
heated to operating temperature, so the pistons are made smaller in
diameter at room temperature so that when both the sleeve and the piston
are at operating temperature the clearance will be correct).  Running an
engine at a lower temperature increases that clearance and can result in
premature piston failure.

As far as hard drive internals are concerned, I'm not positive whether
or not they are subject to the same sort of thermal considerations, but
just looking at the outside of a hard drive shows a very common case of
an aluminum cast frame and some sort of iron/steel based top plate.
These are going to expand at different rates with temperature and for
all I know if you run the drive overly cool, you may be placing undue
stress on the seal between these two parts of the drive (consider the
case of both the aluminum frame and the top plate having a channel for a
rubber o-ring, and until the drive reaches operating temp. the channels
may not line up perfectly, resulting in stress on the o-ring).

Anyway, it might or might not hurt the drives to run them well below
their designed operating temperature, I don't have schematics and
materials lists in front of me to tell for sure.  But second guessing
mechanical engineers that likely have compensated for thermal issues at
a given, specific common operating temperature is usually risky.  Most
people think Heat kills and therefore like to keep things as cool as
possible.  For mechanical devices anyway, it's not so much that heat
kills, as it is operating outside of the designed temperature range,
either above or below, that reduces overall life expectancy.  Keep your
drives from overheating, but don't try to freeze them would be my
advice.


-- 
Doug Ledford [EMAIL PROTECTED]
http://people.redhat.com/dledford


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1-diseaster on reboot: old version overwrites new version

2005-04-04 Thread Doug Ledford

On Sat, 2005-04-02 at 09:35 -0800, Tim Moore wrote:
 
 peter pilsl wrote:
  The only explantion to me is, that I had the wrong entry in my 
  lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
  So maybe root was always mounted as /dev/hda6 and never as /dev/md2, 
  which was started, but never had any data written to it. Is this a 
  possible explanation?
 
 No.  The lilo.conf entry just tells the kernel where root is located.

Yes, as Neil posted, this exactly explains the issue.  If /dev/hda6 is
part of a raid1 array, and you write to it instead of /dev/md2, then
those writes are never sent to /dev/hdc6 and the two devices get out of
sync.  Plus, standard initrd setups and the like are written to
accommodate users passing in arbitrary root= options on the kernel
command line to over ride the default root partition, and in those
situations the root partition must be taken from the command line and
not from fstab in order for this to work.  So, whether it's lilo or grub
or whatever, the root= line on your kernel command line is *the*
authority when it comes to what will be mounted as the root partition
you actually use.

 Can you publish your /etc/fstab and fdisk -l output?

Keep in mind the root partitions is already mounted in ro mode by the
time fstab is available and the rc.sysinit script merely remounts it rw.
Again, the command line is the authority.

-- 
Doug Ledford [EMAIL PROTECTED]
http://people.redhat.com/dledford


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1 and data safety?

2005-04-04 Thread Doug Ledford

 A is busy completing some reads at the moment, and drive B
isn't and completes the end of journal write quickly.  The patch Peter
posted (or at least talked about, can't remember which) would then
return a completion event to the ext3 journal code.  The ext3 code would
then assume the journal was all complete and start issuing the writes
related to that journal transaction en-masse.  These writes will then go
to drives A and B.  Since drive A was busy with some reads, it gets
these writes prior to completing the end of transaction write it already
had in its queue.  Being a nice, smart SCSI disk with tagged queuing
enabled, it then proceeds to complete the whole queue of writes in
whatever order is most efficient for it.  It completes two of the writes
that were issued by the ext3 filesystem after the ext3 filesystem
thought the journal entry was complete, and then the machine has a power
supply failure and nothing else gets written.  As it turns out, drive A
is the first drive in the rdev array, so on reboot, it's selected as the
master for resync.  Now, that means that all the data, journal and
everything else, is going to be copied from drive A to drive B.  And
guess what.  We never completed that end of journal write on drive A, so
when the ext3 filesystem is mounted, that journal transaction is going
to be considered incomplete and *not* get replayed.  But we've also
written a couple of the updates from that transaction to disk A already.
Well, there you go, data corruption.  So, Peter, if you are still toying
with that patch, it's a *BAD* idea.

That's what using a journaling filesystem on top of an md device gets
you in terms of what problems the journaling solves for the md device.
In turn, a weakness of any journaling filesystem is that it is
inherently vulnerable to hard disk failures.  A drive failure takes out
the filesystem and your machine becomes unusable.  Obviously, this very
problem is what md solves for filesystems.  Whether talking about the
journal or the rest of the filesystem, if you let a hard drive error
percolate up to the filesystem, then you've failed in the goal of
software raid.  I remember talk once about how putting the journal on
the raid device was bad because it would cause the media in that area of
the drive to wear out faster.  The proper response to that is: So.  I
don't care.  If that section of media wears out faster, fine by me,
because I'm smart and put both my journal and my filesystem on a
software raid device that allows me to replace the worn out device with
a fresh one without ever loosing any data or suffering a crash.  The
goal of the md layer is not to prevent drive wear out, the goal is to
make us tolerant of drive failures so we don't care when they happen, we
simply replace the bad drive and go on.  Since drive failures happen on
a fairly regular basis without md, if the price of not suffering
problems as a result of those failures is that we slightly increase the
failure rate due to excessive writing in the journal area, then fine by
me.

In addition, if you use raid5 arrays like I do, then putting the journal
on the raid array is a huge win because of the outrageously high
sequential throughput of a raid5 array.  Journals are preallocated at
filesystem creation time and occupy a more or less sequential area on
the disks.  Journals are also more or less a ring buffer.  You can tune
the journal size to a reasonable multiple of a full stripe size on the
raid5 array (say something like 1 to 10 MB per disk, so in a 5 disk
raid5 array, I'd use between a 4 and 40MB journal, depending on whether
I thought I would be doing a lot of large writes of sufficient enough
size to utilize a large journal), turn on journaling of not just meta-
data but all data, and then benefit from the fact that the journal
writes take place as more or less sequential writes as seen by things
like tiobench benchmark runs, and because the typical filesystem writes
are usually much more random in nature, the journaling overhead can be
reduced to no more than, say, 25% performance loss while getting the
benefit of both meta-data and regular data journaled.  It's certainly
*far* faster than sticking the journal on some other device unless it's
another very fast raid array.

Anyway, I think the situation can be summed up as this:

See Peter try to admin lots of machines.
See Peter imagine problems that don't exist.
See Peter disable features that would make his life easier as Peter
takes steps to circumvent his imaginary problems.
See Peter stay at work over New Years holiday fixing problems that
were likely a result of his own efforts to avoid problems.
Don't be a Peter, listen to Neil.


-- 
Doug Ledford [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 - failed disks - i'm confusing

2005-04-04 Thread Doug Ledford

On Mon, 2005-04-04 at 15:51 -0700, Alvin Oga wrote:
 
 On Mon, 4 Apr 2005, Doug Ledford wrote:
 
  Anyway, it might or might not hurt the drives to run them well below
  their designed operating temperature, I don't have schematics and
  materials lists in front of me to tell for sure.
 
 ez enough to do ... its called specs on the various manufacturers 
 websites ... similarly for the operating temp of the ICs on the
 disk controllers ..
 
 you're welcome to run your disks hot ...

I didn't say to run them hot, just design temp.  Overheating is bad,
just like you mentioned.

 i prefer to run it cool to the finger touch test as the server
 room to be 65F
 
 and its a known fact for 40+ years ... heat kills electromechanical
 items,  car engines is a different animal for different reasons

Yes it does, and my point wasn't to say that it doesn't, just to say
that for the mechanical portion of electromechanical devices, excessive
cool can be bad as well.

-- 
Doug Ledford [EMAIL PROTECTED]
http://people.redhat.com/dledford


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

69 matches

Mail list logo