Re: stride / stripe alignment on LVM ?

2007-11-02 Thread Janek Kozicki
Bill Davidsen said: (by the date of Fri, 02 Nov 2007 09:01:05 -0400)

> So I would expect this to make a very large performance difference, so 
> even if it work it would do so slowly.

I was trying to find out the stripe layout for few hours, using
hexedit and dd. And I'm baffled:

md1 : active raid5 hda3[0] sda3[1]
  969907968 blocks super 1.1 level 5, 128k chunk, algorithm 2 [3/2] [UU_]
  bitmap: 8/8 pages [32KB], 32768KB chunk

I fill md1 with random data:

# dd bs=128k count=64 if=/dev/urandom of=/dev/md1

# hexedit /dev/md1

I copy/paste (and remove formmatting) the first 32 bytes of /dev/md1,
now I search for those 32 bytes in /dev/hda3 and in /dev/sda3:

# hexedit /dev/hda3
# hexedit /dev/sda3

And no luck! I'd expect the first bytes of /dev/md1 to be on
beginning of the first drive (hda3).

I pick next 20 bytes from /dev/md1 and I can find them on /dev/hda3
starting just after address 0x1. The bytes before and after those
20 bytes are similar to those on /dev/md1. So now I hexedit /dev/md1
and write by hand 32 bytes of 0xAA. Then I look at address 0x1
on /dev/hda3 - and there is no 0xAA at all.

Well.. it's not critical for me, so you can just ignore my mumbling,
I was just wondering what obvious did I miss. There seems to be more
XORing (or sth. else) involved than I expected.

Maybe the disc did not flush writes, and what I see on /dev/md1 is
not yet present on /dev/hda3 (how's that possible?)

Nevertheless, I think that I will resign from LVM, and just put ext3
on /dev/md1, to avoid this stripe misalignment. I wanted LVM here
only because I might wanted to use lvm-snapshot, but I can live
without that. I can already grow /dev/md1 without LVM, but using
mdadm grow.

best regards
-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-02 Thread Alberto Alonso
On Fri, 2007-11-02 at 15:15 -0400, Doug Ledford wrote:
> It was tested, it simply obviously had a bug you hit.  Assuming that
> your particular failure situation is the only possible outcome for all
> the other people that used it would be an invalid assumption.  There are
> lots of code paths in an error handler routine, and lots of different
> hardware failure scenarios, and they each have their own independent
> outcome should they ever be experienced.

This is the kind of statement why I said you were belittling my 
experiences. 

And to think that since I've hit it in three different machines with
different hardware and different kernel versions that it won't affect
others is something else. I thought I was helping, but don't worry I
learned my lesson, it won't happen again. I asked people for their
experiences, clearly not everybody is as lucky as I am.

> Then you didn't pay attention to what I said before: RHEL3 was the first
> ever RHEL product that had support for SATA hardware.  The SATA drivers
> in RHEL3 *were* first gen.

Oh, I paid attention alright. It is my fault for assuming that things
not marked as experimental are not experimental.

Alberto

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-02 Thread Doug Ledford
On Fri, 2007-11-02 at 13:21 -0500, Alberto Alonso wrote:
> On Fri, 2007-11-02 at 11:45 -0400, Doug Ledford wrote:
> 
> > The key word here being "supported".  That means if you run across a
> > problem, we fix it.  It doesn't mean there will never be any problems.
> 
> On hardware specs I normally read "supported" as "tested within that
> OS version to work within specs". I may be expecting too much.

It was tested, it simply obviously had a bug you hit.  Assuming that
your particular failure situation is the only possible outcome for all
the other people that used it would be an invalid assumption.  There are
lots of code paths in an error handler routine, and lots of different
hardware failure scenarios, and they each have their own independent
outcome should they ever be experienced.

> > I'm sorry, but given the "specially the RHEL" case you cited, it is
> > clear I can't help you.  No one can.  You were running first gen
> > software on first gen hardware.  You show me *any* software company
> > who's first gen software never has to be updated to fix bugs, and I'll
> > show you a software company that went out of business they day after
> > they released their software.
> 
> I only pointed to RHEL as an example since that was a particular
> distro that I use and exhibited the problem. I probably could of
> replaced it with Suse, Ubuntu, etc. I may have called the early
> versions back in 94 first gen but not today's versions. I know I 
> didn't expect the SLS distro to work reliably back then. 

Then you didn't pay attention to what I said before: RHEL3 was the first
ever RHEL product that had support for SATA hardware.  The SATA drivers
in RHEL3 *were* first gen.

> Can you provide specific chipsets that you used (specially for SATA)? 

All of the Adaptec SCSI chipsets through the 7899, Intel PATA, QLogic
FC, and nVidia and winbond based SATA.

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Implementing low level timeouts within MD

2007-11-02 Thread Alberto Alonso
On Fri, 2007-11-02 at 11:45 -0400, Doug Ledford wrote:

> The key word here being "supported".  That means if you run across a
> problem, we fix it.  It doesn't mean there will never be any problems.

On hardware specs I normally read "supported" as "tested within that
OS version to work within specs". I may be expecting too much.

> I'm sorry, but given the "specially the RHEL" case you cited, it is
> clear I can't help you.  No one can.  You were running first gen
> software on first gen hardware.  You show me *any* software company
> who's first gen software never has to be updated to fix bugs, and I'll
> show you a software company that went out of business they day after
> they released their software.

I only pointed to RHEL as an example since that was a particular
distro that I use and exhibited the problem. I probably could of
replaced it with Suse, Ubuntu, etc. I may have called the early
versions back in 94 first gen but not today's versions. I know I 
didn't expect the SLS distro to work reliably back then. 

Thanks for reminding me on what I should and shouldn't consider 
first gen. I guess I should always wait for a couple of updates
prior to considering a distro stable, I'll keep that in mind in 
the future.

> I *really* can't help you.

And I never expected you to. None of my posts asked for support
to get my specific hardware and kernels working. I did ask for
help identifying combinations that work and those that don't.

The thread on low level timeouts within MD was meant as a forward
thinking question to see if it could solve some of these problems.
It has been settled that no, so that's that. I am really not trying
to push the issue with MD timeouts.

> No, your experience, as you listed it, is that
> SATA/usb-storage/Serverworks PATA failed you.  The software raid never
> failed to perform as designed.

And I never said that software raid did anything outside what it
was designed to do. I did state that when the goal is to keep the
server from hanging (a reasonable goal if you ask me) the combination
of SATA/usb-storage/Serverworks PATA with software raid is not
a working solution (neither it is without software raid for that
matter)

> However, one of the things you are doing here is drawing sweeping
> generalizations that are totally invalid.  You are saying your
> experience is that SATA doesn't work, but you aren't qualifying it with
> the key factor: SATA doesn't work in what kernel version?  It is
> pointless to try and establish whether or not something like SATA works
> in a global, all kernel inclusive fashion because the answer to the
> question varies depending on the kernel version.  And the same is true
> of pretty much every driver you can name.  This is why commercial

At time of purchase the hardware vendor (Supermicro for those
interested) listed RHLE v3, which is what got installed.

> companies don't just certify hardware, but the software version that
> actually works as opposed to all versions.  In truth, you have *no idea*
> if SATA works today, because you haven't tried.  As David pointed out,
> there was a significant overhaul of the SATA error recovery that took
> place *after* the kernel versions that failed you which totally
> invalidates your experiences and requires retesting of the later
> software to see if it performs differently.

I completely agree that retesting is needed based on the improvements
stated. I don't think it invalidates my experiences though, it does
date them, but that's fine. And yes, I see your point on always listing
specific kernel versions I will do better with the details in the
future.

> I've had *lots* of success with software RAID as I've been running it
> for years.  I've had old PATA drives fail, SCSI drives fail, FC drives
> fail, and I've had SATA drives that got kicked from the array due to
> read errors but not out and out drive failures.  But I keep at least
> reasonably up to date with my kernels.
> 
Can you provide specific chipsets that you used (specially for SATA)? 

Thanks,

Alberto


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-02 Thread Alberto Alonso
On Fri, 2007-11-02 at 11:09 +, David Greaves wrote:

> David
> PS I can't really contribute to your list - I'm only using cheap desktop 
> hardware.
> -

If you had failures and it properly handled them, then you can 
contribute to the good combinations, so far that's the list
that is kind of empty :-(

Thanks,

Alberto

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching root fs '/' to boot from RAID1 with grub

2007-11-02 Thread berk walker

H. Peter Anvin wrote:

Doug Ledford wrote:


device /dev/sda (hd0)
root (hd0,0)
install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
/boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst

device /dev/hdc (hd0)
root (hd0,0)
install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) 
/boot/grub/e2fs_stage1_5 p /boot/grub/stage2 /boot/grub/menu.lst


That will install grub on the master boot record of hdc and sda, and in
both cases grub will look to whatever drive it is running on for the
files to boot instead of going to a specific drive.



No, it won't... it'll look for the first drive in the system (BIOS 
drive 80h).  This means that if the BIOS can see the bad drive, but it 
doesn't work, you're still screwed.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Depends how "bad" the drive is.  Just to align the thread on this -  If 
the boot sector is bad - the bios on newer boxes will skip to the next 
one.  But if it is "good", and you boot into garbage - - could be 
Windows.. does it crash?


b

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: switching root fs '/' to boot from RAID1 with grub

2007-11-02 Thread Doug Ledford
On Thu, 2007-11-01 at 11:57 -0700, H. Peter Anvin wrote:
> Doug Ledford wrote:
> > 
> > Correct, and that's what you want.  The alternative is that if the BIOS
> > can see the first disk but it's broken and can't be used, and if you
> > have the boot sector on the second disk set to read from BIOS disk 0x81
> > because you ASSuMEd the first disk would be broken but still present in
> > the BIOS tables, then your machine won't boot unless that first dead but
> > preset disk is present.  If you remove the disk entirely, thereby
> > bumping disk 0x81 to 0x80, then you are screwed.  If you have any drive
> > failure that prevents the first disk from being recognized (blown fuse,
> > blown electronics, etc), you are screwed until you get a new disk to
> > replace it.
> > 
> 
> What you want is for it to use the drive number that BIOS passes into it 
> (register DL), not a hard-coded number.  That was my (only) point -- 
> you're obviously right that hard-coding a number to 0x81 would be worse 
> than useless.

Oh, and I forgot to mention that in grub2, the DL register is ignored
for RAID1 devices.  Well, maybe not ignored, but once grub2 has
determined that the intended boot partition is a raid partition, the
raid code takes over and the raid code doesn't care about the DL
register.  Instead, it scans for all the other members of the raid array
and utilizes whichever drives it needs to in order to complete the boot
process.  And since it does reads a sector (or a small group of sectors)
at a time, it doesn't need any member of a raid1 array to be perfect, it
will attempt a round robin read on all the sectors and only fail if all
drives return an error for a given read.

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Time to deprecate old RAID formats?

2007-11-02 Thread Doug Ledford
On Thu, 2007-11-01 at 14:02 -0700, H. Peter Anvin wrote:
> Doug Ledford wrote:
> >>
> >> I would argue that ext[234] should be clearing those 512 bytes.  Why
> >> aren't they cleared  
> > 
> > Actually, I didn't think msdos used the first 512 bytes for the same
> > reason ext3 doesn't: space for a boot sector.
> > 
> 
> The creators of MS-DOS put the superblock in the bootsector, so that the 
> BIOS loads them both.  It made sense in some diseased Microsoft 
> programmer's mind.
> 
> Either way, for RAID-1 booting, the boot sector really should be part of 
> the protected area (and go through the MD stack.)

It depends on what you are calling the protected area.  If by that you
mean outside the filesystem itself, and in a non-replicated area like
where the superblock and internal bitmaps go, then yes, that would be
ideal.  If you mean in the file system proper, then that depends on the
boot loader.

>   The bootloader should 
> deal with the offset problem by storing partition/filesystem-relative 
> pointers, not absolute ones.

Grub2 is on the way to this, but it isn't there yet.

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Implementing low level timeouts within MD

2007-11-02 Thread Doug Ledford
On Fri, 2007-11-02 at 03:41 -0500, Alberto Alonso wrote:
> On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote:
> > Not in the older kernel versions you were running, no.
> 
> These "old versions" (specially the RHEL) are supposed to be
> the official versions supported by Redhat and the hardware 
> vendors, as they were very specific as to what versions of 
> Linux were supported.

The key word here being "supported".  That means if you run across a
problem, we fix it.  It doesn't mean there will never be any problems.

>  Of all people, I would think you would
> appreciate that. Sorry if I sound frustrated and upset, but 
> it is clearly a result of what "supported and tested" really 
> means in this case.

I'm sorry, but given the "specially the RHEL" case you cited, it is
clear I can't help you.  No one can.  You were running first gen
software on first gen hardware.  You show me *any* software company
who's first gen software never has to be updated to fix bugs, and I'll
show you a software company that went out of business they day after
they released their software.

Our RHEL3 update kernels contained *significant* updates to the SATA
stack after our GA release, replete with hardware driver updates and bug
fixes.  I don't know *when* that RHEL3 system failed, but I would
venture a guess that it wasn't prior to RHEL3 Update 1.  So, I'm
guessing you didn't take advantage of those bug fixes.  And I would
hardly call once a quarter "continuously updating" your kernel.  In any
case, given your insistence on running first gen software on first gen
hardware and not taking advantage of the support we *did* provide to
protect you against that failure, I say again that I can't help you.

>  I don't want to go into a discussion of
> commercial distros, which are "supported" as this is nor the
> time nor the place but I don't want to open the door to the
> excuse of "its an old kernel", it wasn't when it got installed.

I *really* can't help you.

> Outside of the rejected suggestion, I just want to figure out 
> when software raid works and when it doesn't. With SATA, my 
> experience is that it doesn't. So far I've only received one 
> response stating success (they were using the 3ware and Areca 
> product lines).

No, your experience, as you listed it, is that
SATA/usb-storage/Serverworks PATA failed you.  The software raid never
failed to perform as designed.

However, one of the things you are doing here is drawing sweeping
generalizations that are totally invalid.  You are saying your
experience is that SATA doesn't work, but you aren't qualifying it with
the key factor: SATA doesn't work in what kernel version?  It is
pointless to try and establish whether or not something like SATA works
in a global, all kernel inclusive fashion because the answer to the
question varies depending on the kernel version.  And the same is true
of pretty much every driver you can name.  This is why commercial
companies don't just certify hardware, but the software version that
actually works as opposed to all versions.  In truth, you have *no idea*
if SATA works today, because you haven't tried.  As David pointed out,
there was a significant overhaul of the SATA error recovery that took
place *after* the kernel versions that failed you which totally
invalidates your experiences and requires retesting of the later
software to see if it performs differently.

> Anyway, this thread just posed the question, and as Neil pointed
> out, it isn't feasible/worth to implement timeouts within the md
> code. I think most of the points/discussions raised beyond that
> original question really belong to the thread "Software RAID when 
> it works and when it doesn't" 
> 
> I do appreciate all comments and suggestions and I hope to keep
> them coming. I would hope however to hear more about success
> stories with specific hardware details. It would be helpfull
> to have a list of tested configurations that are known to work.

I've had *lots* of success with software RAID as I've been running it
for years.  I've had old PATA drives fail, SCSI drives fail, FC drives
fail, and I've had SATA drives that got kicked from the array due to
read errors but not out and out drive failures.  But I keep at least
reasonably up to date with my kernels.

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: Superblocks

2007-11-02 Thread Greg Cormier
Any reason 0.9 is the default? Should I be worried about using 1.0
superblocks? And can I "upgrade" my array from 0.9 to 1.0 superblocks?

Thanks,
Greg

On 11/1/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Tuesday October 30, [EMAIL PROTECTED] wrote:
> > Which is the default type of superblock? 0.90 or 1.0?
>
> The default default is 0.90.
> However a local device can be set in mdadm.conf with e.g.
>CREATE metdata=1.0
>
> NeilBrown
>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: doesm mdadm try to use fastest HDD ?

2007-11-02 Thread Bill Davidsen

Janek Kozicki wrote:

Hello,

My three HHDs have following speeds:

  hda - speed 70 MB/sec
  hdc - speed 27 MB/sec
  sda - speed 60 MB/sec

They create a raid1 /dev/md0 and raid5 /dev/md1 arrays. I wanted to
ask if mdadm is trying to pick the fastest HDD during operation?

Maybe I can "tell" which HDD is preferred?
  


If you are doing raid-1 between hdc and some faster drive, you could try 
using write-mostly and see go that works for you. For raid-5, it's 
faster to read the data off the slow drive than reconstruct it with 
multiple reads to multiple othjer faster drives.

This came to my mind when I saw this:

  # mdadm --query --detail /dev/md1 | grep Prefer
 
  Preferred Minor : 1


And also in the manual:

  -W, --write-mostly [...] "can be useful if mirroring over a slow link."


many thanks for all your help!
  

I have two thoughts on this:
1 - if performance is critical, replace the slow drive
2 - for most things you do, I would expect seek to be more important 
than transfer rate


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stride / stripe alignment on LVM ?

2007-11-02 Thread Bill Davidsen

Neil Brown wrote:

On Thursday November 1, [EMAIL PROTECTED] wrote:
  

Hello,

I have raid5 /dev/md1, --chunk=128 --metadata=1.1. On it I have
created LVM volume called 'raid5', and finally a logical volume
'backup'.

Then I formatted it with command:

   mkfs.ext3 -b 4096 -E stride=32 -E resize=550292480 /dev/raid5/backup

And because LVM is putting its own metadata on /dev/md1, the ext3
partition is shifted by some (unknown for me) amount of bytes from
the beginning of /dev/md1.

I was wondering, how big is the shift, and would it hurt the
performance/safety if the `ext3 stride=32` didn't align perfectly
with the physical stripes on HDD?



It is probably better to ask this question on an ext3 list as people
there might know exactly what 'stride' does.

I *think* it causes the inode tables to be offset in different
block-groups so that they are not all on the same drive.  If that is
the case, then an offset causes by LVM isn't going to make any
difference at all.
  


Actually, I think that all of the performance evil Doug was mentioning 
will apply to LVM as well. So if things are poorly aligned, they will be 
poorly handled, a stripe-sized write will not go in a stripe, but will 
overlap chunks and cause all the data from all chunks to be read back 
for a new raid-5 calculation.


So I would expect this to make a very large performance difference, so 
even if it work it would do so slowly.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very small internal bitmap after recreate

2007-11-02 Thread Ralf Müller


Am 02.11.2007 um 12:43 schrieb Neil Brown:


For now, you will have to live with a smallish bitmap, which probably
isn't a real problem.


Ok then.


 Array Slot : 3 (0, 1, failed, 2, 3, 4)
Array State : uuUuu 1 failed

This time I'm getting nervous - Array State failed doesn't sound  
good!


This is nothing to worry about - just a bad message from mdadm.

The superblock has recorded that there was once a device in position 2
which is now failed (See the list in "Array Slot").
This summaries as "1 failed" in "Array State".

But the array is definitely working OK now.


Good to know.

Thanks a lot
Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-02 Thread Bill Davidsen

Alberto Alonso wrote:

On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote:
  

Not in the older kernel versions you were running, no.



These "old versions" (specially the RHEL) are supposed to be
the official versions supported by Redhat and the hardware 
vendors, as they were very specific as to what versions of 
Linux were supported.


So the vendors of the failing drives claimed that these kernels were 
supported? That's great, most vendors don't even consider Linux 
supported. What response did you get when you reported the problem to 
Redhat on your RHEL support contract? Did they agree that this hardware, 
and its use for software raid, was supported and intended?



 Of all people, I would think you would
appreciate that. Sorry if I sound frustrated and upset, but 
it is clearly a result of what "supported and tested" really 
means in this case. I don't want to go into a discussion of

commercial distros, which are "supported" as this is nor the
time nor the place but I don't want to open the door to the
excuse of "its an old kernel", it wasn't when it got installed.
  
The problem is in the time travel module. It didn't properly cope with 
future hardware, and since you have very long uptimes, I'm reasonably 
sure you haven't updated the kernel to get fixes installed.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Superblocks

2007-11-02 Thread Bill Davidsen

Neil Brown wrote:

On Tuesday October 30, [EMAIL PROTECTED] wrote:
  

Which is the default type of superblock? 0.90 or 1.0?



The default default is 0.90.
However a local device can be set in mdadm.conf with e.g.
   CREATE metdata=1.0

  


If you change to 1.start, 1.ed, 1.4k names for clarity, they need to be 
accepted here, as well.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-11-02 Thread Bill Davidsen

Neil Brown wrote:

On Friday October 26, [EMAIL PROTECTED] wrote:
  
Perhaps you could have called them 1.start, 1.end, and 1.4k in the 
beginning? Isn't hindsight wonderful?





Those names seem good to me.  I wonder if it is safe to generate them
in "-Eb" output

  
If you agree that they are better, using them in the obvious places 
would be better now than later. Are you going to put them in the 
metadata options as well? Let me know, I have looking at the 
documentation on my list for next week, and could include some text.

Maybe the key confusion here is between "version" numbers and
"revision" numbers.
When you have multiple versions, there is no implicit assumption that
one is better than another. "Here is my version of what happened, now
let's hear yours".
When you have multiple revisions, you do assume ongoing improvement.

v1.0  v1.1 and v1.2 are different version of the v1 superblock, which
itself is a revision of the v0...
  


Like kernel releases, people assume that the first number means *big* 
changes, the second incremental change.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


doesm mdadm try to use fastest HDD ?

2007-11-02 Thread Janek Kozicki
Hello,

My three HHDs have following speeds:

  hda - speed 70 MB/sec
  hdc - speed 27 MB/sec
  sda - speed 60 MB/sec

They create a raid1 /dev/md0 and raid5 /dev/md1 arrays. I wanted to
ask if mdadm is trying to pick the fastest HDD during operation?

Maybe I can "tell" which HDD is preferred?

This came to my mind when I saw this:

  # mdadm --query --detail /dev/md1 | grep Prefer
 
  Preferred Minor : 1

And also in the manual:

  -W, --write-mostly [...] "can be useful if mirroring over a slow link."


many thanks for all your help!
-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very small internal bitmap after recreate

2007-11-02 Thread Neil Brown
On Friday November 2, [EMAIL PROTECTED] wrote:
> 
> Am 02.11.2007 um 10:22 schrieb Neil Brown:
> 
> > On Friday November 2, [EMAIL PROTECTED] wrote:
> >> I have a 5 disk version 1.0 superblock RAID5 which had an internal
> >> bitmap that has been reported to have a size of 299 pages in /proc/
> >> mdstat. For whatever reason I removed this bitmap (mdadm --grow --
> >> bitmap=none) and recreated it afterwards (mdadm --grow --
> >> bitmap=internal). Now it has a reported size of 10 pages.
> >>
> >> Do I have a problem?
> >
> > Not a big problem, but possibly a small problem.
> > Can you send
> >mdadm -E /dev/sdg1
> > as well?
> 
> Sure:
> 
> # mdadm -E /dev/sdg1
> /dev/sdg1:
>Magic : a92b4efc
>  Version : 01
>  Feature Map : 0x1
>   Array UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19
> Name : 1
>Creation Time : Wed Oct 31 14:30:55 2007
>   Raid Level : raid5
> Raid Devices : 5
> 
>Used Dev Size : 625137008 (298.09 GiB 320.07 GB)
>   Array Size : 2500547584 (1192.35 GiB 1280.28 GB)
>Used Size : 625136896 (298.09 GiB 320.07 GB)
> Super Offset : 625137264 sectors

So there is 256 sectors before the superblock were a bitmap could go,
or about 6 sectors afterwards

>State : clean
>  Device UUID : 95afade2:f2ab8e83:b0c764a0:4732827d
> 
> Internal Bitmap : 2 sectors from superblock

And the '6 sectors afterwards' was chosen.
6 sectors has room for 5*512*8 = 20480 bits,
and from your previous email:
>   Bitmap : 19078 bits (chunks), 0 dirty (0.0%)
you have 19078 bits, which is about right (a the bitmap chunk size
must be a power of 2).

So the problem is that "mdadm -G" is putting the bitmap after the
superblock rather than considering the space before
(checks code)

Ahh, I remember now.  There is currently no interface to tell the
kernel where to put the bitmap when creating one on an active array,
so it always puts in the 'safe' place.  Another enhancement waiting
for time.

For now, you will have to live with a smallish bitmap, which probably
isn't a real problem.  With 19078 bits, you will still get a
several-thousand-fold increase it resync speed after a crash
(i.e. hours become seconds) and to some extent, fewer bits are better
and you have to update them less.

I've haven't made any measurements to see what size bitmap is
ideal... maybe someone should :-)

>  Update Time : Fri Nov  2 07:46:38 2007
> Checksum : 4ee307b3 - correct
>   Events : 408088
> 
>   Layout : left-symmetric
>   Chunk Size : 128K
> 
>  Array Slot : 3 (0, 1, failed, 2, 3, 4)
> Array State : uuUuu 1 failed
> 
> This time I'm getting nervous - Array State failed doesn't sound good!

This is nothing to worry about - just a bad message from mdadm.

The superblock has recorded that there was once a device in position 2
which is now failed (See the list in "Array Slot").
This summaries as "1 failed" in "Array State".

But the array is definitely working OK now.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stride / stripe alignment on LVM ?

2007-11-02 Thread Michal Soltys

Janek Kozicki wrote:


And because LVM is putting its own metadata on /dev/md1, the ext3
partition is shifted by some (unknown for me) amount of bytes from
the beginning of /dev/md1.



It seems to be multiply of 64KiB. You can specify it during pvcreate, with 
--metadatasize option. It will be rounded to multiply of 64 KiB, and will 
add another 64 KiB on its own. Extents will follow directly after that. 4 
sectors mentioned in pcvreate's man page are covered by that option as well.


So i.e. if you have chunk 1MiB, then pvcreate ... --metadatasize 960K ...
should give you chunk-aligned logical volumes, assuming you have actual 
extent size set appropriately as well. If you use default chunk size, you 
shouldn't need any extra options.


Make sure if it really is this way, after pv/vg/first lv creation. I found 
it experimentally, so ymmv.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-02 Thread David Greaves
Alberto Alonso wrote:
> On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote:
>> Not in the older kernel versions you were running, no.
> 
> These "old versions" (specially the RHEL) are supposed to be
> the official versions supported by Redhat and the hardware 
> vendors, as they were very specific as to what versions of 
> Linux were supported. Of all people, I would think you would
> appreciate that. Sorry if I sound frustrated and upset, but 
> it is clearly a result of what "supported and tested" really 
> means in this case. I don't want to go into a discussion of
> commercial distros, which are "supported" as this is nor the
> time nor the place but I don't want to open the door to the
> excuse of "its an old kernel", it wasn't when it got installed.

It may be worth noting that the context of this email is the upstream linux-raid
 list. In my time watching the list it is mainly focused on 'current' code and
development (but hugely supportive of older environments).
In general discussions in this context will have a certain mindset - and it's
not going to be the same as that which you'd find in an enterprise product
support list.

> Outside of the rejected suggestion, I just want to figure out 
> when software raid works and when it doesn't. With SATA, my 
> experience is that it doesn't.

SATA, or more precisely, error handling in SATA has recently been significantly
overhauled by Tejun Heo (IIRC). We're talking post 2.6.18 though (again IIRC) -
so as far as SATA EH goes, older kernels bear no relation to the new ones.

And the initial SATA EH code was, of course, beta :)

David
PS I can't really contribute to your list - I'm only using cheap desktop 
hardware.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very small internal bitmap after recreate

2007-11-02 Thread Ralf Müller


Am 02.11.2007 um 11:22 schrieb Ralf Müller:



# mdadm -E /dev/sdg1
/dev/sdg1:
  Magic : a92b4efc
Version : 01
Feature Map : 0x1
 Array UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19
   Name : 1
  Creation Time : Wed Oct 31 14:30:55 2007
 Raid Level : raid5
   Raid Devices : 5

  Used Dev Size : 625137008 (298.09 GiB 320.07 GB)
 Array Size : 2500547584 (1192.35 GiB 1280.28 GB)
  Used Size : 625136896 (298.09 GiB 320.07 GB)
   Super Offset : 625137264 sectors
  State : clean
Device UUID : 95afade2:f2ab8e83:b0c764a0:4732827d

Internal Bitmap : 2 sectors from superblock
Update Time : Fri Nov  2 07:46:38 2007
   Checksum : 4ee307b3 - correct
 Events : 408088

 Layout : left-symmetric
 Chunk Size : 128K

Array Slot : 3 (0, 1, failed, 2, 3, 4)
   Array State : uuUuu 1 failed

This time I'm getting nervous - Array State failed doesn't sound good!


Just to make it clear - the array is still reported active by in / 
proc/mdstat and behaves well - no failed devices:

md1 : active raid5 sdd1[0] sdh1[5] sdf1[4] sdg1[3] sde1[1]
  1250273792 blocks super 1.0 level 5, 128k chunk, algorithm 2  
[5/5] [U]

  bitmap: 0/10 pages [0KB], 16384KB chunk

Regards
Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very small internal bitmap after recreate

2007-11-02 Thread Ralf Müller


Am 02.11.2007 um 10:22 schrieb Neil Brown:


On Friday November 2, [EMAIL PROTECTED] wrote:

I have a 5 disk version 1.0 superblock RAID5 which had an internal
bitmap that has been reported to have a size of 299 pages in /proc/
mdstat. For whatever reason I removed this bitmap (mdadm --grow --
bitmap=none) and recreated it afterwards (mdadm --grow --
bitmap=internal). Now it has a reported size of 10 pages.

Do I have a problem?


Not a big problem, but possibly a small problem.
Can you send
   mdadm -E /dev/sdg1
as well?


Sure:

# mdadm -E /dev/sdg1
/dev/sdg1:
  Magic : a92b4efc
Version : 01
Feature Map : 0x1
 Array UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19
   Name : 1
  Creation Time : Wed Oct 31 14:30:55 2007
 Raid Level : raid5
   Raid Devices : 5

  Used Dev Size : 625137008 (298.09 GiB 320.07 GB)
 Array Size : 2500547584 (1192.35 GiB 1280.28 GB)
  Used Size : 625136896 (298.09 GiB 320.07 GB)
   Super Offset : 625137264 sectors
  State : clean
Device UUID : 95afade2:f2ab8e83:b0c764a0:4732827d

Internal Bitmap : 2 sectors from superblock
Update Time : Fri Nov  2 07:46:38 2007
   Checksum : 4ee307b3 - correct
 Events : 408088

 Layout : left-symmetric
 Chunk Size : 128K

Array Slot : 3 (0, 1, failed, 2, 3, 4)
   Array State : uuUuu 1 failed

This time I'm getting nervous - Array State failed doesn't sound good!

Regards
Ralf
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Very small internal bitmap after recreate

2007-11-02 Thread Ralf Müller
I have a 5 disk version 1.0 superblock RAID5 which had an internal  
bitmap that has been reported to have a size of 299 pages in /proc/ 
mdstat. For whatever reason I removed this bitmap (mdadm --grow -- 
bitmap=none) and recreated it afterwards (mdadm --grow -- 
bitmap=internal). Now it has a reported size of 10 pages.


Do I have a problem?

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid5 sdd1[0] sdh1[5] sdf1[4] sdg1[3] sde1[1]
  1250273792 blocks super 1.0 level 5, 128k chunk, algorithm 2  
[5/5] [U]

  bitmap: 0/10 pages [0KB], 16384KB chunk

# mdadm -X /dev/sdg1
Filename : /dev/sdg1
   Magic : 6d746962
 Version : 4
UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19
  Events : 408088
  Events Cleared : 408088
   State : OK
   Chunksize : 16 MB
  Daemon : 5s flush period
  Write Mode : Normal
   Sync Size : 312568448 (298.09 GiB 320.07 GB)
  Bitmap : 19078 bits (chunks), 0 dirty (0.0%)

# mdadm --version
mdadm - v2.6.2 - 21st May 2007

# uname -a
Linux DatenGrab 2.6.22.9-0.4-default #1 SMP 2007/10/05 21:32:04 UTC  
i686 i686 i386 GNU/Linux



Regards Ralf

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software RAID when it works and when it doesn't

2007-11-02 Thread Alberto Alonso
On Sat, 2007-10-27 at 11:26 -0400, Bill Davidsen wrote:
> Alberto Alonso wrote:
> > On Fri, 2007-10-26 at 18:12 +0200, Goswin von Brederlow wrote:
> >
> >   
> >> Depending on the hardware you can still access a different disk while
> >> another one is reseting. But since there is no timeout in md it won't
> >> try to use any other disk while one is stuck.
> >>
> >> That is exactly what I miss.
> >>
> >> MfG
> >> Goswin
> >> -
> >> 
> >
> > That is exactly what I've been talking about. Can md implement
> > timeouts and not just leave it to the drivers?
> >
> > I can't believe it but last night another array hit the dust when
> > 1 of the 12 drives went bad. This year is just a nightmare for
> > me. It brought all the network down until I was able to mark it
> > failed and reboot to remove it from the array.
> >   
> 
> I'm not sure what kind of drives and drivers you use, but I certainly 
> have drives go bad and they get marked as failed. Both on old PATA 
> drives and newer SATA. All the SCSI I currently use is on IBM hardware 
> RAID (ServeRAID), so I can only assume that failure would be noted.
> 
-- 
Alberto AlonsoGlobal Gate Systems LLC.
(512) 351-7233http://www.ggsys.net
Hardware, consulting, sysadmin, monitoring and remote backups

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-11-02 Thread Alberto Alonso
On Thu, 2007-11-01 at 15:16 -0400, Doug Ledford wrote:
> I wasn't belittling them.  I was trying to isolate the likely culprit in
> the situations.  You seem to want the md stack to time things out.  As
> has already been commented by several people, myself included, that's a
> band-aid and not a fix in the right place.  The linux kernel community
> in general is pretty hard lined when it comes to fixing the bug in the
> wrong way.

It did sound as if I was complaining about nothing and that I shouldn't
bother the linux-raid people and instead just continuously update the
kernel and stop raising issues. If I misunderstood you I'm sorry, but
somehow I still think that belittling my problems was implied in your
responses.

> Not in the older kernel versions you were running, no.

These "old versions" (specially the RHEL) are supposed to be
the official versions supported by Redhat and the hardware 
vendors, as they were very specific as to what versions of 
Linux were supported. Of all people, I would think you would
appreciate that. Sorry if I sound frustrated and upset, but 
it is clearly a result of what "supported and tested" really 
means in this case. I don't want to go into a discussion of
commercial distros, which are "supported" as this is nor the
time nor the place but I don't want to open the door to the
excuse of "its an old kernel", it wasn't when it got installed.

> And I guarantee not a single one of those systems even knows what SATA
> is.  They all use tried and true SCSI/FC technology.

Sure, the tru64 units I talked about don't use SATA (although 
some did use PATA) I'll concede to that point.

> In any case, if Neil is so inclined to do so, he can add timeout code
> into the md stack, it's not my decision to make.

The timeout was nothing more than a suggestion based on what
I consider a reasonable expectation of usability. Neil said no
and I respect that. If I didn'tm I could always write my own as
per the open source model :-) But I am not inclined to do so.

Outside of the rejected suggestion, I just want to figure out 
when software raid works and when it doesn't. With SATA, my 
experience is that it doesn't. So far I've only received one 
response stating success (they were using the 3ware and Areca 
product lines).

Anyway, this thread just posed the question, and as Neil pointed
out, it isn't feasible/worth to implement timeouts within the md
code. I think most of the points/discussions raised beyond that
original question really belong to the thread "Software RAID when 
it works and when it doesn't" 

I do appreciate all comments and suggestions and I hope to keep
them coming. I would hope however to hear more about success
stories with specific hardware details. It would be helpfull
to have a list of tested configurations that are known to work.

Alberto

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in processing dependencies by async_tx_submit() ?

2007-11-02 Thread Yuri Tikhonov

 Hi Dan,

On Friday 02 November 2007 03:36, Dan Williams wrote:
> >   This is happened because of the specific implementation of
> >  dma_wait_for_async_tx().
> 
> So I take it you are not implementing interrupt based callbacks in your 
driver?

 Why not ? I have interrupt based callbacks in my driver. An INTERRUPT 
descriptor, implemented for both (COPY and XOR) channels, does the callback 
upon its completion.

 Here is an example where your implementation of dma_wait_for_async_tx() will 
not work as expected. Let's we have OP1 <--depends on-- OP2 <--depends on-- 
OP3, where

 OP1: cookie = -EBUSY, channel = DMA0; <- not submitted
 OP2: cookie = -EBUSY, channel = DMA0; <- not submitted
 OP3: cookie = 101, channel = DMA1; <- submitted, but not linked to h/w

 where cookie == 101 is some valid, positive cookie; and this fact means that 
OP3 *was submitted* to the DMA1 channel but *perhaps was not linked* to the 
h/w chain, for example, because the threshold for DMA1 was not achieved yet.
 
 With your implementation of dma_wait_for_async_tx() we do dma_sync_wait(OP2). 
And I propose to do dma_sync_wait(OP3), because in your case we may never 
wait for OP2 completion since dma_sync_wait() flushes to h/w the chains of 
DMA0, but OP3 in DMA1 remains unlinked to h/w and it blocks all the chain of 
dependencies.

> >   The "iter", we finally waiting for there, corresponds to the last 
allocated
> >  but not-yet-submitted descriptor. But if the "iter" we are waiting for is
> >  dependent from another descriptor which has cookie > 0, but is not yet
> >  submitted to the h/w channel because of the fact that threshold is not
> >  achieved to this moment, then we may wait in dma_wait_for_async_tx()
> >  infinitely. I think that it makes more sense to get the first descriptor
> >  which was submitted to the channel but probably is not put into the h/w
> >  chain, i.e. with cookie > 0 and do dma_sync_wait() of this descriptor.
> >
> >   When I modified the dma_wait_for_async_tx() in such way, then the kernel
> >  locking had disappeared. But nevertheless the mkfs processes hangs-up 
after
> >  some time. So, it looks like something is still missing in support of the
> >  chaining dependencies feature...
> >
> 
> I am preparing a new patch that replaces ASYNC_TX_DEP_ACK with
> ASYNC_TX_CHAIN_ACK.  The plan is to make the entire chain of
> dependencies available up until the last transaction is submitted.
> This allows the entire dependency chain to be walked at
> async_tx_submit time so that we can properly handle these multiple
> dependency cases.  I'll send it out when it passes my internal
> tests...

 Fine. I guess this replacement assumes some modifications to the RAID-5 
driver as well. Right?

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html