Re: [PATCH -mm 0/4] raid5: stripe_queue (+20% to +90% write performance)

2007-10-08 Thread Neil Brown
On Saturday October 6, [EMAIL PROTECTED] wrote:
> Neil,
> 
> Here is the latest spin of the 'stripe_queue' implementation.  Thanks to
> raid6+bitmap testing done by Mr. James W. Laferriere there have been
> several cleanups and fixes since the last release.  Also, the changes
> are now spread over 4 patches to isolate one conceptual change per
> patch.  The most significant cleanup is removing the stripe_head back
> pointer from stripe_queue.  This effectively makes the queuing layer
> independent from the caching layer.

Thanks Dan, and sorry that it has taken such a time for me to take a
serious look at this.
The results seem impressive.  I'll try to do some testing myself, but
firstly: some questions.


1/ Can you explain why this improves the performance more than simply
  doubling the size of the stripe cache?

  The core of what it is doing seems to be to give priority to writing
  full stripes.  We already do that by delaying incomplete stripes.
  Maybe we just need to tune that mechanism a bit?  Maybe release
  fewer partial stripes at a time?

  It seems that the whole point of the stripe_queue structure is to
  allow requests to gather before they are processed so the more
  "deserving" can be processed first, but I cannot see why you need a
  data structure separate from the list_head.

  You could argue that simply doubling the size of the stripe cache
  would be a waste of memory as we only want to use half of it to
  handle active requests - the other half is for requests being built
  up.
  In that case, I don't see a problem with having a pool of pages
  which is smaller that would be needed for the full stripe cache, and
  allocating them to stripe_heads as they become free.

2/ I thought I understood from your descriptions that
   raid456_cache_arbiter would normally be waiting for a free stripe,
   that during this time full stripes could get promoted to io_hi, and
   so when raid456_cache_arbiter finally got a free stripe, it would
   attach it to the most deserving stripe_queue.  However it doesn't
   quite do that.  It chooses the deserving stripe_queue *before*
   waiting for a free stripe_head.  This seems slightly less than
   optimal?

3/ Why create a new workqueue for raid456_cache_arbiter rather than
   use raid5d.  It should be possible to do a non-blocking wait for a
   free stripe_head, in which cache the "find a stripe head and attach
   the most deserving stripe_queue" would fit well into raid5d.

4/ Why do you use an rbtree rather than a hash table to index the
  'stripe_queue' objects?  I seem to recall a discussion about this
  where it was useful to find adjacent requests or something like
  that, but I cannot see that in the current code.
  But maybe rbtrees are a better fit, in which case, should we use
  them for stripe_heads as well???

5/ Then again... It seems to me that a stripe_head will always have a
   stripe_queue pointing to it.  In that case we don't need to index
   the stripe_heads at all any more.  Would that be correct?

6/ What is the point of the do/while loop in
   wait_for_cache_attached_queue?  It seems totally superfluous.

That'll do for now.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Neil Brown
On Tuesday October 9, [EMAIL PROTECTED] wrote:
> 
> Problems at step 4.: 'man mdadm' doesn't tell if it's possible to
> grow an array to a degraded array (non existant disc). Is it possible?

Why not experiment with loop devices on files and find out?

But yes:  you can grow to a degraded array providing you specify a
--backup-file.

However I don't recommend it.  I would never recommend having a
degraded array by design.  It should only ever happen due to a
failure, and should last only until you can get a replacement
rebuilt. 

Remember that a degraded raid5 has a greater risk of data loss than a
single drive.

> 
> 
> PS: the fact, that degraded array will be unsafe for the data is an
> intented motivating factor for buying next drive ;)

:-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Neil Brown
On Tuesday October 9, [EMAIL PROTECTED] wrote:
> 
> o degraded raid5 isn't really Raid - i.e, it's not any better than
>   a raid0 array, that is, any disk fails => the whole array fails.
>   So instead of creating a degraded raid5 array initially, create
>   smaller one instead, but not degraded, and reshape it when
>   necessary.

Fully agree.

> 
> o reshaping takes time, and for this volume, reshape will take
>   many hours, maybe days, to complete.
> 
> o During this reshape time, errors may be fatal to the whole array -
>   while mdadm do have a sense of "critical section", but the
>   whole procedure isn't as much tested as the rest of raid code,
>   I for one will not rely on it, at least for now.  For example,
>   a power failure at an "unexpected" moment, or some plain-stupid
>   error in reshape code so that the whole array goes "boom" etc...

While it is true that the resize code is less tested than other code,
it is designed to handle a single failure at any time (so a power
failure is OK as long as the array is not running degraded), and I
have said that if anyone does suffer problems while performing a
reshape, I will do my absolute best to get the array functioning and
the data safe again.

> 
> o A filesystem on the array has to be resized separately after
>   re{siz,shap}ing the array.  And filesystems are different at
>   this point, too - there are various limitations.  For example,
>   it's problematic to grow ext[23]fs by large amounts, because
>   when it gets initially created, mke2fs calculates sizes of
>   certain internal data structures based on the device size,
>   and those structures can't be grown significantly, only
>   recreating the filesystem will do the trick.

This isn't entirely true.
For online resizing (while the filesystem is mounted) there are some
limitations as you suggest.  For offline resizing (while filesystem is
not mounted) there are no such limitations.


NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Guy Watkins


} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of Janek Kozicki
} Sent: Monday, October 08, 2007 6:47 PM
} To: linux-raid@vger.kernel.org
} Subject: Re: very degraded RAID5, or increasing capacity by adding discs
} 
} Janek Kozicki said: (by the date of Tue, 9 Oct 2007 00:25:50 +0200)
} 
} > Richard Scobie said: (by the date of Tue, 09 Oct 2007 08:26:35
} +1300)
} >
} > > No, but you can make a degraded 3 drive array, containing 2 drives and
} > > then add the next drive to complete it.
} > >
} > > The array can then be grown (man mdadm, GROW section), to add the
} fourth.
} >
} > Oh, good. Thanks, I must've been blind that I missed this.
} > This completely solves my problem.
} 
} Uh, actually not :)
} 
} My 1st 500 GB drive is full now. When I buy a 2nd one I want to
} create a 3-disc degraded array using just 2 discs, one of which
} contains unbackupable data.
} 
} steps:
} 1. create degraded two-disc RAID5 on 1 new disc
} 2. copy data from old disc to new one
} 3. rebuild the array with old and new discs (now I have 500 GB on 2 discs)
3. Add old disk to new array.  Once done RAID5 is redundant.

} 4. GROW this array to a degraded 3 discs RAID5 (so I have 1000 GB on 2
} discs)
4. Buy 3rd disk.
5. Add new 3rd disk to array and grow to 3 disk RAID5 array.  Once done,
array is redundant.

Repeat 4 and 5 each time you buy a new disk.

I don't think you can grow to a degraded array.  I think you must add a new
disk first.  But I am not sure.

} ...
} 5. when I buy 3rd drive I either grow the array, or just rebuild and
} wait with growing until I buy a 4th drive.
} 
} Problems at step 4.: 'man mdadm' doesn't tell if it's possible to
} grow an array to a degraded array (non existant disc). Is it possible?
} 
} 
} PS: the fact, that degraded array will be unsafe for the data is an
} intented motivating factor for buying next drive ;)
} 
} --
} Janek Kozicki |
} -
} To unsubscribe from this list: send the line "unsubscribe linux-raid" in
} the body of a message to [EMAIL PROTECTED]
} More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Michael Tokarev
Janek Kozicki wrote:
> Hello,
> 
> Recently I started to use mdadm and I'm very impressed by its
> capabilities. 
> 
> I have raid0 (250+250 GB) on my workstation. And I want to have
> raid5 (4*500 = 1500 GB) on my backup machine.

Hmm.  Are you sure you need that much space on the backup, to
start with?  Maybe better backup strategy will help to avoid
hardware costs?  Such as using rsync for backups as discussed
on this mailinglist about a month back (rsync is able to keep
many ready to use copies of your filesystems but only store
files that actually changed since the last backup, thus
requiring much less space than many full backups).

> The backup machine currently doesn't have raid, just a single 500 GB
> drive. I plan to buy more HDDs to have a bigger space for my
> backups but since I cannot afford all HDDs at once I face a problem
> of "expanding" an array. I'm able to add one 500 GB drive every few
> months until I have all 4 drives.
> 
> But I cannot make a backup of a backup... so reformatting/copying all
> data each time when I add new disc to the array is not possible for me.
> 
> Is it possible anyhow to create a "very degraded" raid array - a one
> that consists of 4 drives, but has only TWO ?
> 
> This would involve some very tricky *hole* management on the block
> device... A one that places holes in stripes on the block device,
> until more discs are added to fill the holes. When the holes are
> filled, the block device grows bigger, and with lvm I just increase
> the filesystem size. This is perhaps coupled with some "unstripping"
> that moves/reorganizes blocks around to fill/defragment the holes.

It's definitely not possible with raid5.  Only option is to create a
raid5 array consisting of less drives than it should contain at the
end, and reshape it when you get more drives, as others noted in this
thread.  But do note the following points:

o degraded raid5 isn't really Raid - i.e, it's not any better than
  a raid0 array, that is, any disk fails => the whole array fails.
  So instead of creating a degraded raid5 array initially, create
  smaller one instead, but not degraded, and reshape it when
  necessary.

o reshaping takes time, and for this volume, reshape will take
  many hours, maybe days, to complete.

o During this reshape time, errors may be fatal to the whole array -
  while mdadm do have a sense of "critical section", but the
  whole procedure isn't as much tested as the rest of raid code,
  I for one will not rely on it, at least for now.  For example,
  a power failure at an "unexpected" moment, or some plain-stupid
  error in reshape code so that the whole array goes "boom" etc...

o A filesystem on the array has to be resized separately after
  re{siz,shap}ing the array.  And filesystems are different at
  this point, too - there are various limitations.  For example,
  it's problematic to grow ext[23]fs by large amounts, because
  when it gets initially created, mke2fs calculates sizes of
  certain internal data structures based on the device size,
  and those structures can't be grown significantly, only
  recreating the filesystem will do the trick.

> is it just a pipe dream?

I'd say it is... ;)

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Janek Kozicki
Janek Kozicki said: (by the date of Tue, 9 Oct 2007 00:25:50 +0200)

> Richard Scobie said: (by the date of Tue, 09 Oct 2007 08:26:35 +1300)
> 
> > No, but you can make a degraded 3 drive array, containing 2 drives and 
> > then add the next drive to complete it.
> > 
> > The array can then be grown (man mdadm, GROW section), to add the fourth.
> 
> Oh, good. Thanks, I must've been blind that I missed this.
> This completely solves my problem.

Uh, actually not :)

My 1st 500 GB drive is full now. When I buy a 2nd one I want to
create a 3-disc degraded array using just 2 discs, one of which
contains unbackupable data.

steps:
1. create degraded two-disc RAID5 on 1 new disc
2. copy data from old disc to new one
3. rebuild the array with old and new discs (now I have 500 GB on 2 discs)
4. GROW this array to a degraded 3 discs RAID5 (so I have 1000 GB on 2 discs)
...
5. when I buy 3rd drive I either grow the array, or just rebuild and
wait with growing until I buy a 4th drive.

Problems at step 4.: 'man mdadm' doesn't tell if it's possible to
grow an array to a degraded array (non existant disc). Is it possible?


PS: the fact, that degraded array will be unsafe for the data is an
intented motivating factor for buying next drive ;)

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Janek Kozicki
Richard Scobie said: (by the date of Tue, 09 Oct 2007 08:26:35 +1300)

> No, but you can make a degraded 3 drive array, containing 2 drives and 
> then add the next drive to complete it.
> 
> The array can then be grown (man mdadm, GROW section), to add the fourth.

Oh, good. Thanks, I must've been blind that I missed this.
This completely solves my problem.

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Guy Watkins
} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of Richard Scobie
} Sent: Monday, October 08, 2007 3:27 PM
} To: linux-raid@vger.kernel.org
} Subject: Re: very degraded RAID5, or increasing capacity by adding discs
} 
} Janek Kozicki wrote:
} 
} > Is it possible anyhow to create a "very degraded" raid array - a one
} > that consists of 4 drives, but has only TWO ?
} 
} No, but you can make a degraded 3 drive array, containing 2 drives and
} then add the next drive to complete it.
} 
} The array can then be grown (man mdadm, GROW section), to add the fourth.
} 
} Regards,
} 
} Richard

I think someone once said you could create a 2 disk degraded RAID5 array
with just 1 disk.  Then add one later.  Then expand as needed.  Someone
should test this.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Richard Scobie

Janek Kozicki wrote:


Is it possible anyhow to create a "very degraded" raid array - a one
that consists of 4 drives, but has only TWO ?


No, but you can make a degraded 3 drive array, containing 2 drives and 
then add the next drive to complete it.


The array can then be grown (man mdadm, GROW section), to add the fourth.

Regards,

Richard
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Justin Piszcz



On Mon, 8 Oct 2007, Janek Kozicki wrote:


Hello,

Recently I started to use mdadm and I'm very impressed by its
capabilities.

I have raid0 (250+250 GB) on my workstation. And I want to have
raid5 (4*500 = 1500 GB) on my backup machine.

The backup machine currently doesn't have raid, just a single 500 GB
drive. I plan to buy more HDDs to have a bigger space for my
backups but since I cannot afford all HDDs at once I face a problem
of "expanding" an array. I'm able to add one 500 GB drive every few
months until I have all 4 drives.

But I cannot make a backup of a backup... so reformatting/copying all
data each time when I add new disc to the array is not possible for me.

Is it possible anyhow to create a "very degraded" raid array - a one
that consists of 4 drives, but has only TWO ?

This would involve some very tricky *hole* management on the block
device... A one that places holes in stripes on the block device,
until more discs are added to fill the holes. When the holes are
filled, the block device grows bigger, and with lvm I just increase
the filesystem size. This is perhaps coupled with some "unstripping"
that moves/reorganizes blocks around to fill/defragment the holes.

is it just a pipe dream?

best regards


PS: yes it's simple to make a degraded array of 3 drives, but I
cannot afford two discs at once...

--
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



With raid1 you can create a degraded array with 1 disk- I have done this, 
I have always wondered if mdadm will let you make a degraded raid 5 array 
with 2 disks (you'd specify 3 and only give 2) - you can always expand 
later.


Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


very degraded RAID5, or increasing capacity by adding discs

2007-10-08 Thread Janek Kozicki
Hello,

Recently I started to use mdadm and I'm very impressed by its
capabilities. 

I have raid0 (250+250 GB) on my workstation. And I want to have
raid5 (4*500 = 1500 GB) on my backup machine.

The backup machine currently doesn't have raid, just a single 500 GB
drive. I plan to buy more HDDs to have a bigger space for my
backups but since I cannot afford all HDDs at once I face a problem
of "expanding" an array. I'm able to add one 500 GB drive every few
months until I have all 4 drives.

But I cannot make a backup of a backup... so reformatting/copying all
data each time when I add new disc to the array is not possible for me.

Is it possible anyhow to create a "very degraded" raid array - a one
that consists of 4 drives, but has only TWO ?

This would involve some very tricky *hole* management on the block
device... A one that places holes in stripes on the block device,
until more discs are added to fill the holes. When the holes are
filled, the block device grows bigger, and with lvm I just increase
the filesystem size. This is perhaps coupled with some "unstripping"
that moves/reorganizes blocks around to fill/defragment the holes.

is it just a pipe dream?

best regards


PS: yes it's simple to make a degraded array of 3 drives, but I
cannot afford two discs at once...

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 5 performance issue.

2007-10-08 Thread Justin Piszcz



On Sun, 7 Oct 2007, Dean S. Messing wrote:



Justin Piszcz wrote:

On Fri, 5 Oct 2007, Dean S. Messing wrote:


Brendan Conoboy wrote:


Is the onboard SATA controller real SATA or just an ATA-SATA
converter?  If the latter, you're going to have trouble getting faster
performance than any one disk can give you at a time.  The output of
'lspci' should tell you if the onboard SATA controller is on its own
bus or sharing space with some other device.  Pasting the output here
would be useful.



N00bee question:

How does one tell if a machine's disk controller is an ATA-SATA
converter?

The output of `lspci|fgrep -i sata' is:

00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI Controller\
(rev 09)

suggests a real SATA. These references to ATA in "dmesg", however,
make me wonder.

ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: WDC WD1600JS-75NCB3, 10.02E04, max UDMA/133
ata1.00: 31250 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-7: ST3160812AS, 3.ADJ, max UDMA/133
ata2.00: 31250 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata2.00: configured for UDMA/133
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: ATA-7: ST3500630NS, 3.AEK, max UDMA/133
ata3.00: 976773168 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata3.00: configured for UDMA/133


Dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



His drives are either really old and do not support NCQ or he is not using
AHCI in the BIOS.


Sorry, Justin, if I wasn't clear.  I was asking the N00bee question
about _my_own_ machine.  The output of lspci (on my machine) seems to
indicate I have a "real" STAT controller on the Motherboard, but the
contents of "dmesg", with the references to ATA-7 and UDMA/133, made
me wonder if I had just an ATA-SATA converter.  Hence my question: how
does one tell definitively if one has a real SATA controller on the Mother
Board?



The output looks like a real (AHCI-capable) SATA controller and your 
drives are using NCQ/AHCI.


Output from one of my machines:
[   23.621462] ata1: SATA max UDMA/133 cmd 0xf8812100 ctl 0x bmdma 
0x irq 219

[   24.078390] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   24.549806] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

As far as why it shows UDMA/133 in the kernel output I am sure there is a 
reason :)


I know in the older SATA drives there was a bridge chip that was used to 
convert the drive from IDE<->SATA maybe it is from those legacy days, not 
sure.


With the newer NCQ/'native' SATA drives, the bridge chip should no longer 
exist.


Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html