from:"Al Boldi"

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-16 Thread Al Boldi

Justin Piszcz wrote:
 For these benchmarks I timed how long it takes to extract a standard 4.4
 GiB DVD:

 Settings: Software RAID 5 with the following settings (until I change
 those too):

 Base setup:
 blockdev --setra 65536 /dev/md3
 echo 16384  /sys/block/md3/md/stripe_cache_size
 echo Disabling NCQ on all disks...
 for i in $DISKS
 do
echo Disabling NCQ on $i
echo 1  /sys/block/$i/device/queue_depth
 done

 p34:~# grep : *chunk* |sort -n
 4-chunk.txt:0:45.31
 8-chunk.txt:0:44.32
 16-chunk.txt:0:41.02
 32-chunk.txt:0:40.50
 64-chunk.txt:0:40.88
 128-chunk.txt:0:40.21
 256-chunk.txt:0:40.14***
 512-chunk.txt:0:40.35
 1024-chunk.txt:0:41.11
 2048-chunk.txt:0:43.89
 4096-chunk.txt:0:47.34
 8192-chunk.txt:0:57.86
 16384-chunk.txt:1:09.39
 32768-chunk.txt:1:26.61

 It would appear a 256 KiB chunk-size is optimal.

Can you retest with different max_sectors_kb on both md and sd?

Also, can you retest using dd with different block-sizes?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again

2008-01-16 Thread Al Boldi

Justin Piszcz wrote:
 On Wed, 16 Jan 2008, Al Boldi wrote:
   Also, can you retest using dd with different block-sizes?

 I can do this, moment..


 I know about oflag=direct but I choose to use dd with sync and measure the
 total time it takes.
 /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero
 of=/r1/bigfile bs=1M count=10240; sync'

 So I was asked on the mailing list to test dd with various chunk sizes,
 here is the length of time it took
 to write 10 GiB and sync per each chunk size:

 4=chunk.txt:0:25.46
 8=chunk.txt:0:25.63
 16=chunk.txt:0:25.26
 32=chunk.txt:0:25.08
 64=chunk.txt:0:25.55
 128=chunk.txt:0:25.26
 256=chunk.txt:0:24.72
 512=chunk.txt:0:24.71
 1024=chunk.txt:0:25.40
 2048=chunk.txt:0:25.71
 4096=chunk.txt:0:27.18
 8192=chunk.txt:0:29.00
 16384=chunk.txt:0:31.43
 32768=chunk.txt:0:50.11
 65536=chunk.txt:2:20.80

What do you get with bs=512,1k,2k,4k,8k,16k...


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Al Boldi

Lars Ellenberg wrote:
 meanwhile, please, anyone interessted,
 the drbd paper for LinuxConf Eu 2007 is finalized.
 http://www.drbd.org/fileadmin/drbd/publications/
 drbd8.linux-conf.eu.2007.pdf

 it does not give too much implementation detail (would be inapropriate
 for conference proceedings, imo; some paper commenting on the source
 code should follow).

 but it does give a good overview about what DRBD actually is,
 what exact problems it tries to solve,
 and what developments to expect in the near future.

 so you can make up your mind about
  Do we need it?, and
  Why DRBD? Why not NBD + MD-RAID?

Ok, conceptually your driver sounds really interresting, but when I read the 
pdf I got completely turned off.  The problem is that the concepts are not 
clearly implemented, when in fact the concepts are really simple:

  Allow shared access to remote block storage with fault tolerance.

The first thing to tackle here would be write serialization.  Then start 
thinking about fault tolerance.

Now, shared remote block access should theoretically be handled, as does 
DRBD, by a block layer driver, but realistically it may be more appropriate 
to let it be handled by the combining end user, like OCFS or GFS.

The idea here is to simplify lower layer implementations while removing any 
preconceived dependencies, and let upper layers reign free without incurring 
redundant overhead.

Look at ZFS; it illegally violates layering by combining md/dm/lvm with the 
fs, but it does this based on a realistic understanding of the problems 
involved, which enables it to improve performance, flexibility, and 
functionality specific to its use case.

This implies that there are two distinct forces at work here:

  1. Layer components
  2. Use-Case composers

Layer components should technically not implement any use case (other than 
providing a plumbing framework), as that would incur unnecessary 
dependencies, which could reduce its generality and thus reusability.

Use-Case composers can now leverage layer components from across the layering 
hierarchy, to yield a specific use case implementation.

DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general, 
whereas aoe / nbd / loop and the VFS / FUSE are examples of layer 
components.

It follows that Use-case composers, like DRBD, need common functionality that 
should be factored out into layer components, and then recompose to 
implement a specific use case.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread Al Boldi

Evgeniy Polyakov wrote:
 Al Boldi ([EMAIL PROTECTED]) wrote:
  Look at ZFS; it illegally violates layering by combining md/dm/lvm with
  the fs, but it does this based on a realistic understanding of the
  problems involved, which enables it to improve performance, flexibility,
  and functionality specific to its use case.
 
  This implies that there are two distinct forces at work here:
 
1. Layer components
2. Use-Case composers
 
  Layer components should technically not implement any use case (other
  than providing a plumbing framework), as that would incur unnecessary
  dependencies, which could reduce its generality and thus reusability.
 
  Use-Case composers can now leverage layer components from across the
  layering hierarchy, to yield a specific use case implementation.
 
  DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in
  general, whereas aoe / nbd / loop and the VFS / FUSE are examples of
  layer components.
 
  It follows that Use-case composers, like DRBD, need common functionality
  that should be factored out into layer components, and then recompose to
  implement a specific use case.

 Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs
 on top of distributed storage (which is a urprise to me, that holy zfs
 suppors that)?

Actually, I may not have been very clear in my Use-Case composer description 
to mean internal in-kernel Use-Case composer as opposed to external Userland 
Use-Case composer.

So, nbd+dm+raid1 would be an external Userland Use-Case composition, which 
obviously could have some drastic performance issues.

DRBD and ZFS are examples of internal in-kernel Use-Case composers, which 
obviously could show some drastic performance improvements.  

Although you could allow in-kernel Use-Case composers to be run on top of 
Userland Use-Case composers, that wouldn't be the preferred mode of 
operation.  Instead, you would for example recompose ZFS to incorporate an 
in-kernel distributed storage layer component, like nbd.

All this boils down to refactoring Use-Case composers to produce layer 
components with both in-kernel and userland interfaces.  Once we have that, 
it becomes a matter of plug-and-play to produce something awesome like ZFS.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5

2007-07-30 Thread Al Boldi

Justin Piszcz wrote:
 CONFIG:

 Software RAID 5 (400GB x 6): Default mkfs parameters for all filesystems.
 Kernel was 2.6.21 or 2.6.22, did these awhile ago.
 Hardware was SATA with PCI-e only, nothing on the PCI bus.

 ZFS was userspace+fuse of course.

Wow! Userspace and still that efficient.

 Reiser was V3.
 EXT4 was created using the recommended options on its project page.

 RAW:

 ext2,7760M,56728,96.,180505,51,85484,17.,50946.7,80.,235541,21
.,373.667,0,16:10:16/64,2354,27,0,0,8455.67,14.6667,2211.67,26.
,0,0,9724,22.
 ext3,7760M,52702.7,94.,165005,60,82294.7,20.6667,52664,83.6667,258788,
33.,335.8,0,16:10:16/64,858.333,10.6667,10250.3,28.6667,4084,15,897
,12.6667,4024.33,12.,2754,11.
 ext4,7760M,53129.7,95,164515,59.,101678,31.6667,62194.3,98.6667,266716
,22.,405.767,0,16:10:16/64,1963.67,23.6667,0,0,20859,73.6667,1731,2
1.,9022,23.6667,16410,65.6667
 jfs,7760M,54606,92,191997,52,112764,33.6667,63585.3,99,274921,22.,383.
8,0,16:10:16/64,344,1,0,0,539.667,0,297.667,1,0,0,340,0
 reiserfs,7760M,51056.7,96,180607,67,106907,38.,61231.3,97.6667,275339,
29.,441.167,0,16:10:16/64,2516,60.6667,19174.3,60.6667,8194.33,54.3
333,2011,42.6667,6963.67,19.6667,9168.33,68.6667
 xfs,7760M,52985.7,93,158342,45,79682,14,60547.3,98,239101,20.,359.667,
0,16:10:16/64,415,4,0,0,1774.67,10.6667,454,4.7,14526.3,40,1572,12.
6667

 zfs,7760M,

Dissecting some of these numbers:

  speed %cpu  
 25601,43.,
 32198.7,4,
 13266.3, 2,
 44145.3,68.6667,
 129278,9,
 245.167,0,

 16:10:16/64,

  speed %cpu  
 218.333,2,
 2698.33,11.6667,
 7434.67,14.,
 244,2,
 2191.33,11.6667,
 5613.33,13.

Extrapolating these %cpu number makes ZFS the fastest.

Are you sure these numbers are correct?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFH] Partition table recovery

2007-07-22 Thread Al Boldi

Theodore Tso wrote:
 On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote:
  Sounds great, but it may be advisable to hook this into the partition
  modification routines instead of mkfs/fsck.  Which would mean that the
  partition manager could ask the kernel to instruct its fs subsystem to
  update the backup partition table for each known fs-type that supports
  such a feature.

 Well, let's think about this a bit.  What are the requirements?

 1) The partition manager should be able explicitly request that a new
 backup of the partition tables be stashed in each filesystem that has
 room for such a backup.  That way, when the user affirmatively makes a
 partition table change, it can get backed up in all of the right
 places automatically.

 2) The fsck program should *only* stash a backup of the partition
 table if there currently isn't one in the filesystem.  It may be that
 the partition table has been corrupted, and so merely doing an fsck
 should not transfer a current copy of the partition table to the
 filesystem-secpfic backup area.  It could be that the partition table
 was only partially recovered, and we don't want to overwrite the
 previously existing backups except on an explicit request from the
 system administrator.

 3) The mkfs program should automatically create a backup of the
 current partition table layout.  That way we get a backup in the newly
 created filesystem as soon as it is created.

 4) The exact location of the backup may vary from filesystem to
 filesystem.  For ext2/3/4, bytes 512-1023 are always unused, and don't
 interfere with the boot sector at bytes 0-511, so that's the obvious
 location.  Other filesystems may have that location in use, and some
 other location might be a better place to store it.  Ideally it will
 be a well-known location, that isn't dependent on finding an inode
 table, or some such, but that may not be possible for all filesystems.

 OK, so how about this as a solution that meets the above requirements?

 /sbin/partbackup device [fspart]

   Will scan device (i.e., /dev/hda, /dev/sdb, etc.) and create
   a 512 byte partition backup, using the format I've previously
   described.  If fspart is specified on the command line, it
   will use the blkid library to determine the filesystem type of
   fspart, and then attempt to execute
   /dev/partbackupfs.fstype to write the partition backup to
   fspart.  If fspart is '-', then it will write the 512 byte
   partition table to stdout.  If fspart is not specified on
   the command line, /sbin/partbackup will iterate over all
   partitions in device, use the blkid library to attempt to
   determine the correct filesystem type, and then execute
   /sbin/partbackupfs.fstype if such a backup program exists.

 /sbin/partbackupfs.fstype fspart

   ... is a filesystem specific program for filesystem type
   fstype.  It will assure that fspart (i.e., /dev/hda1,
   /dev/sdb3) is of an appropriate filesystem type, and then read
   512 bytes from stdin and write it out to fspart to an
   appropriate place for that filesystem.

 Partition managers will be encouraged to check to see if
 /sbin/partbackup exists, and if so, after the partition table is
 written, will check to see if /sbin/partbackup exists, and if so, to
 call it with just one argument (i.e., /sbin/partbackup /dev/hdb).
 They SHOULD provide an option for the user to suppress the backup from
 happening, but the backup should be the default behavior.

 An /etc/mkfs.fstype program is encouraged to run /sbin/partbackup
 with two arguments (i.e., /sbin/partbackup /dev/hdb /dev/hdb3) when
 creating a filesystem.

 An /etc/fsck.fstype program is encouraged to check to see if a
 partition backup exists (assuming the filesystem supports it), and if
 not, call /sbin/partbackup with two arguments.

 A filesystem utility package for a particular filesystem type is
 encouraged to make the above changes to its mkfs and fsck programs, as
 well as provide an /sbin/partbackupfs.fstype program.

Great!

 I would do this all in userspace, though.  Is there any reason to get
 the kernel involved?  I don't think so.

Yes, doing things in userspace, when possible, is much better.  But, a change 
in the partition table has to be relayed to the kernel, and when that change 
happens to be on a mounted disk, then the partition manager complains of not 
being able to update the kernel's view.  So how can this be addressed?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFH] Partition table recovery

2007-07-21 Thread Al Boldi

Theodore Tso wrote:
 On Sat, Jul 21, 2007 at 07:54:14PM +0200, Rene Herman wrote:
  sfdisk -d already works most of the time. Not as a verbatim tool (I
  actually semi-frequently use a sfdisk -d /dev/hda | sfdisk invocation
  as a way to _rewrite_ the CHS fields to other values after changing
  machines around on a disk) but something you'd backup on the FS level
  should, in my opinion, need to be less fragile than would be possible
  with just 512 bytes available.

 *IF* you remember to store the sfdisk -d somewhere useful.  In my How
 To Recover From Hard Drive Catastrophies classes, I tell them to
 print out a copy of sfdisk -l /dev/hda ; sfdisk -d /dev/hda and tape
 it to the side of the computer.  I also tell them do regular backups.
 What to make a guess how many them actually follow this good advice?
 Far fewer than I would like, I suspect...

 What I'm suggesting is the equivalent of sfdisk -d, except we'd be
 doing it automatically without requiring the user to take any kind of
 explicit action.  Is it perfect?  No, although the edge conditions are
 quite rare these days and generally involve users using legacy systems
 and/or doing Weird Shit such that They Really Should Know To Do Their
 Own Explicit Backups.  But for the novice users, it should work Just
 Fine.

Sounds great, but it may be advisable to hook this into the partition 
modification routines instead of mkfs/fsck.  Which would mean that the 
partition manager could ask the kernel to instruct its fs subsystem to 
update the backup partition table for each known fs-type that supports such 
a feature.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFH] Partition table recovery

2007-07-20 Thread Al Boldi

Jeffrey V. Merkey wrote:
 Al Boldi wrote:
 As always, a good friend of mine managed to scratch my partion table by
 cat'ing /dev/full into /dev/sda.  I was able to push him out of the way,
  but at least the first 100MB are gone.  I can probably live without the
  first partion, but there are many partitions after that, which I hope
  should easily be recoverable.
 
 I tried parted, but it's not working out for me.  Does anybody know of a
 simple partition recovery tool, that would just scan the disk for lost
 partions?

 One thing NetWare always did was to stamp a copy of the partition table
 at the time a partition was created as the second logical sector (offset
 1) from the start of a newly created partition. This allowed the disk to
 be scanned for the original (or last) partition table copy.

This is really a good idea, as this would save you the trouble of 
reconstructing the table due to older overlapping entries.

Can linux do something like that?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFH] Partition table recovery

2007-07-20 Thread Al Boldi

Dave Young wrote:
 On 7/20/07, Al Boldi [EMAIL PROTECTED] wrote:
  As always, a good friend of mine managed to scratch my partion table by
  cat'ing /dev/full into /dev/sda.  I was able to push him out of the way,
  but

 /dev/null ?

  at least the first 100MB are gone.  I can probably live without the
  first partion, but there are many partitions after that, which I hope
  should easily be recoverable.
 
  I tried parted, but it's not working out for me.  Does anybody know of a
  simple partition recovery tool, that would just scan the disk for lost
  partions?

 The best way is to backup you partition table before destroyed.

Very true! 

# sfdisk -d 

is a real saviour.  But make sure you don't save it on the same disk you are 
trying to recover.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFH] Partition table recovery

2007-07-20 Thread Al Boldi

James Lamanna wrote:
 On 7/19/07, Al Boldi [EMAIL PROTECTED] wrote:
  As always, a good friend of mine managed to scratch my partion table by
  cat'ing /dev/full into /dev/sda.  I was able to push him out of the way,
  but at least the first 100MB are gone.  I can probably live without the
  first partion, but there are many partitions after that, which I hope
  should easily be recoverable.
 
  I tried parted, but it's not working out for me.  Does anybody know of a
  simple partition recovery tool, that would just scan the disk for lost
  partions?

 Tried gpart?
 http://www.stud.uni-hannover.de/user/76201/gpart/

This definitely looks like the ticket.  And also rescuept from util-linux.  
There is only one small problem; I have been regularly adding / deleting / 
resizing partitions, which kind of confuses the scanner.  But still, it's 
better than nothing.

Anton Altaparmakov wrote:
 parted and its derivatives are pile of crap...  They cause corruption
 to totally healthy systems at the best of times.  Don't go near them.

 Use TestDisk (http://www.cgsecurity.org/wiki/TestDisk) and be happy.
 (-:

This one really worked best, without getting confused about older partitions.


Thanks everybody!


BTW, what's a partion table?

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFH] Partition table recovery

2007-07-20 Thread Al Boldi

Jan-Benedict Glaw wrote:
 On Fri, 2007-07-20 14:29:34 +0300, Al Boldi [EMAIL PROTECTED] wrote:
  But, I want something much more automated.  And the partition table
  backup per partition entry isn't really a bad idea.

 That's called `gpart'.

Oh, gpart is great, but if we had a backup copy of the partition table on 
every partition location on disk, then this backup copy could easily be 
reused to reconstruct the original partition table without further 
searching.  Just like the NetWare approach, and in some respect like the 
ext2/3 superblock backups.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFH] Partion table recovery

2007-07-19 Thread Al Boldi

As always, a good friend of mine managed to scratch my partion table by 
cat'ing /dev/full into /dev/sda.  I was able to push him out of the way, but 
at least the first 100MB are gone.  I can probably live without the first 
partion, but there are many partitions after that, which I hope should 
easily be recoverable.

I tried parted, but it's not working out for me.  Does anybody know of a 
simple partition recovery tool, that would just scan the disk for lost 
partions?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software RAID5 Horrible Write Speed On 3ware Controller!!

2007-07-18 Thread Al Boldi

Justin Piszcz wrote:
 UltraDense-AS-3ware-R5-9-disks,16G,50676,89,96019,34,46379,9,60267,99,5010
98,56,248.5,0,16:10:16/64,240,3,21959,84,1109,10,286,4,22923,91,544,6
 UltraDense-AS-3ware-R5-9-disks,16G,49983,88,96902,37,47951,10,59002,99,529
121,60,210.3,0,16:10:16/64,250,3,25506,98,1163,10,268,3,18003,71,772,8
 UltraDense-AS-3ware-R5-9-disks,16G,49811,87,95759,35,48214,10,60153,99,538
559,61,276.8,0,16:10:16/64,233,3,25514,97,1100,9,279,3,21398,84,839,9

Is there any easy way to decipher these numbers?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] VFS: data=ordered (was: [Advocacy] Re: 3ware 9650 tips)

2007-07-16 Thread Al Boldi

Matthew Wilcox wrote:
 On Mon, Jul 16, 2007 at 08:40:00PM +0300, Al Boldi wrote:
  XFS surely rocks, but it's missing one critical component: data=ordered
  And that's one component that's just too critical to overlook for an
  enterprise environment that is built on data-integrity over performance.
 
  So that's the secret why people still use ext3, and XFS' reliance on
  external hardware to ensure integrity is really misplaced.
 
  Now, maybe when we get the data=ordered onto the VFS level, then maybe
  XFS may become viable for the enterprise, and ext3 may cease to be KING.

 Wow, thanks for bringing an advocacy thread onto linux-fsdevel.  Just what
 we wanted.  Do you have any insight into how to get the data=ordered
 onto the VFS level?  Because to me, that sounds like pure nonsense.

Well, conceptually it sounds like a piece of cake, technically your guess is 
as good as mine.  IIRC, akpm once mentioned something like this.

But seriously, can you think of a technical reason why it shouldn't be 
possible to abstract data=ordered mode out into the VFS?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]

2007-04-11 Thread Al Boldi

Dan Williams wrote:
 In write-through mode bi_end_io is called once writes to the data disk(s)
 and the parity disk have completed.

 In write-back mode bi_end_io is called immediately after data has been
 copied into the stripe cache, which also causes the stripe to be marked
 dirty.

This is not really meaningful, as this is exactly what the page-cache already 
does before being synced.

It may be more reasonable to sync the data-stripe as usual, and only delay 
the parity.  This way you shouldn't have to worry about unclean shutdowns.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 does not seem faster

2007-04-03 Thread Al Boldi

Bill Davidsen wrote:
 Al Boldi wrote:
  The problem is that raid1 one doesn't do striped reads, but rather uses
  read-balancing per proc.  Try your test with parallel reads; it should
  be faster.
:
:
 It would be nice if reads larger than some size were considered as
 candidates for multiple devices. By setting the readahead larger than
 that value speed increases would be noted for sequential access.

Actually, that's what I thought for a long time too, but as Neil once pointed 
out, for striped reads to be efficient, each chunk should be located 
sequentially, as to avoid any seeks.  This is only possible by introducing 
some offset layout, as in raid10, which infers a loss of raid1's 
single-disk-image compatibility.

What could be feasible, is some kind of an initial burst striped readahead, 
which could possibly improve small reads  (readahead * nr_of_disks).


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 does not seem faster

2007-04-01 Thread Al Boldi

Jan Engelhardt wrote:
 normally, I'd think that combining drives into a raid1 array would give
 me at least a little improvement in read speed. In my setup however,
 this does not seem to be the case.

 14:16 opteron:/var/log # hdparm -t /dev/sda
  Timing buffered disk reads:  170 MB in  3.01 seconds =  56.52 MB/sec
 14:17 opteron:/var/log # hdparm -t /dev/md3
  Timing buffered disk reads:  170 MB in  3.01 seconds =  56.45 MB/sec
 (and dd_rescue shows the same numbers)

The problem is that raid1 one doesn't do striped reads, but rather uses 
read-balancing per proc.  Try your test with parallel reads; it should be 
faster.

You could use raid10, but then you loose single-disk-image compatibility.


Thanks!

--
Al
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATA/SATA Disk Reliability paper

2007-02-26 Thread Al Boldi

Mario 'BitKoenig' Holbe wrote:
 Al Boldi [EMAIL PROTECTED] wrote:
  Interesting link.  They seem to point out that smart not necessarily
  warns of pending failure.  This is probably worse than not having smart
  at all, as it gives you the illusion of safety.

 If SMART gives you the illusion of safety, you didn't understand SMART.
 SMART hints *only* the potential presence or occurence of failures in
 the future, it does not prove the absence of such - and nobody ever said
 it does. It would even be impossible to do that, though (which is easy
 to prove by just utilizing an external damaging tool like a hammer).
 Concluding from that that not having any failure detector at all is
 better than having at least an imperfect one is IMHO completely wrong.

Agreed.  But would you then call it SMART?  Sounds rather DUMB.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATA/SATA Disk Reliability paper

2007-02-25 Thread Al Boldi

Mark Hahn wrote:
  In contrast, ever since these holes appeared, drive failures became the
  norm.

 wow, great conspiracy theory!

I think you misunderstand.  I just meant plain old-fashioned mis-engineering.

 maybe the hole is plugged at
 the factory with a substance which evaporates at 1/warranty-period ;)

Actually it's plugged with a thin paper-like filter, which does not seem to 
evaporate easily.

And it's got nothing to do with warranty, although if you get lucky and the 
failure happens within the warranty period, you can probably demand a 
replacement drive to make you feel better.

But remember, the google report mentions a great number of drives failing for 
no apparent reason, not even a smart warning, so failing within the warranty 
period is just pure luck.

 seriously, isn't it easy to imagine a bladder-like arrangement that
 permits equilibration without net flow?  disk spec-sheets do limit
 this - I checked the seagate 7200.10: 10k feet operating, 40k max.
 amusingly -200 feet is the min either way...

Well, it looks like filtered net flow on wd's.

What's it look like on seagate?

 Doe anyone rememnber that you had to let you drives acclimate to
  your machine room for a day or so before you used them.
 
  The problem is, that's not enough; the room temperature/humidity has to
  be controlled too.  In a desktop environment, that's not really
  feasible.

 5-90% humidity, operating, 95% non-op, and 30%/hour.  seems pretty easy
 to me.  in fact, I frequently ask people to justify the assumption that
 a good machineroom needs tight control over humidity.  (assuming, like
 most machinerooms, you aren't frequently handling the innards.)

I agree, but reality has a different opinion, and it may take down that 
drive, specs or no specs.

A good way to deal with reality is to find the real reasons for failure.  
Once these reasons are known, engineering quality drives becomes, thank GOD, 
really rather easy.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATA/SATA Disk Reliability paper

2007-02-25 Thread Al Boldi

Mark Hahn wrote:
   - disks are very complicated, so their failure rates are a
   combination of conditional failure rates of many components.
   to take a fully reductionist approach would require knowing
   how each of ~1k parts responds to age, wear, temp, handling, etc.
   and none of those can be assumed to be independent.  those are the
   real reasons, but most can't be measured directly outside a lab
   and the number of combinatorial interactions is huge.

It seems to me that the biggest problem are the 7.2k+ rpm platters 
themselves, especially with those heads flying closely on top of them.  So, 
we can probably forget the rest of the ~1k non-moving parts, as they have 
proven to be pretty reliable, most of the time.

   - factorial analysis of the data.  temperature is a good
   example, because both low and high temperature affect AFR,
   and in ways that interact with age and/or utilization.  this
   is a common issue in medical studies, which are strikingly
   similar in design (outcome is subject or disk dies...)  there
   is a well-established body of practice for factorial analysis.

Agreed.  We definitely need more sensors.

   - recognition that the relative results are actually quite good,
   even if the absolute results are not amazing.  for instance,
   assume we have 1k drives, and a 10% overall failure rate.  using
   all SMART but temp detects 64 of the 100 failures and misses 36.
   essentially, the failure rate is now .036.  I'm guessing that if
   utilization and temperature were included, the rate would be much
   lower.  feedback from active testing (especially scrubbing)
   and performance under the normal workload would also help.

Are you saying, you are content with pre-mature disk failure, as long as 
there is a smart warning sign?

If so, then I don't think that is enough.

I think the sensors should trigger some kind of shutdown mechanism as a 
protective measure, when some threshold is reached.  Just like the 
protective measure you see for CPUs to prevent meltdown.

Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATA/SATA Disk Reliability paper

2007-02-23 Thread Al Boldi

Stephen C Woods wrote:
   So drives do need to be ventilated, not so much wory about exploding,
 but rather subtle distortion of the case as the atmospheric preasure
 changed.

I have a '94 Caviar without any apparent holes; and as a bonus, the drive 
still works.

In contrast, ever since these holes appeared, drive failures became the norm.

Doe anyone rememnber that you had to let you drives acclimate to your
 machine room for a day or so before you used them.

The problem is, that's not enough; the room temperature/humidity has to be 
controlled too.  In a desktop environment, that's not really feasible.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATA/SATA Disk Reliability paper

2007-02-20 Thread Al Boldi

Eyal Lebedinsky wrote:
 Disks are sealed, and a dessicant is present in each to keep humidity
 down. If you ever open a disk drive (e.g. for the magnets, or the mirror
 quality platters, or for fun) then you can see the dessicant sachet.

Actually, they aren't sealed 100%.  

On wd's at least, there is a hole with a warning printed on its side:

  DO NOT COVER HOLE BELOW
  V   V  V  V

  o


In contrast, older models from the last century, don't have that hole.

 Al Boldi wrote:
 
  If there is one thing to watch out for, it is dew.
 
  I remember video machines sensing for dew, so do any drives sense for
  dew?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PATA/SATA Disk Reliability paper

2007-02-19 Thread Al Boldi

Richard Scobie wrote:
 Thought this paper may be of interest. A study done by Google on over
 100,000 drives they have/had in service.

 http://labs.google.com/papers/disk_failures.pdf

Interesting link.  They seem to point out that smart not necessarily warns of  
pending failure.  This is probably worse than not having smart at all, as it 
gives you the illusion of safety.

If there is one thing to watch out for, it is dew.

I remember video machines sensing for dew, so do any drives sense for dew?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)

2007-01-12 Thread Al Boldi

Justin Piszcz wrote:
 RAID 5 TWEAKED: 1:06.41 elapsed @ 60% CPU

 This should be 1:14 not 1:06(was with a similarly sized file but not the
 same) the 1:14 is the same file as used with the other benchmarks.  and to
 get that I used 256mb read-ahead and 16384 stripe size ++ 128
 max_sectors_kb (same size as my sw raid5 chunk size)

max_sectors_kb is probably your key. On my system I get twice the read 
performance by just reducing max_sectors_kb from default 512 to 192.

Can you do a fresh reboot to shell and then:
$ cat /sys/block/hda/queue/*
$ cat /proc/meminfo
$ echo 3  /proc/sys/vm/drop_caches
$ dd if=/dev/hda of=/dev/null bs=1M count=10240
$ echo 192  /sys/block/hda/queue/max_sectors_kb
$ echo 3  /proc/sys/vm/drop_caches
$ dd if=/dev/hda of=/dev/null bs=1M count=10240


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)

2007-01-12 Thread Al Boldi

Justin Piszcz wrote:
 Btw, max sectors did improve my performance a little bit but
 stripe_cache+read_ahead were the main optimizations that made everything
 go faster by about ~1.5x.   I have individual bonnie++ benchmarks of
 [only] the max_sector_kb tests as well, it improved the times from
 8min/bonnie run - 7min 11 seconds or so, see below and then after that is
 what you requested.

 # echo 3  /proc/sys/vm/drop_caches
 # dd if=/dev/md3 of=/dev/null bs=1M count=10240
 10240+0 records in
 10240+0 records out
 10737418240 bytes (11 GB) copied, 399.352 seconds, 26.9 MB/s
 # for i in sde sdg sdi sdk; do   echo 192 
 /sys/block/$i/queue/max_sectors_kb;   echo Set
 /sys/block/$i/queue/max_sectors_kb to 192kb; done
 Set /sys/block/sde/queue/max_sectors_kb to 192kb
 Set /sys/block/sdg/queue/max_sectors_kb to 192kb
 Set /sys/block/sdi/queue/max_sectors_kb to 192kb
 Set /sys/block/sdk/queue/max_sectors_kb to 192kb
 # echo 3  /proc/sys/vm/drop_caches
 # dd if=/dev/md3 of=/dev/null bs=1M count=10240
 10240+0 records in
 10240+0 records out
 10737418240 bytes (11 GB) copied, 398.069 seconds, 27.0 MB/s

 Awful performance with your numbers/drop_caches settings.. !

Can you repeat with /dev/sda only?

With fresh reboot to shell, then:
$ cat /sys/block/sda/queue/max_sectors_kb
$ echo 3  /proc/sys/vm/drop_caches
$ dd if=/dev/sda of=/dev/null bs=1M count=10240

$ echo 192  /sys/block/sda/queue/max_sectors_kb
$ echo 3  /proc/sys/vm/drop_caches
$ dd if=/dev/sda of=/dev/null bs=1M count=10240

$ echo 128  /sys/block/sda/queue/max_sectors_kb
$ echo 3  /proc/sys/vm/drop_caches
$ dd if=/dev/sda of=/dev/null bs=1M count=10240

 What were your tests designed to show?

A problem with the block-io.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)

2007-01-12 Thread Al Boldi

Justin Piszcz wrote:
 On Sat, 13 Jan 2007, Al Boldi wrote:
  Justin Piszcz wrote:
   Btw, max sectors did improve my performance a little bit but
   stripe_cache+read_ahead were the main optimizations that made
   everything go faster by about ~1.5x.   I have individual bonnie++
   benchmarks of [only] the max_sector_kb tests as well, it improved the
   times from 8min/bonnie run - 7min 11 seconds or so, see below and
   then after that is what you requested.
 
  Can you repeat with /dev/sda only?

 For sda-- (is a 74GB raptor only)-- but ok.

Do you get the same results for the 150GB-raptor on sd{e,g,i,k}?

 # uptime
  16:25:38 up 1 min,  3 users,  load average: 0.23, 0.14, 0.05
 # cat /sys/block/sda/queue/max_sectors_kb
 512
 # echo 3  /proc/sys/vm/drop_caches
 # dd if=/dev/sda of=/dev/null bs=1M count=10240
 10240+0 records in
 10240+0 records out
 10737418240 bytes (11 GB) copied, 150.891 seconds, 71.2 MB/s
 # echo 192  /sys/block/sda/queue/max_sectors_kb
 # echo 3  /proc/sys/vm/drop_caches
 # dd if=/dev/sda of=/dev/null bs=1M count=10240
 10240+0 records in
 10240+0 records out
 10737418240 bytes (11 GB) copied, 150.192 seconds, 71.5 MB/s
 # echo 128  /sys/block/sda/queue/max_sectors_kb
 # echo 3  /proc/sys/vm/drop_caches
 # dd if=/dev/sda of=/dev/null bs=1M count=10240
 10240+0 records in
 10240+0 records out
 10737418240 bytes (11 GB) copied, 150.15 seconds, 71.5 MB/s


 Does this show anything useful?

Probably a latency issue.  md is highly latency sensitive.

What CPU type/speed do you have?  Bootlog/dmesg?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Propose of enhancement of raid1 driver

2006-10-30 Thread Al Boldi

Mario 'BitKoenig' Holbe wrote:
 Al Boldi [EMAIL PROTECTED] wrote:
  But what still isn't clear, why can't raid1 use something like the
  raid10 offset=2 mode?

 RAID1 has equal data on all mirrors, so sooner or later you have to seek
 somewhere - no matter how you layout the data on each mirror.

Don't underestimate the effects mere layout can have on multi-disk array 
performance, despite it being highly hw dependent.

The best approach would probably involve a user-configurable layout table, to 
tune it to the specific hw.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Propose of enhancement of raid1 driver

2006-10-30 Thread Al Boldi

Mario 'BitKoenig' Holbe wrote:
 Al Boldi [EMAIL PROTECTED] wrote:
  Don't underestimate the effects mere layout can have on multi-disk array
  performance, despite it being highly hw dependent.

 I can't see the difference between equal mirrors and somehow interleaved
 layout on RAID1. Since you have to seek anyways, there should be no
 difference between both approaches once you read big enough chunks. The
 problem with reading big chunks is: you probably read far too much when
 you don't really need the data you did read.

Think adaptive.

 And vice versa: when you
 don't read big chunks, it doesn't matter how your data is laid out.

Think tracks and heads, physical that is.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Propose of enhancement of raid1 driver

2006-10-28 Thread Al Boldi

Mario 'BitKoenig' Holbe wrote:
 Neil Brown [EMAIL PROTECTED] wrote:
  Skipping over blocks within a track is no faster than reading blocks
  in the track, so you would need to make sure that your chunk size is

 Not even no faster but probably even slower.

Surely slower, on conventional hds anyway.

But what still isn't clear, why can't raid1 use something like the raid10 
offset=2 mode?


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Large single raid and XFS or two small ones and EXT3?

2006-06-23 Thread Al Boldi

Chris Allen wrote:
 Francois Barre wrote:
  2006/6/23, PFC [EMAIL PROTECTED]:
  - XFS is faster and fragments less, but make sure you have a
  good UPS
 
  Why a good UPS ? XFS has a good strong journal, I never had an issue
  with it yet... And believe me, I did have some dirty things happening
  here...
 
  - ReiserFS 3.6 is mature and fast, too, you might consider it
  - ext3 is slow if you have many files in one directory, but
  has more
  mature tools (resize, recovery etc)
 
  XFS tools are kind of mature also. Online grow, dump, ...
 
  I'd go with XFS or Reiser.
 
  I'd go with XFS. But I may be kind of fanatic...

 Strange that whatever the filesystem you get equal numbers of people
 saying that they have never lost a single byte to those who have had 
 horrible corruption and would never touch it again. We stopped using XFS 
 about a year ago because we were getting kernel stack space panics under 
 heavy load over NFS. It looks like the time has come to give it another
 try.

If you are keen on data integrity then don't touch any fs w/o data=ordered.

ext3 is still king wrt data=ordered, albeit slow.

Now XFS is fast, but doesn't support data=ordered.  It seems that their 
solution to the problem is to pass the burden onto hw by using barriers.  
Maybe XFS can get away with this.  Maybe.

Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10

2006-05-02 Thread Al Boldi

Neil Brown wrote:
 On Tuesday May 2, [EMAIL PROTECTED] wrote:
  NeilBrown wrote:
   The industry standard DDF format allows for a stripe/offset layout
   where data is duplicated on different stripes. e.g.
  
 A  B  C  D
 D  A  B  C
 E  F  G  H
 H  E  F  G
  
   (columns are drives, rows are stripes, LETTERS are chunks of data).
 
  Presumably, this is the case for --layout=f2 ?

 Almost.  mdadm doesn't support this layout yet.
 'f2' is a similar layout, but the offset stripes are a lot further
 down the drives.
 It will possibly be called 'o2' or 'offset2'.

  If so, would --layout=f4 result in a 4-mirror/striped array?

 o4 on a 4 drive array would be

A  B  C  D
D  A  B  C
C  D  A  B
B  C  D  A
E  F  G  H


Yes, so would this give us 4 physically duplicate mirrors?
If not, would it be possible to add a far-offset mode to yield such a layout?

  Also, would it be possible to have a staged write-back mechanism across
  multiple stripes?

 What exactly would that mean?

Write the first stripe, then write subsequent duplicate stripes based on idle 
with a max delay for each delayed stripe.

 And what would be the advantage?

Faster burst writes, probably.

Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help needed - RAID5 recovery from Power-fail

2006-04-04 Thread Al Boldi

Neil Brown wrote:
 2 devices in a raid5??  Doesn't seem a lot of point it being raid5
 rather than raid1.

Wouldn't a 2-dev raid5 imply a striped block mirror (i.e faster) rather than 
a raid1 duplicate block mirror (i.e. slower) ?

Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2.6.15-git9a] aoe [1/1]: do not stop retransmit timer when device goes down

2006-01-27 Thread Al Boldi

Ed L. Cashin wrote:
 On Thu, Jan 26, 2006 at 01:04:37AM +0300, Al Boldi wrote:
  Ed L. Cashin wrote:
   This patch is a bugfix that follows and depends on the
   eight aoe driver patches sent January 19th.
 
  Will they also fix this?
  Or is this an md bug?

 No, this patch fixes a bug that would cause an AoE device to be
 totally unusable, so I think mdadm or mkraid would get an error that
 the device was not available before it tried to make a new md device.

  It only happens with aoe.

 It looks like in setting up the raid, sysfs_create_link probably has
 this going off:

 BUG_ON(!kobj || !kobj-dentry || !name);

  Also, why is aoe slower than nbd?

 It wasn't when I tried it.  The userland vblade is slow.  Maybe that's
 affecting your results?

Why is the userland vblade server slower than the userland nbd-server?

Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: io performance...

2006-01-19 Thread Al Boldi

Jeff V. Merkey wrote:
 Jens Axboe wrote:
 On Mon, Jan 16 2006, Jeff V. Merkey wrote:
 Max Waterman wrote:
 I've noticed that I consistently get better (read) numbers from kernel 
 2.6.8 than from later kernels.
 
 To open the bottlenecks, the following works well.  Jens will shoot me
 -#define BLKDEV_MIN_RQ4
 -#define BLKDEV_MAX_RQ128 /* Default maximum */
 +#define BLKDEV_MIN_RQ4096
 +#define BLKDEV_MAX_RQ8192/* Default maximum */
 
 Yeah I could shoot you. However I'm more interested in why this is
 necessary, eg I'd like to see some numbers from you comparing:
 
 - Doing
 # echo 8192  /sys/block/dev/queue/nr_requests
   for each drive you are accessing.
 
 The BLKDEV_MIN_RQ increase is just silly and wastes a huge amount of
 memory for no good reason.

 Yep. I build it into the kernel to save the trouble of sending it to proc.
 Jens recommendation will work just fine. It has the same affect of
 increasing the max requests outstanding.

Your suggestion doesn't do anything here on 2.6.15, but
echo 192  /sys/block/dev/queue/max_sectors_kb 
echo 192  /sys/block/dev/queue/read_ahead_kb 
works wonders!

I don't know why, but anything less than 64 and more than 256 makes the queue 
collapse miserably, causing some strange __copy_to_user calls?!?!?

Also, it seems that changing the kernel HZ has some drastic effects on the 
queues.  A simple lilo gets delayed 400% and 200% using 100HZ and 250HZ 
respectively.

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID0 performance question

2005-12-19 Thread Al Boldi

JaniD++ wrote:
 For me, the performance bottleneck is cleanly about RAID0 layer used
 exactly as concentrator to join the 4x2TB to 1x8TB.

Did you try running RAID0 over nbd directly and found it to be faster?

IIRC, stacking raid modules does need a considerable amount of tuning, and 
even then it does not scale linearly.

Maybe NeilBrown can help?

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Where is the performance bottleneck?

2005-09-02 Thread Al Boldi

Holger Kiehl wrote:
 top - 08:39:11 up  2:03,  2 users,  load average: 23.01, 21.48, 15.64
 Tasks: 102 total,   2 running, 100 sleeping,   0 stopped,   0 zombie
 Cpu(s):  0.0% us, 17.7% sy,  0.0% ni,  0.0% id, 78.9% wa,  0.2% hi,  3.1%
 si Mem:   8124184k total,  8093068k used,31116k free,  7831348k
 buffers Swap: 15631160k total,13352k used, 15617808k free, 5524k
 cached

PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
   3423 root  18   0 55204  460  392 R 12.0  0.0   1:15.55 dd
   3421 root  18   0 55204  464  392 D 11.3  0.0   1:17.36 dd
   3418 root  18   0 55204  464  392 D 10.3  0.0   1:10.92 dd
   3416 root  18   0 55200  464  392 D 10.0  0.0   1:09.20 dd
   3420 root  18   0 55204  464  392 D 10.0  0.0   1:10.49 dd
   3422 root  18   0 55200  460  392 D  9.3  0.0   1:13.58 dd
   3417 root  18   0 55204  460  392 D  7.6  0.0   1:13.11 dd
158 root  15   0 000 D  1.3  0.0   1:12.61 kswapd3
159 root  15   0 000 D  1.3  0.0   1:08.75 kswapd2
160 root  15   0 000 D  1.0  0.0   1:07.11 kswapd1
   3419 root  18   0 51096  552  476 D  1.0  0.0   1:17.15 dd
161 root  15   0 000 D  0.7  0.0   0:54.46 kswapd0

 A loadaverage of 23 for 8 dd's seems a bit high. Also why is kswapd
 working so hard? Is that correct.

Actually, kswapd is another problem. (see Kswapd Flaw  thread)
Which has little impact on your problem but basically kswapd tries very hard 
maybe even to hard to fullfil a request for memory, so when the buffer/cache 
pages are full kswapd tries to find some more unused memory. When it finds 
none it starts recycling the buffer/cache pages.  Which is OK, but it only 
does this after searching for swappable memory which wastes CPU cycles.

This can be tuned a little but not much by adjusting /sys(proc)/.../vm/...
Or renicing kswapd to the lowest priority, which may cause other problems.

Things get really bad when procs start asking for more memory than is 
available, causing kswapd to take the liberty of paging out running procs in 
the hope that these procs won't come back later.  So when they do come back 
something like a wild goose chase begins.  This is also known as OverCommit. 

This is closely related to the dreaded OOM-killer, which occurs when the 
system cannot satisfy a memory request for a returning proc, causing the VM 
to start killing in an unpredictable manner.

Turning OverCommit off should solve this problem but it doesn't.

This is why it is recommended to run the system always with swap enabled even 
if you have tons of memory, which really only pushes the problem out of the 
way until you hit the dead end and the wild goose chase begins again.

Sadly 2.6.13 did not fix this either.

Although this description only vaguely defines the problem from an end-user 
pov, the actual semantics may be quite different.

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH md 006 of 6] Add write-behind support for md/raid1

2005-08-12 Thread Al Boldi

Paul Clements wrote:
 Al Boldi wrote:
  NeilBrown wrote:
 If a device is flagged 'WriteMostly' and the array has a bitmap,
 and the bitmap superblock indicates that write_behind is allowed,
 then write_behind is enabled for WriteMostly devices.
 
  Nice, but why is it dependent on WriteMostly?

 WriteMostly is just a flag that tells us which devices will get the
 write-behinds, and which will not. You'll be able to mix any
 combination of WriteMostly devices and normal devices in a raid1.


Yes, but doesn't WriteMostly imply ReadDelay?
If so, doesn't that mean that WriteBehind is dependent on ReadDelay?

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Multiplexed RAID-1 mode

2005-08-01 Thread Al Boldi

Neil Brown wrote: {
On Sunday July 31, [EMAIL PROTECTED] wrote:
 
 Multiplexing read/write requests would certainly improve performance 
 ala RAID-0 (-offset overhead).
 During reads the same RAID-0 code (+mirroring offset) could be used.
 During writes though, this would imply delayed mirroring.

But what exactly do you mean by 'delayed mirroring'?
Are you suggesting that the write request completes after only writing to
one mirror?  If so, which one?  Wouldn't this substantially reduce the value
of mirroring?
}

Think of it as a _smart_ resync running on idle.
Should be an option though!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Multiplexed RAID-1 mode

2005-07-31 Thread Al Boldi

Gordon Henderson wrote: {
On Sat, 30 Jul 2005, Jeff Breidenbach wrote:

 I just ran a Linux software RAID-1 benchmark with some 500GB SATA 
 drives in NCQ mode, along with a non-RAID control. Details are here 
 for those interested.

   http://www.jab.org/raid-bench/

The results you get are about what I get on various systems - essentially
with RAID-1 you get about the same speed as a single drive will get. 

ns1:/var/tmp# hdparm -tT /dev/md1 /dev/sda1 /dev/sdb1

/dev/md1:
 Timing cached reads:   4116 MB in  2.00 seconds = 2058.31 MB/sec
 Timing buffered disk reads:  174 MB in  3.00 seconds =  57.99 MB/sec

/dev/sda1:
 Timing cached reads:   4096 MB in  2.00 seconds = 2048.31 MB/sec
 Timing buffered disk reads:  176 MB in  3.03 seconds =  58.11 MB/sec

/dev/sdb1:
 Timing cached reads:   4116 MB in  2.00 seconds = 2057.28 MB/sec
 Timing buffered disk reads:  176 MB in  3.02 seconds =  58.27 MB/sec
}

Multiplexing read/write requests would certainly improve performance ala
RAID-0 (-offset overhead).
During reads the same RAID-0 code (+mirroring offset) could be used.
During writes though, this would imply delayed mirroring.

--
Al


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

39 matches

Mail list logo