Re: LVM performance (was: Re: RAID5 to RAID6 reshape?)

2008-02-19 Thread Jon Nelson
On Feb 19, 2008 1:41 PM, Oliver Martin
[EMAIL PROTECTED] wrote:
 Janek Kozicki schrieb:
  hold on. This might be related to raid chunk positioning with respect
  to LVM chunk positioning. If they interfere there indeed may be some
  performance drop. Best to make sure that those chunks are aligned together.

 Interesting. I'm seeing a 20% performance drop too, with default RAID
 and LVM chunk sizes of 64K and 4M, respectively. Since 64K divides 4M
 evenly, I'd think there shouldn't be such a big performance penalty.
 It's not like I care that much, I only have 100 Mbps ethernet anyway.
 I'm just wondering...

 $ hdparm -t /dev/md0

 /dev/md0:
   Timing buffered disk reads:  148 MB in  3.01 seconds =  49.13 MB/sec

 $ hdparm -t /dev/dm-0

 /dev/dm-0:
   Timing buffered disk reads:  116 MB in  3.04 seconds =  38.20 MB/sec

I'm getting better performance on a LV than on the underlying MD:

# hdparm -t /dev/md0

/dev/md0:
 Timing buffered disk reads:  408 MB in  3.01 seconds = 135.63 MB/sec
# hdparm -t /dev/raid/multimedia

/dev/raid/multimedia:
 Timing buffered disk reads:  434 MB in  3.01 seconds = 144.04 MB/sec
#

md0 is a 3-disk raid5, 64k chunk, alg. 2, using a bitmap comprised of
7200rpm sata drives from several manufacturers.



-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-06 Thread Jon Nelson
On Feb 6, 2008 12:43 PM, Bill Davidsen [EMAIL PROTECTED] wrote:

 Can you create a raid10 with one drive missing and add it later? I
 know, I should try it when I get a machine free... but I'm being lazy today.

Yes you can. With 3 drives, however, performance will be awful (at
least with layout far, 2 copies).

IMO raid10,f2 is a great balance of speed and redundancy.
it''s faster than raid5 for reading, about the same for writing. it's
even potentially faster than raid0 for reading, actually.
With 3 disks one should be able to get 3.0 times the speed of one
disk, or slightly more, and each stripe involves only *one* disk
instead of 2 as it does with raid5.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-03 Thread Jon Nelson
On Feb 3, 2008 5:29 PM, Janek Kozicki [EMAIL PROTECTED] wrote:
 Neil Brown said: (by the date of Mon, 4 Feb 2008 10:11:27 +1100)

 wow, thanks for quick reply :)

   3. Another thing - would raid10,far=2 work when three drives are used?
  Would it increase the read performance?
 
  Yes.

 is far=2 the most I could do to squeeze every possible MB/sec
 performance in raid10 on three discs ?

In my opinion, yes. It has sequential read characteristics that place
at /or better than/ raid0. Writing is slower, about the speed of a
single disk, give or take.  The other two raid10 layouts (near and
offset) are very close in performance to each other - nearly identical
for reading/writing.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on three discs - few questions.

2008-02-03 Thread Jon Nelson
On Feb 3, 2008 5:29 PM, Janek Kozicki [EMAIL PROTECTED] wrote:
 Neil Brown said: (by the date of Mon, 4 Feb 2008 10:11:27 +1100)

 wow, thanks for quick reply :)

   3. Another thing - would raid10,far=2 work when three drives are used?
  Would it increase the read performance?
 
  Yes.

 is far=2 the most I could do to squeeze every possible MB/sec
 performance in raid10 on three discs ?

In my opinion, yes. It has sequential read characteristics that place
at /or better than/ raid0. Writing is slower, about the speed of a
single disk, give or take.  The other two raid10 layouts (near and
offset) are very close in performance to each other - nearly identical
for reading/writing.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


stopped array, but /sys/block/mdN still exists.

2008-01-02 Thread Jon Nelson
This isn't a high priority issue or anything, but I'm curious:

I --stop(ped) an array but /sys/block/md2 remained largely populated.
Is that intentional?

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10: unfair disk load?

2007-12-23 Thread Jon Nelson
On 12/23/07, maobo [EMAIL PROTECTED] wrote:
 Hi,all

 Yes, I agree some of you. But in my test both using real life trace and
 Iometer test I found that for absolutely read requests, RAID0 is better than
 RAID10 (with same data disks: 3 disks in RAID0, 6 disks in RAID10). I don't
 know why this happen.

 I read the code of RAID10 and RAID0 carefully and experiment with printk to
 track the process flow. The only conclusion I report is the complexity of
 RAID10 to process the read request. While for RAID0 it is so simple that it
 does the read more effectively.

 How do you think about this of absolutely read requests?
 Thank you very much!

My own tests on identical hardware (same mobo, disks, partitions,
everything) and same software, with the only difference being how
mdadm is invoked (the only changes here being level and possibly
layout) show that raid0 is about 15% faster on reads than the very
fast raid10, f2 layout. raid10,f2 is approx. 50% of the write speed of
raid0.

Does this make sense?

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid10 performance question

2007-12-23 Thread Jon Nelson
I've found in some tests that raid10,f2 gives me the best I/O of any
raid5 or raid10 format. However, the performance of raid10,o2 and
raid10,n2 in degraded mode is nearly identical to the non-degraded
mode performance (for me, this hovers around 100MB/s).  raid10,f2 has
degraded mode performance, writing, that is indistinguishable from
it's non-degraded mode performance. It's the raid10,f2 *read*
performance in degraded mode that is strange - I get almost exactly
50% of the non-degraded mode read performance. Why is that?


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm --stop goes off and never comes back?

2007-12-22 Thread Jon Nelson
On 12/22/07, Neil Brown [EMAIL PROTECTED] wrote:
 On Wednesday December 19, [EMAIL PROTECTED] wrote:
  On 12/19/07, Jon Nelson [EMAIL PROTECTED] wrote:
   On 12/19/07, Neil Brown [EMAIL PROTECTED] wrote:
On Tuesday December 18, [EMAIL PROTECTED] wrote:

 I tried to stop the array:

 mdadm --stop /dev/md2

 and mdadm never came back. It's off in the kernel somewhere. :-(

 Looking at your stack traces, you have the mdadm -S holding
 an md lock and trying to get a sysfs lock as part of tearing down the
 array, and 'hald' is trying to read some attribute in
/sys/block/md
 and is holding the sysfs lock and trying to get the md lock.
 A classic AB-BA deadlock.

 
  NOTE: kernel is stock openSUSE 10.3 kernel, x86_64, 2.6.22.13-0.3-default.
 

 It is fixed in mainline with some substantial changes to sysfs.
 I don't imagine they are likely to get back ported to openSUSE, but
 you could try logging a bugzilla if you like.

Nah - I'm eagerly awaiting new kernels anyway as I have some network
cards that work much better (read: they work) with 2.6.24rc3+.

 The 'hald' process is interruptible and killing it would release the
 deadlock.

Cool.

 I suspect you have to be fairly unlucky to lose the race but it is
 obviously quite possible.

Sometimes we are all a little unlucky. In my case, it cost me a reboot
or, in others, nothing at all. Fortunately this was not a production
system with lots of users.

 I don't think there is anything I can do on the md side to avoid the
 bug.

In the situation I don't think that such a change would be warranted anyway.
Thanks again for looking at this. I'm a big believer in the 'canary in
a coal mine' mentality - some problems may indications of much more
serious issues, but in this case, it would appear that the issue has
already been taken care of. Have a Happy Holidays.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10: unfair disk load?

2007-12-22 Thread Jon Nelson
On 12/22/07, Janek Kozicki [EMAIL PROTECTED] wrote:
 Michael Tokarev said: (by the date of Fri, 21 Dec 2007 23:56:09 +0300)

  Janek Kozicki wrote:
   what's your kernel version? I recall that recently there have been
   some works regarding load balancing.
 
  It was in my original email:
  The kernel is 2.6.23
 
  Strange I missed the new raid10 development you
  mentioned (I follow linux-raid quite closely).
  What change(s) you're referring to?

 oh sorry it was a patch for raid1, not raid10:

   http://www.spinics.net/lists/raid/msg17708.html

 I'm wondering if it could be adapted for raid10 ...

 Konstantin Sharlaimov said: (by the date of Sat, 03 Nov 2007
 20:08:42 +1000)

  This patch adds RAID1 read balancing to device mapper. A read operation
  that is close (in terms of sectors) to a previous read or write goes to
  the same mirror.

Looking at the source for raid10 it already looks like it does some
read balancing.
For raid10 f2 on a 3 drive raid I've found really impressive
performance numbers - as good as raid0. Write speeds are a bit lower
but rather better than raid5 on the same devices.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Justin Piszcz [EMAIL PROTECTED] wrote:


 On Wed, 19 Dec 2007, Mattias Wadenstein wrote:
  From that setup it seems simple, scrap the partition table and use the
  disk device for raid. This is what we do for all data storage disks (hw 
  raid)
  and sw raid members.
 
  /Mattias Wadenstein
 

 Is there any downside to doing that?  I remember when I had to take my

There is one (just pointed out to me yesterday): having the partition
and having it labeled as raid makes identification quite a bit easier
for humans and software, too.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Bill Davidsen [EMAIL PROTECTED] wrote:
 As other posts have detailed, putting the partition on a 64k aligned
 boundary can address the performance problems. However, a poor choice of
 chunk size, cache_buffer size, or just random i/o in small sizes can eat
 up a lot of the benefit.

 I don't think you need to give up your partitions to get the benefit of
 alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Bill Davidsen [EMAIL PROTECTED] wrote:
 As other posts have detailed, putting the partition on a 64k aligned
 boundary can address the performance problems. However, a poor choice of
 chunk size, cache_buffer size, or just random i/o in small sizes can eat
 up a lot of the benefit.

 I don't think you need to give up your partitions to get the benefit of
 alignment.

How might that benefit be realized?
Assume I have 3 disks, /dev/sd{b,c,d} all partitioned identically with
4 partitions, and I want to use /dev/sd{b,c,d}3 for a new SW raid.

What sequence of steps can I take to ensure that my raid is aligned on
a 64K boundary?
What effect do the different superblock formats have, if any, in this situation?

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-19 Thread Jon Nelson
On 12/19/07, Michal Soltys [EMAIL PROTECTED] wrote:
 Justin Piszcz wrote:
 
  Or is there a better way to do this, does parted handle this situation
  better?
 
  What is the best (and correct) way to calculate stripe-alignment on the
  RAID5 device itself?
 
 
  Does this also apply to Linux/SW RAID5?  Or are there any caveats that
  are not taken into account since it is based in SW vs. HW?
 
  ---

 In case of SW or HW raid, when you place raid aware filesystem directly on
 it, I don't see any potential poblems

 Also, if md's superblock version/placement actually mattered, it'd be pretty
 strange. The space available for actual use - be it partitions or filesystem
 directly - should be always nicely aligned. I don't know that for sure though.

 If you use SW partitionable raid, or HW raid with partitions, then you would
 have to align it on a chunk boundary manually. Any selfrespecting os
 shouldn't complain a partition doesn't start on cylinder boundary these
 days. LVM can complicate life a bit too - if you want it's volumes to be
 chunk-aligned.

That, for me, is the next question - how can one educate LVM about the
underlying block device such that logical volumes carved out of that
space align properly - many of us have experienced 30% (or so)
performance losses for the convenience of LVM (and mighty convenient
it is).


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm --stop goes off and never comes back?

2007-12-19 Thread Jon Nelson
On 12/19/07, Neil Brown [EMAIL PROTECTED] wrote:
 On Tuesday December 18, [EMAIL PROTECTED] wrote:
  This just happened to me.
  Create raid with:
 
  mdadm --create /dev/md2 --level=raid10 --raid-devices=3
  --spare-devices=0 --layout=o2 /dev/sdb3 /dev/sdc3 /dev/sdd3
 
  cat /proc/mdstat
 
  md2 : active raid10 sdd3[2] sdc3[1] sdb3[0]
5855424 blocks 64K chunks 2 offset-copies [3/3] [UUU]
[==..]  resync = 14.6% (859968/5855424)
  finish=1.3min speed=61426K/sec
 
  Some log messages:
 
  Dec 18 15:02:28 turnip kernel: md: md2: raid array is not clean --
  starting background reconstruction
  Dec 18 15:02:28 turnip kernel: raid10: raid set md2 active with 3 out
  of 3 devices
  Dec 18 15:02:28 turnip kernel: md: resync of RAID array md2
  Dec 18 15:02:28 turnip kernel: md: minimum _guaranteed_  speed: 1000
  KB/sec/disk.
  Dec 18 15:02:28 turnip kernel: md: using maximum available idle IO
  bandwidth (but not more than 20 KB/sec) for resync.
  Dec 18 15:02:28 turnip kernel: md: using 128k window, over a total of
  5855424 blocks.
  Dec 18 15:03:36 turnip kernel: md: md2: resync done.
  Dec 18 15:03:36 turnip kernel: md: checkpointing resync of md2.
 
  I tried to stop the array:
 
  mdadm --stop /dev/md2
 
  and mdadm never came back. It's off in the kernel somewhere. :-(
 
  kill, of course, has no effect.
  The machine still runs fine, the rest of the raids (md0 and md1) work
  fine (same disks).
 
  The output (snipped, only mdadm) of 'echo t  /proc/sysrq-trigger'
 
  Dec 18 15:09:13 turnip kernel: mdadm S 0001e5359fa38fb0 0
  3943  1 (NOTLB)
  Dec 18 15:09:13 turnip kernel:  810033e7ddc8 0086
   0092
  Dec 18 15:09:13 turnip kernel:  0fc7 810033e7dd78
  80617800 80617800
  Dec 18 15:09:13 turnip kernel:  8061d210 80617800
  80617800 
  Dec 18 15:09:13 turnip kernel: Call Trace:
  Dec 18 15:09:13 turnip kernel:  [803fac96]
  __mutex_lock_interruptible_slowpath+0x8b/0xca
  Dec 18 15:09:13 turnip kernel:  [802acccb] do_open+0x222/0x2a5
  Dec 18 15:09:13 turnip kernel:  [8038705d] md_seq_show+0x127/0x6c1
  Dec 18 15:09:13 turnip kernel:  [80275597] vma_merge+0x141/0x1ee
  Dec 18 15:09:13 turnip kernel:  [802a2aa0] seq_read+0x1bf/0x28b
  Dec 18 15:09:13 turnip kernel:  [8028a42d] vfs_read+0xcb/0x153
  Dec 18 15:09:13 turnip kernel:  [8028a7c1] sys_read+0x45/0x6e
  Dec 18 15:09:13 turnip kernel:  [80209c2e] system_call+0x7e/0x83
 
 
 
  What happened? Is there any debug info I can provide before I reboot?

 Don't know very odd.

 The rest of the 'sysrq' output would possibly help.

Does this help? It's the same syscall and args, I think, as above.

Dec 18 15:09:13 turnip kernel: hald  S 0001e52f4793e397 0
3040  1 (NOTLB)
Dec 18 15:09:13 turnip kernel:  81003aa51e38 0086
 802
68ee6
Dec 18 15:09:13 turnip kernel:  81002a97e5c0 81003aa51de8
80617800 806
17800
Dec 18 15:09:13 turnip kernel:  8061d210 80617800
80617800 810
0bb48
Dec 18 15:09:13 turnip kernel: Call Trace:
Dec 18 15:09:13 turnip kernel:  [80268ee6]
get_page_from_freelist+0x3c4/0x545
Dec 18 15:09:13 turnip kernel:  [803fac96]
__mutex_lock_interruptible_slowpath+0x8b/
0xca
Dec 18 15:09:13 turnip kernel:  [80387adf] md_attr_show+0x2f/0x64
Dec 18 15:09:13 turnip kernel:  [802cd142] sysfs_read_file+0xb3/0x111
Dec 18 15:09:13 turnip kernel:  [8028a42d] vfs_read+0xcb/0x153
Dec 18 15:09:13 turnip kernel:  [8028a7c1] sys_read+0x45/0x6e
Dec 18 15:09:13 turnip kernel:  [80209c2e] system_call+0x7e/0x83
Dec 18 15:09:13 turnip kernel:


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm --stop goes off and never comes back?

2007-12-19 Thread Jon Nelson
On 12/19/07, Jon Nelson [EMAIL PROTECTED] wrote:
 On 12/19/07, Neil Brown [EMAIL PROTECTED] wrote:
  On Tuesday December 18, [EMAIL PROTECTED] wrote:
   This just happened to me.
   Create raid with:
  
   mdadm --create /dev/md2 --level=raid10 --raid-devices=3
   --spare-devices=0 --layout=o2 /dev/sdb3 /dev/sdc3 /dev/sdd3
  
   cat /proc/mdstat
  
   md2 : active raid10 sdd3[2] sdc3[1] sdb3[0]
 5855424 blocks 64K chunks 2 offset-copies [3/3] [UUU]
 [==..]  resync = 14.6% (859968/5855424)
   finish=1.3min speed=61426K/sec
  
   Some log messages:
  
   Dec 18 15:02:28 turnip kernel: md: md2: raid array is not clean --
   starting background reconstruction
   Dec 18 15:02:28 turnip kernel: raid10: raid set md2 active with 3 out
   of 3 devices
   Dec 18 15:02:28 turnip kernel: md: resync of RAID array md2
   Dec 18 15:02:28 turnip kernel: md: minimum _guaranteed_  speed: 1000
   KB/sec/disk.
   Dec 18 15:02:28 turnip kernel: md: using maximum available idle IO
   bandwidth (but not more than 20 KB/sec) for resync.
   Dec 18 15:02:28 turnip kernel: md: using 128k window, over a total of
   5855424 blocks.
   Dec 18 15:03:36 turnip kernel: md: md2: resync done.
   Dec 18 15:03:36 turnip kernel: md: checkpointing resync of md2.
  
   I tried to stop the array:
  
   mdadm --stop /dev/md2
  
   and mdadm never came back. It's off in the kernel somewhere. :-(
  
   kill, of course, has no effect.
   The machine still runs fine, the rest of the raids (md0 and md1) work
   fine (same disks).
  
   The output (snipped, only mdadm) of 'echo t  /proc/sysrq-trigger'
  
   Dec 18 15:09:13 turnip kernel: mdadm S 0001e5359fa38fb0 0
   3943  1 (NOTLB)
   Dec 18 15:09:13 turnip kernel:  810033e7ddc8 0086
    0092
   Dec 18 15:09:13 turnip kernel:  0fc7 810033e7dd78
   80617800 80617800
   Dec 18 15:09:13 turnip kernel:  8061d210 80617800
   80617800 
   Dec 18 15:09:13 turnip kernel: Call Trace:
   Dec 18 15:09:13 turnip kernel:  [803fac96]
   __mutex_lock_interruptible_slowpath+0x8b/0xca
   Dec 18 15:09:13 turnip kernel:  [802acccb] do_open+0x222/0x2a5
   Dec 18 15:09:13 turnip kernel:  [8038705d] 
   md_seq_show+0x127/0x6c1
   Dec 18 15:09:13 turnip kernel:  [80275597] vma_merge+0x141/0x1ee
   Dec 18 15:09:13 turnip kernel:  [802a2aa0] seq_read+0x1bf/0x28b
   Dec 18 15:09:13 turnip kernel:  [8028a42d] vfs_read+0xcb/0x153
   Dec 18 15:09:13 turnip kernel:  [8028a7c1] sys_read+0x45/0x6e
   Dec 18 15:09:13 turnip kernel:  [80209c2e] system_call+0x7e/0x83
  
  
  
   What happened? Is there any debug info I can provide before I reboot?
 
  Don't know very odd.
 
  The rest of the 'sysrq' output would possibly help.

 Does this help? It's the same syscall and args, I think, as above.

 Dec 18 15:09:13 turnip kernel: hald  S 0001e52f4793e397 0
 3040  1 (NOTLB)
 Dec 18 15:09:13 turnip kernel:  81003aa51e38 0086
  802
 68ee6
 Dec 18 15:09:13 turnip kernel:  81002a97e5c0 81003aa51de8
 80617800 806
 17800
 Dec 18 15:09:13 turnip kernel:  8061d210 80617800
 80617800 810
 0bb48
 Dec 18 15:09:13 turnip kernel: Call Trace:
 Dec 18 15:09:13 turnip kernel:  [80268ee6]
 get_page_from_freelist+0x3c4/0x545
 Dec 18 15:09:13 turnip kernel:  [803fac96]
 __mutex_lock_interruptible_slowpath+0x8b/
 0xca
 Dec 18 15:09:13 turnip kernel:  [80387adf] md_attr_show+0x2f/0x64
 Dec 18 15:09:13 turnip kernel:  [802cd142] 
 sysfs_read_file+0xb3/0x111
 Dec 18 15:09:13 turnip kernel:  [8028a42d] vfs_read+0xcb/0x153
 Dec 18 15:09:13 turnip kernel:  [8028a7c1] sys_read+0x45/0x6e
 Dec 18 15:09:13 turnip kernel:  [80209c2e] system_call+0x7e/0x83
 Dec 18 15:09:13 turnip kernel:

NOTE: kernel is stock openSUSE 10.3 kernel, x86_64, 2.6.22.13-0.3-default.


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm --stop goes off and never comes back?

2007-12-18 Thread Jon Nelson
This just happened to me.
Create raid with:

mdadm --create /dev/md2 --level=raid10 --raid-devices=3
--spare-devices=0 --layout=o2 /dev/sdb3 /dev/sdc3 /dev/sdd3

cat /proc/mdstat

md2 : active raid10 sdd3[2] sdc3[1] sdb3[0]
  5855424 blocks 64K chunks 2 offset-copies [3/3] [UUU]
  [==..]  resync = 14.6% (859968/5855424)
finish=1.3min speed=61426K/sec

Some log messages:

Dec 18 15:02:28 turnip kernel: md: md2: raid array is not clean --
starting background reconstruction
Dec 18 15:02:28 turnip kernel: raid10: raid set md2 active with 3 out
of 3 devices
Dec 18 15:02:28 turnip kernel: md: resync of RAID array md2
Dec 18 15:02:28 turnip kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Dec 18 15:02:28 turnip kernel: md: using maximum available idle IO
bandwidth (but not more than 20 KB/sec) for resync.
Dec 18 15:02:28 turnip kernel: md: using 128k window, over a total of
5855424 blocks.
Dec 18 15:03:36 turnip kernel: md: md2: resync done.
Dec 18 15:03:36 turnip kernel: md: checkpointing resync of md2.

I tried to stop the array:

mdadm --stop /dev/md2

and mdadm never came back. It's off in the kernel somewhere. :-(

kill, of course, has no effect.
The machine still runs fine, the rest of the raids (md0 and md1) work
fine (same disks).

The output (snipped, only mdadm) of 'echo t  /proc/sysrq-trigger'

Dec 18 15:09:13 turnip kernel: mdadm S 0001e5359fa38fb0 0
3943  1 (NOTLB)
Dec 18 15:09:13 turnip kernel:  810033e7ddc8 0086
 0092
Dec 18 15:09:13 turnip kernel:  0fc7 810033e7dd78
80617800 80617800
Dec 18 15:09:13 turnip kernel:  8061d210 80617800
80617800 
Dec 18 15:09:13 turnip kernel: Call Trace:
Dec 18 15:09:13 turnip kernel:  [803fac96]
__mutex_lock_interruptible_slowpath+0x8b/0xca
Dec 18 15:09:13 turnip kernel:  [802acccb] do_open+0x222/0x2a5
Dec 18 15:09:13 turnip kernel:  [8038705d] md_seq_show+0x127/0x6c1
Dec 18 15:09:13 turnip kernel:  [80275597] vma_merge+0x141/0x1ee
Dec 18 15:09:13 turnip kernel:  [802a2aa0] seq_read+0x1bf/0x28b
Dec 18 15:09:13 turnip kernel:  [8028a42d] vfs_read+0xcb/0x153
Dec 18 15:09:13 turnip kernel:  [8028a7c1] sys_read+0x45/0x6e
Dec 18 15:09:13 turnip kernel:  [80209c2e] system_call+0x7e/0x83



What happened? Is there any debug info I can provide before I reboot?



-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid over 48 disks

2007-12-18 Thread Jon Nelson
On 12/18/07, Thiemo Nagel [EMAIL PROTECTED] wrote:
  Performance of the raw device is fair:
  # dd if=/dev/md2 of=/dev/zero bs=128k count=64k
  8589934592 bytes (8.6 GB) copied, 15.6071 seconds, 550 MB/s
 
  Somewhat less through ext3 (created with -E stride=64):
  # dd if=largetestfile of=/dev/zero bs=128k count=64k
  8589934592 bytes (8.6 GB) copied, 26.4103 seconds, 325 MB/s
 
  Quite slow?
 
  10 disks (raptors) raid 5 on regular sata controllers:
 
  # dd if=/dev/md3 of=/dev/zero bs=128k count=64k
  8589934592 bytes (8.6 GB) copied, 10.718 seconds, 801 MB/s
 
  # dd if=bigfile of=/dev/zero bs=128k count=64k
  3640379392 bytes (3.6 GB) copied, 6.58454 seconds, 553 MB/s

 Interesting.  Any ideas what could be the reason?  How much do you get
 from a single drive?  -- The Samsung HD501LJ that I'm using gives
 ~84MB/s when reading from the beginning of the disk.

 With RAID 5 I'm getting slightly better results (though I really wonder
 why, since naively I would expect identical read performance) but that
 does only account for a small part of the difference:

 16k read64k write
 chunk
 sizeRAID 5  RAID 6  RAID 5  RAID 6
 128k492 497 268 270
 256k615 530 288 270
 512k625 607 230 174
 1024k   650 620 170 75

It strikes me that these numbers are meaningless without knowing if
that is actual data-to-disk or data-to-memcache-and-some-to-disk-too.
Later versions of 'dd' offer 'conv=fdatasync' which is really handy
(call fdatasync on the output file, syncing JUST the one file, right
before close). Otherwise, oflags=direct will (try) to bypass the
page/block cache.

I can get really impressive numbers, too (over 200MB/s on a single
disk capable of 70MB/s) when I (mis)use dd without fdatasync, et al.

The variation in reported performance can be really huge without
understanding that you aren't actually testing the DISK I/O but *some*
disk I/O and *some* memory caching.




-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-08 Thread Jon Nelson
This is what dstat shows me copying lots of large files about (ext3),
one file at a time.
I've benchmarked the raid itself around 65-70 MB/s maximum actual
write I/O so this 3-4MB/s stuff is pretty bad.

I should note that ALL other I/O suffers horribly, even on other filesystems.
What might the cause be?

I should note: going larger in stripe_cache_size (384 and 512)
performance stays the same, going smaller (128) performance
*increases* and stays more steady to 10-13 MB/s.


total-cpu-usage --dsk/sda-- --dsk/sdb-- --dsk/sdc--
--dsk/sdd-- -dsk/total-
usr sys idl wai hiq siq| read  writ: read  writ: read  writ: read
writ: read  writ
  1   1  95   3   0   0|  12k 4261B: 106k  125k:  83k  110k:  83k
110k: 283k  348k
  0   5   0  91   1   2|   0 0 :2384k 4744k:2612k 4412k:2336k
4804k:7332k   14M
  0   4   0  91   1   3|   0 0 :2352k 4964k:2392k 4812k:2620k
4764k:7364k   14M
  0   4   0  92   1   3|   0 0 :1068k 3524k:1336k 3184k:1360k
2912k:3764k 9620k
  0   4   0  92   1   2|   0 0 :2304k 2612k:2128k 2484k:2332k
3028k:6764k 8124k
  0   4   0  92   1   2|   0 0 :1584k 3428k:1252k 3992k:1592k
3416k:4428k   11M
  0   3   0  93   0   2|   0 0 :1400k 2364k:1424k 2700k:1584k
2592k:4408k 7656k
  0   4   0  93   1   2|   0 0 :1764k 3084k:1820k 2972k:1796k
2396k:5380k 8452k
  0   4   0  92   2   3|   0 0 :1984k 3736k:1772k 4024k:1792k
4524k:5548k   12M
  0   4   0  93   1   2|   0 0 :1852k 3860k:1840k 3408k:1696k
3648k:5388k   11M
  0   4   0  93   0   2|   0 0 :1328k 2500k:1640k 2348k:1672k
2128k:4640k 6976k
  0   4   0  92   0   4|   0 0 :1624k 3944k:2080k 3432k:1760k
3704k:5464k   11M
  0   1   0  97   1   2|   0 0 :1480k 1340k: 976k 1564k:1268k
1488k:3724k 4392k
  0   4   0  92   1   2|   0 0 :1320k 2676k:1608k 2548k: 968k
2572k:3896k 7796k
  0   2   0  96   1   1|   0 0 :1856k 1808k:1752k 1988k:1752k
1600k:5360k 5396k
  0   4   0  92   2   1|   0 0 :1360k 2560k:1240k 2788k:1580k
2940k:4180k 8288k
  0   2   0  97   1   2|   0 0 :1928k 1456k:1628k 2080k:1488k
2308k:5044k 5844k
  1   3   0  94   2   2|   0 0 :1432k 2156k:1320k 1840k: 936k
1072k:3688k 5068k
  0   3   0  93   2   2|   0 0 :1760k 2164k:1440k 2384k:1276k
2972k:4476k 7520k
  0   3   0  95   1   2|   0 0 :1088k 1064k: 896k 1424k:1152k
992k:3136k 3480k
  0   0   0  96   0   2|   0 0 : 976k  888k: 632k 1120k:1016k
968k:2624k 2976k
  0   2   0  94   1   2|   0 0 :1120k 1864k: 964k 1776k:1060k
1856k:3144k 5496k

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-06 Thread Jon Nelson
On 12/6/07, David Rees [EMAIL PROTECTED] wrote:
 On Dec 6, 2007 1:06 AM, Justin Piszcz [EMAIL PROTECTED] wrote:
  On Wed, 5 Dec 2007, Jon Nelson wrote:
 
   I saw something really similar while moving some very large (300MB to
   4GB) files.
   I was really surprised to see actual disk I/O (as measured by dstat)
   be really horrible.
 
  Any work-arounds, or just don't perform heavy reads the same time as
  writes?

 What kernel are you using? (Did I miss it in your OP?)

 The per-device write throttling in 2.6.24 should help significantly,
 have you tried the latest -rc and compared to your current kernel?

I was using 2.6.22.12 I think (openSUSE kernel).
I can try using pretty much any kernel - I'm preparing to do an
unrelated test using 2.6.24rc4 this weekend. If I remember I'll try to
see what disk I/O looks like there.


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-05 Thread Jon Nelson
I saw something really similar while moving some very large (300MB to
4GB) files.
I was really surprised to see actual disk I/O (as measured by dstat)
be really horrible.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Stack Trace. Bad?

2007-11-06 Thread Jon Nelson
I was testing some network throughput today and ran into this.
I'm going to bet it's a forcedeth driver problem but since it also
involve software raid I thought I'd include it.
Whom should I contact regarding the forcedeth problem?

The following is only an harmless informational message.
Unless you get a _continuous_flood_ of these messages it means
everything is working fine. Allocations from irqs cannot be
perfectly reliable and the kernel is designed to handle that.
md0_raid5: page allocation failure. order:2, mode:0x20

Call Trace:
 IRQ  [802684c2] __alloc_pages+0x324/0x33d
 [80283147] kmem_getpages+0x66/0x116
 [8028367a] fallback_alloc+0x104/0x174
 [80283330] kmem_cache_alloc_node+0x9c/0xa8
 [80396984] __alloc_skb+0x65/0x138
 [8821d82a] :forcedeth:nv_alloc_rx_optimized+0x4d/0x18f
 [88220fca] :forcedeth:nv_napi_poll+0x61f/0x71c
 [8039ce93] net_rx_action+0xb2/0x1c5
 [8023625e] __do_softirq+0x65/0xce
 [8020adbc] call_softirq+0x1c/0x28
 [8020bef5] do_softirq+0x2c/0x7d
 [8020c180] do_IRQ+0xb6/0xd6
 [8020a141] ret_from_intr+0x0/0xa
 EOI  [80265d8e] mempool_free_slab+0x0/0xe
 [803fac0b] _spin_unlock_irqrestore+0x8/0x9
 [803892d8] bitmap_daemon_work+0xee/0x2f3
 [80386571] md_check_recovery+0x22/0x4b9
 [88118e10] :raid456:raid5d+0x1b/0x3a2
 [8023978b] del_timer_sync+0xc/0x16
 [803f98db] schedule_timeout+0x92/0xad
 [80239612] process_timeout+0x0/0x5
 [803f98ce] schedule_timeout+0x85/0xad
 [80387e62] md_thread+0xf2/0x10e
 [80243353] autoremove_wake_function+0x0/0x2e
 [80387d70] md_thread+0x0/0x10e
 [8024322c] kthread+0x47/0x73
 [8020aa48] child_rip+0xa/0x12
 [802431e5] kthread+0x0/0x73
 [8020aa3e] child_rip+0x0/0x12

Mem-info:
Node 0 DMA per-cpu:
CPU0: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
CPU1: Hot: hi:0, btch:   1 usd:   0   Cold: hi:0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU0: Hot: hi:  186, btch:  31 usd: 115   Cold: hi:   62, btch:  15 usd:  31
CPU1: Hot: hi:  186, btch:  31 usd: 128   Cold: hi:   62, btch:  15 usd:  56
Active:111696 inactive:116497 dirty:31 writeback:0 unstable:0
 free:1850 slab:19676 mapped:3608 pagetables:1217 bounce:0
Node 0 DMA free:3988kB min:40kB low:48kB high:60kB active:232kB
inactive:5496kB present:10692kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 994 994
Node 0 DMA32 free:3412kB min:4012kB low:5012kB high:6016kB
active:446552kB inactive:460492kB present:1018020kB pages_scanned:0
all_unreclaimable? no
lowmem_reserve[]: 0 0 0
Node 0 DMA: 29*4kB 2*8kB 1*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB
1*1024kB 1*2048kB 0*4096kB = 3988kB
Node 0 DMA32: 419*4kB 147*8kB 19*16kB 0*32kB 1*64kB 0*128kB 1*256kB
0*512kB 0*1024kB 0*2048kB 0*4096kB = 3476kB
Swap cache: add 57, delete 57, find 0/0, race 0+0
Free swap  = 979608kB
Total swap = 979832kB
 Free swap:   979608kB
262128 pages of RAM
4938 reserved pages
108367 pages shared
0 pages swap cached


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid5: degraded after reboot

2007-10-12 Thread Jon Nelson
I have a software raid5 using /dev/sd{a,b,c}4.
It's been up for months, through many reboots.

I had to do a reboot using sysrq

When the box came back up, the raid did not re-assemble.
I am not using bitmaps.

I believe it comes down to this:

4md: kicking non-fresh sda4 from array!

what does that mean?

I also have this:

raid5: raid level 5 set md0 active with 2 out of 3 devices, algorithm 2
RAID5 conf printout:
 --- rd:3 wd:2 fd:1
 disk 1, o:1, dev:sdb4
 disk 2, o:1, dev:sdc4
mdadm: forcing event count in /dev/sdb4(1) from 327615 upto 327626

Why was /dev/sda4 kicked?

Contents of /etc/mdadm.conf:

DEVICE /dev/hd*[a-h][0-9] /dev/sd*[a-h][0-9]
ARRAY /dev/md0 level=raid5 num-devices=3
UUID=b4597c3f:ab953cb9:32634717:ca110bfc

Current /proc/mdstat:

md0 : active raid5 sda4[3] sdb4[1] sdc4[2]
  613409664 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
  [==..]  recovery = 13.1% (40423368/306704832)
finish=68.8min speed=64463K/sec

65-70KB/s is about what these drives can do so the rebuild speed is just peachy.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: degraded after reboot

2007-10-12 Thread Jon Nelson
On 10/12/07, Andre Noll [EMAIL PROTECTED] wrote:
 On 10:38, Jon Nelson wrote:
  4md: kicking non-fresh sda4 from array!
 
  what does that mean?

 sda4 was not included because the array has been assembled previously
 using only sdb4 and sdc4. So the data on sda4 is out of date.

I don't understand - over months and months it has always been the three
devices, /dev/sd{a,b,c}4.
I've added and removed bitmaps and done other things but at the time it
rebooted the array had been up, clean (non-degraded), and comprised of the
three devices for 4-6 weeks.

  I also have this:
 
  raid5: raid level 5 set md0 active with 2 out of 3 devices, algorithm 2
  RAID5 conf printout:
   --- rd:3 wd:2 fd:1
   disk 1, o:1, dev:sdb4
   disk 2, o:1, dev:sdc4

 This looks normal. The array is up with two working disks.

Two of three which, to me, is abnormal (ie, the normal state is three
and it's got two).

  Why was /dev/sda4 kicked?

 Because it was non-fresh ;)

OK, but what does that MEAN?


  md0 : active raid5 sda4[3] sdb4[1] sdc4[2]
613409664 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
[==..]  recovery = 13.1% (40423368/306704832)
  finish=68.8min speed=64463K/sec

 Seems like your init scripts re-added sda4.

No, I did this by hand. I forgot to say that.

--
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: degraded after reboot

2007-10-12 Thread Jon Nelson
 You said you had to reboot your box using sysrq. There are chances you
 caused the reboot while all pending data was written to sdb4 and sdc4,
 but not to sda4. So sda4 appears to be non-fresh after the reboot and,
 since mdadm refuses to use non-fresh devices, it kicks sda4.

Can mdadm be told to use non-fresh devices?
What about sdb4: I can understand rewinding an event count (sorta),
but what does this mean:

mdadm: forcing event count in /dev/sdb4(1) from 327615 upto 327626

Since the array is degraded, there are 11 events missing from sdb4
(presumably sdc4 had them). Since sda4 is not part of the array, the
events can't be complete, can they?  Why jump *ahead* on events
instead of rewinding?


 Sure. I should have said: It's normal if one disk in a raid5 array is
 missing (or non-fresh).

I do not have a spare for this raid - I am aware of the risks and
mitigate them in other ways.

 To be precise, it means that the event counter for sda4 is less than
 the event counter on the other devices in the array. So mdadm must
 assume the data on sda4 is out of sync and hence the device can't be
 used. If you are not using bitmaps, there is no other way out than
 syncing the whole device, i.e. writing good data (computed from sdb4
 and sdc4) to sda4.

 Hope that helps.

Yes, that helps.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backups w/ rsync

2007-09-28 Thread Jon Nelson
Please note: I'm having trouble w/gmail's formatting... so please
forgive this if it looks horrible. :-|

On 9/28/07, Bill Davidsen [EMAIL PROTECTED] wrote:

 Dean S. Messing wrote:
  It has been some time since I read the rsync man page.  I see that
  there is (among the bazillion and one switches) a --link-dest=DIR
  switch which I suppose does what you describe.  I'll have to
  experiment with this and think things through.  Thanks, Michal.
 

 Be aware that rsync is useful for making a *copy* of your files, which
 isn't always the best backup. If the goal is to preserve data and be
 able to recover in time of disaster, it's probably not optimal, while if
 you need frequent access to old or deleted files it's fine.


You are absolutely right when you say it isn't always the best backup. There
IS no 'best' backup.

For example, full and incremental backup methods such as dump and
 restore are usually faster to take and restore than a copy, and allow
 easy incremental backups.


If copy meant full data copy and not hard link where possible, I'd
agree with you. However...

I use a nightly rsync (with --link-dest) to backup more than 40 GiB to a
drbd-backed drive. I'll explain why I use drbd in just a moment.

Technically, I have a 3 disk raid5 (Linux Software Raid) which is the
primary store for the data. Then I have a second drive (non-raid) that is
used as a drbd backing store, which I rsync *to* from filesystems built off
of the raid. I keep *30 days* of nightly backups on the drbd volume. The
average difference between nightly backups is about 45MB, or a bit less than
10%. The total disk usage is (on average) about 10% more than a single
backup. On an AMD x86-64 dual core (3600 de-clocked to run at 1GHz) the
entire process takes between 1 and 2 minutes, from start to finish.

Using hard links means I can snapshot ~175,000 files, about 40GiB, in under
2 minutes - something I'd have a hard time doing with dump+restore. I could
easily make incremental or differential copies, and maybe even in that time
frame, but I'm not sure I much advantage in that. Furthermore, as you state,
dump+restore does *not* include the removal of files which for some
scenarios is a huge deal.

The long and short of it is this: using hard links (via rsync or cp or
whatever) to do snapshot backups can be really, really fast and have
significant advantages but there are, as with all things, some downsides.
Those downsides are fairly easily mitigated, however. In my case, I can lose
1 drive of the raid and I'm OK. If I lose 2, then the other drive (not part
of the raid) has the data I care about. If I lose the entire machine, the
*other* machine (the other end of the drbd, only woken up every other day or
so) has the data. Going back 30 days. And a bare-metal restore is as fast
as your I/O is.  I back my /really/ important stuff up on DLT.

Thanks again to drbd, when the secondary comes up it communicates with the
primary and is able to figure out only which blocks have changed and only
copies those. On a nightly basis that is usually a couple of hundred
megabytes, and at 12MiB/s that doesn't take terribly long to take care of.

-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backups w/ rsync

2007-09-28 Thread Jon Nelson
On 9/28/07, Bill Davidsen [EMAIL PROTECTED] wrote:
 What I don't understand is how you use hard links... because a hard link
 needs to be in the same filesystem, and because a hard link is just
 another pointer to the inode and doesn't make a physical copy of the
 data to another device or to anywhere, really.

Yes, I know how hard links work. There is (one) physical copy of the
data when it goes from the filesystem on the raid to the filesystem on
the drbd. Subsequent copies of the same file, assuming the file has
not changed, are all hard links on the drbd-backed filesystem. Thus, I
have one *physical* copy of the data and a whole bunch of hard links.
Now, since I'm using drbd I actually have *two* physical copies (for a
total of three if you include the original) because the *other*
machine has a block-for-block copy of the drbd device (or it did, as
of a few days ago).

link-dest basically works like this:

Assuming we are going to copy (using that word loosely here) file
A from /source to /dest/backup.tmp/, and we've told rsync that
/dest/backup.1/A might exist:


If /dest/backup.1/A does not exist: make a physical copy from
/source/A to /dest/backup.tmp/A.
If it does exist, and the two files are considered identical, simply
hardlink /dest/backup.tmp/A to /dest/backup.1/A.
When all files are copied, move every /dest/backup.N (N is a number)
to /dest/backup.N+1
If /dest/backup.31 exists, delete it.
Move /dest/backup.tmp to /dest/backup.1 (which was just renamed /dest/backup.2)

I can do all of this, for 175K files (40G), in under 2 minutes on
modest hardware.
I end up with:
1+1 physical copies of the data (local drbd copy and remote drbd copy)

There is more but if I may suggest: if you want more details contact
me off-line, I'm pretty sure the linux-raid folks couldn't care less
about rsync and drbd.
-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-26 Thread Jon Nelson
On Mon, 25 Jun 2007, Justin Piszcz wrote:

 Neil has a patch for the bad speed.

What does the patch do?

 In the mean time, do this (or better to set it to 30, for instance):
 
 # Set minimum and maximum raid rebuild speed to 60MB/s.
 echo Setting minimum and maximum resync speed to 60 MiB/s...
 echo 6  /sys/block/md0/md/sync_speed_min
 echo 6  /sys/block/md0/md/sync_speed_max

sync_speed_max defaults to 20 on this box already.
I tried a binary search of values between the default (1000) 
and 6 which resulted in some pretty weird behavior:

at values below 26000 the rate (also confirmed via dstat output) stayed 
low.  2-3MB/s.  At 26000 and up, the value jumped more or less instantly 
to 70-74MB/s. What makes 26000 special? If I set the value to 2 why 
do I still get 2-3MB/s actual?

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-26 Thread Jon Nelson
On Tue, 26 Jun 2007, Justin Piszcz wrote:

 
 
 On Tue, 26 Jun 2007, Jon Nelson wrote:
 
  On Mon, 25 Jun 2007, Justin Piszcz wrote:
 
   Neil has a patch for the bad speed.
 
  What does the patch do?
 
   In the mean time, do this (or better to set it to 30, for instance):
  
   # Set minimum and maximum raid rebuild speed to 60MB/s.
   echo Setting minimum and maximum resync speed to 60 MiB/s...
   echo 6  /sys/block/md0/md/sync_speed_min
   echo 6  /sys/block/md0/md/sync_speed_max
 
  sync_speed_max defaults to 20 on this box already.
  I tried a binary search of values between the default (1000)
  and 6 which resulted in some pretty weird behavior:
 
  at values below 26000 the rate (also confirmed via dstat output) stayed
  low.  2-3MB/s.  At 26000 and up, the value jumped more or less instantly
  to 70-74MB/s. What makes 26000 special? If I set the value to 2 why
  do I still get 2-3MB/s actual?


 You want to use sync_speed_min.

I forgot to say I changed the values of sync_speed_min.


--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-26 Thread Jon Nelson
On Tue, 26 Jun 2007, Justin Piszcz wrote:

 
 
 On Tue, 26 Jun 2007, Jon Nelson wrote:
 
  On Tue, 26 Jun 2007, Justin Piszcz wrote:
 
  
  
   On Tue, 26 Jun 2007, Jon Nelson wrote:
  
On Mon, 25 Jun 2007, Justin Piszcz wrote:
   
 Neil has a patch for the bad speed.
   
What does the patch do?

I repeat: what does the patch do (or is this no longer applicable)?


sync_speed_max defaults to 20 on this box already.

Altering sync_speed_min...

I tried a binary search of values between the default (1000)
and 6 which resulted in some pretty weird behavior:
   
at values below 26000 the rate (also confirmed via dstat output) stayed
low.  2-3MB/s.  At 26000 and up, the value jumped more or less instantly
to 70-74MB/s. What makes 26000 special? If I set the value to 2 why
do I still get 2-3MB/s actual?

 Sounds quite strange, what chunk size are you using for your RAID?

The default: 64

md0 : active raid5 sdc4[2] sda4[0] sdb4[1]
  613409664 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]


--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-25 Thread Jon Nelson
On Thu, 21 Jun 2007, Jon Nelson wrote:

 On Thu, 21 Jun 2007, Raz wrote:
 
  What is your raid configuration ?
  Please note that the stripe_cache_size is acting as a bottle neck in some
  cases.

Well, that's kind of the point of my email. I'll try to restate things, 
as my question appears to have gotten lost.

1. I have a 3x component raid5, ~314G per component. Each component 
happens to be the 4th partition of a 320G SATA drive. Each drive 
can sustain approx. 70MB/s reads/writes. Except for the first 
drive, none of the other partitions are used for anything else at this 
time. The system is nominally quiescent during these tests.

2. The kernel is 2.6.18.8-0.3-default on x86_64 (openSUSE 10.2).

3. My best sustained write performance comes with a stripe_cache_size of 
4096. Larger than that seems to reduce performance, although only very 
slightly.

4. At values below 4096, the absolute write performance is less than the 
best, but only marginally. 

5. HOWEVER, at any value *above* 512 the 'check' performance is REALLY 
BAD. By 'check' performance I mean the value displayed by /proc/mdstat 
after I issue:

echo check  /sys/block/md0/md/sync_action

When I say REALLY BAD I mean  3MB/s. 

6. Here is a short incomplete table of stripe_cache_size to 'check' 
performance:

384 72-73MB/s
512 72-73MB/s
640 73-74MB/s
768.3-3.4MB/s

And the performance stays bad as I increase the stripe_cache_size.

7. And now, the question: the best absolute 'write' performance comes 
with a stripe_cache_size value of 4096 (for my setup). However, any 
value of stripe_cache_size above 384 really, really hurts 'check' (and 
rebuild, one can assume) performance.  Why?

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-25 Thread Jon Nelson
On Mon, 25 Jun 2007, Dan Williams wrote:

  7. And now, the question: the best absolute 'write' performance comes
  with a stripe_cache_size value of 4096 (for my setup). However, any
  value of stripe_cache_size above 384 really, really hurts 'check' (and
  rebuild, one can assume) performance.  Why?
 
 Question:
 After performance goes bad does it go back up if you reduce the size
 back down to 384?

Yes, and almost instantly.

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stripe_cache_size and performance

2007-06-21 Thread Jon Nelson
On Thu, 21 Jun 2007, Raz wrote:

 What is your raid configuration ?
 Please note that the stripe_cache_size is acting as a bottle neck in some
 cases.

Well, it's 3x SATA drives in raid5. 320G drives each, and I'm using a 
314G partition from each disk (the rest of the space is quiescent).

 On 6/21/07, Jon Nelson [EMAIL PROTECTED] wrote:
 
  I've been futzing with stripe_cache_size on a 3x component raid5,
  using 2.6.18.8-0.3-default on x86_64 (openSUSE 10.2).
 
  With the value set at 4096 I get pretty great write numbers.
  2048 and on down the write numbers slowly drop.
 
  However, at values above 512 the 'check' performance is terrible. By
  'check' performance I mean the value displayed by /proc/mdstat after
  I issue:
 
  echo check  /sys/block/md0/md/sync_action
 
  When I say terrible I mean  3MB/s.
  When I use 384, the performance goes to ~70MB/s
  512.. 72-73MB/s
  640.. 73-74MB/s
 
  768.. 3300 K/s. Wow!
 
  Can somebody 'splain to me what is going on?
 
  --
  Jon Nelson [EMAIL PROTECTED]
  -
  To unsubscribe from this list: send the line unsubscribe linux-raid in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 -- 
 Raz
 
 
 

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: below 10MB/s write on raid5

2007-06-11 Thread Jon Nelson
On Mon, 11 Jun 2007, Justin Piszcz wrote:

 
 
 On Mon, 11 Jun 2007, Dexter Filmore wrote:
 
  On Monday 11 June 2007 14:47:50 Justin Piszcz wrote:
   On Mon, 11 Jun 2007, Dexter Filmore wrote:
I recently upgraded my file server, yet I'm still unsatisfied with the
write speed.
Machine now is a Athlon64 3400+ (Socket 754) equipped with 1GB of RAM.
The four RAID disks are attached to the board's onbaord sATA controller
(Sil3114 attached via PCI)
Kernel is 2.6.21.1, custom on Slackware 11.0.
RAID is on four Samsung SpinPoint disks, has LVM, 3 volumes atop of each
XFS.
   
The machine does some other work, too, but still I would have suspected
to get into the 20-30MB/s area. Too much asked for?
   
Dex
  
   What do you get without LVM?
 
  Hard to tell: the PV hogs all of the disk space, can't really do non-LVM
  tests.
 
 You can do a read test.
 
 10gb read test:
 
 dd if=/dev/md0 bs=1M count=10240 of=/dev/null

eek! Make sure to use iflag=direct
with that otherwise you'll get cached reads and that will throw
the numbers off considerably.


--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: below 10MB/s write on raid5

2007-06-11 Thread Jon Nelson
On Mon, 11 Jun 2007, Nix wrote:

 On 11 Jun 2007, Justin Piszcz told this:
  You can do a read test.
 
  10gb read test:
 
  dd if=/dev/md0 bs=1M count=10240 of=/dev/null
 
  What is the result?
 
  I've read that LVM can incur a 30-50% slowdown.
 
 FWIW I see a much smaller penalty than that.
 
 loki:~# lvs -o +devices
   LV   VGAttr   LSize   Origin Snap%  Move Log Copy%  Devices
 [...]
   usr  raid  -wi-ao   6.00G   /dev/md1(50)
 
 loki:~# time dd if=/dev/md1 bs=1000 count=502400 of=/dev/null
 502400+0 records in
 502400+0 records out
 50240 bytes (502 MB) copied, 16.2995 s, 30.8 MB/s
 
 loki:~# time dd if=/dev/raid/usr bs=1000 count=502400 of=/dev/null
 502400+0 records in
 502400+0 records out
 50240 bytes (502 MB) copied, 18.6172 s, 27.0 MB/s

And what is it like with 'iflag=direct' which I really feel you have to 
use, otherwise you get caching.

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding odd RAID5 I/O patterns

2007-06-07 Thread Jon Nelson
On Thu, 7 Jun 2007, Neil Brown wrote:

 On Wednesday June 6, [EMAIL PROTECTED] wrote:
 
  2. now, if I use oflag=direct, the I/O patterns are very strange:
 0 (zero) reads from sda or sdb, and 2-3MB/s worth of reads from sdc.
 11-12 MB/s writes to sda, and 8-9MB/s writes to sdb and sdc.
 
 --dsk/sda-- --dsk/sdb-- --dsk/sdc-- --dsk/hda--
  read  writ: read  writ: read  writ: read  writ
011M:4096B 8448k:2824k 8448k:   0   132k
012M:   0  9024k:3008k 9024k:   0   152k
 
 Why is /dev/sdc getting so many reads? This only happens with
 multiples of 192K for blocksizes. For every other blocksize I tried,
 the reads are spread across all three disks.

 Where letters are 64K chunks, and digits are 64K parity chunks, and
 columns are individual drives, your data is laid out something like
 this:

 A   B   1
 C   2   D
 3   E   F

 Your first 192K write contains data for A, B, and C.
 To generate 1 no read is needed.
 To generate '2', it needs to read either C or D.  It chooses D.
 So you get a read from the third drive, and writes to all.

 Your next 192K write contains data for D, E, and F.
 The update '2' it finds that C is already in cache and doesn't need to
 read anything.  To generate '3', E and F are both available, so no
 read is needed.

 This pattern repeats.

Aha!

  3. Why can't I find a blocksize that doesn't require reading from any
 device? Theoretically, if the chunk size is 64KB, then writing 128KB
 *should* result in 3 writes and 0 reads, right?

 With oflag=direct 128KB should work.  What do you get?
 Without oflag=direct, you have less control.  The VM will flush data
 whenever it wants to and it doesn't know about raid5 alignment
 requirements.

I tried 128KB. Actually, I tried dozens of values and found strange
patterns. Would using 'sync' as well help? [ me tries.. nope. ]

Note: the bitmap in this case remains external (on /dev/hda)
Note: /dev/raid/test is a logical volume carved from a volume group made
whose only physical volume is the raid.

Using:

dd if=/dev/zero of=/dev/raid/test bs=128K oflag=direct

/dev/raid/test is a logical volume carved from a volume group made whose
only physical volume is the raid.


!!
NOTE: after writing this, I decided to test against a raid device 
'in-the-raw' (without LVM). 128KB writes get the expected behavior (no 
reads). Unfortunately, this means LVM is doing something funky (maybe 
expected by others, though...) which means that the rest of this isn't 
specific to raid. Where do I go now to find out what's going on?
!!

When I use 128KB I get reads across all three devices. The following is
from dstat, showing 12-13MB/s writes to each drive, and 3.2MB/s give or
take reads. The pattern remains consistent:

--dsk/sda-- --dsk/sdb-- --dsk/sdc-- --dsk/hda--
 read  writ: read  writ: read  writ: read  writ
2688k   11M:2700k   11M:2696k   11M:   0   136k
2688k   10M:2656k   10M:2688k   10M:   0   124k
2752k   11M:2752k   11M:2688k   11M:   0   128k

(/dev/hda is where the bitmap is stored, so the writes there make
perfect sense - however, why are there any reads on sda, sdb, or sdc?)

  4. When using the page cache (no oflag=direct), even with 192KB
 blocksizes, there are (except for noise) *no* reads from the devices,
 as expected.  Why does bypassing the page cache, plus the
 combination of 192KB blocks cause such strange behavior?

 Hmm... this isn't what I get... maybe I misunderstood exactly what you
 were asking in '2' abovec??

I should have made clearer that items 1 through 4 have the bitmap on an
external device to avoid having to update it (when internal), if that
matters. Essentially, whenever I use dd *without* oflag=direct,
regardless of the blocksize, dstat shows 0 (zero) reads on the component
devices.

dd if=/dev/zero of=/dev/raid/test bs=WHATEVER
(no oflag=direct)

--dsk/sda-- --dsk/sdb-- --dsk/sdc-- --dsk/hda--
 read  writ: read  writ: read  writ: read  writ
   041M:   041M:   041M:   0   240k
   066M:   076M:   067M:   0   260k


  5. If I use an 'internal' bitmap, the write performance is *terrible*. I
 can't seem to sqeeze more than 8-12MB/s out of it (no page cache) or
 60MB/s (page cache allowed). When not using the page cache, the reads
 are spread across all three disks to the tune of 2-4MB per second.
 The bitmap file is only 150KB or so in size, why does storing it
 internally cause such a huge performance problem?

 If the bitmap is internal, you have to keep seeking to the end of the
 devices to update the bitmap.  If the bitmap is external and on a
 different device, it seeks independently of the data writes.

That's what I thought, but I didn't know if the bitmap was stored
with the bitmap or not. Is there more than one bitmap?

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http

RE: RAID 6 grow problem

2007-06-06 Thread Jon Nelson
On Wed, 6 Jun 2007, Daniel Korstad wrote:

  You say you have a RAID with three drive (I assume RAID5) with a read 
 performance of 133MB/s.  There are lots of variables, file system 
 type, cache tuning, but that sounds very reasonable to me.

I just did the math quickly - assuming each drive can sustain (give or 
take) 70MB/s, and I'm using a 3-drive raid5, then (block device) reads 
will only use 2 of those 3 drives resulting in a maximum of 140MB/s. 
I've measured (consistently) 133MB/s using dd (again, with iflag=direct) 
and a blocksize of 768K, so I'm OK with that.

A question posed by a friend is this: assuming 64K blocks, if I read a 
single stripe from a raid (128K, right?) then two drives will be used 
(say, drive A and drive B). If I want the next 128K then which drives 
are most likely to be used? Drive A will now have the parity block in 
it's next block for head position, but drive B has the next data 
block. Nobody knows where drive C is. Does raid5 use an algorithm 
similar to raid1 in that it chooses the 2 drives whose heads are closest 
or does it utilize some other algorithm? 

 Here is a site with some test for RAID5 and 8 drives in the set using 
 high end hardware raid.

 http://www.rhic.bnl.gov/hepix/talks/041019pm/schoen.pdf
 8 drives RAID 5 7200 rpm SATA drives = ~180MB/s

I would have expected a peak transfer rate (sustained, block, linear 
access) of MAX(8*70,533) which is 533MB/s, assuming 64bit, 66MHz bus and 
70MB/s drives.

I recently completed a series of some 800+ tests on a 4 disk raid5 
varying the I/O scheduler, readahead of the components, readahead of the 
raid, bitmap present or not, and filesystem and arrived at some fairly 
interesting results. I hope to throw them together in a more usable form 
in the near future.

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


very strange (maybe) raid1 testing results

2007-05-30 Thread Jon Nelson

I assembled a 3-component raid1 out of 3 4GB partitions.
After syncing, I ran the following script:

for bs in 32 64 128 192 256 384 512 768 1024 ; do \
 let COUNT=2048 * 1024 / ${bs}; \
 echo -n ${bs}K bs - ; \
 dd if=/dev/md1 of=/dev/null bs=${bs}k count=$COUNT iflag=direct 21 | 
 grep 'copied' ; \
done

I also ran 'dstat' (like iostat) in another terminal. What I noticed was 
very unexpected to me, so I re-ran it several times.  I confirmed my 
initial observation - every time a new dd process ran, *all* of the read 
I/O for that process came from a single disk. It does not (appear to) 
have to do with block size -  if I stop and re-run the script the next 
drive in line will take all of the I/O - it goes sda, sdc, sdb and back 
to sda and so on.

I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s 
as reported by dd. What I don't understand is why just one disk is being 
used here, instead of two or more. I tried different versions of 
metadata, and using a bitmap makes no difference. I created the array 
with (allowing for variations of bitmap and metadata version):

mdadm --create --level=1 --raid-devices=3 /dev/md1 /dev/sda3 /dev/sdb3 /dev/sdc3

I am running 2.6.18.8-0.3-default on x86_64, openSUSE 10.2.

Am I doing something wrong or is something weird going on?

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very strange (maybe) raid1 testing results

2007-05-30 Thread Jon Nelson
On Thu, 31 May 2007, Richard Scobie wrote:

 Jon Nelson wrote:
 
  I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s as
  reported by dd. What I don't understand is why just one disk is being used
  here, instead of two or more. I tried different versions of metadata, and
  using a bitmap makes no difference. I created the array with (allowing for
  variations of bitmap and metadata version):
 
 This is normal for md RAID1. What you should find is that for 
 concurrent reads, each read will be serviced by a different disk, 
 until no. of reads = no. of drives.

Alright. To clarify, let's assume some process (like a single-threaded 
webserver) using a raid1 to store content (who knows why, let's just say 
it is), and also assume that the I/O load is 100% reads. Given that the 
server does not fork (or create a thread) for each request, does that 
mean that every single web request is essentially serviced from one 
disk, always? What mechanism determines which disk actually services the 
request?

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html