Re: 2.2.14 + raid-2.2.14-B1 on PPC failing on bootup

2000-01-11 Thread Mark Ferrell

It is possible that the problem is a result of the raid code not beeing PPC
friendly when concerning byte boundries.

Open up linux/include/linux/raid/md_p.h
At line 161 you should have something ressembling the following

__u32 sb_csum;  /*  6 checksum of the whole superblock
*/
__u64 events; /*  7 number of superblock updates
(64-bit!)  */
__u32 gstate_sreserved[MD_SB_GENERIC_STATE_WORDS - 9];

Try swapping the __u32 sb_csum and __u64 events around so that it looks like

__u64 events; /*  7 number of superblock updates
(64-bit!)  */
__u32 sb_csum;  /*  6 checksum of the whole superblock
*/
__u32 gstate_sreserved[MD_SB_GENERIC_STATE_WORDS - 9];

This should fix the byte boundry problem that seems to cause a few issues
on PPC systems.  This problem and solution was previously reported by
Corey Minyard whom notated that the PPC is a bit more picky about byte
boundries then the x86 architecture.

"Kevin M. Myer" wrote:

 Hi,

 I am running kernel 2.2.14 + Ingo's latest RAID patches on an Apple
 Network Server.  I have (had) a RAID 5 array with 5 4Gb Seagate drives in
 it working nicely with 2.2.11 and I had to do something silly, like
 upgrade the kernel so I can use the big LCD display on the front to
 display cute messages.

 Now, I seem to have a major problem - I can make the array fine.  I can
 create a filesystem fine.  I can start and stop the array fine.  But I
 can't reboot.  Once I reboot, the kernel loads until it reaches the raid
 detection.  It detects the five drives and identifies them as a RAID5
 array and then, endlessly, the following streams across my screen:

 [dev 00:00][dev 00:00][dev 00:00][dev 00:00][dev 00:00][dev
 00:00][dev 00:00]

 ad infiniteum and forever.

 I have no choice but to reboot with an old kernel, run mkraid on the whole
 array again, remake the file system and download the 5 Gigs of Linux and
 BSD software that I had mirrored.

 Can anyone tell me where to start looking for clues as to whats going
 on?  I'm using persistent superblocks and as far as I can tell, everything
 is getting updated when I shutdown the machine and reboot
 it.  Unfortunately, the kernel nevers gets to the point where it can dump
 the stuff from dmesg into syslog so I have no record of what its actually
 stumbling over.

 Any ideas of what to try?  Need more information?

 Thanks,

 Kevin

 --
  ~Kevin M. Myer
 . .   Network/System Administrator
 /V\   ELANCO School District
// \\
   /(   )\
^`~'^



Re: tiotest

2000-01-11 Thread James Manning

[ Monday, January 10, 2000 ] Dietmar Goldbeck wrote:
 On Mon, Nov 29, 1999 at 04:20:45PM -0500, James Manning wrote:
  tiotest is a nice start to what I would like to see: a replacement
  for bonnie... While stripping out the character-based stuff from
  bonnie would bring it closer to what I'd like to see, threading
  would be a bit of a pain so starting with tiotest as a base might
  not be a bad idea if people are willing to help out...
 
 Is tiotest Open Source? 

Yes, under the GNU GPL

 Can you give me an URL please

http://www.icon.fi/~mak/tiotest/

I'm going to re-do the filesize stuff soon, but until then just try
and use the number of megabytes of RAM in your machine as the --size
parameter tio tiobench.pl once you "make" to build the tiotest program.

James
-- 
Miscellaneous Engineer --- IBM Netfinity Performance Development



Re: optimising raid performance

2000-01-11 Thread James Manning

[ Monday, January 10, 2000 ] [EMAIL PROTECTED] wrote:
   I've currently got a hardware raid system that I'm maxing out so
 any ideas on how to speed it up would be gratefully received.

Just some quick questions for additional info...

 - What kinds of numbers are you getting for performance now?
 - I'd check bonnie with a filesize of twice your RAM and
  then get http://www.icon.fi/~mak/tiotest/tiotest-0.16.tar.gz
  and do a make and then ./tiobench.pl --threads 16
 - Did you get a chance to benchmark raid 1+0 against 0+1?
 - Of the 12 disks over 2 channels, which are in the raid0+1, which
  in the raid5, which spare? how are the drive packs configured?
 - Is the card using its write cache?  write-back or write-through?
 - Do you have the latest firmware on the card?
 - Which kernel are you using?
 - What block size is the filesystem?  Did you create with a -R param?
 - What is your percentage of I/O operations that are writes?

 Since there is a relatively high proportion of writes a single raid5 set
 seem to be out.  The next best thing looks like a mirror but which is going
 to be better performance wise, 6 mirror pairs striped together or mirroring
 2 stripes of 6 discs?

IMO raid 1+0 for 2 stripes of 6 discs (better be around when a drive goes,
though, as that second failure will have about a 55% chance of taking
out the array :)

   Does the kernel get any scheduling benefit by seeing the discs and doing
 things in software?  As you can see the machine has a very low cpu load
 so I'd quite hapily trade some cpu for io throughput...

I'd really love to see you do a s/w raid 1 over 2 6-disk raid0's from
the card and check that performance-wise...  I believe putting the raid1
and raid0 logic on sep. processors could help, and worst case it'll
give a nice test case for any read-balancing patches floating around
(although you've noted that you are more write-intensive)

James
-- 
Miscellaneous Engineer --- IBM Netfinity Performance Development



Re: large ide raid system

2000-01-11 Thread Gregory Leblanc

Dan Hollis wrote:
 
 On Mon, 10 Jan 2000, Jan Edler wrote:
 
 Cable length is not so much a pain as the number of cables. Of course with
 scsi you want multiple channels anyway for performance, so the situation
 is very similar to ide. A cable mess.

There's a (relatively) nice way to get around this, if you make your own
IDE cables (or are brave enough to cut some up).  If you cut the cable
lengthwise (no, don't cut the wires) between wires (don't break the
insulation on the wires themselves, just the connecting plastic) you can
get your cables to be 1/4 the normal width (up until you get to the
connector).  This also makes a big difference for airflow, since those
big, flat ribbon cables are really bad for that.  
Greg



Re: Swapping Drives on RAID?

2000-01-11 Thread Andreas Trottmann

On Mon, Jan 10, 2000 at 11:16:27AM -0700, Scott Patten wrote:

 1 - I have a raid1 consisting of 2 drives.  For strange
 historical reasons one is SCSI and the other IDE.  Although
 the IDE is fairly fast the SCSI is much faster and since I
 now have another SCSI drive to add, I would like to replace
 the IDE with the SCSI.  Can I unplug the IDE drive, run in
 degraded mode, edit the raid.conf and somehow mkraid
 without loosing data or do I need to restore from tape.
 BYW, I'm using 2.2.13ac1.

I assume you configured your raid to "auto-start", i.e. you mkraid'ed it
with persistent_superblock set to 1, and set all the partition types to$
0xfd. If not, please tell me so and we'll work out what you have to do.

But in the case of auto-starting raid, it's quite easy, but a bit lengthy 
to explain:

* halt your computer
* remove the IDE drive (keep it in a safe place, in case I screwed up :) )
* attach the second SCSI drive
* boot. Your /dev/md devices should come up fully useable, but in
  degraded mode
* partition the second SCSI drive exactly like the first one
* for each md device, "raidhotadd" the new disk to it.
  Assuming you have /dev/md5 that consisted of /dev/sda5 and /dev/hda5
  (which is now removed), and you partitoined /dev/sdb like /dev/sda, do

  raidhotadd /dev/md5 /dev/sdb5

  It's actually quite simple, but difficult to explain.
* check /proc/mdstat. It should show that the devices are being resynced
* if you are finished, and everything works, be sure to change
  /etc/raidtab to reflect your new settings.

 2 - Which is better, 2.2.13ac3 or a patched 2.2.14?  Will
 there be a 2.2.14ac series?  Is there a place besides this
 list with this kind of information?

I've been using a patched 2.2.14 for some days without any problems.

Only Alan Cox knows if there will be a 2.2.14ac. He usually writes his
intentions into his diary, http://www.linux.org.uk/diary/ 

-- 
Andreas Trottmann [EMAIL PROTECTED]



Re: large ide raid system

2000-01-11 Thread Benno Senoner

Jan Edler wrote:

 On Mon, Jan 10, 2000 at 12:49:29PM -0800, Dan Hollis wrote:
  On Mon, 10 Jan 2000, Jan Edler wrote:
- Performance is really horrible if you use IDE slaves.
  Even though you say you aren't performance-sensitive, I'd
  recommend against it if possible.
 
  My tests indicate UDMA performs favorably with ultrascsi, at about 1/6 the
  cost. Cost is often a big factor.

 I wasn't advising against IDE, only against the use of slaves.
 With UDMA-33 or -66, masters work quite well,
 if you can deal with the other constraints that I mentioned
 (cable length, PCI slots, etc).

Do you have any numbers handy ?

will the performance of master/slave setup be at least HALF of the
master-only setup.

For some apps cost is really important, and software IDE RAID has a very low
price/Megabyte.
If the app doesn't need killer performance , then I think it is the best
solution.

now if we only had soft-RAID + journaled FS + power failure safeness  right now
...

cheers,
Benno.





Re: tiotest

2000-01-11 Thread Mika Kuoppala



On Mon, 10 Jan 2000, James Manning wrote:

 [ Monday, January 10, 2000 ] Dietmar Goldbeck wrote:
  On Mon, Nov 29, 1999 at 04:20:45PM -0500, James Manning wrote:
   tiotest is a nice start to what I would like to see: a replacement
   for bonnie... While stripping out the character-based stuff from
   bonnie would bring it closer to what I'd like to see, threading
   would be a bit of a pain so starting with tiotest as a base might
   not be a bad idea if people are willing to help out...
  
  Is tiotest Open Source? 
 
 Yes, under the GNU GPL
 
  Can you give me an URL please
 
 http://www.icon.fi/~mak/tiotest/

This works, but for future:

If you use http://www.iki.fi/miku/tiotest, you will
always get redirected to correct place.

-- Mika [EMAIL PROTECTED]



Re: large ide raid system

2000-01-11 Thread John Burton

Thomas Davis wrote:
 
 James Manning wrote:
 
  Well, it's kind of on-topic thanks to this post...
 
  Has anyone used the systems/racks/appliances/etc from raidzone.com?
  If you believe their site, it certainly looks like a good possibility.
 
 
 Yes.
 
 It's pricey.  Not much cheaper that SCSI chassis.  You only save money
 on the drives.
 

Interesting... The 100GB Internal RAID-5 SmartCan I purchased from
RaidZone was approx. $5k. The quotes I got for a SCSI equivalent ranged
from $10k to $15K. Personally I consider half the cost significantly
cheaper. I also was quite impressed with a qoute for a 1TB rackmount
system in the $50K range, again SCSI equivalents were significantly
higher...

 Performance is ok.  Has a few other problems - your stuck with the
 kernels they support; the raid code is NOT open sourced.

Performance is pretty good - these numbers are for a first generation
smartcan (spring '99)

  ---Sequential Output ---Sequential Input--
--Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
--Seeks---
MachineMB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU 
/sec %CPU
raidzone  100  6923 89.7 25987 26.6 14230 28.9  7297 89.4 215121 77.7
16407.3 69.7
raidzone  200  6537 86.2 22175 21.5 14297 30.2  7667 92.5  56355 36.0  
377.5  3.1

Softraid  100  6598 86.0 43411 36.5 12077 27.4  6180 77.9  54022 46.4  
721.4  4.1
Softraid  200  8337 87.9 25373 24.0  9009 18.8  8952 87.1  34413 21.7  
301.1  2.2  

The two sets of numbers were measured on the same computer  hardware
setup (500Mhz PIII w/ 128MB, 100GB Smartcan w/ 5 24GB IBM drives).
"raidzone" is using Raidzone's most recent pre-release version of their
Linux software (BIOS upgrades  all). "Softraid" was based on early
alpha release of RaidZone's linux support which basically allowed you to
access the individual drives. RAID was handled by the Software Raid
support available under RedHat Linux 6.0  6.1. Both were set up as
RAID-5

Using "top":
 - With "Softraid" bonnie and the md Raid-5 software were sharing the
cpu equally
 - With "raidzone" bonnie was consuming most (85%) of the cpu and no
other processes 
   and "system"  15%

Getting back to the discussion of Hardware vs. Software raid...
Can someone say *definitively* *where* the raid-5 code is being run on a
*current* Raidzone product? Originally, it was an "md" process running
on the system cpu. Currently I'm not so sure. The SmartCan *does* have
its own BIOS, so there is *some* intelligence there, but what exactly is
the division of responsibility here...

John

-- 
John Burton, Ph.D.
Senior Associate GATS, Inc.  
[EMAIL PROTECTED]  11864 Canon Blvd - Suite 101
[EMAIL PROTECTED] (personal)  Newport News, VA 23606
(757) 873-5920 (voice)   (757) 873-5924 (fax)



Re: optimising raid performance

2000-01-11 Thread chris . good

 - What kinds of numbers are you getting for performance now?

  Kinda hard to say, we're far more interested in random IO rather 
than sequential stuff.

  and do a make and then ./tiobench.pl --threads 16

/tiotest/ is a single root disk
/data1 is the 8 disc raid 1+0
/data2 is the 3 disc raid 5

  All discs are IBM DMVS18D so 18Gb 10k rpm 2mb cache, sca2 discs.
http://www.storage.ibm.com/hardsoft/diskdrdl/prod/us18lzx36zx.htm

  MachineDirectory   Size(MB)  BlkSz   Threads   Read   Write   Seeks
--- --- - --- - -- --- ---
 /tiotest/ 512   4096   1   18.092   6.053 2116.40 
  /tiotest/ 512   4096   2   16.363   5.792 829.876   
/tiotest/ 512   4096   4   17.164   5.882 1520.91   
/tiotest/ 512   4096   8   14.533   5.852 932.401   
/tiotest/ 512   4096  16   16.244   5.806 1731.60
/data1/tiot512   4096   1   29.257  14.406 2234.63
/data1/tiot512   4096   2   38.124  13.734 .11
/data1/tiot512   4096   4   31.373  12.864 5128.20
/data1/tiot512   4096   8   29.341  12.460 4705.88
/data1/tiot512   4096  16   34.806  12.121 .55
/data2/tiot512   4096   1   23.063  16.269 1851.85
/data2/tiot512   4096   2   21.576  16.754 1498.12
/data2/tiot512   4096   4   17.908  17.021 3125.00
/data2/tiot512   4096   8   15.773  17.107 3478.26
/data2/tiot512   4096  16   15.394  16.920 4166.66


 - Did you get a chance to benchmark raid 1+0 against 0+1?

 - Of the 12 disks over 2 channels, which are in the raid0+1, which
  in the raid5, which spare? how are the drive packs configured?
 
6 discs on each channel, discs 1-4 of each pack form the raid 6 stripe,
 discs 5 and 6 on ch1 and 5 on ch2 are in the raid5, disc 6 on ch6 is the
spare.

 - Is the card using its write cache?  write-back or write-through?

  Its using write back on both devices.

 - Do you have the latest firmware on the card?

  Pretty much, the firmware changelog implies the only real change
is to support PCI hotswap.

 - Which kernel are you using?

  Standard Redhat 6.1 kernel
Linux xxx.yyy.zzz 2.2.12-20smp #1 SMP Mon Sep 27 10:34:45 EDT 1999 i686 unknown


 - What block size is the filesystem?  Did you create with a -R param?

  4k blocksize, didn't use the -R as this is currently hardware raid 

 - What is your percentage of I/O operations that are writes?

  Approx 50%

IMO raid 1+0 for 2 stripes of 6 discs (better be around when a drive goes,
though, as that second failure will have about a 55% chance of taking
out the array :)

  Can't fault your logic there...  But don't you mean 0+1 ie 2 stripes
of 6 discs mirrored together rather than 1+0 (6 mirroring pairs striped
together).

I'd really love to see you do a s/w raid 1 over 2 6-disk raid0's from
the card and check that performance-wise...  I believe putting the raid1
and raid0 logic on sep. processors could help, and worst case it'll
give a nice test case for any read-balancing patches floating around
(although you've noted that you are more write-intensive)

  Which would you like me to try all software or do part in software
and part in hardware and if the latter which part?  The raid card
seems pretty good (233MHz strongarm onboard) so I doubt that is limiting
us.

thanks,

 Chris
--
Chris Good - Dialog Corp. The Westbrook Centre, Milton Rd, Cambridge UK
Phone: 01223 715000   Fax: 01223 715001   http://www.dialog.com 



Re: RedHat 6.1

2000-01-11 Thread Tim Jung

I am running RAID 5 on my news server with Red Hat 6.1 out of the box with
no problems. I also am running RAID 1 on one of my static content web
servers as well with no problems. On our news server I am using UW SCSI 2,
and on the web server I am running EIDE UDMA drives with no problems. We
have been running this way for almost a month now with no problems.

So I would say that the RAID stuff is very stable at this time. I have been
following the RAID driver development for over a year. Since as an ISP we
really want/need RAID support.

Tim Jung
System Admin
Internet Gateway Inc.
[EMAIL PROTECTED]


- Original Message -
From: "Jochen Scharrlach" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: "RAID Mailinglist" [EMAIL PROTECTED]
Sent: Tuesday, January 11, 2000 3:16 AM
Subject: Re: RedHat 6.1


 Tim Niemueller writes:
  I will get a new computer in some days and I want to build up an array.
  I will use a derivate of RedHat Linux 6.1 (Halloween 4). There is RAID
  support in the graphical installation tool, so I think the RAID patches
  are already attached to the kernel.

 Yes, just like the knfs-patches and some other stuff.

  Any hints what I must change, any settings to do? If I compile my own
  kernel with the supplied kernel source, will this kernel support RAID
  and can I use it without any changes and can I install the RAID tools
  from RPM?

 The partitioning tool is (IMHO) a bit confusing - you'll have to
 define first partitions of the type "Linux-RAID" which you'll have to
 combine with the "make RAID device" button. Don't let you confuse by
 the fact that the partition numbers are changing every time you add a
 partition...

 The default kernel options are set to include all RAID-stuff, so this
 is no problem - I made the experience that usually it isn't necessary to
 rebuild a kernel on RH 5.x/6.x, unless you need special drivers or
 want to use a newer kernel revision.

 The raidtools are (of course) in the default install set.

 Bye,
 Jochen

 --

 # mgm ComputerSysteme und -Service GmbH
 # Sophienstr. 26 / 70178 Stuttgart / Germany / Voice: +49.711.96683-5


 The Internet treats censorship as a malfunction and routes around it.
--John Perry Barlow



Re: Swapping Drives on RAID?

2000-01-11 Thread D. Lance Robinson

Scott,

1.  Use raidhotremove to take out the IDE drive.  Example:
raidhotremove /dev/md0 /dev/hda5
2.  Use raidhotadd to add the SCSI drive.  Example: raidhotadd /dev/md0
/dev/sda5
3.  Correct your /etc/raidtab file with the changed device.

 Lance.

Scott Patten wrote:

 I'm sorry if this is covered somewhere.  I couldn't find it.

 1 - I have a raid1 consisting of 2 drives.  For strange
 historical reasons one is SCSI and the other IDE.  Although
 the IDE is fairly fast the SCSI is much faster and since I
 now have another SCSI drive to add, I would like to replace
 the IDE with the SCSI.  Can I unplug the IDE drive, run in
 degraded mode, edit the raid.conf and somehow mkraid
 without loosing data or do I need to restore from tape.
 BYW, I'm using 2.2.13ac1.




Proper settings for fstab

2000-01-11 Thread Gregory Leblanc

I've managed to create a RAID stripe set (RAID 0) out of a pair of
SCSI2-W (20MB/Sec) drives, and it looks happy.  I'd like to mount some
part of my filesystem to this new device, but when I add it to fstab in
an out-of-the way location with 1 2 following that entry is fstab, it
always has errors on boot.  They are usually something about attempting
to read thus-and-such block caused a short read.  fsck'ing that drive by
hand generally doesn't find any errors, although every third or fourth
time something will turn up (same error, respond ignore, then fix a
couple of minor errors).  Any ideas on how to track this down?  Thanks,
Greg



Which filesystem(s) on RAID for speed

2000-01-11 Thread Gregory Leblanc

I'm back to running everything from my dog-slow UDMA drive again,
because I have bad feelings about my stripe set.  But once I get things
cleared up, which filesystem(s) should I put on a RAID-0 device for best
system performance.  The two drives in the stripe set are identical,
because this should be the best way to go about it (right?).  I'm
looking to learn more about software RAID on *nix systems, after some
bad times with software RAID on NT, so any good links are appreciated. 
Thanks,
Greg



Re: large ide raid system

2000-01-11 Thread D. Lance Robinson

SCSI works quite well with many devices connected to the same cable. The PCI bus
turns out to be the bottleneck with the faster scsi modes, so it doesn't matter
how many channels you have. If performance was the issue, but the original poster
wasn't interested in performance, multiple channels would improve performance if
the slower (single ended) devices are used.

 Lance

Dan Hollis wrote:

 Cable length is not so much a pain as the number of cables. Of course with
 scsi you want multiple channels anyway for performance, so the situation
 is very similar to ide. A cable mess.



Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Benno Senoner

"Stephen C. Tweedie" wrote:

 Hi,

 On Fri, 07 Jan 2000 13:26:21 +0100, Benno Senoner [EMAIL PROTECTED]
 said:

  what happens when I run RAID5+ jornaled FS and the box is just writing
  data to the disk and then a power outage occurs ?

  Will this lead to a corrupted filesystem or will only the data which
  was just written, be lost ?

 It's more complex than that.  Right now, without any other changes, the
 main danger is that the raid code can sometimes lead to the filesystem's
 updates being sent to disk in the wrong order, so that on reboot, the
 journaling corrupts things unpredictably and silently.

 There is a second effect, which is that if the journaling code tries to

 prevent a buffer being written early by keeping its dirty bit clear,

 then raid can miscalculate parity by assuming that the buffer matches
 what is on disk, and that can actually cause damage to other data than
 the data being written if a disk dies and we have to start using parity
 for that stripe.

do you know if using soft RAID5 + regular etx2 causes the same sort of
damages,
or if the corruption chances are lower when using a non journaled FS ?

is the potential corruption  caused by the RAID layer or by the FS layer ?
( does need the FS code or the RAID code to be fixed ?)

if it's caused by the FS layer, how does behave XFS (not here yet ;-) ) or
ReiserFS in this case ?

cheers,
Benno.




 Both are fixable, but for now, be careful...

 --Stephen





Ribbon Cabling (was Re: large ide raid system)

2000-01-11 Thread Andy Poling

On Tue, 11 Jan 2000, Gregory Leblanc wrote:
 If you cut the cable
 lengthwise (no, don't cut the wires) between wires (don't break the
 insulation on the wires themselves, just the connecting plastic) you can
 get your cables to be 1/4 the normal width (up until you get to the
 connector).

I don't know about IDE, but I'm pretty sure that's a big no-no for SCSI
cables.  The alternating conductors in the ribbon cable are sig, gnd, sig,
gnd, sig, etc.  And it's electrically important (for proper impedance and
noise and cross-talk rejection) that they stay that way.

I think the same is probably true for the schmancy UDMA66 cables too...

-Andy



Re: large ide raid system

2000-01-11 Thread Thomas Davis

John Burton wrote:
 
 Thomas Davis wrote:
 
  James Manning wrote:
  
   Well, it's kind of on-topic thanks to this post...
  
   Has anyone used the systems/racks/appliances/etc from raidzone.com?
   If you believe their site, it certainly looks like a good possibility.
  
 
  Yes.
 
  It's pricey.  Not much cheaper that SCSI chassis.  You only save money
  on the drives.
 
 
 Interesting... The 100GB Internal RAID-5 SmartCan I purchased from
 RaidZone was approx. $5k. The quotes I got for a SCSI equivalent ranged
 from $10k to $15K. Personally I consider half the cost significantly
 cheaper. I also was quite impressed with a qoute for a 1TB rackmount
 system in the $50K range, again SCSI equivalents were significantly
 higher...
 

We paid $25k x 4, for:

2x450mhz cpu
256mb ram
15x37gb IBM 5400 drives (550 gb of drive space)
Intel system board, w/eepro
tulip card 
(channel bonded into cisco5500)

 
 Performance is pretty good - these numbers are for a first generation
 smartcan (spring '99)
 
   ---Sequential Output ---Sequential Input--
 --Random--
   -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
 --Seeks---
 MachineMB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
 /sec %CPU
 raidzone  100  6923 89.7 25987 26.6 14230 28.9  7297 89.4 215121 77.7
 16407.3 69.7
 raidzone  200  6537 86.2 22175 21.5 14297 30.2  7667 92.5  56355 36.0
 377.5  3.1
 
 Softraid  100  6598 86.0 43411 36.5 12077 27.4  6180 77.9  54022 46.4
 721.4  4.1
 Softraid  200  8337 87.9 25373 24.0  9009 18.8  8952 87.1  34413 21.7
 301.1  2.2

You made a mistake.  :-)  Your bonnie size is smaller than the amount of
memory in the machine your tested on - so you tested the memory, NOT the
drive system.

Our current large machine(s) (15x37gb IBM drives, 500gb file system, 4kb
blocks, v2.2.13 kernel, fixed knfsd, channel bonding, raidzone 1.2.0b3)
does:

  ---Sequential Output ---Sequential Input--
--Random--
  -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
--Seeks---
MachineMB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU 
/sec %CPU
pdsfdv10 1024 14076 85.1 18487 24.3 12089 35.8 20182 83.0 63064 69.8
344.4  7.1

I've also hit it with 8 machines, doing an NFS copy of about 60gb onto
it, and it sustained about a 20mb/sec write rate.

 
 Using "top":
  - With "Softraid" bonnie and the md Raid-5 software were sharing the
 cpu equally
  - With "raidzone" bonnie was consuming most (85%) of the cpu and no
 other processes
and "system"  15%
 

I've seen load averages in the 5's and 6's.  This is on a dual processor
machine w/256mb of ram.  My biggest complaint is the raid rebuild code
runs as the highest priority, so on a crash/reboot, it takes _forever_
for fsck to complete (because the rebuild thread is taking all of the
CPU and disk bandwidth).

The raidzone code also appears to be single threaded - it doesn't take
advantage of multiple CPU's.  (although, user space code benefits from
having a second CPU then)

 Getting back to the discussion of Hardware vs. Software raid...
 Can someone say *definitively* *where* the raid-5 code is being run on a
 *current* Raidzone product? Originally, it was an "md" process running
 on the system cpu. Currently I'm not so sure. The SmartCan *does* have
 its own BIOS, so there is *some* intelligence there, but what exactly is
 the division of responsibility here...
 

None of the RAID code runs in the smartcan, or the controller.  It all
runs in the kernel.  the current code has several kernel threads, and a
user space thread:

root 6  0.0  0.0 00 ?SW   Jan04   0:02
[rzft-syncd]
root 7  0.0  0.0 00 ?SW   Jan04   0:00
[rzft-rcvryd]
root 8  0.1  0.0 00 ?SW  Jan04  14:41
[rzft-dpcd]
root   620  0.0  0.0   5640 ?SW   Jan04   0:00 [rzmpd]
root   621  0.0  0.1  2080  296 ?SJan04   3:30 rzmpd
root  3372  0.0  0.0 00 ?ZJan10   0:00 [rzmpd
defunct]
root  3806  0.0  0.1  1240  492 pts/1S09:57   0:00 grep rz

-- 
+--
Thomas Davis| PDSF Project Leader
[EMAIL PROTECTED] | 
(510) 486-4524  | "Only a petabyte of data this year?"



Re: large ide raid system

2000-01-11 Thread Gregory Leblanc

Benno Senoner wrote:
 
 Jan Edler wrote:
 
  On Mon, Jan 10, 2000 at 12:49:29PM -0800, Dan Hollis wrote:
   On Mon, 10 Jan 2000, Jan Edler wrote:
 - Performance is really horrible if you use IDE slaves.
   Even though you say you aren't performance-sensitive, I'd
   recommend against it if possible.
  
   My tests indicate UDMA performs favorably with ultrascsi, at about 1/6 the
   cost. Cost is often a big factor.
 
  I wasn't advising against IDE, only against the use of slaves.
  With UDMA-33 or -66, masters work quite well,
  if you can deal with the other constraints that I mentioned
  (cable length, PCI slots, etc).
 
 Do you have any numbers handy ?
 
 will the performance of master/slave setup be at least HALF of the
 master-only setup.

Well, this depends on how it's used.  If you were saturating your I/O
bus, then things would be REALLY ugly.  Say you've got a controller
running in UDMA/33 mode, with two disks attached.  If you have drives
that are reasonably fast, say recent 5400 RPM UDMA drives, then this
will actually hinder performance compared to having just one drive.  If
you're doing 16MB/sec of I/O, then your performance will be slightly
less than half the performance of having just one drive on that channel
(consider overhead, IDE controller context switches, etc).  If you only
need the space, then this is an accptable solution, for low throughput
applications.  I don't know jack schitt about ext2, the linux ide
drivers (patches or old ones), or about the RAID code, except that they
work.  

 
 For some apps cost is really important, and software IDE RAID has a very low
 price/Megabyte.
 If the app doesn't need killer performance , then I think it is the best
 solution.

It's a very good solution for a small number of disks, where you can
keep everything in a small case.  It may actually be superior to SCSI
for situations where you have 4 or fewer disks and can put just a single
disk on a controller.  

 
 now if we only had soft-RAID + journaled FS + power failure safeness  right now
 ...

I'll be happy as long as it gets there relatively soon, I'll be happy. 
fsck'ing is the only thing that really bugs me...
Greg



Re: optimising raid performance

2000-01-11 Thread James Manning

[ Tuesday, January 11, 2000 ] [EMAIL PROTECTED] wrote:
 I'd really love to see you do a s/w raid 1 over 2 6-disk raid0's from
 the card and check that performance-wise...  I believe putting the raid1
 and raid0 logic on sep. processors could help, and worst case it'll
 give a nice test case for any read-balancing patches floating around
 (although you've noted that you are more write-intensive)
 
   Which would you like me to try all software or do part in software
 and part in hardware and if the latter which part?  The raid card
 seems pretty good (233MHz strongarm onboard) so I doubt that is limiting
 us.

dual PII-500  single 233 :)

s/w raid 1 over 2 6-disk h/w raid0's is what I meant to ask for

I trust the strongarm to handle raid0, but that's about it :)

what stripe/chunk sizes are you using in the raid?  My exp. has been
smaller is better down to 4k, although I'm not sure why :)

James
-- 
Miscellaneous Engineer --- IBM Netfinity Performance Development



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Benno Senoner

"Stephen C. Tweedie" wrote:

(...)


 3) The soft-raid backround rebuild code reads and writes through the
buffer cache with no synchronisation at all with other fs activity.
After a crash, this background rebuild code will kill the
write-ordering attempts of any journalling filesystem.

This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

 Interaction 3) needs a bit more work from the raid core to fix, but it's
 still not that hard to do.

 So, can any of these problems affect other, non-journaled filesystems
 too?  Yes, 1) can: throughout the kernel there are places where buffers
 are modified before the dirty bits are set.  In such places we will
 always mark the buffers dirty soon, so the window in which an incorrect
 parity can be calculated is _very_ narrow (almost non-existant on
 non-SMP machines), and the window in which it will persist on disk is
 also very small.

 This is not a problem.  It is just another example of a race window
 which exists already with _all_ non-battery-backed RAID-5 systems (both
 software and hardware): even with perfect parity calculations, it is
 simply impossible to guarantee that an entire stipe update on RAID-5
 completes in a single, atomic operation.  If you write a single data
 block and its parity block to the RAID array, then on an unexpected
 reboot you will always have some risk that the parity will have been
 written, but not the data.  On a reboot, if you lose a disk then you can
 reconstruct it incorrectly due to the bogus parity.

 THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
 only way you can get bitten by this failure mode is to have a system
 failure and a disk failure at the same time.



 --Stephen

thank you very much for these clear explanations,

Last doubt: :-)
Assume all RAID code - FS interaction problems get fixed,
since a linux soft-RAID5 box has no battery backup,
does this mean that we will loose data
ONLY if there is a power failure AND successive disk failure ?
If we loose the power and then after reboot all disks remain intact
can the RAID layer reconstruct all information in a safe way ?

The problem is that power outages are unpredictable even in presence
of UPSes therefore it is important to have some protection against
power losses.

regards,
Benno.






[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

This is a FAQ: I've answered it several times, but in different places,
so here's a definitive answer which will be my last one: future
questions will be directed to the list archives. :-)

On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 then raid can miscalculate parity by assuming that the buffer matches
 what is on disk, and that can actually cause damage to other data
 than the data being written if a disk dies and we have to start using
 parity for that stripe.

 do you know if using soft RAID5 + regular etx2 causes the same sort of
 damages, or if the corruption chances are lower when using a non
 journaled FS ?

Sort of.  See below.

 is the potential corruption caused by the RAID layer or by the FS
 layer ?  ( does need the FS code or the RAID code to be fixed ?)

It is caused by neither: it is an interaction effect.

 if it's caused by the FS layer, how does behave XFS (not here yet ;-)
 ) or ReiserFS in this case ?

They will both fail in the same way.

Right, here's the problem:

The semantics of the linux-2.2 buffer cache are not well defined with
respect to write ordering.  There is no policy to guide what gets
written and when: the writeback caching can trickle to disk at any time,
and other system components such as filesystems and the VM can force a
write-back of data to disk at any time.

Journaling imposes write ordering constraints which insist that data in
the buffer cache *MUST NOT* be written to disk unless the filesystem
explicitly says so.

RAID-5 needs to interact directly with the buffer cache in order to be
able to improve performance.

There are three nasty interactions which result:

1) RAID-5 tries to bunch writes of dirty buffers up so that all the data
   in a stripe gets written to disk at once.  For RAID-5, this is very
   much faster than dribbling the stripe back one disk at a time.
   Unfortunately, this can result in dirty buffers being written to disk
   earlier than the filesystem expected, with the result that on a
   crash, the filesystem journal may not be entirely consistent.

   This interaction hits ext3, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit set.

2) RAID-5 peeks into the buffer cache to look for buffer contents in
   order to calculate parity without reading all of the disks in a
   stripe.  If a journaling system tries to prevent modified data from
   being flushed to disk by deferring the setting of the buffer dirty
   flag, then RAID-5 will think that the buffer, being clean, matches
   the state of the disk and so it will calculate parity which doesn't
   actually match what is on disk.  If we crash and one disk fails on
   reboot, wrong parity may prevent recovery of the lost data.

   This interaction hits reiserfs, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit clear.

Both interactions 1) and 2) can be solved by making RAID-5 completely
avoid buffers which have an incremented b_count reference count, and
making sure that the filesystems all hold that count raised when the
buffers are in an inconsistent or pinned state.

3) The soft-raid backround rebuild code reads and writes through the
   buffer cache with no synchronisation at all with other fs activity.
   After a crash, this background rebuild code will kill the
   write-ordering attempts of any journalling filesystem.  

   This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

Interaction 3) needs a bit more work from the raid core to fix, but it's
still not that hard to do.


So, can any of these problems affect other, non-journaled filesystems
too?  Yes, 1) can: throughout the kernel there are places where buffers
are modified before the dirty bits are set.  In such places we will
always mark the buffers dirty soon, so the window in which an incorrect
parity can be calculated is _very_ narrow (almost non-existant on
non-SMP machines), and the window in which it will persist on disk is
also very small.

This is not a problem.  It is just another example of a race window
which exists already with _all_ non-battery-backed RAID-5 systems (both
software and hardware): even with perfect parity calculations, it is
simply impossible to guarantee that an entire stipe update on RAID-5
completes in a single, atomic operation.  If you write a single data
block and its parity block to the RAID array, then on an unexpected
reboot you will always have some risk that the parity will have been
written, but not the data.  On a reboot, if you lose a disk then you can
reconstruct it incorrectly due to the bogus parity.

THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
only way you can get bitten by this failure mode is to have a system
failure and a disk failure at the same time.


--Stephen



Re: large ide raid system

2000-01-11 Thread Jan Edler

On Tue, Jan 11, 2000 at 04:25:27PM +0100, Benno Senoner wrote:
 Jan Edler wrote:
  I wasn't advising against IDE, only against the use of slaves.
  With UDMA-33 or -66, masters work quite well,
  if you can deal with the other constraints that I mentioned
  (cable length, PCI slots, etc).
 
 Do you have any numbers handy ?

Sorry, I can't seem to find any quantitative results on that right now.

 will the performance of master/slave setup be at least HALF of the
 master-only setup.

I did run some tests, and my recollection is that it was much worse.

 For some apps cost is really important, and software IDE RAID has a very low
 price/Megabyte.
 If the app doesn't need killer performance , then I think it is the best
 solution.

It all depends on your minimum acceptable performance level.
I know my master/slave test setup couldn't keep up with fast ethernet
(10 MByte/s).  I don't remember if it was 1 Mbyte/s or not.

I was also wondering about the reliability of using slaves.
Does anyone know about the likelihood of a single failed drive
bringing down the whole master/slave pair?  Since I have tended to
stay away from slaves, for performance reasons, I don't know
how they influence reliability.  Maybe it's ok.

Jan Edler
NEC Research Institute



Re: large ide raid system

2000-01-11 Thread James Manning

[ Tuesday, January 11, 2000 ] John Burton wrote:
 Performance is pretty good - these numbers are for a first generation
 smartcan (spring '99)

Could you re-run the raidzone and softraid with a size of 512MB or larger?

Could you run the tiobench.pl from http://www.iki.fi/miku/tiotest
(after "make" to build tiotest)

Those would be great results to see.

Thanks,

James
-- 
Miscellaneous Engineer --- IBM Netfinity Performance Development



Re: optimising raid performance

2000-01-11 Thread James Manning

[ Tuesday, January 11, 2000 ] [EMAIL PROTECTED] wrote:
 what stripe/chunk sizes are you using in the raid?  My exp. has been
 smaller is better down to 4k, although I'm not sure why :)
 
   We're currently using 8k but with our load then if I can go smaller
 I will do.
   Is there any merit in using -R on mke2fs if we're doing raid1? 

I've always interpreted -R stride= as meaning "how many ext2 blocks to
gather before sending to the lower-level block device".  This way the
block device can deal with things more efficiently.  Since the stride=
must default to 1 (I can't see how it could pick a different one) then
any time your device (h/w or s/w raid) is using larger block sizes -R
would seem to be a good choice (for 8K block sizes, stride=2)

The raid1 shouldn't matter as much, so try without stride= and then with
stride=2 (if still using 8K block sizes)

I get the feeling that the parallelism vs. efficiency tradeoff in block
sizes still isn't fully understood, but lots of random writes should
almost certainly do best with the smallest block sizes available down
to a single page (4k)

As always, I'd like to solicit other views on this :)

James
-- 
Miscellaneous Engineer --- IBM Netfinity Performance Development



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread mauelsha

"Stephen C. Tweedie" wrote:
 
 Hi,
 
 This is a FAQ: I've answered it several times, but in different places,

SNIP

 THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
 only way you can get bitten by this failure mode is to have a system
 failure and a disk failure at the same time.
 

To try to avoid this kind of problem some brands do have additional
logging (to disk
which is slow for sure or to NVRAM) in place, which enables them to at
least recognize
the fault to avoid the reconstruction of invalid data or even enables
them to recover
the data by using redundant copies of it in NVRAM + logging information
what could be
written to the disks and what not.

Heinz



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

On Tue, 11 Jan 2000 15:03:03 +0100, mauelsha
[EMAIL PROTECTED] said:

 THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
 only way you can get bitten by this failure mode is to have a system
 failure and a disk failure at the same time.

 To try to avoid this kind of problem some brands do have additional
 logging (to disk which is slow for sure or to NVRAM) in place, which
 enables them to at least recognize the fault to avoid the
 reconstruction of invalid data or even enables them to recover the
 data by using redundant copies of it in NVRAM + logging information
 what could be written to the disks and what not.

Absolutely: the only way to avoid it is to make the data+parity updates
atomic, either in NVRAM or via transactions.  I'm not aware of any
software RAID solutions which do such logging at the moment: do you know
of any?

--Stephen