Re: RAID questions
On Mon, 7 Aug 2000, Adam McKenna wrote: 2) If I do, will it still broken unless I apply the "2.2.16combo" patch? 3) If it will, then how do I resolve the problem with the md.c hunk failing with "2.2.16combo"? If I remember correctly, 2.2.16combo was there to make it possible to use Ingo's older raid patches on 2.2.16 (before raid-2.2.16-A0 was released). I'm not 100% sure, though. This is a production system I am working on here. I can't afford to have it down for an hour or two to test a new kernel. I'd rather not be working with this mess to begin with, but unfortunately this box was purchased before I started this job, and whoever ordered it decided that software raid was "Good enough". A test machine comes in handy. Not to actually test the new RAID code (we did/do that already ;) ), but just to train handling of SW raid. I am not subscribed to either list so CC's are desirable. However if you don't want to CC then you don't have to -- I'll just read the archives. That is, if someone fixes the "Mailing list archives" link on www.linux.org to point to someplace that exists and actually has archives. IMHO, if you need (or want) to work with SW raid, it would be better to subscribe. It's not all that much traffic here and (usually) the stuff we get here is relevant (with exception of too many questions on patches location, but that should be fixed anyway). Besides, any real problems, bug reports, warnings appear here very soon. D.
RE: owie, disk failure
On Mon, 7 Aug 2000, Corin Hartland-Swann wrote: I have to confess I've never heard of manufacturers offering diagnostic utilities for disks... Gregory, can you point me at any examples? Am I just being a complete dumbass here? At least Western Digital does on their ftp address ftp://ftp.wdc.com/pub/drivers/hdutil, however I don't know what and how those utils do better than badblocks friends. D.
Re: Raid developers question
On Thu, 27 Jul 2000, Art wrote: After pulling out one disk (system off line), it came back on line with the data intact... It started automatically the reconfiguration using the spare disk. The funny thing was, after reinserting the original disk it did not reconfigure it automatially. I had to raidhotadd the disk. Then it started reconfiguring it. That's expected behaviour. I would like to be able to stop my raid array and switch off the power of this box (not the computer). If I switch the array off and on, the scsi disks do not spindle up. So I have to reboot the machine (scsi card spindles them up). This is a bit awkward. Check out /usr/src/linux/drivers/scsi/scsi.c. You need to do some magic with 'remove-single-device' and then, after restarting the disks, 'add-single-device'. If you turn your disk off without first removing it from /proc/scsi/scsi, the controller seems to get confused... Can the raid-software also control the above mentioned leds? AFAIK, no. What is translucent mode? No idea, it's just not supposed to be used... :) Which drive is my hotspare if I issue `cat /proc/mdstat` ? md0 : active raid5 sdd2[3] sdc2[2] sdb2[1] sda2[0] 17333888 blocks level 5, 32k chunk, algorithm 2 [3/3] [UUU] I suppose active RAID drives are numbered from 0, meaning that 0, 1 and 2 are active (in a 3+1 RAID5 array). So number sdd2[3] is the spare drive. Is it possible to modify the raidstart/stop code so that it uses scsi commands to start/stop the disks (in a running machine)? This is trivial as software raid is not limited to SCSI disks only, so that would involve quite a lot of sanity checking... However, it's pretty easy to write a couple of scripts doing just that. D.
RE: raid and 2.4 kernels
On Thu, 27 Jul 2000, Neil Brown wrote: If raid on 2.4 is fast than raid in 2.2, we say "great". If it is slower, we look at the no-raid numbers. If no-raid on 2.4 is slow than no-raid on 2.2, we say "oh dear, the disc subsystem is slower on 2.4", and point the finger appropriately. If no-raid on 2.2 is fast than no-raid on 2.4, then we say "Hmm, must be a problem with raid" and point the finger there. Does that make sense? In a way, yes. But raid could depend on other parts of the kernel more heavily then no-raid disk access and thus could be more affected by errors/problems in those parts. D.
Re: raid5 troubles
On Thu, 20 Jul 2000, Hermann 'mrq1' Gausterer wrote: but when i do mkraid, i get an error :-((( [root@mrqserv2 linux]# mkraid /dev/md0 handling MD device /dev/md0 analyzing super-block disk 0: /dev/sdb1, 4233096kB, raid superblock at 4233024kB disk 1: /dev/sdc1, 4233096kB, raid superblock at 4233024kB disk 2: /dev/sda6, failed mkraid: aborted, see the syslog and /proc/mdstat for potential clues. [root@mrqserv2 linux]# what is wrong here ? Most probably your version of raidtools-0.90 doesn't recognize the failed-disk directive. I use the version from Ingo's page (marked dangerous) http://people.redhat.com/mingo/raid-patches/... and it works fine. D.
Re: upgrading a raid kernel
On Tue, 11 Jul 2000, Dirk Bonenkamp - Bean IT wrote: I want to upgrade a machine running 2.2.10 kernel running software raid to 2.2.16. I only found raid patches ending with 2.2.11 (ftp.fi.kernel.org), will this work on 2.2.16?? And, is patching and installing the new kernel enough to get things working? (I guess so, raid devices are build working, so no need for new raidtools etc?). Aren't you reading this mailing list? Patches for new kernels are available at http://people.redhat.com/mingo/raid-patches/. You should also grab the raidtools from there, as they support some new usefull features (such as failed-disk directive). D.
Re: Problem with raidhotremove
On Tue, 27 Jun 2000, Neil Brown wrote: If you don't have raidsetfaulty (so RedHats don't have it), grab the latest raidtools from http://www.{country}.kernel.org/pub/linux/daemons/raid/alpha/raidtools-19990824-0.90.tar.gz I think you can get more recent raidtools from http://people.redhat.com/mingo/raid-patches/ . D.
Re: 2.2.16 RAID patch
On Wed, 14 Jun 2000, Matthew DeFoor wrote: now! sdb6's event counter: 0006 sda6's event counter: 0006 request_module[md-personality-3]: Root fs not mounted Seems to me you have raid1 compiled as a module. That's OK if you really know initrd stuff, but personally I prefer to compile raid1 in the kernel. It saves me the trouble of creating an initrd... D.
Re: Software Raid on linux 2.2.14/5 with version 0.90.0 of raidtools
On Thu, 8 Jun 2000, Maria Blackmore wrote: In a nutshell, it simply doesn't work, there isn't much more I can say than that, because that is just it. In a nutshell, get the patches (http://www.redhat.com/~mingo/raid-patches/), compile the kernel and off you go. needless to say, niether the syslog or /proc/mdstat provide any hints whatsoever, in fact there is nothing logged at all during this. Needless to say, this was on the list like 6000 times. I wish HOW-TO would mention the location of recent patches. D.
Problems again
Today, I had another SCSI failure. I was able to get a bit more of dmesg stuff, but can't figure out, what is going wrong there. In /var/log/messages, the unusuall stuff starts with this repeated a couple of times: Mar 28 12:00:45 mail kernel: (scsi0:0:2:0) Parity error during Message-In phase Mar 28 12:00:45 mail kernel: (scsi0:0:2:0) Parity error during Data-In phase. It goes on to a lot of messages similar to this (pid, id and stuff right from 'lun 0' is changing): Mar 28 12:00:45 mail kernel: scsi : aborting command due to timeout : pid 14301024, scsi0, channel 0, id 0, lun 0 Write (10) 00 00 6b 0f 14 00 00 08 00 Then this (a lot of lines): Mar 28 12:00:45 mail kernel: SCSI host 0 abort (pid 14301062) timed out - resetting Mar 28 12:00:45 mail kernel: SCSI bus is being reset for host 0 channel 0. Somewhere in between this shows up: Mar 28 12:00:45 mail kernel: (scsi0:0:2:0) Performing Domain validation. Then this: Mar 28 12:00:45 mail kernel: SCSI host 0 reset (pid 14301061) timed out again - Mar 28 12:00:45 mail kernel: probably an unrecoverable SCSI bus or device hang. And finally this: Mar 28 12:00:45 mail kernel: (scsi0:0:2:0) Successfully completed Domain validation. Mar 28 12:00:45 mail kernel: (scsi0:0:2:0) Using asynchronous transfers. Mar 28 12:00:45 mail kernel: (scsi0:0:1:0) Synchronous at 80.0 Mbyte/sec, offse 31. Mar 28 12:00:45 mail kernel: (scsi0:0:0:0) Using asynchronous transfers. followed by some more liens of previous messages. This are the last entries I got in /var/log/messages before rebooting (hard). The machine was sortof alive (ie. ping, httpd, php3...), but I was unable to login (even locally). The one console I had open was able to do 'ls', 'free', 'dmesg', things doing anything with hard disk froze up. Even 'shutdown' and 'reboot' failed to execute. The weird thing is that all of these messages occured in a single second (12:00:45). I'm asking if someone with more SCSI experience could diagnose what could be the cause of that? Thanks, D. PS: More info about the machine: CPU:Dual P-III 500 MHz Board: Intel L440GX Disks: 4x IBM DNES-309170Y (3 RAID5 + 1 spare) LAN:Integrated Inte EtherExpress Pro 10/100 cat /proc/interrupts CPU0 CPU1 0: 253546 252241IO-APIC-edge timer 1: 99103IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 4:473472IO-APIC-edge serial 8: 0 0IO-APIC-edge rtc 13: 1 0 XT-PIC fpu 19: 358370 359232 IO-APIC-level aic7xxx, aic7xxx 21: 225846 225239 IO-APIC-level Intel EtherExpress Pro 10/100 Ethernet cat /proc/ioports -001f : dma1 0020-003f : pic1 0040-005f : timer 0060-006f : keyboard 0070-007f : rtc 0080-008f : dma page reg 00a0-00bf : pic2 00c0-00df : dma2 00f0-00ff : fpu 03c0-03df : vga+ 03f8-03ff : serial(auto) 1080-109f : Intel Speedo3 Ethernet 1400-14be : aic7xxx 1800-18be : aic7xxx uname -a Linux my.host.name 2.2.13 #1 SMP Tue Mar 14 11:55:56 CET 2000 i686 unknown
Re: Problems again
On Tue, 28 Mar 2000, Mike Bilow wrote: That's a hardware problem. A SCSI parity error is reported by the hardware and simply passed up the chain. Unless there is something seriously wrong in the aic7xxx sequencer code, which I doubt, this looks like a typical cabling and termination issue. Well, the chassis is an Intel pre-installed rack mountable one with hot-swappable SCSI backplane. All the cables were there allready connected to disk racks. All I had to do was to install the disks in the racks and slide them in. I think I wasn't able to screw something up there... :/ Hard to say, but my guess is that your drive has elected to shut down. I don't know what devices are on the bus, but the negotiation of aynchronous transfers is not a good sign and it may indicated one of the lines is being held in a funny state. Are you trying to run slow and fast devices on the same SCSI bus? No, the disks are the only SCSI devices there. No other disk/tape devices there (except a standard 3,5" floppy, but it really shouldn't matter). I think you have an electrical issue. I feared that, but what should I do? It's all LVD and all pre-installed by Intel... except disks, of course. Besides, it only happens every few weeks even though the machine is pretty active (in use). 19: 358370 359232 IO-APIC-level aic7xxx, aic7xxx * * * 1400-14be : aic7xxx 1800-18be : aic7xxx Are you really running two separate aic7xxx controllers? Do they have the same firmware revision? I guess the motherboard has two chips integrated. I didn't really check then (now it's off-site), but kernel detects two hosts (scsi0 scsi1). The board also features two 68-pin SCSI connectors (the one I use is marked LVD, the other is marked SE). Thanks, D.
Re: RAID5 array not coming up after repaired disk
On Fri, 24 Mar 2000, Douglas Egan wrote: When this happened to me I had to "raidhotadd" to get it back in the list. What does your /proc/mdstat indicate? Try: raidhotadd /dev/md0 /dev/sde7 I *think* you should 'raidhotremove' the failed disk-partition first, then you can 'raidhotadd' it back. D.
Re: Which patch? Kernel 2.2.14
On Mon, 13 Mar 2000, Clinton Bittel wrote: I tried patching ide_2_2_14_2124_patch.gz raid-2_2.14-B1.gz And still cannot find a mention of the Ultra 66 or 33 when I go to recompile the kernel. Is it already built in?? ide_2_2_14_2124 takes care of ATA-66 drivers. They are not reffered to as in general 'ATA-66', but are rather mentioned on per-chipset basis. CONFIG_BLK_DEV_HPT366 is one, for example... D.
Disk or SCSI bus problem?
Hi! I have a three disk RAID5 with 2.2.13-SMP kernel (with 2.2.11 raid patches) and recently I seem to be havink some disk related trouble. Once the machine was brougth down by a huge amount of SCSI errors (printed out to the console). That time I was unable to track the problem, especially cause the machine was working well after reboot (I even badblocked the suspected disk, the reconstruction went well...). Of course, the machine is now under 'heavy surveilance' and recently I got this in /var/log/messages: (scsi0:0:0:0) Parity error during Message-In phase. (scsi0:0:0:0) Parity error during Data-In phase. (scsi0:0:0:0) Parity error during Message-In phase. (scsi0:0:0:0) Parity error during Data-In phase. (scsi0:0:2:0) Parity error during Message-In phase. (scsi0:0:2:-1) Unexpected busfree, LASTPHASE = 0xa0, SEQADDR = 0x14f Does anyone have a clue, what this might mean? Thanks, Danilo __ |Danilo Godec|Agenda d.o.o.| ISP for business | | jr. Syst. Admin | Gosposvetska 84 | WAN networks| | [EMAIL PROTECTED] | si-2000 Maribor | Internet/Intranet | | tel:+386.62.226364 | Slovenija | Application servers | | fax:+386.62.226364 | http://www.slon.net | Caldera OpenLinux |
Failed disk
If I have a three disk RAID5 array and one disk seems to be slowly failing. The disks are on hot-swapable backplane. I know that 'echo "scsi remove-single-device X X X X" /proc/scsi/scsi' works for me and I can remove and replace the disk, but NOT as long it is in use in RAID5 array. I don't want to stop the array for two reasons: 1. / file system is on it 2. it's a production machine, running web and mail services for a lot of users Is it somehow possible to temporarily mark the disk as unused (failed?), apply this setting to the running array and thus 'free' the device for removal? Thanks, D. __ | Danilo Godec|Agenda d.o.o.| ISP for business | | jr. Syst. Admin | Gosposvetska 84 | WAN networks| | [EMAIL PROTECTED] | si-2000 Maribor | Internet/Intranet | | tel:+386.62.226364 | Slovenija | Application servers | | fax:+386.62.226364 | http://www.slon.net | Caldera OpenLinux |
RE: Failed disk
On Mon, 13 Mar 2000 [EMAIL PROTECTED] wrote: I think what you are looking for is: raidhotremove /dev/md? /dev/sd?? I already tried that. Simply raidhotremove-ing doesn't work as the /dev/sd?? is used (it complains about it). But you're close. However, I found out that Ingo's _dangerous_ raidtools (2116) include 'raidsetfaulty' command, which marks the device as failed. It is possible to raidhotremove it afterwards. Redhat 6.1 original raidtools-0.90-5 don't inlcude that command. I'm currently testing this on my local machine using multiple partitions of a single disk as a RAID5 array (for testing only) and it's looking good. Thanks, D. __ |Danilo Godec|Agenda d.o.o.| ISP for business | | jr. Syst. Admin | Gosposvetska 84 | WAN networks| | [EMAIL PROTECTED] | si-2000 Maribor | Internet/Intranet | | tel:+386.62.226364 | Slovenija | Application servers | | fax:+386.62.226364 | http://www.slon.net | Caldera OpenLinux |
Re: RAID5 and 2.2.14
On Sun, 23 Jan 2000, David Cooley wrote: Here's what I get when patching against a fresh 2.2.13-1.3.0 kernel source Where'd you get your source? I downloaded mine from ftp.kernel.org and it's 2.2.14-1.3.0 What is this '-1.3.0'? I don't think this is plain kernel source... If I go to ftp://ftp.kernel.org/pub/linux/kernel/v2.2/ (that is where official kernel tarballs are) I see linux-2.2.14.tar.bz2 (and .gz and .sign files). D.
Re: raid with 2.2.13
On Sun, 16 Jan 2000, Standardaccount wrote: How can I get raid running? BTW: I've tried to apply the kernel-patch for the 2.2.11 kernel, but patch won't work. Is there any need for the kernel- patch and where can I get it for the 2.2.13 kernel? You need to apply the 2.2.11 patch to 2.2.13 kernel tree. There are a few (I think two) errors reported, but they can be safely ignored. This is necessary as plain 2.2.13 kernel use the 'old style' raid code, while raidtools-0.90 make use of the 'new style' raid code. D. PS: You can use 2.2.14 kernel with a 2.2.14 patch now (http://people.redhat.com/mingo/raid...).
Re: kernel patch?
On Thu, 13 Jan 2000, Edward Schernau wrote: I am running 2.2.13, whose config script has options for RAID. I have raidtools-0.90. Why/Do I need to patch? Pointers appreciated. You have to patch because plain 2.2.13 kernel has an 'old style' raid, while raidtools-0.90 are designed for 'new style' raid (which adds autodetection and other nice stuff). For 2.2.13 you can use the patch for 2.2.11, while for 2.2.14 you have to get a new patch from http://www.redhat.com/~mingo/raid-2.2.14-B1 D.
Re: Performance?
On Thu, 9 Dec 1999, Randy Winch wrote: Retested with with mem=256M: I usually use mem=12M and boot into single mode, so that memory really has almost no influence. bonnie ---Sequential Output ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- MachineMB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU 2000 7230 98.3 37168 59.2 19576 55.8 8305 97.4 71834 58.3 339.7 6.6 Well, in my case the read performance dropped from 280MB/sec (hehe, would be nice though) to ~30MB/sec (which is pretty cool too). Write performance was also influenced, but I don't remember the figures. I guess bonnie is using pretty standard routines for file operations which get cached and buffered by the kernel. I suggest you REALLY limit the memory down (like 12M, could try lower) and run bonnie with 'normal' file sizes - this way you do get pretty real results and you don't have to wait that long... D.
Re: 5*36.5 GB SoftRAID problem
On Wed, 8 Dec 1999, Jakob Sandgren wrote: No, actually i did not. I read the FAQ and according to it should it be ok to start with mke2fs _before_ the sync/(re)build was finished. Anyone else who could confirm that this should be a problem? A few days ago a bug - that could cause that - was mentioned on this list. It was said it could corrupt swap writes when the background reconstruction was going on, but who knows... D.
Re: Web page for kernel/raid updates Promise Ultra66 issues
On Sun, 5 Dec 1999, Zach Coombes wrote: - the Ultra66 isn't supported yet. I'm running a 2.3.XX beta (yes I take digitalis regularly) as recent 2.2 kernels didn't even register the Ultra66 controllers' existence. Are the Ultra66 patches to the kernel nearing a state where we'll see them in a 2.2 release soon? (Slap me if this reads as a "aren't you finished yet" poke at the developers - it's not meant to be) Well, the ide patches located in ftp.kernel.org/pub/linux/kernel/people/hedrick apply to 2.2 kernels very nicely and include support for a variety of UDMA66 controller and chipsets. I think it's better and safer to run a non-developement kernel with pathces made for it (especially in production machines). Also some of the drives are coming up in PIO mode. Is there any redress to adjust this before mounting the drives (i.e. request that it re-check for DMA capable drives)? You could use 'hdparm -d1 device'. You can do that after mounting too. D.
Re: Raid with new kernel
On Sun, 5 Dec 1999, ACEAlex wrote: the 2.2.13 kernel with is the latest stable. But when i try to start using it i get a different startup screens (see belove). Do i have to patch the kernel before i use raidtools. Cause i get errors when trying to execute Yes. The RedHat kernel includes the latest RAID patches. You should patch your 2.2.13 kernel too. The patch will probably produce two rejects, but you can ignore them. ftp.kernel.org/pub/linux/daemons/raid/alpha/raid0145-19990824-2.2.11.bz2 mkraid etc.. Also i have another question. In some faqs they are talking about mdadd and mddel etc.. But i cant find those with the redhat package. I use mkraid and edit the /etc/raidtab file. This are the 'old' raid utilities. Now you should look for raidstart, raidstop, etc. D.
Re: problems with RAID fs
On Thu, 2 Dec 1999, Terry Ewing wrote: I also manually used cp to copy about 10 or 12 of the corrupted files from the original tree to the RAID filesystem. After this, the files that I copied did not differ from the originals. It seems that files become corrupted under a heavy load either by the RAID5 daemon or in hardware. Bad RAM can often be the cause for wierd problems like that (I had my share of that). Now I use memtest86 on every machine I build and it seems very reliable - last week I discovered 4 out of 12 brand new DIMMs to be faulty and machines didn't even complain under moderat load. Kernel compile using 'make -j' resulted in many 'signal 11' errors. http://reality.sgi.com/cbrady_denver/memtest86/ D.
Re: ac?
On Tue, 30 Nov 1999, David Cunningham wrote: I've seen a lot of recommendations for obtaining the 2.2.13ac kernel. So far I've found nothing listed with the ac suffix. What is this ac? These are Alan Cox's patches located on kernel.org mirrors in /pub/linux/kernel/people/alan/ directory. They combine some features and/or bug fixes usually found in separate patches. D. PS: I got excelent results with plain 2.2.13+raid0145-19990824-2.2.11 (only two rejects while patching, both due to already patched files).
Re: Could not change configuration.
On Thu, 25 Nov 1999, Dong Hu wrote: Now I want to change the configuration to raid0, so I edit the /etc/raidtab file, issue mkraid --force /dev/md0, I suppose you did stop the raid device ('raidstop /dev/md0') first? Then I think you should use --really-force (this is not in the documentation, but it is printed on the screen when you do mkraid --force). D.
Partitions on RAID ?
Hi! Can I use partitions on software raid device (/dev/md0, raid-5 in my case)? Using 2.2.13 with raid0145-19990824-2.2.11 patch and raidtools-0.90. Thanks, D.
Archive?
Hi! I'm new to this list and have some questions. However I'd like to first browse through the list archive if it's available somewhere. Is it? :) Thanks, D.
Monitoring?
Hi! Ok, found an archive, but haven't found the questions/answers I was hoping to find. I have a RAID1 setup with kernel 2.2.13 and appropriate patches for 2.2.11 (only two files didn't patch correctly, as they were already patched in 2.2.13) and raidtools-0.90. Everything works nice, even hot-swapping disks (with hot-pluggable SCSI backplane and some caution, of course) didn't cause a problem. However, are there any tools already available to monitor the md device and notify the administrator via mail, modem, pager etc.? Thanks, D.
Re: Monitoring?
On Fri, 12 Nov 1999, [iso-8859-1] Jakob stergaard wrote: It should be fairly simple to grep for underscores in /proc/mdstat using cron+{perl,grep,whatever} and send a mail if one is found. When a disk dies it is marked in /proc/mdstat like [UU_U]. Thanks, I think I will do that. Now for another question: I have a hot-swappable SCSI backplane, so I simulated a dead disk by simply removing it (while there was no I/O activity). If I umount /dev/md0 and stop it (raidstop /dev/md0), I can use /proc/scsi/scsi and first remove the dead-disk entry and then add a new disk (echo "scsi [remove|add]-single-device 0 0 1 0" /proc/scsi/scsi). Then, I can raidhotadd the new disk to /dev/md0 and the world is nice. However, is there a way to do all this while raid1 is stil active? So that users never have to notice something went wrong with disks? Thanks, D. PS: I thought of adding the new disk with some other ID, but the backplane has fixed IDs so I cannot change them (disk0= ID 0, disk1= ID 1).