Re: dumping on RAIDframe
On 2023-09-20 12.41, Edgar Fuß wrote: Didn't RAIDframe recently (for certain values of "recently") gain the function to dump on a level 1 set? Should this work in -8? swapctl -z says "dump device is raid0b" (and raid0 is a level 1 RAID), but reboot 0x100 in DDB says dumping to dev 18,1 (offset=1090767, size=8252262): dump device not ready What am I missing? The offset (as reported by disklabel) of raid0b within raid0 is 2097152 (1G), the partition size is 67108864 (32G), so maybe something's wrong with the offset and size values (whatever unit they are in) DDB reports. Dumping to a RAID 1 set is supported in -8. But yes, none of those values seem to align with each other. 18,1 is 'raid0b' thouugh, so that part seems correct. Could you please file a PR on this, with details as to the various disklabels? If this is broken, it should get addressed. Thanks. Later... Greg Oster
Re: Scheduling problem - need some help here
Hi Folks A reminder that the issue in this thread (and recorded in this PR: http://gnats.netbsd.org/55415 ) is still very much outstanding and will result in the vax port being nearly unusable in NetBSD 10+. There is a 'fix' mentioned in the PR, but it's not clear it's the right solution. Any/all thoughts on this are welcome. Thanks. Later... Greg Oster On 2020-07-28 06:01, Anders Magnusson wrote: Hi, Den 2020-07-28 kl. 13:28, skrev Nick Hudson: On 28/06/2020 16:11, Anders Magnusson wrote: Hi, there is a problem (on vax) that I do not really understand. Greg Oster filed a PR on it (#55415). A while ago ad@ removed the "(ci)->ci_want_resched = 1;" from cpu_need_resched() in vax/include/cpu.h. And as I read the code (in kern_runq.c) it shouldn't be needed, ci_want_resched should be set already when the macro cpu_need_resched() is invoked. But; without setting cpu_need_resched=1 the vax performs really bad (as described in the PR). cpu_need_resched may have multiple values nowadays, setting it to 1 will effectively clear out other flags, which is probably what makes it work. Anyone know what os going on here (and can explain it to me)? I'm no expert here, but I think the expectation is that each platform has its own method to signal "ast pending" and eventually call userret (and preempt) when it's set - see setsoftast/aston. VAX has hardware ASTs, (AST is actually a VAX operation), which works so that if an AST is requested, then next time an REI to userspace is executed it will get an AST trap instead and then reschedule. As I don't understand vax I don't know what 197 #define cpu_signotify(l) mtpr(AST_OK,PR_ASTLVL) is expected to do, but somehow it should result in userret() being called. Yep, this is the way an AST is posted. Next time an REI is executed it will trap to the AST subroutine. Other points are: - vax cpu_need_resched doesn't seem to differentiate between locally running lwp and an lwp running on another cpu. Most likely. It was 20 years since I wrote the MP code (and probably the same since anyone tested it last time) and at that time LWPs didn't exist in NetBSD. I would be surprised if it still worked :-) - I can't see how hardclock would result in userret being called, but like I said - I don't know vax. When it returns from hardclock (via REI) it directly traps to the AST handler instead if an AST is posted. http://src.illumos.org/source/xref/netbsd-src/sys/arch/vax/vax/intvec.S#311 I believe ci_want_resched is an MI variable for the scheduler which is why its use in vax cpu_need_resched got removed. It shouldn't be needed, but obviously something breaks if it isn't added. What I think may have happened is that someone may have optimized something in the MI code that expects a different behaviour than the VAX hardware ASTs have. AFAIK VAX is (almost) the only port that have hardware ASTs. Thanks for at least looking at this. -- Ragge
Re: RAIDframe: what if a disc fails during copyback
On 10/30/20 1:54 PM, Edgar Fuß wrote: it locks out all other non-copyback IO in order to finish the job! Oops! Locking out all other IO is very poor... but if it's a small enough RAID set you might be able to get away with the downtime for the copyback... Certainly not. You shouldn't need to reboot for this... the 'failing spared disk' and 'reconstruct to previous second disk' should work fine without reboot. I still don't get this. What I have is: Components: /dev/sd5a: spared /dev/sd6a: optimal Spares: /dev/sd7a: used_spare So what am I supposed to do from here? If you really want to get /dev/sd5a in use again, you can do: raidctl -f /dev/sda7 raidX raidctl -vR /dev/sd5a raidX to do the fail of sd7a and rebuild of sd5a. But unless you have a strong need to use sd5a I would do nothing and leave things as-is. If you reboot at this point /dev/sd7a would show up as the first component and be marked as 'optimal'. Later... Greg Oster
Re: RAIDframe: what if a disc fails during copyback
On 10/30/20 4:25 AM, Edgar Fuß wrote: Thanks for the detailed answer. it's still there, and it does work, That's reassuring to know. but it's not at all performant or system-friendly. Just how bad is it? It's been probably over a decade since I last tried it, but as I recall it locks out all other non-copyback IO in order to finish the job! If you want the components labelled nicely, give the system a reboot Re-booting our file server is something I like to avoid. You'll like copyback even less then -- I'd say once you're done reconstruct, just leave it, or reconstruct again to the 'repaired original' as I suggested... and behaves very poorly. Depending on how poorly, I could probably live with it (the RAID in question is the small system one, not the large user data one). Locking out all other IO is very poor... but if it's a small enough RAID set you might be able to get away with the downtime for the copyback... In your case, what I'd do is just fail the spare, and initiate a reconstruct to the original failed component. (You still have the data on the spare if something goes back with the original good component.) Hm, I guess I would need to re-boot and intervene manually in that case. Just using the slow copyback looks preferrable if it doesn't take more than a day. You shouldn't need to reboot for this... the 'failing spared disk' and 'reconstruct to previous second disk' should work fine without reboot. (IIRC I've used a '3rd component' to make the primary/secondary components swap places.. just to test that, of course :) ) Probably I need to test this on another machine before. I guess there's no way to initiate a reconstruction to a spare and failing the specified component only /after/ the reconstruction has completed, not before? No, there's not, unfortunately. :( Later... Greg Oster
Re: Horrendous RAIDframe reconstruction performance
On 6/28/20 7:31 PM, John Klos wrote: Any thoughts about what's going on here? Is this because the drives are 512e drives? Three weeks is a LONG time to reconstruct. So this turns out to be a failing drive. SMART doesn't show it's failing, but the one that's failing defaults to having the write cache off, and turning it on doesn't change the speed. Yep, that will do it. I guess it's still usable, in a limited way - I can only write at 5 or 6 MB/sec, but I can read at 200 MB/sec. Maybe I'll use it in an m68k Mac. Also, the autoconfigure works, but the forcing of root FS status didn't because I was testing it on a system that already had a RAIDframe with forced root. However, it still doesn't work on aarch64, but I'll recheck this after trying Jared's boot.cfg support. Thanks, Greg, Michael and Edgar. I learned something :) I am still curious about whether I was seeing both good read and write speeds because writes weren't going to both drives. I suppose I assumed that all writes would go to both drives even while reconstructing, but I suppose that only happens when the writes are inside of the area which has already been reconstructed, yes? Correct. Writes will go to both on stripes where reconstruction has completed, but will only go to the 'good' device if reconstruction hasn't reached that stripe yet. Later... Greg Oster
Re: Horrendous RAIDframe reconstruction performance
On 6/28/20 12:29 PM, Edgar Fuß wrote: That's the reconstruction algorithm. It reads each stripe and if it has a bad parity, the parity data gets rewritten. That's the way parity re-write works. I thought reconstruction worked differently. oster@? Reconstruction does not do the "read", "compare", "write" operation like parity checking. It just does "read", "compute", "write". In the case of RAID 1, "compute" does nothing ,and it's basically read from one component and write to the other. Later... Greg Oster
Re: Asymmetric Disk I/O priorities in a RAIDframe RAID1?
On Mon, 28 May 2018 10:27:46 +0200 Hauke Fath <h...@spg.tu-darmstadt.de> wrote: > All, > > is there a way in RAIDframe to give a member of a RAID1 set read > priority, to e.g. favour reads from an SSD over rotating rust or > iscsi? > > The Linux mdadm(8) software raid config tool has the following option: > > > "[...] devices listed in a --build, --create, or --add command will > be flagged as 'write-mostly'. This is valid for RAID1 only and means > that the 'md' driver will avoid reading from these devices if at all > possible. This can be useful if mirroring over a slow link." > > > Can RAIDframe do anything similar? No... RAIDframe basically selects the component with the shortest queue. If queue lengths are equal, then it tries to pick the component where the data you want is 'nearest' to the last data item being fetched in the queue (see rf_dagutils.c:rf_SelectMirrorDiskIdle()). Later... Greg Oster
Re: RAIDframe: passing component capabilities
On Fri, 31 Mar 2017 17:15:38 +0200 Edgar Fuß <e...@math.uni-bonn.de> wrote: > > given that RAIDframe (nor ccd, nor much else) has a general 'query > > the underlying layers to ask about this capability' function. > Is there a ``neither'' missing between ``that'' and ``RAIDframe''? Yes, sorry. > > (NetBSD 8 refusing to configure a RAID set because of this is not an > > option.) > Of course not. With my model, you would need to (re-)configure the > RAID set with ``all components have SCSI tagged queueing'' in order > for the RAID device to announce that capability. If one of the drives > is SATA, that configuration fails. If you later try to replace a SCSI > drive with a SATA one it fails like it fails when the replacement > drive has insufficient capacity. > It's just like with capacities: There's no need to announce the full > component capacity to the set (well, in fact, you don't use the full > drive capacity for the partition that constitutes the component), but > the component needs to have at least the announced capacity (in fact, > you need to be able to create a partition of sufficient size on the > drive). With capabilities, there would also be no need to announce > all the drive's capabilities, but a component (original or > replacement) needs to have at least the announced capabilities. That still requires RAIDframe then asking the components (or having them report to RAIDframe when they are attached) about whether or not they can do a certain thing, in order to decide whether or not the reconfiguration succeeds or fails. Later... Greg Oster
Re: RAIDframe: passing component capabilities (was: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL)
On Wed, 29 Mar 2017 12:02:23 +0200 Edgar Fuß <e...@math.uni-bonn.de> wrote: > EF> Some comments as I probably count as one of the larger WAPBL > EF> consumers (we have ~150 employee's Home and Mail on NFS on > EF> FFS2+WAPBL on RAIDframe on SAS): > JD> I've not changed the code in RF to pass the cache flags, so the > JD> patch doesn't actually enable FUA there. Mainly because disks > JD> come and go and I'm not aware of mechanism to make WAPBL aware of > JD> such changes. It > TLS> I ran into this issue with tls-maxphys and got so frustrated I > TLS> was actually considering simply panicing if a less-capable disk > TLS> were used to replace a more-capable one. > Oops. What did you do in the end? What does Mr. RAIDframe say? > > My (probably simplistic) idea would be to add a capabilities option > to the configuration file, and just as you can't add a disc with > insufficient capacity, you can't add one with insufficient > capabilities. Of course, greater capabilities are to be ignored just > as a larger capacity is. FUA/maxphys/anything 'disk'-specific is a bit of a pain to deal with, given that RAIDframe (nor ccd, nor much else) has a general 'query the underlying layers to ask about this capability' function. I see two major things here: 1) Whatever we do can't break existing setups. That is, if an underlying disk can't do FUA, then upper layers just need to Deal. (NetBSD 8 refusing to configure a RAID set because of this is not an option.) 2) Whatever query mechanism is used must be device agnostic at the higher levels. It needs to work for RAID, SAS, SCSI, SATA, HP-IB, etc, and leave it up to the lower levels to respond with the correct "Yes all devices I talk to (recursively) can do this" or "No, at least one of us can't do this" to the query. And then it's up to the drivers to actually pass the appropriate flags and do the Right Things. Later... Greg Oster
Re: RAIDframe: cooldown nnn out of range
On Thu, 1 Sep 2016 10:58:00 +0200 Edgar Fuß <e...@math.uni-bonn.de> wrote: > Upon reboot after a clean shutdown, I yesterday got > raid0: cooldown 1944635974 out of range > raidctl -s is happy. What does that message mean? Do I need to worry? Good question. It seems that the cooldown value is not in the 1-128 range... (the cooldown is used to regulate the rate at which paritymap updates get pushed onto the disk). I'm not sure why it is that high, or how it got that high :-( I'm also not sure if the kernel got that value from the paritymap on the disk or where... though I think if it read a '0' for the cooldown from the on-disk parity map that it would then use 'random garbage' from the newly initialized parity map if coming through rf_paritymap_attach() :( Fortunately, when the code detects these 'out of range' things, it grabs a set of correct values, and uses those going forward... Later... Greg Oster
Re: IIs factible to implement full writes of strips to raid using NVRAM memory in LFS?
On Sun, 21 Aug 2016 10:20:07 -0400 Thor Lancelot Simon <t...@panix.com> wrote: > On Fri, Aug 19, 2016 at 10:01:43PM +0200, Jose Luis Rodriguez Garcia > wrote: > > On Fri, Aug 19, 2016 at 5:27 PM, Thor Lancelot Simon > > > > > > Perhaps, but I bet it'd be easier to implement a generic > > > pseudodisk device that used NVRAM (fast SSD, etc -- just another > > > disk device really) to buffer *all* writes to a given size and > > > feed them out in that-size chunks. Or to add support for that to > > > RAIDframe. > > > > > > For bonus points, do what the better "hardware" RAID cards do, > > > and if the inbound writes are already chunk-size or larger, > > > bypass them around the buffering (it implies extra copies and > > > waits after all). > > > > > > That would help LFS and much more. And you can do it without > > > having to touch the LFS code. > > > > Won't it be easier to add a layer that do these tasks in > > the LFS code. It has the disadvantage that it would be used only > > by > > I am guessing not. The LFS code is very large and complex -- much > more so than it needs to be. It is many times the size of the > original Sprite LFS code, which, frankly, worked better in almost all > ways. It represents (to me) a failed experiment at code and > datastructure sharing with FFS (it is also worse, and larger, because > Sprite's buffer cache and driver APIs were simpler than ours and > better suited to LFS' needs). > > It is so large and so complex that truly screamingly funny bugs like > writing the blocks of a segement out in backwards order went > undetected for long periods of time! > > It might be possible to build something like this inside RAIDframe or > LVM but I think it would share little code with any other component > of those subsystems. I would suggest building it as a standalone > driver which takes a "data disk" and "cache disk" below and provides > a "cached disk" above. I actually think relatively little code > should be required, and avoiding interaction with other existing > filesystems or pseudodisks should keep it quite a bit simpler and > cleaner. Building this as a layer that allows arbitrary devices as either the 'main store' or the 'cache' would work well, and allow for all sorts of flexibility. What I don't know is how you'd glue that in to be a device usable for /. The RAIDframe code in that regard is already a nightmare! Perhaps something along the lines of the dk(4) driver, where one could either use it as a stand-alone device, or hook into it to use the caching features.. (e.g. 'register' the cache when raid0 is configured, and then use/update the cache on reads/writes/etc to raid0) Obviously this needs to be fleshed out a significant amount... Later... Greg Oster
Re: IIs factible to implement full writes of strips to raid using NVRAM memory in LFS?
On Sat, 20 Aug 2016 03:20:51 +0200 Jose Luis Rodriguez Garcia <joseyl...@gmail.com> wrote: > On Fri, Aug 19, 2016 at 5:27 PM, Thor Lancelot Simon <t...@panix.com> > wrote: > > On Thu, Aug 18, 2016 at 06:23:32PM +, Eduardo Horvath wrote: > > chunks. Or to add support for that to RAIDframe. > ... > > That would help LFS and much more. And you can do it without having > > to touch the LFS code. > > > I have been thinking about this, and I think that this is the best > option, although I like more integrate it with LFS as I said in my > previous mail, adding to RAIDframe it can be used/tested by more > people and it is possible that more developers/testers are involved. > Integrating it inside of LFS surely will be a one man project, that it > is very possible that it isn't finished. > Other bonus of integrating it with RAIDframe, it is can resolve the > problems of write hole of raid: > http://www.raid-recovery-guide.com/raid5-write-hole.aspx > I don't know if NetBSD resolves the problem of write hole (it has > penalty in performance to resolve it). RAIDframe maintains a 'Parity status:', which indicates whether or not all the parity is up-to-date. Jed Davis did the GSoC work to add the 'parity map' stuff which significantly reduces the amount of effort needed to ensure the parity is up-to-date after a crash. (Basically RAIDframe checks (and corrects) any parity blocks in any modified regions of the RAID set.) Later... Greg Oster
Re: WAPBL not locking enough?
On Tue, 3 May 2016 10:02:44 + co...@sdf.org wrote: > Hi, > > I fear that WAPBL should be locking wl_mtx before running > wapbl_transaction_len (in wapbl_flush) > > I suspect so because it performs a similar lock before > looking at wl_bufcount in sys/kern/vfs_wapbl.c:1455-1460. > > If true, how about this diff? > > diff --git a/sys/kern/vfs_wapbl.c b/sys/kern/vfs_wapbl.c > index 378b0ba..f765ce3 100644 > --- a/sys/kern/vfs_wapbl.c > +++ b/sys/kern/vfs_wapbl.c > @@ -1971,6 +1971,7 @@ wapbl_transaction_len(struct wapbl *wl) > size_t len; > int bph; > > + mutex_enter(>wl_mtx); > /* Calculate number of blocks described in a blocklist header > */ bph = (blocklen - offsetof(struct wapbl_wc_blocklist, wc_blocks)) / > sizeof(((struct wapbl_wc_blocklist *)0)->wc_blocks[0]); > @@ -1981,6 +1982,7 @@ wapbl_transaction_len(struct wapbl *wl) > len += howmany(wl->wl_bufcount, bph) * blocklen; > len += howmany(wl->wl_dealloccnt, bph) * blocklen; > len += wapbl_transaction_inodes_len(wl); > + mutex_exit(>wl_mtx); > > return len; > } > It's been a while since I looked at this, but you might want to see if this: rw_enter(>wl_rwlock, RW_WRITER); takes care care of the locking that you are concerned about... There certainly does seem to be mutex_enter/mutex_exits missing on the DIAGNOSTIC code around line 981. In any event you'll want more than just my comments, and a lot more than 'a kernel with these changes boots' for testing :) Later... Greg Oster
Re: RAIDFrame changes
On Sun, 27 Dec 2015 11:49:36 + (UTC) mlel...@serpens.de (Michael van Elst) wrote: > Hi, > > now that RAIDFrame is usuable at a module I have prepared > a patch to refactor the driver to use the common dksubr code. > > The patch can be found at > > http://ftp.netbsd.org/pub/NetBSD/misc/mlelstv/raidframe.diff > > and it includes a few other fixes too: > > - finding components now does proper kernel locking by using > bdev_strategy, previously it would trigger an assertion in the sd > driver (and maybe others). > > - finding components now runs in two passes to prefer wedges over > raw partitions when the wedge starts at offset 0. > > - defer RAIDFRAME_SHUTDOWN (raidctl -u operation) to raidclose. > - side effect is that 'raidctl -u' succeeds even for units that > haven't been configured. The previous behaviour was to keep > the embryonal unit and fail ioctl and close operations which > prevents unloading of the module. > > - moved raidput again to raid_detach() because raid_detach_unlocked() > is only called for initialized units. > > - use common dksubr code > - the fake device softc now stores a pointer back to the real softc > - no longer uses a private bufq > - private disklabel and disk_busy code is gone. > > - some extra messages > > > > I will commit this in the next days if there is no objection. Thanks for working on this, but "next days" doesn't provide any real time during the "Holiday Season". :( Have these changes, especially those to raiddump(), been extensively tested? RF_PROTECTED_SECTORS used to be required in raiddump() so as not to eat the component label. raid_dumpblocks() now contains what raiddump() used to, but it raiddump() has to be able to select which of the underlying components is still alive, and not to attempt to dump to anything other than a RAID 1 device. I'm not seeing the mechanisms by which those requirements are still met Methinks these changes should have been explicitly reviewed by the RAIDframe maintainer *before* they were committed. :( Later... Greg Oster
Re: RAIDframe hot spares
On Mon, 21 Dec 2015 16:49:12 +0100 Edgar Fuß <e...@math.uni-bonn.de> wrote: > I have another question on RAIDframe, this time on hot spares (which > I never used before). > I was at A and B "optimal" and C "spare", and B failed. I would have > expected RAIDframe to immediately start a reconstruction on C, but > that didn't happen; I had to raidctl -F B first. > > In case it matters, that setup was reached through the following > events: A and B "optimal" > B failed > A "optimal", B "failed" > raidctl -a C > A "optimal", B "failed", C "spare" > raidctl -F B > A "optimal", B "spared", C "used_spare" > raidctl -B > A and B "optimal", C "spare" > > Isn't the point of a *hot* spare to immediately reconstruct on it if > neccessary? The original code didn't support immediate reconstruction, and no-one has bothered to add it since... At this point you can approximate the immediate reconstruction with a cron job that checks the status of the RAID set, and then executes the appropriate rebuild. Later... Greg Oster
Re: raidctl -B syntax
On Mon, 21 Dec 2015 00:25:02 +1100 matthew green <m...@eterna.com.au> wrote: > > I am confident that an I/O error duing reconstruction will result > > in the reconstruction failing. > > it does for RAID1. i've not used RAID5 for years. It does for RAID 5 as well. > i had a disk failure, followed by the otherside giving read > errors while reconstructing. my rebuild failed sort of > appropriately (it would be nice if the re-rebuild would know > where to restart from.) > > (i managed to recover the failed blocks from the 2nd disk > from the 1st one. at least they managed to fail in different > regions of the dis, obviating the need for more annoying > methods of restore :-) About a year ago I had to build a custom kernel to ignore IO errors on the source disk... A RAID 1 set had failed the second component, and it turned out the first component also had read errors. Fortunately, the blocks with errors were not associated with any filesystem bits (a full dump, for example, worked fine) and so I was able to recover everything without having to go to backups. (i.e. custom kernel recovered bits to new second component, and then the first component got replaced as well!) Later... Greg Oster
Re: raidctl -B syntax
On Sun, 20 Dec 2015 12:38:43 +0100 Edgar Fuß <e...@math.uni-bonn.de> wrote: > I'm unsure what the "dev" argument to raidctl -B is: the spare or the > original? The 'dev' is the RAID device > Suppose I have a level 1 RAID with components A and B; B failed. I > add C as a hot spare (raidctl -a C) and reconstruct on it (raidctl -F > B) now I have A "optimal", B "spared" and C "used_spare". > Then I find that B's failure must have been a glitch; do I raidctl -B > B or raidctl -B C? So one note about the '-B' option IIRC, it actually blocks IO to the RAID set while the copyback is happening. I think the last time I looked at this I was of the opinion that the copyback bits should just be deprecated, as most people never use it, and it actually causes serious performance issues (i.e. no IO can be done to the set while copyback is in progress!). But it's been a while... > I suppose that after the copyback, I'll have A and B "optimal" and C > "spare", right? I believe that is correct... it's been a long while since I used copyback... > What if, during the reconstruction, I get an I/O error on B. I hope > the reconstruction will simply stop and leave me with A "optimal", B > "spared" and C "used_spare", right? Errors encountered during reconstruction are supposed to be gracefully handled. That is, if you have A and B in the RAID set, and B fails, and you encounter an error when reconstructing from A to C, then A will be left as optimal, B will still be failed, and C will still be a spare. C will not get marked as 'used_spare' (and be eligible to auto-configure as the second component) until the reconstruction is 100% successful. Later... Greg Oster
Re: RAIDframe: stripe unit per reconstruction unit
On Fri, 18 Oct 2013 15:50:55 +0200 Edgar Fuß e...@math.uni-bonn.de wrote: The man page says: The stripe units per parity unit and stripe units per reconstruction unit are normally each set to 1. While certain values above 1 are permitted, a discussion of valid values and the consequences of using anything other than 1 are outside the scope of this document. I noticed that reconstruction seems to read/write in units of one stripe unit, which in my case (being 4k) seems rather inefficient. Is it possible to accelerate recontruction by using a figure larger than 1 here? If you go here: http://www.pdl.cmu.edu/RAIDframe/ and pull down the RAIDframe Manual and go to page 73 of that manual you will find: When specifying SUsPerRU, set the number to 1 unless you are specifically implementing reconstruction under parity declustering; if so, you should read through the reconstruction code first. The answer might be yes, but I don't know what the other implications are. I'd only recommend changing this value on a RAID set you don't care about, with data you don't care about. I'd also recommend reading through the reconstruction code, looking for SUsPerRU... Later... Greg Oster
Re: high load, no bottleneck
On Thu, 19 Sep 2013 10:29:55 -0400 chris...@zoulas.com (Christos Zoulas) wrote: On Sep 19, 8:13am, t...@panix.com (Thor Lancelot Simon) wrote: -- Subject: Re: high load, no bottleneck | On Wed, Sep 18, 2013 at 06:03:11PM +0200, Emmanuel Dreyfus wrote: | Emmanuel Dreyfus m...@netbsd.org wrote: | | Thank you for saving my day. But now what happens? | I note the SATA disks are in IDE emulation mode, and not AHCI. This is | something I need to try changing: | | Switched to AHCI. Here is below how hard disks are discovered (the relevant raid | is RAID1 on wd0 and wd1) | | In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads, on both 6.0 | and -current. I assume there must be something bad with WAPBL/RAIDframe | | There is at least one thing: RAIDframe doesn't allow enough simultaneously | pending transactions, so everything *really* backs up behind the cache flush. | | Fixing that would require allowing RAIDframe to eat more RAM. Last time I | proposed that, I got a rather negative response here. sysctl to the rescue. The appropriate 'bit to twiddle' is likely raidPtr-openings. Increasing the value can be done while holding raidPtr-mutex. Decreasing the value can also be done while holding raidPtr-mutex, but will need some care if attempting to decrease it by more than the number of outstanding IOs. I'm happy to review any changes to this, but won't have time to code it myself, unfortunately :( Later... Greg Oster
Re: RAIDOUTSTANDING (was: high load, no bottleneck)
On Thu, 19 Sep 2013 20:14:33 +0200 Edgar Fuß e...@math.uni-bonn.de wrote: options RAIDOUTSTANDING=40 #try and enhance raid performance. Is there any downside to this other than memory usage? How much does one unit cost? This is from the comment in src/sys/dev/raidframe/rf_netbsdkintf.c : /* * Allow RAIDOUTSTANDING number of simultaneous IO's to this RAID device. * Be aware that large numbers can allow the driver to consume a lot of * kernel memory, especially on writes, and in degraded mode reads. * * For example: with a stripe width of 64 blocks (32k) and 5 disks, * a single 64K write will typically require 64K for the old data, * 64K for the old parity, and 64K for the new parity, for a total * of 192K (if the parity buffer is not re-used immediately). * Even it if is used immediately, that's still 128K, which when multiplied * by say 10 requests, is 1280K, *on top* of the 640K of incoming data. * * Now in degraded mode, for example, a 64K read on the above setup may * require data reconstruction, which will require *all* of the 4 remaining * disks to participate -- 4 * 32K/disk == 128K again. */ The amount of memory used is actually more than this, but the buffers are the biggest consumer, and the easiest way to get a ball-park estimate So if you have a RAID 5 set with 12 disks and 32K/disk for the stripe width, then for a *single* degraded write of 32K you'd need: 11*32K (for reads) + 32K (for the write) + 32K (for the parity write) which is 416K. If you want to allow 40 simultaneous requests, then you need ~16MB of kernel memory. Maybe not a big deal on a 32GB machine, but 40 probably isn't a good default for a machine with 128MB RAM. Also: remember that this is just one RAID set, and each additional RAID set on the same machine could use that much memory too Also2: this really only matters the most in degraded operation. I can't think, off-hand, of any downsides other than memory usage.. Later... Greg Oster
Re: high load, no bottleneck
On Thu, 19 Sep 2013 20:53:30 +0200 m...@netbsd.org (Emmanuel Dreyfus) wrote: Greg Oster os...@cs.usask.ca wrote: It's probably easier to do by raidctl right now. I'm not opposed to having RAIDframe grow a sysctl interface as well if folks think that makes sense. The 'openings' value is currently set on a per-RAID basis, so a sysctl would need to be able to handle individual RAID sets as well as overall configuration parameters. IMO raidctl makes more sense here, as it is the place where one is looking for RAID stuff. While I am there: fsck takes an infinite time while RAIDframe is rebuilding parity. I need to renice the raidctl process that does it in order to complete fsck. Would raising the outstanding write value also help here? Any additional load you have on the RAID set while rebuilding parity is just going to make things worse... What you really want to do is turn on the parity logging stuff, and reduce the amount of effort spent checking parity by orders of magnitude... Later... Greg Oster
Re: high load, no bottleneck
On Thu, 19 Sep 2013 11:26:21 -0700 (PDT) Paul Goyette p...@whooppee.com wrote: On Thu, 19 Sep 2013, Brian Buhrow wrote: The line I include in my config files is: options RAIDOUTSTANDING=40 #try and enhance raid performance. Is this likely to have any impact on a system with multiple raid-1 mirrors? Yes, it would, provided you have more than 6 concurrent IOs to each RAID set.. Later... Greg Oster
Re: high load, no bottleneck
On Fri, 20 Sep 2013 01:37:20 +0200 m...@netbsd.org (Emmanuel Dreyfus) wrote: Greg Oster os...@cs.usask.ca wrote: Any additional load you have on the RAID set while rebuilding parity is just going to make things worse... What you really want to do is turn on the parity logging stuff, and reduce the amount of effort spent checking parity by orders of magnitude... You mean raidctl -M yes, right? Correct. Later... Greg Oster
Re: Unexpected RAIDframe behavior
On Tue, 3 Sep 2013 16:59:28 + (UTC) John Klos j...@ziaspace.com wrote: Parity Re-write is 79% complete. OK, so this is really more about how parity checking works than anything else (i guess.) for RAID1, it reads both disks and compares them, and if one fails it will write the master data. (more generally, it reads all disks and if anything fails parity check it writes corrected parity back to it.) Ah, so a reboot caused RAIDframe to switch from reconstruction to parity creation. Hmm.. it shouldn't be 'switching' if reconstruction finished, but you didn't have parity logging turned on, then you'd see the behaviour you describe. If reconstruction didn't finish, then you should see one of the components still marked as 'failed'. That explains what was going on. However, it makes me wonder if the state of the RAID is not properly being maintained through reboot. I didn't really need all that non-zero data. If the state of the RAID is not being maintained, then that's a bug, and needs to be fixed right away. To my knowledge, however, it does maintain things correctly. Feel free to file a PR with the specifics of any failures in this regard... Later... Greg Oster
Re: Where is the component queue depth actually used in the raidframe system?
On Thu, 14 Mar 2013 10:32:26 -0400 Thor Lancelot Simon t...@panix.com wrote: On Wed, Mar 13, 2013 at 09:36:07PM -0400, Thor Lancelot Simon wrote: On Wed, Mar 13, 2013 at 03:32:02PM -0700, Brian Buhrow wrote: hello. What I'm seeing is that the underlying disks under both a raid1 set and a raid5 set are not seeing anymore than 8 active requests at once across the entire bus of disks. This leaves a lot of disk bandwidth unused, not to mention less than stellar disk performance. I see that RAIDOUTSTANDING is defined as 6 if not otherwise defined, and this suggests that this is the limiting factor, rather than the actual number of requests allowed to be sent to a component's queue. It should be the sum of the number of openings on the underlying components, divided by the number of data disks in the set. Well, roughly. Getting it just right is a little harder than that, but I think it's obvious how. Actually, I think the simplest correct answer is that it should be the minimum number of openings presented by any individual underlying component. I cannot see any good reason why it should be either more nor less than that value. Consider the case when a read spans two stripes... Unfortunately, each of those reads will be done independently, requiring two IOs for a given disk, even though there is only one request. The reason '6' was picked back in the day was that it seemed to offer reasonable performance while not requiring a huge amount of memory to be reserved for the kernel. And part of the issue there was that RAIDframe had no way to stop new requests from coming in and consuming all kernel resources :( '6' is probably a reasonable hack for older machines, but if we can come up with something self-tuning I'm all for it... (Having this self-tuning is going to be even more critical when MAXPHYS gets sent to the bitbucket and the amount of memory needed for a given IO increases...) Later... Greg Oster
Re: Where is the component queue depth actually used in the raidframe system?
On Wed, 13 Mar 2013 10:47:18 -0700 Brian Buhrow buh...@nfbcal.org wrote: Hello greg. In looking into a performance issue I'm having with some raid systems relative to their underlying disks, I became interested in seeing how the component queue depth affects the business of underlying disks. To my surprise, it looks as if the queue depth as defined in the raidx.conf file is never used. Is that really true? The chain looks like: raidctl(8) sets raidPtr-maxQueueDepth = cfgPtr-maxOutstandingDiskReqs; Then, in rf_netbsdkintf.c, we have: d_cfg-maxqdepth = raidPtr-maxQueueDepth; But I don't see where maxqdepth is ever used again. Heh... Congrats! I think you just found more 'leftovers' from when the simulator bits were removed (before RAIDframe was imported to NetBSD). In the simulation code maxQueueDepth was also assigned to threadsPerDisk which was used to fire off multiple requests to the simulated disk. In the current code you are correct that maxqdepth is not used in any real meaningful way... Unfortunately, we can't just rip it out without worrying about backward kernel-raidctl compatibility :( When you set maxOutstandingDiskReqs you're really setting maxOutstanding, and how that influences performance would be interesting to find out :) Just be aware that the more requests you allow to be outstanding the more kernel memory you'll need to have available... Later... Greg Oster
Re: Raidframe and disk strategy
On Wed, 17 Oct 2012 11:26:12 -0700 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: On Oct 17, 12:03pm, Edgar =?iso-8859-1?B?RnXf?= wrote: } Subject: Re: Raidframe and disk strategy } Two more questions on the subject: } } sets the strategy of raidframe to the default strategy for the system, } rather than fcfs. } How does that play with the usual ``fifo 100'' in the ``START queue'' section? } } you can easily test various disk sorting strategies } Where can I find a description/discussion of the different strategies? } } Probably your patch is the cause for our occasional NFS hangs having decreased } from thirty seconds to a few seconds. -- End of excerpt from Edgar =?iso-8859-1?B?RnXf?= In answering your question, I find I have a couple for Greg. Depending on his comments to my notes below, you may have a couple of knobs you can tweak for performance to the raidframe system to get even more efficiency out of the system. It looks like you can select a number of disk queuing strategies within raidframe itself, something I didn't realize. There seem to be 5 choices: fifo, cvscan, sstf, scan and cscan. Fifo is the one we've been using for years, but the others appear to be compiled into the system. Unless my understanding is completely wrong, cscan is the algorithm which most closely aligns with the priocscan buffer queue strategy and scan matches the traditional BSD disksort buffer queue strategy. To set a new disk queueing strategy for a given raid set, try the following: See more below about the next 6 steps... 1. If the raid set is configured by startup scripts at boot time, then edit your raid.conf file for that raid set and change the word fifo in the start queue section to one of the other choices listed above. 2. Unconfigure the raid and reconfigure it. 3. If your raid set is configured automatically by the kernel, construct a raid.conf file that matches the characteristics of your raid set, if you don't already have one, and change the word fifo in the start queue section to one of the choices listed above. 4. Turn off autoconfigure with raidctl -A no on the indicated raid set. 5. Unconfigure and reconfigure the raid set as you did in step 2 above. 6. Turn on autoconfigure again with raidctl -A yes or raidctl -A root depending on whether your raid set is a root filesystem or just a raid set On the system. For greg: Have you played with any of these disk queueing strategies? Not in any meaningful way, and not in years. Do you know if they work or, more importantly, if they contain huge disk eating bugs? My understanding is that they all work, though, again, I've not done anything resembling exhaustive testing. Have you done any bench marking to compare their relative performance? No. The other unfortunate thing is that your 6 steps above won't be permanent -- the component labels don't have a record of what the queuing strategy currently is. We probably need another 'raidctl' option to set raidPtr-qType to whatever type is desired (if you'd like to write this, you basically need to make sure all IO is quiesced, do the switch, and then allow IO to happen again. There is existing code that does this start/stop, IIRC.) I'm happy to look over any code additions folks might propose :) Later... Greg Oster
Re: Raidframe and disk strategy
On Wed, 17 Oct 2012 12:50:44 -0700 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: hello. Thanks for the reply. I was just loking at the raidframe code to determine if the queueing strategy was permanent across reboots. Also, I apologize Edgar, cscan is not a valid queueing strategy name. The four choices are: fifo, cvscan, sstf and scan. As to the possibility of implementing persistent disk queuing strategies, I was thinking that the easiest way might be just to add an option to store the letter of the queuing strategy in the component label, and then having the autoconfigure code set the queueing strategy when it runs during the next reboot. You can already achieve a change to the queuing strategy on a raid set by unconfiguring it and reconfiguring it. (Yes, that means you have to reboot for root raid sets or raid sets that can't be unmounted at run time, but that seems safer than trying to do this change on the fly, especially since it isn't something one would do often under most circumstances.) Unconfiguring/reconfiguring is a pain... I think what we really need is an ioctl or similar to set the strategy (it can just pass in a config struct with only the strategy field set -- that mechanism is already there). The ioctl code can then do the quiesce/change/go song'n'dance and update component labels accordingly. New code in the component label code would read in the strategy (from a newly carved out field from the component labels) and away we go. There's lots of free space remaning in the component labels, and this change would be backwards compatible with older kernels. New kernels seeing a '0' in that spot would simply pick 'fifo' as before. One question is whether to use 0,1,2,3 to encode the strategy, or to use letters, or whatever. But that's just a minor implementation detail. I've got a raid set that I use for backups that I've altered to use the scan strategy. Thoughts? -Brian Later... Greg Oster
Re: Raidframe and disk strategy
On Thu, 9 Aug 2012 09:45:18 -0700 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: Hello. I'm not going to claim to be an expert at anything here, but grepping through the raidframe sources doesn't show me anything that says it sorts requests from the upper layers, nor that it sorts requests to the underlying disks except in the case of when it's doing reconstruction or paritymap maintenance. That said, and while I don't have any hard numbers yet, it looks like the patch I posted last night yields an instant 16% improvement on throughput on one of the backup servers I run. Could someone else try the patch and see if they see similar gains? Below is a shell snippet I use in /etc/rc.local to set the strategy for all the attached raid sets on a system, in case that's useful for folks. So there are a number of places sorting can occur here... 1) In the bufq_* strategy bits in RAIDframe. 2) In the rf_diskqueue.c code, where we have fifo, cvscan, shortest seek time first, and two elevator algorithms as possible candidates for use in disk queueing. 3) In the underlying component strategies (which may be another RAIDframe device, a CCD, vnd, sd, wd, or whatever). The sorting stuff I was referring to in my previous email (where I said it could be handled in the config file) was 2). I wasn't even thinking about 1). Now if you're seeing a 16% performance boost, it's likely worth adding the code you suggest -- it's not much code for a nice bump for folks who might have a similar IO mix/setup. Later... Greg Oster
Re: Raidframe and disk strategy
On Wed, 8 Aug 2012 15:07:24 -0700 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: hello. I've been looking at some disk performance issues lately and trying to figure out if there's anything I can do to make it better. (This is under NetBSD/I386 5.1_stable with sources from July 18 2012). During the course of my investigations, I discovered the raidframe driver does not implement the DIOCSSTRATEGY or DIOCGSTRATEGY ioctls. Even more interestingly, I notice its set to use the fcfs strategy, and has been doing so since at least NetBSD-2.0. The ccd(4) driver does the same thing. Presumably, the underlying disks can use what ever strategy they use for handling queued data, but I'm wondering if there is a particular reason the fcfs strategy was chosen for the raidframe driver as opposed to letting the system administrator pick the strategy? My particular environment has a lot of unrelated reads and writes going on simultaneously, and it occurrs to me that using a different disk strategy than fcfs might mitigate some of these issues. Were bench marks done to pick the best strategy for raidframe and/or ccd or is there some other eason I'm missing that implementing a buffer queue strategy on top of these devices is a bad idea? -thanks -Brian The FIFO strategy was the one that seemed to be the best tested at the time, and since I didn't want to introduce any more variables, that's the one I went with as a default. Unfortunately, the RAID labels currently don't specify the queuing strategy, so the 'autoconfig' sets won't do anything other than FIFO at this time. Non-autoconfig sets certainly support any of the other queuing methods in the config file, and that could certainly be used for testing/benchmarking. If you'd like to write up support for alternate queuing methods being specified by the component labels let me know -- it'd be one less thing on my TODO list :) Later... Greg Oster
Re: why does raidframe retry I/O 5 times
On Mon, 25 Jun 2012 11:11:37 +0200 Manuel Bouyer bou...@antioche.eu.org wrote: Hello, why does raidframe retry failed I/O 5 times ? the wd(4) driver already retries 5 times, which means a bad block is retried 25 times. While drivers are retrying the system is stalled ... As I recall, for RAIDframe 5 was just chosen as an arbitrary number. Do you know if scsi or other disk drivers also retry 5 times too? Later... Greg Oster
Re: RAIDframe performance vs. stripe size: Test Results
On Tue, 12 Jun 2012 17:02:21 +0200 Edgar Fuß e...@math.uni-bonn.de wrote: Any comments on the results? Really no comments? Parity re-build: 328 128 6min ~15min 5min My questions: Why does parity re-build take longer with smaller stripes? Is it really done one stripe at a time? So a parity rebuild does so by reading all the data and the exiting parity, computing the new parity, and then comparing the existing parity with the new parity. If they match, it's on to the next stripe. If they differ, the new parity is written out. No, this doesn't happen one stripe at a time -- it's much more parallel than that. What we don't know here is what the state of the array was when you started the rebuild. That is, was the parity 'mostly correct' beforehand? (i.e. saving having to do a lot of the writes). If it really was doing one write per stripe, then it can still be much slower with the smaller stripes -- there's way more overhead, and the amount of work getting done with each IO is much smaller Why does enabling quotas slow down extraction so much? The test data should be ordered by uid in the tar, so quota should be easily cachable. Why does the negative impact of atime updates decrease at larger block/stripe sizes? No answers to my questions either? I don't know the answers to these questions. Later... Greg Oster
Re: RAIDframe parity rebuild (was: RAIDframe performance vs. stripe size: Test Results)
On Tue, 12 Jun 2012 18:34:52 +0200 Edgar Fuß e...@math.uni-bonn.de wrote: So a parity rebuild does so by reading all the data and the exiting parity, computing the new parity, and then comparing the existing parity with the new parity. If they match, it's on to the next stripe. If they differ, the new parity is written out. Oops. What's the point of not simply writing out the computed parity? Writes are typically slower, so not having to do them means the rebuild goes faster... What we don't know here is what the state of the array was when you started the rebuild. That is, was the parity 'mostly correct' beforehand? (i.e. saving having to do a lot of the writes). I don't know either. If the discs contained a clean RAID set with a different SPSU, was the parity correct? For a basic RAID 5 set yes, it would be. The exception might be at the very end of the RAID set where a smaller stripe size might allow you to use a bit more of each component. Later... Greg Oster
Re: RAIDframe parity rebuild (was: RAIDframe performance vs. stripe size: Test Results)
On Tue, 12 Jun 2012 13:20:27 -0400 Thor Lancelot Simon t...@panix.com wrote: On Tue, Jun 12, 2012 at 10:40:47AM -0600, Greg Oster wrote: On Tue, 12 Jun 2012 18:34:52 +0200 Edgar Fu? e...@math.uni-bonn.de wrote: So a parity rebuild does so by reading all the data and the exiting parity, computing the new parity, and then comparing the existing parity with the new parity. If they match, it's on to the next stripe. If they differ, the new parity is written out. Oops. What's the point of not simply writing out the computed parity? Writes are typically slower, so not having to do them means the rebuild goes faster... Are writes to the underlying disk really typically slower? It's easy to see why writes to the RAID set itself would be slower, but sequential disk write throughput is usually pretty darned close to -- if not better than -- read throughput these days, isn't it? It's been a while since I've checked, and the current generation of 2TB and 3TB disks may be significantly better than the 1TB disks... I also don't know what the disks are doing with their write-back/write-through caches these days either. If you don't know the set's blank, I guess you do have to read the existing data. Maybe that limits how much win can really be had here. That, and someone would have to change the code :) Later... Greg Oster
Re: Strange problem with raidframe under NetBSD-5.1
On Tue, 12 Jun 2012 14:44:55 -0700 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: Hello. I've just encountered a strange problem with raidframe under NetBSD-5.1 that I can't immediately explain. this machine has been runing a raid set since 2007. The raid set was originally constructed under NetBSD-3. For the past year, it's been running 5.0_stable with sources from July 2009 or so without a problem.Last night, I installed NetBSD-5.1 with sources from May 23 2012 or so. Now, the raid0 set fails the first component with an i/o error with no corresponding disk errors underneath. Trying to reconstruct to the failed component also fails with an error of 22, invalid argument. Looking at the dmesg output compared with the output of raidctl -s reveals the problem. The size of the raid in the dmesg output is bogus, and, if the raid driver dries to write as many blocks as is reported by the configuration output, it will surely fail as it does. However, raidctl -g /dev/wd0a looks ok and the underlying disk label on /dev/wd0a looks ok as well. Where does the raid driver get the numbers it reports on bootup? Also, there is a second raid set on this machine, the second half of the same two drives, which was constructed at the same time. It works fine with the new code. Below is the output of the boot sequence before the upgrade, and then the boot sequence after the upgrade. Below that are the output of raidctl -s raid0 and raidctl -g /dev/wd0a raid0. It looks to me like something is not zero'd out in the component label that should be, but some change in the raid code is no longer ignoring the noise in the component label. Correct. Any ideas? There was some code added a while back to handle components whose sizes were larger than 32-bit. But 5.1_stable should have the code to handle those 'bogus' values in the component label and do the appropriate thing (see rf_fix_old_label_size in rf_netbsdkintf.c version 1.250.4.11, for example). What is your code rev for src/sys/dev/raidframe/rf_netbsdkintf.c ? Later... Greg Oster
Re: Strange problem with raidframe under NetBSD-5.1
On Tue, 12 Jun 2012 16:11:03 -0700 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: hello Greg. I just updated to the latest 5.1 tree but I don't see the change you note in that update. I see the commit in the cvs logs, but it doesn't look like it made it into the NetBSD-5 branch. The latest version I see, even after combing through the source-changes archives on the www.netbsd.org site is ...2.44.8 which was a fix for a bug I reported with wedges and raidframe some time ago. I could be missing something, and I probably am, but it's not obvious to me. Could you look to see if you see it on the NetBSD-5 branch? I don't think it's in the 5.1 tree.. The 1.250.4.11 version I quoted is from the netbsd-5 branch... The rev you actually want is 1.250.4.10 as from here: http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/dev/raidframe/rf_netbsdkintf.c?only_with_tag=netbsd-5 But as mrg has pointed out, you need the partitionSizeHi fix too... Later... Greg Oster On Jun 12, 3:30pm, Brian Buhrow wrote: } Subject: Re: Strange problem with raidframe under NetBSD-5.1 } Hello. That appears to be the problem. I thought I updated my 5.1 } sources, but I've been doing so much patching, testing and patching with } respect to the ffs fixes, that I guess I didn't actually get the latest } sources. doing that now. I think/hope that will fix me up. } } -thanks } -Brian } On Jun 12, 4:14pm, Greg Oster wrote: } } Subject: Re: Strange problem with raidframe under NetBSD-5.1 } } On Tue, 12 Jun 2012 14:44:55 -0700 } } buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: } } } } Hello. I've just encountered a strange problem with } } raidframe under NetBSD-5.1 that I can't immediately explain. } } } } this machine has been runing a raid set since 2007. The raid } } set was originally constructed under NetBSD-3. For the past year, } } it's been running 5.0_stable with sources from } } July 2009 or so without a problem. Last night, I installed } } NetBSD-5.1 with sources from May 23 2012 or so. Now, the raid0 set } } fails the first component with an i/o error with no corresponding } } disk errors underneath. Trying to reconstruct to the failed } } component also fails with an error of 22, invalid argument. Looking } } at the dmesg output compared with the output of raidctl -s reveals } } the problem. The size of the raid in the dmesg output is bogus, and, } } if the raid driver dries to write as many blocks as is reported by } } the configuration output, it will surely fail as it does. However, } } raidctl -g /dev/wd0a looks ok and the underlying disk label } } on /dev/wd0a looks ok as well. Where does the raid driver get the } } numbers it reports on bootup? Also, there is a second raid set on } } this machine, the second half of the same two drives, which was } } constructed at the same time. It works fine with the new code. } } } } Below is the output of the boot sequence before the upgrade, } } and then the boot sequence after the upgrade. Below that are the } } output of raidctl -s raid0 and raidctl -g /dev/wd0a raid0. } } It looks to me like something is not zero'd out in the } } component label that should be, but some change in the raid code is } } no longer ignoring the noise in the component label. } } } } Correct. } } } } Any ideas? } } } } There was some code added a while back to handle components whose sizes } } were larger than 32-bit. But 5.1_stable should have the code to handle } } those 'bogus' values in the component label and do the appropriate } } thing (see rf_fix_old_label_size in rf_netbsdkintf.c version } } 1.250.4.11, for example). } } } } What is your code rev for src/sys/dev/raidframe/rf_netbsdkintf.c ? } } } } Later... } } } } Greg Oster } -- End of excerpt from Greg Oster } } -- End of excerpt from Brian Buhrow Later... Greg Oster
Re: RAIDframe performance vs. stripe size
On Fri, 11 May 2012 12:48:08 +0200 Edgar Fuß e...@math.uni-bonn.de wrote: Edgar is describing the desideratum for a minimum-latency application. Yes, I'm looking for minimum latency. I've logged the current file server's disc business and the only time they really are busy is during the nightly backup. I suppose that mainly consists of random reads traversing the directory tree followed by random reads transferring the modified files (which, I suppose, are typically small). Of course I finally will have to experiment, but a) I'm used to trying to understand the theory first and b) the rate of experiments will probably be limited to one per day or less given the amount of real data I would have to re-transfer each time. Since my understanding of the NetBSD I/O system may be incorrect, incomplete or whatever, I better try to ask people first who know better: Given 6.0/amd64 and a RAID 5 accross 4+1 SAS discs. Suppose I have a 16k fsbsize and a stripe size such that there is one full fs block per disc (i.e. the stripe is 4*16k large). Suppose everything is correctly aligned (apart from where I say it isn't, of course). Suppose there is no disc I/O apart from that I describe. Ok :) Scenario A: I have two processes each reading a single file system block. Those two blocks happen to end up on two different discs (there is a 3/4 probability for that being true). Will I end up with those two discs actually seeking in parallel? Yes. Absolutely. Scenario B: I have one process reading a 64k chunk that is 16k-aligned, but not 64k- aligned (so it's not just reading a full stripe). Will I end up in four discs seeking and reading in parallel? Yes. Aligned or not, there will be 4 discs busy. (think of it as reading the last part of one stripe, and the first part of another. Those are always aligned to be non-overlapping...) I.e. will this be degraded wrt. a stripe-aligned read? No. Consider the following layout: d0 d1 d2 d3 p0 d5 d6 d7 p1 d4 d10 d11 p2 d8 d9 d15 p3 d12 d13 d14 p4 d16 d17 d18 d19 where dn is a 16K datablock n of the RAID set and px is the parity block for stripe x. (This is what the layout will be for your configuration.) If you read 64K that is strip aligned, then you would maybe be reading d0-d3 or d4-d7 or d8-d11 or d12-d15 or d16-d19. As you can see, all of those span all 4 discs. If you read something that isn't stripe aligned (e..g d1-d4 or d7-d10 or d11-d14 or whatever) you also see that those IOs are evenly distributed among the discs. Scenario C: I have one process doing something largely resulting in meta-data reads (i.e. traversing a very large directory tree). Will the kernel only issue sequential reads or will it be able to parallelise, e.g. reading indirect blocks? I don't know the answer to this off the top of my head... What will change if I scale all sizes by a factor of, say, 4, such that the full stripe exceeds MAXPHYS? RAIDframe won't be able to directly handle a request larger than MAXPHYS, so a single MAXPHYS-sized request to RAIDframe will end up accessing only a single component. The parallelism seen in A will still be there, but not the parallelism in B. Sorry if the answers to those question are obvious from what has already been written so far for someone more familiar with the matter than I am. Later... Greg Oster
Re: RAIDframe performance vs. stripe size
On Fri, 11 May 2012 17:05:24 +0200 Edgar Fuß e...@math.uni-bonn.de wrote: Thanks a lot for your detailed answers. Yes. Absolutely. Fine. As you can see, all of those span all 4 discs. Yes, that was perfectly clear to me. What I wasn't sure of was that the whole stack of subsystems involved would really be able to make use of that. Thanks for confirming it actually does. EF I have one process doing something largely resulting in meta-data EF reads (i.e. traversing a very large directory tree). Will the EF kernel only issue sequential reads or will it be able to EF parallelise, e.g. reading indirect blocks? GO I don't know the answer to this off the top of my head... Oops, any file-system experts round here? The parallelism seen in A will still be there, but not the parallelism in B. Thanks. One last, probably stupid, question: Why doesn't one use raw devices als RAIDframe components? Doesdn't all data pass through the buffer cache twice when using block devices? I think you'd asked that before, but I didn't get to responding, and no-one else did either... The shortest answer is that back in the day when RAIDframe arrived it was made to handle 'underlying components' just like CCD did... and using the VOP_* interfaces meant things could be layered without worrying about what other devices or pseudo-devices were above or below the RAIDframe layer. Yes, things might get (initially) cached on a couple of different levels, but as those 'bp's get recycled it'll be the lower copies that get recycled first and you should just end up with only a single cached copy of popular items... I think it's one of those things where you trade a bit of duplication for flexibility Does that help? Later... Greg Oster
Re: RAIDframe performance vs. stripe size
On Thu, 10 May 2012 17:46:38 +0200 Edgar Fuß e...@math.uni-bonn.de wrote: Does anyone have some real-world experience with RAIDframe (Level 5) performance vs. stripe size? My impression would be that, with a not to large number of components (4+1, in my case), chances are rather low to spread simultaneous accesses to different physical discs, so the best choice seems the file system's block size. On the other hand, the Carnegie Mellon paper suggests something around 1/2*(througput)*(access time), which would amount to more than a Megabyte with my discs. So, what are the real-world benefits of a large stripe size? My application are home and mail directories exported via NFS. There have been various discussions of this on the mailing lists over the years... The one issue you'll find is that RAIDframe still suffers from the 64K MAXPHYS limitation -- RAIDframe will only be handed 64K at a time, and from there it hands chunks of that out to each component. So if you have a 4+1 RAID 5 set, if you make the stripe 32 blocks (16K) wide, that will allow a single 64K IO to hit all 4+1 disks at the same time. Of course, you'll want to make sure that your filesystem is aligned so that those 64K writes are stripe aligned, and you'll also probably want to use fairly large block sizes (16K/64K frag/block) too... Later... Greg Oster
Re: RAIDframe performance vs. stripe size
On Thu, 10 May 2012 18:59:42 +0200 Edgar Fuß e...@math.uni-bonn.de wrote: I don't know whether I'm getting this right. In my understanding, the benefit of a large stripe size lies in parallelisation: Correct. Suppose the stripe size is such that a file system block fits on a single disc, i.e. stripe size = (file system block size)*(number of effective discs). Then, if one (read) transfer makes disc A seek to block X, there is a good chance that the next transfer can be satisfied from disc B != A, making disc B seek to block Y in parallel. Or will this (issuing requests to different discs in parallel) not happen on NetBSD? Unless something issues the command to disc B to fetch block Y in parallel, it won't happen. That is, if you simply make a request for block X from disc A, there's nothing in the RAIDframe code that will have it automatically go looking to B for Y. What you're typically looking for in the parallelization is that a given IO will span all of the components. In that way, if you have n components, and the transfer would normally take t amount of time, then the total time gets reduced to just t/n (plus some overhead) because you are able to do IO to/from all components at the same time. RAIDframe will definitely do this whenever it can. Later... Greg Oster
Re: RAIDframe performance vs. stripe size
On Thu, 10 May 2012 13:23:24 -0400 Thor Lancelot Simon t...@panix.com wrote: On Thu, May 10, 2012 at 11:15:09AM -0600, Greg Oster wrote: What you're typically looking for in the parallelization is that a given IO will span all of the components. In that way, if you have n That's not what I'm typically looking for. You're describing the desideratum for a maximum-throughput application. Edgar is describing the desideratum for a minimum-latency application. No? I think what I describe still works for minimum-latency too... where it doesn't work is when your IO is so small that the time to actually transfer the data is totally dominated by the time to seek to the data. In that case you're better off in just going to a single component instead of having n components all moving their heads around to grab those few bytes (especially true where there are lots of simultaneous IOs happening). Of course, this is all still theoretical -- the best thing is to experiment with different RAID settings and real workloads to see what works best for the particular applications... Later... Greg Oster
Re: RAIDframe performance vs. stripe size
On Thu, 10 May 2012 14:06:11 -0400 Thor Lancelot Simon t...@panix.com wrote: On Thu, May 10, 2012 at 11:47:36AM -0600, Greg Oster wrote: On Thu, 10 May 2012 13:23:24 -0400 Thor Lancelot Simon t...@panix.com wrote: On Thu, May 10, 2012 at 11:15:09AM -0600, Greg Oster wrote: What you're typically looking for in the parallelization is that a given IO will span all of the components. In that way, if you have n That's not what I'm typically looking for. You're describing the desideratum for a maximum-throughput application. Edgar is describing the desideratum for a minimum-latency application. No? I think what I describe still works for minimum-latency too... where it doesn't work is when your IO is so small that the time to actually transfer the data is totally dominated by the time to seek to the data. What if I have 8 simultaneous, unrelated streams of I/O, on a 9 data-disk set? That's the lots of simultaneous IOs happening part of the bit you cut out: In that case you're better off in just going to a single component instead of having n components all moving their heads around to grab those few bytes (especially true where there are lots of simultaneous IOs happening). Like, say, 8 CVS clients all at different points fetching a repository that is too big to fit in RAM? If the I/Os are all smaller than a stripe size, the heads should be able to service them in parallel. Doing reads, where somehow each of the 8 reads is magically on a disk independent of the others, is a pretty specific use-case ;) If you were talking about writes here, then you still have the Read-modify-write thing to contend with -- and now instead of just doing IO to a single disk, you're moving two other disk heads to read the old data and old parity, and then moving them back again to do the write... Reads I agree - you can get those done in parallel with no interference (assuming the reading of the data is somehow aligned accordingly to evenly distribute the load to all disks). Write to anything other than a full stripe, and it gets really expensive really fast If they are stripe size or larger, they will have to be serviced in sequence -- it will take 8 times as long. Reads, given the configuration and assumptions you have suggested, would certainly take longer. Writes would be a different story (how the drive does caching might be the determining factor?). Writing as a stripe each disk would end up with 8 IOs for each of the 8 writes -- 64 writes in all. Writing to a block on each of the 8 disks would actually end up with: a) 8 reads to each of the disks to fetch the old blocks, b) 8 reads to one of the other disks to get the old parity. This will also mess up the head positions for the reads happening in a), even though both a) and b) will fire at the same time. c) 8 writes to each of the disks to write out the new data and, d) 8 writes to each of the disks to write out the new parity. Thats now 128 reads and 128 writes (and the writes have to wait on the reads!) to now do the work that the striped system could do with just 64 writes In practice, this is why I often layer a ccd with a huge (and prime) stripe size over RAIDframe. It's also a good use case for LVMs. But it should be possible to do it entirely at the RAID layer through proper stripe size selection. In this regard RAIDframe seems to be optimized for throughput alone. See my point before about optimizing for one's own particular workloads :) (e.g. how you might optimize for a mostly read-only CVS repository will be completely different from an application where the mixture of reads and writes is more balanced...) Later... Greg Oster
Re: disklabel on RAID disappeared
On Fri, 2 Mar 2012 20:04:56 +0100 Edgar Fuß edgar.f...@bn2.maus.net wrote: Help, there's something weird going on on our fileserver! I'm on vacation and had a colleague do this over the phone. Please CC me in replies because I don't have access to my regular mail. raid0 is level 1, sd0a sd1a raid1 is level 5, sd2a .. sd9a sd0/1 are scsibus0, targets 0/1 sd2..9 are scsibus1, targets 0..8 The machine paniced After reboot, parity rewrite on raid0 succeded and failed on raid1 because of a read error on sd2a. He did scsictl stop sd2, scsictl detach scsibus1 0 0, replaced sd2, scsictl scan scsibus1 0 0. Something strange must have happened and sd2 was async. He nevertheless started the reconstruction (raidctl -R sd2a raid1), but raidctl -S estimated 24 hours. I asked him to stop the reconstruction, but neither failing sd2 nor detaching scsibus1 0 0 stopped it. Shortly after, the machine paniced again. It came up with raid0 and raid1 configured correctly, but fsck raid1a railed. We now have no disklabel on raid1 (disklabel -r says something about not being able to read it and disklabel without -r shows the fabricated one). Since fsck raid1a said somethin like incorrect fs size. I assume the superblock of raid1a is still there, only the disklabel is broken. Any hints? He is currently running the reconstruction and we'll see whether the disklabel returns or what happens if we re-write it from the backup we have in /var. Panic messages and raid-related info from /var/log/messages would help here. Also the NetBSD version would help too. Is there a sane way to stop an on-going reconstruction? Hmm... no. May trying to stop it have corrupted the raid1 contents? No... I expect the disklabel for raid1 would have been physically living on sd2a, but it should have been recoverable from the data and parity on the remaining didks. You don't say what arch you're on, but if you use 'dd if=/dev/rraid1a' to go hunting, do you find something that looks like a disklabel, or is it just garbage? (I'm guessing the latter...) Later... Greg Oster
Re: raidframe questions
On Thu, 23 Feb 2012 16:18:22 +0100 Jean-Yves Moulin j...@eileo.net wrote: Hi everybody, I was using two 500Go disks on raid1 setup. One of my disk died. I replace it with a bigger disk (600Go) and rebuilt raid. Then, the second 500Go disk died. I replace it also with a 600go disk and rebuilt raid. Now, I've two 600Go disk on my raid1 but size is always shrunk to 500Go. I made a mistake when creating RAID label on 600Go disks: It take entire disk. I want to know if I can recover my lost 100Go... disklabel for sd0: a: 117212350563 RAID # (Cyl. 0*- 101045*) disklabel for sd1: a: 117212350563 RAID # (Cyl. 0*- 101045*) disklabel for raid0: d: 976772992 0 unused 0 0# (Cyl. 0 - 953879*) And raidframe message for shrinking: Warning: truncating spare disk /dev/sd1a to 976772992 blocks (from 1172123441) I read that changing raid size is not possible. But can I change label size for sd0a and sd1a to the same size as raid0d (from 1172123505 to 976772992), and then, can I use the recovered space for a second raid array ? I'd want to test this somewhere where the data doesn't matter as much, but yes, this should be doable. Note that you'd want to change the size from 1172123505 to 976773056 (not 976772992) because you need to leave room for the component label (64 reserved blocks at the beginning of the partition). Later... Greg Oster
Re: RAIDframe reconstruction
On Mon, 20 Feb 2012 03:20:32 +0100 Edgar Fuß e...@math.uni-bonn.de wrote: I have raid1 consisting of sd2a..sd9a. sd3a failed, and I did something presumably stupid: after detach-ing sd3 and scan-ing a replacement, instead of raidctl -R /dev/sd3a raid1, I did raidctl -a /dev/sd3a raid1, and, as that didn't start a reconstruction, raidctl -F /dev/sd3a raid1, which did start a reconstruction. After the reconstruction succeeded, RAIDframe seems to be confused about the state of sd3a: Hmm... interesting that what you did even worked... I'd have thought /dev/sd3a would have still been held 'open' by the RAID set, and that the hot-add would have failed. Components: /dev/sd2a: optimal /dev/sd3a: spared /dev/sd4a: optimal /dev/sd5a: optimal /dev/sd6a: optimal /dev/sd7a: optimal /dev/sd8a: optimal /dev/sd9a: optimal Spares: /dev/sd3a: used_spare How do I get out of this? Would unconfiguring raid1 and re-configuring it work? Or how can I make sure what has really been written to the component label of sd3a? If you're using the autoconfigure stuff (which you should be) then a simple reboot should fix things up. (it will write out the correct component label for sd3a as part of unconfiguring /dev/raid1 ) Later... Greg Oster
Re: New boothowto flag to prevent raid auto-root-configuration
On Mon, 18 Apr 2011 13:06:23 +0200 Klaus . Heinz k.he...@aprelf.kh-22.de wrote: Martin Husemann wrote: as described in PR 44774 (see http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=44774), it is currently not possible to use a standard NetBSD install CD on a system wich normally boots from raid (at least on i386, amd64 or sparc64, where a stock GENERIC kernel is used). It looks like this is a similar problem to the one I raised here http://mail-index.NetBSD.org/tech-kern/2009/10/31/msg006410.html Instead of providing RB_NO_ROOT_OVERRIDE I would prefer something that actually _lets_ me override everything else from boot.cfg. So how about adding another flag like RB_ROOT_EXPLICIT OR RB_EXPLICIT_ROOT with the idea being that the user has to explicitly specify what the root is going to be? I think that addresses 1) below and would hopefully handle the pxeboot and NFS root situations. Maybe something like RB_SET_ROOT with a required option of NFS or raid0 or wd0a or pxe or whatever would be the way to go? I've never liked the 'yank root away from the boot device that the system thinks it maybe booted from' hack in RAIDframe, so anything we can do to make it better is fine by me We may need both RB_NO_ROOT_OVERRIDE and this new flag in order to get everything covered, and I'm fine with that too... Quoting Robert Elz from the mentioned thread: FWIW, I think the code to allow the user to override that at boot time would be a useful addition - while it is possible to boot, raidctl -A yes or raidctl -A no, and then reboot, or perhaps boot -a on systems that support that, but that's a painful sequence of operations for what should be a simple task. It should always be 1. what the user explicitly asks for 2. what the kernel has built in 3. hacks like ratdctl -A root (or perhaps similar things for cgd etc) 4. where I think I came from ciao Klaus Later... Greg Oster
Re: partitionSizeHi in raidframe component label, take 2 (wasRe: partitionSizeHi in raidframe component label)
On Fri, 18 Mar 2011 20:57:49 +1100 matthew green m...@eterna.com.au wrote: this patch seems to fix my problem. i've moved the call to fix the label into rf_reasonable_label() itself, right before the valid return, and made the fix up also fix the partitionSizeHi. greg, what do you think? Looks good to me. Go for it. Later... Greg Oster
Re: Problems with raidframe under NetBSD-5.1/i386
On Fri, 18 Feb 2011 13:09:18 -0800 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: Hello. It's been a while since I had an opportunity to work on this problem. However, I have figured out the trouble. While the error is mine, I do have a couple of questions as to why I didn't discover it sooner. It turns out that I had fat fingered the disktab entry I used to disklabel the component disks such that the start of the raid partition was at offset 0 relative to the entire disk, rather than offset 63, which is what I normally use to work around PC BIOS routines and the like. Once I figured that out, the error I was getting made sense. With this in mind, my question and suggestion are as follows: 1. It makes sense to me that I would get an EROFS error if I try to reconstruct to a protected portion of the component disk. What doesn't make sense to me is why I could create the working raid set in the first place? Why didn't I run into this error when writing the initial component labels? Another symptom of this issue, although I didn't know about it at the time, is that components of my newly created raid sets would fail with an i/o failure, without any apparent whining from the component disk itself. I think now that this was because the raid driver was trying to update some portion of the component label and failing in the same way. Ok, my bad for getting my offsets wrong in the disklabel for the component disks, but can't we make it so this fails immediatly upon raid creation rather than having the trouble exhibit itself as apparently unexplained component disk failures? I really don't get why the creation of the raid set would have succeeded before, but not afterwards Was the RAID set created in single-user mode or from sysinst or something? Is there some 'securelevel' thing coming into play? I'm just guessing here, as this makes no sense to me :( (The thing is: RAIDframe shouldn't be touching any of those 'protected' areas of the disk anyway... the first 64 blocks are reserved, with the component label and such being at the half-way point. So even if you used an offset of 0, it would have only been looking to touch blocks 32 and 33 (for parity logging) so unless something is protecting all of the first 63 blocks it shouldn't be complaining :( ) 2. I'd like to suggest the following quick patch to the raid driver to help make the diagnosis of component failures easier.Thoughts? The patch looks fine, and quite useful. Please commit. Later... Greg Oster
Re: raid of queue length 0
On Mon, 21 Feb 2011 13:02:05 +0900 (JST) enami tsugutomo tsugutomo.en...@jp.sony.com wrote: Hi, all. It is possible to create raid of queue(fifo) length 0 (see the raidctl -G output below), and looks as if it works, but reconstruction stalls once it is interfered by normal I/O. Does such configuration make sense? Otherwise raidctl(8) shouldn't allow it. It doesn't make sense. Configuration should fail for a queue length of 0. Best place to fix this is probably in rf_driver.c right after the line: raidPtr-maxOutstanding = cfgPtr-maxOutstandingDiskReqs; In there, something like: if (raidPtr-maxOutstanding = 0) { DO_RAID_FAIL(); return 1; } will probably do the trick. (return code probably needs to be changed, and perhaps some sort of warning should be printed, but that's the general idea...) Later... Greg Oster # raidctl -G raid1 # raidctl config file for /dev/rraid1d START array # numRow numCol numSpare 1 2 0 START disks /dev/vnd0a /dev/vnd1a START layout # sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1 128 1 1 1 START queue fifo 0
Re: Problems with raidframe under NetBSD-5.1/i386
On Thu, 20 Jan 2011 17:28:21 -0800 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: hello. I got side tracked from this problem for a while, but I'm back to looking at it as I have time. I think I may have been barking up the wrong tree with respect to the problem I'm having reconstructing to raidframe disks with wedges on the raid sets. Putting in a little extra info in the error messages yields: raid2: initiating in-place reconstruction on column 4 raid2: Recon write failed (status 30(0x1e)! raid2: reconstruction failed. If that status number, taken from the second argument of rf_ReconeWriteDoneProc() is an error from /usr/include/sys/errno.h, then I'm getting EROFS when I try to reconstruct the disk. Hmmm... strange... Wouldn't that seem to imply that raidframe is trying to write over some protected portion of one of the components, probably the one I can't reconstruct to? Each of the components has a BSD disklabel on it, and I know that the raid set actually begins 64 sectors from the start of the partition in which the raid set resides. However, is a similar back set done for the end of the raid? That is, does the raid set extend all the way to the end of its partition or does it leave some space at the end for data as well? No, it doesn't. The RAID set will use the remainder of the component, but up to a multiple of whatever the stripe width is... (that is, the RAID set will always end on a complete stripe.) Here's the thought. I notice when I was reading through the wedge code, that there's a reference to searching for backup gpt tables and that one of the backups is stored at the end of the media passed to the wedge discovery code. Since the broken component is the last component in the raid set, I wonder if the wedge discovery code is marking the sectors containing the gpt table at the end of the raid set as protected, but for the disk itself, rather than the raid set? I want to say that this is only a theory at the moment, based on a quick diagnostic enhancement to the error messages, but I can't think of another reason why I'd be getting that error. I'm going to be in and out of the office over the next week, but I'll try to see if I can capture the block numbers that are attempting to be written when the error occurs. I think I can do that with a debug kernel I have built for the purpose. Again, this problem exists under 5.0, not just 5.1, so it predates Jed's changes. If anyone has any other thoughts as to why I'd be getting EROFS on a raid component when trying to reconstruct to it, but not when I create the raid, I'm all ears. So when one builds a regular filesystem on a wedge, do they end up with the same problem with 'data' at the end of the wedge? If one does a dd to the wedge, does it report write errors before the end of the wedge? I really need to get my test box up-to-speed again, but that's going to have to wait a few more weeks... Later... Greg Oster On Jan 7, 3:22pm, Brian Buhrow wrote: } Subject: Re: Problems with raidframe under NetBSD-5.1/i386 } hello Greg. Regarding problem 1, the inability to reconstruct disks } in raid sets with wedges in them, I confess I don't understand the vnode } stuff entirely, but rf_getdisksize() in rf_netbsdkintf.c looks suspicious } to me. I'm a little unclear, but it looks like it tries to get the disk } size a number of ways, including by checking for a possible wedge on the } component. I wonder if that's what's sending the reference count too high? } -thanks } -Brian } } On Jan 7, 2:17pm, Greg Oster wrote: } } Subject: Re: Problems with raidframe under NetBSD-5.1/i386 } } On Fri, 7 Jan 2011 05:34:11 -0800 } } buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: } } } } hello. OK. Still more info.There seem to be two bugs here: } } } } 1. Raid sets with gpt partition tables in the raid set are not able } } to reconstruct failed components because, for some reason, the failed } } component is still marked open by the system even after the raidframe } } code has marked it dead. Still looking into the fix for that one. } } } } Is this just with autoconfig sets, or with non-autoconfig sets too? } } When RF marks a disk as 'dead', it only does so internally, and doesn't } } write anything to the 'dead' disk. It also doesn't even try to close } } the disk (maybe it should?). Where it does try to close the disk is } } when you do a reconstruct-in-place -- there, it will close the disk } } before re-opening it... } } } } rf_netbsdkintf.c:rf_close_component() should take care of closing a } } component, but does something Special need to be done for wedges there? } } } } 2. Raid sets with gpt partition tables on them cannot be } } unconfigured and reconfigured without rebooting. This is because } } dkwedge_delall() is not called during the raid shutdown process. I } } have a patch
Re: Problems with raidframe under NetBSD-5.1/i386
On Fri, 7 Jan 2011 15:22:03 -0800 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: hello Greg. Regarding problem 1, the inability to reconstruct disks in raid sets with wedges in them, I confess I don't understand the vnode stuff entirely, but rf_getdisksize() in rf_netbsdkintf.c looks suspicious to me. I'm a little unclear, but it looks like it tries to get the disk size a number of ways, including by checking for a possible wedge on the component. I wonder if that's what's sending the reference count too high? -thanks In rf_reconstruct.c:rf_ReconstructInPlace() we have this: retcode = VOP_IOCTL(vp, DIOCGPART, dpart, FREAD, curlwp-l_cred); I think will fail for wedges... it should be doing: retcode = VOP_IOCTL(vp, DIOCGWEDGEINFO, dkw, FREAD, l-l_cred); for the wedge case (see rf_getdisksize()). Now: since the kernel prints: raid2: initiating in-place reconstruction on column 4 raid2: Recon write failed! raid2: reconstruction failed. it's somehow making it past that point... but maybe with the wrong values?? (is there an old label on the disk or something??? ) Later... Greg Oster On Jan 7, 2:17pm, Greg Oster wrote: } Subject: Re: Problems with raidframe under NetBSD-5.1/i386 } On Fri, 7 Jan 2011 05:34:11 -0800 } buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: } }hello. OK. Still more info.There seem to be two bugs here: } } 1. Raid sets with gpt partition tables in the raid set are not able } to reconstruct failed components because, for some reason, the failed } component is still marked open by the system even after the raidframe } code has marked it dead. Still looking into the fix for that one. } } Is this just with autoconfig sets, or with non-autoconfig sets too? } When RF marks a disk as 'dead', it only does so internally, and doesn't } write anything to the 'dead' disk. It also doesn't even try to close } the disk (maybe it should?). Where it does try to close the disk is } when you do a reconstruct-in-place -- there, it will close the disk } before re-opening it... } } rf_netbsdkintf.c:rf_close_component() should take care of closing a } component, but does something Special need to be done for wedges there? } } 2. Raid sets with gpt partition tables on them cannot be } unconfigured and reconfigured without rebooting. This is because } dkwedge_delall() is not called during the raid shutdown process. I } have a patch for this issue which seems to work fine. See the } following output: } [snip] } } Here's the patch. Note that this is against NetBSD-5.0 sources, but } it should be clean for 5.1, and, i'm guessing, -current as well. } } Ah, good! Thanks for your help with this. I see Christos has already } commited your changes too. (Thanks, Christos!) } } Later... } } Greg Oster -- End of excerpt from Greg Oster Later... Greg Oster
Re: Problems with raidframe under NetBSD-5.1/i386
On Thu, 6 Jan 2011 09:42:41 -0800 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: Hello. I have a box running NetBSD-5.1/i386 with kernel sources from 1/4/2011 which refuses to reconstruct to what looks like a perfectly good disk. The errormessages are: Command: root#raidctl -R /dev/sd10a raid2 Error messages from the kernel: raid2: initiating in-place reconstruction on column 4 raid2: Recon write failed! raid2: reconstruction failed. Is there anything else in /var/log/messsages about this? Did the component fail before with write errors? So, I realize this isn't a lot to go on, so I've been trying to build a kernel with some debugging in it. Configured a kernel with: options DEBUG options RF_DEBUG_RECON But the kernel won't build because the Dprintf statements that get turned on in the rf_reconstruct.c file when the second option is enabled causes gcc to emit warnings. Ug. Those should probably be fixed, although I don't suspect many people use them So, rather than fixing the warnings, and potentially breaking something else, my question is, how can I turn off -werror in the build process for just this kernel? Looking in the generated Makefile didn't provide enlightenment and I don't really want to disable this option for my entire tree. But, I imagine, this is an easy and often-wanted thing to do. -thanks -Brian Later... Greg Oster
Re: Problems with raidframe under NetBSD-5.1/i386
On Thu, 6 Jan 2011 10:05:17 -0800 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: hello Greg. When the component failed, I just got an error: raid2: io error on /dev/sd10a, marking /dev/sd10a as failed! No messages from the drive itself. It's an SCSI disk attached to an LSI mpt(4) card, who's driver Im trying to improve in the advent of problematic disk behavior. However, this drive is good, I believe, and I can read and write to the disk itself without problems. To all sectors, or just through the filesystem? I suspect I'm having kmem allocation problems inside Raiframe, I've been here before, :), but I'm not sure of that. I did test with NetBSD-5.0 with kernel sources fom June 2009, which is what I've been running in production around here, just to make sure the new parity map stuff wasn't the problem. It doesn't reconstruct there either. So, what ever is wrong, it's an old problem, and probably a pilot error at that. Hmm.. so if you bump up the size of NKMEMPAGES (or whatever) does the reconstruction succeed? How large are these components, and how much RAM in the system? Do you know the magic to turn off -werror for individual kernel builds? Not off the top of my head, no... Later... Greg Oster On Jan 6, 11:50am, Greg Oster wrote: } Subject: Re: Problems with raidframe under NetBSD-5.1/i386 } On Thu, 6 Jan 2011 09:42:41 -0800 } buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: } }Hello. I have a box running NetBSD-5.1/i386 with kernel } sources from 1/4/2011 which refuses to reconstruct to what looks like } a perfectly good disk. The errormessages are: } } Command: } root#raidctl -R /dev/sd10a raid2 } } Error messages from the kernel: } raid2: initiating in-place reconstruction on column 4 } raid2: Recon write failed! } raid2: reconstruction failed. } } Is there anything else in /var/log/messsages about this? Did the } component fail before with write errors? } } So, I realize this isn't a lot to go on, so I've been trying to build } a kernel with some debugging in it. } } Configured a kernel with: } options DEBUG } options RF_DEBUG_RECON } } But the kernel won't build because the Dprintf statements that get } turned on in the rf_reconstruct.c file when the second option is } enabled causes gcc to emit warnings. } } Ug. Those should probably be fixed, although I don't suspect many } people use them } }So, rather than fixing the warnings, and potentially breaking } something else, my question is, how can I turn off -werror in the } build process for just this kernel? Looking in the generated } Makefile didn't provide enlightenment and I don't really want to } disable this option for my entire tree. But, I imagine, this is an } easy and often-wanted thing to do. -thanks } -Brian } } } Later... } } Greg Oster -- End of excerpt from Greg Oster Later... Greg Oster
Re: Problems with raidframe under NetBSD-5.1/i386
On Thu, 6 Jan 2011 18:33:58 -0800 buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote: Hello. Ok. I have more information, perhaps this is a known issue. If not, I can file a bug. Please, do file a PR... this is a new one. the problem seems to be that if you partition a raid set with gpt instead of disklabel, if a component of that raid set fails, the underlying component is held open even after raidframe declares it dead. Thus, when you try to ask raidframe to do a reconstruct on the dead component, it can't open the component because the component is busy. I think the culprit is in src/sys/dev/raidframe/rf_netbsdkintf.c:rf_find_raid_components() where in the: if (wedge) { ... ac_list = rf_get_component(ac_list, dev, vp, device_xname(dv), dkw.dkw_size); continue; } case that little continue is not letting the execution reach the: /* don't need this any more. We'll allocate it again a little later if we really do... */ vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); VOP_CLOSE(vp, FREAD | FWRITE, NOCRED); vput(vp); code which would close the opened wedge. :( Both 5.1 and -current suffer from the same issue (though the code in -current is slightly different). Thanks for the investigation and report... Later... Greg Oster