Re: dumping on RAIDframe

2023-09-24 Thread Greg Oster




On 2023-09-20 12.41, Edgar Fuß wrote:

Didn't RAIDframe recently (for certain values of "recently") gain the function
to dump on a level 1 set? Should this work in -8?
swapctl -z says "dump device is raid0b" (and raid0 is a level 1 RAID), but
reboot 0x100 in DDB says
dumping to dev 18,1 (offset=1090767, size=8252262):

dump device not ready

What am I missing?

The offset (as reported by disklabel) of raid0b within raid0 is 2097152 (1G),
the partition size is 67108864 (32G), so maybe something's wrong with the
offset and size values (whatever unit they are in) DDB reports.


Dumping to a RAID 1 set is supported in -8.  But yes, none of those 
values seem to align with each other.  18,1 is 'raid0b' thouugh, so that 
part seems correct.


Could you please file a PR on this, with details as to the various 
disklabels?  If this is broken, it should get addressed.


Thanks.

Later...

Greg Oster


Re: Scheduling problem - need some help here

2022-11-23 Thread Greg Oster



Hi Folks

A reminder that the issue in this thread (and recorded in this PR: 
http://gnats.netbsd.org/55415 ) is still very much outstanding and will 
result in the vax port being nearly unusable in NetBSD 10+.  There is a 
'fix' mentioned in the PR, but it's not clear it's the right solution.


Any/all thoughts on this are welcome.

Thanks.

Later...

Greg Oster

On 2020-07-28 06:01, Anders Magnusson wrote:

Hi,

Den 2020-07-28 kl. 13:28, skrev Nick Hudson:

On 28/06/2020 16:11, Anders Magnusson wrote:

Hi,

there is a problem (on vax) that I do not really understand. Greg Oster
filed a PR on it (#55415).

A while ago ad@ removed the  "(ci)->ci_want_resched = 1;" from
cpu_need_resched() in vax/include/cpu.h.
And as I read the code (in kern_runq.c) it shouldn't be needed,
ci_want_resched should be set already when the macro cpu_need_resched()
is invoked.

But; without setting cpu_need_resched=1 the vax performs really bad (as
described in the PR).

cpu_need_resched may have multiple values nowadays, setting it to 1 will
effectively clear out other flags, which is probably what makes it work.

Anyone know what os going on here (and can explain it to me)?


I'm no expert here, but I think the expectation is that each platform
has its own method to signal "ast pending" and eventually call userret
(and preempt) when it's set - see setsoftast/aston.
VAX has hardware ASTs, (AST is actually a VAX operation), which works so 
that if an AST is requested, then next time an REI to userspace is 
executed it will get an AST trap instead and then reschedule.


As I don't understand vax I don't know what

197 #define cpu_signotify(l) mtpr(AST_OK,PR_ASTLVL)

is expected to do, but somehow it should result in userret() being 
called.
Yep, this is the way an AST is posted. Next time an REI is executed it 
will trap to the AST subroutine.


Other points are:

- vax cpu_need_resched doesn't seem to differentiate between locally
  running lwp and an lwp running on another cpu.
Most likely.  It was 20 years since I wrote the MP code (and probably 
the same since anyone tested it last time) and at that time LWPs didn't 
exist in NetBSD.  I would be surprised if it still worked :-)


- I can't see how hardclock would result in userret being called, but
  like I said - I don't know vax.
When it returns from hardclock (via REI) it directly traps to the AST 
handler instead if an AST is posted.

http://src.illumos.org/source/xref/netbsd-src/sys/arch/vax/vax/intvec.S#311

I believe ci_want_resched is an MI variable for the scheduler which is
why its use in vax cpu_need_resched got removed.

It shouldn't be needed, but obviously something breaks if it isn't added.

What I think may have happened is that someone may have optimized 
something in the MI code that expects a different behaviour than the VAX 
hardware ASTs have.  AFAIK VAX is (almost) the only port that have 
hardware ASTs.


Thanks for at least looking at this.

-- Ragge


Re: RAIDframe: what if a disc fails during copyback

2020-10-30 Thread Greg Oster

On 10/30/20 1:54 PM, Edgar Fuß wrote:

it locks out all other non-copyback IO in order to finish the job!

Oops!


Locking out all other IO is very poor... but if it's a small enough RAID set
you might be able to get away with the downtime for the copyback...

Certainly not.


You shouldn't need to reboot for this... the 'failing spared disk' and
'reconstruct to previous second disk' should work fine without reboot.

I still don't get this. What I have is:

Components:
/dev/sd5a: spared
/dev/sd6a: optimal
Spares:
/dev/sd7a: used_spare

So what am I supposed to do from here?


If you really want to get /dev/sd5a in use again, you can do:

 raidctl -f /dev/sda7 raidX
 raidctl -vR /dev/sd5a raidX

to do the fail of sd7a and rebuild of sd5a.  But unless you have a 
strong need to use sd5a I would do nothing and leave things as-is.  If 
you reboot at this point /dev/sd7a would show up as the first component 
and be marked as 'optimal'.


Later...

Greg Oster


Re: RAIDframe: what if a disc fails during copyback

2020-10-30 Thread Greg Oster

On 10/30/20 4:25 AM, Edgar Fuß wrote:

Thanks for the detailed answer.


it's still there, and it does work,

That's reassuring to know.


but it's not at all performant or system-friendly.

Just how bad is it?


It's been probably over a decade since I last tried it, but as I recall 
it locks out all other non-copyback IO in order to finish the job!



If you want the components labelled nicely, give the system a reboot

Re-booting our file server is something I like to avoid.


You'll like copyback even less then -- I'd say once you're done 
reconstruct, just leave it, or reconstruct again to the 'repaired 
original' as I suggested...



and behaves very poorly.

Depending on how poorly, I could probably live with it (the RAID in question
is the small system one, not the large user data one).


Locking out all other IO is very poor... but if it's a small enough RAID 
set you might be able to get away with the downtime for the copyback...



In your case, what I'd do is just fail the spare, and initiate a reconstruct
to the original failed component.  (You still have the data on the spare if
something goes back with the original good component.)

Hm, I guess I would need to re-boot and intervene manually in that case.
Just using the slow copyback looks preferrable if it doesn't take more than
a day.


You shouldn't need to reboot for this... the 'failing spared disk' and 
'reconstruct to previous second disk' should work fine without reboot. 
(IIRC I've used a '3rd component' to make the primary/secondary 
components swap places.. just to test that, of course :) )



Probably I need to test this on another machine before.
I guess there's no way to initiate a reconstruction to a spare and failing
the specified component only /after/ the reconstruction has completed,
not before?


No, there's not, unfortunately. :(

Later...

Greg Oster


Re: Horrendous RAIDframe reconstruction performance

2020-06-28 Thread Greg Oster

On 6/28/20 7:31 PM, John Klos wrote:
Any thoughts about what's going on here? Is this because the drives 
are 512e drives? Three weeks is a LONG time to reconstruct.


So this turns out to be a failing drive. SMART doesn't show it's 
failing, but the one that's failing defaults to having the write cache 
off, and turning it on doesn't change the speed.


Yep, that will do it.

I guess it's still usable, in a limited way - I can only write at 5 or 6 
MB/sec, but I can read at 200 MB/sec. Maybe I'll use it in an m68k Mac.


Also, the autoconfigure works, but the forcing of root FS status didn't 
because I was testing it on a system that already had a RAIDframe with 
forced root. However, it still doesn't work on aarch64, but I'll recheck 
this after trying Jared's boot.cfg support.


Thanks, Greg, Michael and Edgar. I learned something :) I am still 
curious about whether I was seeing both good read and write speeds 
because writes weren't going to both drives. I suppose I assumed that 
all writes would go to both drives even while reconstructing, but I 
suppose that only happens when the writes are inside of the area which 
has already been reconstructed, yes?


Correct.  Writes will go to both on stripes where reconstruction has 
completed, but will only go to the 'good' device if reconstruction 
hasn't reached that stripe yet.


Later...

Greg Oster


Re: Horrendous RAIDframe reconstruction performance

2020-06-28 Thread Greg Oster

On 6/28/20 12:29 PM, Edgar Fuß wrote:

That's the reconstruction algorithm. It reads each stripe and if it
has a bad parity, the parity data gets rewritten.

That's the way parity re-write works. I thought reconstruction worked
differently. oster@?



Reconstruction does not do the "read", "compare", "write" operation like 
parity checking. It just does "read", "compute", "write".  In the case 
of RAID 1, "compute" does nothing ,and it's basically read from one 
component and write to the other.


Later...

Greg Oster


Re: Asymmetric Disk I/O priorities in a RAIDframe RAID1?

2018-05-28 Thread Greg Oster
On Mon, 28 May 2018 10:27:46 +0200
Hauke Fath <h...@spg.tu-darmstadt.de> wrote:

> All,
> 
> is there a way in RAIDframe to give a member of a RAID1 set read 
> priority, to e.g. favour reads from an SSD over rotating rust or
> iscsi?
> 
> The Linux mdadm(8) software raid config tool has the following option:
> 
> 
> "[...] devices listed in a --build, --create, or --add command will
> be flagged as 'write-mostly'. This is valid for RAID1 only and means
> that the 'md' driver will avoid reading from these devices if at all 
> possible. This can be useful if mirroring over a slow link."
> 
> 
> Can RAIDframe do anything similar?

No... RAIDframe basically selects the component with the shortest
queue.  If queue lengths are equal, then it tries to pick the component
where the data you want is 'nearest' to the last data item being
fetched in the queue (see rf_dagutils.c:rf_SelectMirrorDiskIdle()).

Later...

Greg Oster


Re: RAIDframe: passing component capabilities

2017-03-31 Thread Greg Oster
On Fri, 31 Mar 2017 17:15:38 +0200
Edgar Fuß <e...@math.uni-bonn.de> wrote:

> > given that RAIDframe (nor ccd, nor much else) has a general 'query
> > the underlying layers to ask about this capability' function.  
> Is there a ``neither'' missing between ``that'' and ``RAIDframe''?

Yes, sorry.

> > (NetBSD 8 refusing to configure a RAID set because of this is not an
> > option.)  
> Of course not. With my model, you would need to (re-)configure the
> RAID set with ``all components have SCSI tagged queueing'' in order
> for the RAID device to announce that capability. If one of the drives
> is SATA, that configuration fails. If you later try to replace a SCSI
> drive with a SATA one it fails like it fails when the replacement
> drive has insufficient capacity.
> It's just like with capacities: There's no need to announce the full
> component capacity to the set (well, in fact, you don't use the full
> drive capacity for the partition that constitutes the component), but
> the component needs to have at least the announced capacity (in fact,
> you need to be able to create a partition of sufficient size on the
> drive). With capabilities, there would also be no need to announce
> all the drive's capabilities, but a component (original or
> replacement) needs to have at least the announced capabilities.

That still requires RAIDframe then asking the components (or having
them report to RAIDframe when they are attached) about whether or not
they can do a certain thing, in order to decide whether or not the
reconfiguration succeeds or fails.

Later...

Greg Oster


Re: RAIDframe: passing component capabilities (was: Exposing FUA as alternative to DIOCCACHESYNC for WAPBL)

2017-03-31 Thread Greg Oster
On Wed, 29 Mar 2017 12:02:23 +0200
Edgar Fuß <e...@math.uni-bonn.de> wrote:

> EF> Some comments as I probably count as one of the larger WAPBL
> EF> consumers (we have ~150 employee's Home and Mail on NFS on
> EF> FFS2+WAPBL on RAIDframe on SAS):
> JD> I've not changed the code in RF to pass the cache flags, so the
> JD> patch doesn't actually enable FUA there. Mainly because disks
> JD> come and go and I'm not aware of mechanism to make WAPBL aware of
> JD> such changes. It
> TLS> I ran into this issue with tls-maxphys and got so frustrated I
> TLS> was actually considering simply panicing if a less-capable disk
> TLS> were used to replace a more-capable one.  
> Oops. What did you do in the end? What does Mr. RAIDframe say?
> 
> My (probably simplistic) idea would be to add a capabilities option
> to the configuration file, and just as you can't add a disc with
> insufficient capacity, you can't add one with insufficient
> capabilities. Of course, greater capabilities are to be ignored just
> as a larger capacity is.

FUA/maxphys/anything 'disk'-specific is a bit of a pain to deal with,
given that RAIDframe (nor ccd, nor much else) has a general 'query the
underlying layers to ask about this capability' function.

I see two major things here:
 1) Whatever we do can't break existing setups.  That is, if an
 underlying disk can't do FUA, then upper layers just need to Deal.
 (NetBSD 8 refusing to configure a RAID set because of this is not an
 option.)

 2) Whatever query mechanism is used must be device agnostic at the
 higher levels.  It needs to work for RAID, SAS, SCSI, SATA, HP-IB,
 etc, and leave it up to the lower levels to respond with the correct
 "Yes all devices I talk to (recursively) can do this" or "No,
 at least one of us can't do this" to the query.  And then it's up to
 the drivers to actually pass the appropriate flags and do the Right
 Things.

Later...

Greg Oster


Re: RAIDframe: cooldown nnn out of range

2016-09-01 Thread Greg Oster
On Thu, 1 Sep 2016 10:58:00 +0200
Edgar Fuß <e...@math.uni-bonn.de> wrote:

> Upon reboot after a clean shutdown, I yesterday got
>   raid0: cooldown 1944635974 out of range
> raidctl -s is happy. What does that message mean? Do I need to worry?

Good question.  It seems that the cooldown value is not in the 1-128
range...  (the cooldown is used to regulate the rate at which paritymap
updates get pushed onto the disk).  I'm not sure why it is that high,
or how it got that high :-(  I'm also not sure if the kernel got that
value from the paritymap on the disk or where... though I think if it
read a '0' for the cooldown from the on-disk parity map that it would
then use 'random garbage' from the newly initialized parity map if
coming through rf_paritymap_attach() :(  Fortunately, when the code
detects these 'out of range' things, it grabs a set of correct values,
and uses those going forward...

Later...

Greg Oster


Re: IIs factible to implement full writes of strips to raid using NVRAM memory in LFS?

2016-08-22 Thread Greg Oster
On Sun, 21 Aug 2016 10:20:07 -0400
Thor Lancelot Simon <t...@panix.com> wrote:

> On Fri, Aug 19, 2016 at 10:01:43PM +0200, Jose Luis Rodriguez Garcia
> wrote:
> > On Fri, Aug 19, 2016 at 5:27 PM, Thor Lancelot Simon  
> > >
> > > Perhaps, but I bet it'd be easier to implement a generic
> > > pseudodisk device that used NVRAM (fast SSD, etc -- just another
> > > disk device really) to buffer *all* writes to a given size and
> > > feed them out in that-size chunks.  Or to add support for that to
> > > RAIDframe.
> > >
> > > For bonus points, do what the better "hardware" RAID cards do,
> > > and if the inbound writes are already chunk-size or larger,
> > > bypass them around the buffering (it implies extra copies and
> > > waits after all).
> > >
> > > That would help LFS and much more.  And you can do it without
> > > having to touch the LFS code.  
> > 
> > Won't it be easier to add a layer that do these tasks in
> > the LFS code. It has the disadvantage that it would be used only
> > by  
> 
> I am guessing not.  The LFS code is very large and complex -- much
> more so than it needs to be.  It is many times the size of the
> original Sprite LFS code, which, frankly, worked better in almost all
> ways.  It represents (to me) a failed experiment at code and
> datastructure sharing with FFS (it is also worse, and larger, because
> Sprite's buffer cache and driver APIs were simpler than ours and
> better suited to LFS' needs).
> 
> It is so large and so complex that truly screamingly funny bugs like
> writing the blocks of a segement out in backwards order went
> undetected for long periods of time!
> 
> It might be possible to build something like this inside RAIDframe or
> LVM but I think it would share little code with any other component
> of those subsystems.  I would suggest building it as a standalone
> driver which takes a "data disk" and "cache disk" below and provides
> a "cached disk" above.  I actually think relatively little code
> should be required, and avoiding interaction with other existing
> filesystems or pseudodisks should keep it quite a bit simpler and
> cleaner.

Building this as a layer that allows arbitrary devices as either the
'main store' or the 'cache' would work well, and allow for all sorts of
flexibility.  What I don't know is how you'd glue that in to be a
device usable for /.  The RAIDframe code in that regard is already a
nightmare!

Perhaps something along the lines of the dk(4) driver, where one could
either use it as a stand-alone device, or hook into it to use the
caching features.. (e.g. 'register' the cache when raid0 is configured,
and then use/update the cache on reads/writes/etc to raid0)

Obviously this needs to be fleshed out a significant amount...

Later...

Greg Oster


Re: IIs factible to implement full writes of strips to raid using NVRAM memory in LFS?

2016-08-22 Thread Greg Oster
On Sat, 20 Aug 2016 03:20:51 +0200
Jose Luis Rodriguez Garcia <joseyl...@gmail.com> wrote:

> On Fri, Aug 19, 2016 at 5:27 PM, Thor Lancelot Simon <t...@panix.com>
> wrote:
> > On Thu, Aug 18, 2016 at 06:23:32PM +, Eduardo Horvath wrote:
> > chunks.  Or to add support for that to RAIDframe.  
> ...
> > That would help LFS and much more.  And you can do it without having
> > to touch the LFS code.
> >  
> I have been thinking about this, and I think that this is the best
> option, although I like more integrate it with LFS as I said in my
> previous mail, adding to RAIDframe it can be used/tested by more
> people and it is possible that more developers/testers are involved.
> Integrating it inside of LFS surely will be a one man project, that it
> is very possible that it isn't finished.
> Other bonus of integrating it with RAIDframe, it is can resolve the
> problems of write hole of raid:
> http://www.raid-recovery-guide.com/raid5-write-hole.aspx
> I don't know if NetBSD resolves the problem of write hole (it has
> penalty in performance to resolve it).

RAIDframe maintains a 'Parity status:', which indicates whether or not
all the parity is up-to-date.  Jed Davis did the GSoC work to add the
'parity map' stuff which significantly reduces the amount of effort
needed to ensure the parity is up-to-date after a crash.  (Basically
RAIDframe checks (and corrects) any parity blocks in any modified
regions of the RAID set.)

Later...

Greg Oster


Re: WAPBL not locking enough?

2016-05-03 Thread Greg Oster
On Tue, 3 May 2016 10:02:44 +
co...@sdf.org wrote:

> Hi,
> 
> I fear that WAPBL should be locking wl_mtx before running
> wapbl_transaction_len (in wapbl_flush)
> 
> I suspect so because it performs a similar lock before
> looking at wl_bufcount in sys/kern/vfs_wapbl.c:1455-1460.
> 
> If true, how about this diff?
> 
> diff --git a/sys/kern/vfs_wapbl.c b/sys/kern/vfs_wapbl.c
> index 378b0ba..f765ce3 100644
> --- a/sys/kern/vfs_wapbl.c
> +++ b/sys/kern/vfs_wapbl.c
> @@ -1971,6 +1971,7 @@ wapbl_transaction_len(struct wapbl *wl)
> size_t len;
> int bph;
>  
> +   mutex_enter(>wl_mtx);
> /* Calculate number of blocks described in a blocklist header
> */ bph = (blocklen - offsetof(struct wapbl_wc_blocklist, wc_blocks)) /
> sizeof(((struct wapbl_wc_blocklist *)0)->wc_blocks[0]);
> @@ -1981,6 +1982,7 @@ wapbl_transaction_len(struct wapbl *wl)
> len += howmany(wl->wl_bufcount, bph) * blocklen;
> len += howmany(wl->wl_dealloccnt, bph) * blocklen;
> len += wapbl_transaction_inodes_len(wl);
> +   mutex_exit(>wl_mtx);
>  
> return len;
>  }
>  

It's been a while since I looked at this, but you might want to see if 
this:

 rw_enter(>wl_rwlock, RW_WRITER);

takes care care of the locking that you are concerned about...  

There certainly does seem to be mutex_enter/mutex_exits missing on the
DIAGNOSTIC code around line 981.

In any event you'll want more than just my comments, and a lot more
than 'a kernel with these changes boots' for testing :)

Later...

Greg Oster


Re: RAIDFrame changes

2016-01-03 Thread Greg Oster
On Sun, 27 Dec 2015 11:49:36 + (UTC)
mlel...@serpens.de (Michael van Elst) wrote:

> Hi,
> 
> now that RAIDFrame is usuable at a module I have prepared
> a patch to refactor the driver to use the common dksubr code.
> 
> The patch can be found at
> 
>  http://ftp.netbsd.org/pub/NetBSD/misc/mlelstv/raidframe.diff
> 
> and it includes a few other fixes too:
> 
> - finding components now does proper kernel locking by using
>   bdev_strategy, previously it would trigger an assertion in the sd
>   driver (and maybe others).
> 
> - finding components now runs in two passes to prefer wedges over
>   raw partitions when the wedge starts at offset 0.
> 
> - defer RAIDFRAME_SHUTDOWN (raidctl -u operation) to raidclose.
>   - side effect is that 'raidctl -u' succeeds even for units that
> haven't been configured. The previous behaviour was to keep
> the embryonal unit and fail ioctl and close operations which
> prevents unloading of the module.
> 
> - moved raidput again to raid_detach() because raid_detach_unlocked()
>   is only called for initialized units.
> 
> - use common dksubr code
>   - the fake device softc now stores a pointer back to the real softc
>   - no longer uses a private bufq
>   - private disklabel and disk_busy code is gone.
> 
> - some extra messages
> 
> 
> 
> I will commit this in the next days if there is no objection.

Thanks for working on this, but "next days" doesn't provide any real
time during the "Holiday Season".  :(

Have these changes, especially those to raiddump(), been extensively
tested?  RF_PROTECTED_SECTORS used to be required in raiddump() so as
not to eat the component label.  raid_dumpblocks() now contains what
raiddump() used to, but it raiddump() has to be able to
select which of the underlying components is still alive, and not to
attempt to dump to anything other than a RAID 1 device.  I'm not seeing
the mechanisms by which those requirements are still met

Methinks these changes should have been explicitly reviewed by the
RAIDframe maintainer *before* they were committed. :(

Later...

Greg Oster


Re: RAIDframe hot spares

2015-12-21 Thread Greg Oster
On Mon, 21 Dec 2015 16:49:12 +0100
Edgar Fuß <e...@math.uni-bonn.de> wrote:

> I have another question on RAIDframe, this time on hot spares (which
> I never used before).
> I was at A and B "optimal" and C "spare", and B failed. I would have
> expected RAIDframe to immediately start a reconstruction on C, but
> that didn't happen; I had to raidctl -F B first.
> 
> In case it matters, that setup was reached through the following
> events: A and B "optimal"
> B failed
> A "optimal", B "failed"
> raidctl -a C
> A "optimal", B "failed", C "spare"
> raidctl -F B
> A "optimal", B "spared", C "used_spare"
> raidctl -B
> A and B "optimal", C "spare"
> 
> Isn't the point of a *hot* spare to immediately reconstruct on it if
> neccessary?

The original code didn't support immediate reconstruction, and no-one
has bothered to add it since...  At this point you can approximate the
immediate reconstruction with a cron job that checks the status of the
RAID set, and then executes the appropriate rebuild.

Later...

Greg Oster


Re: raidctl -B syntax

2015-12-20 Thread Greg Oster
On Mon, 21 Dec 2015 00:25:02 +1100
matthew green <m...@eterna.com.au> wrote:

> > I am confident that an I/O error duing reconstruction will result
> > in the reconstruction failing.
> 
> it does for RAID1.  i've not used RAID5 for years.

It does for RAID 5 as well.

> i had a disk failure, followed by the otherside giving read
> errors while reconstructing.  my rebuild failed sort of
> appropriately (it would be nice if the re-rebuild would know
> where to restart from.)
> 
> (i managed to recover the failed blocks from the 2nd disk
> from the 1st one.  at least they managed to fail in different
> regions of the dis, obviating the need for more annoying
> methods of restore :-)

About a year ago I had to build a custom kernel to ignore IO errors on
the source disk...  A RAID 1 set had failed the second component, and
it turned out the first component also had read errors.  Fortunately,
the blocks with errors were not associated with any filesystem bits (a
full dump, for example, worked fine) and so I was able to recover
everything without having to go to backups. (i.e. custom kernel
recovered bits to new second component, and then the first component
got replaced as well!)

Later...

Greg Oster


Re: raidctl -B syntax

2015-12-20 Thread Greg Oster
On Sun, 20 Dec 2015 12:38:43 +0100
Edgar Fuß <e...@math.uni-bonn.de> wrote:

> I'm unsure what the "dev" argument to raidctl -B is: the spare or the
> original?

The 'dev' is the RAID device 

> Suppose I have a level 1 RAID with components A and B; B failed. I
> add C as a hot spare (raidctl -a C) and reconstruct on it (raidctl -F
> B) now I have A "optimal", B "spared" and C "used_spare".
> Then I find that B's failure must have been a glitch; do I raidctl -B
> B or raidctl -B C?

So one note about the '-B' option IIRC, it actually blocks IO to
the RAID set while the copyback is happening.  I think the last time I
looked at this I was of the opinion that the copyback bits should just
be deprecated, as most people never use it, and it actually causes
serious performance issues (i.e. no IO can be done to the set while
copyback is in progress!).  But it's been a while... 

> I suppose that after the copyback, I'll have A and B "optimal" and C
> "spare", right?

I believe that is correct... it's been a long while since I used
copyback...

> What if, during the reconstruction, I get an I/O error on B. I hope
> the reconstruction will simply stop and leave me with A "optimal", B
> "spared" and C "used_spare", right?

Errors encountered during reconstruction are supposed to be gracefully
handled.  That is, if you have A and B in the RAID set, and B fails,
and you encounter an error when reconstructing from A to C, then A will
be left as optimal, B will still be failed, and C will still be a spare.

C will not get marked as 'used_spare' (and be eligible to
auto-configure as the second component) until the reconstruction is
100% successful.

Later...

Greg Oster


Re: RAIDframe: stripe unit per reconstruction unit

2013-10-18 Thread Greg Oster
On Fri, 18 Oct 2013 15:50:55 +0200
Edgar Fuß e...@math.uni-bonn.de wrote:

 The man page says:
 
 The stripe units per parity unit and stripe units per reconstruction
 unit are normally each set to 1.  While certain values above 1 are
 permitted, a discussion of valid values and the consequences of using
 anything other than 1 are outside the scope of this document.
 
 I noticed that reconstruction seems to read/write in units of one
 stripe unit, which in my case (being 4k) seems rather inefficient. Is
 it possible to accelerate recontruction by using a figure larger than
 1 here?

If you go here: http://www.pdl.cmu.edu/RAIDframe/
and pull down the RAIDframe Manual and go to page 73 of that manual
you will find:

 When specifying SUsPerRU, set the number to 1 unless you are
 specifically implementing reconstruction under parity declustering; if
 so, you should read through the reconstruction code first.

The answer might be yes, but I don't know what the other implications
are.  I'd only recommend changing this value on a RAID set you don't
care about, with data you don't care about.  I'd also recommend reading
through the reconstruction code, looking for SUsPerRU...

Later...

Greg Oster


Re: high load, no bottleneck

2013-09-19 Thread Greg Oster
On Thu, 19 Sep 2013 10:29:55 -0400
chris...@zoulas.com (Christos Zoulas) wrote:

 On Sep 19,  8:13am, t...@panix.com (Thor Lancelot Simon) wrote:
 -- Subject: Re: high load, no bottleneck
 
 | On Wed, Sep 18, 2013 at 06:03:11PM +0200, Emmanuel Dreyfus wrote:
 |  Emmanuel Dreyfus m...@netbsd.org wrote:
 |  
 |   Thank you for saving my day. But now what happens?
 |   I note the SATA disks are in IDE emulation mode, and not AHCI.
 This is |   something I need to try changing:
 |  
 |  Switched to AHCI. Here is below how hard disks are discovered
 (the relevant raid |  is RAID1 on wd0 and wd1)
 |  
 |  In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads,
 on both 6.0 |  and -current. I assume there must be something bad
 with WAPBL/RAIDframe | 
 | There is at least one thing: RAIDframe doesn't allow enough
 simultaneously | pending transactions, so everything *really* backs
 up behind the cache flush. | 
 | Fixing that would require allowing RAIDframe to eat more RAM.  Last
 time I | proposed that, I got a rather negative response here.
 
 sysctl to the rescue.

The appropriate 'bit to twiddle' is likely raidPtr-openings.
Increasing the value can be done while holding raidPtr-mutex.
Decreasing the value can also be done while holding raidPtr-mutex, but
will need some care if attempting to decrease it by more than the
number of outstanding IOs.

I'm happy to review any changes to this, but won't have time to code it
myself, unfortunately :( 

Later...

Greg Oster


Re: RAIDOUTSTANDING (was: high load, no bottleneck)

2013-09-19 Thread Greg Oster
On Thu, 19 Sep 2013 20:14:33 +0200
Edgar Fuß e...@math.uni-bonn.de wrote:

  options  RAIDOUTSTANDING=40 #try and enhance raid performance.
 Is there any downside to this other than memory usage?
 How much does one unit cost?

This is from the comment in src/sys/dev/raidframe/rf_netbsdkintf.c :

/*
 * Allow RAIDOUTSTANDING number of simultaneous IO's to this RAID
device.
 * Be aware that large numbers can allow the driver to consume a lot of
 * kernel memory, especially on writes, and in degraded mode reads.
 *
 * For example: with a stripe width of 64 blocks (32k) and 5 disks,
 * a single 64K write will typically require 64K for the old data,
 * 64K for the old parity, and 64K for the new parity, for a total
 * of 192K (if the parity buffer is not re-used immediately).
 * Even it if is used immediately, that's still 128K, which when
multiplied
 * by say 10 requests, is 1280K, *on top* of the 640K of incoming data.
 *
 * Now in degraded mode, for example, a 64K read on the above setup may
 * require data reconstruction, which will require *all* of the 4
remaining
 * disks to participate -- 4 * 32K/disk == 128K again.
 */

The amount of memory used is actually more than this, but the buffers
are the biggest consumer, and the easiest way to get a ball-park
estimate So if you have a RAID 5 set with 12 disks and 32K/disk for
the stripe width, then for a *single* degraded write of 32K you'd need:

 11*32K (for reads) + 32K (for the write) + 32K (for the parity write)

which is 416K.  If you want to allow 40 simultaneous requests, then you
need ~16MB of kernel memory.  Maybe not a big deal on a 32GB machine,
but 40 probably isn't a good default for a machine with 128MB RAM.
Also: remember that this is just one RAID set, and each additional RAID
set on the same machine could use that much memory too
Also2: this really only matters the most in degraded operation.

I can't think, off-hand, of any downsides other than memory usage..

Later...

Greg Oster


Re: high load, no bottleneck

2013-09-19 Thread Greg Oster
On Thu, 19 Sep 2013 20:53:30 +0200
m...@netbsd.org (Emmanuel Dreyfus) wrote:

 Greg Oster os...@cs.usask.ca wrote:
 
  It's probably easier to do by raidctl right now.  I'm not opposed to
  having RAIDframe grow a sysctl interface as well if folks think that
  makes sense. The 'openings' value is currently set on a per-RAID
  basis, so a sysctl would need to be able to handle individual RAID
  sets as well as overall configuration parameters.
 
 IMO raidctl makes more sense here, as it is the place where one is
 looking for RAID stuff.
 
 While I am there: fsck takes an infinite time while RAIDframe is
 rebuilding parity. I need to renice the raidctl process that does it
 in order to complete fsck. Would raising the outstanding write value
 also help here?

Any additional load you have on the RAID set while rebuilding parity is
just going to make things worse...  What you really want to do is turn
on the parity logging stuff, and reduce the amount of effort spent
checking parity by orders of magnitude...

Later...

Greg Oster


Re: high load, no bottleneck

2013-09-19 Thread Greg Oster
On Thu, 19 Sep 2013 11:26:21 -0700 (PDT)
Paul Goyette p...@whooppee.com wrote:

 On Thu, 19 Sep 2013, Brian Buhrow wrote:
 
  The line I include in my config files is:
 
  options  RAIDOUTSTANDING=40 #try and enhance raid performance.
 
 Is this likely to have any impact on a system with multiple raid-1 
 mirrors?

Yes, it would, provided you have more than 6 concurrent IOs to each RAID
set..

Later...

Greg Oster


Re: high load, no bottleneck

2013-09-19 Thread Greg Oster
On Fri, 20 Sep 2013 01:37:20 +0200
m...@netbsd.org (Emmanuel Dreyfus) wrote:

 Greg Oster os...@cs.usask.ca wrote:
 
  Any additional load you have on the RAID set while rebuilding
  parity is just going to make things worse...  What you really want
  to do is turn on the parity logging stuff, and reduce the amount of
  effort spent checking parity by orders of magnitude...
 
 You mean raidctl -M yes, right?

Correct.

Later...

Greg Oster


Re: Unexpected RAIDframe behavior

2013-09-03 Thread Greg Oster
On Tue, 3 Sep 2013 16:59:28 + (UTC)
John Klos j...@ziaspace.com wrote:

  Parity Re-write is 79% complete.
 
  OK, so this is really more about how parity checking works than
  anything else (i guess.)
 
  for RAID1, it reads both disks and compares them, and if one
  fails it will write the master data.  (more generally, it
  reads all disks and if anything fails parity check it writes
  corrected parity back to it.)
 
 Ah, so a reboot caused RAIDframe to switch from reconstruction to
 parity creation. 

Hmm.. it shouldn't be 'switching' if reconstruction finished, but
you didn't have parity logging turned on, then you'd see the behaviour
you describe.  If reconstruction didn't finish, then you should see one
of the components still marked as 'failed'.

 That explains what was going on. However, it makes
 me wonder if the state of the RAID is not properly being maintained
 through reboot. I didn't really need all that non-zero data.

If the state of the RAID is not being maintained, then that's a bug,
and needs to be fixed right away.  To my knowledge, however, it does
maintain things correctly.  Feel free to file a PR with the specifics
of any failures in this regard...

Later...

Greg Oster


Re: Where is the component queue depth actually used in the raidframe system?

2013-03-14 Thread Greg Oster
On Thu, 14 Mar 2013 10:32:26 -0400
Thor Lancelot Simon t...@panix.com wrote:

 On Wed, Mar 13, 2013 at 09:36:07PM -0400, Thor Lancelot Simon wrote:
  On Wed, Mar 13, 2013 at 03:32:02PM -0700, Brian Buhrow wrote:
 hello.   What I'm seeing is that the underlying disks
   under both a raid1 set and a raid5 set are not seeing anymore
   than 8 active requests at once across the entire bus of disks.
   This leaves a lot of disk bandwidth unused, not to mention less
   than stellar disk performance.  I see that RAIDOUTSTANDING is
   defined as 6 if not otherwise defined, and this suggests that
   this is the limiting factor, rather than the actual number of
   requests allowed to be sent to a component's queue.
  
  It should be the sum of the number of openings on the underlying
  components, divided by the number of data disks in the set.  Well,
  roughly.  Getting it just right is a little harder than that, but I
  think it's obvious how.
 
 Actually, I think the simplest correct answer is that it should be the
 minimum number of openings presented by any individual underlying
 component. I cannot see any good reason why it should be either more
 nor less than that value.

Consider the case when a read spans two stripes...  Unfortunately, each
of those reads will be done independently, requiring two IOs for a given
disk, even though there is only one request.

The reason '6' was picked back in the day was that it seemed to offer
reasonable performance while not requiring a huge amount of memory to
be reserved for the kernel.  And part of the issue there was that
RAIDframe had no way to stop new requests from coming in and consuming
all kernel resources :(  '6' is probably a reasonable hack for older
machines, but if we can come up with something self-tuning I'm all for
it...  (Having this self-tuning is going to be even more critical when
MAXPHYS gets sent to the bitbucket and the amount of memory needed for
a given IO increases...)

Later...

Greg Oster


Re: Where is the component queue depth actually used in the raidframe system?

2013-03-13 Thread Greg Oster
On Wed, 13 Mar 2013 10:47:18 -0700
Brian Buhrow buh...@nfbcal.org wrote:

   Hello greg.  In looking into a performance issue I'm having
 with some raid systems relative to their underlying disks, I became
 interested in seeing how the component queue depth affects the
 business of underlying disks.  To my surprise, it looks as if the
 queue depth as defined in the raidx.conf file is never used.  Is that
 really true?  The chain looks like: raidctl(8) sets 
 raidPtr-maxQueueDepth = cfgPtr-maxOutstandingDiskReqs;
 Then, in rf_netbsdkintf.c, we have:
 d_cfg-maxqdepth = raidPtr-maxQueueDepth;
 But I don't see where maxqdepth is ever used again.

Heh... Congrats!  I think you just found more 'leftovers' from when the
simulator bits were removed (before RAIDframe was imported to NetBSD).
In the simulation code maxQueueDepth was also assigned to
threadsPerDisk which was used to fire off multiple requests to the
simulated disk.  In the current code you are correct that maxqdepth is
not used in any real meaningful way...  Unfortunately, we can't just
rip it out without worrying about backward kernel-raidctl
compatibility :( 

When you set maxOutstandingDiskReqs you're really setting
maxOutstanding, and how that influences performance would be
interesting to find out :)  Just be aware that the more requests you
allow to be outstanding the more kernel memory you'll need to have
available... 

Later...

Greg Oster


Re: Raidframe and disk strategy

2012-10-17 Thread Greg Oster
On Wed, 17 Oct 2012 11:26:12 -0700
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

 On Oct 17, 12:03pm, Edgar =?iso-8859-1?B?RnXf?= wrote:
 } Subject: Re: Raidframe and disk strategy
 } Two more questions on the subject:
 } 
 }  sets the strategy of raidframe to the default strategy for the
 system, }  rather than fcfs.
 } How does that play with the usual ``fifo 100'' in the ``START
 queue'' section? } 
 }  you can easily test various disk sorting strategies
 } Where can I find a description/discussion of the different
 strategies? } 
 } Probably your patch is the cause for our occasional NFS hangs
 having decreased } from thirty seconds to a few seconds.
 -- End of excerpt from Edgar =?iso-8859-1?B?RnXf?=
 
 
   In answering your question, I find I have a couple for Greg.
 Depending on his comments to my notes below, you may have a couple of
 knobs you can tweak for performance to the raidframe system to get
 even more efficiency out of the system.
   It looks like you can select a number of disk queuing
 strategies within raidframe itself, something I didn't realize.
 There seem to be  5 choices: fifo, cvscan, sstf, scan and cscan.
 Fifo is the one we've been using for years, but the others appear to
 be compiled into the system.  Unless my understanding is completely
 wrong, cscan is the algorithm which most closely aligns with the
 priocscan buffer queue strategy and scan  matches the traditional BSD
 disksort buffer queue strategy.  To set a new disk queueing strategy
 for a given raid set, try the following:

See more below about the next 6 steps...

 1.  If the raid set is configured by  startup scripts at boot time,
 then edit your raid.conf file for that raid set and change the word
 fifo in the start queue section to one of the other choices listed
 above.
 
 2.  Unconfigure the raid and reconfigure it.
 
 3.  If your raid set is configured automatically by the kernel,
 construct a raid.conf file that matches the characteristics of your
 raid set, if you don't already have one, and change the word fifo
 in the start queue section to one of the choices listed above.
 
 4.  Turn off autoconfigure with raidctl -A no on the indicated raid
 set.
 
 5.  Unconfigure and reconfigure the raid set as you did in step 2
 above.
 
 6.  Turn on autoconfigure again with raidctl -A yes or raidctl -A root
 depending on whether your raid set is a root filesystem or just a
 raid set On the system.
 
   For greg:
 Have you played with any of these disk queueing strategies? 

Not in any meaningful way, and not in years.

 Do you know if they work or, more importantly, if they contain 
 huge disk eating bugs? 

My understanding is that they all work, though, again, I've not done
anything resembling exhaustive testing.

 Have you done any bench marking to compare their relative performance?

No.

The other unfortunate thing is that your 6 steps above won't be
permanent -- the component labels don't have a record of what the
queuing strategy currently is.  We probably need another 'raidctl'
option to set raidPtr-qType to whatever type is desired (if you'd
like to write this, you basically need to make sure all IO is quiesced,
do the switch, and then allow IO to happen again.  There is existing
code that does this start/stop, IIRC.)

I'm happy to look over any code additions folks might propose :) 

Later...

Greg Oster


Re: Raidframe and disk strategy

2012-10-17 Thread Greg Oster
On Wed, 17 Oct 2012 12:50:44 -0700
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   hello.  Thanks for the reply.  I was just loking at the
 raidframe code to determine if the queueing strategy was permanent
 across reboots.  Also, I apologize Edgar, cscan is not a valid
 queueing strategy name. The  four choices are: fifo, cvscan, sstf and
 scan.
 
   As to the possibility of implementing persistent disk queuing
 strategies, I was thinking that the easiest way might be just to add
 an option to store the letter of the queuing strategy in the
 component label, and then having the autoconfigure code set the
 queueing strategy when it runs during the next reboot.  You can
 already achieve a change to the queuing strategy on a raid set by
 unconfiguring it and reconfiguring it. (Yes, that means you have to
 reboot for root raid sets or raid sets that can't be unmounted at run
 time, but that seems safer than trying to do this change on the fly,
 especially since it isn't something one would do often under most
 circumstances.)

Unconfiguring/reconfiguring is a pain...  I think what we really need
is an ioctl or similar to set the strategy (it can just pass in a
config struct with only the strategy field set -- that mechanism is
already there).  The ioctl code can then do the quiesce/change/go
song'n'dance and update component labels accordingly.  New code in the
component label code would read in the strategy (from a newly carved
out field from the component labels) and away we go.  There's lots of
free space remaning in the component labels, and this change would be
backwards compatible with older kernels.  New kernels seeing a '0' in
that spot would simply pick 'fifo' as before.  One question is whether
to use 0,1,2,3 to encode the strategy, or to use letters, or whatever.
But that's just a minor implementation detail.

   I've got a raid set that I use for backups that I've altered
 to use the scan strategy.  
 
 Thoughts?
 -Brian


Later...

Greg Oster


Re: Raidframe and disk strategy

2012-08-09 Thread Greg Oster
On Thu, 9 Aug 2012 09:45:18 -0700
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   Hello.  I'm not going to claim to be an expert at anything
 here, but grepping through the raidframe sources doesn't show me
 anything that says it sorts requests from the upper layers, nor that
 it sorts requests to the  underlying disks except in the case of when
 it's doing reconstruction or paritymap maintenance.  That said, and
 while I don't have any hard numbers yet, it looks like the patch I
 posted last night yields an instant 16% improvement on throughput on
 one of the backup servers I run. Could someone else try the patch and
 see if they see similar gains?  Below is a shell snippet I use
 in /etc/rc.local to set the strategy for all the attached raid sets
 on a system, in case that's useful for folks.

So there are a number of places sorting can occur here... 

 1) In the bufq_* strategy bits in RAIDframe.
 2) In the rf_diskqueue.c code, where we have fifo, cvscan, shortest
 seek time first, and two elevator algorithms as possible candidates
 for use in disk queueing.
 3) In the underlying component strategies (which may be another
 RAIDframe device, a CCD, vnd, sd, wd, or whatever).

The sorting stuff I was referring to in my previous email (where I said
it could be handled in the config file) was 2).  I wasn't even thinking
about 1).

Now if you're seeing a 16% performance boost, it's likely worth adding
the code you suggest -- it's not much code for a nice bump for folks
who might have a similar IO mix/setup.

Later...

Greg Oster


Re: Raidframe and disk strategy

2012-08-08 Thread Greg Oster
On Wed, 8 Aug 2012 15:07:24 -0700
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   hello.  I've been looking at some disk performance issues
 lately and trying to figure out if there's anything I can do to make
 it better.  (This is under NetBSD/I386 5.1_stable with sources from
 July 18 2012).  During the course of my investigations, I discovered
 the raidframe driver does not implement the DIOCSSTRATEGY or
 DIOCGSTRATEGY ioctls.  Even more interestingly, I notice its set to
 use the fcfs strategy, and has been doing so since at least
 NetBSD-2.0.  The ccd(4) driver does the same thing. Presumably, the
 underlying disks can use what ever strategy they use for handling
 queued data, but I'm wondering if there is a particular reason the
 fcfs strategy was chosen for the raidframe driver as opposed to
 letting the system administrator pick the strategy?  My particular
 environment has a lot of unrelated reads and writes going on
 simultaneously, and it occurrs to me that using a different disk
 strategy than fcfs might mitigate some of these issues.  Were bench
 marks done to pick the best strategy for raidframe and/or ccd or is
 there some other eason I'm missing that implementing a buffer queue
 strategy on top of these devices is a bad idea? -thanks -Brian

The FIFO strategy was the one that seemed to be the best tested at
the time, and since I didn't want to introduce any more variables,
that's the one I went with as a default.  Unfortunately, the RAID
labels currently don't specify the queuing strategy, so the
'autoconfig' sets won't do anything other than FIFO at this time.
Non-autoconfig sets certainly support any of the other queuing
methods in the config file, and that could certainly be used for
testing/benchmarking.

If you'd like to write up support for alternate queuing methods being
specified by the component labels let me know -- it'd be one less thing
on my TODO list :) 

Later...

Greg Oster


Re: why does raidframe retry I/O 5 times

2012-06-25 Thread Greg Oster
On Mon, 25 Jun 2012 11:11:37 +0200
Manuel Bouyer bou...@antioche.eu.org wrote:

 Hello,
 why does raidframe retry failed I/O 5 times ? the wd(4) driver already
 retries 5 times, which means a bad block is retried 25 times.
 While drivers are retrying the system is stalled ...

As I recall, for RAIDframe 5 was just chosen as an arbitrary number.  Do
you know if scsi or other disk drivers also retry 5 times too?  

Later...

Greg Oster


Re: RAIDframe performance vs. stripe size: Test Results

2012-06-12 Thread Greg Oster
On Tue, 12 Jun 2012 17:02:21 +0200
Edgar Fuß e...@math.uni-bonn.de wrote:

  Any comments on the results?
 Really no comments?

 Parity re-build:
 328   128
 6min  ~15min  5min


  My questions:
  Why does parity re-build take longer with smaller stripes? Is it
  really done one stripe at a time?

So a parity rebuild does so by reading all the data and the
exiting parity, computing the new parity, and then comparing the
existing parity with the new parity.  If they match, it's on to the
next stripe.  If they differ, the new parity is written out.  No, this
doesn't happen one stripe at a time -- it's much more parallel than
that.

What we don't know here is what the state of the array was when you
started the rebuild.  That is, was the parity 'mostly correct'
beforehand? (i.e. saving having to do a lot of the writes).  If it
really was doing one write per stripe, then it can still be much slower
with the smaller stripes -- there's way more overhead, and the amount
of work getting done with each IO is much smaller

  Why does enabling quotas slow down extraction so much? The test
  data should be ordered by uid in the tar, so quota should be easily
  cachable. Why does the negative impact of atime updates decrease at
  larger block/stripe sizes?
 No answers to my questions either?

I don't know the answers to these questions. 

Later...

Greg Oster


Re: RAIDframe parity rebuild (was: RAIDframe performance vs. stripe size: Test Results)

2012-06-12 Thread Greg Oster
On Tue, 12 Jun 2012 18:34:52 +0200
Edgar Fuß e...@math.uni-bonn.de wrote:

  So a parity rebuild does so by reading all the data and the
  exiting parity, computing the new parity, and then comparing the
  existing parity with the new parity.  If they match, it's on to the
  next stripe.  If they differ, the new parity is written out.
 Oops.
 What's the point of not simply writing out the computed parity?

Writes are typically slower, so not having to do them means the rebuild
goes faster...

  What we don't know here is what the state of the array was when you
  started the rebuild.  That is, was the parity 'mostly correct'
  beforehand? (i.e. saving having to do a lot of the writes).
 I don't know either. If the discs contained a clean RAID set with a
 different SPSU, was the parity correct?

For a basic RAID 5 set yes, it would be.  The exception might be at the
very end of the RAID set where a smaller stripe size might allow you to
use a bit more of each component.

Later...

Greg Oster


Re: RAIDframe parity rebuild (was: RAIDframe performance vs. stripe size: Test Results)

2012-06-12 Thread Greg Oster
On Tue, 12 Jun 2012 13:20:27 -0400
Thor Lancelot Simon t...@panix.com wrote:

 On Tue, Jun 12, 2012 at 10:40:47AM -0600, Greg Oster wrote:
  On Tue, 12 Jun 2012 18:34:52 +0200
  Edgar Fu? e...@math.uni-bonn.de wrote:
  
So a parity rebuild does so by reading all the data and the
exiting parity, computing the new parity, and then comparing the
existing parity with the new parity.  If they match, it's on to
the next stripe.  If they differ, the new parity is written out.
   Oops.
   What's the point of not simply writing out the computed parity?
  
  Writes are typically slower, so not having to do them means the
  rebuild goes faster...
 
 Are writes to the underlying disk really typically slower?  It's easy
 to see why writes to the RAID set itself would be slower, but
 sequential disk write throughput is usually pretty darned close to --
 if not better than -- read throughput these days, isn't it?

It's been a while since I've checked, and the current generation of 2TB
and 3TB disks may be significantly better than the 1TB disks... I also
don't know what the disks are doing with their write-back/write-through
caches these days either.  

 If you don't know the set's blank, I guess you do have to read the
 existing data.  Maybe that limits how much win can really be had here.

That, and someone would have to change the code :)

Later...

Greg Oster


Re: Strange problem with raidframe under NetBSD-5.1

2012-06-12 Thread Greg Oster
On Tue, 12 Jun 2012 14:44:55 -0700
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   Hello.  I've just encountered a strange problem with
 raidframe under NetBSD-5.1 that I can't immediately explain.
 
   this machine has been runing a raid set since 2007.  The raid
 set was originally constructed under NetBSD-3.  For the past year,
 it's been running 5.0_stable with sources from 
 July 2009 or so without a problem.Last night, I installed
 NetBSD-5.1 with sources from May 23 2012 or so.  Now, the raid0 set
 fails the first component with an i/o error with no corresponding
 disk  errors underneath. Trying to reconstruct to the failed
 component also fails with an error of 22, invalid argument.  Looking
 at the dmesg output compared with the output of raidctl -s reveals
 the problem.  The size of the raid in the dmesg output is bogus, and,
 if the raid driver dries to write as many blocks as is reported by
 the configuration output, it will surely fail as it does. However,
 raidctl -g /dev/wd0a looks ok and the underlying disk label
 on /dev/wd0a looks ok as well. Where does the raid driver get the
 numbers it reports on bootup? Also, there is a second raid set on
 this machine, the second half of the same two drives, which was
 constructed at the same time.  It works fine with the new code.
 
   Below is the output of the boot sequence before the upgrade,
 and then the boot sequence after the upgrade.  Below that are the
 output of  raidctl -s raid0 and raidctl -g /dev/wd0a raid0.
   It looks to me like something is not zero'd out in the
 component label that should be, but some change in the raid code is
 no longer ignoring the noise in the component label.

Correct.

 Any ideas?

There was some code added a while back to handle components whose sizes
were larger than 32-bit.  But 5.1_stable should have the code to handle
those 'bogus' values in the component label and do the appropriate
thing (see rf_fix_old_label_size in rf_netbsdkintf.c version
1.250.4.11, for example).

What is your code rev for src/sys/dev/raidframe/rf_netbsdkintf.c ?

Later...

Greg Oster


Re: Strange problem with raidframe under NetBSD-5.1

2012-06-12 Thread Greg Oster
On Tue, 12 Jun 2012 16:11:03 -0700
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   hello Greg.  I just updated to the latest 5.1 tree but I
 don't see the change you note in that update.  I see the commit in
 the cvs logs, but it doesn't look like it made it into the NetBSD-5
 branch.  The latest version I see, even after combing through the
 source-changes archives on the www.netbsd.org site is ...2.44.8 which
 was a fix for a bug I reported with wedges and raidframe some time
 ago.  I could be missing something, and I probably am, but  it's not
 obvious to me.  Could you  look to see if you see it on the NetBSD-5
 branch? 

I don't think it's in the 5.1 tree.. The 1.250.4.11 version I quoted
is from the netbsd-5 branch...  The rev you actually want is 1.250.4.10
as from here:
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/dev/raidframe/rf_netbsdkintf.c?only_with_tag=netbsd-5

But as mrg has pointed out, you need the partitionSizeHi fix too...

Later...

Greg Oster


 On Jun 12,  3:30pm, Brian Buhrow wrote:
 } Subject: Re: Strange problem with raidframe under NetBSD-5.1
 } Hello.  That appears to be the problem.  I thought I
 updated my 5.1 } sources, but I've been doing so much  patching,
 testing and patching with } respect to the ffs fixes, that I guess I
 didn't actually get the latest } sources.  doing that now.  I
 think/hope that will fix me up. } 
 } -thanks
 } -Brian
 } On Jun 12,  4:14pm, Greg Oster wrote:
 } } Subject: Re: Strange problem with raidframe under NetBSD-5.1
 } } On Tue, 12 Jun 2012 14:44:55 -0700
 } } buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:
 } } 
 } }  Hello.  I've just encountered a strange problem with
 } }  raidframe under NetBSD-5.1 that I can't immediately explain.
 } }  
 } }  this machine has been runing a raid set since 2007.
 The raid } }  set was originally constructed under NetBSD-3.  For
 the past year, } }  it's been running 5.0_stable with sources from 
 } }  July 2009 or so without a problem.  Last night, I
 installed } }  NetBSD-5.1 with sources from May 23 2012 or so.  Now,
 the raid0 set } }  fails the first component with an i/o error with
 no corresponding } }  disk  errors underneath. Trying to reconstruct
 to the failed } }  component also fails with an error of 22, invalid
 argument.  Looking } }  at the dmesg output compared with the output
 of raidctl -s reveals } }  the problem.  The size of the raid in the
 dmesg output is bogus, and, } }  if the raid driver dries to write
 as many blocks as is reported by } }  the configuration output, it
 will surely fail as it does. However, } }  raidctl -g /dev/wd0a
 looks ok and the underlying disk label } }  on /dev/wd0a looks ok as
 well. Where does the raid driver get the } }  numbers it reports on
 bootup? Also, there is a second raid set on } }  this machine, the
 second half of the same two drives, which was } }  constructed at
 the same time.  It works fine with the new code. } }  
 } }  Below is the output of the boot sequence before the
 upgrade, } }  and then the boot sequence after the upgrade.  Below
 that are the } }  output of  raidctl -s raid0 and raidctl
 -g /dev/wd0a raid0. } }  It looks to me like something is
 not zero'd out in the } }  component label that should be, but some
 change in the raid code is } }  no longer ignoring the noise in the
 component label. } } 
 } } Correct.
 } } 
 } }  Any ideas?
 } } 
 } } There was some code added a while back to handle components whose
 sizes } } were larger than 32-bit.  But 5.1_stable should have the
 code to handle } } those 'bogus' values in the component label and do
 the appropriate } } thing (see rf_fix_old_label_size in
 rf_netbsdkintf.c version } } 1.250.4.11, for example).
 } } 
 } } What is your code rev for src/sys/dev/raidframe/rf_netbsdkintf.c ?
 } } 
 } } Later...
 } } 
 } } Greg Oster
 } -- End of excerpt from Greg Oster
 } 
 } 
 -- End of excerpt from Brian Buhrow
 


Later...

Greg Oster


Re: RAIDframe performance vs. stripe size

2012-05-11 Thread Greg Oster
On Fri, 11 May 2012 12:48:08 +0200
Edgar Fuß e...@math.uni-bonn.de wrote:

  Edgar is describing the desideratum for a minimum-latency
  application.
 Yes, I'm looking for minimum latency.
 I've logged the current file server's disc business and the only
 time they really are busy is during the nightly backup. I suppose
 that mainly consists of random reads traversing the directory tree
 followed by random reads transferring the modified files (which, I
 suppose, are typically small).
 
 Of course I finally will have to experiment, but a) I'm used to
 trying to understand the theory first and b) the rate of experiments
 will probably be limited to one per day or less given the amount of
 real data I would have to re-transfer each time.
 
 Since my understanding of the NetBSD I/O system may be incorrect,
 incomplete or whatever, I better try to ask people first who know
 better:
 
 Given 6.0/amd64 and a RAID 5 accross 4+1 SAS discs.
 Suppose I have a 16k fsbsize and a stripe size such that there is one
 full fs block per disc (i.e. the stripe is 4*16k large). Suppose
 everything is correctly aligned (apart from where I say it isn't, of
 course).
 
 Suppose there is no disc I/O apart from that I describe.

Ok :) 

 Scenario A:
 I have two processes each reading a single file system block. Those
 two blocks happen to end up on two different discs (there is a 3/4
 probability for that being true). Will I end up with those two discs
 actually seeking in parallel?

Yes.  Absolutely.

 Scenario B:
 I have one process reading a 64k chunk that is 16k-aligned, but not
 64k- aligned (so it's not just reading a full stripe). Will I end up
 in four discs seeking and reading in parallel? 

Yes.  Aligned or not, there will be 4 discs busy.  (think of it as
reading the last part of one stripe, and the first part of another.
Those are always aligned to be non-overlapping...)

 I.e. will this be
 degraded wrt. a stripe-aligned read?

No.  Consider the following layout:

 d0  d1  d2  d3  p0
 d5  d6  d7  p1  d4
 d10 d11 p2  d8  d9
 d15 p3  d12 d13 d14   
 p4  d16 d17 d18 d19

where dn is a 16K datablock n of the RAID set and px is the parity
block for stripe x.  (This is what the layout will be for your
configuration.)

If you read 64K that is strip aligned, then you would maybe be reading
d0-d3 or d4-d7 or d8-d11 or d12-d15 or d16-d19.  As you can see, all of
those span all 4 discs.  If you read something that isn't stripe
aligned (e..g d1-d4 or d7-d10 or d11-d14 or whatever) you also see that
those IOs are evenly distributed among the discs.

 Scenario C:
 I have one process doing something largely resulting in meta-data
 reads (i.e. traversing a very large directory tree). Will the kernel
 only issue sequential reads or will it be able to parallelise, e.g.
 reading indirect blocks?

I don't know the answer to this off the top of my head... 

 What will change if I scale all sizes by a factor of, say, 4, such
 that the full stripe exceeds MAXPHYS?

RAIDframe won't be able to directly handle a request larger than
MAXPHYS, so a single MAXPHYS-sized request to RAIDframe will end up
accessing only a single component.  The parallelism seen in A will
still be there, but not the parallelism in B.  

 Sorry if the answers to those question are obvious from what has
 already been written so far for someone more familiar with the matter
 than I am.


Later...

Greg Oster


Re: RAIDframe performance vs. stripe size

2012-05-11 Thread Greg Oster
On Fri, 11 May 2012 17:05:24 +0200
Edgar Fuß e...@math.uni-bonn.de wrote:

 Thanks a lot for your detailed answers.
 
  Yes.  Absolutely.
 Fine.
 
  As you can see, all of those span all 4 discs.
 Yes, that was perfectly clear to me. What I wasn't sure of was that
 the whole stack of subsystems involved would really be able to make
 use of that. Thanks for confirming it actually does.
 
 EF I have one process doing something largely resulting in meta-data
 EF reads (i.e. traversing a very large directory tree). Will the
 EF kernel only issue sequential reads or will it be able to
 EF parallelise, e.g. reading indirect blocks?
 GO I don't know the answer to this off the top of my head... 
 Oops, any file-system experts round here?
 
  The parallelism seen in A will still be there, but not the
  parallelism in B.  
 Thanks.
 
 One last, probably stupid, question: Why doesn't one use raw devices
 als RAIDframe components? Doesdn't all data pass through the buffer
 cache twice when using block devices?

I think you'd asked that before, but I didn't get to responding, and
no-one else did either...

The shortest answer is that back in the day when RAIDframe arrived it
was made to handle 'underlying components' just like CCD did... and
using the VOP_* interfaces meant things could be layered without
worrying about what other devices or pseudo-devices were above or below
the RAIDframe layer.  Yes, things might get (initially) cached on a
couple of different levels, but as those 'bp's get recycled it'll be
the lower copies that get recycled first and you should just end up
with only a single cached copy of popular items...

I think it's one of those things where you trade a bit of duplication
for flexibility

Does that help?

Later...

Greg Oster


Re: RAIDframe performance vs. stripe size

2012-05-10 Thread Greg Oster
On Thu, 10 May 2012 17:46:38 +0200
Edgar Fuß e...@math.uni-bonn.de wrote:

 Does anyone have some real-world experience with RAIDframe (Level 5)
 performance vs. stripe size?
 My impression would be that, with a not to large number of components
 (4+1, in my case), chances are rather low to spread simultaneous
 accesses to different physical discs, so the best choice seems the
 file system's block size.
 On the other hand, the Carnegie Mellon paper suggests something around
   1/2*(througput)*(access time),
 which would amount to more than a Megabyte with my discs.
 So, what are the real-world benefits of a large stripe size? My
 application are home and mail directories exported via NFS.

There have been various discussions of this on the mailing lists over
the years... 

The one issue you'll find is that RAIDframe still suffers from the 64K
MAXPHYS limitation -- RAIDframe will only be handed 64K at a time, and
from there it hands chunks of that out to each component.  So if you
have a 4+1 RAID 5 set, if you make the stripe 32 blocks (16K) wide,
that will allow a single 64K IO to hit all 4+1 disks at the same time.
Of course, you'll want to make sure that your filesystem is aligned so
that those 64K writes are stripe aligned, and you'll also probably want
to use fairly large block sizes (16K/64K frag/block) too...

Later...

Greg Oster


Re: RAIDframe performance vs. stripe size

2012-05-10 Thread Greg Oster
On Thu, 10 May 2012 18:59:42 +0200
Edgar Fuß e...@math.uni-bonn.de wrote:

 I don't know whether I'm getting this right.
 
 In my understanding, the benefit of a large stripe size lies in
 parallelisation:

Correct.

 Suppose the stripe size is such that a file system
 block fits on a single disc, i.e. stripe size = (file system block
 size)*(number of effective discs). Then, if one (read) transfer
 makes disc A seek to block X, there is a good chance that the next
 transfer can be satisfied from disc B != A, making disc B seek to
 block Y in parallel. Or will this (issuing requests to different
 discs in parallel) not happen on NetBSD?

Unless something issues the command to disc B to fetch block Y in
parallel, it won't happen.  That is, if you simply make a request for
block X from disc A, there's nothing in the RAIDframe code that will
have it automatically go looking to B for Y.

What you're typically looking for in the parallelization is that a
given IO will span all of the components.  In that way, if you have n
components, and the transfer would normally take t amount of time, then
the total time gets reduced to just t/n (plus some overhead) because
you are able to do IO to/from all components at the same time.
RAIDframe will definitely do this whenever it can.

Later...

Greg Oster


Re: RAIDframe performance vs. stripe size

2012-05-10 Thread Greg Oster
On Thu, 10 May 2012 13:23:24 -0400
Thor Lancelot Simon t...@panix.com wrote:

 On Thu, May 10, 2012 at 11:15:09AM -0600, Greg Oster wrote:
  
  What you're typically looking for in the parallelization is that a
  given IO will span all of the components.  In that way, if you have
  n
 
 That's not what I'm typically looking for.  You're describing the
 desideratum for a maximum-throughput application.  Edgar is describing
 the desideratum for a minimum-latency application.  No?

I think what I describe still works for minimum-latency too...  where
it doesn't work is when your IO is so small that the time to actually
transfer the data is totally dominated by the time to seek to the data.
In that case you're better off in just going to a single component
instead of having n components all moving their heads around to grab
those few bytes (especially true where there are lots of simultaneous
IOs happening).

Of course, this is all still theoretical -- the best thing is to
experiment with different RAID settings and real workloads to see what
works best for the particular applications...

Later...

Greg Oster


Re: RAIDframe performance vs. stripe size

2012-05-10 Thread Greg Oster
On Thu, 10 May 2012 14:06:11 -0400
Thor Lancelot Simon t...@panix.com wrote:

 On Thu, May 10, 2012 at 11:47:36AM -0600, Greg Oster wrote:
  On Thu, 10 May 2012 13:23:24 -0400
  Thor Lancelot Simon t...@panix.com wrote:
  
   On Thu, May 10, 2012 at 11:15:09AM -0600, Greg Oster wrote:

What you're typically looking for in the parallelization is
that a given IO will span all of the components.  In that way,
if you have n
   
   That's not what I'm typically looking for.  You're describing the
   desideratum for a maximum-throughput application.  Edgar is
   describing the desideratum for a minimum-latency application.  No?
  
  I think what I describe still works for minimum-latency too...
  where it doesn't work is when your IO is so small that the time to
  actually transfer the data is totally dominated by the time to seek
  to the data.
 
 What if I have 8 simultaneous, unrelated streams of I/O, on a 9
 data-disk set? 

That's the lots of simultaneous IOs happening part of the bit
you cut out:

 In that case you're better off in just going to a single component
 instead of having n components all moving their heads around to grab
 those few bytes (especially true where there are lots of simultaneous
 IOs happening).

 Like, say, 8 CVS clients all at different points
 fetching a repository that is too big to fit in RAM?
 
 If the I/Os are all smaller than a stripe size, the heads should be
 able to service them in parallel.

Doing reads, where somehow each of the 8 reads is magically on a
disk independent of the others, is a pretty specific use-case ;) 

If you were talking about writes here, then you still have the
Read-modify-write thing to contend with -- and now instead of just
doing IO to a single disk, you're moving two other disk heads to read
the old data and old parity, and then moving them back again to do the
write... 

Reads I agree - you can get those done in parallel with no interference
(assuming the reading of the data is somehow aligned accordingly to
evenly distribute the load to all disks). Write to anything other than a
full stripe, and it gets really expensive really fast

 If they are stripe size or larger, they will have to be serviced in
 sequence -- it will take 8 times as long.

Reads, given the configuration and assumptions you have suggested,
would certainly take longer.

Writes would be a different story (how the drive does caching
might be the determining factor?).  

Writing as a stripe each disk would end up with 8 IOs for each of the 8
writes -- 64 writes in all.  

Writing to a block on each of the 8 disks would actually end up with:
 a) 8 reads to each of the disks to fetch the old blocks, 

 b) 8 reads to one of the other disks to get the old parity.  This
 will also mess up the head positions for the reads happening in a),
 even though both a) and b) will fire at the same time. 

 c) 8 writes to each of the disks to write out the new data and, 
 
 d) 8 writes to each of the disks to write out the new parity.

Thats now 128 reads and 128 writes (and the writes have to wait on the
reads!) to now do the work that the striped system could do with just
64 writes

 In practice, this is why I often layer a ccd with a huge (and prime)
 stripe size over RAIDframe.  It's also a good use case for LVMs.
 But it should be possible to do it entirely at the RAID layer through
 proper stripe size selection.  In this regard RAIDframe seems to be
 optimized for throughput alone.

See my point before about optimizing for one's own particular
workloads :)  (e.g. how you might optimize for a mostly read-only CVS
repository will be completely different from an application where the
mixture of reads and writes is more balanced...)

Later...

Greg Oster


Re: disklabel on RAID disappeared

2012-03-02 Thread Greg Oster
On Fri, 2 Mar 2012 20:04:56 +0100
Edgar Fuß edgar.f...@bn2.maus.net wrote:

 Help, there's something weird going on on our fileserver!
 I'm on vacation and had a colleague do this over the phone.
 Please CC me in replies because I don't have access to my regular
 mail.
 
 raid0 is level 1, sd0a sd1a
 raid1 is level 5, sd2a .. sd9a
 sd0/1 are scsibus0, targets 0/1
 sd2..9 are scsibus1, targets 0..8
 
 The machine paniced
 After reboot, parity rewrite on raid0 succeded and failed on raid1
 because of a read error on sd2a.
 He did scsictl stop sd2, scsictl detach scsibus1 0 0, replaced sd2,
 scsictl scan scsibus1 0 0. Something strange must have happened and
 sd2 was async.
 He nevertheless started the reconstruction (raidctl -R sd2a raid1),
 but raidctl -S estimated 24 hours.
 I asked him to stop the reconstruction, but neither failing sd2 nor
 detaching scsibus1 0 0 stopped it. Shortly after, the machine paniced
 again. It came up with raid0 and raid1 configured correctly, but fsck
 raid1a railed. We now have no disklabel on raid1 (disklabel -r says
 something about not being able to read it and disklabel without -r
 shows the fabricated one). Since fsck raid1a said somethin like
 incorrect fs size. I assume the superblock of raid1a is still
 there, only the disklabel is broken.
 
 Any hints? He is currently running the reconstruction and we'll see
 whether the disklabel returns or what happens if we re-write it from
 the backup we have in /var.

Panic messages and raid-related info from /var/log/messages would help
here.  Also the NetBSD version would help too.

 Is there a sane way to stop an on-going reconstruction?

Hmm... no.

 May trying to stop it have corrupted the raid1 contents?

No... I expect the disklabel for raid1 would have been physically
living on sd2a, but it should have been recoverable from the data and
parity on the remaining didks.  You don't say what arch you're on, but
if you use 'dd if=/dev/rraid1a' to go hunting, do you find something
that looks like a disklabel, or is it just garbage? (I'm guessing the
latter...)

Later...

Greg Oster


Re: raidframe questions

2012-02-23 Thread Greg Oster
On Thu, 23 Feb 2012 16:18:22 +0100
Jean-Yves Moulin j...@eileo.net wrote:

 Hi everybody,
 
 I was using two 500Go disks on raid1 setup. One of my disk died. I
 replace it with a bigger disk (600Go) and rebuilt raid. Then, the
 second 500Go disk died. I replace it also with a 600go disk and
 rebuilt raid. Now, I've two 600Go disk on my raid1 but size is always
 shrunk to 500Go. I made a mistake when creating RAID label on 600Go
 disks: It take entire disk.
 
 I want to know if I can recover my lost 100Go...
 
 disklabel for sd0:
  a: 117212350563   RAID # (Cyl.
 0*- 101045*) disklabel for sd1:
  a: 117212350563   RAID # (Cyl.
 0*- 101045*) disklabel for raid0:
  d: 976772992 0  unused  0 0# (Cyl.
 0 - 953879*)
 
 And raidframe message for shrinking:
 Warning: truncating spare disk /dev/sd1a to 976772992 blocks (from
 1172123441)
 
 
 I read that changing raid size is not possible. But can I change
 label size for sd0a and sd1a to the same size as raid0d (from
 1172123505 to 976772992), and then, can I use the recovered space for
 a second raid array ?

I'd want to test this somewhere where the data doesn't matter as much,
but yes, this should be doable.  Note that you'd want to change the
size from 1172123505 to 976773056 (not 976772992) because you need to
leave room for the component label (64 reserved blocks at the beginning
of the partition).

Later...

Greg Oster


Re: RAIDframe reconstruction

2012-02-19 Thread Greg Oster
On Mon, 20 Feb 2012 03:20:32 +0100
Edgar Fuß e...@math.uni-bonn.de wrote:

 I have raid1 consisting of sd2a..sd9a. sd3a failed, and I did
 something presumably stupid: after detach-ing sd3 and scan-ing a
 replacement, instead of raidctl -R /dev/sd3a raid1, I did raidctl
 -a /dev/sd3a raid1, and, as that didn't start a reconstruction,
 raidctl -F /dev/sd3a raid1, which did start a reconstruction.
 
 After the reconstruction succeeded, RAIDframe seems to be confused
 about the state of sd3a:

Hmm... interesting that what you did even worked... I'd have
thought /dev/sd3a would have still been held 'open' by the RAID set,
and that the hot-add would have failed.

 Components:
/dev/sd2a: optimal
/dev/sd3a: spared
/dev/sd4a: optimal
/dev/sd5a: optimal
/dev/sd6a: optimal
/dev/sd7a: optimal
/dev/sd8a: optimal
/dev/sd9a: optimal
 Spares:
/dev/sd3a: used_spare

 
 How do I get out of this? Would unconfiguring raid1 and
 re-configuring it work? Or how can I make sure what has really been
 written to the component label of sd3a?

If you're using the autoconfigure stuff (which you should be) then a
simple reboot should fix things up. (it will write out the correct 
component label for sd3a as part of unconfiguring /dev/raid1 )

Later...

Greg Oster


Re: New boothowto flag to prevent raid auto-root-configuration

2011-04-18 Thread Greg Oster
On Mon, 18 Apr 2011 13:06:23 +0200
Klaus . Heinz k.he...@aprelf.kh-22.de wrote:

 Martin Husemann wrote:
 
  as described in PR 44774 (see
  http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=44774),
  it is currently not possible to use a standard NetBSD install CD on
  a system wich normally boots from raid (at least on i386, amd64 or
  sparc64, where a stock GENERIC kernel is used).
 
 It looks like this is a similar problem to the one I raised here
 
   http://mail-index.NetBSD.org/tech-kern/2009/10/31/msg006410.html
 
 Instead of providing RB_NO_ROOT_OVERRIDE I would prefer something that
 actually _lets_ me override everything else from boot.cfg.

So how about adding another flag like RB_ROOT_EXPLICIT OR
RB_EXPLICIT_ROOT with the idea being that the user has to explicitly
specify what the root is going to be?  I think that addresses 1) below
and would hopefully handle the pxeboot and NFS root situations.  Maybe
something like RB_SET_ROOT with a required option of NFS or raid0
or wd0a or pxe or whatever would be the way to go?

I've never liked the 'yank root away from the boot device that the
system thinks it maybe booted from' hack in RAIDframe, so anything we
can do to make it better is fine by me

We may need both RB_NO_ROOT_OVERRIDE and this new flag in order to get
everything covered, and I'm fine with that too...

 Quoting Robert Elz from the mentioned thread:
 
  FWIW, I think the code to allow the user to override that at boot
  time would be a useful addition - while it is possible to boot,
  raidctl -A yes or raidctl -A no, and then reboot, or perhaps
  boot -a on systems that support that, but that's a painful
  sequence of operations for what should be a simple task.
  
  It should always be
1. what the user explicitly asks for
2. what the kernel has built in
3. hacks like ratdctl -A root (or perhaps similar things
  for cgd etc) 4. where I think I came from
 
 
 ciao
  Klaus


Later...

Greg Oster


Re: partitionSizeHi in raidframe component label, take 2 (wasRe: partitionSizeHi in raidframe component label)

2011-03-18 Thread Greg Oster
On Fri, 18 Mar 2011 20:57:49 +1100
matthew green m...@eterna.com.au wrote:

 
 this patch seems to fix my problem.  i've moved the call to fix
 the label into rf_reasonable_label() itself, right before the
 valid return, and made the fix up also fix the partitionSizeHi.
 
 greg, what do you think?

Looks good to me.  Go for it.

Later...

Greg Oster


Re: Problems with raidframe under NetBSD-5.1/i386

2011-02-23 Thread Greg Oster
On Fri, 18 Feb 2011 13:09:18 -0800
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   Hello.  It's been a while since I had an opportunity to work
 on this problem.  However, I have figured out the trouble.  While the
 error is mine, I do have a couple of questions as to why I didn't
 discover it sooner.
   It turns out that I had fat fingered the disktab entry I used
 to disklabel the component disks such that the start of the raid
 partition was at offset 0 relative to the entire disk, rather than
 offset 63, which is what I normally use to work around PC BIOS
 routines and the like.  Once I figured that out, the error I was
 getting made sense. With this in mind, my question and suggestion are
 as follows:
 
 1.  It makes sense to me that I would get an EROFS error if I try to
 reconstruct to a protected portion of the component disk.  What
 doesn't make sense to me is why I could create the working raid set
 in the first place?  Why didn't I run into this error when writing
 the initial component labels?  Another symptom of this issue,
 although I didn't know about it at the time, is that components  of
 my newly created raid sets would fail with an i/o failure, without
 any apparent whining from the component disk itself. I think now that
 this was because the raid driver was trying to update some portion of
 the component label and failing in the same way. Ok, my bad for
 getting my offsets wrong in the disklabel for the component disks,
 but can't we make it so this fails immediatly upon raid creation
 rather than having the trouble exhibit itself as apparently
 unexplained component disk failures?

I really don't get why the creation of the raid set would have
succeeded before, but not afterwards Was the RAID set created in
single-user mode or from sysinst or something?  Is there some
'securelevel' thing coming into play?  I'm just guessing here, as this
makes no sense to me :(  (The thing is: RAIDframe shouldn't be touching
any of those 'protected' areas of the disk anyway... the first 64
blocks are reserved, with the component label and such being at the
half-way point.  So even if you used an offset of 0, it would have only
been looking to touch blocks 32 and 33 (for parity logging) so
unless something is protecting all of the first 63 blocks it shouldn't
be complaining :( )

 2. I'd like to suggest the following quick patch to the raid driver
 to help make the diagnosis of component failures easier.Thoughts?

The patch looks fine, and quite useful.  Please commit.

Later...

Greg Oster


Re: raid of queue length 0

2011-02-23 Thread Greg Oster
On Mon, 21 Feb 2011 13:02:05 +0900 (JST)
enami tsugutomo tsugutomo.en...@jp.sony.com wrote:

 Hi, all.
 
 It is possible to create raid of queue(fifo) length 0 (see the raidctl
 -G output below), and looks as if it works, but reconstruction stalls
 once it is interfered by normal I/O.
 
 Does such configuration make sense?  Otherwise raidctl(8) shouldn't
 allow it.

It doesn't make sense.  Configuration should fail for a queue length of
0.  Best place to fix this is probably in rf_driver.c right after the
line:

 raidPtr-maxOutstanding = cfgPtr-maxOutstandingDiskReqs;

In there, something like:

  if (raidPtr-maxOutstanding = 0) {
DO_RAID_FAIL();
return 1;
  }

will probably do the trick.  (return code probably needs to be changed,
and perhaps some sort of warning should be printed, but that's the
general idea...)

Later...

Greg Oster

 # raidctl -G raid1  
 # raidctl config file for /dev/rraid1d
 
 START array
 # numRow numCol numSpare
 1 2 0
 
 START disks
 /dev/vnd0a
 /dev/vnd1a
 
 START layout
 # sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_1
 128 1 1 1
 
 START queue
 fifo 0


Re: Problems with raidframe under NetBSD-5.1/i386

2011-01-21 Thread Greg Oster
On Thu, 20 Jan 2011 17:28:21 -0800
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   hello.  I got side tracked from this problem for a while, but
 I'm back to looking at it as I have time.
   I think I may have been barking up the wrong tree with
 respect to the problem I'm having reconstructing to raidframe disks
 with wedges on the raid sets.  Putting in a little extra info in the
 error messages yields: raid2: initiating in-place reconstruction on
 column 4 raid2: Recon write failed (status 30(0x1e)!
 raid2: reconstruction failed.
 
   If that status number, taken from the second argument of
 rf_ReconeWriteDoneProc() is an error from /usr/include/sys/errno.h,
 then I'm getting EROFS when I try to reconstruct the disk.  

Hmmm... strange...

 Wouldn't
 that seem to imply that raidframe is trying to write over some
 protected portion of one of the components, probably the one I can't
 reconstruct to? Each of the components has a BSD disklabel on it, and
 I know that the raid set actually begins 64 sectors from the start of
 the partition in which the raid set resides.  However, is a similar
 back set done for the end of the raid?  That is, does the raid set
 extend all the way to the end of its partition or does it leave some
 space at the end for data as well? 

No, it doesn't.  The RAID set will use the remainder of the component,
but up to a multiple of whatever the stripe width is... (that is, the
RAID set will always end on a complete stripe.)

 Here's the thought.  I notice when
 I was reading through the wedge code, that there's a reference to
 searching for backup gpt tables and that one of the backups is stored
 at the end of the media passed to the wedge discovery code.  Since
 the broken component is the last component  in the raid set, I wonder
 if the wedge discovery code is marking the sectors containing the gpt
 table at the end of the raid set as protected, but for the disk
 itself, rather than the raid set?  I want to say that this is only a
 theory at the moment, based on a quick diagnostic enhancement to the
 error messages, but I can't think of another reason why I'd be
 getting that error. I'm going to be in and out of the office over the
 next week, but I'll try to see if I can capture the block numbers
 that are attempting to be written when the error occurs.  I think I
 can do that with a debug kernel I have built for the purpose.  Again,
 this problem exists under 5.0, not just 5.1, so it predates Jed's
 changes. If anyone has any other thoughts as to why I'd be getting
 EROFS on a raid component when trying to reconstruct to it, but not
 when I  create the raid, I'm all ears.

So when one builds a regular filesystem on a wedge, do they end up with
the same problem with 'data' at the end of the wedge?  If one does a dd
to the wedge, does it report write errors before the end of the wedge?

I really need to get my test box up-to-speed again, but that's going to
have to wait a few more weeks...

Later...

Greg Oster


 On Jan 7,  3:22pm, Brian Buhrow wrote:
 } Subject: Re: Problems with raidframe under NetBSD-5.1/i386
 } hello Greg.  Regarding problem 1, the inability to
 reconstruct disks } in raid sets with wedges in them, I confess I
 don't understand the vnode } stuff entirely, but rf_getdisksize() in
 rf_netbsdkintf.c looks suspicious } to me.  I'm a little unclear, but
 it looks like it tries to get the disk } size a number of ways,
 including by checking for a possible wedge on the } component.  I
 wonder if that's what's sending the reference count too high? }
 -thanks } -Brian
 } 
 } On Jan 7,  2:17pm, Greg Oster wrote:
 } } Subject: Re: Problems with raidframe under NetBSD-5.1/i386
 } } On Fri, 7 Jan 2011 05:34:11 -0800
 } } buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:
 } } 
 } }  hello.  OK.  Still more info.There seem to be two bugs
 here: } }  
 } }  1.  Raid sets with gpt partition tables in the raid set are not
 able } }  to reconstruct failed components because, for some reason,
 the failed } }  component is still marked open by the system even
 after the raidframe } }  code has marked it dead.  Still looking
 into the fix for that one. } } 
 } } Is this just with autoconfig sets, or with non-autoconfig sets
 too? } } When RF marks a disk as 'dead', it only does so internally,
 and doesn't } } write anything to the 'dead' disk.  It also doesn't
 even try to close } } the disk (maybe it should?).  Where it does try
 to close the disk is } } when you do a reconstruct-in-place -- there,
 it will close the disk } } before re-opening it... 
 } } 
 } } rf_netbsdkintf.c:rf_close_component() should take care of closing
 a } } component, but does something Special need to be done for
 wedges there? } } 
 } }  2.  Raid sets with gpt partition tables on them cannot be
 } }  unconfigured and reconfigured without rebooting.  This is
 because } }  dkwedge_delall() is not called during the raid shutdown
 process.  I } }  have a patch

Re: Problems with raidframe under NetBSD-5.1/i386

2011-01-07 Thread Greg Oster
On Fri, 7 Jan 2011 15:22:03 -0800
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   hello Greg.  Regarding problem 1, the inability to
 reconstruct disks in raid sets with wedges in them, I confess I don't
 understand the vnode stuff entirely, but rf_getdisksize() in
 rf_netbsdkintf.c looks suspicious to me.  I'm a little unclear, but
 it looks like it tries to get the disk size a number of ways,
 including by checking for a possible wedge on the component.  I
 wonder if that's what's sending the reference count too high? -thanks

In rf_reconstruct.c:rf_ReconstructInPlace() we have this:

retcode = VOP_IOCTL(vp, DIOCGPART, dpart, FREAD,
curlwp-l_cred);

I think will fail for wedges... it should be doing:

retcode = VOP_IOCTL(vp, DIOCGWEDGEINFO, dkw, FREAD, l-l_cred);

for the wedge case (see rf_getdisksize()).  Now: since the kernel
prints:

 raid2: initiating in-place reconstruction on column 4
 raid2: Recon write failed!
 raid2: reconstruction failed.

it's somehow making it past that point... but maybe with the wrong
values?? (is there an old label on the disk or something??? )

Later...

Greg Oster

 On Jan 7,  2:17pm, Greg Oster wrote:
 } Subject: Re: Problems with raidframe under NetBSD-5.1/i386
 } On Fri, 7 Jan 2011 05:34:11 -0800
 } buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:
 } 
 }hello.  OK.  Still more info.There seem to be two bugs
 here: }  
 }  1.  Raid sets with gpt partition tables in the raid set are not
 able }  to reconstruct failed components because, for some reason,
 the failed }  component is still marked open by the system even
 after the raidframe }  code has marked it dead.  Still looking into
 the fix for that one. } 
 } Is this just with autoconfig sets, or with non-autoconfig sets too?
 } When RF marks a disk as 'dead', it only does so internally, and
 doesn't } write anything to the 'dead' disk.  It also doesn't even
 try to close } the disk (maybe it should?).  Where it does try to
 close the disk is } when you do a reconstruct-in-place -- there, it
 will close the disk } before re-opening it... 
 } 
 } rf_netbsdkintf.c:rf_close_component() should take care of closing a
 } component, but does something Special need to be done for wedges
 there? } 
 }  2.  Raid sets with gpt partition tables on them cannot be
 }  unconfigured and reconfigured without rebooting.  This is because
 }  dkwedge_delall() is not called during the raid shutdown process.
 I }  have a patch for this issue which seems to work fine.  See the
 }  following output:
 } [snip]
 }  
 }  Here's the patch.  Note that this is against NetBSD-5.0 sources,
 but }  it should be clean for 5.1, and, i'm guessing, -current as
 well. } 
 } Ah, good!  Thanks for your help with this.   I see Christos has
 already } commited your changes too. (Thanks, Christos!)
 } 
 } Later...
 } 
 } Greg Oster
 -- End of excerpt from Greg Oster
 


Later...

Greg Oster


Re: Problems with raidframe under NetBSD-5.1/i386

2011-01-06 Thread Greg Oster
On Thu, 6 Jan 2011 09:42:41 -0800
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   Hello.  I have a box running NetBSD-5.1/i386 with kernel
 sources from 1/4/2011 which refuses to reconstruct to what looks like
 a perfectly good disk.  The errormessages are:
 
 Command:
 root#raidctl -R /dev/sd10a raid2
 
 Error messages from the kernel:
 raid2: initiating in-place reconstruction on column 4
 raid2: Recon write failed!
 raid2: reconstruction failed.

Is there anything else in /var/log/messsages about this?  Did the
component fail before with write errors?

 So, I realize this isn't a lot to go on, so I've been trying to build
 a kernel with some debugging in it.
 
 Configured a kernel with:
 options DEBUG
 options   RF_DEBUG_RECON
 
 But the kernel won't build because the Dprintf statements that get
 turned on in the rf_reconstruct.c file when the second option is
 enabled causes gcc to emit warnings.  

Ug.  Those should probably be fixed, although I don't suspect many
people use them

   So, rather than fixing the warnings, and potentially breaking
 something else, my question is, how can I turn off -werror in the
 build process for just this kernel?  Looking in the generated
 Makefile didn't provide enlightenment and I don't really want to
 disable this option for my entire tree.  But, I imagine, this is an
 easy and often-wanted thing to do. -thanks
 -Brian


Later...

Greg Oster


Re: Problems with raidframe under NetBSD-5.1/i386

2011-01-06 Thread Greg Oster
On Thu, 6 Jan 2011 10:05:17 -0800
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   hello Greg.  When the component failed, I just got an error:
 raid2: io error on /dev/sd10a, marking /dev/sd10a as failed!
 No messages from the drive itself.  It's an SCSI disk attached to an
 LSI mpt(4) card, who's driver Im trying to improve in the advent of
 problematic disk behavior.  However, this drive is good, I believe,
 and I can read and write to the disk itself without problems.  

To all sectors, or just through the filesystem?

 I suspect I'm having kmem allocation problems inside Raiframe, I've
 been here before, :), but I'm not sure of that.  I did test with
 NetBSD-5.0 with kernel sources fom  June 2009, which is what I've
 been running in production around here, just to make sure the new
 parity map stuff wasn't the problem.  It doesn't reconstruct there
 either.  So, what ever is wrong, it's an old problem, and probably a
 pilot error at that.

Hmm.. so if you bump up the size of NKMEMPAGES (or whatever) does the
reconstruction succeed?  How large are these components, and how much
RAM in the system?

 Do you know the magic to turn off -werror for
 individual kernel builds?

Not off the top of my head, no...

Later...

Greg Oster

 On Jan 6, 11:50am, Greg Oster wrote:
 } Subject: Re: Problems with raidframe under NetBSD-5.1/i386
 } On Thu, 6 Jan 2011 09:42:41 -0800
 } buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:
 } 
 }Hello.  I have a box running NetBSD-5.1/i386 with kernel
 }  sources from 1/4/2011 which refuses to reconstruct to what looks
 like }  a perfectly good disk.  The errormessages are:
 }  
 }  Command:
 }  root#raidctl -R /dev/sd10a raid2
 }  
 }  Error messages from the kernel:
 }  raid2: initiating in-place reconstruction on column 4
 }  raid2: Recon write failed!
 }  raid2: reconstruction failed.
 } 
 } Is there anything else in /var/log/messsages about this?  Did the
 } component fail before with write errors?
 } 
 }  So, I realize this isn't a lot to go on, so I've been trying to
 build }  a kernel with some debugging in it.
 }  
 }  Configured a kernel with:
 }  options DEBUG
 }  options   RF_DEBUG_RECON
 }  
 }  But the kernel won't build because the Dprintf statements that get
 }  turned on in the rf_reconstruct.c file when the second option is
 }  enabled causes gcc to emit warnings.  
 } 
 } Ug.  Those should probably be fixed, although I don't suspect many
 } people use them
 } 
 }So, rather than fixing the warnings, and potentially
 breaking }  something else, my question is, how can I turn off
 -werror in the }  build process for just this kernel?  Looking in
 the generated }  Makefile didn't provide enlightenment and I don't
 really want to }  disable this option for my entire tree.  But, I
 imagine, this is an }  easy and often-wanted thing to do. -thanks
 }  -Brian
 } 
 } 
 } Later...
 } 
 } Greg Oster
 -- End of excerpt from Greg Oster
 


Later...

Greg Oster


Re: Problems with raidframe under NetBSD-5.1/i386

2011-01-06 Thread Greg Oster
On Thu, 6 Jan 2011 18:33:58 -0800
buh...@lothlorien.nfbcal.org (Brian Buhrow) wrote:

   Hello.  Ok.  I have more information, perhaps this is a known
 issue. If not, I can file a bug.

Please, do file a PR... this is a new one.

   the problem seems to be that if you partition a raid set with
 gpt instead of disklabel, if a component of that raid set fails,  the
 underlying component is held open even after raidframe declares it
 dead. Thus, when you try to ask raidframe to do a reconstruct on the
 dead component, it can't open the component because the component is
 busy.

I think the culprit is in
src/sys/dev/raidframe/rf_netbsdkintf.c:rf_find_raid_components() where
in the: 

  if (wedge) {
...
ac_list = rf_get_component(ac_list, dev,
 vp, device_xname(dv), dkw.dkw_size);
continue;
  }

case that little continue is not letting the execution reach the:

/* don't need this any more.  We'll allocate it again
   a little later if we really do... */
vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
VOP_CLOSE(vp, FREAD | FWRITE, NOCRED);
vput(vp);

code which would close the opened wedge. :(  Both 5.1 and -current
suffer from the same issue (though the code in -current is slightly
different).

Thanks for the investigation and report...

Later...

Greg Oster