Re: Beating a dead horse

2015-11-25 Thread Swift Griggs

On Wed, 25 Nov 2015, Andreas Gustafsson wrote:
The don't have sectors as much as flash pages, and the page size varies 
from device to device.


I'm curious about something, probably due to ignorance of the full 
dynamics of the vfs(9) layer. Why is it that folks don't choose file 
system block sizes and partition offsets that are least-common-factors 
that they share with the hardware layer. Ie.. Let's say the hard disk uses 
4K pages, the file system uses 8K blocks, and the vendor recommends that 
you stay aligned with a 1GB value. Wouldn't operating on 8K blocks still 
satisfy the underlying device (since 8K operations would always be 
divisible by a factor of 4K) and the 1GB alignment may not always be 
perfect, but the 8K ops below it would eventually stack to 1GB exactly, 
too.


Is it all about waste at the file system layer due to some block 
operations being optimized for large devices and buffers but not being as 
applicable (or being downright wasteful) on smaller block devices?


I'm just asking so I can better follow the conversation you more 
experienced folks are having.


Thanks,
  Swift


Re: Beating a dead horse

2015-11-25 Thread Greg Oster
On Tue, 24 Nov 2015 21:57:50 -0553.75
"William A. Mahaffey III"  wrote:

> On 11/24/15 19:08, Robert Elz wrote:
> >  Date:Mon, 23 Nov 2015 11:18:48 -0553.75
> >  From:"William A. Mahaffey III" 
> >  Message-ID:  <5653492e.1090...@hiwaay.net>
> >

> >| The machine works well except for horribly slow I/O to the
> > RAID5 I setup
> >
> > What is your definition of "horribly slow" and are we talking read
> > or write or both ?
> 
> 4256EE1 # time dd if=/dev/zero of=/home/testfile bs=16k count=32768
> 32768+0 records in
> 32768+0 records out
> 536870912 bytes transferred in 22.475 secs (23887471 bytes/sec)
> 23.28 real 0.10 user 2.38 sys
> 4256EE1 #

Just to recap: You have a RAID set that is not 4K aligned with the
underlying disks.  Fixing that will certainly help, but there is a bit
more tweaking to be done:

 1) Your stripe size is 32 blocks - 16K.  So if you are writing in 16K
 chunks (as with the dd above), at best you're guaranteed to be
 re-reading the old data, re-reading parity, writing new data, and
 writing new data for each block.  That's the 'small write problem'
 with RAID 5.  (at worst, that 16K might span two components, in which
 case you'll have to wait for three reads and three writes to complete,
 instead of two reads and two writes)

 2) Your filesystem block size is 32k, meaning a filesystem block write
 will never give you a full stripe write either.  See 1) for
 performance implications.

 3) If you want more write speed, then:
   a) get the RAID components and RAID 4K-aligned with the underlying
  disks.  Your 32 block stripe width is perfect. The number
  of data components is good.
   b) get the filesystem 4K-aligned
   c) use a 64K block size for the filesystem
   d) Use 'bs=64k' (or 'bs=10m') to see better IO performance
  
 4) As others have said, if you need high-performance writes, RAID 5 is
 probably not what you want, especially if you're not streaming writes
 in 64K chunks.

Later...

Greg Oster


Re: Beating a dead horse

2015-11-25 Thread Greg Troxel

Swift Griggs  writes:

> I'm curious about something, probably due to ignorance of the full
> dynamics of the vfs(9) layer. Why is it that folks don't choose file
> system block sizes and partition offsets that are least-common-factors
> that they share with the hardware layer. Ie.. Let's say the hard disk
> uses 4K pages, the file system uses 8K blocks, and the vendor
> recommends that you stay aligned with a 1GB value. Wouldn't operating
> on 8K blocks still satisfy the underlying device (since 8K operations
> would always be divisible by a factor of 4K) and the 1GB alignment may
> not always be perfect, but the 8K ops below it would eventually stack
> to 1GB exactly, too.

Good questions, and it boils down to a few things:

  - many devices don't have a way to report their underlying block
sizes.  For example, if you buy a 2T spinning disk, it will very
likely be one that has sectors that are actually 4K but an interface
of 512B sectors.  So if you read, it's fine because it gets a 4K
sector into the cache, and then hands you the piece you want.  And
when you write, if you write a 512-byte sector, it has to
read-modify-write.  Worse, if you write 4K or 8K but not lined up
(which you will if your fs has 8K blocks but aligned to 63), it has
to read-modify-write 2 sectors per write.

  - SSDs are even harder to figure out, as Andreas's helpful references
in response to my question show.

  - filesystems sometimes get moved around, and higher up it's even more
disconnected from the actual hardware

So there are two issues: alignment and filesystem block/frag size, and
both have to be ok.  For larger disks, UFS uses larger block sizes by
default (man newfs).  So that's ok, but alignment is messier.  We're
seeing smaller disks with 4K sectors or larger flash erase blocks and
512B interfaces now.

And, there are also disks with native 4K sectors, where the interface to
the computer transfers 4K chunks.  That avoids the alignment issue, but
requires filesystem/kernel support.  I am pretty sure netbsd-7 is ok
with that but I am not sure about earlier.

It would probably be possible to add a call into drivers to return this
info and propagate it up and have newfs/fdisk query it.  I am not sure
that all disks return the info enough, and there are probably a lot of
details.  But it's more work and doesn't necessarily do better than
"just start at 2048 and use big blocks".  Certainly you are welcome to
read the code and think about it if this interests you - just explaining
why I think no one has done the work so far.

> Is it all about waste at the file system layer due to some block
> operations being optimized for large devices and buffers but not being
> as applicable (or being downright wasteful) on smaller block devices?

I geuss you can put it that way, saying otherwise we would always start
at 2048 and use 32K or even 64K blocks.   But I think part of it is
inertia.  And the the 63 start dates back to floppies that had 63-sector
tracks - so it was actually aligned.



signature.asc
Description: PGP signature


Re: Beating a dead horse

2015-11-25 Thread Swift Griggs

On Wed, 25 Nov 2015, Greg Troxel wrote:
So there are two issues: alignment and filesystem block/frag size, and 
both have to be ok.


Ahh, a key point to be certain.


So that's ok, but alignment is messier.


It sure seems that way!  :-)

 We're seeing smaller disks with 4K sectors or larger flash erase blocks 
and 512B interfaces now.


Those larger erase blocks (128k?!) would seem to be a big problem if you'd 
rather stick to a smaller alignment or file system block size. However, 
maybe I'm missing something about the interplay between things like inode 
sizes and any relationship with the erase block size. For example, if you 
were using an 8K block size I'd think a 128k erase block size would blow 
things up nicely when you blew way past the 8k block you originally wanted 
to clear. Again, however, I feel I'm missing something, but let me say 
right away that I'm ingratiated by your detailed response and I learned 
much from it.


That avoids the alignment issue, but requires filesystem/kernel support. 
I am pretty sure netbsd-7 is ok with that but I am not sure about 
earlier.


Interesting. For some reason I have it in my head that many EMC SANs of 
various stripes (and I know for sure the XtremeIO stuff) uses 4k block 
sizes. I've used NetBSD in the past in EMC environments and seen moderate 
(but not what I'd call "high") performance vis-a-vis other (Linux and/or 
Solaris) hosts. I wonder if this was a factor. I'll have to test with 
Netbsd 7, now. Sounds fun.


It would probably be possible to add a call into drivers to return this 
info and propagate it up and have newfs/fdisk query it.


That sounds pretty automagical (in a good way). However, as you point out 
it might be tedious for not enough payoff considering that using a 2048 
offset seems to address the issue.


Certainly you are welcome to read the code and think about it if this 
interests you - just explaining why I think no one has done the work so 
far.


I did, in fact, take a look at /usr/src/sbin/newfs/mkfs.c and 
/usr/include/ufs/ffs as well. I'm a kernel lightweight and I got all 
confused by discussions in the code about block fragments and minimum / 
maximum block sizes. Here I was thinking all that was static. I did learn 
that 4k seems to be the smallest allowable size (line 141 in fs.h #define 
MINBSIZE 4096). It sounds like that's the smallest since a cylinder group 
block struct needs a 4k space to be workable. I also noticed that 
superblock sizes seem to operate independently of the data block sizes. 
It's a great day for learning!


I geuss you can put it that way, saying otherwise we would always start 
at 2048 and use 32K or even 64K blocks.


That doesn't seem too wasteful to me, but I'm ready for some embedded 
systems guy to lecture me.  :-)


But I think part of it is inertia.  And the the 63 start dates back to 
floppies that had 63-sector tracks - so it was actually aligned.


A! I always wondered where that came from. I remember floppy disk 
controllers were bizarre animals. Physical characteristics were 
so important that you could easily create differences that'd result in a 
completely unreadable floppy. Ie.. that was always a big issue trying to 
get my Amiga floppies to read on a PC. I had to buy a special dual-mode 
floppy controller et al After all that hassle I still have some stupid 
emotional affinity for 3.5" floppies because of how much more wonderful 
they seemed over 5.25" floppies. Ugh... nostalgia can be painful sometimes 
:-P


Thanks again for the detailed response. I learned a ton from this 
discussion.


Thanks,
  Swift


Re: (unknown)

2015-11-25 Thread Christos Zoulas
In article <24245.1448479...@andromeda.noi.kre.to>,
Robert Elz   wrote:
>Date:Wed, 25 Nov 2015 12:25:50 -0600 (CST)
>From:"Jeremy C. Reed" 
>Message-ID:  
>
>  | Is this expected behavior? Undefined? A bug?
>
>parsedate is full of bugs.   I have a fix for some of them (not that
>one, as that more does depend upon what it "should" do) that I've
>sent to a few people.
>
>I think the plan was for apb to incorporate it, but he seems to be MIA
>at the minute, so nothing has happened yet.
>
>I can send the patches to anyone who wants.

Send them here; preferably with unit-tests :-)

christos



Re: Beating a dead horse

2015-11-25 Thread Robert Elz
Date:Wed, 25 Nov 2015 15:59:29 -0600
From:Greg Oster 
Message-ID:  <20151125155929.2a5f2...@mickey.usask.ca>

  |  time dd if=/dev/zero of=/home/testfile bs=64k count=32768
  |  time dd if=/dev/zero of=/home/testfile bs=10240k count=32768
  | 
  | so that at least you're sending 64K chunks to the disk...

Will that really work?   Wouldn't the filesystem divide the 64k writes
from dd into 32K file blocks, and write those to raidframe?   I doubt
those tests would be productive.

They'd need to write to the raw rraid0d (or rraid0a) to be effective
I think, and that would destroy all the data...

kre



Re: Beating a dead horse

2015-11-25 Thread William A. Mahaffey III

On 11/25/15 16:05, Greg Oster wrote:

On Thu, 26 Nov 2015 04:41:02 +0700
Robert Elz  wrote:


 Date:Wed, 25 Nov 2015 14:57:02 -0553.75
 From:"William A. Mahaffey III" 
 Message-ID:  <56561f54.5040...@hiwaay.net>


   |   f: 1886414256  67110912   RAID # (Cyl.
66578*- | 1938020)

OK, 67110912 is a multiple of 2^11 (2048) which is just fine.
The size is a multiple of 2^4 (16) so that's OK too.

   |   128  7545656543  1  GPT part - NetBSD FFSv1/FFSv2

The 128 is what I was expecting, from the dk0 wedgeinfo, and that's
fine.  The size is weird, but I don't think should give a problen.
Greg will be able to say what happens when there's a partial stripe
left over at the end of a raidframe array.

RAIDframe truncates to the last full stripe.


If you do ever decide to redo things, I'd make that size be a
mulltiple of 2048 too (maybe a multiple of 2048 - 128).  Wasting a
couple of thousand sectors (1 MB) won't hurt (and that's the max).


But overall I think your basic layout is fine, and there's no need to
adjust that.  The one thing that you need to do (if you really need
better performance, rather than just think you should have it - that
is, if you need it enough to re-init the filesystem) would be to
change the ffs block size, or change the raidframe stripe size, so
standard size block I/O turns into full size stripe I/O.

Doing that should improve performance.  Nothing else is likely to
help.

The first thing I would do is test with these:

  time dd if=/dev/zero of=/home/testfile bs=64k count=32768
  time dd if=/dev/zero of=/home/testfile bs=10240k count=32768

so that at least you're sending 64K chunks to the disk... After that,
64K blocks on the filesystem are going to be next, and that might be
more effort than it's worth, depending on the results of the above
dd's...

Later...

Greg Oster



4256EE1 # time dd if=/dev/zero of=/home/testfile bs=64k count=32768
32768+0 records in
32768+0 records out
2147483648 bytes transferred in 166.255 secs (12916806 bytes/sec)
  167.20 real 0.12 user 8.85 sys
4256EE1 #

The other command is still running, will write out 320 GB by my count, 
is that as intended, or a typo :-) ? If as wanted, I will leave it going 
& report back when it is done. BTW, I see much more of the above 13-ish 
MB/s than the 24-ish reported earlier, when I posted earlier (a few 
weeks ago) I think I had about 18 MB/s, but 12-15 is much more common, 
apropos of nothing if it is nominally as expected  Thanks & TIA for 
any more insight 


--

William A. Mahaffey III

 --

"The M1 Garand is without doubt the finest implement of war
 ever devised by man."
   -- Gen. George S. Patton Jr.



Re: Beating a dead horse

2015-11-25 Thread Robert Elz
Date:Wed, 25 Nov 2015 19:08:59 -0553.75
From:"William A. Mahaffey III" 
Message-ID:  <56565a61.7080...@hiwaay.net>

  | The other command is still running, will write out 320 GB by my count, 
  | is that as intended, or a typo :-) ? If as wanted, I will leave it going 
  | & report back when it is done.

Kill it, those tests are testing precisely nothing.

If you want to try a slighly better test, try with bs=32k, so you are at
least having dd write file system sized blocks.   Won't help with raidframe
overheads, but at least you'll optimise the filesystem overheads as much
as possible.

But I think now that it is clear that you could improve performance if you
rebuild the filesystem with -b 65536 (and -f one of 8192 16384 32768, take
your pick... 8192 would be most space saving).

The question is whether the improvement is really needed for real work,
as opposed to meaningless benchmarks (which dd really isn't anyway, if
you want a real benchmark pick something, perhaps bonnie, from the
benchmarks category in pkgsrc).   And whether it will be enough to be
worth the pain - unfortunately, there's no real way to know how much
improvement you'd get without doing the work first.

Only you can answer those questions - you know what the work is, and
how effectively your system is coping as it is now configured.

As I said before, when it is me, I just ignore all of this, I don't care
what the i/o throughput is (or could be) because in practice there just
isn't enough i/o (on raid5 based filesystems) in my system to matter.
So I optimise for other things - of which the most valuable (to me) is
my time!

kre



[no subject]

2015-11-25 Thread Jeremy C. Reed
At Wed, 25 Nov 2015 10:00:00 + (UTC) I had a cron job run:

for tz in America/Los_Angeles America/Chicago America/New_York \
Asia/Tokyo Europe/Berlin ; do
TZ=$tz date -d "Wednesday 22:00utc" +"%A %B %d %I:%M %p %z %Z ${tz}"  ; 
done

This resulted in:

Wednesday November 25 12:00 PM -0800 PST America/Los_Angeles
Wednesday November 25 02:00 PM -0600 CST America/Chicago
Wednesday November 25 03:00 PM -0500 EST America/New_York
Wednesday December 02 05:00 AM +0900 JST Asia/Tokyo
Wednesday November 25 09:00 PM +0100 CET Europe/Berlin


Notice the December 02 above.

An easy workaround is to also add today's date to the -d parsedate 
string above.

Is this expected behavior? Undefined? A bug?


[no subject]

2015-11-25 Thread Robert Elz
Date:Wed, 25 Nov 2015 12:25:50 -0600 (CST)
From:"Jeremy C. Reed" 
Message-ID:  

  | Is this expected behavior? Undefined? A bug?

parsedate is full of bugs.   I have a fix for some of them (not that
one, as that more does depend upon what it "should" do) that I've
sent to a few people.

I think the plan was for apb to incorporate it, but he seems to be MIA
at the minute, so nothing has happened yet.

I can send the patches to anyone who wants.

kre



Re: Beating a dead horse

2015-11-25 Thread Robert Elz
Date:Wed, 25 Nov 2015 12:29:15 -0500
From:Greg Troxel 
Message-ID:  


  | And, there are also disks with native 4K sectors, where the interface to
  | the computer transfers 4K chunks.  That avoids the alignment issue, but
  | requires filesystem/kernel support.  I am pretty sure netbsd-7 is ok
  | with that but I am not sure about earlier.

FFS is OK on NetBSD-7 (not sure about LFS or others, never tried them).
Raidframe might be (haven't looked) but both cgd and ccd are a mess...

I have been looking into cgd, I have exactly that kind of drive, and
want to put CGD on it.   It "works" but counts 4K sectors as if they
were 512 bytes each...  (ie: the cgd is 1/8 the size it should be).
Fixing that the easy way loses all the alignment (and since it is required,
not just inefficient, would probably cause breakage if you're not very very
careful).   Fixing it the right way is hard the way things are right now
(it is real easy to cause panics trying ... I know, been there, done that...)

  | It would probably be possible to add a call into drivers to return this
  | info and propagate it up and have newfs/fdisk query it.

Something like that is a part of what I am thinking about to make cgds
on 4k sector drives work properly (if/when I make cgd work, I'll probably
look at ccds though I have no personal requirement for those, but it
ought to just fall out of the cgd fixes I think).

One real issue, is that on 4K sector discs, things like LABELSECTOR
change meaning (that is, they retain their meaning, but that meaning means
a different thing).   That is, you cannot simply dd a 512 byte/sector
drive to an equivalent sized 4k sector drive and have it have any hope
at all of working - the labels will be in the wrong place, and contain
the wrong data (disklabels on 4k sector drives are kind of unspecified,
but by analogy with MBRs and GPT need to count in units of sectors, not
DEV_BSIZE's)

The one big advantage of 4k sector drives (and the reason I suspect they
exist, as a different species than the drives with emulated 512 byte
sectors) is that they can grow much bigger without blowing the size limits
inherent in 32 bit based labels (like MBRs and disklabels).

That is, while we all have been assuming that for drives > 2TB GPT was
mandatory, it turns out it isn't, disklabels and/or MBR work just fine,
provided the sectors are 4K rather than 512...

I had not been thinking about passing this drive info all the way up to
newfs (and other userland tools) but there's no reason that couldn't be
done.

kre



Re: Beating a dead horse

2015-11-25 Thread Robert Elz
Date:Wed, 25 Nov 2015 08:10:50 -0553.75
From:"William A. Mahaffey III" 
Message-ID:  <5655c020.5090...@hiwaay.net>

In addition to what I said in the previous message ...

  | H  I thought that the RAID5 would write 1 parity byte & 4 data 
  | bytes in parallel, i.e. no '1 drive bottleneck'. AFAIUT, parity data is 
  | calculated for N data drives, 1 byte of parity data per N bytes written 
  | to the N data drives, then the N+1 bytes are written to the N+1 drives, no ?

As I understand it, that's how the mathematics works ... but (like many
things) the theory and the practice don't always cooperate quite like
that.  To see what really happens you have to imagine writing a single
disk block (because that's what raidframe gets from the filesystem layer)
in isolation.  Because of striping, you could imagine it in your model
as if you wanted to write just one byte.   You finish that, and then you
write one more byte, and finish that...

In reality, it isn't done as bytes, that's too inefficient, but stripes,
but the principle is the same.


  | 4256EE1 # raidctl -s  raid2
  | Components:
  | /dev/wd0f: optimal
  | /dev/wd1f: optimal
  | /dev/wd2f: optimal
  | /dev/wd3f: optimal
  | /dev/wd4f: optimal
  | No spares.

The real reason I wanted to reply to this message is that last line.

wd5 is not being used as a spare.  I kind of suspected that might be the case.
(Parts of it might be used for raid0 or raid1, that's a whole different
question and not material here).

Raidframe autoconfigures the in-use components (assuming autoconfig is
enabled for the raid array, which it is for you ...)

  | Component label for /dev/wd0f:
  | Autoconfig: Yes

(same for the other components.)   But spares are not autoconfigured.
If you want a "hot" spare (one that will automatically be used if one
of the other components fails, so you get back the reliability as soon
as possible, in case a second component also fails) rather than a "cold"
spare (one waiting to be used, but which needs human intervention to
actually make it happen - which is what you have now), then you need
to arrange to add the spare after every reboot.

There is no current standard way to make that happen (most of us tend to
be counting pennies, and have no spare drives ready at all - we wait for one
to fail, then go buy its replacement only when required...), so I'd just add

raidctl -a /dev/wd5f raid2

in /etc/rc.localYou might want to defer doing that though until after
having everything else sorted out - at the minute, wd5f is spare scratch
space, being used by nothing - you could make a ffs filesystem on it, to
measure the speed you can get to a single drive filesystem.  You would need
to alter its partition type from RAID to 4.2BSD in the disklabel first
(and then put it back again after you are done testing with the filesystem
and ready to make it back being a spare.)

Sometime or other we ought to either arrange for spares to autoconfig (but I
suspect that would be a job for Greg, and that probably means not anytime
soon...) or at least to have a standard rc.d script that would turn on any
configured (and unused) spares for autoconfigured raid sets, without needing
evil hacks like the one I just suggested sticking in rc.local...

Doing the second of those is probably within my abilities, so I might take
a crack at that one.

kre



Re:

2015-11-25 Thread gary
=> At Wed, 25 Nov 2015 10:00:00 + (UTC) I had a cron job run:
=>
=> for tz in America/Los_Angeles America/Chicago America/New_York \
=> Asia/Tokyo Europe/Berlin ; do
=> TZ=$tz date -d "Wednesday 22:00utc" +"%A %B %d %I:%M %p %z %Z ${tz}"  ;
=> done
=>
=> This resulted in:
=>
=> Wednesday November 25 12:00 PM -0800 PST America/Los_Angeles
=> Wednesday November 25 02:00 PM -0600 CST America/Chicago
=> Wednesday November 25 03:00 PM -0500 EST America/New_York
=> Wednesday December 02 05:00 AM +0900 JST Asia/Tokyo
=> Wednesday November 25 09:00 PM +0100 CET Europe/Berlin
=>
=>
=> Notice the December 02 above.
=>
=> An easy workaround is to also add today's date to the -d parsedate
=> string above.
=>
=> Is this expected behavior? Undefined? A bug?

   FWIW, I get similar results on my Linux box (Ubuntu 14.04).

Wednesday November 25 02:00 PM -0800 PST America/Los_Angeles
Wednesday November 25 04:00 PM -0600 CST America/Chicago
Wednesday November 25 05:00 PM -0500 EST America/New_York
Thursday December 03 07:00 AM +0900 JST Asia/Tokyo
Wednesday November 25 11:00 PM +0100 CET Europe/Berlin

  Gary Duzan





Re: Beating a dead horse

2015-11-25 Thread Robert Elz
Date:Thu, 26 Nov 2015 01:41:00 +0700
From:Robert Elz 
Message-ID:  <23815.1448476...@andromeda.noi.kre.to>

  | so I'd just add
  | 
  | raidctl -a /dev/wd5f raid2
  | 
  | in /etc/rc.local

Actually, a better way short term, is probably to put your config file
for raid2 in /etc/raid2.conf disable autoconfig for raid2 (for /home,
it is not really required ... but don't do this if there's any possibility
that the wdN numbers might change - that is, if you might add new drives
or rearrange the cabling in any way).

For now anyway, don't include raidN.conf files in /etc for any raidN's
that are autoconfigured.

To disable autoconfig
raidctl -A no raid2

And make sure you have
raidframe=yes
in /etc/rc.conf

Then /etc/rc.d/raidframe will config raid2 for you at each boot, and that
way if configuring does handle spares, so if /etc/raid2.conf says that
wd5f should be a spare, it will be.

Still, do any of this only after you have finished using wd5f for any other
testing you want to do.

kre



Re: Beating a dead horse

2015-11-25 Thread Robert Elz
Date:Wed, 25 Nov 2015 14:57:02 -0553.75
From:"William A. Mahaffey III" 
Message-ID:  <56561f54.5040...@hiwaay.net>


  |   f: 1886414256  67110912   RAID # (Cyl. 66578*- 
  | 1938020)

OK, 67110912 is a multiple of 2^11 (2048) which is just fine.
The size is a multiple of 2^4 (16) so that's OK too.

  |   128  7545656543  1  GPT part - NetBSD FFSv1/FFSv2

The 128 is what I was expecting, from the dk0 wedgeinfo, and that's
fine.  The size is weird, but I don't think should give a problen.
Greg will be able to say what happens when there's a partial stripe
left over at the end of a raidframe array.

If you do ever decide to redo things, I'd make that size be a mulltiple of
2048 too (maybe a multiple of 2048 - 128).  Wasting a couple of thousand
sectors (1 MB) won't hurt (and that's the max).


But overall I think your basic layout is fine, and there's no need to adjust 
that.  The one thing that you need to do (if you really need better
performance, rather than just think you should have it - that is, if you
need it enough to re-init the filesystem) would be to change the ffs block 
size, or change the raidframe stripe size, so standard size block I/O turns
into full size stripe I/O.

Doing that should improve performance.  Nothing else is likely to help.

kre



Re: Beating a dead horse

2015-11-25 Thread Robert Elz
Date:Wed, 25 Nov 2015 13:20:14 -0700 (MST)
From:Swift Griggs 
Message-ID:  

  | I wonder if the same is true for LVM?

No idea.   I thought it should be easy enough to test, so I just tried
that ...  unfortunately, I cannot work out (with 5 minutes of research!)
how to make it work, so no luck...

lvm wasn't on my radar - while it seems to do a lot, encryption doesn't
appear to be included.  My 4k sector drive is a USB removable, which is
the kind of thing that mandates encryption, so I want/need cgd on it.
ANything else after that is just frills...  (there's just one of them,
now anyway, so raid etc isn't important, and it is mostly for backups,
so doesn't need lots of small data volumes, just one big place to
put dump files...)


First problem with testing lvm is that amd64 GENERIC (and consequently my
cut down version) doesn't have "pseudo-device dm" in it, so llvm isn't going
to work there at all.   That's easy to fix of course, so I did that.

But after that, I couldn't figure out how to make

lvm pvcreate dk13

work, it just said (something like, paraphrasing, that system isn't
running any more) "invalid device or disabled by filtering"

Initially dk13 was of type "cgd" in the GPT label.   I wasn't sure what
to make it for LVM, there doesn't seem to be a NetBSD UUID for that,
so I tried setting it to linux-lvm on the assumption that would work.
Changed nothing.

I was using a current kernel (7.99.21) from a couple of weeks ago.
I can upgrade to a newer one if that is likely to help (but I doubt it).

That's where I gave up.   If anyone has a recipe I can follow to create
an lvm volume on a GPT partitioned drive (I *will not* revert to MBR or
disklabel - those are "so 20th century") and can tell me just how to run a
test, I'm willing to try, anytime in the remainder of this week.

After that, I'm going to be away from my 4k sector drive for all of
December, so I won't be testing lvm, or working on cgd at all.   If anyone
else wants to fix cgd for 4k sector size discs during Dec, please go ahead.
So far all I've been doing is investigating, and thinking, I have nothing
productive to share yet...

  | Since it's relatively new, perhaps 
  | some of these issues were worked out in a more "modern" way that would 
  | properly take advantage of the NetBSD-7 kernel.

The kernel does some stuff right for non 512 byte sector disks, but still
does lots wrong.   It is still a mess in this area.

kre



Re: Beating a dead horse

2015-11-25 Thread Greg Oster
On Thu, 26 Nov 2015 04:41:02 +0700
Robert Elz  wrote:

> Date:Wed, 25 Nov 2015 14:57:02 -0553.75
> From:"William A. Mahaffey III" 
> Message-ID:  <56561f54.5040...@hiwaay.net>
> 
> 
>   |   f: 1886414256  67110912   RAID # (Cyl.
> 66578*- | 1938020)
> 
> OK, 67110912 is a multiple of 2^11 (2048) which is just fine.
> The size is a multiple of 2^4 (16) so that's OK too.
> 
>   |   128  7545656543  1  GPT part - NetBSD FFSv1/FFSv2
> 
> The 128 is what I was expecting, from the dk0 wedgeinfo, and that's
> fine.  The size is weird, but I don't think should give a problen.
> Greg will be able to say what happens when there's a partial stripe
> left over at the end of a raidframe array.

RAIDframe truncates to the last full stripe.

> If you do ever decide to redo things, I'd make that size be a
> mulltiple of 2048 too (maybe a multiple of 2048 - 128).  Wasting a
> couple of thousand sectors (1 MB) won't hurt (and that's the max).
> 
> 
> But overall I think your basic layout is fine, and there's no need to
> adjust that.  The one thing that you need to do (if you really need
> better performance, rather than just think you should have it - that
> is, if you need it enough to re-init the filesystem) would be to
> change the ffs block size, or change the raidframe stripe size, so
> standard size block I/O turns into full size stripe I/O.
> 
> Doing that should improve performance.  Nothing else is likely to
> help.

The first thing I would do is test with these:

 time dd if=/dev/zero of=/home/testfile bs=64k count=32768
 time dd if=/dev/zero of=/home/testfile bs=10240k count=32768

so that at least you're sending 64K chunks to the disk... After that,
64K blocks on the filesystem are going to be next, and that might be
more effort than it's worth, depending on the results of the above
dd's... 

Later...

Greg Oster


Re: Beating a dead horse

2015-11-25 Thread William A. Mahaffey III

On 11/25/15 12:14, Robert Elz wrote:

 Date:Wed, 25 Nov 2015 10:52:30 -0600
 From:Greg Oster 
 Message-ID:  <20151125105230.209c5...@mickey.usask.ca>

   | Just to recap: You have a RAID set that is not 4K aligned with the
   | underlying disks.

Actually, I think it is - though we haven't seen all the data yet to
prove it.  The unaligned assumption was based upon misreading fdisk
output.  That was incorrect - but doesn't actually mean that it is
properly aligned, it probably is OK, but we have not yet seen the
BSD disklabel for the data disks to be sure (we know the BSD fdisk
partition is OK, but that's then divided into pieces, and we have
not - not recently anyway - been told that breakup).   We also haven't
seen the filesystem layout within the raid.

Everything else makes sense, and William, that does come from a raid expert,
you can believe what Greg says, take what I say as more of a semi-literate
guess.

One question though, Greg you suggest 32 block stripe, and 64K filesys
block size.   Would 16 block stripe and 32K blocksize work as well?

So, William, two more command outputs needed ...

disklabel wd0

(or any of the drives that are all identical, if they are not all
identical, then disklabel for each different layout).   We need to
see where wd0f (and wd1f etc) start...

And, for the raid set itself,

gpt show raid2

kre



Roger that (6 HDD's, all identical, all identically partitioned/sliced):

4256EE1 # disklabel wd0
# /dev/rwd0d:
type: ESDI
disk: HGST HTS721010A9
label: disk0
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 16
sectors/cylinder: 1008
cylinders: 1938021
total sectors: 1953525168
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0

6 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 a:  33554432  2048   RAID # (Cyl. 2*-  33290*)
 c: 1953523120  2048 unused  0 0# (Cyl. 2*- 
1938020)
 d: 1953525168 0 unused  0 0# (Cyl. 0 - 
1938020)
 e:  33554432  33556480   swap # (Cyl. 33290*-  
66578*)
 f: 1886414256  67110912   RAID # (Cyl. 66578*- 
1938020)

4256EE1 # gpt show raid2
   startsize  index  contents
   0   1 PMBR
   1   1 Pri GPT header
   2  32 Pri GPT table
  34  94
 128  7545656543  1  GPT part - NetBSD FFSv1/FFSv2
  7545656671  32 Sec GPT table
  7545656703   1 Sec GPT header
4256EE1 #

--

William A. Mahaffey III

 --

"The M1 Garand is without doubt the finest implement of war
 ever devised by man."
   -- Gen. George S. Patton Jr.



Re: Beating a dead horse

2015-11-25 Thread William A. Mahaffey III

On 11/25/15 14:26, Swift Griggs wrote:

On Thu, 26 Nov 2015, Robert Elz wrote:
FFS is OK on NetBSD-7 (not sure about LFS or others, never tried 
them). Raidframe might be (haven't looked) but both cgd and ccd are a 
mess...


I wonder if the same is true for LVM? Since it's relatively new, 
perhaps some of these issues were worked out in a more "modern" way 
that would properly take advantage of the NetBSD-7 kernel.


Personally, as a long-time sysadmin, having LVM fully implemented is a 
huge plus (especially when it fully integrates RAID features without 
another abstraction layer to deal with). I know that RAIDFrame is 
capable and I've used it myself, but I see LVM as being one of the 
(very) few design-by-committee projects that ever amounted to squat. 
Perhaps they should have involved ISO or IEEE to properly ruin it. 
Nonetheless, irrespective of my warm fuzzy coming from familiarity or 
actual design wisdom, LVM makes a lot of sense to me, if for no other 
reason than I get to shorten the mental-linked-list of block storage 
management tools like RAIDFrame, GEOM, VxVM, XVM, LVM, SVM, ZFS, 
BTRFS, AdvFS, LSM, etc Being a support engineer for many multiple 
flavors of Unix ... I get around.



-Swift



While LVM may have been designed by committee, I am pretty sure it was 
originally an SGI committee, & seems pretty good to me as well. All of 
my old SGI stuff worked like clockwork as long as it functioned (snif ).



--

William A. Mahaffey III

 --

"The M1 Garand is without doubt the finest implement of war
 ever devised by man."
   -- Gen. George S. Patton Jr.



Re: Beating a dead horse

2015-11-25 Thread Greg Oster
On Thu, 26 Nov 2015 01:08:21 +0700
Robert Elz  wrote:

> Date:Wed, 25 Nov 2015 10:52:30 -0600
> From:Greg Oster 
> Message-ID:  <20151125105230.209c5...@mickey.usask.ca>
> 
>   | Just to recap: You have a RAID set that is not 4K aligned with the
>   | underlying disks. 
> 
> Actually, I think it is - though we haven't seen all the data yet to
> prove it.  The unaligned assumption was based upon misreading fdisk
> output.  That was incorrect - but doesn't actually mean that it is
> properly aligned, it probably is OK, but we have not yet seen the
> BSD disklabel for the data disks to be sure (we know the BSD fdisk
> partition is OK, but that's then divided into pieces, and we have
> not - not recently anyway - been told that breakup).   We also haven't
> seen the filesystem layout within the raid.
> 
> Everything else makes sense, and William, that does come from a raid
> expert, you can believe what Greg says, take what I say as more of a
> semi-literate guess.
> 
> One question though, Greg you suggest 32 block stripe, and 64K filesys
> block size.   Would 16 block stripe and 32K blocksize work as well?

Yes, though because the IO chunks are smaller the performance would
likely be slightly poorer.  (and a 64K write of data would end up
spanning two stripes, instead of one, meaning twice the locking,
twice the number of read requests, etc., given that those things
don't get combined at the component level by RAIDframe...)

> So, William, two more command outputs needed ...
> 
>   disklabel wd0
> 
> (or any of the drives that are all identical, if they are not all
> identical, then disklabel for each different layout).   We need to
> see where wd0f (and wd1f etc) start...
> 
> And, for the raid set itself,
> 
>   gpt show raid2

Yes.. the results of this, and of 'disklabel wd0' will get to the heart
of the issue..

Later...

Greg Oster


Re: Beating a dead horse

2015-11-25 Thread Swift Griggs

On Wed, 25 Nov 2015, William A. Mahaffey III wrote:
While LVM may have been designed by committee, I am pretty sure it was 
originally an SGI committee, & seems pretty good to me as well.


As a guy who still supports ancient Unix platforms every day, I'll tell 
you that IRIX categorically rocks in my opinion, too (at least as far as a 
commercial OS goes and excluding the horrible inst/overlay install 
process). However, IIRC, it was HP, IBM, and someone else (the open 
group?). At least that was the story I was told. Remember that IRIX has 
XVM, which works well, but isn't all that similar to LVM. AIX and HP still 
have strong LVM implementations.


All of my old SGI stuff worked like clockwork as long as it functioned 
(snif ).


You are going to make me choke up, here. I'm sitting here looking at my 
Tezro and O2 and I'm cursing Rick Belluzo's name. SGI had a period in the 
90's where just about everything it released was chock full of awesome and 
made other vendors gear look 10 years old the day it was released. The 
Indy and O2 were especially wonderful in their day...


-Swift


Re: Beating a dead horse

2015-11-25 Thread William A. Mahaffey III

On 11/25/15 12:47, Robert Elz wrote:

The real reason I wanted to reply to this message is that last line.

wd5 is not being used as a spare.  I kind of suspected that might be the case.
(Parts of it might be used for raid0 or raid1, that's a whole different
question and not material here).

Raidframe autoconfigures the in-use components (assuming autoconfig is
enabled for the raid array, which it is for you ...)

   | Component label for /dev/wd0f:
   | Autoconfig: Yes

(same for the other components.)   But spares are not autoconfigured.
If you want a "hot" spare (one that will automatically be used if one
of the other components fails, so you get back the reliability as soon
as possible, in case a second component also fails) rather than a "cold"
spare (one waiting to be used, but which needs human intervention to
actually make it happen - which is what you have now), then you need
to arrange to add the spare after every reboot.

There is no current standard way to make that happen (most of us tend to
be counting pennies, and have no spare drives ready at all - we wait for one
to fail, then go buy its replacement only when required...), so I'd just add

raidctl -a /dev/wd5f raid2



Thanks, just did it. I am probably not going to try that test, sounds 
kinda involved. I am still digesting this & other input from this thread 
to decide what to do next. Thanks & TIA :-) 


--

William A. Mahaffey III

 --

"The M1 Garand is without doubt the finest implement of war
 ever devised by man."
   -- Gen. George S. Patton Jr.



Fan control on nForce430 MCP61 chipset

2015-11-25 Thread Tom Sweet
Greetings all,

I'm trying to gain control of the cooling fans, CPU and case, for a
DIY home NAS. They're running at 100% regardless of load or idle
duration.

Fans run high speed using Xpenology (Synology distro), FreeBSD LiveCD
from Install img, and NetBSD 6.1.5 and 7.0.  Fans under control with
current Windows client and server operating systems at existing BIOS
config settings.  Unlike many others in searches, I don't have
amdcputemp or others in dmesg output.  I do have functioning acpi
power button and have processor clock control with estd.

The board is a Gigabyte m61PME-s2p rev 1.0 with an nForce 430 (MCP 61)
chipset and an iTE IT8718 hardware monitoring chip [1].  That chip is
in /etc/envsys.conf.

Uncommenting relevant lines and envstat -c /etc/envsys.conf {2} returns:
envstat: device `itesio0' doesn't exist.  Adding lines for nfsmbc0,
nfsmb0, nfsmb0, and iic0 resulted in "device doesn't exist, fix config
file".  envstat -D returns "no drivers registered".

dmesg [3] review, no itesio0 in output.  But I do have nfsmbc0,
nfsmb0/1, and iic0/1.  man nfsmb(4) indicates support for NVIDIA
nForce 2/3/4/ SMBus controllers.

Running NetBSD 7.0 stable Generic Kernel, upgraded from 6.1.5 via
sysbuild / sysupgrade.  I'm brand new to NetBSD, and have essentially
no modern *nix experience.

I've exhausted my talent here, anybody else have a tip?

TIA,
Tom Sweet

[1] http://www.gigabyte.com/products/product-page.aspx?pid=3009#sp
[2]
# $NetBSD: envsys.conf,v 1.12 2008/04/26 13:02:35 xtraeme Exp $
#
# --
# Configuration file for envstat(8) and the envsys(4) framework.
# --
#
# Devices are specified in the first block, sensors in the second block,
# and properties inside of the sensor block:
#
#   foo0 {
#   prop0 = value;
#   sensor0 { ... }
#   }
#
# Properties must be separated by a semicolon character and assigned by
# using the equal character:
#
#   critical-capacity = 10;
#
# Please see the envsys.conf(5) manual page for a detailed explanation.
#
# --
#   CONFIGURATION PROPERTIES FOR SPECIFIC DRIVERS AND MOTHERBOARDS
# --
#
# The following configuration blocks will report the correct
# values for the specified motherboard and driver. If you have
# a different motherboard and verified the values are not correct
# please email me at .
#
# --
# ASUS M2N-E (IT8712F Super I/O)
# --
#
# itesio0 {
#   # Fixup rfact for the VCORE_A sensor.
#   sensor3 { rfact = 180; }
#
#   # Fixup rfact and change description (VCORE_B = +3.3V).
#   sensor4 { description = "+3.3 Voltage"; rfact = 200; }
#
#   # Change description (+3.3V, unused sensor).
#   sensor5 { description = "Unused"; }
#
#   # Fixup rfact and change description for the +5V sensor.
#   sensor6 { description = "+5 Voltage"; rfact = 349; }
#
#   # Fixup rfact and change description for the +12V sensor.
#   sensor7 { description = "+12 Voltage"; rfact = 850; }
# }
#
# --
# Gigabyte P35C-DS3R (IT8718F Super I/O)
# --
#
# itesio0 {
#   # Fixup rfact and change description for the VCore sensor.
#   sensor3 { description = "VCore Voltage"; rfact = 100; }
#
#   # Change description (VCORE_B is DDR).
#   sensor4 { description = "DDR Voltage"; }
#
#   # Fixup rfact and change description for the +12V sensor.
#   sensor7 { description = "+12 Voltage"; rfact = 11600; }
#
#   # Fixup rfact for the -12V sensor.
#   sensor9 { rfact = 900; }
# }
# ---
# Gigabyte M61PME-S2P rev. 1.0
# 
#
# nfsmbc0 {
#   # test
#   sensor1 { description = "test"; }
# }
# nfsmb0 {
#   # test 2
#   sensor0 { description = "nfsmb0"; }
# }
# nfsmb {
#   # test 3
#   sensor3 { description = "nfsmb"; }
# }
iic0 {
#   # test 4
sensor4 { description = "i2c4"; }
 }
[3]
NetBSD 7.0 (GENERIC.201509250726Z)
total memory = 2014 MB
avail memory = 1938 MB
kern.module.path=/stand/amd64/7.0/modules
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
Gigabyte Technology Co., Ltd. M61PME-S2P ( )
mainbus0 (root)
ACPI: RSDP 0xf65f0 14 (v00 GBT   )
ACPI: RSDT 0x7def3000 38 (v01 GBTNVDAACPI 42302E31 NVDA 01010101)
ACPI: FACP 0x7def3040 74 (v01 GBTNVDAACPI 42302E31 NVDA 01010101)
ACPI: DSDT 0x7def30c0 0045A5 (v01 GBTNVDAACPI 1000 MSFT 0300)
ACPI: FACS 0x7def 40
ACPI: SSDT 0x7def7780 0007BA (v01 PTLTD  POWERNOW 0001  LTP 0001)
ACPI: HPET 0x7def7f40 38 (v01 GBTNVDAACPI 42302E31 

Re: Beating a dead horse

2015-11-25 Thread William A. Mahaffey III

On 11/25/15 19:36, Robert Elz wrote:

 Date:Wed, 25 Nov 2015 19:08:59 -0553.75
 From:"William A. Mahaffey III" 
 Message-ID:  <56565a61.7080...@hiwaay.net>

   | The other command is still running, will write out 320 GB by my count,
   | is that as intended, or a typo :-) ? If as wanted, I will leave it going
   | & report back when it is done.

Kill it, those tests are testing precisely nothing.

If you want to try a slighly better test, try with bs=32k, so you are at
least having dd write file system sized blocks.   Won't help with raidframe
overheads, but at least you'll optimise the filesystem overheads as much
as possible.

But I think now that it is clear that you could improve performance if you
rebuild the filesystem with -b 65536 (and -f one of 8192 16384 32768, take
your pick... 8192 would be most space saving).

The question is whether the improvement is really needed for real work,
as opposed to meaningless benchmarks (which dd really isn't anyway, if
you want a real benchmark pick something, perhaps bonnie, from the
benchmarks category in pkgsrc).   And whether it will be enough to be
worth the pain - unfortunately, there's no real way to know how much
improvement you'd get without doing the work first.

Only you can answer those questions - you know what the work is, and
how effectively your system is coping as it is now configured.

As I said before, when it is me, I just ignore all of this, I don't care
what the i/o throughput is (or could be) because in practice there just
isn't enough i/o (on raid5 based filesystems) in my system to matter.
So I optimise for other things - of which the most valuable (to me) is
my time!

kre




Well, just for posterity:


4256EE1 # time dd if=/dev/zero of=/home/testfile bs=10240k count=32768
^C
22515+0 records in
22514+0 records out
236076400640 bytes transferred in 20550.826 secs (11487440 bytes/sec)
time: Command terminated abnormally.
20552.92 real 0.09 user  1592.38 sys
4256EE1 #


--

William A. Mahaffey III

 --

"The M1 Garand is without doubt the finest implement of war
 ever devised by man."
   -- Gen. George S. Patton Jr.



Re: Beating a dead horse

2015-11-25 Thread Andreas Gustafsson
Greg Troxel wrote:
> > I would go further than that.  Alignment is not only an issue with 4K
> > sector disks, but also with SSDs, USB sticks, and SD cards, all of
> > which are being deployed in sizes smaller than 128 GB even today.
> 
> I didn't realize that.  Do these devices have 4K native sectors?

The don't have sectors as much as flash pages, and the page size
varies from device to device.  For example, Intel Datacenter SSDs have
either 8 KB or 16 KB pages according to this document:

  
http://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/ssd-partition-alignment-tech-brief.pdf

> So you mean 2048 for >= 1G?

Yes.

> Is there a clear motivation for 2048 vs 64?   Are there any known
> devices where that matters?

Intel specifically recommends 1 MB alignment (2048 sectors) in the
document linked above.  Also, 128 kB is a common erase block size in
consumer flash media, but I'm not sure how relevant that is to
partition alignment.

> But why is just changing 63 to 64 always an issue?  Or using that for
> >= 1G?   Other than complexity, and having 3 branches.

I don't have a problem with changing the 63 to 64 for the smallest
devices, or with having 3 branches.  I would just like an 80 GB SSD,
for example, to get get aligned to 1 MB per Intel's recommendation.
-- 
Andreas Gustafsson, g...@gson.org


Re: Beating a dead horse

2015-11-25 Thread William A. Mahaffey III

On 11/25/15 00:30, Robert Elz wrote:

 Date:Tue, 24 Nov 2015 21:57:50 -0553.75
 From:"William A. Mahaffey III" 
 Message-ID:  <56553074.9060...@hiwaay.net>


   | 4256EE1 # time dd if=/dev/zero of=/home/testfile bs=16k count=32768
   | 32768+0 records in
   | 32768+0 records out
   | 536870912 bytes transferred in 22.475 secs (23887471 bytes/sec)
   | 23.28 real 0.10 user 2.38 sys
   | 4256EE1 #
   |
   | i.e. about 24 MB/s.

I think I'd be happy enough with that, maybe it can be improved a little.

   | When I zero-out parts of these drive to reinitialize
   | them, I see ~120 MB/s for one drive.

Depending upon just how big those "parts" are, that number might be
an illusion.You need to be writing at least about as much as
you did in the test above to reduce the effects of write behind
(caching in the drive) etc.   Normally a "zero to reinit" write
doesn't need nearly that much (often just a few MB) - writing that
much would be just to the drive's cache, and measuring that speed
is just measuring DMA rate, and useless for anything.


   | RAID5 stripes I/O onto the data
   | drives, so I expect ~4X I/O speed w/ 4 data drives. With various
   | overheads/inefficiencies, I (think I) expect 350-400 MB/s writes.

That's not going to happen.   Every raid write (whatever raid level, except 0)
requires 2 parallel disc writes (at least) - you need that to get the
redundancy that is the R in the name - it can also require reads.

For raid 5, you write to the data drive (one of the 4 of them) and to the
parity drive - that is, all writes end up having a write to the parity
drive, so the upper limit on speed for a contiguous write is that of one
drive (a bit less probably, depending upon which controllers are in use,
as the data still needs to be transmitted twice, and each controller can
only be transferring for one drive at a time .. at least for the kinds of
disc controllers in consumer grade equipment.)   If both data and parity
happen to be using the same controller, the max rate will certainly be
less (measurably less, though perhaps not dramatically) than what you can
achieve with one drive.  If they're on different controllers, then in
ideal circumstances you might get close to the rate you can expect from
one drive.

For general purpose I/O (writes all over the filesystem, as you'd see in
normal operations) that's mitigated by there not really being one parity
drive, rather, all 5 drives (the 4 you think of as being the data drives,
and the one you think of as being the parity drive) perform as both data
and parity drives, for different segments of the raid, so there isn't
really (in normal operation) a one drive bottleneck -- but with 5
drives, and 2 writes needed for each I/O, the best you could possibly
do is twice as fast as a single drive in overall throughput.  In practice
you'd never see that however, real workloads just aren't going to be
just that conveniently spread out in just the right parts of the filesystems,
if you ever get faster than a single drive can achieve, I'd be surprised.
If you ever even approach what a single drive can achieve, I'd be surprised.


H  I thought that the RAID5 would write 1 parity byte & 4 data 
bytes in parallel, i.e. no '1 drive bottleneck'. AFAIUT, parity data is 
calculated for N data drives, 1 byte of parity data per N bytes written 
to the N data drives, then the N+1 bytes are written to the N+1 drives, no ?




Now, in the above (aside from the possible measurement error in your 120MB/s)
I've been allowing you to think that's "what a single drive can achieve".
It isn't.  That's raw I/O onto the drive, and will run at the best possible
speed that the drive can handle - everything is optimised for that case, as
it is one of the (meaningless) standard benchmarks.

For real use, there are also filesystem overheads to consider, your raid
test was onto a file, on the raid, not onto the raw raid (though I wouldn't
expect that to be all that much faster, certainly not more than about 60MB/s
assuming the 120 MB/s is correct).

To get a more valid baseline, what you can actually expect to observe,
you need to be comparing apples to apples - that is, the one drive test
needs to also have a filesystem, and you need to be writing a file to it.

To test that, take your hot spare (ie: unused) drive, and build a ffs on
it instead of raid (unfortunately, I think you need to reboot to get it
out of being a raidframe spare first - as I recall, raidframe has no
"stop being a spare" operation ... it should have, but ...).  Just stop
it being added back as a hot spare (assuming you are actually doing that now,
spares don't get autoconfigured.)   (Wait till down below to see how to
do the raidctl -s to see if the hot spare is actually configured or not,
raidctl will tell you, once you run it properly).

Then build a ffs on the spare drive (after it is no longer a spare)
(you'll need to change the 

Re: Beating a dead horse

2015-11-25 Thread Greg Troxel

Robert Elz  writes:

> Date:Mon, 23 Nov 2015 11:18:48 -0553.75
> From:"William A. Mahaffey III" 
> Message-ID:  <5653492e.1090...@hiwaay.net>
>
> Much of what you wanted to know has been answered already I think, but
> not everything, so
>
> (in a different order than they were in your message)
>
>   | Also, why did my fdisk 
>   | choose those values when his chose apparently better ones ?
>
> There's a size threshold - drives smaller get the offfset 63 stuff,
> and drives larger, 2048 ... the assumption is that on small drives
> you don't want to waste too much space, but on big ones a couple of
> thousand sectors is really irrelevant...

I can see that 2048 is a multiple of many more values, but I wonder
about changing 63 to 64, which is a multiple of 8 and thus good enough
for the 4K issue, and wastes only 512 bytes.  (In my experience drives
of 1T and maybe even 750G are showing up with 4K sectors.)

The other thing would be to change the alignment threshold to 128G.
Even that's big enough that 1M not used by default is not important.
And of course people who care can do whatever they want anyway.



signature.asc
Description: PGP signature


Re: Beating a dead horse

2015-11-25 Thread Andreas Gustafsson
Greg Troxel wrote:
> The other thing would be to change the alignment threshold to 128G.
> Even that's big enough that 1M not used by default is not important.
> And of course people who care can do whatever they want anyway.

I would go further than that.  Alignment is not only an issue with 4K
sector disks, but also with SSDs, USB sticks, and SD cards, all of
which are being deployed in sizes smaller than 128 GB even today.

My vote is for a threshold of 1 G.  This means that worst case, we
could end up wasting as much as 0.1% of the capacity on alignment
(the horror!).
-- 
Andreas Gustafsson, g...@gson.org


Re: Beating a dead horse

2015-11-25 Thread Greg Troxel

Andreas Gustafsson  writes:

> Greg Troxel wrote:
>> The other thing would be to change the alignment threshold to 128G.
>> Even that's big enough that 1M not used by default is not important.
>> And of course people who care can do whatever they want anyway.
>
> I would go further than that.  Alignment is not only an issue with 4K
> sector disks, but also with SSDs, USB sticks, and SD cards, all of
> which are being deployed in sizes smaller than 128 GB even today.

I didn't realize that.  Do these devices have 4K native sectors?

> My vote is for a threshold of 1 G.  This means that worst case, we
> could end up wasting as much as 0.1% of the capacity on alignment
> (the horror!).

So you mean 2048 for >= 1G?  That's ok with me, too.

Is there a clear motivation for 2048 vs 64?   Are there any known
devices where that matters?

But why is just changing 63 to 64 always an issue?  Or using that for
>= 1G?   Other than complexity, and having 3 branches.




signature.asc
Description: PGP signature