Re: Beating a dead horse
On Wed, 25 Nov 2015, Andreas Gustafsson wrote: The don't have sectors as much as flash pages, and the page size varies from device to device. I'm curious about something, probably due to ignorance of the full dynamics of the vfs(9) layer. Why is it that folks don't choose file system block sizes and partition offsets that are least-common-factors that they share with the hardware layer. Ie.. Let's say the hard disk uses 4K pages, the file system uses 8K blocks, and the vendor recommends that you stay aligned with a 1GB value. Wouldn't operating on 8K blocks still satisfy the underlying device (since 8K operations would always be divisible by a factor of 4K) and the 1GB alignment may not always be perfect, but the 8K ops below it would eventually stack to 1GB exactly, too. Is it all about waste at the file system layer due to some block operations being optimized for large devices and buffers but not being as applicable (or being downright wasteful) on smaller block devices? I'm just asking so I can better follow the conversation you more experienced folks are having. Thanks, Swift
Re: Beating a dead horse
On Tue, 24 Nov 2015 21:57:50 -0553.75 "William A. Mahaffey III"wrote: > On 11/24/15 19:08, Robert Elz wrote: > > Date:Mon, 23 Nov 2015 11:18:48 -0553.75 > > From:"William A. Mahaffey III" > > Message-ID: <5653492e.1090...@hiwaay.net> > > > >| The machine works well except for horribly slow I/O to the > > RAID5 I setup > > > > What is your definition of "horribly slow" and are we talking read > > or write or both ? > > 4256EE1 # time dd if=/dev/zero of=/home/testfile bs=16k count=32768 > 32768+0 records in > 32768+0 records out > 536870912 bytes transferred in 22.475 secs (23887471 bytes/sec) > 23.28 real 0.10 user 2.38 sys > 4256EE1 # Just to recap: You have a RAID set that is not 4K aligned with the underlying disks. Fixing that will certainly help, but there is a bit more tweaking to be done: 1) Your stripe size is 32 blocks - 16K. So if you are writing in 16K chunks (as with the dd above), at best you're guaranteed to be re-reading the old data, re-reading parity, writing new data, and writing new data for each block. That's the 'small write problem' with RAID 5. (at worst, that 16K might span two components, in which case you'll have to wait for three reads and three writes to complete, instead of two reads and two writes) 2) Your filesystem block size is 32k, meaning a filesystem block write will never give you a full stripe write either. See 1) for performance implications. 3) If you want more write speed, then: a) get the RAID components and RAID 4K-aligned with the underlying disks. Your 32 block stripe width is perfect. The number of data components is good. b) get the filesystem 4K-aligned c) use a 64K block size for the filesystem d) Use 'bs=64k' (or 'bs=10m') to see better IO performance 4) As others have said, if you need high-performance writes, RAID 5 is probably not what you want, especially if you're not streaming writes in 64K chunks. Later... Greg Oster
Re: Beating a dead horse
Swift Griggswrites: > I'm curious about something, probably due to ignorance of the full > dynamics of the vfs(9) layer. Why is it that folks don't choose file > system block sizes and partition offsets that are least-common-factors > that they share with the hardware layer. Ie.. Let's say the hard disk > uses 4K pages, the file system uses 8K blocks, and the vendor > recommends that you stay aligned with a 1GB value. Wouldn't operating > on 8K blocks still satisfy the underlying device (since 8K operations > would always be divisible by a factor of 4K) and the 1GB alignment may > not always be perfect, but the 8K ops below it would eventually stack > to 1GB exactly, too. Good questions, and it boils down to a few things: - many devices don't have a way to report their underlying block sizes. For example, if you buy a 2T spinning disk, it will very likely be one that has sectors that are actually 4K but an interface of 512B sectors. So if you read, it's fine because it gets a 4K sector into the cache, and then hands you the piece you want. And when you write, if you write a 512-byte sector, it has to read-modify-write. Worse, if you write 4K or 8K but not lined up (which you will if your fs has 8K blocks but aligned to 63), it has to read-modify-write 2 sectors per write. - SSDs are even harder to figure out, as Andreas's helpful references in response to my question show. - filesystems sometimes get moved around, and higher up it's even more disconnected from the actual hardware So there are two issues: alignment and filesystem block/frag size, and both have to be ok. For larger disks, UFS uses larger block sizes by default (man newfs). So that's ok, but alignment is messier. We're seeing smaller disks with 4K sectors or larger flash erase blocks and 512B interfaces now. And, there are also disks with native 4K sectors, where the interface to the computer transfers 4K chunks. That avoids the alignment issue, but requires filesystem/kernel support. I am pretty sure netbsd-7 is ok with that but I am not sure about earlier. It would probably be possible to add a call into drivers to return this info and propagate it up and have newfs/fdisk query it. I am not sure that all disks return the info enough, and there are probably a lot of details. But it's more work and doesn't necessarily do better than "just start at 2048 and use big blocks". Certainly you are welcome to read the code and think about it if this interests you - just explaining why I think no one has done the work so far. > Is it all about waste at the file system layer due to some block > operations being optimized for large devices and buffers but not being > as applicable (or being downright wasteful) on smaller block devices? I geuss you can put it that way, saying otherwise we would always start at 2048 and use 32K or even 64K blocks. But I think part of it is inertia. And the the 63 start dates back to floppies that had 63-sector tracks - so it was actually aligned. signature.asc Description: PGP signature
Re: Beating a dead horse
On Wed, 25 Nov 2015, Greg Troxel wrote: So there are two issues: alignment and filesystem block/frag size, and both have to be ok. Ahh, a key point to be certain. So that's ok, but alignment is messier. It sure seems that way! :-) We're seeing smaller disks with 4K sectors or larger flash erase blocks and 512B interfaces now. Those larger erase blocks (128k?!) would seem to be a big problem if you'd rather stick to a smaller alignment or file system block size. However, maybe I'm missing something about the interplay between things like inode sizes and any relationship with the erase block size. For example, if you were using an 8K block size I'd think a 128k erase block size would blow things up nicely when you blew way past the 8k block you originally wanted to clear. Again, however, I feel I'm missing something, but let me say right away that I'm ingratiated by your detailed response and I learned much from it. That avoids the alignment issue, but requires filesystem/kernel support. I am pretty sure netbsd-7 is ok with that but I am not sure about earlier. Interesting. For some reason I have it in my head that many EMC SANs of various stripes (and I know for sure the XtremeIO stuff) uses 4k block sizes. I've used NetBSD in the past in EMC environments and seen moderate (but not what I'd call "high") performance vis-a-vis other (Linux and/or Solaris) hosts. I wonder if this was a factor. I'll have to test with Netbsd 7, now. Sounds fun. It would probably be possible to add a call into drivers to return this info and propagate it up and have newfs/fdisk query it. That sounds pretty automagical (in a good way). However, as you point out it might be tedious for not enough payoff considering that using a 2048 offset seems to address the issue. Certainly you are welcome to read the code and think about it if this interests you - just explaining why I think no one has done the work so far. I did, in fact, take a look at /usr/src/sbin/newfs/mkfs.c and /usr/include/ufs/ffs as well. I'm a kernel lightweight and I got all confused by discussions in the code about block fragments and minimum / maximum block sizes. Here I was thinking all that was static. I did learn that 4k seems to be the smallest allowable size (line 141 in fs.h #define MINBSIZE 4096). It sounds like that's the smallest since a cylinder group block struct needs a 4k space to be workable. I also noticed that superblock sizes seem to operate independently of the data block sizes. It's a great day for learning! I geuss you can put it that way, saying otherwise we would always start at 2048 and use 32K or even 64K blocks. That doesn't seem too wasteful to me, but I'm ready for some embedded systems guy to lecture me. :-) But I think part of it is inertia. And the the 63 start dates back to floppies that had 63-sector tracks - so it was actually aligned. A! I always wondered where that came from. I remember floppy disk controllers were bizarre animals. Physical characteristics were so important that you could easily create differences that'd result in a completely unreadable floppy. Ie.. that was always a big issue trying to get my Amiga floppies to read on a PC. I had to buy a special dual-mode floppy controller et al After all that hassle I still have some stupid emotional affinity for 3.5" floppies because of how much more wonderful they seemed over 5.25" floppies. Ugh... nostalgia can be painful sometimes :-P Thanks again for the detailed response. I learned a ton from this discussion. Thanks, Swift
Re: (unknown)
In article <24245.1448479...@andromeda.noi.kre.to>, Robert Elzwrote: >Date:Wed, 25 Nov 2015 12:25:50 -0600 (CST) >From:"Jeremy C. Reed" >Message-ID: > > | Is this expected behavior? Undefined? A bug? > >parsedate is full of bugs. I have a fix for some of them (not that >one, as that more does depend upon what it "should" do) that I've >sent to a few people. > >I think the plan was for apb to incorporate it, but he seems to be MIA >at the minute, so nothing has happened yet. > >I can send the patches to anyone who wants. Send them here; preferably with unit-tests :-) christos
Re: Beating a dead horse
Date:Wed, 25 Nov 2015 15:59:29 -0600 From:Greg OsterMessage-ID: <20151125155929.2a5f2...@mickey.usask.ca> | time dd if=/dev/zero of=/home/testfile bs=64k count=32768 | time dd if=/dev/zero of=/home/testfile bs=10240k count=32768 | | so that at least you're sending 64K chunks to the disk... Will that really work? Wouldn't the filesystem divide the 64k writes from dd into 32K file blocks, and write those to raidframe? I doubt those tests would be productive. They'd need to write to the raw rraid0d (or rraid0a) to be effective I think, and that would destroy all the data... kre
Re: Beating a dead horse
On 11/25/15 16:05, Greg Oster wrote: On Thu, 26 Nov 2015 04:41:02 +0700 Robert Elzwrote: Date:Wed, 25 Nov 2015 14:57:02 -0553.75 From:"William A. Mahaffey III" Message-ID: <56561f54.5040...@hiwaay.net> | f: 1886414256 67110912 RAID # (Cyl. 66578*- | 1938020) OK, 67110912 is a multiple of 2^11 (2048) which is just fine. The size is a multiple of 2^4 (16) so that's OK too. | 128 7545656543 1 GPT part - NetBSD FFSv1/FFSv2 The 128 is what I was expecting, from the dk0 wedgeinfo, and that's fine. The size is weird, but I don't think should give a problen. Greg will be able to say what happens when there's a partial stripe left over at the end of a raidframe array. RAIDframe truncates to the last full stripe. If you do ever decide to redo things, I'd make that size be a mulltiple of 2048 too (maybe a multiple of 2048 - 128). Wasting a couple of thousand sectors (1 MB) won't hurt (and that's the max). But overall I think your basic layout is fine, and there's no need to adjust that. The one thing that you need to do (if you really need better performance, rather than just think you should have it - that is, if you need it enough to re-init the filesystem) would be to change the ffs block size, or change the raidframe stripe size, so standard size block I/O turns into full size stripe I/O. Doing that should improve performance. Nothing else is likely to help. The first thing I would do is test with these: time dd if=/dev/zero of=/home/testfile bs=64k count=32768 time dd if=/dev/zero of=/home/testfile bs=10240k count=32768 so that at least you're sending 64K chunks to the disk... After that, 64K blocks on the filesystem are going to be next, and that might be more effort than it's worth, depending on the results of the above dd's... Later... Greg Oster 4256EE1 # time dd if=/dev/zero of=/home/testfile bs=64k count=32768 32768+0 records in 32768+0 records out 2147483648 bytes transferred in 166.255 secs (12916806 bytes/sec) 167.20 real 0.12 user 8.85 sys 4256EE1 # The other command is still running, will write out 320 GB by my count, is that as intended, or a typo :-) ? If as wanted, I will leave it going & report back when it is done. BTW, I see much more of the above 13-ish MB/s than the 24-ish reported earlier, when I posted earlier (a few weeks ago) I think I had about 18 MB/s, but 12-15 is much more common, apropos of nothing if it is nominally as expected Thanks & TIA for any more insight -- William A. Mahaffey III -- "The M1 Garand is without doubt the finest implement of war ever devised by man." -- Gen. George S. Patton Jr.
Re: Beating a dead horse
Date:Wed, 25 Nov 2015 19:08:59 -0553.75 From:"William A. Mahaffey III"Message-ID: <56565a61.7080...@hiwaay.net> | The other command is still running, will write out 320 GB by my count, | is that as intended, or a typo :-) ? If as wanted, I will leave it going | & report back when it is done. Kill it, those tests are testing precisely nothing. If you want to try a slighly better test, try with bs=32k, so you are at least having dd write file system sized blocks. Won't help with raidframe overheads, but at least you'll optimise the filesystem overheads as much as possible. But I think now that it is clear that you could improve performance if you rebuild the filesystem with -b 65536 (and -f one of 8192 16384 32768, take your pick... 8192 would be most space saving). The question is whether the improvement is really needed for real work, as opposed to meaningless benchmarks (which dd really isn't anyway, if you want a real benchmark pick something, perhaps bonnie, from the benchmarks category in pkgsrc). And whether it will be enough to be worth the pain - unfortunately, there's no real way to know how much improvement you'd get without doing the work first. Only you can answer those questions - you know what the work is, and how effectively your system is coping as it is now configured. As I said before, when it is me, I just ignore all of this, I don't care what the i/o throughput is (or could be) because in practice there just isn't enough i/o (on raid5 based filesystems) in my system to matter. So I optimise for other things - of which the most valuable (to me) is my time! kre
[no subject]
At Wed, 25 Nov 2015 10:00:00 + (UTC) I had a cron job run: for tz in America/Los_Angeles America/Chicago America/New_York \ Asia/Tokyo Europe/Berlin ; do TZ=$tz date -d "Wednesday 22:00utc" +"%A %B %d %I:%M %p %z %Z ${tz}" ; done This resulted in: Wednesday November 25 12:00 PM -0800 PST America/Los_Angeles Wednesday November 25 02:00 PM -0600 CST America/Chicago Wednesday November 25 03:00 PM -0500 EST America/New_York Wednesday December 02 05:00 AM +0900 JST Asia/Tokyo Wednesday November 25 09:00 PM +0100 CET Europe/Berlin Notice the December 02 above. An easy workaround is to also add today's date to the -d parsedate string above. Is this expected behavior? Undefined? A bug?
[no subject]
Date:Wed, 25 Nov 2015 12:25:50 -0600 (CST) From:"Jeremy C. Reed"Message-ID: | Is this expected behavior? Undefined? A bug? parsedate is full of bugs. I have a fix for some of them (not that one, as that more does depend upon what it "should" do) that I've sent to a few people. I think the plan was for apb to incorporate it, but he seems to be MIA at the minute, so nothing has happened yet. I can send the patches to anyone who wants. kre
Re: Beating a dead horse
Date:Wed, 25 Nov 2015 12:29:15 -0500 From:Greg TroxelMessage-ID: | And, there are also disks with native 4K sectors, where the interface to | the computer transfers 4K chunks. That avoids the alignment issue, but | requires filesystem/kernel support. I am pretty sure netbsd-7 is ok | with that but I am not sure about earlier. FFS is OK on NetBSD-7 (not sure about LFS or others, never tried them). Raidframe might be (haven't looked) but both cgd and ccd are a mess... I have been looking into cgd, I have exactly that kind of drive, and want to put CGD on it. It "works" but counts 4K sectors as if they were 512 bytes each... (ie: the cgd is 1/8 the size it should be). Fixing that the easy way loses all the alignment (and since it is required, not just inefficient, would probably cause breakage if you're not very very careful). Fixing it the right way is hard the way things are right now (it is real easy to cause panics trying ... I know, been there, done that...) | It would probably be possible to add a call into drivers to return this | info and propagate it up and have newfs/fdisk query it. Something like that is a part of what I am thinking about to make cgds on 4k sector drives work properly (if/when I make cgd work, I'll probably look at ccds though I have no personal requirement for those, but it ought to just fall out of the cgd fixes I think). One real issue, is that on 4K sector discs, things like LABELSECTOR change meaning (that is, they retain their meaning, but that meaning means a different thing). That is, you cannot simply dd a 512 byte/sector drive to an equivalent sized 4k sector drive and have it have any hope at all of working - the labels will be in the wrong place, and contain the wrong data (disklabels on 4k sector drives are kind of unspecified, but by analogy with MBRs and GPT need to count in units of sectors, not DEV_BSIZE's) The one big advantage of 4k sector drives (and the reason I suspect they exist, as a different species than the drives with emulated 512 byte sectors) is that they can grow much bigger without blowing the size limits inherent in 32 bit based labels (like MBRs and disklabels). That is, while we all have been assuming that for drives > 2TB GPT was mandatory, it turns out it isn't, disklabels and/or MBR work just fine, provided the sectors are 4K rather than 512... I had not been thinking about passing this drive info all the way up to newfs (and other userland tools) but there's no reason that couldn't be done. kre
Re: Beating a dead horse
Date:Wed, 25 Nov 2015 08:10:50 -0553.75 From:"William A. Mahaffey III"Message-ID: <5655c020.5090...@hiwaay.net> In addition to what I said in the previous message ... | H I thought that the RAID5 would write 1 parity byte & 4 data | bytes in parallel, i.e. no '1 drive bottleneck'. AFAIUT, parity data is | calculated for N data drives, 1 byte of parity data per N bytes written | to the N data drives, then the N+1 bytes are written to the N+1 drives, no ? As I understand it, that's how the mathematics works ... but (like many things) the theory and the practice don't always cooperate quite like that. To see what really happens you have to imagine writing a single disk block (because that's what raidframe gets from the filesystem layer) in isolation. Because of striping, you could imagine it in your model as if you wanted to write just one byte. You finish that, and then you write one more byte, and finish that... In reality, it isn't done as bytes, that's too inefficient, but stripes, but the principle is the same. | 4256EE1 # raidctl -s raid2 | Components: | /dev/wd0f: optimal | /dev/wd1f: optimal | /dev/wd2f: optimal | /dev/wd3f: optimal | /dev/wd4f: optimal | No spares. The real reason I wanted to reply to this message is that last line. wd5 is not being used as a spare. I kind of suspected that might be the case. (Parts of it might be used for raid0 or raid1, that's a whole different question and not material here). Raidframe autoconfigures the in-use components (assuming autoconfig is enabled for the raid array, which it is for you ...) | Component label for /dev/wd0f: | Autoconfig: Yes (same for the other components.) But spares are not autoconfigured. If you want a "hot" spare (one that will automatically be used if one of the other components fails, so you get back the reliability as soon as possible, in case a second component also fails) rather than a "cold" spare (one waiting to be used, but which needs human intervention to actually make it happen - which is what you have now), then you need to arrange to add the spare after every reboot. There is no current standard way to make that happen (most of us tend to be counting pennies, and have no spare drives ready at all - we wait for one to fail, then go buy its replacement only when required...), so I'd just add raidctl -a /dev/wd5f raid2 in /etc/rc.localYou might want to defer doing that though until after having everything else sorted out - at the minute, wd5f is spare scratch space, being used by nothing - you could make a ffs filesystem on it, to measure the speed you can get to a single drive filesystem. You would need to alter its partition type from RAID to 4.2BSD in the disklabel first (and then put it back again after you are done testing with the filesystem and ready to make it back being a spare.) Sometime or other we ought to either arrange for spares to autoconfig (but I suspect that would be a job for Greg, and that probably means not anytime soon...) or at least to have a standard rc.d script that would turn on any configured (and unused) spares for autoconfigured raid sets, without needing evil hacks like the one I just suggested sticking in rc.local... Doing the second of those is probably within my abilities, so I might take a crack at that one. kre
Re:
=> At Wed, 25 Nov 2015 10:00:00 + (UTC) I had a cron job run: => => for tz in America/Los_Angeles America/Chicago America/New_York \ => Asia/Tokyo Europe/Berlin ; do => TZ=$tz date -d "Wednesday 22:00utc" +"%A %B %d %I:%M %p %z %Z ${tz}" ; => done => => This resulted in: => => Wednesday November 25 12:00 PM -0800 PST America/Los_Angeles => Wednesday November 25 02:00 PM -0600 CST America/Chicago => Wednesday November 25 03:00 PM -0500 EST America/New_York => Wednesday December 02 05:00 AM +0900 JST Asia/Tokyo => Wednesday November 25 09:00 PM +0100 CET Europe/Berlin => => => Notice the December 02 above. => => An easy workaround is to also add today's date to the -d parsedate => string above. => => Is this expected behavior? Undefined? A bug? FWIW, I get similar results on my Linux box (Ubuntu 14.04). Wednesday November 25 02:00 PM -0800 PST America/Los_Angeles Wednesday November 25 04:00 PM -0600 CST America/Chicago Wednesday November 25 05:00 PM -0500 EST America/New_York Thursday December 03 07:00 AM +0900 JST Asia/Tokyo Wednesday November 25 11:00 PM +0100 CET Europe/Berlin Gary Duzan
Re: Beating a dead horse
Date:Thu, 26 Nov 2015 01:41:00 +0700 From:Robert ElzMessage-ID: <23815.1448476...@andromeda.noi.kre.to> | so I'd just add | | raidctl -a /dev/wd5f raid2 | | in /etc/rc.local Actually, a better way short term, is probably to put your config file for raid2 in /etc/raid2.conf disable autoconfig for raid2 (for /home, it is not really required ... but don't do this if there's any possibility that the wdN numbers might change - that is, if you might add new drives or rearrange the cabling in any way). For now anyway, don't include raidN.conf files in /etc for any raidN's that are autoconfigured. To disable autoconfig raidctl -A no raid2 And make sure you have raidframe=yes in /etc/rc.conf Then /etc/rc.d/raidframe will config raid2 for you at each boot, and that way if configuring does handle spares, so if /etc/raid2.conf says that wd5f should be a spare, it will be. Still, do any of this only after you have finished using wd5f for any other testing you want to do. kre
Re: Beating a dead horse
Date:Wed, 25 Nov 2015 14:57:02 -0553.75 From:"William A. Mahaffey III"Message-ID: <56561f54.5040...@hiwaay.net> | f: 1886414256 67110912 RAID # (Cyl. 66578*- | 1938020) OK, 67110912 is a multiple of 2^11 (2048) which is just fine. The size is a multiple of 2^4 (16) so that's OK too. | 128 7545656543 1 GPT part - NetBSD FFSv1/FFSv2 The 128 is what I was expecting, from the dk0 wedgeinfo, and that's fine. The size is weird, but I don't think should give a problen. Greg will be able to say what happens when there's a partial stripe left over at the end of a raidframe array. If you do ever decide to redo things, I'd make that size be a mulltiple of 2048 too (maybe a multiple of 2048 - 128). Wasting a couple of thousand sectors (1 MB) won't hurt (and that's the max). But overall I think your basic layout is fine, and there's no need to adjust that. The one thing that you need to do (if you really need better performance, rather than just think you should have it - that is, if you need it enough to re-init the filesystem) would be to change the ffs block size, or change the raidframe stripe size, so standard size block I/O turns into full size stripe I/O. Doing that should improve performance. Nothing else is likely to help. kre
Re: Beating a dead horse
Date:Wed, 25 Nov 2015 13:20:14 -0700 (MST) From:Swift GriggsMessage-ID: | I wonder if the same is true for LVM? No idea. I thought it should be easy enough to test, so I just tried that ... unfortunately, I cannot work out (with 5 minutes of research!) how to make it work, so no luck... lvm wasn't on my radar - while it seems to do a lot, encryption doesn't appear to be included. My 4k sector drive is a USB removable, which is the kind of thing that mandates encryption, so I want/need cgd on it. ANything else after that is just frills... (there's just one of them, now anyway, so raid etc isn't important, and it is mostly for backups, so doesn't need lots of small data volumes, just one big place to put dump files...) First problem with testing lvm is that amd64 GENERIC (and consequently my cut down version) doesn't have "pseudo-device dm" in it, so llvm isn't going to work there at all. That's easy to fix of course, so I did that. But after that, I couldn't figure out how to make lvm pvcreate dk13 work, it just said (something like, paraphrasing, that system isn't running any more) "invalid device or disabled by filtering" Initially dk13 was of type "cgd" in the GPT label. I wasn't sure what to make it for LVM, there doesn't seem to be a NetBSD UUID for that, so I tried setting it to linux-lvm on the assumption that would work. Changed nothing. I was using a current kernel (7.99.21) from a couple of weeks ago. I can upgrade to a newer one if that is likely to help (but I doubt it). That's where I gave up. If anyone has a recipe I can follow to create an lvm volume on a GPT partitioned drive (I *will not* revert to MBR or disklabel - those are "so 20th century") and can tell me just how to run a test, I'm willing to try, anytime in the remainder of this week. After that, I'm going to be away from my 4k sector drive for all of December, so I won't be testing lvm, or working on cgd at all. If anyone else wants to fix cgd for 4k sector size discs during Dec, please go ahead. So far all I've been doing is investigating, and thinking, I have nothing productive to share yet... | Since it's relatively new, perhaps | some of these issues were worked out in a more "modern" way that would | properly take advantage of the NetBSD-7 kernel. The kernel does some stuff right for non 512 byte sector disks, but still does lots wrong. It is still a mess in this area. kre
Re: Beating a dead horse
On Thu, 26 Nov 2015 04:41:02 +0700 Robert Elzwrote: > Date:Wed, 25 Nov 2015 14:57:02 -0553.75 > From:"William A. Mahaffey III" > Message-ID: <56561f54.5040...@hiwaay.net> > > > | f: 1886414256 67110912 RAID # (Cyl. > 66578*- | 1938020) > > OK, 67110912 is a multiple of 2^11 (2048) which is just fine. > The size is a multiple of 2^4 (16) so that's OK too. > > | 128 7545656543 1 GPT part - NetBSD FFSv1/FFSv2 > > The 128 is what I was expecting, from the dk0 wedgeinfo, and that's > fine. The size is weird, but I don't think should give a problen. > Greg will be able to say what happens when there's a partial stripe > left over at the end of a raidframe array. RAIDframe truncates to the last full stripe. > If you do ever decide to redo things, I'd make that size be a > mulltiple of 2048 too (maybe a multiple of 2048 - 128). Wasting a > couple of thousand sectors (1 MB) won't hurt (and that's the max). > > > But overall I think your basic layout is fine, and there's no need to > adjust that. The one thing that you need to do (if you really need > better performance, rather than just think you should have it - that > is, if you need it enough to re-init the filesystem) would be to > change the ffs block size, or change the raidframe stripe size, so > standard size block I/O turns into full size stripe I/O. > > Doing that should improve performance. Nothing else is likely to > help. The first thing I would do is test with these: time dd if=/dev/zero of=/home/testfile bs=64k count=32768 time dd if=/dev/zero of=/home/testfile bs=10240k count=32768 so that at least you're sending 64K chunks to the disk... After that, 64K blocks on the filesystem are going to be next, and that might be more effort than it's worth, depending on the results of the above dd's... Later... Greg Oster
Re: Beating a dead horse
On 11/25/15 12:14, Robert Elz wrote: Date:Wed, 25 Nov 2015 10:52:30 -0600 From:Greg OsterMessage-ID: <20151125105230.209c5...@mickey.usask.ca> | Just to recap: You have a RAID set that is not 4K aligned with the | underlying disks. Actually, I think it is - though we haven't seen all the data yet to prove it. The unaligned assumption was based upon misreading fdisk output. That was incorrect - but doesn't actually mean that it is properly aligned, it probably is OK, but we have not yet seen the BSD disklabel for the data disks to be sure (we know the BSD fdisk partition is OK, but that's then divided into pieces, and we have not - not recently anyway - been told that breakup). We also haven't seen the filesystem layout within the raid. Everything else makes sense, and William, that does come from a raid expert, you can believe what Greg says, take what I say as more of a semi-literate guess. One question though, Greg you suggest 32 block stripe, and 64K filesys block size. Would 16 block stripe and 32K blocksize work as well? So, William, two more command outputs needed ... disklabel wd0 (or any of the drives that are all identical, if they are not all identical, then disklabel for each different layout). We need to see where wd0f (and wd1f etc) start... And, for the raid set itself, gpt show raid2 kre Roger that (6 HDD's, all identical, all identically partitioned/sliced): 4256EE1 # disklabel wd0 # /dev/rwd0d: type: ESDI disk: HGST HTS721010A9 label: disk0 flags: bytes/sector: 512 sectors/track: 63 tracks/cylinder: 16 sectors/cylinder: 1008 cylinders: 1938021 total sectors: 1953525168 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # microseconds track-to-track seek: 0 # microseconds drivedata: 0 6 partitions: #sizeoffset fstype [fsize bsize cpg/sgs] a: 33554432 2048 RAID # (Cyl. 2*- 33290*) c: 1953523120 2048 unused 0 0# (Cyl. 2*- 1938020) d: 1953525168 0 unused 0 0# (Cyl. 0 - 1938020) e: 33554432 33556480 swap # (Cyl. 33290*- 66578*) f: 1886414256 67110912 RAID # (Cyl. 66578*- 1938020) 4256EE1 # gpt show raid2 startsize index contents 0 1 PMBR 1 1 Pri GPT header 2 32 Pri GPT table 34 94 128 7545656543 1 GPT part - NetBSD FFSv1/FFSv2 7545656671 32 Sec GPT table 7545656703 1 Sec GPT header 4256EE1 # -- William A. Mahaffey III -- "The M1 Garand is without doubt the finest implement of war ever devised by man." -- Gen. George S. Patton Jr.
Re: Beating a dead horse
On 11/25/15 14:26, Swift Griggs wrote: On Thu, 26 Nov 2015, Robert Elz wrote: FFS is OK on NetBSD-7 (not sure about LFS or others, never tried them). Raidframe might be (haven't looked) but both cgd and ccd are a mess... I wonder if the same is true for LVM? Since it's relatively new, perhaps some of these issues were worked out in a more "modern" way that would properly take advantage of the NetBSD-7 kernel. Personally, as a long-time sysadmin, having LVM fully implemented is a huge plus (especially when it fully integrates RAID features without another abstraction layer to deal with). I know that RAIDFrame is capable and I've used it myself, but I see LVM as being one of the (very) few design-by-committee projects that ever amounted to squat. Perhaps they should have involved ISO or IEEE to properly ruin it. Nonetheless, irrespective of my warm fuzzy coming from familiarity or actual design wisdom, LVM makes a lot of sense to me, if for no other reason than I get to shorten the mental-linked-list of block storage management tools like RAIDFrame, GEOM, VxVM, XVM, LVM, SVM, ZFS, BTRFS, AdvFS, LSM, etc Being a support engineer for many multiple flavors of Unix ... I get around. -Swift While LVM may have been designed by committee, I am pretty sure it was originally an SGI committee, & seems pretty good to me as well. All of my old SGI stuff worked like clockwork as long as it functioned (snif ). -- William A. Mahaffey III -- "The M1 Garand is without doubt the finest implement of war ever devised by man." -- Gen. George S. Patton Jr.
Re: Beating a dead horse
On Thu, 26 Nov 2015 01:08:21 +0700 Robert Elzwrote: > Date:Wed, 25 Nov 2015 10:52:30 -0600 > From:Greg Oster > Message-ID: <20151125105230.209c5...@mickey.usask.ca> > > | Just to recap: You have a RAID set that is not 4K aligned with the > | underlying disks. > > Actually, I think it is - though we haven't seen all the data yet to > prove it. The unaligned assumption was based upon misreading fdisk > output. That was incorrect - but doesn't actually mean that it is > properly aligned, it probably is OK, but we have not yet seen the > BSD disklabel for the data disks to be sure (we know the BSD fdisk > partition is OK, but that's then divided into pieces, and we have > not - not recently anyway - been told that breakup). We also haven't > seen the filesystem layout within the raid. > > Everything else makes sense, and William, that does come from a raid > expert, you can believe what Greg says, take what I say as more of a > semi-literate guess. > > One question though, Greg you suggest 32 block stripe, and 64K filesys > block size. Would 16 block stripe and 32K blocksize work as well? Yes, though because the IO chunks are smaller the performance would likely be slightly poorer. (and a 64K write of data would end up spanning two stripes, instead of one, meaning twice the locking, twice the number of read requests, etc., given that those things don't get combined at the component level by RAIDframe...) > So, William, two more command outputs needed ... > > disklabel wd0 > > (or any of the drives that are all identical, if they are not all > identical, then disklabel for each different layout). We need to > see where wd0f (and wd1f etc) start... > > And, for the raid set itself, > > gpt show raid2 Yes.. the results of this, and of 'disklabel wd0' will get to the heart of the issue.. Later... Greg Oster
Re: Beating a dead horse
On Wed, 25 Nov 2015, William A. Mahaffey III wrote: While LVM may have been designed by committee, I am pretty sure it was originally an SGI committee, & seems pretty good to me as well. As a guy who still supports ancient Unix platforms every day, I'll tell you that IRIX categorically rocks in my opinion, too (at least as far as a commercial OS goes and excluding the horrible inst/overlay install process). However, IIRC, it was HP, IBM, and someone else (the open group?). At least that was the story I was told. Remember that IRIX has XVM, which works well, but isn't all that similar to LVM. AIX and HP still have strong LVM implementations. All of my old SGI stuff worked like clockwork as long as it functioned (snif ). You are going to make me choke up, here. I'm sitting here looking at my Tezro and O2 and I'm cursing Rick Belluzo's name. SGI had a period in the 90's where just about everything it released was chock full of awesome and made other vendors gear look 10 years old the day it was released. The Indy and O2 were especially wonderful in their day... -Swift
Re: Beating a dead horse
On 11/25/15 12:47, Robert Elz wrote: The real reason I wanted to reply to this message is that last line. wd5 is not being used as a spare. I kind of suspected that might be the case. (Parts of it might be used for raid0 or raid1, that's a whole different question and not material here). Raidframe autoconfigures the in-use components (assuming autoconfig is enabled for the raid array, which it is for you ...) | Component label for /dev/wd0f: | Autoconfig: Yes (same for the other components.) But spares are not autoconfigured. If you want a "hot" spare (one that will automatically be used if one of the other components fails, so you get back the reliability as soon as possible, in case a second component also fails) rather than a "cold" spare (one waiting to be used, but which needs human intervention to actually make it happen - which is what you have now), then you need to arrange to add the spare after every reboot. There is no current standard way to make that happen (most of us tend to be counting pennies, and have no spare drives ready at all - we wait for one to fail, then go buy its replacement only when required...), so I'd just add raidctl -a /dev/wd5f raid2 Thanks, just did it. I am probably not going to try that test, sounds kinda involved. I am still digesting this & other input from this thread to decide what to do next. Thanks & TIA :-) -- William A. Mahaffey III -- "The M1 Garand is without doubt the finest implement of war ever devised by man." -- Gen. George S. Patton Jr.
Fan control on nForce430 MCP61 chipset
Greetings all, I'm trying to gain control of the cooling fans, CPU and case, for a DIY home NAS. They're running at 100% regardless of load or idle duration. Fans run high speed using Xpenology (Synology distro), FreeBSD LiveCD from Install img, and NetBSD 6.1.5 and 7.0. Fans under control with current Windows client and server operating systems at existing BIOS config settings. Unlike many others in searches, I don't have amdcputemp or others in dmesg output. I do have functioning acpi power button and have processor clock control with estd. The board is a Gigabyte m61PME-s2p rev 1.0 with an nForce 430 (MCP 61) chipset and an iTE IT8718 hardware monitoring chip [1]. That chip is in /etc/envsys.conf. Uncommenting relevant lines and envstat -c /etc/envsys.conf {2} returns: envstat: device `itesio0' doesn't exist. Adding lines for nfsmbc0, nfsmb0, nfsmb0, and iic0 resulted in "device doesn't exist, fix config file". envstat -D returns "no drivers registered". dmesg [3] review, no itesio0 in output. But I do have nfsmbc0, nfsmb0/1, and iic0/1. man nfsmb(4) indicates support for NVIDIA nForce 2/3/4/ SMBus controllers. Running NetBSD 7.0 stable Generic Kernel, upgraded from 6.1.5 via sysbuild / sysupgrade. I'm brand new to NetBSD, and have essentially no modern *nix experience. I've exhausted my talent here, anybody else have a tip? TIA, Tom Sweet [1] http://www.gigabyte.com/products/product-page.aspx?pid=3009#sp [2] # $NetBSD: envsys.conf,v 1.12 2008/04/26 13:02:35 xtraeme Exp $ # # -- # Configuration file for envstat(8) and the envsys(4) framework. # -- # # Devices are specified in the first block, sensors in the second block, # and properties inside of the sensor block: # # foo0 { # prop0 = value; # sensor0 { ... } # } # # Properties must be separated by a semicolon character and assigned by # using the equal character: # # critical-capacity = 10; # # Please see the envsys.conf(5) manual page for a detailed explanation. # # -- # CONFIGURATION PROPERTIES FOR SPECIFIC DRIVERS AND MOTHERBOARDS # -- # # The following configuration blocks will report the correct # values for the specified motherboard and driver. If you have # a different motherboard and verified the values are not correct # please email me at. # # -- # ASUS M2N-E (IT8712F Super I/O) # -- # # itesio0 { # # Fixup rfact for the VCORE_A sensor. # sensor3 { rfact = 180; } # # # Fixup rfact and change description (VCORE_B = +3.3V). # sensor4 { description = "+3.3 Voltage"; rfact = 200; } # # # Change description (+3.3V, unused sensor). # sensor5 { description = "Unused"; } # # # Fixup rfact and change description for the +5V sensor. # sensor6 { description = "+5 Voltage"; rfact = 349; } # # # Fixup rfact and change description for the +12V sensor. # sensor7 { description = "+12 Voltage"; rfact = 850; } # } # # -- # Gigabyte P35C-DS3R (IT8718F Super I/O) # -- # # itesio0 { # # Fixup rfact and change description for the VCore sensor. # sensor3 { description = "VCore Voltage"; rfact = 100; } # # # Change description (VCORE_B is DDR). # sensor4 { description = "DDR Voltage"; } # # # Fixup rfact and change description for the +12V sensor. # sensor7 { description = "+12 Voltage"; rfact = 11600; } # # # Fixup rfact for the -12V sensor. # sensor9 { rfact = 900; } # } # --- # Gigabyte M61PME-S2P rev. 1.0 # # # nfsmbc0 { # # test # sensor1 { description = "test"; } # } # nfsmb0 { # # test 2 # sensor0 { description = "nfsmb0"; } # } # nfsmb { # # test 3 # sensor3 { description = "nfsmb"; } # } iic0 { # # test 4 sensor4 { description = "i2c4"; } } [3] NetBSD 7.0 (GENERIC.201509250726Z) total memory = 2014 MB avail memory = 1938 MB kern.module.path=/stand/amd64/7.0/modules timecounter: Timecounters tick every 10.000 msec timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100 Gigabyte Technology Co., Ltd. M61PME-S2P ( ) mainbus0 (root) ACPI: RSDP 0xf65f0 14 (v00 GBT ) ACPI: RSDT 0x7def3000 38 (v01 GBTNVDAACPI 42302E31 NVDA 01010101) ACPI: FACP 0x7def3040 74 (v01 GBTNVDAACPI 42302E31 NVDA 01010101) ACPI: DSDT 0x7def30c0 0045A5 (v01 GBTNVDAACPI 1000 MSFT 0300) ACPI: FACS 0x7def 40 ACPI: SSDT 0x7def7780 0007BA (v01 PTLTD POWERNOW 0001 LTP 0001) ACPI: HPET 0x7def7f40 38 (v01 GBTNVDAACPI 42302E31
Re: Beating a dead horse
On 11/25/15 19:36, Robert Elz wrote: Date:Wed, 25 Nov 2015 19:08:59 -0553.75 From:"William A. Mahaffey III"Message-ID: <56565a61.7080...@hiwaay.net> | The other command is still running, will write out 320 GB by my count, | is that as intended, or a typo :-) ? If as wanted, I will leave it going | & report back when it is done. Kill it, those tests are testing precisely nothing. If you want to try a slighly better test, try with bs=32k, so you are at least having dd write file system sized blocks. Won't help with raidframe overheads, but at least you'll optimise the filesystem overheads as much as possible. But I think now that it is clear that you could improve performance if you rebuild the filesystem with -b 65536 (and -f one of 8192 16384 32768, take your pick... 8192 would be most space saving). The question is whether the improvement is really needed for real work, as opposed to meaningless benchmarks (which dd really isn't anyway, if you want a real benchmark pick something, perhaps bonnie, from the benchmarks category in pkgsrc). And whether it will be enough to be worth the pain - unfortunately, there's no real way to know how much improvement you'd get without doing the work first. Only you can answer those questions - you know what the work is, and how effectively your system is coping as it is now configured. As I said before, when it is me, I just ignore all of this, I don't care what the i/o throughput is (or could be) because in practice there just isn't enough i/o (on raid5 based filesystems) in my system to matter. So I optimise for other things - of which the most valuable (to me) is my time! kre Well, just for posterity: 4256EE1 # time dd if=/dev/zero of=/home/testfile bs=10240k count=32768 ^C 22515+0 records in 22514+0 records out 236076400640 bytes transferred in 20550.826 secs (11487440 bytes/sec) time: Command terminated abnormally. 20552.92 real 0.09 user 1592.38 sys 4256EE1 # -- William A. Mahaffey III -- "The M1 Garand is without doubt the finest implement of war ever devised by man." -- Gen. George S. Patton Jr.
Re: Beating a dead horse
Greg Troxel wrote: > > I would go further than that. Alignment is not only an issue with 4K > > sector disks, but also with SSDs, USB sticks, and SD cards, all of > > which are being deployed in sizes smaller than 128 GB even today. > > I didn't realize that. Do these devices have 4K native sectors? The don't have sectors as much as flash pages, and the page size varies from device to device. For example, Intel Datacenter SSDs have either 8 KB or 16 KB pages according to this document: http://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/ssd-partition-alignment-tech-brief.pdf > So you mean 2048 for >= 1G? Yes. > Is there a clear motivation for 2048 vs 64? Are there any known > devices where that matters? Intel specifically recommends 1 MB alignment (2048 sectors) in the document linked above. Also, 128 kB is a common erase block size in consumer flash media, but I'm not sure how relevant that is to partition alignment. > But why is just changing 63 to 64 always an issue? Or using that for > >= 1G? Other than complexity, and having 3 branches. I don't have a problem with changing the 63 to 64 for the smallest devices, or with having 3 branches. I would just like an 80 GB SSD, for example, to get get aligned to 1 MB per Intel's recommendation. -- Andreas Gustafsson, g...@gson.org
Re: Beating a dead horse
On 11/25/15 00:30, Robert Elz wrote: Date:Tue, 24 Nov 2015 21:57:50 -0553.75 From:"William A. Mahaffey III"Message-ID: <56553074.9060...@hiwaay.net> | 4256EE1 # time dd if=/dev/zero of=/home/testfile bs=16k count=32768 | 32768+0 records in | 32768+0 records out | 536870912 bytes transferred in 22.475 secs (23887471 bytes/sec) | 23.28 real 0.10 user 2.38 sys | 4256EE1 # | | i.e. about 24 MB/s. I think I'd be happy enough with that, maybe it can be improved a little. | When I zero-out parts of these drive to reinitialize | them, I see ~120 MB/s for one drive. Depending upon just how big those "parts" are, that number might be an illusion.You need to be writing at least about as much as you did in the test above to reduce the effects of write behind (caching in the drive) etc. Normally a "zero to reinit" write doesn't need nearly that much (often just a few MB) - writing that much would be just to the drive's cache, and measuring that speed is just measuring DMA rate, and useless for anything. | RAID5 stripes I/O onto the data | drives, so I expect ~4X I/O speed w/ 4 data drives. With various | overheads/inefficiencies, I (think I) expect 350-400 MB/s writes. That's not going to happen. Every raid write (whatever raid level, except 0) requires 2 parallel disc writes (at least) - you need that to get the redundancy that is the R in the name - it can also require reads. For raid 5, you write to the data drive (one of the 4 of them) and to the parity drive - that is, all writes end up having a write to the parity drive, so the upper limit on speed for a contiguous write is that of one drive (a bit less probably, depending upon which controllers are in use, as the data still needs to be transmitted twice, and each controller can only be transferring for one drive at a time .. at least for the kinds of disc controllers in consumer grade equipment.) If both data and parity happen to be using the same controller, the max rate will certainly be less (measurably less, though perhaps not dramatically) than what you can achieve with one drive. If they're on different controllers, then in ideal circumstances you might get close to the rate you can expect from one drive. For general purpose I/O (writes all over the filesystem, as you'd see in normal operations) that's mitigated by there not really being one parity drive, rather, all 5 drives (the 4 you think of as being the data drives, and the one you think of as being the parity drive) perform as both data and parity drives, for different segments of the raid, so there isn't really (in normal operation) a one drive bottleneck -- but with 5 drives, and 2 writes needed for each I/O, the best you could possibly do is twice as fast as a single drive in overall throughput. In practice you'd never see that however, real workloads just aren't going to be just that conveniently spread out in just the right parts of the filesystems, if you ever get faster than a single drive can achieve, I'd be surprised. If you ever even approach what a single drive can achieve, I'd be surprised. H I thought that the RAID5 would write 1 parity byte & 4 data bytes in parallel, i.e. no '1 drive bottleneck'. AFAIUT, parity data is calculated for N data drives, 1 byte of parity data per N bytes written to the N data drives, then the N+1 bytes are written to the N+1 drives, no ? Now, in the above (aside from the possible measurement error in your 120MB/s) I've been allowing you to think that's "what a single drive can achieve". It isn't. That's raw I/O onto the drive, and will run at the best possible speed that the drive can handle - everything is optimised for that case, as it is one of the (meaningless) standard benchmarks. For real use, there are also filesystem overheads to consider, your raid test was onto a file, on the raid, not onto the raw raid (though I wouldn't expect that to be all that much faster, certainly not more than about 60MB/s assuming the 120 MB/s is correct). To get a more valid baseline, what you can actually expect to observe, you need to be comparing apples to apples - that is, the one drive test needs to also have a filesystem, and you need to be writing a file to it. To test that, take your hot spare (ie: unused) drive, and build a ffs on it instead of raid (unfortunately, I think you need to reboot to get it out of being a raidframe spare first - as I recall, raidframe has no "stop being a spare" operation ... it should have, but ...). Just stop it being added back as a hot spare (assuming you are actually doing that now, spares don't get autoconfigured.) (Wait till down below to see how to do the raidctl -s to see if the hot spare is actually configured or not, raidctl will tell you, once you run it properly). Then build a ffs on the spare drive (after it is no longer a spare) (you'll need to change the
Re: Beating a dead horse
Robert Elzwrites: > Date:Mon, 23 Nov 2015 11:18:48 -0553.75 > From:"William A. Mahaffey III" > Message-ID: <5653492e.1090...@hiwaay.net> > > Much of what you wanted to know has been answered already I think, but > not everything, so > > (in a different order than they were in your message) > > | Also, why did my fdisk > | choose those values when his chose apparently better ones ? > > There's a size threshold - drives smaller get the offfset 63 stuff, > and drives larger, 2048 ... the assumption is that on small drives > you don't want to waste too much space, but on big ones a couple of > thousand sectors is really irrelevant... I can see that 2048 is a multiple of many more values, but I wonder about changing 63 to 64, which is a multiple of 8 and thus good enough for the 4K issue, and wastes only 512 bytes. (In my experience drives of 1T and maybe even 750G are showing up with 4K sectors.) The other thing would be to change the alignment threshold to 128G. Even that's big enough that 1M not used by default is not important. And of course people who care can do whatever they want anyway. signature.asc Description: PGP signature
Re: Beating a dead horse
Greg Troxel wrote: > The other thing would be to change the alignment threshold to 128G. > Even that's big enough that 1M not used by default is not important. > And of course people who care can do whatever they want anyway. I would go further than that. Alignment is not only an issue with 4K sector disks, but also with SSDs, USB sticks, and SD cards, all of which are being deployed in sizes smaller than 128 GB even today. My vote is for a threshold of 1 G. This means that worst case, we could end up wasting as much as 0.1% of the capacity on alignment (the horror!). -- Andreas Gustafsson, g...@gson.org
Re: Beating a dead horse
Andreas Gustafssonwrites: > Greg Troxel wrote: >> The other thing would be to change the alignment threshold to 128G. >> Even that's big enough that 1M not used by default is not important. >> And of course people who care can do whatever they want anyway. > > I would go further than that. Alignment is not only an issue with 4K > sector disks, but also with SSDs, USB sticks, and SD cards, all of > which are being deployed in sizes smaller than 128 GB even today. I didn't realize that. Do these devices have 4K native sectors? > My vote is for a threshold of 1 G. This means that worst case, we > could end up wasting as much as 0.1% of the capacity on alignment > (the horror!). So you mean 2048 for >= 1G? That's ok with me, too. Is there a clear motivation for 2048 vs 64? Are there any known devices where that matters? But why is just changing 63 to 64 always an issue? Or using that for >= 1G? Other than complexity, and having 3 branches. signature.asc Description: PGP signature