Re: high load, no bottleneck
Hello, on Saturday 28 September 2013 04:54:19 you wrote: Date:Sat, 28 Sep 2013 09:09:02 +0100 From:Roland C. Dowdeswell el...@imrryr.org Message-ID: 20130928080902.gg4...@roofdrak.imrryr.org | I thought quite some time ago that it probably makes sense for us | to make the installer mount everything async to extract the sets | because, after all, if the install fails you are rather likely to | simply re-install rather than manually figure out where it was and | fix it. I have had the same thought - the one downside for this usage is the sync time at umount - for novice users this would appear to be a hang/crash after the install was almost finished (since it might take minutes to complete) to which a natural response would be to push the reset button and try again... So, I think this would need to be an option to be used by those who know to expect that delay, and so default off. Or just throw up a warning - Hey, this can take a few minutes, don't freak out! have fun Michael
Re: high load, no bottleneck
Basically, if we have N pending VOP_FSYNC for a given filesystem, all theses requests will be honoured on first flush, but they are serialized and will be acknowledged one by one, with the cost of a useless flush each time. Am I right? Do you mean all these requests *could* be honored on first flush? If so, then yes, I agree. Last flush, surely? There may be pending operations between the first FSYNC and the last which need to be performed before the last FSYNC is done. Looking at the code in XFS that gathers barrier events, it seems to be drawing a DAG... That's exactly what I'd expect: write--flush dependency _is_ a DAG, with barriers being chokepoints (I forget the graph-theory term for them) in the DAG. Mouse
Re: high load, no bottleneck
Date:Sat, 28 Sep 2013 14:24:32 +1000 From:matthew green m...@eterna.com.au Message-ID: 11701.1380342...@splode.eterna.com.au | -o async is very dangerous. there's not even the vaguest | guarantee that even fsck can help you after a crash in | that case... All true, still it is remarkably useful - I use it all the time (incidentally, while the man page says that -o log and -o async can't be used together, if they are, the result is a panic, rather than a more graceful error message ... I should point out that I saw this on a remount, adding -o async to a filesys that had been mounted -o log - with -o log in fstab ... I haven't been inclined to panic the system more by running more tests, just removed the -o log which wasn't really needed for that filesys, it is mostly either -o async or -o ro). My strategy is to newfs a filesystem, mount it -o async, extract files into it (extracting a pkgsrc.tgz can be done in a few seconds if the system has sufficient ram for a large buffer cache - the subsequent sync / umount takes ages - but the combined time is still much less than any other strategy for filling a filesys with lots of files, and the filesys can be happily used in parallel with the sync - -o async helps less if the files are relatively big of course, but for pkgsrc it is ideal). Should the system crash for any reason while all this is happening, I simply start again, from the newfs - that is so rare (NetBSD being mostly stable, and with a UPS to guard against power problems) that the extra delay work that might be required is irrelevant - and in any case, I suspect that I could newfs/mount -oasync/tar x/crash/newfs/mount/tar x faster than a simple tar x on a normally mounted filesystem, even with -olog. I also mount the filesystem I use for pkg_comp sandboxes with -o async. Again, should the system crash, I don't care, simply newfs and make the sandbox(es) again. This vastly improves compile times (particularly cleanup times - a newfs followed by repopulating the sandbox is quite fast ... even a rm -fr on the sandbox, and repopulate, is MUCH faster that make clean on any sizeable package with many dependencies - I do that between package builds to guarantee no accidental undesired pollution.) Of course, to do this, one must believe in filesystems as useful objects, and not simply a nuisance created out of the necessity of drives that were too small, which should be avoided wherever possible. Some of my systems have approaching 40 mounted filesystems - filesystems are first class objects - they're the unit for mount options (like -o ro, -o async, and -o log), they're the unit for exports, they're the unit for dumps. Using them intelligently makes system management much more flexible. We are still lacking some facilities that would make things even better, including filesystems that could easily grow/shrink as needed, so the argument about running out of space in one filesystem while there is plenty available in another could be ignored - it is the only argument against multiple filesysems with any real merit, and it is true only because we allow it to remain so, it doesn't have to be (Digital Unix's ADVFS proved that decades ago). There's more that could be done to improve things - including handling fsck better at startup - the system should be able to come up multi user before all filesystems are checked and mounted, only some subset (of the system with almost 40, I think it needs about 8 to function for 99% of its uses - the rest are specialised) are really needed, the rest should be checked and when ready, mounted, after the system is running -- -o log helps there, but isn't really enough (like for many of the filesystems I have, if they were never to become available, because of hardware failure or something, it should not prevent successful multi-user boot.) kre ps: I had been meaning to rant like this for some time, your message just provided the incentive today!
re: high load, no bottleneck
ps: I had been meaning to rant like this for some time, your message just provided the incentive today! :-) i will note that i'm also a fan of using -o async FFS mounts in the right place. i just wouldn't do it for a file server :-)
Re: high load, no bottleneck
On Sat, Sep 28, 2013 at 05:56:50PM +1000, matthew green wrote: ps: I had been meaning to rant like this for some time, your message just provided the incentive today! :-) i will note that i'm also a fan of using -o async FFS mounts in the right place. i just wouldn't do it for a file server :-) I thought quite some time ago that it probably makes sense for us to make the installer mount everything async to extract the sets because, after all, if the install fails you are rather likely to simply re-install rather than manually figure out where it was and fix it. -- Roland Dowdeswell http://Imrryr.ORG/~elric/
Re: high load, no bottleneck
Robert Elz k...@munnari.oz.au wrote: incidentally, while the man page says that -o log and -o async can't be used together, if they are, the result is a panic, rather than a more graceful error message ... This could be a real problem on a system that allows unprivilegied users to mount thumb drives... -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
m...@netbsd.org (Emmanuel Dreyfus) writes: Basically, if we have N pending VOP_FSYNC for a given filesystem, all theses requests will be honoured on first flush, but they are serialized and will be acknowledged one by one, with the cost of a useless flush each time. Am I right? That should be trivial to fix then. Don't flush if it isn't dirty.
Re: high load, no bottleneck
Michael van Elst mlel...@serpens.de wrote: Basically, if we have N pending VOP_FSYNC for a given filesystem, all theses requests will be honoured on first flush, but they are serialized and will be acknowledged one by one, with the cost of a useless flush each time. Am I right? That should be trivial to fix then. Don't flush if it isn't dirty. Here are the backtrace I collected for ine of the many stuck process waiting for I/O: turnstile_block rw_vector_enter wapbl_begin ffs_write VOP_WRITE nfsrv_write nfssvc_nfsd sys_nfssvc syscall We would add a we_dirty flag to struct wapbl_entry? When would it be set to 1? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
On Sat, Sep 28, 2013 at 06:25:22AM +0200, Emmanuel Dreyfus wrote: Basically, if we have N pending VOP_FSYNC for a given filesystem, all theses requests will be honoured on first flush, but they are serialized and will be acknowledged one by one, with the cost of a useless flush each time. Am I right? Do you mean all these requests *could* be honored on first flush? If so, then yes, I agree. Unfortunately, it may be that we lost some of the framework necessary to do this when we reverted the softdep code. Looking at the code in XFS that gathers barrier events, it seems to be drawing a DAG...
Re: high load, no bottleneck
Emmanuel Dreyfus m...@netbsd.org wrote: I tried moving a client NFS mount to async. The result is that the server never sees a filesync again from that client. Further testing shows that server with -o log / client with -o async has no performance problem. OTOH, the client sometimes complain about write errors. -o async seems dangerous. Too many fsync() causing too many WAPBL flushes seems confirmed to be the problem at work here. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
Thor Lancelot Simon t...@panix.com wrote: It should be possible to gather those requests and commit many of them at once to disk with a single cache flush operation, rather than issuing a cache flush for each one. This is not unlike the problem with nfs3 in general, that many clients at once may issue WRITE RPCs followed by COMMIT RPCs, and the optimal behavior is to gather the COMMITS, service many at a time, then respond to them all - If I understand correctly, the current situation is that each NFS client fsync() causes a server WAPBL flush, while other NFS clients fsync() are waiting. The situation is obvious with NFS, but it also probably exists with local I/O, when VOP_FSYNC cause a WAPBL flush and let others VOP_FSYNC wait. Basically, if we have N pending VOP_FSYNC for a given filesystem, all theses requests will be honoured on first flush, but they are serialized and will be acknowledged one by one, with the cost of a useless flush each time. Am I right? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
re: high load, no bottleneck
I tried moving a client NFS mount to async. The result is that the server never sees a filesync again from that client. Further testing shows that server with -o log / client with -o async has no performance problem. OTOH, the client sometimes complain about write errors. -o async seems dangerous. -o async is very dangerous. there's not even the vaguest guarantee that even fsck can help you after a crash in that case... .mrg.
Re: high load, no bottleneck
I tried moving a client NFS mount to async. [...] Further testing shows that server with -o log / client with -o async has no performance problem. OTOH, the client sometimes complain about write errors. -o async seems dangerous. -o async is very dangerous. there's not even the vaguest guarantee that even fsck can help you after a crash in that case... I think you're confusing -o async on the server-side (FFS, etc) mount with -o async on the client-side (NFS) mount. The former, yes, agreed (though fsck will be able to put things back together just about often enough to lull people into a false sense of security...). The latter, things aren't quite so dismal - depending on what the client is doing and what kind of damage it's prepared to tolerate, -o async on the NFS mount (client with -o async) may actually be a sane thing to do. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: high load, no bottleneck
Emmanuel Dreyfus m...@netbsd.org wrote: async Assume that unstable write requests have actually been committed to stable storage on the server, and thus will not require resending in the event that the server crashes. Use of this option may improve performance but only at the risk of data loss if the server crashes. Note: this mount option will only be hon- ored if the nfs.client.allow_async option in nfs.conf(5) is also enabled. I tried moving a client NFS mount to async. The result is that the server never sees a filesync again from that client. What are the consequences? I understand that if I use -o log server-side, I will still benefit regular flushes. I will loose the guarantee that client fsync(2) push data to stable storage, but I will not have corrupted files on server crash. Is that right? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
On Thu, 26 Sep 2013, Emmanuel Dreyfus wrote: Emmanuel Dreyfus m...@netbsd.org wrote: async Assume that unstable write requests have actually been committed to stable storage on the server, and thus will not require resending in the event that the server crashes. Use of this option may improve performance but only at the risk of data loss if the server crashes. Note: this mount option will only be hon- ored if the nfs.client.allow_async option in nfs.conf(5) is also enabled. I tried moving a client NFS mount to async. The result is that the server never sees a filesync again from that client. What are the consequences? I understand that if I use -o log server-side, I will still benefit regular flushes. I will loose the guarantee that client fsync(2) push data to stable storage, but I will not have corrupted files on server crash. Is that right? As I understand things, -o log (wapbl) doesn't guarantee file content, only fs metadata. - | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | Customer Service | FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Network Engineer | 0786 F758 55DE 53BA 7731 | pgoyette at juniper.net | | Kernel Developer | | pgoyette at netbsd.org | -
Re: high load, no bottleneck
I have no idea wether [several journal flushes per second] is high or low. It should be killing you. So the main question is who is issuing these small sync writes. As already mentioned, per filesync write you get a WAPBL journal flush which ends up in two disc flushes (one before and one after). This is a two-disk RAID 1 Wait, this is a Level 1 RAID? The you don't have RMW. You may compare write througput to read throughput.
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de wrote: I have no idea wether [several journal flushes per second] is high or low. It should be killing you. So the main question is who is issuing these small sync writes. As already mentioned, per filesync write you get a WAPBL journal flush which ends up in two disc flushes (one before and one after). I understand the rationale, but I do not see how that could be fixed. We want fsync to do a disk sync, and client are unlikely to be fixable. This is a two-disk RAID 1 Wait, this is a Level 1 RAID? The you don't have RMW. RMW? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
We want fsync to do a disk sync, and client are unlikely to be fixable. In my case, the culprit was SQLite used by browsers and dropbox. As these were not fixable, I ended up writing a system that re-directs these SQLite files to local storage (http://www.math.uni-bonn.de/people/ef/dotcache). RMW? Read-Modify-Write. On a RAID 4/5, writing anything that's not an entire stripe needs either to read the rest of the stripe (to be able to compute the new parity) before writing the modified part and the parity; or it (if you modify less than half the stripe) reads both the old data and old parity to compute the new parity. You don't have that on RAID 1, of course.
Re: high load, no bottleneck
http://www.math.uni-bonn.de/people/ef/dotcache/ has a typo in the first subheading Dotache :) On 24 September 2013 13:38, Edgar Fuß e...@math.uni-bonn.de wrote: We want fsync to do a disk sync, and client are unlikely to be fixable. In my case, the culprit was SQLite used by browsers and dropbox. As these were not fixable, I ended up writing a system that re-directs these SQLite files to local storage (http://www.math.uni-bonn.de/people/ef/dotcache). RMW? Read-Modify-Write. On a RAID 4/5, writing anything that's not an entire stripe needs either to read the rest of the stripe (to be able to compute the new parity) before writing the modified part and the parity; or it (if you modify less than half the stripe) reads both the old data and old parity to compute the new parity. You don't have that on RAID 1, of course.
Re: high load, no bottleneck
crap, apologies for the non checked return address. In the interest of trying to make a relevant reply - doesn't nfs3 support differing COMMIT sync levels which could be leveraged for this? (assuming your server is stable :) aside I recall using NFS for file storage at Dreamworks in the late '90s and discovering the reason that the SGI file servers boxes outperformed everything else is that they lied to the client and indicated data has been synced to disk as soon as it hit memory. Wonderful performance feature... until someone insisted in putting known buggy ATM drivers into production which could give up to a GB of lost data when the fileservers paniced... /aside
Re: high load, no bottleneck
David Brownlee a...@netbsd.org wrote: In the interest of trying to make a relevant reply - doesn't nfs3 support differing COMMIT sync levels which could be leveraged for this? (assuming your server is stable :) It would be that mount_nfs option? (from MacOS X man page) async Assume that unstable write requests have actually been committed to stable storage on the server, and thus will not require resending in the event that the server crashes. Use of this option may improve performance but only at the risk of data loss if the server crashes. Note: this mount option will only be hon- ored if the nfs.client.allow_async option in nfs.conf(5) is also enabled. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
EF However, the amount of filesync writes may still be the problem. EF The missing data (for me) is how often your WAPBL journal gets flushed ED How that can be retreived? Look at the WAPBL debug output in syslog (which has time stamps). EF How large are your stripes, btw.? ED It is the sectPerSU in raidctl -s output, right? Multiplied by the number of data discs (i.e. discs minus one for level 5).
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de wrote: EF However, the amount of filesync writes may still be the problem. EF The missing data (for me) is how often your WAPBL journal gets flushed ED How that can be retreived? Look at the WAPBL debug output in syslog (which has time stamps). min: 1 flush/s max: 6 flush/s mean: 3.2 flush/s, standard deviation: 0.33 I have no idea wether this is high or low. EF How large are your stripes, btw.? ED It is the sectPerSU in raidctl -s output, right? Multiplied by the number of data discs (i.e. discs minus one for level 5). This is a two-disk RAID 1 with this raidctl -s output sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1 Queue size: 100, blocksize: 512, numBlocks: 972799936 The answer would therefore be 32 * 2 =64 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
I confirm. Indeed this is weird. Yes. I would try finding out who causes this. But how small write could kill WAPBL performances? I don't think it would. However, the amount of filesync writes may still be the problem. The missing data (for me) is how often your WAPBL journal gets flushed (because of these filesync writes). If this happens once a second or even more often, it may completly stall your discs. One filesync write causes one journal flush which causes two disc cache flushes. Additionally, every single small sync write will cause one RAID stripe RMW. How large are your stripes, btw.?
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de writes: I myself can't make sense out of the combination of -- vfs.wapbl.flush_disk_cache=0 mitigating the problem -- neither the RAID set nor its components showing busy in iostat Maybe during a flush, the discs are not regarded busy? Do you have physical access to the server during the test? Then you could have a look whether the discs are really idle (as iostat says) or busy (as disabling flushing mitigating the problem suggests). I have a WD Elements external USB drive, and it seems to take most of a second to execute a cache flush (via wapbl). It was basically unusable with rdiff-backup (which calls fsync which does a log truncation) with cache flush enabled. The real fix was disabling the fsync in rdiff-backup, but disabling wapbl cache flush ehlped. pgplLAzo0Kl0W.pgp Description: PGP signature
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de wrote: I don't think it would. However, the amount of filesync writes may still be the problem. The missing data (for me) is how often your WAPBL journal gets flushed (because of these filesync writes). I How that can be retreived? Additionally, every single small sync write will cause one RAID stripe RMW. How large are your stripes, btw.? It is the sectPerSU in raidctl -s output, right? sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1 Queue size: 100, blocksize: 512, numBlocks: 972799936 RAID Level: 1 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
ED 2908 getattr EF During which timeframe? ED 22.9 seconds. So that's 100 getattrs per second. Indeed [lots of 549-byte write requests] is weird. But how small write could kill WAPBL performances? No idea. I think I'm out of luck now, but maybe it rings a bell with someone else. It would probably help finding out (with WAPL logging) how often the journal flushes happen. I myself can't make sense out of the combination of -- vfs.wapbl.flush_disk_cache=0 mitigating the problem -- neither the RAID set nor its components showing busy in iostat Maybe during a flush, the discs are not regarded busy? Do you have physical access to the server during the test? Then you could have a look whether the discs are really idle (as iostat says) or busy (as disabling flushing mitigating the problem suggests). Any experts to explain what exactly busy means in the iostat time sense? Maybe also you get some hint by trying to find out whether the problem is NFS or client related. Are you able to reproduce it locally? Can you make it happen (to a lesser extent, of course) with a single process?
Re: high load, no bottleneck
On Sat, Sep 21, 2013 at 11:44:26AM +0200, Edgar Fuß wrote: ED 2908 getattr EF During which timeframe? ED 22.9 seconds. So that's 100 getattrs per second. Indeed [lots of 549-byte write requests] is weird. But how small write could kill WAPBL performances? No idea. I think I'm out of luck now, but maybe it rings a bell with someone else. It would probably help finding out (with WAPL logging) how often the journal flushes happen. I myself can't make sense out of the combination of -- vfs.wapbl.flush_disk_cache=0 mitigating the problem -- neither the RAID set nor its components showing busy in iostat Maybe during a flush, the discs are not regarded busy? I suspect that indeed, while a fluch cache command is running, the disk is not considered busy. Only read and write commands are tracked. -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: high load, no bottleneck
I suspect that indeed, while a fluch cache command is running, the disk is not considered busy. Only read and write commands are tracked. Would it a) make sense b) be possible to implement that flushes are counted as busy?
Re: high load, no bottleneck
On Sat, Sep 21, 2013 at 12:01:45PM +0200, Edgar Fuß wrote: I suspect that indeed, while a fluch cache command is running, the disk is not considered busy. Only read and write commands are tracked. Would it a) make sense b) be possible to implement that flushes are counted as busy? It would probably make sense. But requires a bit of work: right now, disk_busy()/disk_unbusy() takes as argument the byte count and the op type (read/write). I think we should separate flushes from writes so that writes don't get counted twice (also byte count cannot be guessed for flushes, we don't know how much data there's in the disk cache). Maybe we could count flushes as 0-byte writes, but I'm not sure that disk_busy()/disk_unbusy() would handle this properly. Or we could add a third operation type, but then disk_busy()/disk_unbusy() and userland tools would need to be chanegd to handle it ... -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: high load, no bottleneck
On Sat, Sep 21, 2013 at 09:58:30AM +, Michael van Elst wrote: e...@math.uni-bonn.de (Edgar =?iso-8859-1?B?RnXf?=) writes: I myself can't make sense out of the combination of -- vfs.wapbl.flush_disk_cache=0 mitigating the problem -- neither the RAID set nor its components showing busy in iostat Maybe during a flush, the discs are not regarded busy? busy means that the driver is executing I/O requests or waiting for results from such operartions. The cache flush operation doesn't count by itself, but it slows down parallel I/O operations, so the disk appears busy as long as there are such I/O operations. At last for wd(4) (and I suspect it's true for sd(4) as well), new I/Os while a flush cache is running are stalled in the buffer queue, and so disk_busy() is only called once the flush cache has completed. So, the disk is effectively counted as unbusy while a cache flush is in progress. -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de wrote: Maybe also you get some hint by trying to find out whether the problem is NFS or client related. Are you able to reproduce it locally? Can you make it happen (to a lesser extent, of course) with a single process? I tried various ways but I could not obtain the same phenomenon locally: if I get high load, it is because of CPU usage. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
Here is an excerpt: device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.00 0.00 13.74 27 0.00 0.36 wd1 0.00 0 0.00 0.00 13.74 27 0.00 0.36 raid10.00 0 0.00 0.00 20.61 18 0.00 0.36 Provided raid1 is where tha data lives, both the RAID and its components are quiet. Output of your first script: 2908 getattr During which timeframe? Output of your second script: 167 549 unstable 28 549 filesync So you mostly get 549-byte write requests? Could you manually double-check this? It sounds so weird that I'm afraid of an error in my script mis-interpreting your tcpdump's output format. In any case, I'm afraid you are not facing the kind of problems I inadvertently became a sort-of-expert in. Those 549-byte writes, should they prove real, may (others) give a hint what's going wrong, though.
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de wrote: Output of your first script: 2908 getattr During which timeframe? 22.9 seconds. Output of your second script: 167 549 unstable 28 549 filesync So you mostly get 549-byte write requests? Could you manually double-check this? I confirm. Indeed this is weird. But how small write could kill WAPBL performances? The load does not get beyond 1 if mounted without -o log... -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
I re-enabled -o log and did the dd test again on NetBSD 6.0 with the patch you posted and vfs.wapbl.verbose_commit=2 I wouldn't expect anything interesting from this, but maybe hannken@ does. Running my stress test, which drives load to insane values: How often do these log flushes occur? During the stess phase, what does iostat -D -x -w 1 show for the raid and for the components, especially in the time column? During the stress, do you see small synchronous writes in NFS traffic? I attach two small shell scripts I wrote to extract statistics from tcpdump output. Both operate either on a pcap file, e.g. the output of tcpdump ... -s slen -w file port nfs (where I don't remember what the minimum slen is to capture all relevant info) or a textual tcpdump -vvv output (which is usually smaller), e.g. tcpdump ... -s 0 -vvv port nfs file As output, you get a list of NFS calls and a count (nfsstat) or a list of write calls ordered by size/sync and their counts. nfsstat.sh Description: Bourne shell script nfswritestat.sh Description: Bourne shell script
Re: high load, no bottleneck
On Wed, Sep 18, 2013 at 06:03:11PM +0200, Emmanuel Dreyfus wrote: Emmanuel Dreyfus m...@netbsd.org wrote: Thank you for saving my day. But now what happens? I note the SATA disks are in IDE emulation mode, and not AHCI. This is something I need to try changing: Switched to AHCI. Here is below how hard disks are discovered (the relevant raid is RAID1 on wd0 and wd1) In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads, on both 6.0 and -current. I assume there must be something bad with WAPBL/RAIDframe There is at least one thing: RAIDframe doesn't allow enough simultaneously pending transactions, so everything *really* backs up behind the cache flush. Fixing that would require allowing RAIDframe to eat more RAM. Last time I proposed that, I got a rather negative response here. Thor
Re: high load, no bottleneck
On Wed, Sep 18, 2013 at 06:03:11PM +0200, Emmanuel Dreyfus wrote: Emmanuel Dreyfus m...@netbsd.org wrote: Thank you for saving my day. But now what happens? I note the SATA disks are in IDE emulation mode, and not AHCI. This is something I need to try changing: Switched to AHCI. Here is below how hard disks are discovered (the relevant raid is RAID1 on wd0 and wd1) The other thing is, *if* we had support for more modern features (tagged queueing, FUA, etc.) in our ATA code, switching to AHCI mode could potentially have much more benefit. But it doesn't now -- and the development work required to add support for those features to the ATA code and to use them in FFS and WAPBL is not small. Thor
Re: high load, no bottleneck
On Thu, Sep 19, 2013 at 08:13:42AM -0400, Thor Lancelot Simon wrote: There is at least one thing: RAIDframe doesn't allow enough simultaneously pending transactions, so everything *really* backs up behind the cache flush. Fixing that would require allowing RAIDframe to eat more RAM. Last time I proposed that, I got a rather negative response here. It could be optionnal so that everyone is happy, couldn't it? -- Emmanuel Dreyfus m...@netbsd.org
Re: high load, no bottleneck
On Thu, 19 Sep 2013 10:29:55 -0400 chris...@zoulas.com (Christos Zoulas) wrote: On Sep 19, 8:13am, t...@panix.com (Thor Lancelot Simon) wrote: -- Subject: Re: high load, no bottleneck | On Wed, Sep 18, 2013 at 06:03:11PM +0200, Emmanuel Dreyfus wrote: | Emmanuel Dreyfus m...@netbsd.org wrote: | | Thank you for saving my day. But now what happens? | I note the SATA disks are in IDE emulation mode, and not AHCI. This is | something I need to try changing: | | Switched to AHCI. Here is below how hard disks are discovered (the relevant raid | is RAID1 on wd0 and wd1) | | In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads, on both 6.0 | and -current. I assume there must be something bad with WAPBL/RAIDframe | | There is at least one thing: RAIDframe doesn't allow enough simultaneously | pending transactions, so everything *really* backs up behind the cache flush. | | Fixing that would require allowing RAIDframe to eat more RAM. Last time I | proposed that, I got a rather negative response here. sysctl to the rescue. The appropriate 'bit to twiddle' is likely raidPtr-openings. Increasing the value can be done while holding raidPtr-mutex. Decreasing the value can also be done while holding raidPtr-mutex, but will need some care if attempting to decrease it by more than the number of outstanding IOs. I'm happy to review any changes to this, but won't have time to code it myself, unfortunately :( Later... Greg Oster
Re: high load, no bottleneck
Greg Oster os...@cs.usask.ca wrote: sysctl to the rescue. The appropriate 'bit to twiddle' is likely raidPtr-openings. Increasing the value can be done while holding raidPtr-mutex. Decreasing the value can also be done while holding raidPtr-mutex, but will need some care if attempting to decrease it by more than the number of outstanding IOs. This suggests that in my problem, RAIDframe would be the bottleneck given too many concurent I/O sent by WAPBL. But how is it possible? Aren't WAPBL flushes serialized? The change you sugest would be set by raidctl rather than sysctl, right? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
Hello. thor's right. The raidframe driver defaults to a rediculously low number of maximum outstanding transactions for today's environment. This is not a criticism of how the number was chosen initially, but things have changed. In my production kernels around here, I include the following option which is a number I derived from a bit of impirical testing. I found that for arrays of raid5 disks, I didn't get much benefit with higher numbers, but numbers below this did show a marked decline in performance. For example, on an amd64 machine with 32G of ram, I have a raid5 set with 12 disks running on 2 mpt(4) buses. I get the following read and write numbers written to a filesystem with softdep enabled on top of a dk(4) wedge built on the raid5 set: (This is NetBSD-5.1) test# dd if=/dev/zero of=testfile bs=64k count=65535 65535+0 records in 65535+0 records out 4294901760 bytes transferred in 125.486 secs (34226142 bytes/sec) test# dd if=testfile of=/dev/null bs=64k count=65535 65535+0 records in 65535+0 records out 4294901760 bytes transferred in 5.994 secs (716533493 bytes/sec) The line I include in my config files is: options RAIDOUTSTANDING=40 #try and enhance raid performance.
Re: high load, no bottleneck
Hello. the worst case scenario is when a raid set is running in degraded mode. Greg sent me some notes on how to calculate the memory utilization in this instance. I'll go dig them out and send them along in a bit. In theory, if all your raid sets are in degraded mode at once, and i/o is busy, you could be highly impacted, since you can have up to 40 i/o's outstanding for each raid set with my configuration option. However, even on machines with multiple raid5 sets, with 2 of them running in degraded mode, I've not seen a memory bottleneck. I don't recommend this, of course, but somethimes stuff happens. In any case, except for the potential memory utilization, there's no down side to setting this number in the kernel and not worrying about it anymore. In fact, this is what I do for all our machines around here regardless of whether the machine is hosting raid1 sets, raid5 sets or a combination of the two. -Brian
Re: high load, no bottleneck
Greg Oster os...@cs.usask.ca wrote: It's probably easier to do by raidctl right now. I'm not opposed to having RAIDframe grow a sysctl interface as well if folks think that makes sense. The 'openings' value is currently set on a per-RAID basis, so a sysctl would need to be able to handle individual RAID sets as well as overall configuration parameters. IMO raidctl makes more sense here, as it is the place where one is looking for RAID stuff. While I am there: fsck takes an infinite time while RAIDframe is rebuilding parity. I need to renice the raidctl process that does it in order to complete fsck. Would raising the outstanding write value also help here? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
On Thu, 19 Sep 2013 20:53:30 +0200 m...@netbsd.org (Emmanuel Dreyfus) wrote: Greg Oster os...@cs.usask.ca wrote: It's probably easier to do by raidctl right now. I'm not opposed to having RAIDframe grow a sysctl interface as well if folks think that makes sense. The 'openings' value is currently set on a per-RAID basis, so a sysctl would need to be able to handle individual RAID sets as well as overall configuration parameters. IMO raidctl makes more sense here, as it is the place where one is looking for RAID stuff. While I am there: fsck takes an infinite time while RAIDframe is rebuilding parity. I need to renice the raidctl process that does it in order to complete fsck. Would raising the outstanding write value also help here? Any additional load you have on the RAID set while rebuilding parity is just going to make things worse... What you really want to do is turn on the parity logging stuff, and reduce the amount of effort spent checking parity by orders of magnitude... Later... Greg Oster
Re: high load, no bottleneck
On Thu, 19 Sep 2013 11:26:21 -0700 (PDT) Paul Goyette p...@whooppee.com wrote: On Thu, 19 Sep 2013, Brian Buhrow wrote: The line I include in my config files is: options RAIDOUTSTANDING=40 #try and enhance raid performance. Is this likely to have any impact on a system with multiple raid-1 mirrors? Yes, it would, provided you have more than 6 concurrent IOs to each RAID set.. Later... Greg Oster
Re: high load, no bottleneck
On Sep 19, 11:35am, buh...@nfbcal.org (Brian Buhrow) wrote: -- Subject: Re: high load, no bottleneck | Hello. the worst case scenario is when a raid set is running in | degraded mode. Greg sent me some notes on how to calculate the memory | utilization in this instance. I'll go dig them out and send them along in | a bit. In theory, if all your raid sets are in degraded mode at once, and | i/o is busy, you could be highly impacted, since you can have up to 40 | i/o's outstanding for each raid set with my configuration option. However, | even on machines with multiple raid5 sets, with 2 of them running in | degraded mode, I've not seen a memory bottleneck. I don't recommend this, | of course, but somethimes stuff happens. In any case, except for the | potential memory utilization, there's no down side to setting this number | in the kernel and not worrying about it anymore. In fact, this is what I | do for all our machines around here regardless of whether the machine is | hosting raid1 sets, raid5 sets or a combination of the two. If we are going to add a sysctl, we might also put a different value for the raid-degraded condition? Ideally I prefer if things autotuned, but that is much more difficult. christos
Re: high load, no bottleneck
Greg Oster os...@cs.usask.ca wrote: Any additional load you have on the RAID set while rebuilding parity is just going to make things worse... What you really want to do is turn on the parity logging stuff, and reduce the amount of effort spent checking parity by orders of magnitude... You mean raidctl -M yes, right? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
Brian Buhrow buh...@nfbcal.org wrote: options RAIDOUTSTANDING=40 #try and enhance raid performance. I gave it a try, and even with RAIDOUTSTANDING set to 800 on a NetBSD-6.1 kernel, my stress test raises load over 10 with -o log, whereas it remains below 1 without -o log Therefore it must be something else. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de wrote: How often do these log flushes occur? On a 6.1 kernel with RAIDOUTSTANDING=800 and -o log. Stress test raises load to around 10. During the stess phase, what does iostat -D -x -w 1 show for the raid and for the components, especially in the time column? Here is an excerpt: device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.00 0.00 13.74 27 0.00 0.36 wd1 0.00 0 0.00 0.00 13.74 27 0.00 0.36 raid10.00 0 0.00 0.00 20.61 18 0.00 0.36 device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.06 0.00 12.70306 0.06 3.80 wd1 0.00 0 0.14 0.00 12.70306 0.14 3.80 raid10.00 0 0.14 0.00 18.51210 0.14 3.80 device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.04 0.00 13.53285 0.04 3.77 wd1 0.00 0 0.04 0.00 13.53285 0.04 3.77 raid10.00 0 0.05 0.00 20.18191 0.05 3.77 device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.04 0.00 13.34304 0.04 3.96 wd1 16.00 1 0.04 0.02 13.34304 0.04 3.96 raid1 16.00 1 0.05 0.02 20.28200 0.05 3.96 device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.06 0.00 12.97242 0.06 3.06 wd1 0.00 0 0.13 0.00 12.97242 0.13 3.06 raid10.00 0 0.13 0.00 19.42161 0.13 3.06 During the stress, do you see small synchronous writes in NFS traffic? Output of your first script: 2908 getattr 1140 lookup 969 access 273 fsstat 195 write 145 commit 110 create 102 setattr 94 remove 23 read Output of your second script: 167 549 unstable 28 549 filesync And here is the iostat without -o log (load barely raise to 1) device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.00 0.00 14.86 48 0.00 0.69 wd1 0.00 0 0.00 0.00 14.86 48 0.00 0.69 raid10.00 0 0.00 0.00 22.30 32 0.00 0.69 device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.07 0.00 7.75227 0.07 1.72 wd1 0.00 0 0.06 0.00 7.75227 0.06 1.72 raid10.00 0 0.09 0.00 7.85224 0.09 1.72 device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.01 0.00 14.08 50 0.01 0.69 wd1 2.00 1 0.01 0.00 14.08 50 0.01 0.69 raid12.00 1 0.02 0.00 32.64 22 0.02 0.69 device read KB/tr/s time MB/s write KB/tw/s time MB/s wd0 0.00 0 0.00 0.00 10.40 20 0.00 0.20 wd1 16.00 1 0.05 0.02 10.40 20 0.05 0.20 raid1 16.00 1 0.05 0.02 14.86 14 0.05 0.20 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
On Fri, 20 Sep 2013 01:37:20 +0200 m...@netbsd.org (Emmanuel Dreyfus) wrote: Greg Oster os...@cs.usask.ca wrote: Any additional load you have on the RAID set while rebuilding parity is just going to make things worse... What you really want to do is turn on the parity logging stuff, and reduce the amount of effort spent checking parity by orders of magnitude... You mean raidctl -M yes, right? Correct. Later... Greg Oster
Re: high load, no bottleneck
On Wed, Sep 18, 2013 at 03:34:19AM +0200, Emmanuel Dreyfus wrote: Christos Zoulas chris...@zoulas.com wrote: On large filesystems with many files fsck can take a really long time after a crash. In my personal experience power outages are much less frequent than crashes (I crash quite a lot since I always fiddle with things). If you don't care about fsck time, you don't need WAPBL. But you just told me that I will need a fsck after crash now I am running with vfs.wapbl.flush_disk_cache=0 so I wonder if I should not just mount without -o log. What are WAPBL benefits when running with vfs.wapbl.flush_disk_cache=0? For a NFS server, I'm not sure there's any benefit ... -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: high load, no bottleneck
On Tue, Sep 17, 2013 at 09:48:49PM +0200, Emmanuel Dreyfus wrote: Thank you for saving my day. But now what happens? I note the SATA disks are in IDE emulation mode, and not AHCI. This is something I need to try changing: In AHCI mode, you might be able to use ordered tags or force unit access (does SATA have this concept per command?) to force individual transactions or series of transactions out, rather than flushing out all the data every time just to get the metadata into the journal on-disk. But that would take some work on our ATA subsystem.
Re: high load, no bottleneck
Thor Lancelot Simon t...@panix.com wrote: In AHCI mode, you might be able to use ordered tags or force unit access (does SATA have this concept per command?) to force individual transactions or series of transactions out, rather than flushing out all the data every time just to get the metadata into the journal on-disk. But that would take some work on our ATA subsystem. On another machine, I already had performance problems with SATA controllers in IDE emulation mode (recognized as piixide), which disapeared when selecting AHCI mode in the BIOS (turning it to be recognized as ahcisata) I will report on this as soon as the server will be idle enough to be rebooted. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
On Sep 18, 3:34am, m...@netbsd.org (Emmanuel Dreyfus) wrote: -- Subject: Re: high load, no bottleneck | Christos Zoulas chris...@zoulas.com wrote: | | On large filesystems with many files fsck can take a really long time after | a crash. In my personal experience power outages are much less frequent than | crashes (I crash quite a lot since I always fiddle with things). If you | don't care about fsck time, you don't need WAPBL. | | But you just told me that I will need a fsck after crash now I am | running with vfs.wapbl.flush_disk_cache=0 so I wonder if I should not | just mount without -o log. What are WAPBL benefits when running with | vfs.wapbl.flush_disk_cache=0? You *might* need an fsck after power loss. If you crash and the disk syncs then you should be ok if the disk flushed (which it probably did if you say syncing disks after the panic). christos
Re: high load, no bottleneck
Christos Zoulas chris...@zoulas.com wrote: You *might* need an fsck after power loss. If you crash and the disk syncs then you should be ok if the disk flushed (which it probably did if you say syncing disks after the panic). I am not sure I ever encountered a crash where syncing disk after panic did not lock up the machine forever :-) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
Emmanuel Dreyfus m...@netbsd.org wrote: Thank you for saving my day. But now what happens? I note the SATA disks are in IDE emulation mode, and not AHCI. This is something I need to try changing: Switched to AHCI. Here is below how hard disks are discovered (the relevant raid is RAID1 on wd0 and wd1) In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads, on both 6.0 and -current. I assume there must be something bad with WAPBL/RAIDframe ahcisata0 at pci0 dev 31 function 2: vendor 0x8086 product 0x2922 (rev. 0x02) ahcisata0: interrupting at ioapic0 pin 17 ahcisata0: AHCI revision 1.20, 4 ports, 32 slots, CAP 0xe322ffe3SXS,EMS,CCCS,PSC,SSC,PMD,SPM,ISS=0x2=Gen2,SCLO,SAL,SSNTF,SNCQ,S64A atabus2 at ahcisata0 channel 0 atabus3 at ahcisata0 channel 1 atabus4 at ahcisata0 channel 2 atabus5 at ahcisata0 channel 3 ahcisata0 port 0: device present, speed: 3.0Gb/s ahcisata0 port 1: device present, speed: 3.0Gb/s ahcisata0 port 2: device present, speed: 3.0Gb/s ahcisata0 port 3: device present, speed: 3.0Gb/s wd0 at atabus2 drive 0 wd0: WDC WD5000AAJS-55A8B0 wd0: drive supports 16-sector PIO transfers, LBA48 addressing wd0: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd0(ahcisata0:0:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA) wd1 at atabus3 drive 0 wd1: WDC WD5000AAJS-55A8B0 wd1: drive supports 16-sector PIO transfers, LBA48 addressing wd1: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA) wd2 at atabus4 drive 0 wd2: WDC WD5000AAJS-55A8B0 wd2: drive supports 16-sector PIO transfers, LBA48 addressing wd2: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd2(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA) wd3 at atabus5 drive 0 wd3: WDC WD5000AADS-00S9B0 wd3: drive supports 16-sector PIO transfers, LBA48 addressing wd3: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors wd3: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd3(ahcisata0:3:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA) -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
In this setup, vfs.wapbl.flush_disk_cache=1 still get high loads, on both 6.0 and -current. I assume there must be something bad with WAPBL/RAIDframe Everything up to and including 6.0 is broken in this respect. Thanks to hannken@, 6.1 does align journal flushes. How fast can you write to the file system in question? Does your NFS load include a large amount of small syncrounous (filesync) write operations? The attached patch (by hannken@) and vfs.wapbl.verbose_commit=2 will tell you how long the journal flushes take. Don't activate (i.e. set verbose_commit) for longer than a few seconds without monitoring syslog size. Index: vfs_wapbl.c === RCS file: /cvsroot/src/sys/kern/vfs_wapbl.c,v retrieving revision 1.51.2.2 diff -u -r1.51.2.2 vfs_wapbl.c --- vfs_wapbl.c 2 Jan 2013 23:23:15 - 1.51.2.2 +++ vfs_wapbl.c 18 Sep 2013 16:19:40 - @@ -1456,6 +1456,8 @@ size_t flushsize; size_t reserved; int error = 0; +struct bintime start_time; +flushsize = 0; /* * Do a quick check to see if a full flush can be skipped @@ -1479,6 +1481,7 @@ * if we want to call flush from inside a transaction */ rw_enter(wl-wl_rwlock, RW_WRITER); +bintime(start_time); wl-wl_flush(wl-wl_mount, wl-wl_deallocblks, wl-wl_dealloclens, wl-wl_dealloccnt); @@ -1712,6 +1715,24 @@ } #endif +if (wapbl_verbose_commit) +{ + struct bintime d; + struct timespec ts; + int kbsec, msec; + + bintime(d); + bintime_sub(d, start_time); + bintime2timespec(d, ts); + msec = ts.tv_nsec/100+1000*ts.tv_sec; + if (msec == 0) + kbsec = 0; + else + kbsec = flushsize / msec; + printf(%s %zu bytes %d.%03d secs %d.%03d MB/sec\n, + wl-wl_mount-mnt_stat.f_mntonname, flushsize, + msec/1000, msec%1000, kbsec/1000, kbsec%1000); +} rw_exit(wl-wl_rwlock); return error; }
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de wrote: How fast can you write to the file system in question? What test do you want me to perform? Does your NFS load include a large amount of small syncrounous (filesync) write operations? Yes, I run 24 concurent tar -czf as a test. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
EF How fast can you write to the file system in question? ED What test do you want me to perform? dd if=/dev/zero bs=64k EF Does your NFS load include a large amount of small syncrounous (filesync) EF write operations? ED Yes, I run 24 concurent tar -czf as a test. But those shouldn't do small synchronous writes, should they? Anyway, hannken@'s logging patch should reveal if slow log flushing is indeed the problem. P.S.: With us, the log flush alingment patch helped a lot, but a bunch of NFS clients running Thunderbird still locked up the File Server. SQLite loves to do 4k sync writes which kill WAPBL. I ended up writing a system that relocated the Mozilla profiles to local volatile storage, periodically syncing them to NFS. Oh, and Dropbox also uses SQLite.
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de wrote: EF How fast can you write to the file system in question? ED What test do you want me to perform? dd if=/dev/zero bs=64k helvede# dd if=/dev/zero bs=64k of=out count=1 1+0 records in 1+0 records out 65536 bytes transferred in 18.365 secs (35685270 bytes/sec) Note I removed -o log EF Does your NFS load include a large amount of small syncrounous (filesync) EF write operations? ED Yes, I run 24 concurent tar -czf as a test. But those shouldn't do small synchronous writes, should they? Indeed they should not. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
Yes, I run 24 concurent tar -czf as a test. But those shouldn't do small synchronous writes, should they? Depends. Is the filesystem mounted noatime (or read-only)? If not, there are going to be atime updates, and don't all inode updates get done synchronously? Or am I misunderstanding something? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: high load, no bottleneck
Edgar Fuß e...@math.uni-bonn.de wrote: 35685270 bytes/sec That's OK. Note I removed -o log Shouldn't make a difference, I think. I re-enabled -o log and did the dd test again on NetBSD 6.0 with the patch you posted and vfs.wapbl.verbose_commit=2 # dd if=/dev/zero bs=64k of=out count=1 1+0 records in 1+0 records out 65536 bytes transferred in 17.331 secs (37814321 bytes/sec) kernel log: wapbl_flush: 1379547406.828798896 this transaction = 168960 bytes wapbl_cache_sync: 1: dev 0x1208 0.141346518 wapbl_cache_sync: 2: dev 0x1208 0.052974853 / 168960 bytes 0.468 secs 0.361 MB/sec wapbl_flush: 1379547409.299850220 this transaction = 323072 bytes wapbl_cache_sync: 1: dev 0x120c 0.253761121 wapbl_cache_sync: 2: dev 0x120c 0.022943043 /home 323072 bytes 0.719 secs 0.449 MB/sec wapbl_flush: 1379547417.023140298 this transaction = 136192 bytes wapbl_cache_sync: 1: dev 0x1208 0.239226048 wapbl_cache_sync: 2: dev 0x1208 0.058130346 / 136192 bytes 0.480 secs 0.283 MB/sec wapbl_flush: 1379547420.514618291 this transaction = 338944 bytes wapbl_cache_sync: 1: dev 0x120c 0.207321357 wapbl_cache_sync: 2: dev 0x120c 0.022987705 /home 338944 bytes 0.563 secs 0.602 MB/sec Running my stress test, which drives load to insane values: wapbl_flush: 1379547625.571507421 this transaction = 373760 bytes wapbl_cache_sync: 1: dev 0x120c 0.099539954 wapbl_cache_sync: 2: dev 0x120c 0.009327683 /home 373760 bytes 0.132 secs 2.831 MB/sec wapbl_flush: 1379547625.741582309 this transaction = 341504 bytes wapbl_cache_sync: 1: dev 0x120c 0.136495561 wapbl_cache_sync: 2: dev 0x120c 0.009992682 /home 341504 bytes 0.168 secs 2.032 MB/sec wapbl_flush: 1379547625.921656139 this transaction = 273408 bytes wapbl_cache_sync: 1: dev 0x120c 0.147418701 wapbl_cache_sync: 2: dev 0x120c 0.010158968 /home 273408 bytes 0.178 secs 1.536 MB/sec wapbl_flush: 1379547626.111737705 this transaction = 375808 bytes wapbl_cache_sync: 1: dev 0x120c 0.116060286 wapbl_cache_sync: 2: dev 0x120c 0.006006430 /home 375808 bytes 0.159 secs 2.363 MB/sec -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
On Sep 17, 2013, at 5:39 PM, Emmanuel Dreyfus m...@netbsd.org wrote: I was suggested this should be better posted on tech-kern. It happened on NetBSD-6.0, and I tried to upgrade the kernel to -current, with the same result. On Tue, Sep 17, 2013 at 12:54:59PM +, Emmanuel Dreyfus wrote: I have a NFS server that exhibit a high load (20-30) when supporting about 30 clients, while there is no apparent bottleneck: low disck activity, CPU idle most of the time, plenty of available RAM. Of course service is crapy, with many timouts. Any hint of what can be going on? I found the bottleneck. ps does not show it because it happens within the differen threads of nfsd. DDB tells me that almost all nfsd threads are waiting on tstile with this backtrace: turnstile_block rw_vector_enter genfs_lock VOP_LOCK vn_lock vget ufs_ihashget ffs_vget ufs_fhtovp VFS_FHTOVP nfsrv_fhtovp nfsrv_write nfssvc_nfsd sys_nfssvc What are your clients doing? Which vnode(s) are your nfsd threads waiting on (first arg to vn_lock)? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: high load, no bottleneck
J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: What are your clients doing? MacOS X machines opening sessions kill the server. I can reproduce the problem with just concurent tar -czf on the NFS volume Which vnode(s) are your nfsd threads waiting on (first arg to vn_lock)? Here is an example: vn_lock(c5a24b08,2,0,c03a238e,4,c4ce9ed4,6,2ec4bcfb,3d90d5,0) at netbsd:vn_lock+0x7c db{0} show vnode c5a24b08 OBJECT 0xc5a24b08: locked=0, pgops=0xc0b185a8, npages=1720, refs=16 VNODE flags 0x4030MPSAFE,LOCKSWORK,ONWORKLST mp 0xc4a14000 numoutput 0 size 0x6f writesize 0x6f data 0xc5a25d74 writecount 0 holdcnt 2 tag VT_UFS(1) type VREG(1) mount 0xc4a14000 typedata 0xc4fe5480 v_lock 0xc5a24bac db{0} show ncache c5a24b08 name .nfs.2005108f.1174 parent user22 After some experiments, it seems -current is much more resistant, I cannot raise the load to 40 on it. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
Emmanuel Dreyfus m...@netbsd.org wrote: db{0} show vnode c5a24b08 OBJECT 0xc5a24b08: locked=0, pgops=0xc0b185a8, npages=1720, refs=16 VNODE flags 0x4030MPSAFE,LOCKSWORK,ONWORKLST mp 0xc4a14000 numoutput 0 size 0x6f writesize 0x6f data 0xc5a25d74 writecount 0 holdcnt 2 tag VT_UFS(1) type VREG(1) mount 0xc4a14000 typedata 0xc4fe5480 v_lock 0xc5a24bac While many threads are waiting, another nfsd thread holds the lock with this backtrace: turnstile_block rw_vector_enter wapbl_begin ffs_write VOP_WRITE nfsrv_write nfssvc_nfsd sys_nfssvc syscall I understand it is waiting for another process to complete I/O before passing the entering rwlock in wapbl_begin I have a first-class suspect with this other nfsd thread which is engaged in I/O: sleepq_block wdc_exec_command wd_flushcache wdioctl bdev_ioctl spec_ioctl VOP_IOCTL rf_sync_component_caches raidioctl bdev_ioctl spec_ioctl VOP_IOCTL wapbl_cache_sync Is it a nasty interraction between RAIDframe, NFS and WAPBL? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
In article 1l9czcn.y6kr35aruvzvm%m...@netbsd.org, Emmanuel Dreyfus m...@netbsd.org wrote: Emmanuel Dreyfus m...@netbsd.org wrote: db{0} show vnode c5a24b08 OBJECT 0xc5a24b08: locked=0, pgops=0xc0b185a8, npages=1720, refs=16 VNODE flags 0x4030MPSAFE,LOCKSWORK,ONWORKLST mp 0xc4a14000 numoutput 0 size 0x6f writesize 0x6f data 0xc5a25d74 writecount 0 holdcnt 2 tag VT_UFS(1) type VREG(1) mount 0xc4a14000 typedata 0xc4fe5480 v_lock 0xc5a24bac While many threads are waiting, another nfsd thread holds the lock with this backtrace: turnstile_block rw_vector_enter wapbl_begin ffs_write VOP_WRITE nfsrv_write nfssvc_nfsd sys_nfssvc syscall I understand it is waiting for another process to complete I/O before passing the entering rwlock in wapbl_begin I have a first-class suspect with this other nfsd thread which is engaged in I/O: sleepq_block wdc_exec_command wd_flushcache wdioctl bdev_ioctl spec_ioctl VOP_IOCTL rf_sync_component_caches raidioctl bdev_ioctl spec_ioctl VOP_IOCTL wapbl_cache_sync Is it a nasty interraction between RAIDframe, NFS and WAPBL? My suggestion is to try: sysctl -w vfs.wapbl.flush_disk_cache=0 for now... christos
Re: high load, no bottleneck
I was suggested this should be better posted on tech-kern. It happened on NetBSD-6.0, and I tried to upgrade the kernel to -current, with the same result. On Tue, Sep 17, 2013 at 12:54:59PM +, Emmanuel Dreyfus wrote: I have a NFS server that exhibit a high load (20-30) when supporting about 30 clients, while there is no apparent bottleneck: low disck activity, CPU idle most of the time, plenty of available RAM. Of course service is crapy, with many timouts. Any hint of what can be going on? I found the bottleneck. ps does not show it because it happens within the differen threads of nfsd. DDB tells me that almost all nfsd threads are waiting on tstile with this backtrace: turnstile_block rw_vector_enter genfs_lock VOP_LOCK vn_lock vget ufs_ihashget ffs_vget ufs_fhtovp VFS_FHTOVP nfsrv_fhtovp nfsrv_write nfssvc_nfsd sys_nfssvc -- Emmanuel Dreyfus m...@netbsd.org
Re: high load, no bottleneck
On Sep 17, 9:48pm, m...@netbsd.org (Emmanuel Dreyfus) wrote: -- Subject: Re: high load, no bottleneck | Excellent: the load does not go over 2 now (compared to 50). | | Thank you for saving my day. But now what happens? | I note the SATA disks are in IDE emulation mode, and not AHCI. This is | something I need to try changing: What happens highly depends on the drive (how frequently it flushes cache to disk internally and how long does it keep data in-cache), but it is never good. The best case scenario is would be that WAPBL writes are ordered properly and that cache-flush is only send occasionally between transactionally safe metadata commit points, but it seems that this is not happening (because we are getting too many flushes). The case to worry about is the scenario where the machine suddently loses power, the data never makes it to the physical media, and gets lost from the cache. In this case you might end up with a filesystem that has inconsistent metadata, so the next reboot might end up causing a panic when the filesystem is used. The solution there is to reboot and force an fsck. If you have a UPS I would not worry too much about it; even if your system panics the kernel should issue the flush commands to the disk. BTW I hope that everyone realizes that WAPBL deals only with metadata and not the actual file data, so if you crash/lose power you typically end up with garbage in the active files (usually bits and pieces of files form other files, or NUL's). christos
Re: high load, no bottleneck
Christos Zoulas chris...@astron.com wrote: My suggestion is to try: sysctl -w vfs.wapbl.flush_disk_cache=0 for now... Excellent: the load does not go over 2 now (compared to 50). Thank you for saving my day. But now what happens? I note the SATA disks are in IDE emulation mode, and not AHCI. This is something I need to try changing: piixide0 at pci0 dev 31 function 2: Intel 82801I Serial ATA Controller (ICH9) (rev. 0x02) piixide0: bus-master DMA support present piixide0: primary channel configured to compatibility mode piixide0: primary channel interrupting at ioapic0 pin 14 piixide0: secondary channel configured to compatibility mode piixide0: secondary channel interrupting at ioapic0 pin 15 atabus2 at piixide0 channel 0 wd0 at atabus2 drive 0 atabus3 at piixide0 channel 1 wd2 at atabus3 drive 0 -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
Christos Zoulas chris...@zoulas.com wrote: The case to worry about is the scenario where the machine suddently loses power, the data never makes it to the physical media, and gets lost from the cache. In this case you might end up with a filesystem that has inconsistent metadata, so the next reboot might end up causing a panic when the filesystem is used. The solution there is to reboot and force an fsck. It seems the system would be better without WAPBL enabled in this case. Is there any befenit left? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
On Sep 18, 2:22am, m...@netbsd.org (Emmanuel Dreyfus) wrote: -- Subject: Re: high load, no bottleneck | The case to worry about is the scenario where the machine | suddently loses power, the data never makes it to the physical media, | and gets lost from the cache. In this case you might end up with a | filesystem that has inconsistent metadata, so the next reboot might | end up causing a panic when the filesystem is used. The solution there | is to reboot and force an fsck. | | It seems the system would be better without WAPBL enabled in this case. | Is there any befenit left? On large filesystems with many files fsck can take a really long time after a crash. In my personal experience power outages are much less frequent than crashes (I crash quite a lot since I always fiddle with things). If you don't care about fsck time, you don't need WAPBL. Another easy thing you can try is to put the WAPBL log in a flash drive and re-enable the cache flushes. christos
Re: high load, no bottleneck
hello. How do you move the wapbl log to a drive other than the one on which the filesystem that's being logged is runing? In other words, I thought the log existed on the same media as the filesystem. Is that not the case? On Sep 17, 8:34pm, Christos Zoulas wrote: } Subject: Re: high load, no bottleneck } On Sep 18, 2:22am, m...@netbsd.org (Emmanuel Dreyfus) wrote: } -- Subject: Re: high load, no bottleneck } } | The case to worry about is the scenario where the machine } | suddently loses power, the data never makes it to the physical media, } | and gets lost from the cache. In this case you might end up with a } | filesystem that has inconsistent metadata, so the next reboot might } | end up causing a panic when the filesystem is used. The solution there } | is to reboot and force an fsck. } | } | It seems the system would be better without WAPBL enabled in this case. } | Is there any befenit left? } } On large filesystems with many files fsck can take a really long time after } a crash. In my personal experience power outages are much less frequent than } crashes (I crash quite a lot since I always fiddle with things). If you } don't care about fsck time, you don't need WAPBL. Another easy thing you can } try is to put the WAPBL log in a flash drive and re-enable the cache flushes. } } christos -- End of excerpt from Christos Zoulas
Re: high load, no bottleneck
Christos Zoulas chris...@zoulas.com wrote: On large filesystems with many files fsck can take a really long time after a crash. In my personal experience power outages are much less frequent than crashes (I crash quite a lot since I always fiddle with things). If you don't care about fsck time, you don't need WAPBL. But you just told me that I will need a fsck after crash now I am running with vfs.wapbl.flush_disk_cache=0 so I wonder if I should not just mount without -o log. What are WAPBL benefits when running with vfs.wapbl.flush_disk_cache=0? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: high load, no bottleneck
On Wed, Sep 18, 2013 at 03:34:19AM +0200, Emmanuel Dreyfus wrote: Christos Zoulas chris...@zoulas.com wrote: On large filesystems with many files fsck can take a really long time after a crash. In my personal experience power outages are much less frequent than crashes (I crash quite a lot since I always fiddle with things). If you don't care about fsck time, you don't need WAPBL. But you just told me that I will need a fsck after crash now I am running with vfs.wapbl.flush_disk_cache=0 so I wonder if I should not just mount without -o log. What are WAPBL benefits when running with vfs.wapbl.flush_disk_cache=0? To the extent it's correct (which may vary) it's much faster. The downside is that without the cache flushing there's some chance that fsck won't be able to repair things afterwards. The only real solution is to figure out why it's being slow. (as far as I know wapbl doesn't support an external journal, and even if it did putting one on an SSD that doesn't have power failure recovery is worse than useless) -- David A. Holland dholl...@netbsd.org
Re: high load, no bottleneck
David Holland dholland-t...@netbsd.org wrote: The downside is that without the cache flushing there's some chance that fsck won't be able to repair things afterwards. This is scary. If this is a WAPBL/NFS/RAIDframe interaction, I think I'd rather dump the RAID than the insurence to have a working fsck after a crash. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org