Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Robert Milkowski wrote On 06/28/06 15:52,: Hello Neil, Wednesday, June 21, 2006, 8:15:54 PM, you wrote: NP Robert Milkowski wrote On 06/21/06 11:09,: Hello Neil, Why is this option available then? (Yes, that's a loaded question.) NP I wouldn't call it an option, but an internal debugging switch that I NP originally added to allow progress when initially integrating the ZIL. NP As Roch says it really shouldn't be ever set (as it does negate POSIX NP synchronous semantics). Nor should it be mentioned to a customer. NP In fact I'm inclined to now remove it - however it does still have a use NP as it helped root cause this problem. Isn't it similar to unsupported fastfs for ufs? NP It is similar in the sense that it speeds up the file system. NP Using fastfs can be much more dangerous though as it can lead NP to a badly corrupted file system as writing meta data is delayed NP and written out of order. Whereas disabling the ZIL does not affect NP the integrity of the fs. The transaction group model of ZFS gives NP consistency in the event of a crash/power fail. However, any data that NP was promised to be on stable storage may not be unless the transaction NP group committed (an operation that is started every 5s). NP We once had plans to add a mount option to allow the admin NP to control the ZIL. Here's a brief section of the RFE (6280630): NP sync={deferred,standard,forced} NP Controls synchronous semantics for the dataset. NP When set to 'standard' (the default), synchronous operations NP such as fsync(3C) behave precisely as defined in NP fcntl.h(3HEAD). NP When set to 'deferred', requests for synchronous semantics NP are ignored. However, ZFS still guarantees that ordering NP is preserved -- that is, consecutive operations reach stable NP storage in order. (If a thread performs operation A followed NP by operation B, then the moment that B reaches stable storage, NP A is guaranteed to be on stable storage as well.) ZFS also NP guarantees that all operations will be scheduled for write to NP stable storage within a few seconds, so that an unexpected NP power loss only takes the last few seconds of change with it. NP When set to 'forced', all operations become synchronous. NP No operation will return until all previous operations NP have been committed to stable storage. This option can be NP useful if an application is found to depend on synchronous NP semantics without actually requesting them; otherwise, it NP will just make everything slow, and is not recommended. NP Of course we would need to stress the dangers of setting 'deferred'. NP What do you guys think? I think it would be really useful. I found myself many times in situation that such features (like fastfs) were my last resort help. The over-whelming consensus was that it would be useful. So I'll go ahead and put that on my to do list. The same with txg_time - in some cases tuning it could probably be useful. Instead of playing with mdb it would be much better put into zpool/zfs or other util (and if possible made per fs not per host). This one I'm less sure about. I have certainly tuned txg_time myself to force certain situations, but I wouldn't be happy exposing the inner workings of ZFS - which may well change. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
So if you have a single thread doing open/write/close of 8K files and get 1.25MB/sec, that tells me you have something like a 6ms I/O latency. Which look reasonable also. What does iostat -x svc_t (client side) says ? 400ms seems high for the workload _and_ doesn't match my formula, so I don't like it ;-) Quick look at your script looks fine tough; but something just does not compute here. Why this formula (which applies to any NFS single threaded client app working on small files). Even if the open and write parts are infinitely fast, on close(2), NFS must insure that data is set to disk. So at a minimum every close(2) must wait 1 I/O latency. During that wait the single thread client applicationwillnot initiate the following open/write/close segment. At best you get one file output per I/O latency. The I/O latency is the one seen by the client and includes network part but that should be small compared to the physical I/O. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
To clarify what has just been stated. With zil disabled I got 4MB/sec. With zil enabled I get 1.25MB/sec On 6/23/06, Tao Chen [EMAIL PROTECTED] wrote: On 6/23/06, Roch [EMAIL PROTECTED] wrote: On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: On 6/22/06, Jeff Bonwick [EMAIL PROTECTED] wrote: a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. That's still super slow -- is this over a 10Mb link or something? Jeff I think the performance is in line with expectation for, small file,single threaded, open/write/close NFS workload (nfs must commit on close). Therefore I expect : (avg file size) / (I/O latency). Joe does this formula approach the 1.25 MB/s ? Joe sent me another set of DTrace output (biorpt.sh.rec.gz), running 105 seconds with zil_disable=1. I generate a graph using Grace ( rec.gif ). The interesting part for me: 1) How I/O response time (at bdev level) changes in a pattern. 2) Both iSCSI (sd2) and local (sd1) storage follow the same pattern and have almost identicle latency on average. 3) The latency is very high, either on average or at peaks. Although a low throughput is expected given large amount of small files, I don't expect such high latency, and of course 1.25MB/s is too low, even after turn on zil_disable, I see 4MB/s in this data set. I/O size at bdev level are actually pretty decent: mostly (75%) 128KB. Here's a summary: # biorpt -i biorpt.sh.rec Generating report from biorpt.sh.rec ... === Top 5 I/O types === DEVICET BLKs COUNT - sd1 W 256 3122 sd2 W 256 3118 sd1 W 2 164 sd2 W 2 151 sd2 W 3 123 === Top 5 worst I/O response time === DEVICET BLKs OFFSETTIMESTAMP TIME.ms - -- --- --- sd1 W 256 529562656 104.322170 3316.90 sd1 W 256 529563424 104.322185 3281.97 sd2 W 256 521152480 104.262081 3262.49 sd2 W 256 521152736 104.262102 3258.56 sd1 W 256 529562912 104.262091 3249.85 === Top 5 Devices with largest number of I/Os === DEVICE READ AVG.ms MBWRITE AVG.ms MB IOs SEEK --- --- -- -- --- -- -- --- sd17 2.70 0 4169 440.62409 4176 0% sd26 0.25 0 4131 444.79407 4137 0% cmdk0 5 21.50 0 138 0.82 0 143 11% === Top 5 Devices with largest amount of data transfer === DEVICE READ AVG.ms MBWRITE AVG.ms MB Tol.MB MB/s --- --- -- -- --- -- -- --- sd17 2.70 0 4169 440.62409 4094 sd26 0.25 0 4131 444.79407 4074 cmdk0 5 21.50 0 138 0.82 000 Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On 6/23/06, Roch [EMAIL PROTECTED] wrote: Joe Little writes: On 6/22/06, Bill Moore [EMAIL PROTECTED] wrote: Hey Joe. We're working on some ZFS changes in this area, and if you could run an experiment for us, that would be great. Just do this: echo 'zil_disable/W1' | mdb -kw We're working on some fixes to the ZIL so it won't be a bottleneck when fsyncs come around. The above command will let us know what kind of improvement is on the table. After our fixes you could get from 30-80% of that improvement, but this would be a good data point. This change makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out a txg every 5 seconds. So at most, your disk will be 5 seconds out of date compared to what it should be. It's a pretty small window, but it all depends on your appetite for such windows. :) After running the above command, you'll need to unmount/mount the filesystem in order for the change to take effect. If you don't have time, no big deal. --Bill On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: On 6/22/06, Jeff Bonwick [EMAIL PROTECTED] wrote: a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. That's still super slow -- is this over a 10Mb link or something? Jeff I think the performance is in line with expectation for, small file,single threaded, open/write/close NFS workload (nfs must commit on close). Therefore I expect : (avg file size) / (I/O latency). Joe does this formula approach the 1.25 MB/s ? To this day, I still don't know how to calculate the i/o latency. Average file size is always expected to be close to kernel page size for NASes -- 4-8k. Always tune for that. Nope, gig-e link (single e1000g, or aggregate, doesn't matter) to the iscsi target, and single gig-e link (nge) to the NFS clients, who are gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. Again, the issue is the multiple fsyncs that NFS requires, and likely the serialization of those iscsi requests. Apparently, there is a basic latency in iscsi that one could improve upon with FC, but we are definitely in the all ethernet/iscsi camp for multi-building storage pool growth and don't have interest in a FC-based SAN. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Well, following Bill's advice and the previous note on disabling zil, I ran my test on a B38 opteron initiator and if you do a time on the copy from the client, 6250 8k files transfer at 6MB/sec now. If you watch the entire commit on the backend using zpool iostat 1 I see that it takes a few more seconds, and the actual rate there is 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Joe, you know this but for the benefit of others, I have to highlight that running any NFS server this way, may cause silent data corruption from client's point of view. Whenever a server keeps data in RAM this way and does not commit it to stable storage upon request from clients, that opens a time window for corruption. So a client writes to a page, then reads the same page, and if the server suffered a crash in between, the data may not match. So this is performance at the expense of data integrity. -r Yes.. ZFS in its normal mode has better data integrity. However, this may be a more ideal tradeoff if you have specific read/write patterns. In my case, I'm going to use ZFS initially for my tier2 storage, with nightly write periods (needs to be short duration rsync from tier1) and mostly read periods throughout the rest of the day. I'd love to use ZFS as a tier1 service as well, but then you'd have to perform as a NetApp does. Same tricks, same NVRAM or initial write to local stable storage before writing to backend storage. 6MB/sec is closer to expected behavior for first tier at the expense of reliability. I don't know what the answer is for Sun to make ZFS 1st Tier quality with their NFS implementation and its sync happiness. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Joe Little wrote: On 6/23/06, Roch [EMAIL PROTECTED] wrote: Joe, you know this but for the benefit of others, I have to highlight that running any NFS server this way, may cause silent data corruption from client's point of view. Whenever a server keeps data in RAM this way and does not commit it to stable storage upon request from clients, that opens a time window for corruption. So a client writes to a page, then reads the same page, and if the server suffered a crash in between, the data may not match. So this is performance at the expense of data integrity. I agree, as a RAS guy this line of reasoning makes me nervous... I've never known anyone who regularly made this trade-off and didn't get burned. Yes.. ZFS in its normal mode has better data integrity. However, this may be a more ideal tradeoff if you have specific read/write patterns. The only pattern this makes sense for is the write-only pattern. That pattern has near zero utility. In my case, I'm going to use ZFS initially for my tier2 storage, with nightly write periods (needs to be short duration rsync from tier1) and mostly read periods throughout the rest of the day. I'd love to use ZFS as a tier1 service as well, but then you'd have to perform as a NetApp does. Same tricks, same NVRAM or initial write to local stable storage before writing to backend storage. 6MB/sec is closer to expected behavior for first tier at the expense of reliability. I don't know what the answer is for Sun to make ZFS 1st Tier quality with their NFS implementation and its sync happiness. I know the answer will not compromise data integrity. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
I should copy this to the list.-- Forwarded message --On 6/23/06, Joe Little [EMAIL PROTECTED] wrote: I can post back to Roch what this latency is. I think the latency is aconstant regardless of the zil or not. all that I do by disabling thezil is that I'm able to submit larger chunks at a time (faster) thandoing 1k or worse blocks 3 times per file (the NFS fsync penalty) Please send the script ( I attached a modified version ) along with the result. They need to see how it works to trust ( or dispute ) the result. Rule #1 in performance tuning is do not trust the report from an unproven tool :) I have some comment on the output below. This is for a bit longer (16 trees of 6250 8k files, again with zil disabled): Generating report from biorpt.sh.rec ... === Top 5 I/O types ===DEVICETBLKs COUNT-sd2 W 2563095sd1 W 2562843 sd1 W 2 201sd2 W 2 197sd1 W32 185This part tells me majority of I/Os are 128KB writes on sd2 and sd1. === Top 5 worst I/O response time ===DEVICETBLKsOFFSETTIMESTAMPTIME.ms-sd2 W 175 52907067185.9338433559.55 sd1 W 256 52109768047.5619183097.21sd1 W 256 52115196854.9442533090.42sd1 W 256 52115222454.9442073090.23sd1 W64 52115248054.944241 3090.21Longest response time are more than 3 seconds, ouch. === Top 5 Devices with largest number of I/Os ===DEVICEREAD AVG.ms MBWRITE AVG.ms MBIOs SEEK-- -- - -- - sd16 0.340 4948 387.88413 4954 0%sd26 0.250 4230 387.07405 4236 0%cmdk0 23 8.110152 0.84017510% Average response time of 300ms is bad. I calculate SEEK rate on 512-byte block basis, since I/Os are mostly 128K, the seek rate is less than 1% ( 0 ), in other words I consider this as mostly sequential I/O. I guess it's debatable whether 512-byte-based calculation is meaningful. === Top 5 Devices with largest amount of data transfer === DEVICEREAD AVG.ms MBWRITE AVG.ms MB Tol.MB MB/s-- -- - -- - sd16 0.340 4948 387.884134134sd26 0.250 4230 387.074054054cmdk0 23 8.110152 0.84000=== Report saved in biorpt.sh.rec.rpt === I calculate the MB/s on per-second basis, meaning as long as there's at least one finished I/O on the device in a second, that second is used in calculating throughput. Tao biorpt.sh Description: Bourne shell script ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On 6/23/06, Richard Elling [EMAIL PROTECTED] wrote: comment on analysis below...Tao Chen wrote: === Top 5 Devices with largest number of I/Os === DEVICEREAD AVG.ms MBWRITE AVG.ms MBIOs SEEK -- -- - -- - sd16 0.340 4948 387.88413 4954 0% sd26 0.250 4230 387.07 405 4236 0% cmdk0 23 8.110152 0.84017510% Average response time of 300ms is bad.Average is totally useless with this sort of a distribution. I'd suggest using a statistical package to explore the distribution.Just a few 3-second latencies will skew the average quite a lot.-- richardA summary report is nothing more than an indication of issues, or non-issue. So I agree that an average is just, an average.However, a few 3-second latencies will not spoil the result too much when there're more than 4000 I/Os sampled.The script saves the raw data in a .rec file, so you can run whatever statistic tool you have against it. I am currently more worried about how accurate and useful the raw data is, which is generated from a DTrace command in it. The raw record is in this format:- Timestamp(sec.microsec) - DeviceName- W/R- BLK_NO (offset) - BLK_CNT (I/O size)- IO_Time (I/O elapsed time)Tao ( msec.xx)Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
How about the 'deferred' option be on a leased basis with a deadline to revert to normal behavior; at most 24hrs at a time. Console output everytime the option is enabled. -r Torrey McMahon writes: Neil Perrin wrote: Of course we would need to stress the dangers of setting 'deferred'. What do you guys think? That's the key: Be very explicit about what the option does and the side effects. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Bill Sommerfeld wrote: On Wed, 2006-06-21 at 14:15, Neil Perrin wrote: Of course we would need to stress the dangers of setting 'deferred'. What do you guys think? I can think of a use case for deferred: improving the efficiency of a large mega-transaction/batch job such as a nightly build. Yum Yum!! We could even build this into nightly(1) once we have user delegation to create clones. nightly(1) would zfs clone, zfs set reservation=, zfs set sync=deferred, and when it is done release the reservation unset deffered and snapshot. When we can have it ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Darren J Moffat wrote: Bill Sommerfeld wrote: On Wed, 2006-06-21 at 14:15, Neil Perrin wrote: Of course we would need to stress the dangers of setting 'deferred'. What do you guys think? I can think of a use case for deferred: improving the efficiency of a large mega-transaction/batch job such as a nightly build. Yum Yum!! We could even build this into nightly(1) once we have user delegation to create clones. nightly(1) would zfs clone, zfs set reservation=, zfs set sync=deferred, and when it is done release the reservation unset deffered and snapshot. When we can have it ? Before we get too far down that path, has anyone timed a nightly with and without the zil_disable'd ? Dana ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, 2006-06-22 at 03:55, Roch wrote: How about the 'deferred' option be on a leased basis with a deadline to revert to normal behavior; at most 24hrs at a time. why? Console output everytime the option is enabled. in general, no. error messages to the console should be reserved for truly frightening events and this simply isn't one of them. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Bill Sommerfeld writes: On Thu, 2006-06-22 at 03:55, Roch wrote: How about the 'deferred' option be on a leased basis with a deadline to revert to normal behavior; at most 24hrs at a time. why? I'll trust your judgement over mine on this, so I won't press. But it was mentioned that this would be useful to implement time-bounded huge meta-transaction such as a build. Given that we eventually do want to have a point where we know that data is on stable-storage, I'd figure we could say upfront what the time scale is. Is there a sync command that targets individual FS ? -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, 2006-06-22 at 13:01, Roch wrote: Is there a sync command that targets individual FS ? Yes. lockfs -f - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Bill Sommerfeld wrote: On Thu, 2006-06-22 at 13:01, Roch wrote: Is there a sync command that targets individual FS ? Yes. lockfs -f Does lockfs work with ZFS ? The man page appears to indicate it is very UFS specific. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, 2006-06-22 at 13:19, Darren J Moffat wrote: Yes. lockfs -f Does lockfs work with ZFS ? The man page appears to indicate it is very UFS specific. all of lockfs does not. but, if truss is to believed, the ioctl used by lockfs -f appears to. or at least, it returns without error. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, Jun 22, 2006 at 06:19:20PM +0100, Darren J Moffat wrote: Bill Sommerfeld wrote: On Thu, 2006-06-22 at 13:01, Roch wrote: Is there a sync command that targets individual FS ? Yes. lockfs -f Does lockfs work with ZFS ? The man page appears to indicate it is very UFS specific. Well, it just ends up doing an ioctl(), which zfs recognizes: # dtrace -n 'syscall::ioctl:entry/pid == $target/{self-on = 1}' \ -n'fbt:::/self-on/{}' -n 'syscall::ioctl:return/self-on/{self-on = 0}' \ -F -c 'lockfs -f /aux1' dtrace: description 'syscall::ioctl:entry' matched 1 probe dtrace: description 'fbt:::' matched 44321 probes dtrace: description 'syscall::ioctl:return' matched 1 probe dtrace: pid 151072 has exited CPU FUNCTION 0 - ioctl 0- getf 0 - set_active_fd 0 - set_active_fd 0- getf 0- get_udatamodel 0- get_udatamodel 0- fop_ioctl 0 - zfs_ioctl 0 - zfs_ioctl 0 - zfs_sync 0- zil_commit 0- zil_commit 0 - zfs_sync 0- fop_ioctl 0- releasef 0 - clear_active_fd 0 - clear_active_fd 0 - cv_broadcast 0 - cv_broadcast 0- releasef 0 - ioctl So the sync happens. Cheers, - jonathan -- Jonathan Adams, Solaris Kernel Development ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
As I recall, the zfs sync is, unlike UFS, synchronous. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Yep. ZFS supports the ioctl (_FIOFFS) which 'lockfs -f' issues. -- Prabahar. Darren J Moffat wrote: Bill Sommerfeld wrote: On Thu, 2006-06-22 at 13:01, Roch wrote: Is there a sync command that targets individual FS ? Yes. lockfs -f Does lockfs work with ZFS ? The man page appears to indicate it is very UFS specific. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. That's still super slow -- is this over a 10Mb link or something? Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On 6/22/06, Jeff Bonwick [EMAIL PROTECTED] wrote: a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. That's still super slow -- is this over a 10Mb link or something? Jeff Nope, gig-e link (single e1000g, or aggregate, doesn't matter) to the iscsi target, and single gig-e link (nge) to the NFS clients, who are gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. Again, the issue is the multiple fsyncs that NFS requires, and likely the serialization of those iscsi requests. Apparently, there is a basic latency in iscsi that one could improve upon with FC, but we are definitely in the all ethernet/iscsi camp for multi-building storage pool growth and don't have interest in a FC-based SAN. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: Again, the issue is the multiple fsyncs that NFS requires, and likely the serialization of those iscsi requests. Apparently, there is a basic latency in iscsi that one could improve upon with FC, but we are definitely in the all ethernet/iscsi camp for multi-building storage pool growth and don't have interest in a FC-based SAN. There may be two things going on here. One is the fsync()s. So many fsync()s, required by the semantics of NFS serialize I/O too much (which, with a client doing single-threaded things, kills), and, looking at Roch's Dynamics of ZFS blog entry, perhaps there's some aliasing of fsync() pressure as memory pressure/disk I/O saturation. From Roch's blog entry: CAVEAT: The current state of the ZIL is such that if there is a lot of pending data in a Filesystem (written to the FS, not yet output to disk) and a process issues an fsync() for one of it's files, then all pending operations will have to be sent to disk before the synchronous command can complete. This can lead to unexpected performance characteristics. Code is under review. Since NFS is doing so many fsyncs this, I reason, can force many writes into txg+2, thus triggering throttling of the NFS server threads. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On 6/22/06, Bill Moore [EMAIL PROTECTED] wrote: Hey Joe. We're working on some ZFS changes in this area, and if you could run an experiment for us, that would be great. Just do this: echo 'zil_disable/W1' | mdb -kw We're working on some fixes to the ZIL so it won't be a bottleneck when fsyncs come around. The above command will let us know what kind of improvement is on the table. After our fixes you could get from 30-80% of that improvement, but this would be a good data point. This change makes ZFS ignore the iSCSI/NFS fsync requests, but we still push out a txg every 5 seconds. So at most, your disk will be 5 seconds out of date compared to what it should be. It's a pretty small window, but it all depends on your appetite for such windows. :) After running the above command, you'll need to unmount/mount the filesystem in order for the change to take effect. If you don't have time, no big deal. --Bill On Thu, Jun 22, 2006 at 04:22:22PM -0700, Joe Little wrote: On 6/22/06, Jeff Bonwick [EMAIL PROTECTED] wrote: a test against the same iscsi targets using linux and XFS and the NFS server implementation there gave me 1.25MB/sec writes. I was about to throw in the towel and deem ZFS/NFS has unusable until B41 came along and at least gave me 1.25MB/sec. That's still super slow -- is this over a 10Mb link or something? Jeff Nope, gig-e link (single e1000g, or aggregate, doesn't matter) to the iscsi target, and single gig-e link (nge) to the NFS clients, who are gig-e. Sun Ultra20 or AMD Quad Opteron, again with no difference. Again, the issue is the multiple fsyncs that NFS requires, and likely the serialization of those iscsi requests. Apparently, there is a basic latency in iscsi that one could improve upon with FC, but we are definitely in the all ethernet/iscsi camp for multi-building storage pool growth and don't have interest in a FC-based SAN. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Well, following Bill's advice and the previous note on disabling zil, I ran my test on a B38 opteron initiator and if you do a time on the copy from the client, 6250 8k files transfer at 6MB/sec now. If you watch the entire commit on the backend using zpool iostat 1 I see that it takes a few more seconds, and the actual rate there is 4MB/sec. Beats my best of 1.25MB/sec, and this is not B41. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
The vi we were doing was a 2 line file. If you just vi a new file, add one line and exit it would take 15 minutes in fdsynch. On recommendation of a workaround we set set zfs:zil_disable=1 after the reboot the fdsynch is now 0.1 seconds. Now I have no idea if it was this setting or the fact that we went through a reboot. Whatever the root cause we are now back to a well behaved file system. thanks sean Roch wrote: 15 minutes to do a fdsync is way outside the slowdown usually seen. The footprint for 6413510 is that when a huge amount of data is being written non synchronously and a fsync comes in for the same filesystem then all the non-synchronous data is also forced out synchronously. So is there a lot of data being written during the vi? vi will write the whole file (in 4K) chunks and fsync it. (based on a single experiment). So for a largefile vi , on quit, we have lots of data to sync in and of itself. But because 6413510 we potentially have tosync lots ofother data written by other applications. Now take a Niagara with lots of available CPUs and lots of free memory (32GB maybe?) running some 'tar x' in parallel. A huge chunk of the 32GB can end up as dirty. I say too much so because of lack of throttling: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205 6429205 each zpool needs to monitor it's throughput and throttle heavy writers Then vi :q; fsyncs; and all of the pending data must sync. So we have extra data to sync because of: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS Furthermore, we can be slowed by this: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6440499 zil should avoid txg_wait_synced() and use dmu_sync() to issue parallel IOs... Note: 6440499 is now fixed in the gate. And finally all this data goes to a single disk. Worse a slice of a disk. Since it's just a slice ZFS can't enable the write cache. Then if there is no tag queue (is there ?) we will handle everything one I/O at a time. If it's a SATA drive we have other issues... I think we've hit is all here. So can this lead to 15 min fsync ? I can't swear, Actually I won't be convinced myself before I convince you, but we do have things to chew on already. Do I recall that this is about a1GB file in vi ? :wq-uitting out of a 1 GB vi session on a 50MB/sec disk will take 20sec when everything hums and there are no other traffic involved. With no write cache / no tag queue , maybe 10X more. -r -- Sean Meighan Mgr ITSM Engineering Sun Microsystems, Inc. US Phone x32329 / +1 408 850-9537 Mobile 303-520-2024 Fax 408 850-9537 Email [EMAIL PROTECTED] NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Well this does look more and more like a duplicate of: 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Sean Meighan writes: The vi we were doing was a 2 line file. If you just vi a new file, add one line and exit it would take 15 minutes in fdsynch. On recommendation of a workaround we set set zfs:zil_disable=1 after the reboot the fdsynch is now 0.1 seconds. Now I have no idea if it was this setting or the fact that we went through a reboot. Whatever the root cause we are now back to a well behaved file system. well behaved...In appearance only ! Maybe it's nice to validate hypothesis but you should not run with this option set, ever., it disable O_DSYNC and fsync() and I don't know what else. Bad idea, bad. -r bad. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Roch wrote: Sean Meighan writes: The vi we were doing was a 2 line file. If you just vi a new file, add one line and exit it would take 15 minutes in fdsynch. On recommendation of a workaround we set set zfs:zil_disable=1 after the reboot the fdsynch is now 0.1 seconds. Now I have no idea if it was this setting or the fact that we went through a reboot. Whatever the root cause we are now back to a well behaved file system. well behaved...In appearance only ! Maybe it's nice to validate hypothesis but you should not run with this option set, ever., it disable O_DSYNC and fsync() and I don't know what else. Bad idea, bad. Why is this option available then? (Yes, that's a loaded question.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Torrey McMahon wrote On 06/21/06 10:29,: Roch wrote: Sean Meighan writes: The vi we were doing was a 2 line file. If you just vi a new file, add one line and exit it would take 15 minutes in fdsynch. On recommendation of a workaround we set set zfs:zil_disable=1 after the reboot the fdsynch is now 0.1 seconds. Now I have no idea if it was this setting or the fact that we went through a reboot. Whatever the root cause we are now back to a well behaved file system. well behaved...In appearance only ! Maybe it's nice to validate hypothesis but you should not run with this option set, ever., it disable O_DSYNC and fsync() and I don't know what else. Bad idea, bad. Why is this option available then? (Yes, that's a loaded question.) I wouldn't call it an option, but an internal debugging switch that I originally added to allow progress when initially integrating the ZIL. As Roch says it really shouldn't be ever set (as it does negate POSIX synchronous semantics). Nor should it be mentioned to a customer. In fact I'm inclined to now remove it - however it does still have a use as it helped root cause this problem. Neil ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Wed, Jun 21, 2006 at 10:41:50AM -0600, Neil Perrin wrote: Why is this option available then? (Yes, that's a loaded question.) I wouldn't call it an option, but an internal debugging switch that I originally added to allow progress when initially integrating the ZIL. As Roch says it really shouldn't be ever set (as it does negate POSIX synchronous semantics). Nor should it be mentioned to a customer. In fact I'm inclined to now remove it - however it does still have a use as it helped root cause this problem. Rename it to zil_disable_danger_will_robinson :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Nicolas Williams wrote: On Wed, Jun 21, 2006 at 10:41:50AM -0600, Neil Perrin wrote: Why is this option available then? (Yes, that's a loaded question.) I wouldn't call it an option, but an internal debugging switch that I originally added to allow progress when initially integrating the ZIL. As Roch says it really shouldn't be ever set (as it does negate POSIX synchronous semantics). Nor should it be mentioned to a customer. In fact I'm inclined to now remove it - however it does still have a use as it helped root cause this problem. Rename it to zil_disable_danger_will_robinson The sad truth is that debugging bits tend to survive into production and then we get escalations that go something like, I set this variable in /etc/system and now I'm {getting data corruption, weird behavior, an odd rash, ...} The fewer the better, imho. If it can be removed, great. If not, then maybe something for the tunables guide. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Robert Milkowski wrote On 06/21/06 11:09,: Hello Neil, Why is this option available then? (Yes, that's a loaded question.) NP I wouldn't call it an option, but an internal debugging switch that I NP originally added to allow progress when initially integrating the ZIL. NP As Roch says it really shouldn't be ever set (as it does negate POSIX NP synchronous semantics). Nor should it be mentioned to a customer. NP In fact I'm inclined to now remove it - however it does still have a use NP as it helped root cause this problem. Isn't it similar to unsupported fastfs for ufs? It is similar in the sense that it speeds up the file system. Using fastfs can be much more dangerous though as it can lead to a badly corrupted file system as writing meta data is delayed and written out of order. Whereas disabling the ZIL does not affect the integrity of the fs. The transaction group model of ZFS gives consistency in the event of a crash/power fail. However, any data that was promised to be on stable storage may not be unless the transaction group committed (an operation that is started every 5s). We once had plans to add a mount option to allow the admin to control the ZIL. Here's a brief section of the RFE (6280630): sync={deferred,standard,forced} Controls synchronous semantics for the dataset. When set to 'standard' (the default), synchronous operations such as fsync(3C) behave precisely as defined in fcntl.h(3HEAD). When set to 'deferred', requests for synchronous semantics are ignored. However, ZFS still guarantees that ordering is preserved -- that is, consecutive operations reach stable storage in order. (If a thread performs operation A followed by operation B, then the moment that B reaches stable storage, A is guaranteed to be on stable storage as well.) ZFS also guarantees that all operations will be scheduled for write to stable storage within a few seconds, so that an unexpected power loss only takes the last few seconds of change with it. When set to 'forced', all operations become synchronous. No operation will return until all previous operations have been committed to stable storage. This option can be useful if an application is found to depend on synchronous semantics without actually requesting them; otherwise, it will just make everything slow, and is not recommended. Of course we would need to stress the dangers of setting 'deferred'. What do you guys think? Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On Wed, 2006-06-21 at 14:15, Neil Perrin wrote: Of course we would need to stress the dangers of setting 'deferred'. What do you guys think? I can think of a use case for deferred: improving the efficiency of a large mega-transaction/batch job such as a nightly build. You create an initially empty or cloned dedicated filesystem for the build, and start it off, and won't look inside until it completes. If the build machine crashes in the middle of the build you're going to nuke it all and start over because that's lower risk than assuming you can pick up where it left off. now, it happens that a bunch of tools used during a build invoke fsync. But in the context of a full nightly build that effort is wasted. All you need is one big sync everything at the very end, either by using a command like sync or lockfs -f, or as a side effect of reverting from sync=deferred to sync=standard. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
RE: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Did I miss something on this thread? Was the root cause of the 15-minute fsync every actually determined? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of eric kustarz Sent: Wednesday, June 21, 2006 2:12 PM To: [EMAIL PROTECTED] Cc: zfs-discuss@opensolaris.org; Torrey McMahon Subject: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved Neil Perrin wrote: Robert Milkowski wrote On 06/21/06 11:09,: Hello Neil, Why is this option available then? (Yes, that's a loaded question.) NP I wouldn't call it an option, but an internal debugging switch that I NP originally added to allow progress when initially integrating the ZIL. NP As Roch says it really shouldn't be ever set (as it does negate POSIX NP synchronous semantics). Nor should it be mentioned to a customer. NP In fact I'm inclined to now remove it - however it does still have a use NP as it helped root cause this problem. Isn't it similar to unsupported fastfs for ufs? It is similar in the sense that it speeds up the file system. Using fastfs can be much more dangerous though as it can lead to a badly corrupted file system as writing meta data is delayed and written out of order. Whereas disabling the ZIL does not affect the integrity of the fs. The transaction group model of ZFS gives consistency in the event of a crash/power fail. However, any data that was promised to be on stable storage may not be unless the transaction group committed (an operation that is started every 5s). We once had plans to add a mount option to allow the admin to control the ZIL. Here's a brief section of the RFE (6280630): sync={deferred,standard,forced} Controls synchronous semantics for the dataset. When set to 'standard' (the default), synchronous operations such as fsync(3C) behave precisely as defined in fcntl.h(3HEAD). When set to 'deferred', requests for synchronous semantics are ignored. However, ZFS still guarantees that ordering is preserved -- that is, consecutive operations reach stable storage in order. (If a thread performs operation A followed by operation B, then the moment that B reaches stable storage, A is guaranteed to be on stable storage as well.) ZFS also guarantees that all operations will be scheduled for write to stable storage within a few seconds, so that an unexpected power loss only takes the last few seconds of change with it. When set to 'forced', all operations become synchronous. No operation will return until all previous operations have been committed to stable storage. This option can be useful if an application is found to depend on synchronous semantics without actually requesting them; otherwise, it will just make everything slow, and is not recommended. Of course we would need to stress the dangers of setting 'deferred'. What do you guys think? Neil. Scares me, and it seems we should wait until people are demanding it and we *have* to do it (if that time ever comes) - that is, we can't squeeze any more performance gain out of the 'standard' method. If problems do occur because of 'deferred' mode, once i wrap-up zpool history, we'll have that they set this logged to disk. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
Bill Sommerfeld wrote: On Wed, 2006-06-21 at 14:15, Neil Perrin wrote: Of course we would need to stress the dangers of setting 'deferred'. What do you guys think? I can think of a use case for deferred: improving the efficiency of a large mega-transaction/batch job such as a nightly build. You create an initially empty or cloned dedicated filesystem for the build, and start it off, and won't look inside until it completes. If the build machine crashes in the middle of the build you're going to nuke it all and start over because that's lower risk than assuming you can pick up where it left off. now, it happens that a bunch of tools used during a build invoke fsync. But in the context of a full nightly build that effort is wasted. All you need is one big sync everything at the very end, either by using a command like sync or lockfs -f, or as a side effect of reverting from sync=deferred to sync=standard. Can I give support for this use case? Or does it take someone like Casper Dik with 'fastfs' to come along later and provide a utility that lets people make the filesystem do what want it to? [still annoyed that it took me so long to find out about fastfs - hell, the Solaris 8 or 9 OS installation process was using the same IOCTL as fastfs uses, but for some reason end users still have to find fastfs out on the Net somewhere instead of getting it with the OS]. If the ZFS docs state why it's not for general use, then what's to separate this from the zillion other ways that a cavalier sysadmin can bork their data (or indeed their whole machine)? Otherwise, why even let people create a striped zpool vdev without redundancy - it's just an accident waiting to happen, right? We must save people from themselves! Think of the children! ;-) -Jason =:^/ -- [EMAIL PROTECTED] ANU Supercomputer Facility APAC Grid ProgramLeonard Huxley Bldg 56, Mills Road Ph: +61 2 6125 5449 Australian National University Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss