Re: [zfs-discuss] ZFS write throttling
On 2/15/08, Roch Bourbonnais [EMAIL PROTECTED] wrote: Le 15 févr. 08 à 11:38, Philip Beevers a écrit : [...] Obviously this isn't good behaviour, but it's particularly unfortunate given that this checkpoint is stuff that I don't want to retain in any kind of cache anyway - in fact, preferably I wouldn't pollute the ARC with it in the first place. But it seems directio(3C) doesn't work with ZFS (unsurprisingly as I guess this is implemented in segmap), and madvise(..., MADV_DONTNEED) doesn't drop data from the ARC (again, I guess, as it's working on segmap/segvn). Of course, limiting the ARC size to something fairly small makes it behave much better. But this isn't really the answer. I also tried using O_DSYNC, which stops the pathological behaviour but makes things pretty slow - I only get a maximum of about 20MBytes/sec, which is obviously much less than the hardware can sustain. It sounds like we could do with different write throttling behaviour to head this sort of thing off. Of course, the ideal would be to have some way of telling ZFS not to bother keeping pages in the ARC. The latter appears to be bug 6429855. But the underlying behaviour doesn't really seem desirable; are there plans afoot to do any work on ZFS write throttling to address this kind of thing? Throttling is being addressed. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6429205 BTW, the new code will adjust write speed to disk speed very quickly. You will not see those ultra fast initial checkpoints. Is this a concern ? I'll wait for more details on how you address this. Maybe a blog? like this one: http://blogs.technet.com/markrussinovich/archive/2008/02/04/2826167.aspx Inside Vista SP1 File Copy Improvements :- One of the biggest problems with the engine's implementation is that for copies involving lots of data, the Cache Manager write-behind thread on the target system often can't keep up with the rate at which data is written and cached in memory. That causes the data to fill up memory, possibly forcing other useful code and data out, and eventually, the target's system's memory to become a tunnel through which all the copied data flows at a rate limited by the disk. Sounds familiar? ;-) Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Apple Time Machine
I am reading the live coverage of WWDC keynote here:http://www.macrumorslive.com/web/They talked about a new feature in OS X/Leopard: Time Machine. Does it sound like instant snapshot and rollback to you?I don't know how else this can be implemented.10:37 am with time machine, you can get those files back by entering a date or time 10:35 am ever had time where you work on a doc and you do a save as and overwrote the wrong one? 10:35 am coolest part - and reason we call it that - whole new way of backing up files10:35 am backup to HD, or server 10:35 am can restore everything, or just one file at a time10:34 am can be right where you were when the HD drive 10:34 am automatically backs up mac you change a file, it automatically backs up photos, music, documents, files folder, everything then you can restore everything 10:34 am plan to change all of that Time Machine10:33 am how many use automated software to stay always backed up? only 4%Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apple Time Machine
On 8/7/06, Tim Foster [EMAIL PROTECTED] wrote: David Magda wrote: Well, they've ported Dtrace: ..now built into Mac OS X Leopard. Xray. Because it's 2006.Uh right and they're actually shipping it in 2007. Apple marketing. Anyone want to start printing t-shirts:DTrace Time Machine in OpenSolaris. Because we had it in 2005.Looks like Time Machine is implemented using HFS+:To make Time Machine work, Mac users will need to use a separate HFS+ compatible non-bootable hard drive, Croll said. http://news.com.com/New+Apple+feature+sends+users+back+in+time/2100-1046_3-6103007.html?tag=nefd.top As Eric said earlier, it's a standard backup, incremental after the first one, a versioning system of some sort. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apple Time Machine
On 8/7/06, Eric Schrock [EMAIL PROTECTED] wrote: On Mon, Aug 07, 2006 at 01:19:14PM -1000, David J. Orman wrote: (actually did they give OpenSolaris a name check at all when they mentioned DTrace ?) Nope, not that I can see. Apple's pretty notorious for that kind of oversight. I used to work for them, I know first hand how hat-tipping doesn't occur very often.Before this progresses much further, it's worth noting that all of teamDTrace is at WWDC, has met with Apple engineers previously, and will be involved in one or more presentations today.So while the marketingdepartment may not include OpenSolaris in the high level overview, Appleis not ignoring the roots of DTrace, and will not be hiding this fact from their developers (not that they could).- EricCool.Let'sseehow it works out in the long run,will we (the OpenSolaris community) get anything back from Apple and its community, how well does CDDL work in the real world, etc. (we all know what happened to Darwin/BSD).In terms of openness, Sun and Apple are going opposite directions IMHO, interesting situation :)Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apple Time Machine
On 8/7/06, Robert Gordon [EMAIL PROTECTED] wrote: On Aug 7, 2006, at 7:17 PM, Tao Chen wrote: In terms of openness, Sun and Apple are going opposite directions IMHO, interesting situation :) TaoApple just released the Darwin Kernel code xnu-792-10.96the equivalent of 10.4.7 for intel machines.-- Robert.You're right, I just saw the announcement, http://lists.apple.com/archives/Darwin-dev/2006/Aug/msg00067.htmlA good move. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apple Time Machine
On 8/7/06, Bryan Cantrill [EMAIL PROTECTED] wrote: We've had a great relationship with Apple at the engineering level -- andindeed, Team DTrace just got back from dinner with the Apple engineersinvolved with the port.More details here: http://blogs.sun.com/roller/page/bmc?entry=dtrace_on_mac_os_xYour blog should bedigged/slashdotted/osnews'ed/whatever'ed:) Onadifferentnote(sorry this is already off-topic for zfs-discuss), your previous blog happens to be DTrace on FreeBSD, update -are these efforts shared at all between OSX and FreeBSD? Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS bechmarks w/8 disk raid - Quirky results, any thoughts?
On 7/17/06, Jonathan Wheeler [EMAIL PROTECTED] wrote: Hi All,I've just built an 8 disk zfs storage box, and I'm in the testing phase before I put it into production. I've run into some unusual results, and I was hoping the community could offer some suggestions. I've bascially made the switch to Solaris on the promises of ZFS alone (yes I'm that excited about it!), so naturally I'm looking forward to some great performance - but it appears I'm going to need some help finding all of it. One major concern Jonathan has is the 7-raidz write performance. (I see no big surprise in 'read' results.) The really interesting numbers happen at 7 disks -it's slower then with 4, in all tests. I randomly picked 3 results from his several runs: -Per Char- --Block--- -Rewrite-- MB K/sec %CPU K/sec %CPU K/sec %CPU == == 4-disk 8196 57965 67.9 123268 27.6 78712 17.17-disk 8196 49454 57.1 92149 20.1 73013 16.0 8-disk 8196 61345 70.7 139259 28.5 89545 20.8 I looked at the corresponding dtrace data for7 and 8-raidz cases. (Should have also asked for 4-raidz data.Jonathan, you can still send 4-raidz data to me offline.) In 7-raidz, each disk had writes in two sizes: 214-block or 85-block, equally. DEVICE BLKs COUNT sd1 85 27855 214 27882 sd2 85 27854 214 27868 sd3 85 27849 214 27884 ...In 8-raidz, sd1,3,5,7 had either 220 or 221-block writes, equally. sd2,4,6,8 had 100% of 146-block writes. DEVICE BLKs COUNT sd1 220 16325 221 16338 sd2 146 49001 sd3 220 16335 221 16333 sd4 146 49005 sd5 220 16340 221 16324 sd6 146 49001 sd7 220 16332 221 16333 sd8 146 49009 In terms of average write response time, in 7-raidz DEVICE WRITE AVG.ms --- --- -- sd1 63990 54.03 sd2 64000 53.65 sd3 63898 55.48 sd4 64190 54.14 sd5 64091 54.81 sd6 63967 57.83 sd7 64092 54.19 in 8-raidz DEVICE WRITE AVG.ms --- --- -- sd1 42276 6.64 sd2 58467 19.66 sd3 42287 6.24 sd4 55198 20.01 sd5 42285 6.64 sd6 58409 22.90 sd7 42235 6.88 sd8 54967 24.46 At bdev level, 8-raidz shows much better turnaroundtime than 7-raidz, while disk 1,3,5,7 (larger writes) are better than 2,4,6,8 (smaller writes). So 8-raidz wins by larger writes and much better response time for each write, but why these two differences? and why the disparity between odd- and even-number disks within 8-raidz?Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
I should copy this to the list.-- Forwarded message --On 6/23/06, Joe Little [EMAIL PROTECTED] wrote: I can post back to Roch what this latency is. I think the latency is aconstant regardless of the zil or not. all that I do by disabling thezil is that I'm able to submit larger chunks at a time (faster) thandoing 1k or worse blocks 3 times per file (the NFS fsync penalty) Please send the script ( I attached a modified version ) along with the result. They need to see how it works to trust ( or dispute ) the result. Rule #1 in performance tuning is do not trust the report from an unproven tool :) I have some comment on the output below. This is for a bit longer (16 trees of 6250 8k files, again with zil disabled): Generating report from biorpt.sh.rec ... === Top 5 I/O types ===DEVICETBLKs COUNT-sd2 W 2563095sd1 W 2562843 sd1 W 2 201sd2 W 2 197sd1 W32 185This part tells me majority of I/Os are 128KB writes on sd2 and sd1. === Top 5 worst I/O response time ===DEVICETBLKsOFFSETTIMESTAMPTIME.ms-sd2 W 175 52907067185.9338433559.55 sd1 W 256 52109768047.5619183097.21sd1 W 256 52115196854.9442533090.42sd1 W 256 52115222454.9442073090.23sd1 W64 52115248054.944241 3090.21Longest response time are more than 3 seconds, ouch. === Top 5 Devices with largest number of I/Os ===DEVICEREAD AVG.ms MBWRITE AVG.ms MBIOs SEEK-- -- - -- - sd16 0.340 4948 387.88413 4954 0%sd26 0.250 4230 387.07405 4236 0%cmdk0 23 8.110152 0.84017510% Average response time of 300ms is bad. I calculate SEEK rate on 512-byte block basis, since I/Os are mostly 128K, the seek rate is less than 1% ( 0 ), in other words I consider this as mostly sequential I/O. I guess it's debatable whether 512-byte-based calculation is meaningful. === Top 5 Devices with largest amount of data transfer === DEVICEREAD AVG.ms MBWRITE AVG.ms MB Tol.MB MB/s-- -- - -- - sd16 0.340 4948 387.884134134sd26 0.250 4230 387.074054054cmdk0 23 8.110152 0.84000=== Report saved in biorpt.sh.rec.rpt === I calculate the MB/s on per-second basis, meaning as long as there's at least one finished I/O on the device in a second, that second is used in calculating throughput. Tao biorpt.sh Description: Bourne shell script ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Fwd: Re: [zfs-discuss] 15 minute fdsync problem and ZFS: Solved
On 6/23/06, Richard Elling [EMAIL PROTECTED] wrote: comment on analysis below...Tao Chen wrote: === Top 5 Devices with largest number of I/Os === DEVICEREAD AVG.ms MBWRITE AVG.ms MBIOs SEEK -- -- - -- - sd16 0.340 4948 387.88413 4954 0% sd26 0.250 4230 387.07 405 4236 0% cmdk0 23 8.110152 0.84017510% Average response time of 300ms is bad.Average is totally useless with this sort of a distribution. I'd suggest using a statistical package to explore the distribution.Just a few 3-second latencies will skew the average quite a lot.-- richardA summary report is nothing more than an indication of issues, or non-issue. So I agree that an average is just, an average.However, a few 3-second latencies will not spoil the result too much when there're more than 4000 I/Os sampled.The script saves the raw data in a .rec file, so you can run whatever statistic tool you have against it. I am currently more worried about how accurate and useful the raw data is, which is generated from a DTrace command in it. The raw record is in this format:- Timestamp(sec.microsec) - DeviceName- W/R- BLK_NO (offset) - BLK_CNT (I/O size)- IO_Time (I/O elapsed time)Tao ( msec.xx)Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and databases
On 5/12/06, Roch Bourbonnais - Performance Engineering [EMAIL PROTECTED] wrote: From: Gregory Shaw [EMAIL PROTECTED] Regarding directio and quickio, is there a way with ZFS to skip the system buffer cache? I've seen big benefits for using directio when the data files have been segregated from the log files. Were the benefits coming from extra concurrency (no single writter lock) Does DIO bypass writter lock on Solaris? Not on AIX, which uses CIO (concurrent I/O) to bypass managing locks at filesystem level: http://oracle.ittoolbox.com/white-papers/improving-database-performance-with-aix-concurrent-io-2582 or avoiding the extra copy to page cache Certainly. Also to avoid VM overhead (DB does like raw devices). or from too much readahead that is not used before pages need to be recycled. Not sure what you mean ( avoid unnecessary readahead? ) ZFS already has the concurrency. Interesting, would like to find more on this. The page cache copy is really rather cheap VM as a whole is certainly not cheap. and I assert somewhat necessary to insure data integrity. Not following you. The extra readahead is somewhat of a bug in UFS (read 2 pages get a maxcontig chunk (1MB)). Ouch. ZFS is new, conventional wisdom, may or may not apply. This (zfs-discuss) is the place where we can be enlightened :-) Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and databases
On 5/11/06, Peter Rival [EMAIL PROTECTED] wrote: Richard Elling wrote: Oracle will zero-fill the tablespace with 128kByte iops -- it is not sparse. I've got a scar. Has this changed in the past few years? Multiple parallel tablespace creates is usually a big pain point for filesystem / cache interaction, and also fragmentation once in a while. The latter ZFS should take care of; the former, well, I dunno. The purpose of zero-filled tablespace is to prevent fragmentation by future writes, in the case when multiple tablespaces are being updated/filled on the same disk, correct? This becomes pointless on ZFS, since it never overwrites the same pre-allocated block, i.e. the tablespace becomes fragmented in that case no matter what. Also, in order to write a partial update to a new block, zfs needs the rest of the orignal block, hence the notion by Roch: partial writes to blocks that are not in cache are much slower than writes to blocks that are. Fortunately I think DB almost always does aligned full block I/O, or is that right? Tao ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss