Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
On 01/08/12 18:21, Bob Friesenhahn wrote: Something else to be aware of is that even if you don't have a dedicated ZIL device, zfs will create a ZIL using devices in the main pool so Terminology nit: The log device is a SLOG. Every ZFS dataset has a ZIL. Where the ZIL writes (slog or main pool devices) go for a given dataset are determined by a combination of things including (but not limited to) the presence of a SLOG device, the logbias property and the size of the data. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
2012-01-08 5:45, Richard Elling wrote: I think you will see a tradeoff on the read side of the mixed read/write workload. Sync writes have higher priority than reads so the order of I/O sent to the disk will appear to be very random and not significantly coalesced. This is the pathological worst case workload for a HDD. I guess this is what I'm trying to combat when thinking about a dedicated ZIL (SLOG device) in ordedr to reduce pool's fragmentation. It is my understanding (which may be wrong and often is) that without a dedicated SLOG: 1) Sync writes will land on disk randomly into nearest (to disk heads) available blocks, in order to have them committed ASAP; 2) Coalesced writes (at TXG sync) may have intermixed data and metadata blocks, of which metadata may soon expire due to whatever updates, snapshots or deletions involving the blocks this metadata references. If this is true, then after a while there will be many available cheese-holes from expired metadata among larger data blocks. 3) Now, this might be further complicated (or relieved) if the metadata blocks are stored in separate groupings from the bulk user-data, which I don't know about yet. In that case it would be easier for ZFS to prefetch metadata from disk in one IO (as we discussed in another thread), as well as to effectively reuse the small cheese-holes from freed older metadata blocks. --- If any of the above is true, then it is my blind expectation that a dedicated ZIL/SLOG area would decrease fragmentation at least due to sync writes of metadata, and possibly of data, into nearest HDD locations. Again, this is based on my possibly wrong understanding that the blocks committed to a SLOG would be neatly recommitted to the main pool during a TXG close with coalesced writes. I do understand the argument that if the SLOG is dedicated from a certain area on the same HDD, then in fact this would be slowing down the writes by creating more random IO and extra seeks. But as a trade-off I hope for more linear faster reads, including pool import, scrubbing and ZDB walks; and less fragmented free space. Is there any truth to these words? ;) Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs defragmentation via resilvering?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn To put things in proper perspective, with 128K filesystem blocks, the worst case file fragmentation as a percentage is 0.39% (100*1/((128*1024)/512)). On a Microsoft Windows system, the defragger might suggest that defragmentation is not warranted for this percentage level. I don't think that's correct... Suppose you write a 1G file to disk. It is a database store. Now you start running your db server. It starts performing transactions all over the place. It overwrites the middle 4k of the file, and it overwrites 512b somewhere else, and so on. Since this is COW, each one of these little writes in the middle of the file will actually get mapped to unused sectors of disk. Depending on how quickly they're happening, they may be aggregated as writes... But that's not going to help the sequential read speed of the file, later when you stop your db server and try to sequentially copy your file for backup purposes. In the pathological worst case, you would write a file that takes up half of the disk. Then you would snapshot it, and overwrite it in random order, using the smallest possible block size. Now your disk is 100% full, and if you read that file, you will be performing worst case random IO spanning 50% of the total disk space. Granted, this is not a very realistic case, but it is the worst case, and it's really really really bad for read performance. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jim Klimov 1) Sync writes will land on disk randomly into nearest (to disk heads) available blocks, in order to have them committed ASAP; This is true - but you need to make the distinction - if you don't have a dedicated slog, and you haven't disabled zil, then the sync writes you're talking about land into dedicated zil sectors of the disk. This is write-only space, consider it temporary. The only time it will ever be read is after an ungraceful system reboot, the system will scan these sectors to see if anything is there. As soon as the sync writes are written to the zil, they become async writes, which are buffered in memory with all the other async writes, and they will be written *again* into permanent storage in the main pool. At that point, the previously written copy in zil becomes irrelevant. If any of the above is true, then it is my blind expectation that a dedicated ZIL/SLOG area would decrease fragmentation at least due to sync writes sync writes to zil aren't causing fragmentation, because they're only temporary writes as long as they're sync mode. Then they become async mode, and they will be aggregated with all the other async writes. This isn't saying fragmentation doesn't happen. It's just saying there's no special relationship between sync mode and fragmentation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs defragmentation via resilvering?
On Jan 9, 2012, at 5:44 AM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn To put things in proper perspective, with 128K filesystem blocks, the worst case file fragmentation as a percentage is 0.39% (100*1/((128*1024)/512)). On a Microsoft Windows system, the defragger might suggest that defragmentation is not warranted for this percentage level. I don't think that's correct... Suppose you write a 1G file to disk. It is a database store. Now you start running your db server. It starts performing transactions all over the place. It overwrites the middle 4k of the file, and it overwrites 512b somewhere else, and so on. It depends on the database, but many (eg Oracle database) are COW and write fixed block sizes so your example does not apply. Since this is COW, each one of these little writes in the middle of the file will actually get mapped to unused sectors of disk. Depending on how quickly they're happening, they may be aggregated as writes... But that's not going to help the sequential read speed of the file, later when you stop your db server and try to sequentially copy your file for backup purposes. Those who expect sequential to get performance out of HDDs usually end up being sad :-( Interestingly, if you run Oracle database on top of ZFS on top of SSDs, then you have COW over COW over COW. Now all we need is a bull! :-) -- richard -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
On 01/08/12 20:10, Jim Klimov wrote: Is it true or false that: ZFS might skip the cache and go to disks for streaming reads? I don't believe this was ever suggested. Instead, if data is not already in the file system cache and a large read is made from disk should the file system put this data into the cache? BTW, I chose the term streaming to be a subset of sequential where the access pattern is sequential but at what appears to be artificial time intervals. The suggested pre-read of the entire file would be a simple sequential read done as quickly as the hardware allows. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
Thanks for the replies, some more questions follow. Your answers below seem to contradict each other somewhat. Is it true that: 1) VDEV cache before b70 used to contain a full copy of prefetched disk contents, 2) VDEV cache since b70 analyzes the prefetched sectors and only keeps metadata blocks, 3) VDEV cache since b148 is disabled by default? So in fact currently we only have file-level intelligent prefetching? On my older systems I fired kstat -p zfs:0:vdev_cache_stats and saw hit/miss ratios ranging from 30% to 70%. On the oi_148a box I do indeed see all-zeros. While I do understand the implications of VDEV-caching lots of disks on systems with inadequate RAM, I tend to find this feature useful on smaller systems - like home-NASes. It is essentially free in terms of mechanical seeks, as well as in RAM (what is 60-100Mb for a small box at home?) and any nonzero hit ratio that speeds up the system seems justifiable ;) I've tried playing with the options on my oi_148a LiveUSB repair boot, and got varying results: VDEV is indeed disabled by default, but can be enabled. My system is scrubbing now, so it's got a few cache hits (about 10%) right away. root@openindiana:~# echo zfs_vdev_cache_size/W0t1000 | mdb -kw zfs_vdev_cache_size:0 = 0x989680 root@openindiana:~# kstat -p zfs:0:vdev_cache_stats zfs:0:vdev_cache_stats:classmisc zfs:0:vdev_cache_stats:crtime 65.042318652 zfs:0:vdev_cache_stats:delegations 72 zfs:0:vdev_cache_stats:hits 11 zfs:0:vdev_cache_stats:misses 158 zfs:0:vdev_cache_stats:snaptime 114232.782154249 However, trying to increase the prefetch size hung my system almost immediately (in a couple of seconds). I'm away from it now, so I'll ask for a photo of the console screen :) root@openindiana:~# echo zfs_vdev_cache_max/W0t16384 | mdb -kw zfs_vdev_cache_max: 0x4000 = 0x4000 root@openindiana:~# echo zfs_vdev_cache_bshift/W0t20 | mdb -kw zfs_vdev_cache_bshift: 0x10= 0x14 So there are deeper questions: 1) As of Illumos bug #175 (as well as OpenSolaris b148 and if known - Solaris 11), is the vdev prefetch feature *removed* from codebase (no as of oi_148a, what about others?), or disabled by default (i.e. limit is preset to 0, tune it yourself)? 2) If it is only disabled, are there solid plans to remove it, or can we vote to keep it for those interested? :) 3) If the feature is present and gets enabled, how would VDEV prefetch play along with file prefetch, again? ;) 4) Is there some tuneable (after b70) to enable prefetching and keeping of user-data as well (not only metadata)? Perhaps only so that I could test it with my use-patterns to make sure that caching generic sectors is useless for me, and I really should revert to caching only metadata? 5) Would it make sense to increase zfs_vdev_cache_bshift? For example, when I tried to set it to 20 and prefetch a whole 1MB of data, why would that cause the system to die? Can it increase cache hit ratios (if works)? 6) Does the VDEV cache keep ZFS blocks or disk sectors? For example, on my 4k disks the blocks are 4k, even though there are a few hundred bytes worth of data in metadata blocks and 3+KB of slack space. 7) Modern HDDs often have 32-64Mb DRAM cache onboard. Is there any reason to match VDEV cache size with that in any way (1:1, 2:1, etc)? Thanks again, //Jim Klimov 2012-01-09 6:06, Richard Elling wrote: On Jan 8, 2012, at 5:10 PM, Jim Klimov wrote: 2012-01-09 4:14, Richard Elling пишет: On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote: I wonder if it is possible (currently or in the future as an RFE) to tell ZFS to automatically read-ahead some files and cache them in RAM and/or L2ARC? See discussions on the ZFS intelligent prefetch algorithm. I think Ben Rockwood's description is the best general description: http://www.cuddletech.com/blog/pivot/entry.php?id=1040 And a more engineer-focused description is at: http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch -- richard Thanks for the pointers. While I've seen those articles (in fact, one of the two non-spam comments in Ben's blog was mine), rehashing the basics is always useful ;) Still, how does VDEV prefetch play along with File-level Prefetch? Trick question… it doesn't. vdev prefetching is disabled in opensolaris b148, illumos, and Solaris 11 releases. The benefits of having the vdev cache for large numbers of disks does not appear to justify the cost. See http://wesunsolve.net/bugid/id/6684116 https://www.illumos.org/issues/175 For example, if ZFS prefetched 64K from disk at the SPA level, and those sectors luckily happen to contain next blocks of a streaming-read file, would the file-level prefetch take the data from RAM cache or still request them from the disk? As of b70, vdev_cache only contains metadata. See
Re: [zfs-discuss] zfs read-ahead and L2ARC
On 01/08/12 10:15, John Martin wrote: I believe Joerg Moellenkamp published a discussion several years ago on how L1ARC attempt to deal with the pollution of the cache by large streaming reads, but I don't have a bookmark handy (nor the knowledge of whether the behavior is still accurate). http://www.c0t0d0s0.org/archives/5329-Some-insight-into-the-read-cache-of-ZFS-or-The-ARC.html ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
2012-01-09 18:15, John Martin пишет: On 01/08/12 20:10, Jim Klimov wrote: Is it true or false that: ZFS might skip the cache and go to disks for streaming reads? (The more I think about it, the more senseless this sentence seems, and I might have just mistaken it with ZIL writes of bulk data). I don't believe this was ever suggested. Instead, if data is not already in the file system cache and a large read is made from disk should the file system put this data into the cache? Hmmm... perhaps THIS is what I could mistake it for... Thus the correct version of the question goes like this: is it true or false that some large reads from disk can be deemed by ZFS as too big and rare to cache in ARC? If yes, what conditions are checked to mark a read as such? Can this behavior be disabled in order to try and cache every read (further subject to normal eviction due to MRU/MFU/memory pressure and other considerations)? Thanks again, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs defragmentation via resilvering?
On Mon, 9 Jan 2012, Edward Ned Harvey wrote: I don't think that's correct... But it is! :-) Suppose you write a 1G file to disk. It is a database store. Now you start running your db server. It starts performing transactions all over the place. It overwrites the middle 4k of the file, and it overwrites 512b somewhere else, and so on. Since this is COW, each one of these little writes in the middle of the file will actually get mapped to unused sectors of disk. Depending on how quickly they're happening, they may be aggregated Oops. I see an error in the above. Other than tail blocks, or due to compression, zfs will not write a COW data block smaller than the zfs filesystem blocksize. If the blocksize was 128K, then updating just one byte in that 128K block results in writing a whole new 128K block. This is pretty significant write-amplification but the resulting fragmentation is still limited by the 128K block size. Remember that any fragmentation calculation needs to be based on the disk's minimum read (i.e. sector) size. However, it is worth remembering that it is common to set the block size to a much smaller value than default (e.g. 8K) if the filesystem is going to support a database. In that case it is possible for there to be fragmentation for every 8K of data. The worst case fragmentation pecentage for 8K blocks (and 512-byte sectors) is 6.25% ((100*1/((8*1024)/512))). That would be a high enough percentage that Microsoft Windows defrag would recommend defragging the disk. Metadata chunks can not be any smaller than the disk's sector size (e.g. 512 bytes or 4K bytes). Metadata can be seen as contributing to fragmentation, which is why it is so valuable to cache it. If the metadata is not conveniently close to the data, then it may result in a big ugly disk seek (same impact as data fragmentation) to read it. In summary, with zfs's default 128K block size, data fragmentation is not a significant issue, If the zfs filesystem block size is reduced to a much smaller value (e.g. 8K) then it can become a significant issue. As Richard Elling points out, a database layered on top of zfs may already be fragmented by design. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs defragmentation via resilvering?
2012-01-09 19:14, Bob Friesenhahn wrote: In summary, with zfs's default 128K block size, data fragmentation is not a significant issue, If the zfs filesystem block size is reduced to a much smaller value (e.g. 8K) then it can become a significant issue. As Richard Elling points out, a database layered on top of zfs may already be fragmented by design. I THINK there is some fallacy in your discussion: I've seen 128K referred to as the maximum filesystem block size, i.e. for large streaming writes. For smaller writes ZFS adapts with smaller blocks. I am not sure how it would rewrite a few bytes inside a larger block - split it into many smaller ones or COW all 128K. Intermixing variable-sized indivisible blocks can in turn lead to more fragmentation than would otherwise be expected/possible ;) Fixed block sizes are used (only?) for volume datasets. If the metadata is not conveniently close to the data, then it may result in a big ugly disk seek (same impact as data fragmentation) to read it. Also I'm not sure about ths argument. If VDEV prefetch does not slurp in data blocks, then by the time metadata is discovered in read-from-disk blocks and data block locations are determined, the disk may have rotated away from the head, so at least one rotational delay is incurred even if metadata is immediately followed by its referred data... no? //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thinking about spliting a zpool in system and data
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/01/12 13:39, Jim Klimov wrote: I have transitioned a number of systems roughly by the same procedure as you've outlined. Sadly, my notes are not in English so they wouldn't be of much help directly; Yes, my russian is rusty :-). I have bitten the bullet and spend 3-4 days doing the migration. I wrote the details here: http://www.jcea.es/artic/solaris_zfs_split.htm The page is written in Spanish, but the terminal transcriptions should be useful for everybody. In the process, maybe somebody finds this interesting too: http://www.jcea.es/artic/zfs_flash01.htm Sorry, Spanish only too. Overall, your plan seems okay and has more failsafes than we've had - because longer downtimes were affordable ;) However, when doing such low-level stuff, you should make sure that you have remote access to your systems (ILOM, KVM, etc.; remotely-controlled PDUs for externally enforced Yes, the migration I did had plenty of safety points (you can go back if something doesn't work) and, most of the time, the system was in a state able to survive accidental reboot. Downtime was minimal, less than an hour in total (several reboots to validate configurations before proceeding). I am quite pleased of the eventless migration, but I planned it quite carefully. Worried about hitting bugs in Solaris/ZFS, though. But it was very smooth. The machine is hosted remotely but yes, I have remote-KVM. I can't boot from remote media, but I have an OpenIndiana release in the SSD, with VirtualBox installed and the Solaris 10 Update 10 release ISO, just in case :-). The only suspicious thing is that I keep swap (32GB) and dump (4GB) in the data zpool, instead in system. Seems to work OK. Crossing my fingers for the next Live Upgrade :-). I have read your message after I migrated, but it was very interesting. Thanks for taking the time to write it!. Have a nice 2012. - -- Jesus Cea Avion _/_/ _/_/_/_/_/_/ j...@jcea.es - http://www.jcea.es/ _/_/_/_/ _/_/_/_/ _/_/ jabber / xmpp:j...@jabber.org _/_/_/_/ _/_/_/_/_/ . _/_/ _/_/_/_/ _/_/ _/_/ Things are not so easy _/_/ _/_/_/_/ _/_/_/_/ _/_/ My name is Dump, Core Dump _/_/_/_/_/_/ _/_/ _/_/ El amor es poner tu felicidad en la felicidad de otro - Leibniz -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQCVAwUBTwuvNJlgi5GaxT1NAQLJ0wP9EgpQnUdYCiLOnlGK8UC2QodT9s8KuqMK 5F9YwlPLdZ3S1DfWGKgC3k9MLbCfYLihM+KqysblsHs5Jf9/HGYSGK5Ky5HlYB5c 4vO+KrDU2eT/BYIVrDmFCucj8Fh8CN0Ule+Z5JtvhdlN/5rQ+osRmLQXr3SqQm6F w/ilYwB09+0= =fGc3 -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss