Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
2012-01-09 6:25, Richard Elling wrote: Note: more analysis of the GPFS implementations is needed, but that will take more time than I'll spend this evening :-) Quick hits below... Good to hear you might look into it after all ;) but at the end of the day, if we've got a 12 hour rebuild (fairly conservative in the days of 2TB SATA drives), the performance degradation is going to be very real for end-users. I'd like to see some data on this for modern ZFS implementations (post Summer 2010) Is "scrubbing performance" irrelevant in this discussion? I think that in general, scrubbing is the read-half of a larger rebuild process, at least for a single-vdev pool, so rebuilds are about as long or worse. Am I wrong? In my home-NAS case a raidz2 pool of six 2Tb drives, which is filled 76%, consistently takes 85 hours to scrub. No SSDs involved, no L2ARC, no ZILs. According to iostat, the HDDs are often utilized to 100% with random IO load, yielding from 500KBps to 2-3MBps in about 80-100IOPS per disk (I have a scrub going on at this moment). This system variably runs oi_148a (LiveUSB recovery) and oi_151a when alive ;) HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
Note: more analysis of the GPFS implementations is needed, but that will take more time than I'll spend this evening :-) Quick hits below... On Jan 7, 2012, at 7:15 PM, Tim Cook wrote: > On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling > wrote: > Hi Jim, > > On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote: > > > Hello all, > > > > I have a new idea up for discussion. > > > > Several RAID systems have implemented "spread" spare drives > > in the sense that there is not an idling disk waiting to > > receive a burst of resilver data filling it up, but the > > capacity of the spare disk is spread among all drives in > > the array. As a result, the healthy array gets one more > > spindle and works a little faster, and rebuild times are > > often decreased since more spindles can participate in > > repairs at the same time. > > Xiotech has a distributed, relocatable model, but the FRU is the whole ISE. > There have been other implementations of more distributed RAIDness in the > past (RAID-1E, etc). > > The big question is whether they are worth the effort. Spares solve a > serviceability > problem and only impact availability in an indirect manner. For single-parity > solutions, spares can make a big difference in MTTDL, but have almost no > impact > on MTTDL for double-parity solutions (eg. raidz2). > > > I disagree. Dedicated spares impact far more than availability. During a > rebuild performance is, in general, abysmal. In ZFS, there is a resilver throttle that is designed to ensure that resilvering activity does not impact interactive performance. Do you have data that suggests otherwise? > ZIL and L2ARC will obviously help (L2ARC more than ZIL), ZIL makes zero impact on resilver. I'll have to check to see if L2ARC is still used, but due to the nature of the ARC design, read-once workloads like backup or resilver do not tend to negatively impact frequently used data. > but at the end of the day, if we've got a 12 hour rebuild (fairly > conservative in the days of 2TB > SATA drives), the performance degradation is going to be very real for > end-users. I'd like to see some data on this for modern ZFS implementations (post Summer 2010) > With distributed parity and spares, you should in theory be able to cut this > down an order of magnitude. > I feel as though you're brushing this off as not a big deal when it's an > EXTREMELY big deal (in my mind). In my opinion you can't just approach this > from an MTTDL perspective, you also need to take into account user > experience. Just because I haven't lost data, doesn't mean the system isn't > (essentially) unavailable (sorry for the double negative and repeated > parenthesis). If I can't use the system due to performance being a fraction > of what it is during normal production, it might as well be an outage. So we have a method to analyze the ability of a system to perform during degradation: performability. This can be applied to computer systems and we've done some analysis specifically on RAID arrays. See also http://www.springerlink.com/content/267851748348k382/ http://blogs.oracle.com/relling/tags/performability Hence my comment about "doing some math" :-) > > I don't think I've seen such idea proposed for ZFS, and > > I do wonder if it is at all possible with variable-width > > stripes? Although if the disk is sliced in 200 metaslabs > > or so, implementing a spread-spare is a no-brainer as well. > > Put some thoughts down on paper and work through the math. If it all works > out, let's implement it! > -- richard > > > I realize it's not intentional Richard, but that response is more than a bit > condescending. If he could just put it down on paper and code something up, > I strongly doubt he would be posting his thoughts here. He would be posting > results. The intention of his post, as far as I can tell, is to perhaps > inspire someone who CAN just write down the math and write up the code to do > so. Or at least to have them review his thoughts and give him a dev's > perspective on how viable bringing something like this to ZFS is. I fear > responses like "the code is there, figure it out" makes the *aris community > no better than the linux one. When I talk about spares in tutorials, we discuss various tradeoffs and how to analyse the systems. Interestingly, for the GPFS case, the mirrors example clearly shows the benefit of declustered RAID. However, the triple-parity example (similar to raidz3) is not so persuasive. If you have raidz3 + spares, then why not go ahead and do raidz4? In the tutorial we work through a raidz2 + spare vs raidz2 case and the raidz2 case is better in both performance and dependability without sacrificing space (an unusual condition!) It is not very difficult to add a raidz4 or indeed any number of additional parity, but there is a point of diminishing returns, usually when some other system component becomes more critical than the RAID protection. So,
Re: [zfs-discuss] zfs read-ahead and L2ARC
On Jan 8, 2012, at 5:10 PM, Jim Klimov wrote: > 2012-01-09 4:14, Richard Elling пишет: >> On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote: >> >>> I wonder if it is possible (currently or in the future as an RFE) >>> to tell ZFS to automatically read-ahead some files and cache them >>> in RAM and/or L2ARC? >> >> See discussions on the ZFS intelligent prefetch algorithm. I think Ben >> Rockwood's >> description is the best general description: >> http://www.cuddletech.com/blog/pivot/entry.php?id=1040 >> >> And a more engineer-focused description is at: >> http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch >> -- richard > > Thanks for the pointers. While I've seen those articles > (in fact, one of the two non-spam comments in Ben's > blog was mine), rehashing the basics is always useful ;) > > Still, how does VDEV prefetch play along with File-level > Prefetch? Trick question… it doesn't. vdev prefetching is disabled in opensolaris b148, illumos, and Solaris 11 releases. The benefits of having the vdev cache for large numbers of disks does not appear to justify the cost. See http://wesunsolve.net/bugid/id/6684116 https://www.illumos.org/issues/175 > For example, if ZFS prefetched 64K from disk > at the SPA level, and those sectors luckily happen to > contain "next" blocks of a streaming-read file, would > the file-level prefetch take the data from RAM cache > or still request them from the disk? As of b70, vdev_cache only contains metadata. See http://wesunsolve.net/bugid/id/6437054 > In what cases would it make sense to increase the > zfs_vdev_cache_size? Does it apply to all disks > combined, or to each disk (or even slice/partition) > separately? It applies to each leaf vdev. > > In fact, this reading got me thinking that I might have > a fundamental misunderstanding lately; hence a couple > of new yes-no questions arose: > > Is it true or false that: ZFS might skip the cache and > go to disks for "streaming" reads? (The more I think > about it, the more senseless this sentence seems, and > I might have just mistaken it with ZIL writes of bulk > data). Unless the primarycache parameter is set to none, reads will look in the ARC first. > > Is it true or false that: ARC might evict cached blocks > based on age (without new reads or other processes > requiring the RAM space)? False. Evictions occur when needed. NB, I'm not sure of the status of the Solaris 11 ARC no-grow issue. As that code is not open sourced, and we know that Oracle rewrote some of the ARC code, all bets are off. > And I guess the generic answer to my original question > regarding intelligent pre-fetching of whole files is > that this should be done by scripts outside ZFS itself, > and that the read-prefetch as well as ARC/L2ARC is all > in place already. So if no other IOs occur, the disks > may spin down... if only not for those "nasty" writes > that may sporadically occur and which I'd love to see > pushed out to dedicated ZILs ;) I've setup external prefetching for specific use cases. Spin-down is another can of worms… -- richard -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
2012-01-09 4:14, Richard Elling пишет: On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote: I wonder if it is possible (currently or in the future as an RFE) to tell ZFS to automatically read-ahead some files and cache them in RAM and/or L2ARC? See discussions on the ZFS intelligent prefetch algorithm. I think Ben Rockwood's description is the best general description: http://www.cuddletech.com/blog/pivot/entry.php?id=1040 And a more engineer-focused description is at: http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch -- richard Thanks for the pointers. While I've seen those articles (in fact, one of the two non-spam comments in Ben's blog was mine), rehashing the basics is always useful ;) Still, how does VDEV prefetch play along with File-level Prefetch? For example, if ZFS prefetched 64K from disk at the SPA level, and those sectors luckily happen to contain "next" blocks of a streaming-read file, would the file-level prefetch take the data from RAM cache or still request them from the disk? In what cases would it make sense to increase the zfs_vdev_cache_size? Does it apply to all disks combined, or to each disk (or even slice/partition) separately? In fact, this reading got me thinking that I might have a fundamental misunderstanding lately; hence a couple of new yes-no questions arose: Is it true or false that: ZFS might skip the cache and go to disks for "streaming" reads? (The more I think about it, the more senseless this sentence seems, and I might have just mistaken it with ZIL writes of bulk data). Is it true or false that: ARC might evict cached blocks based on age (without new reads or other processes requiring the RAM space)? And I guess the generic answer to my original question regarding intelligent pre-fetching of whole files is that this should be done by scripts outside ZFS itself, and that the read-prefetch as well as ARC/L2ARC is all in place already. So if no other IOs occur, the disks may spin down... if only not for those "nasty" writes that may sporadically occur and which I'd love to see pushed out to dedicated ZILs ;) Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
On Jan 7, 2012, at 8:59 AM, Jim Klimov wrote: > I wonder if it is possible (currently or in the future as an RFE) > to tell ZFS to automatically read-ahead some files and cache them > in RAM and/or L2ARC? See discussions on the ZFS intelligent prefetch algorithm. I think Ben Rockwood's description is the best general description: http://www.cuddletech.com/blog/pivot/entry.php?id=1040 And a more engineer-focused description is at: http://www.solarisinternals.com/wiki/index.php/ZFS_Performance#Intelligent_prefetch -- richard > > One use-case would be for Home-NAS setups where multimedia (video > files or catalogs of images/music) are viewed form a ZFS box. For > example, if a user wants to watch a film, or listen to a playlist > of MP3's, or push photos to a wall display (photo frame, etc.), > the storage box "should" read-ahead all required data from HDDs > and save it in ARC/L2ARC. Then the HDDs can spin down for hours > while the pre-fetched gigabytes of data are used by consumers > from the cache. End-users get peace, quiet and less electricity > used while they enjoy their multimedia entertainment ;) > > Is it possible? If not, how hard would it be to implement? > > In terms of scripting, would it suffice to detect reads (i.e. > with DTrace) and read the files to /dev/null to get them cached > along with all required metadata (so that mechanical HDDs are > not required for reads afterwards)? > > Thanks, > //Jim Klimov > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS and performance consulting http://www.RichardElling.com illumos meetup, Jan 10, 2012, Menlo Park, CA http://www.meetup.com/illumos-User-Group/events/41665962/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Pool faulted in a bad way
Hello, I have been asked to take a look at at poll on a old OSOL 2009.06 host. It have been left unattended for a long time and it was found in a FAULTED state. Two of the disks in the raildz2 pool seems to have failed, one have been replaced by a spare, the other one is UNAVAIL. The machine was restarted and the damaged disks was removed to make it possible to access the pool without it hanging on I/O-errors. Now, I have no indication on that more than two disk should have failed, and one of them seems to have been replaced by the spare. I would then have expected the pool to be in a working state even with two failed disks and some bad data on the remaining disks since metadata has additional replication. This is the current state of the pool, unable to be imported (at least with 2009.06): pool: tank state: FAULTED status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-3C scrub: none requested config: NAME STATE READ WRITE CKSUM tank FAULTED 0 0 1 corrupted data raidz2 DEGRADED 0 0 6 c12t0d0ONLINE 0 0 0 c12t1d0ONLINE 0 0 0 spare ONLINE 0 0 0 c12t2d0 ONLINE 0 0 0 c12t7d0 ONLINE 0 0 0 c12t3d0ONLINE 0 0 0 c12t4d0ONLINE 0 0 0 c12t5d0ONLINE 0 0 0 c12t6d0UNAVAIL 0 0 0 cannot open If we look at the status it is a mismatch of between the status message that states that insufficient replicas are available and the status of the disks. More troublesome is the "corrupted data" status for the whole pool. I also get "bad config type 16 for stats" from zdb. What can possible cause something like this, a faulty controller? Is there any way to recover (UB rollback with OI perhaps?) The server has ECC memory and another pool that is still working fine. The controller is a ARECA 1280. And some output from zdb: # zdb tank | more zdb: can't open tank: I/O error version=14 name='tank' state=0 txg=0 pool_guid=17315487329998392945 hostid=8783846 hostname='storage' vdev_tree type='root' id=0 guid=17315487329998392945 bad config type 16 for stats children[0] type='raidz' id=0 guid=14250359679717261360 nparity=2 metaslab_array=24 metaslab_shift=37 ashift=9 asize=14002698321920 is_log=0 root@storage:~# zdb tank version=14 name='tank' state=0 txg=0 pool_guid=17315487329998392945 hostid=8783846 hostname='storage' vdev_tree type='root' id=0 guid=17315487329998392945 bad config type 16 for stats children[0] type='raidz' id=0 guid=14250359679717261360 nparity=2 metaslab_array=24 metaslab_shift=37 ashift=9 asize=14002698321920 is_log=0 bad config type 16 for stats children[0] type='disk' id=0 guid=5644370057710608379 path='/dev/dsk/c12t0d0s0' devid='id1,sd@x001b4d23002bb800/a' phys_path='/pci@0,0/pci8086,25f8@4/pci8086,370@0/pci17d3,1260@e/disk@0,0:a' whole_disk=1 DTL=154 bad config type 16 for stats children[1] type='disk' id=1 guid=7134885674951774601 path='/dev/dsk/c12t1d0s0' devid='id1,sd@x001b4d23002bb810/a' phys_path='/pci@0,0/pci8086,25f8@4/pci8086,370@0/pci17d3,1260@e/disk@1,0:a' whole_disk=1 DTL=153 bad config type 16 for stats children[2] type='spare' id=2 guid=7434068041432431375 whole_disk=0 bad config type 16 for stats children[0] type='disk' id=0 guid=5913529661608977121 path='/dev/dsk/c12t2d0s0' devid='id1,sd@x001b4d23002bb820/a' ph
Re: [zfs-discuss] zfs read-ahead and L2ARC
2012-01-09 0:29, John Martin пишет: On 01/08/12 11:30, Jim Klimov wrote: However for smaller servers, such as home NASes which have about one user overall, pre-reading and caching files even for a single use might be an objective per se - just to let the hard-disks spin down. Say, if I sit down to watch a movie from my NAS, it is likely that for 90 or 120 minutes there will be no other IO initiated by me. The movie file can be pre-read in a few seconds, and then most of the storage system can go to sleep. I can't find such home-NAS usage uncommon, because I am my own example user - so I see this pattern often ;) Isn't this just a more extreme case of prediction? Probably is, and this is probably not a task for only ZFS, but for logic outside it. There are some requirements that ZFS should meet, in order for this to work, though. Details follow... In addition to the file system knowing there will only be one client reading 90-120 minutes of (HD?) video that will fit in the memory of a small(er) server, now the hard drive power management code also knows there won't be another access for 90-120 minutes so it is OK to spin down the hard drive(s). Well, in the original post I did suggest that the prediction logic might go into scripting or some other user-level tool. And it should, really, to keep the kernel clean and slim. The "predictor" might be as simple as a DTrace file access monitor, which would "cat" or "tar" files into /dev/null. I.e. if it detected access to "*.(avi|mkv|wmv)", then it should cat the file. If it detected "*.(mp3|ogg|jpg)" it should tar the parent directory. Might be dumb and still sufficiently efficient ;) However, for such usecases this tool would need some "guarantees" from ZFS. One would be that the read-ahead data will find its way into caches and won't be evicted for no reason (when there's no other RAM pressure). This means that the tool should be able to read all the data and metadata required by ZFS, so that no more disk access is required if it's all in cache. It might require a tunable in ZFS for home-NAS users which would disable current "no-caching" for detected streaming reads: we need the opposite of that behavior. Another part is HDD power-management, which reportedly works in Solaris, allowing disks to spin down when there was no access for some time. Probably there is a syscall to do this on-demand as well... On a side note, for home-NASes or other not-heavily-used storage servers, it would be wonderful to be able to cache small writes into ZIL devices, if present, and not flush them onto the main pool until some megabyte limit is reached (i.e. ZIL is full), or a pool export/import event occurs. This would allow main disk arrays to remain idle for a long time while small sporadic writes which are initiated by the OS (logs, atimes, web-browser cache files, whatever), and have these writes persistently stored in ZIL. Essentially, this would be like setting TXG-commit times to practical infinity, and actually commit based on bytecount limits. One possible difference would be not-streaming larger writes to pool disks at once, but also storing them in dedicated ZIL. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
On 01/08/12 11:30, Jim Klimov wrote: However for smaller servers, such as home NASes which have about one user overall, pre-reading and caching files even for a single use might be an objective per se - just to let the hard-disks spin down. Say, if I sit down to watch a movie from my NAS, it is likely that for 90 or 120 minutes there will be no other IO initiated by me. The movie file can be pre-read in a few seconds, and then most of the storage system can go to sleep. Isn't this just a more extreme case of prediction? In addition to the file system knowing there will only be one client reading 90-120 minutes of (HD?) video that will fit in the memory of a small(er) server, now the hard drive power management code also knows there won't be another access for 90-120 minutes so it is OK to spin down the hard drive(s). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
On Sun, Jan 08, 2012 at 06:59:57AM +0400, Jim Klimov wrote: > 2012-01-08 5:37, Richard Elling ??: >> The big question is whether they are worth the effort. Spares solve a >> serviceability >> problem and only impact availability in an indirect manner. For single-parity >> solutions, spares can make a big difference in MTTDL, but have almost no >> impact >> on MTTDL for double-parity solutions (eg. raidz2). > > Well, regarding this part: in the presentation linked in my OP, > the IBM presenter suggests that for a 6-disk raid10 (3 mirrors) > with one spare drive, overall a 7-disk set, there are such > options for "critical" hits to data redundancy when one of > drives dies: > > 1) Traditional RAID - one full disk is a mirror of another >full disk; 100% of a disk's size is "critical" and has to >be prelicated into a spare drive ASAP; > > 2) Declustered RAID - all 7 disks are used for 2 unique data >blocks from "original" setup and one spare block (I am not >sure I described it well in words, his diagram shows it >better); if a single disk dies, only 1/7 worth of disk >size is critical (not redundant) and can be fixed faster. > >For their typical 47-disk sets of RAID-7-like redundancy, >under 1% of data becomes critical when 3 disks die at once, >which is (deemed) unlikely as is. > > Apparently, in the GPFS layout, MTTDL is much higher than > in raid10+spare with all other stats being similar. > > I am not sure I'm ready (or qualified) to sit down and present > the math right now - I just heard some ideas that I considered > worth sharing and discussing ;) > Thanks for the video link (http://www.youtube.com/watch?v=2g5rx4gP6yU). It's very interesting! GPFS Native RAID seems to be more advanced than current ZFS, and it even has rebalancing implemented (the infamous missing zfs bp-rewrite). It'd definitely be interesting to have something like this implemented in ZFS. -- Pasi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs defragmentation via resilvering?
On Sat, 7 Jan 2012, Jim Klimov wrote: I understand that relatively high fragmentation is inherent to ZFS due to its COW and possible intermixing of metadata and data blocks (of which metadata path blocks are likely to expire and get freed relatively quickly). To put things in proper perspective, with 128K filesystem blocks, the worst case file fragmentation as a percentage is 0.39% (100*1/((128*1024)/512)). On a Microsoft Windows system, the defragger might suggest that defragmentation is not warranted for this percentage level. Finally, what would the gurus say - does fragmentation pose a heavy problem on nearly-filled-up pools made of spinning HDDs (I believe so, at least judging from those performance degradation problems writing to 80+%-filled pools), and can fragmentation be effectively combatted on ZFS at all (with or without BP rewrite)? There are different types of fragmentation. The fragmentation which causes a slowdown when writing to an almost full pool is fragmentation of the free-list/area (causing zfs to take longer to find free space to write to) as opposed to fragmentation of the files themselves. The files themselves will still not be fragmented any more severely than the zfs blocksize. However, there are seeks and there are *seeks* and some seeks take longer than others so some forms of fragmentation are worse than others. When the free space is fragmented into smaller blocks, there is necessarily more file fragmentation then the file is written. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
>If the performance of the outer tracks is better than the performance of the >inner tracks due to limitations of magnetic density or rotation speed (not >being limited by the head speed or bus speed), then the sequential >performance of the drive should increase as a square function, going toward >the outer tracks. c = pi * r^2 Decrease because the outer tracks are the lower numbered tracks; they have the same density but they are larger. >So, small variations of sequential performance are possible, jumping from >track to track, but based on what I've seen, the maximum performance >difference from the absolute slowest track to the absolute fastest track >(which may or may not have any relation to inner vs outer) ... maximum >variation on-par with 10% performance difference. Not a square function. I've noticed a change of 50% in speed or more between the lower and the higher numbers. (60MB to 30MB) In benchmark land, they do short-stroke disks for better performance; I believe the Pillar boxes do similar tricks under the covers (if you want more performance, it gives you the faster tracks) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
On Sat, 7 Jan 2012, Edward Ned Harvey wrote: If you don't split out your ZIL separate from the storage pool, zfs already chooses disk blocks that it believes to be optimized for minimal access time. In fact, I believe, zfs will dedicate a few sectors at the low end, a few at the high end, and various other locations scattered throughout the pool, so whatever the current head position, it tries to go to the closest "landing zone" that's available for ZIL writes. If anything, splitting out your ZIL to a different partition might actually hurt your performance. Something else to be aware of is that even if you don't have a dedicated ZIL device, zfs will create a ZIL using devices in the main pool so there is always a ZIL, even if you don't see it. Also, the ZIL is only used to record pending small writes. Larger writes (I think 128K or more) are written to their pre-allocated final location in the main pool. This choice is made since the purpose of the ZIL is to minimize random I/O to disk, and writing large amounts of data to the ZIL would create a bandwidth bottleneck. There are postings by Matt Ahrens to this list (and elsewhere) which provide an accurate description of how the ZIL works. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
2012-01-08 18:56, Edward Ned Harvey wrote: From: Richard Elling [mailto:richard.ell...@gmail.com] Disagree. My data, and the vendor specs, continue to show different sequential media bandwidth speed for inner vs outer cylinders. Any reference? Well, Richard's data matches mine with tests of my HDDs at home: I read in some 10-gb blocks at different offsets (dd > /dev/null), and "linear" speeds dropped from about 150MBps to about 80-100MBps. This was tested on a relatively modern 2TB Seagate drive. Random IOs are still crappy on mechanical drives, often under 10MBps ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
2012-01-08 19:15, John Martin пишет: On 01/08/12 09:30, Edward Ned Harvey wrote: In the case of your MP3 collection... Probably the only thing you can do is to write a script which will simply go read all the files you predict will be read soon. The key here is the prediction - There's no way ZFS or solaris, or any other OS in the present day is going to intelligently predict which files you'll be requesting soon. The other prediction is whether the blocks will be reused. If the blocks of a streaming read are only used once, then it may be wasteful for a file system to allow these blocks to placed in the cache. If a file system purposely chooses to not cache streaming reads, manually scheduling a "pre-read" of particular files may simply cause the file to be read from disk twice: on the manual pre-read and when it is read again by the actual application. I believe Joerg Moellenkamp published a discussion several years ago on how L1ARC attempt to deal with the pollution of the cache by large streaming reads, but I don't have a bookmark handy (nor the knowledge of whether the behavior is still accurate). Well, this point is valid for intensively-used servers - but then such blocks might just get evicted from the caches by newer and/or more-frequently-used blocks. However for smaller servers, such as home NASes which have about one user overall, pre-reading and caching files even for a single use might be an objective per se - just to let the hard-disks spin down. Say, if I sit down to watch a movie from my NAS, it is likely that for 90 or 120 minutes there will be no other IO initiated by me. The movie file can be pre-read in a few seconds, and then most of the storage system can go to sleep. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
On 01/08/12 09:30, Edward Ned Harvey wrote: In the case of your MP3 collection... Probably the only thing you can do is to write a script which will simply go read all the files you predict will be read soon. The key here is the prediction - There's no way ZFS or solaris, or any other OS in the present day is going to intelligently predict which files you'll be requesting soon. The other prediction is whether the blocks will be reused. If the blocks of a streaming read are only used once, then it may be wasteful for a file system to allow these blocks to placed in the cache. If a file system purposely chooses to not cache streaming reads, manually scheduling a "pre-read" of particular files may simply cause the file to be read from disk twice: on the manual pre-read and when it is read again by the actual application. I believe Joerg Moellenkamp published a discussion several years ago on how L1ARC attempt to deal with the pollution of the cache by large streaming reads, but I don't have a bookmark handy (nor the knowledge of whether the behavior is still accurate). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
> From: Richard Elling [mailto:richard.ell...@gmail.com] > > > Also, the concept of "faster tracks of the HDD" is also incorrect. Yes, > > there was a time when HDD speeds were limited by rotational speed and > > magnetic density, so the outer tracks of the disk could serve up more data > > because more magnetic material passed over the head in each rotation. > But > > nowadays, the hard drive sequential speed is limited by the head speed, > > which is invariably right around 1Gbps. So the inner and outer sectors of > > the HDD are equally fast - the outer sectors are actually less magnetically > > dense because the head can't handle it. And the random IO speed is > limited > > by head seek + rotational latency, where seek is typically several times > > longer than latency. > > Disagree. My data, and the vendor specs, continue to show different > sequential > media bandwidth speed for inner vs outer cylinders. Any reference? I know, as I sit and dd from some disk | pv > /dev/null, it will tell me something like 1.0Gbps... I periodically check its progress while it's in progress, and while it varies a little (say, sometimes 1.0, 1.1, 1.2) it goes up and down throughout the process. There is no noticeable difference between the early, mid, and late behavior, sequentially reading the whole disk. If the performance of the outer tracks is better than the performance of the inner tracks due to limitations of magnetic density or rotation speed (not being limited by the head speed or bus speed), then the sequential performance of the drive should increase as a square function, going toward the outer tracks. c = pi * r^2 It is my belief, based on specs I've previously looked at, that mfgrs break the drive down into zones. So, something like the inner 20% of the tracks will have magnetic layout pattern A, and the next 20% will have magnetic layout pattern B, and so forth... Within a single magnetic layout pattern, jumping from individual track to individual track can yield a difference of performance, but it's not a huge step from one to the next. And when you transition from layout pattern to layout pattern, the pattern just repeats itself again. They're trying to optimize, to a first order, ensure the performance limitations are mostly caused by head and/or bus speed. If those are the bottlenecks, let them be the bottlenecks, and at least solve all the other problems that are solvable. So, small variations of sequential performance are possible, jumping from track to track, but based on what I've seen, the maximum performance difference from the absolute slowest track to the absolute fastest track (which may or may not have any relation to inner vs outer) ... maximum variation on-par with 10% performance difference. Not a square function. > OTOH, you're not trying to get high performance from an HDD are you? That > game is over. Lots of us still have to live with HDD's, due to capacity and cost requirements. We accept a relative definition of "high performance," and still want to get all the performance we can out of whatever device we're using. Even if there exists a faster device somewhere in the world. Also, for sequential performance, HDD's are on-par with, and often better than SSD's. (For now.) While many SSD's publish specs including something like "220 MB/s" which is higher than HDD's can reach... SSD's publish their maximum performance, which is not typical performance. After you use them for a month, they slow down. Often to half or worse, of the speed they originally were able to run. Which is... as I say... on-par with, or worse than, the sequential speed of an HDD. Even crappy SSD's can have random IO worse than HDD's. Just benchmark any high-cost top-tier USB3 flash memory stick, and you'll see what I mean. ;-) The only SSD's that are faster than HDD's in any way are *actual* internal sas/sata/etc SSD's, which are faster than HDD in terms of random IOPS and maybe sequential. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs read-ahead and L2ARC
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jim Klimov > > I wonder if it is possible (currently or in the future as an RFE) > to tell ZFS to automatically read-ahead some files and cache them > in RAM and/or L2ARC? > > One use-case would be for Home-NAS setups where multimedia (video > files or catalogs of images/music) are viewed form a ZFS box. For > example, if a user wants to watch a film, or listen to a playlist > of MP3's, or push photos to a wall display (photo frame, etc.), > the storage box "should" read-ahead all required data from HDDs > and save it in ARC/L2ARC. Then the HDDs can spin down for hours > while the pre-fetched gigabytes of data are used by consumers > from the cache. End-users get peace, quiet and less electricity > used while they enjoy their multimedia entertainment ;) This whole subject is important and useful - and not unique to ZFS. The whole question is, how can the system predict which things are going to be requested next? In the case of a video - there's a big file which is likely to be read sequentially. I don't know how far readahead currently will read ahead, but it is surely only smart enough to stay within a single file. If the readahead buffer starts to get low, and the disks have been spun down, I don't know how low the buffer gets before it will trigger more readahead. But at least in the case of streaming video files, there's a very realistic possibility that something like the existing readahead can do what you want. In the case of your MP3 collection... Probably the only thing you can do is to write a script which will simply go read all the files you predict will be read soon. The key here is the prediction - There's no way ZFS or solaris, or any other OS in the present day is going to intelligently predict which files you'll be requesting soon. But you, the user, who knows your usage patterns, might be able to make these predictions and request to cache them. The request is simply - telling the system to start reading those files now. So it's very easy to cache, as long as you know what to cache. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spread-spares (kinda like GPFS declustered RAID)?
First of all, I would like to thank Bob, Richard and Tim for at least taking time to look at this proposal and responding ;) It is also encouraging to see that 2 of 3 responders consider this idea at least worth pondering and discussng, as it appeals to their direct interest. Even Richard was not dismissive of it ;) Finally, as Tim was right to note, I am not a kernel developer (and won't become one as good as those present on this list). Of course, I could "pull the blanket onto my side" and say that I'd try to write that code myself... but it would probably be a long wait, like that for "BP rewrite" - because, I already have quite a few commitments and responsibilities as an admin and recently as a parent (yay!) So, I guess, my piece of the pie is currently limited to RFEs and bug reports... and working in IT for a software development company, I believe (or hope) that's not a useless part of the process ;) I do believe that ZFS technology is amazing - despite some shortcomings that are still present - and I do want to see it flourish... ASAP! :^) //Jim 2012-01-08 7:15, Tim Cook wrote: On Sat, Jan 7, 2012 at 7:37 PM, Richard Elling mailto:richard.ell...@gmail.com>> wrote: Hi Jim, On Jan 6, 2012, at 3:33 PM, Jim Klimov wrote: > Hello all, > > I have a new idea up for discussion. > ... I disagree. Dedicated spares impact far more than availability. During a rebuild performance is, in general, abysmal. ... If I can't use the system due to performance being a fraction of what it is during normal production, it might as well be an outage. > I don't think I've seen such idea proposed for ZFS, and > I do wonder if it is at all possible with variable-width > stripes? Although if the disk is sliced in 200 metaslabs > or so, implementing a spread-spare is a no-brainer as well. Put some thoughts down on paper and work through the math. If it all works out, let's implement it! -- richard I realize it's not intentional Richard, but that response is more than a bit condescending. If he could just put it down on paper and code something up, I strongly doubt he would be posting his thoughts here. He would be posting results. The intention of his post, as far as I can tell, is to perhaps inspire someone who CAN just write down the math and write up the code to do so. Or at least to have them review his thoughts and give him a dev's perspective on how viable bringing something like this to ZFS is. I fear responses like "the code is there, figure it out" makes the *aris community no better than the linux one. > > What do you think - can and should such ideas find their > way into ZFS? Or why not? Perhaps from theoretical or > real-life experience with such storage approaches? > > //Jim Klimov As always, feel free to tell me why my rant is completely off base ;) --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss