Re: [zfs-discuss] Re: Big IOs overhead due to ZFS?
On Thu, Jun 01, 2006 at 02:46:32PM +0200, Robert Milkowski wrote: btw: what differences there'll be between raidz1 and raidz2? I guess two checksums will be stored so one loose approximately space of two disks in a one raidz2 group. Any other things? The difference between raidz1 and raidz2 is just that the latter is resilient against losing 2 disks rather than just 1. If you have a total of 5 disks in a raidz1 stripe your optimal capacity will be 4/5ths of the raw capacity of the disks whereas it would be 3/5ths with raidz2. Consider however that you'll typically use larger stripes with raidz2 so you aren't necessarily going to lose any capacity depending on how you configure your pool. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expanding raidz2
On Wed, Jul 12, 2006 at 02:45:40PM -0700, Darren Dunham wrote: There may be several parity sectors per row so adding another column doesn't work. But presumably it would be possible to use additional columns for future writes? I guess that could be made to work, but then the data on the disk becomes much (much much) more difficult to interpret because you have some rows which are effectively one width and others which are another (ad infinitum). It also doesn't really address the issue since you assume that you want to add space because the disks are getting full, but this scheme, as you mention, only applies the new width to empty rows. I'm not sure I even agree with the notion that this is a real problem (and if it is, I don't think is easily solved). Stripe widths are a function of the expected failure rate and fault domains of the system which tend to be static in nature. A coarser solution would be to create a new pool where you zfs send/zfs recv the filesystems of the old pool. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apple Time Machine
Needless to say, this was a pretty interesting piece of the keynote from a technical point of view that had quite a few of us scratching our heads. After talking to some Apple engineers, it seems like what they're doing is more or less this: When a file is modified, the kernel fires off an event which a user-land daemon listens for. Every so often, the user-land daemon does something like a snapshot of the affected portions of the filesystem with hard links (including hard links to directories -- I'm not making this up). That might be a bit off, but it's the impression I was left with. Anyhow, very slick UI, sort of dubious back end, interesting possibility for integration with ZFS. Adam On Mon, Aug 07, 2006 at 12:08:17PM -0700, Eric Schrock wrote: Yeah, I just noticed this line: Backup Time: Time Machine will back up every night at midnight, unless you select a different time from this menu. So this is just standard backups, with a (very) slick GUI layered on top. From the impression of the text-only rumor feed, it sounded more impressive, from a filesystem implementation perspective. Still, the GUI integration is pretty nice, and implies that they're backups are in some easily accessed form. Otherwise, extracting hundreds of files from a compressed stream would induce too much delay for the interactive stuff they describe. - Eric On Mon, Aug 07, 2006 at 08:58:15AM -1000, David J. Orman wrote: Reading that site, it sounds EXACTLY like snapshots. It doesn't sound to require a second disk, it just gives you the option of backing up to one. Sounds like it snapshots once a day (configurable) and then sends the snapshot to another drive/server if you request it to do so. Looks like they just made snapshots accesible to desktop users. Pretty impressive how they did the GUI work too. -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] in-kernel gzip compression
On Thu, Aug 17, 2006 at 10:00:32AM -0700, Matthew Ahrens wrote: (Actually, I think that a RLE compression algorithm for metadata is a higher priority, but if someone from the community wants to step up, we won't turn your code away!) Is RLE likely to be more efficient for metadata? Have you taking a stab as estimating the comparative benefits? Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/iSCSI target integration
On Wed, Nov 01, 2006 at 10:05:01AM +, Ceri Davies wrote: On Wed, Nov 01, 2006 at 01:33:33AM -0800, Adam Leventhal wrote: Rick McNeal and I have been working on building support for sharing ZVOLs as iSCSI targets directly into ZFS. Below is the proposal I'll be submitting to PSARC. Comments and suggestions are welcome. It looks great and I'd love to see it implemented. It's implemented! This is the end of the process, not the beginning ;-) I expect it will be in OpenSolaris by the end of November. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/iSCSI target integration
On Wed, Nov 01, 2006 at 12:18:36PM +0200, Cyril Plisko wrote: Note again that all configuration information is stored with the dataset. As with NFS shared filesystems, iSCSI targets imported on a different system will be shared appropriately. Does that mean that if I manage the iSCSI target via iscsitadm after it is shared via zfs shareiscsi=on and then 'zpool export' and 'zpool import' on some other host all the customization done via iscsitadm will be preserved ? No. Modifications to the target must be made through zfs(1M) not through iscsitadm(1M) if you want them to be persistent. This is similar to sharing ZFS filesystems via NFS: you can use share(1M), but it doesn't affect the persistent properties of the dataset. What properties are you specifically interested in modifying? Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/iSCSI target integration
What properties are you specifically interested in modifying? LUN for example. How would I configure LUN via zfs command ? You can't. Forgive my ignorance about how iSCSI is deployed, but why would you want/need to change the LUN? Adam On Wed, Nov 01, 2006 at 01:36:05PM +0200, Cyril Plisko wrote: On 11/1/06, Adam Leventhal [EMAIL PROTECTED] wrote: On Wed, Nov 01, 2006 at 12:18:36PM +0200, Cyril Plisko wrote: Note again that all configuration information is stored with the dataset. As with NFS shared filesystems, iSCSI targets imported on a different system will be shared appropriately. Does that mean that if I manage the iSCSI target via iscsitadm after it is shared via zfs shareiscsi=on and then 'zpool export' and 'zpool import' on some other host all the customization done via iscsitadm will be preserved ? No. Modifications to the target must be made through zfs(1M) not through iscsitadm(1M) if you want them to be persistent. This is similar to sharing ZFS filesystems via NFS: you can use share(1M), but it doesn't affect the persistent properties of the dataset. What properties are you specifically interested in modifying? LUN for example. How would I configure LUN via zfs command ? -- Regards, Cyril -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/iSCSI target integration
On Wed, Nov 01, 2006 at 01:17:02PM -0500, Torrey McMahon wrote: Is there going to be a method to override that on the import? I can see a situation where you want to import the pool for some kind of maintenance procedure but you don't want the iSCSI target to fire up automagically. There isn't -- to my knowledge -- a way to do this today for NFS shares. This would be a reasonable RFE that would apply to both NFS and iSCSI. Also, what if I don't have the iSCSI target packages on the node I'm importing to? Error messages? Nothing? You'll get an error message reporting that it could not be shared. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: [storage-discuss] ZFS/iSCSI target integration
On Wed, Nov 01, 2006 at 12:22:46PM -0500, Matty wrote: This is super useful! Will ACLs and aliases be stored as properties? Could you post the list of available iSCSI properties to the list? We're still investigating ACL and iSNS support. The alias name will always be the name of the dataset, but we've considered making that an option you could set in the 'shareiscsi' property ('alias=blah' for example). The iSCSI properties I was referring to are the private meta data for the target daemon such as IQN. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/iSCSI target integration
On Wed, Nov 01, 2006 at 09:25:26PM +0200, Cyril Plisko wrote: On 11/1/06, Adam Leventhal [EMAIL PROTECTED] wrote: What properties are you specifically interested in modifying? LUN for example. How would I configure LUN via zfs command ? You can't. Forgive my ignorance about how iSCSI is deployed, but why would you want/need to change the LUN? Well, with iSCSI specifically it is of less importance, since one can easily created multiple units identified by other means, than LUN. I, however, trying to look forward for FC SCSI target functionality mirroring that of the iSCSI (AFAIK it is on the Rick' roadmap [and I really do not mind helping]). In FC world it is essentially the only way to have multiple units on a particular FC port. Can we do something similar to NFS case, where sharenfs can be on, off, or something else, in which case it is a list of options ? Would this technique be applicable to shareiscsi too ? Absolutely. We would, however, like to be conservative about adding options only doing so when it meets a specific need. As you noted, there's no real requirement to be able to set the LUN. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/iSCSI target integration
On Wed, Nov 01, 2006 at 04:00:43PM -0500, Torrey McMahon wrote: Lets say server A has the pool with NFS shared, or iSCSI shared, volumes. Server A exports the pool or goes down. Server B imports the pool. Which clients would still be active on the filesystem(s)? The ones that were mounting it when it was on Server A? Clients would need to explicitly change the server they're contacting unless that new server also took over the IP address, hostname, etc. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: [storage-discuss] ZFS/iSCSI target integration
On Thu, Nov 02, 2006 at 12:10:06AM -0800, eric kustarz wrote: Like the 'sharenfs' property, 'shareiscsi' indicates if a ZVOL should be exported as an iSCSI target. The acceptable values for this property are 'on', 'off', and 'direct'. In the future, we may support other target types (for example, 'tape'). The default is 'off'. This property may be set on filesystems, but has no direct effect; this is to allow ZVOLs created under the ZFS hierarchy to inherit a default. For example, an administrator may want ZVOLs to be shared by default, and so set 'shareiscsi=on' for the pool. hey adam, what's direct mean? It's iSCSI target lingo for vanilla disk emulation. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z random read performance
I don't think you'd see the same performance benefits on RAID-Z since parity isn't always on the same disk. Are you seeing hot/cool disks? Adam On Sun, Nov 05, 2006 at 04:03:18PM +0100, Pawel Jakub Dawidek wrote: In my opinion RAID-Z is closer to RAID-3 than to RAID-5. In RAID-3 you do only full stripe writes/reads, which is also the case for RAID-Z. What I found while working on RAID-3 implementation for FreeBSD was that for small RAID-3 arrays there is a way to speed up random reads up to 40% by using parity component in a round-robin fashion. For example (DiskP stands for partity component): Disk0 Disk1 Disk2 Disk3 DiskP And now when I get read request I do: Request number Components 0 Disk0+Disk1+Disk2+Disk3 1 Disk1+Disk2+Disk3+(Disk1^Disk2^Disk3^DiskP) 2 Disk2+Disk3+(Disk2^Disk3^DiskP^Disk0)+Disk0 3 Disk3+(Disk3^DiskP^Disk0+Disk1)+Disk0+Disk1 etc. + - concatenation ^ - XOR In other words for every read request different component is skipped. It was still a bit slower than RAID-5, though. And of course writes in RAID-3 (and probably for RAID-Z) are much, much faster. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/iSCSI target integration
Thanks for all the feedback. This PSARC case was approved yesterday and will be integrated relatively soon. Adam On Wed, Nov 01, 2006 at 01:33:33AM -0800, Adam Leventhal wrote: Rick McNeal and I have been working on building support for sharing ZVOLs as iSCSI targets directly into ZFS. Below is the proposal I'll be submitting to PSARC. Comments and suggestions are welcome. Adam ---8--- iSCSI/ZFS Integration A. Overview The goal of this project is to couple ZFS with the iSCSI target in Solaris specifically to make it as easy to create and export ZVOLs via iSCSI as it is to create and export ZFS filesystems via NFS. We will add two new ZFS properties to support this feature. shareiscsi Like the 'sharenfs' property, 'shareiscsi' indicates if a ZVOL should be exported as an iSCSI target. The acceptable values for this property are 'on', 'off', and 'direct'. In the future, we may support other target types (for example, 'tape'). The default is 'off'. This property may be set on filesystems, but has no direct effect; this is to allow ZVOLs created under the ZFS hierarchy to inherit a default. For example, an administrator may want ZVOLs to be shared by default, and so set 'shareiscsi=on' for the pool. iscsioptions This property, which is hidden by default, is used by the iSCSI target daemon to store persistent information such as the IQN. The contents are not intended for users or external consumers. B. Examples iSCSI targets are simple to create with the zfs(1M) command: # zfs create -V 100M pool/volumes/v1 # zfs set shareiscsi=on pool/volumes/v1 # iscsitadm list target Target: pool/volumes/v1 iSCSI Name: iqn.1986-03.com.sun:02:4db92521-f5dc-cde4-9cd5-a3f6f567220a Connections: 0 Renaming the ZVOL has the expected result for the iSCSI target: # zfs rename pool/volumes/v1 pool/volumes/stuff # iscsitadm list target Target: pool/volumes/stuff iSCSI Name: iqn.1986-03.com.sun:02:4db92521-f5dc-cde4-9cd5-a3f6f567220a Connections: 0 Note that per the iSCSI specification (RFC3720), the iSCSI Name is unchanged after the ZVOL is renamed. Exporting a pool containing a shared ZVOL will cause the target to be removed; importing a pool containing a shared ZVOL will cause the target to be shared: # zpool export pool # iscsitadm list target # zpool import pool # iscsitadm list target Target: pool/volumes/stuff iSCSI Name: iqn.1986-03.com.sun:02:4db92521-f5dc-cde4-9cd5-a3f6f567220a Connections: 0 Note again that all configuration information is stored with the dataset. As with NFS shared filesystems, iSCSI targets imported on a different system will be shared appropriately. ---8--- -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 11/06 + iscsi integration
Hey Robert, The iSCSI target is targetting Solaris 10 update 4. There wasn't any issue with the target, rather it was the timing of the its integration into Nevada, and the sheer quantity of projects targetting update 3. Adam On Thu, Dec 14, 2006 at 02:39:17PM -0500, Robert Petkus wrote: Folks, Just wondering why iSCSI target disk support didn't make it into the latest Solaris release. Were there any problems? Robert -- Robert Petkus Brookhaven National Laboratory Physics Dept. - Bldg. 510A http://www.bnl.gov/RHIC http://www.acf.bnl.gov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] odd versus even
Hey Peter, If I recall correctly, the result was there was a very slight space-efficiency benefit of using a multiple of 2 vdevs for raidz1 and of 3 vdevs for raidz2 -- doing this can reduce the number of 'skipped' blocks. That said, the advantage is very slight and is only really relevant when the blocksize or recordsize is relatively closer to the number of bytes in a stripe. Adam On Thu, Jan 04, 2007 at 11:17:26PM +, Peter Tribble wrote: I'm being a bit of a dunderhead at the moment and neither the site search nor google are picking up the information I seek... I'm setting up a thumper and I'm sure I recall some discussion of the optimal number of drives in raidz1 and raidz2 vdevs. I also recall that it was something like you would want an even number of disk for raidz1, and an odd number for raidz2 (so you always have an odd number of data drives). Have I remembered this correctly, or am I going delusional? And, if it is the case, what is the reasoning behind it? Thanks, -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can you turn on zfs compression when the fs is already populated?
For what it's worth, there is a plan to allow data to be scrubbed so that you can enable compression for extant data. No ETA, but it's on the roadmap. In fact, I was recently reminded that I filed a bug on this in 2004: 5029294 there should be a way to compress an extant file system Adam On Wed, Jan 24, 2007 at 06:50:22PM +0100, [EMAIL PROTECTED] wrote: I have an 800GB raidz2 zfs filesystem. It already has approx 142Gb of data. Can I simply turn on compression at this point, or do you need to start with compression at the creation time? If I turn on compression now, what happens to the existing data? Yes. Nothing. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Adding my own compression to zfs
On Mon, Jan 29, 2007 at 02:39:13PM -0800, roland wrote: # zfs get compressratio NAME PROPERTY VALUE SOURCE pool/gzip compressratio 3.27x - pool/lzjb compressratio 1.89x - this looks MUCH better than i would have ever expected for smaller files. any real-world data how good or bad compressratio goes with lots of very small but good compressible files , for example some (evil for those solaris evangelists) untarred linux-source tree ? i'm rather excited how effective gzip will compress here. for comparison: sun1:/comptest # bzcat /tmp/linux-2.6.19.2.tar.bz2 |tar xvf - --snipp-- sun1:/comptest # du -s -k * 143895 linux-2.6.19.2 1 pax_global_header sun1:/comptest # du -s -k --apparent-size * 224282 linux-2.6.19.2 1 pax_global_header sun1:/comptest # zfs get compressratio comptest NAME PROPERTY VALUE SOURCE comptest tank compressratio 1.79x - Don't start sending me your favorite files to compress (it really should work about the same as gzip), but here's the result for the above (I found a tar file that's about 235M uncompressed): # du -ks linux-2.6.19.2/ 80087 linux-2.6.19.2 # zfs get compressratio pool/gzip NAME PROPERTY VALUE SOURCE pool/gzip compressratio 3.40x - Doing a gzip with the default compression level (6 -- the same setting I'm using in ZFS) yields a file that's about 52M. The small files are hurting a bit here, but it's still pretty good -- and considerably better than LZJB. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Need help making lsof work with ZFS
On Wed, Feb 14, 2007 at 01:56:33PM -0700, Matthew Ahrens wrote: These files are not shipped with Solaris 10. You can find them in opensolaris: usr/src/uts/common/fs/zfs/sys/ The interfaces in these files are not supported, and may change without notice at any time. Even if they're not supported, shouldn't the header files be shipped so people can make sense of kernel data structure types? Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs received vol not appearing on iscsi target list
On Sat, Feb 24, 2007 at 09:29:48PM +1300, Nicholas Lee wrote: I'm not really a Solaris expert, but I would have expected vol4 to appear on the iscsi target list automatically. Is there a way to refresh the target list? Or is this a bug. Hi Nicholas, This is a bug either in ZFS or in the iSCSI target. Please file a bug. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS overhead killed my ZVOL
On Tue, Mar 20, 2007 at 06:01:28PM -0400, Brian H. Nelson wrote: Why does this happen? Is it a bug? I know there is a recommendation of 20% free space for good performance, but that thought never occurred to me when this machine was set up (zvols only, no zfs proper). It sounds like this bug: 6430003 record size needs to affect zvol reservation size on RAID-Z Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS overhead killed my ZVOL
On Wed, Mar 21, 2007 at 01:36:10AM +0100, Robert Milkowski wrote: btw: I assume that compression level will be hard coded after all, right? Nope. You'll be able to choose from gzip-N with N ranging from 1 to 9 just like gzip(1). Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] gzip compression support
I recently integrated this fix into ON: 6536606 gzip compression for ZFS With this, ZFS now supports gzip compression. To enable gzip compression just set the 'compression' property to 'gzip' (or 'gzip-N' where N=1..9). Existing pools will need to upgrade in order to use this feature, and, yes, this is the second ZFS version number update this week. Recall that once you've upgraded a pool older software will no longer be able to access it regardless of whether you're using the gzip compression algorithm. I did some very simple tests to look at relative size and time requirements: http://blogs.sun.com/ahl/entry/gzip_for_zfs_update I've also asked Roch Bourbonnais and Richard Elling to do some more extensive tests. Adam From zfs(1M): compression=on | off | lzjb | gzip | gzip-N Controls the compression algorithm used for this dataset. The lzjb compression algorithm is optimized for performance while providing decent data compression. Setting compression to on uses the lzjb compression algorithm. The gzip compression algorithm uses the same compression as the gzip(1) command. You can specify the gzip level by using the value gzip-N, where N is an integer from 1 (fastest) to 9 (best compression ratio). Currently, gzip is equivalent to gzip-6 (which is also the default for gzip(1)). This property can also be referred to by its shortened column name compress. -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gzip compression support
On Fri, Mar 23, 2007 at 11:41:21AM -0700, Rich Teer wrote: I recently integrated this fix into ON: 6536606 gzip compression for ZFS Cool! Can you recall into which build it went? I put it back yesterday so it will be in build 62. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS over iSCSI question
On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote: I'm in a way still hoping that it's a iSCSI related Problem as detecting dead hosts in a network can be a non trivial problem and it takes quite some time for TCP to timeout and inform the upper layers. Just a guess/hope here that FC-AL, ... do better in this case iscsi doesn't use TCP, does it? Anyway, the problem is really transport independent. It does use TCP. Were you thinking UDP? Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Convert raidz
On Mon, Apr 02, 2007 at 12:37:24AM -0700, homerun wrote: Is it possible to convert live 3 disks zpool from raidz to raidz2 And is it possible to add 1 new disk to raidz configuration without backups and recreating zpool from scratch. The reason that's not possible is because RAID-Z uses a variable stripe width. This solves some problems (notably the RAID-5 write hole [1]), but it means that a given 'stripe' over N disks in a raidz1 configuration may contains as many as floor(N/2) parity blocks -- clearly a single additional disk wouldn't be sufficient to grow the stripe properly. It would be possible to have a different type of RAID-Z where stripes were variable-width to avoid the RAID-5 write hole, but the remainder of the stripe was left unused. This would allow users to add an additional parity disk (or several if we ever implement further redundancy) to an existing configuration, BUT would potentially make much less efficient use of storage. Adam [1] http://blogs.sun.com/bonwick/entry/raid_z -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up for zfsboot
On Wed, Apr 04, 2007 at 03:34:13PM +0200, Constantin Gonzalez wrote: - RAID-Z is _very_ slow when one disk is broken. Do you have data on this? The reconstruction should be relatively cheap especially when compared with the initial disk access. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting up for zfsboot
On Wed, Apr 04, 2007 at 11:04:06PM +0200, Robert Milkowski wrote: If I stop all activity to x4500 with a pool made of several raidz2 and then I issue spare attach I get really poor performance (1-2MB/s) on a pool with lot of relatively small files. Does that mean the spare is resilvering when you collect the performance data? I think a fair test would be to compare the performance of a fully functional RAID-Z stripe against a one with a missing (absent) device. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and Linux
On Thu, Apr 12, 2007 at 06:59:45PM -0300, Toby Thain wrote: Hey, then just don't *keep on* asking to relicense ZFS (and anything else) to GPL. I never would. But it would be horrifying to imagine it relicensed to BSD. (Hello, Microsoft, you just got yourself a competitive filesystem.) There's nothing today preventing Microsoft (or Apple) from sticking ZFS into their OS. They'd just to have to release the (minimal) diffs to ZFS-specific files. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list [EMAIL PROTECTED] http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: gzip compression throttles system?
On Thu, May 03, 2007 at 11:43:49AM -0500, [EMAIL PROTECTED] wrote: I think this may be a premature leap -- It is still undetermined if we are running up against a yet unknown bug in the kernel implementation of gzip used for this compression type. From my understanding the gzip code has been reused from an older kernel implementation, it may be possible that this code has some issues with kernel stuttering when used for zfs compression that may have not been exposed with its original usage. If it turns out that it is just a case of high cpu trade-off for buying faster compression times, then the talk of a tunable may make sense (if it is even possible given the constraints of the gzip code in kernelspace). The in-kernel version is zlib is the latest version (1.2.3). It's not surprising that we're spending all of our time in zlib if the machine is being driving by I/O. There are outstanding problems with compression in the ZIO pipeline that may contribute to the bursty behavior. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iscsitadm local_name in ZFS
That would be a great RFE. Currently the iSCSI Alias is the dataset name which should help with identification. Adam On Fri, May 04, 2007 at 02:02:34PM +0200, cedric briner wrote: cedric briner wrote: hello dear community, Is there a way to have a ``local_name'' as define in iscsitadm.1m when you verbshareiscsi/verb a zvol. This way, it will give even easier way to identify an device through IQN. Ced. Okay no reply from you so... maybe I didn't make myself well understandable. Let me try to re-explain you what I mean: when you use zvol and enable shareiscsi, could you add a suffix to the IQN (Iscsi Qualified Name). This suffix will be given by myself and will help me to identify which IQN correspond to which zvol : this is just a more human readable tag on an IQN. Similarly, this tag is also given when you do an iscsitadm. And in the man page of iscsitadm it is called a local_name. iscsitadm iscsitadm create target -b /dev/dsk/c0d0s5 tiger or iscsitadm iscsitadm create target -b /dev/dsk/c0d0s5 hd-1 tiger and hd-1 are local_name Ced. -- Cedric BRINER Geneva - Switzerland ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS over a layered driver interface
Try 'trace((int)arg1);' -- 4294967295 is the unsigned representation of -1. Adam On Mon, May 14, 2007 at 09:57:23AM -0700, Shweta Krishnan wrote: Thanks Eric and Manoj. Here's what ldi_get_size() returns: bash-3.00# dtrace -n 'fbt::ldi_get_size:return{trace(arg1);}' -c 'zpool create adsl-pool /dev/layerzfsminor1' dtrace: description 'fbt::ldi_get_size:return' matched 1 probe cannot create 'adsl-pool': invalid argument for this pool operation dtrace: pid 2582 has exited CPU IDFUNCTION:NAME 0 20927 ldi_get_size:return4294967295 This is strange because I looked at the code for ldi_get_size() and the only possible return values in the code are DDI_SUCCESS (0) and DDI_FAILURE(-1). Looks like what I'm looking at either isn't the return value, or some bad address is being reached. Any hints? Thanks, Swetha. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: ISCSI alias when shareiscsi=on
Right now -- as I'm sure you have noticed -- we use the dataset name for the alias. To let users explicitly set the alias we could add a new property as you suggest or allow other options for the existing shareiscsi property: shareiscsi='alias=potato' This would sort of match what we do for the sharenfs property. Adam On Thu, May 24, 2007 at 02:39:24PM +0200, cedric briner wrote: Starting from this thread: http://www.opensolaris.org/jive/thread.jspa?messageID=118786#118786 I would love to have the possibility to set an ISCSI alias when doing an shareiscsi=on on ZFS. This will greatly facilate to identify where an IQN is hosted. the ISCSI alias is defined in rfc 3721 e.g. http://www.apps.ietf.org/rfc/rfc3721.html#sec-2 and the CLI could be something like: zfs set shareiscsi=on shareisicsiname=iscsi_alias tank Ced. -- Cedric BRINER Geneva - Switzerland ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mac OS X Leopard to use ZFS
On Thu, Jun 07, 2007 at 08:38:10PM -0300, Toby Thain wrote: When should we expect Solaris kernel under OS X? 10.6? 10.7? :-) I'm sure Jonathan will be announcing that soon. ;-) Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: LZO compression?
Those are interesting results. Does this mean you've already written lzo support into ZFS? If not, that would be a great next step -- licensing issues can be sorted out later... Adam On Sat, Jun 16, 2007 at 04:40:48AM -0700, roland wrote: btw - is there some way to directly compare lzjb vs lzo compression - to see which performs better and using less cpu ? here those numbers from my little benchmark: |lzo |6m39.603s |2.99x |gzip |7m46.875s |3.41x |lzjb |7m7.600s |1.79x i`m just curious about these numbers - with lzo i got better speed and better compression in comparison to lzjb nothing against lzjb compression - it's pretty nice - but why not taking a closer look here? maybe here is some room for improvement roland This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Mac OS X 10.5 read-only support for ZFS
On Sun, Jun 17, 2007 at 09:38:51PM -0700, Anton B. Rang wrote: Sector errors on DVD are not uncommon. Writing a DVD in ZFS format with duplicated data blocks would help protect against that problem, at the cost of 50% or so disk space. That sounds like a lot, but with BluRay etc. coming along, maybe paying a 50% penalty isn't too bad. (And if ZFS eventually supports RAID on a single disk, the penalty would be less.) It would be an interesting project to create some software that took a directory (or ZFS filesystem) to be written to a CD or DVD and optimized the layout for redundancy. That is, choose the compression method (if any), and then, in effect, partition the CD for RAID-Z or mirroring to stretch the data to fill the entire disc. It wouldn't necessarily be all that efficient to access, but it would give you resiliency against media errors. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to take advantage of PSARC 2007/171: ZFS Separate Intent Log
Flash SSDs typically boast a huge number of _read_ IOPS (thousands), but very few write IOPS (tens). The write throughput numbers quoted are almost certainly for non-synchronous writes whose latency can easily be in the milisecond range. STEC makes an interesting device which offers fast _synchronous_ writes on an SSD, but at a pretty steep cost. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Compression algorithms - Project Proposal
This is a great idea. I'd like to add a couple of suggestions: It might be interesting to focus on compression algorithms which are optimized for particular workloads and data types, an Oracle database for example. It might be worthwhile to have some sort of adaptive compression whereby ZFS could choose a compression algorithm based on its detection of the type of data being stored. Adam On Thu, Jul 05, 2007 at 08:29:38PM -0300, Domingos Soares wrote: Bellow, follows a proposal for a new opensolaris project. Of course, this is open to change since I just wrote down some ideas I had months ago, while researching the topic as a graduate student in Computer Science, and since I'm not an opensolaris/ZFS expert at all. I would really appreciate any suggestion or comments. PROJECT PROPOSAL: ZFS Compression Algorithms. The main purpose of this project is the development of new compression schemes for the ZFS file system. We plan to start with the development of a fast implementation of a Burrows Wheeler Transform based algorithm (BWT). BWT is an outstanding tool and the currently known lossless compression algorithms based on it outperform the compression ratio of algorithms derived from the well known Ziv-Lempel algorithm, while being a little more time and space expensive. Therefore, there is space for improvement: recent results show that the running time and space needs of such algorithms can be significantly reduced and the same results suggests that BWT is likely to become the new standard in compression algorithms[1]. Suffixes Sorting (i.e. the problem of sorting suffixes of a given string) is the main bottleneck of BWT and really significant progress has been made in this area since the first algorithms of Manbers and Myers[2] and Larsson and Sadakane[3], notably the new linear time algorithms of Karkkainen and Sanders[4]; Kim, Sim and Park[5] and Ko e aluru[6] and also the promising O(nlogn) algorithm of Karkkainen and Burkhardt[7]. As a conjecture, we believe that some intrinsic properties of ZFS and file systems in general (e.g. sparseness and data entropy in blocks) could be exploited in order to produce brand new and really efficient compression algorithms, as well as the adaptation of existing ones to the task. The study might be extended to the analysis of data in specific applications (e.g. web servers, mail servers and others) in order to develop compression schemes for specific environments and/or modify the existing Ziv-Lempel based scheme to deal better with such environments. [1] The Burrows-Wheeler Transform: Theory and Practice. Manzini, Giovanni. Proc. 24th Int. Symposium on Mathematical Foundations of Computer Science [2] Suffix Arrays: A New Method for On-Line String Searches. Manber, Udi and Myers, Eugene W.. SIAM Journal on Computing, Vol. 22 Issue 5. 1990 [3] Faster suffix sorting. Larsson, N Jasper and Sadakane, Kunihiko. TECHREPORT, Department of Computer Science, Lund University, 1999 [4] Simple Linear Work Suffix Array Construction. Karkkainen, Juha and Sanders,Peter. Proc. 13th International Conference on Automata, Languages and Programming, 2003 [5]Linear-time construction of suffix arrays D.K. Kim, J.S. Sim, H. Park, K. Park, CPM, LNCS, Vol. 2676, 2003 [6]Space ecient linear time construction of sux arrays,P. Ko and S. Aluru, CPM 2003. [7]Fast Lightweight Suffix Array Construction and Checking. Burkhardt, Stefan and K?rkk?inen, Juha. 14th Annual Symposium, CPM 2003, Domingos Soares Neto University of Sao Paulo Institute of Mathematics and Statistics ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs vol issue?.
On Thu, Aug 16, 2007 at 05:20:25AM -0700, ramprakash wrote: #zfs mount -a 1. mounts c again. 2. but not vol1.. [ ie /dev/zvol/dsk/mytank/b/c does not contain vol1 ] Is this the normal behavior or is it a bug? That looks like a bug. Please file it. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mirrored zpool across network
On Sun, Aug 19, 2007 at 05:45:18PM -0700, Mark wrote: Basically, the setup is a large volume of Hi-Def video is being streamed from a camera, onto an editing timeline. This will be written to a network share. Due to the large amounts of data, ZFS is a really good option for us. But we need a backup. We need to do it on generic hardware (i was thinking AMD64 with an array of large 7200rpm hard drives), and therefore i think im going to have one box mirroring the other box. They will be connected by gigabit ethernet. So my question is how do I mirror one raidz Array across the network to the other? One big decision you need to make in this scenario is whether you want true synchronous replication or if asynchronous replication possibly with some time-bound is acceptable. For the former, each byte must traverse the network before it is acknowledged to the client; for the latter, data is written locally and then transmitted shortly after that. Synchronous replication obviously imposes a much larger performance hit, but asychronous replication means you may lose data over some recent period (but the data will always be consistent). Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.
On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote: And here are the results: RAIDZ: Number of READ requests: 4. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 1305213 Requests per second: 75 RAID5: Number of READ requests: 4. Number of WRITE requests: 0. Number of bytes to transmit: 695678976. Number of processes: 8. Bytes per second: 2749719 Requests per second: 158 I'm a bit surprised by these results. Assuming relatively large blocks written, RAID-Z and RAID-5 should be laid out on disk very similarly resulting in similar read performance. Did you compare the I/O characteristic of both? Was the bottleneck in the software or the hardware? Very interesting experiment... Adam -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
On Wed, Nov 07, 2007 at 01:47:04PM -0800, can you guess? wrote: I do consider the RAID-Z design to be somewhat brain-damaged [...] How so? In my opinion, it seems like a cure for the brain damage of RAID-5. Adam -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote: How so? In my opinion, it seems like a cure for the brain damage of RAID-5. Nope. A decent RAID-5 hardware implementation has no 'write hole' to worry about, and one can make a software implementation similarly robust with some effort (e.g., by using a transaction log to protect the data-plus-parity double-update or by using COW mechanisms like ZFS's in a more intelligent manner). Can you reference a software RAID implementation which implements a solution to the write hole and performs well. My understanding (and this is based on what I've been told from people more knowledgeable in this domain than I) is that software RAID has suffered from being unable to provide both correctness and acceptable performance. The part of RAID-Z that's brain-damaged is its concurrent-small-to-medium-sized-access performance (at least up to request sizes equal to the largest block size that ZFS supports, and arguably somewhat beyond that): while conventional RAID-5 can satisfy N+1 small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in parallel (though the latter also take an extra rev to complete), RAID-Z can satisfy only one small-to-medium access request at a time (well, plus a smidge for read accesses if it doesn't verity the parity) - effectively providing RAID-3-style performance. Brain damage seems a bit of an alarmist label. While you're certainly right that for a given block we do need to access all disks in the given stripe, it seems like a rather quaint argument: aren't most environments that matter trying to avoid waiting for the disk at all? Intelligent prefetch and large caches -- I'd argue -- are far more important for performance these days. The easiest way to fix ZFS's deficiency in this area would probably be to map each group of N blocks in a file as a stripe with its own parity - which would have the added benefit of removing any need to handle parity groups at the disk level (this would, incidentally, not be a bad idea to use for mirroring as well, if my impression is correct that there's a remnant of LVM-style internal management there). While this wouldn't allow use of parity RAID for very small files, in most installations they really don't occupy much space compared to that used by large files so this should not constitute a significant drawback. I don't really think this would be feasible given how ZFS is stratified today, but go ahead and prove me wrong: here are the instructions for bringing over a copy of the source code: http://www.opensolaris.org/os/community/tools/scm - ahl -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expanding a RAIDZ based Pool...
On Mon, Dec 10, 2007 at 03:59:22PM +, Karl Pielorz wrote: e.g. If I build a RAIDZ pool with 5 * 400Gb drives, and later add a 6th 400Gb drive to this pool, will its space instantly be available to volumes using that pool? (I can't quite see this working myself) Hi Karl, You can't currently expand the width of a RAID-Z stripe. It has been considered, but implementing that would require a fairly substantial change in the way RAID-Z works. Sun's current ZFS priorities are elsewhere, but there's nothing preventing an interested member of the community from undertaking this project... Adam -- Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz in zfs questions
2. in a raidz do all the disks have to be the same size? Disks don't have to be the same size, but only as much space will be used on the larger disks will be used as is available on the smallest disk. In other words, there's no benefit to be gained from this approach. Related question: Does a raidz have to be either only full disks or only slices, or can it be mixed? E.g., can you do a 3-way raidz with 2 complete disks and one slice (of equal size as the disks) on a 3rd, larger, disk? Sure. One could do this, but it's kind of a hack. I imagine you'd like to do something like match a disk of size N with another disk of size 2N and use RAID-Z to turn them into a single vdev. At that point it's probably a better idea to build a striped vdev and use ditto blocks to do your data redundancy by setting copies=2. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mixing RAIDZ and RAIDZ2 zvols in the same zpool
On Wed, Mar 12, 2008 at 09:59:53PM +, A Darren Dunham wrote: It's not *bad*, but as far as I'm concerned, it's wasted space. You have to deal with the pool as a whole as having single-disk redundancy for failure modes. So the fact that one section of it has two-disk redundancy doesn't give you anything in failure planning. And you can't assign filesystems or particular data to that vdev, so the added redundancy can't be concentrated anywhere. Well, one can imagine a situation where two different type of disks have different failure probabilities such that the same reliability could be garnered with one using single-parity RAID as with the other using double- parity RAID. That said, it would be a fairly uncommon scenario. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Per filesystem scrub
On Mar 31, 2008, at 10:41 AM, kristof wrote: I would be very happy having a filesystem based zfs scrub We have a 18TB big zpool, it takes more then 2 days to do the scrub. Since we cannot take snapshots during the scrub, this is unacceptable While per-dataset scrubbing would certainly be a coarse-grained solution to your problem, work is underway to address the problematic interaction between scrubs and snapshots. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Algorithm for expanding RAID-Z
After hearing many vehement requests for expanding RAID-Z vdevs, Matt Ahrens and I sat down a few weeks ago to figure out an mechanism that would work. While Sun isn't committing resources to imlementing a solution, I've written up our ideas here: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z I'd encourage anyone interested in getting involved with ZFS development to take a look. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic ZFS maintenance?
On Mon, Apr 21, 2008 at 10:41:35AM +1200, Ian Collins wrote: Sam wrote: I have a 10x500 disc file server with ZFS+, do I need to perform any sort of periodic maintenance to the filesystem to keep it in tip top shape? No, but if there are problems, a periodic scrub will tip you off sooner rather than later. Well, tip you off _and_ correct the problems if possible. I believe a long- standing RFE has been to scrub periodically in the background to ensure that correctable problems don't turn into uncorrectable ones. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08
On Fri, May 16, 2008 at 03:12:02PM -0700, Zlotnick Fred wrote: The issues with CIFS is not just complexity; it's the total amount of incompatible change in the kernel that we had to make in order to make the CIFS protocol a first class citizen in Solaris. This includes changes in the VFS layer which would break all S10 file systems. So in a very real sense CIFS simply cannot be backported to S10. However, the same arguments were made explaining the difficulty backporting ZFS and GRUB boot to Solaris 10. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD reliability, wear levelling, warranty period
On Jun 11, 2008, at 1:16 AM, Al Hopper wrote: But... if you look broadly at the current SSD product offerings, you see: a) lower than expected performance - particularly in regard to write IOPS (I/O Ops per Second) True. Flash is quite asymmetric in its performance characteristics. That said, the L2ARC has been specially designed to play well with the natural strengths and weaknesses of flash. and b) warranty periods that are typically 1 year - with the (currently rare) exception of products that are offered with a 5 year warranty. You'll see a new class of SSDs -- eSSDs -- designed for the enterprise with longer warranties and more write/erase cycles. Further, ZFS will do its part by not killing the write/erase cycles of the L2ARC by constantly streaming as fast as possible. You should see lifetimes in the 3-5 year range on typical flash. Obviously, for SSD products to live up to the current marketing hype, they need to deliver superior performance and *reliability*. Everyone I know *wants* one or more SSD devices - but they also have the expectation that those devices will come with a warranty at least equivalent to current hard disk drives (3 or 5 years). I don't disagree entirely, but as a cache device flash actually can be fairly unreliable and we'll pick it up in ZFS. So ... I'm interested in learning from anyone on this list, and, in particular, from Team ZFS, what the reality is regarding SSD reliability. Obviously Sun employees are not going to compromise their employment and divulge upcoming product specific data - but there must be *some* data (white papers etc) in the public domain that would provide some relevant technical data?? A typical high-end SSD can sustain 100k write/erase cycles so you can do some simple math to see that a 128GB device written to at a rate of 150M/s will last nearly 3 years. Again, note that unreliable devices will result in a performance degradation when you fail a checksum in the L2ARC, but the data will still be valid out of the main storage pool. You're going to see much more on this in the next few months. I made a post to my blog that probably won't answer your questions directly, but may help inform you about what we have in mind. http://blogs.sun.com/ahl/entry/flash_hybrid_pools_and_future Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD reliability, wear levelling, warranty period
On Wed, Jun 11, 2008 at 01:51:17PM -0500, Al Hopper wrote: I think that I'll (personally) avoid the initial rush-to-market comsumer level products by vendors with no track record of high tech software development - let alone those who probably can't afford the PhD level talent it takes to get the wear leveling algorithms correct - and then to implement them correctly. Instead I'll wait for a Sun product - from a company with a track record of proven design and *implementation* for enterprise level products (software and hardware). Wear leveling is actually a fairly mature technology. I'm more concerned with what will happen as people continue pushing these devices out of the consumer space and into the enterprise where stuff like failure modes and reliability matters in a completely different way. If my iPod sucks that's a hassle, but it's a different matter if an SSD hangs an I/O request on my enterprise system. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ideal Setup: RAID-5, Areca, etc!
But, is there a performance boost with mirroring the drives? That is what I'm unsure of. Mirroring will provide a boost on reads, since the system to read from both sides of the mirror. It will not provide an increase on writes, since the system needs to wait for both halves of the mirror to finish. It could be slightly slower than a single raid5. That's not strictly correct. Mirroring will, in fact, deliver better IOPS for both reads and writes. For reads, as Brandon stated, mirroring will deliver better performance because it can distribute the reads between both devices. For writes, however, RAID-Z with an N+1 wide stripe will divide the the data into N+1 chunks, and reads will need to access the N chunks. This reduces the total IOPS by a factor of N+1 for reads and writes whereas mirroring reduces the IOPS by a factor of 2 for writes and not at all for reads. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which is better for root ZFS: mlc or slc SSD?
For a root device it doesn't matter that much. You're not going to be writing to the device at a high data rate so write/erase cycles don't factor much (MLC can sustain about a factor of 10 more). With MLC you'll get 2-4x the capacity for the same price, but again that doesn't matter much for a root device. Performance is typically a bit better with SLC -- especially on the write side -- but it's not such a huge difference. The reason you'd use a flash SSD for a boot device is power (with maybe a dash of performance), and either SLC or MLC will do just fine. Adam On Sep 24, 2008, at 11:41 AM, Erik Trimble wrote: I was under the impression that MLC is the preferred type of SSD, but I want to prevent myself from having a think-o. I'm looking to get (2) SSD to use as my boot drive. It looks like I can get 32GB SSDs composed of either SLC or MLC for roughly equal pricing. Which would be the better technology? (I'll worry about rated access times/etc of the drives, I'm just wondering about general tech for an OS boot drive usage...) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] An slog experiment (my NAS can beat up your NAS)
So what are the downsides to this? If both nodes were to crash and I used the same technique to recreate the ramdisk I would lose any transactions in the slog at the time of the crash, but the physical disk image is still in a consistent state right (just not from my apps point of view)? You would lose transactions, but the pool would still reflect a consistent state. So is this idea completely crazy? On the contrary; it's very clever. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenStorage GUI
On Nov 11, 2008, at 9:38 AM, Bryan Cantrill wrote: Just to throw some ice-cold water on this: 1. It's highly unlikely that we will ever support the x4500 -- only the x4540 is a real possibility. And to warm things up a bit: there's already an upgrade path from the x4500 to the x4540 so that would be required before any upgrade to the equivalent of the Sun Storage 7210. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenStorage GUI
On Nov 11, 2008, at 10:41 AM, Brent Jones wrote: Wish I could get my hands on a beta of this GUI... Take a look at the VMware version that you can run on any machine: http://www.sun.com/storage/disk_systems/unified_storage/resources.jsp Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenStorage GUI
Is this software available for people who already have thumpers? We're considering offering an upgrade path for people with existing thumpers. Given the feedback we've been hearing, it seems very likely that we will. No word yet on pricing or availability. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] continuous replication
On Fri, Nov 14, 2008 at 10:48:25PM +0100, Mattias Pantzare wrote: That is _not_ active-active, that is active-passive. If you have a active-active system I can access the same data via both controllers at the same time. I can't if it works like you just described. You can't call it active-active just because different volumes are controlled by different controllers. Most active-passive RAID controllers can do that. The data sheet talks about active-active clusters, how does that work? What the Sun Storage 7000 Series does would more accurately be described as dual active-passive. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Storage 7000
On Mon, Nov 17, 2008 at 12:35:38PM -0600, Tim wrote: I'm not sure if this is the right place for the question or not, but I'll throw it out there anyways. Does anyone know, if you create your pool(s) with a system running fishworks, can that pool later be imported by a standard solaris system? IE: If for some reason the head running fishworks were to go away, could I attach the JBOD/disks to a system running snv/mainline solaris/whatever, and import the pool to get at the data? Or is the zfs underneath fishworks proprietary as well? Yes. The Sun Storage 7000 Series uses the same ZFS that's in OpenSolaris today. A pool created on the appliance could potentially be imported on an OpenSolaris system; that is, of course, not explicitly supported in the service contract. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Storage 7000
Would be interesting to hear more about how Fishworks differs from Opensolaris, what build it is based on, what package mechanism you are using (IPS already?), and other differences... I'm sure these details will be examined in the coming weeks on the blogs of members of the Fishworks team. Keep an eye on blogs.sun.com/fishworks. A little off topic: Do you know when the SSDs used in the Storage 7000 are available for the rest of us? I don't think the will be, but it will be possible to purchase them as replacement parts. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Storage 7000
On Tue, Nov 18, 2008 at 09:09:07AM -0800, Andre Lue wrote: Is the web interface on the appliance available for download or will it make it to opensolaris sometime in the near future? It's not, and it's unlikely to make it to OpenSolaris. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Comparison between the S-TEC Zeus and the Intel X25-E ??
The Intel part does about a fourth as many synchronous write IOPS at best. Adam On Jan 16, 2009, at 5:34 PM, Erik Trimble wrote: I'm looking at the newly-orderable (via Sun) STEC Zeus SSDs, and they're outrageously priced. http://www.stec-inc.com/product/zeusssd.php I just looked at the Intel X25-E series, and they look comparable in performance. At about 20% of the cost. http://www.intel.com/design/flash/nand/extreme/index.htm Can anyone enlighten me as to any possible difference between an STEC Zeus and an Intel X25-E ? I mean, other than those associated with the fact that you can't get the Intel one orderable through Sun right now. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
Right, which is an absolutely piss poor design decision and why every major storage vendor right-sizes drives. What happens if I have an old maxtor drive in my pool whose 500g is just slightly larger than every other mfg on the market? You know, the one who is no longer making their own drives since being purchased by seagate. I can't replace the drive anymore? *GREAT*. Sun does right size our drives. Are we talking about replacing a device bought from sun with another device bought from Sun? If these are just drives that fell off the back of some truck, you may not have that assurance. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups. Odd that the Sun Unified Storage 7000 products do not allow you to control this, it appears to put all the hdd's into one group. At least on the 7110 we are evaluating there is no control to allow multiple groups/different raid types. Our experience has shown that that initial guess of 3-9 per parity device was surprisingly narrow. We see similar performance out to much wider stripes which, of course, offer the user more usable capacity. We don't allow you to manually set the RAID stripe widths on the 7000 series boxes because frankly the stripe width is an implementation detail. If you want the best performance, choose mirroring; capacity, double-parity RAID; for something in the middle, we offer 3+1 single-parity RAID. Other than that you're micro-optimizing for gains that would hardly be measurable given the architecture of the Hybrid Storage Pool. Recall that unlike other products in the same space, we get our IOPS from flash rather than from a bazillion spindles spinning at 15,000 RPM. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
Since it's done in software by HDS, NetApp, and EMC, that's complete bullshit. Forcing people to spend 3x the money for a Sun drive that's identical to the seagate OEM version is also bullshit and a piss-poor answer. I didn't know that HDS, NetApp, and EMC all allow users to replace their drives with stuff they've bought at Fry's. Is this still covered by their service plan or would this only be in an unsupported config? Thanks. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
Since it's done in software by HDS, NetApp, and EMC, that's complete bullshit. Forcing people to spend 3x the money for a Sun drive that's identical to the seagate OEM version is also bullshit and a piss-poor answer. I didn't know that HDS, NetApp, and EMC all allow users to replace their drives with stuff they've bought at Fry's. Is this still covered by their service plan or would this only be in an unsupported config? So because an enterprise vendor requires you to use their drives in their array, suddenly zfs can't right-size? Vendor requirements have absolutely nothing to do with their right-sizing, and everything to do with them wanting your money. Sorry, I must have missed your point. I thought that you were saying that HDS, NetApp, and EMC had a different model. Were you merely saying that the software in those vendors' products operates differently than ZFS? Are you telling me zfs is deficient to the point it can't handle basic right-sizing like a 15$ sata raid adapter? How do there $15 sata raid adapters solve the problem? The more details you could provide the better obviously. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks in each RAIDZ group
BWAHAHAHAHA. That's a good one. You don't need to setup your raid, that's micro-managing, we'll do that. Remember that one time when I talked about limiting snapshots to protect a user from themselves, and you joined into the fray of people calling me a troll? I don't remember this, but I don't doubt it. Can you feel the irony oozing out between your lips, or are you completely oblivious to it? The irony would be that on one hand I object to artificial limitations to business-critical features while on the other hand I think that users don't need to tweak settings that add complexity and little to no value? They seem very different to me, so I suppose the answer to your question is: no I cannot feel the irony oozing out between my lips, and yes I'm oblivious to the same. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
On Mon, Jan 19, 2009 at 01:35:22PM -0600, Tim wrote: Are you telling me zfs is deficient to the point it can't handle basic right-sizing like a 15$ sata raid adapter? How do there $15 sata raid adapters solve the problem? The more details you could provide the better obviously. They short stroke the disk so that when you buy a new 500GB drive that isn't the exact same number of blocks you aren't screwed. It's a design choice to be both sane, and to make the end-users life easier. You know, sort of like you not letting people choose their raid layout... Drive vendors, it would seem, have an incentive to make their 500GB drives as small as possible. Should ZFS then choose some amount of padding at the end of each device and chop it off as insurance against a slightly smaller drive? How much of the device should it chop off? Conversely, should users have the option to use the full extent of the drives they've paid for, say, if they're using a vendor that already provides that guarantee? You know, sort of like you not letting people choose their raid layout... Yes, I'm not saying it shouldn't be done. I'm asking what the right answer might be. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] replace same sized disk fails with too small error
And again, I say take a look at the market today, figure out a percentage, and call it done. I don't think you'll find a lot of users crying foul over losing 1% of their drive space when they don't already cry foul over the false advertising that is drive sizes today. Perhaps it's quaint, but 5GB still seems like a lot to me to throw away. In any case, you might as well can ZFS entirely because it's not really fair that users are losing disk space to raid and metadata... see where this argument is going? Well, I see where this _specious_ argument is going. I have two disks in one of my systems... both maxtor 500GB drives, purchased at the same time shortly after the buyout. One is a rebadged Seagate, one is a true, made in China Maxtor. Different block numbers... same model drive, purchased at the same time. Wasn't zfs supposed to be about using software to make up for deficiencies in hardware? It would seem this request is exactly that... That's a fair point, and I do encourage you to file an RFE, but a) Sun has already solved this problem in a different way as a company with our products and b) users already have the ability to right-size drives. Perhaps a better solution would be to handle the procedure of replacing a disk with a slightly smaller one by migrating data and then treating the extant disks as slightly smaller as well. This would have the advantage of being far more dynamic and of only applying the space tax in situations where it actually applies. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD - slow down with age
On Feb 14, 2009, at 12:45 PM, Nicholas Lee wrote: A useful article about long term use of the Intel SSD X25-M: http://www.pcper.com/article.php?aid=669 - Long-term performance analysis of Intel Mainstream SSDs. Would a zfs cache (ZIL or ARC) based on a SSD device see this kind of issue? Maybe a periodic scrub via a full disk erase would be a useful process. Indeed SSDs can have certain properties that would cause their performance to degrade over time. We've seen this to varying degrees with different devices we've tested in our lab. We're working on adapting our use of SSDs with ZFS as a ZIL device, an L2ARC device, and eventually as primary storage. We'll first focus on the specific SSDs we certify for use in our general purpose servers and the Sun Storage 7000 series, and help influence the industry to move to standards that we can then use. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS 15K drives as L2ARC
After all this discussion, I am not sure if anyone adequately answered the original poster's question as to whether at 2540 with SAS 15K drives would provide substantial synchronous write throughput improvement when used as a L2ARC device. I was under the impression that the L2ARC was to speed up reads, as it allows things to be cached on something faster than disks (usually MLC SSDs). Offloading the ZIL is what handles synchronous writes, isn't it? How would adding an L2ARC speed up writes? You're absolutely right. The L2ARC is for accelerating reads only and will not affect write performance. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 7110 questions
On Thu, Jun 18, 2009 at 11:51:44AM -0400, Dan Pritts wrote: I'm curious about a couple things that would be unsupported. Specifically, whether they are not supported if they have specifically been crippled in the software. We have not crippled the software in any way, but we have designed an appliance with some specific uses. Doing things from the Solaris shell by hand my damage your system and void your support contract. 1) SSD's I can imagine buying an intel SSD, slotting it into the 7110, and using it as a ZFS L2ARC (? i mean the equivalent of readzilla) That's not supported, it won't work easily, and if you get it working you'll be out of luck if you have a problem. 2) expandability I can imagine buying a SAS card and a JBOD and hooking it up to the 7110; it has plenty of PCI slots. Ditto. finally, one question - I presume that I need to devote a pair of disks to the OS, so I really only get 14 disks for data. Correct? That's right. We market the 7110 as either 2TB = 146GB x 14 or 4.2TB = 300GB x 14 raw capacity. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 7110 questions
Hey Lawrence, Make sure you're running the latest software update. Note that this forumn is not the appropriate place to discuss support issues. Please contact your official Sun support channel. Adam On Thu, Jun 18, 2009 at 12:06:02PM -0700, lawrence ho wrote: We have a 7110 on try and buy program. We tried using the 7110 with XEN Server 5 over iSCSI and NFS. Nothing seems to solve the slow write problem. Within the VM, we observed around 8MB/s on writes. Read performance is fantastic. Some troubleshooting was done with local SUN rep. The conclusion is that 7110 does not have write cache in forms of SSD or controller DRAM write cache. The solution from SUN is to buy StorageTek or 7000 series model with SSD write cache. Adam, please advise if there any fixes for 7110. I am still shopping for SAN and would rather buy a 7100 than a StorageTek or something else. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Hey Bob, MTTDL analysis shows that given normal evironmental conditions, the MTTDL of RAID-Z2 is already much longer than the life of the computer or the attendant human. Of course sometimes one encounters unusual conditions where additional redundancy is desired. To what analysis are you referring? Today the absolute fastest you can resilver a 1TB drive is about 4 hours. Real-world speeds might be half that. In 2010 we'll have 3TB drives meaning it may take a full day to resilver. The odds of hitting a latent bit error are already reasonably high especially with a large pool that's infrequently scrubbed meaning. What then are the odds of a second drive failing in the 24 hours it takes to resiler? I do think that it is worthwhile to be able to add another parity disk to an existing raidz vdev but I don't know how much work that entails. It entails a bunch of work: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Matt Ahrens is working on a key component after which it should all be possible. Zfs development seems to be overwelmed with marketing-driven requirements lately and it is time to get back to brass tacks and make sure that the parts already developed are truely enterprise- grade. While I don't disagree that the focus for ZFS should be ensuring enterprise-class reliability and performance, let me assure you that requirements are driven by the market and not by marketing. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
which gap? 'RAID-Z should mind the gap on writes' ? Message was edited by: thometal I believe this is in reference to the raid 5 write hole, described here: http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance It's not. So I'm not sure what the 'RAID-Z should mind the gap on writes' comment is getting at either. Clarification? I'm planning to write a blog post describing this, but the basic problem is that RAID-Z, by virtue of supporting variable stripe writes (the insight that allows us to avoid the RAID-5 write hole), must round the number of sectors up to a multiple of nparity+1. This means that we may have sectors that are effectively skipped. ZFS generally lays down data in large contiguous streams, but these skipped sectors can stymie both ZFS's write aggregation as well as the hard drive's ability to group I/Os and write them quickly. Jeff Bonwick added some code to mind these gaps on reads. The key insight there is that if we're going to read 64K, say, with a 512 byte hole in the middle, we might as well do one big read rather than two smaller reads and just throw out the data that we don't care about. Of course, doing this for writes is a bit trickier since we can't just blithely write over gaps as those might contain live data on the disk. To solve this we push the knowledge of those skipped sectors down to the I/O aggregation layer in the form of 'optional' I/Os purely for the purpose of coalescing writes into larger chunks. I hope that's clear; if it's not, stay tuned for the aforementioned blog post. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Don't hear about triple-parity RAID that often: Author: Adam Leventhal Repository: /hg/onnv/onnv-gate Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651 Total changesets: 1 Log message: 6854612 triple-parity RAID-Z http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/ 009872.html http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612 (Via Blog O' Matty.) Would be curious to see performance characteristics. I just blogged about triple-parity RAID-Z (raidz3): http://blogs.sun.com/ahl/entry/triple_parity_raid_z As for performance, on the system I was using (a max config Sun Storage 7410), I saw about a 25% improvement to 1GB/s for a streaming write workload. YMMV, but I'd be interested in hearing your results. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Don't hear about triple-parity RAID that often: I agree completely. In fact, I have wondered (probably in these forums), why we don't bite the bullet and make a generic raidzN, where N is any number =0. I agree, but raidzN isn't simple to implement and it's potentially difficult to get it to perform well. That said, it's something I intend to bring to ZFS in the next year or so. If memory serves, the second parity is calculated using Reed-Solomon which implies that any number of parity devices is possible. True; it's a degenerate case. In fact, get rid of mirroring, because it clearly is a variant of raidz with two devices. Want three way mirroring? Call that raidz2 with three devices. The truth is that a generic raidzN would roll up everything: striping, mirroring, parity raid, double parity, etc. into a single format with one parameter. That's an interesting thought, but there are some advantages to calling out mirroring for example as its own vdev type. As has been pointed out, reading from either side of the mirror involves no computation whereas reading from a RAID-Z 1+2 for example would involve more computation. This would complicate the calculus of balancing read operations over the mirror devices. Let's not stop there, though. Once we have any number of parity devices, why can't I add a parity device to an array? That should be simple enough with a scrub to set the parity. In fact, what is to stop me from removing a parity device? Once again, I think the code would make this rather easy. With RAID-Z stripes can be of variable width meaning that, say, a single row in a 4+2 configuration might have two stripes of 1+2. In other words, there might not be enough space in the new parity device. I did write up the steps that would be needed to support RAID-Z expansion; you can find it here: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Ok, back to the real world. The one downside to triple parity is that I recall the code discovered the corrupt block by excluding it from the stripe, reconstructing the stripe and comparing that with the checksum. In other words, for a given cost of X to compute a stripe and a number P of corrupt blocks, the cost of reading a stripe is approximately X^P. More corrupt blocks would radically slow down the system. With raidz2, the maximum number of corrupt blocks would be two, putting a cap on how costly the read can be. Computing the additional parity of triple-parity RAID-Z is slightly more expensive, but not much -- it's just bitwise operations. Recovering from a read failure is identical (and performs identically) to raidz1 or raidz2 until you actually have sustained three failures. In that case, performance is slower as more computation is involved -- but aren't you just happy to get your data back? If there is silent data corruption, then and only then can you encounter the O(n^3) algorithm that you alluded to, but only as a last resort. If we don't know what drives failed, we try to reconstruct your data by assuming that one drive, then two drives, then three drives are returning bad data. For raidz1, this was a linear operation; raidz2, quadratic; now raidz3 is N-cubed. There's really no way around it. Fortunately with proper scrubbing encountering data corruption in one stripe on three different drives is highly unlikely. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Robert, On Fri, Jul 24, 2009 at 12:59:01AM +0100, Robert Milkowski wrote: To what analysis are you referring? Today the absolute fastest you can resilver a 1TB drive is about 4 hours. Real-world speeds might be half that. In 2010 we'll have 3TB drives meaning it may take a full day to resilver. The odds of hitting a latent bit error are already reasonably high especially with a large pool that's infrequently scrubbed meaning. What then are the odds of a second drive failing in the 24 hours it takes to resiler? I wish it was so good with raid-zN. In real life, at least from mine experience, it can take several days to resilver a disk for vdevs in raid-z2 made of 11x sata disk drives with real data. While the way zfs ynchronizes data is way faster under some circumstances it is also much slower under other. IIRC some builds ago there were some fixes integrated so maybe it is different now. Absolutely. I was talking more or less about optimal timing. I realize that due to the priorities within ZFS and real word loads that it can take far longer. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD (SLC) for cache...
My question is about SSD, and the differences between use SLC for readzillas instead of MLC. Sun uses MLCs for Readzillas for their 7000 series. I would think that if SLCs (which are generally more expensive) were really needed, they would be used. That's not entirely accurate. In the 7410 and 7310 today (the members of the Sun Storage 7000 series that support Readzilla) we use SLC SSDs. We're exploring the use of MLC. Perhaps someone on the Fishworks team could give more details, but by going what I've read and seen, MLCs should be sufficient for the L2ARC. Save your money. That's our assessment, but it's highly dependent on the specific characteristics of the MLC NAND itself, the SSD controller, and, of course, the workload. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
Hey Gary, There appears to be a bug in the RAID-Z code that can generate spurious checksum errors. I'm looking into it now and hope to have it fixed in build 123 or 124. Apologies for the inconvenience. Adam On Aug 25, 2009, at 5:29 AM, Gary Gendel wrote: I have a 5-500GB disk Raid-Z pool that has been producing checksum errors right after upgrading SXCE to build 121. They seem to be randomly occurring on all 5 disks, so it doesn't look like a disk failure situation. Repeatingly running a scrub on the pools randomly repairs between 20 and a few hundred checksum errors. Since I hadn't physically touched the machine, it seems a very strong coincidence that it started right after I upgraded to 121. This machine is a SunFire v20z with a Marvell SATA 8-port controller (the same one as in the original thumper). I've seen this kind of problem way back around build 40-50 ish, but haven't seen it after that until now. Anyone else experiencing this problem or knows how to isolate the problem definitively? Thanks, Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] change raidz1 to raidz2 with BP rewrite?
Will BP rewrite allow adding a drive to raidz1 to get raidz2? And how is status on BP rewrite? Far away? Not started yet? Planning? BP rewrite is an important component technology, but there's a bunch beyond that. It's not a high priority right now for us at Sun. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] change raidz1 to raidz2 with BP rewrite?
Hi David, BP rewrite is an important component technology, but there's a bunch beyond that. It's not a high priority right now for us at Sun. What's the bug / RFE number for it? (So those of us with contracts can add a request for it.) I don't have the number handy, but while it might be satisfying to add another request for it, Matt is already cranking on it as fast as he can and more requests for it are likely to have the opposite of the intended effect. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
Hi James, After investigating this problem a bit I'd suggest avoiding deploying RAID-Z until this issue is resolved. I anticipate having it fixed in build 124. Apologies for the inconvenience. Adam On Aug 28, 2009, at 8:20 PM, James Lever wrote: On 28/08/2009, at 3:23 AM, Adam Leventhal wrote: There appears to be a bug in the RAID-Z code that can generate spurious checksum errors. I'm looking into it now and hope to have it fixed in build 123 or 124. Apologies for the inconvenience. Are the errors being generated likely to cause any significant problem running 121 with a RAID-Z volume or should users of RAID-Z* wait until this issue is resolved? cheers, James -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
Hey Bob, I have seen few people more prone to unsubstantiated conjecture than you. The raidz checksum code was recently reworked to add raidz3. It seems likely that a subtle bug was added at that time. That appears to be the case. I'm investigating the problem and hope to have and update to the last either later today or tomorrow. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 7110: Would it self upgrade the system zpool?
Hi Trevor, We intentionally install the system pool with an old ZFS version and don't provide the ability to upgrade. We don't need or use (or even expose) any of the features of the newer versions so using a newer version would only create problems rolling back to earlier releases. Adam On Sep 2, 2009, at 7:01 PM, Trevor Pretty wrote: Just Curious The 7110 I've on loan has an old zpool. I *assume* because it's been upgraded and it gives me the ability to downgrade. Anybody know if I delete the old version of Amber Road whether the pool would then upgrade (I don't want to do it as I want to show the up/downgrade feature). OS pool:- pool: system state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. And yes I may have invalidated my support. If you have a 7000 box don't ask me how to access the system like this, you can see the warning. Remember I've a loan box and are just being nosey, a sort of looking under the bonnet and going OOOHHH an engine, but being too scared to even pull the dip stick :-) + -+ | You are entering the operating system shell. By confirming this action in | | the appliance shell you have agreed that THIS ACTION MAY VOID ANY SUPPORT | | AGREEMENT. If you do not agree to this -- or do not otherwise understand | | what you are doing -- you should type exit at the shell prompt. EVERY | | COMMAND THAT YOU EXECUTE HERE IS AUDITED, and support personnel may use| | this audit trail to substantiate invalidating your support contract. The | | operating system shell is NOT a supported mechanism for managing this | | appliance, and COMMANDS EXECUTED HERE MAY DO IRREPARABLE HARM. | | | | NOTHING SHOULD BE ATTEMPTED HERE BY UNTRAINED SUPPORT PERSONNEL UNDER ANY | | CIRCUMSTANCES. This appliance is a non-traditional operating system | | environment, and expertise in a traditional operating system environment | | in NO WAY constitutes training for supporting this appliance. THOSE WITH | | EXPERTISE IN OTHER SYSTEMS -- HOWEVER SUPERFICIALLY SIMILAR -- ARE MORE| | LIKELY TO MISTAKENLY EXECUTE OPERATIONS HERE THAT WILL DO IRREPARABLE | | HARM. Unless you have been explicitly trained on supporting this | | appliance via the operating system shell, you should immediately return| | to the appliance shell.| | | | Type exit now to return to the appliance shell. | + -+ Trevor www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
_ | | | |P = parity | P | D | D | LBAs D = data |___|___|___| |X = skipped sector | | | | | | X | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| The logic for the optional IOs effectively (though not literally) in this case would fill in the next LBA on the disk with a 0: _ | | | |P = parity | P | D | D | LBAs D = data |___|___|___| |X = skipped sector | | | | |0 = zero-data from aggregation | 0 | P | D | v |___|___|___| | | | | | D | X | | |___|___|___| We can see the problem when the parity undergoes the swap described above: disks 0 1 2 _ | | | |P = parity | D | P | D | LBAs D = data |___|___|___| |X = skipped sector | | | | |0 = zero-data from aggregation | X | 0 | P | v |___|___|___| | | | | | D | X | | |___|___|___| Note that the 0 incorrectly is also swapped thus inadvertently overwriting a data sector in the subsequent stripe. This only occurs if there is IO aggregation making it much more likely with small, synchronous IOs. It's also only possible with an odd (N) number of child vdevs since to induce the problem the size of the data written must consume a multiple of N-1 sectors _and_ the total number of sectors used for data and parity must be odd (to create the need for a skipped sector). The number of data sectors is simply size / 512 and the number of parity sectors is ceil(size / 512 / (N-1)). 1) size / 512 = K * (N-1) 2) size / 512 + ceil(size / 512 / (N-1)) is odd therefore K * (N-1) + K = K * N is odd If N is even K * N cannot be odd and therefore the situation cannot arise. If N is odd, it is possible to satisfy (1) and (2). -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Hey Simon, Thanks for the info on this. Some people, including myself, reported seeing checksum errors within mirrors too. Is it considered that these checksum errors within mirrors could also be related to this bug, or is there another bug related to checksum errors within mirrors that I should take a look at? Absolutely not. That is an unrelated issue. This problem is isolated to RAID-Z. And good luck with the fix for build 124. Are talking days or weeks for the fix to be available, do you think? :) -- Days or hours. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
On Thu, Sep 17, 2009 at 01:32:43PM +0200, Eugen Leitl wrote: reasons), you will lose 2 disks worth of storage to parity leaving 12 disks worth of data. With raid10 you will lose half, 7 disks to parity/redundancy. With two raidz2 sets, you will get (5+2)+(5+2), that is 5+5 disks worth of storage and 2+2 disks worth of redundancy. The actual redudancy/parity is spread over all disks, not like raid3 which has a dedicated parity disk. So raidz3 has a dedicated parity disk? I couldn't see that from skimming http://blogs.sun.com/ahl/entry/triple_parity_raid_z Note that Tomas was talking about RAID-3 not raidz3. To summarize the RAID levels: RAID-0striping RAID-1mirror RAID-2ECC (basically not used) RAID-3bit-interleaved parity (basically not used) RAID-4block-interleaved parity RAID-5block-interleaved distributed parity RAID-6block-interleaved double distributed parity raidz1 is most like RAID-5; raidz2 is most like RAID-6. There's no RAID level that covers more than two parity disks, but raidz3 is most like RAID-6, but with triple distributed parity. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Checksums
On Fri, Oct 23, 2009 at 06:55:41PM -0500, Tim Cook wrote: So, from what I gather, even though the documentation appears to state otherwise, default checksums have been changed to SHA256. Making that assumption, I have two questions. That's false. The default checksum has changed from fletcher2 to fletcher4 that is to say, the definition of the value of 'on' has changed. First, is the default updated from fletcher2 to SHA256 automatically for a pool that was created with an older version of zfs and then upgraded to the latest? Second, would all of the blocks be re-checksummed with a zfs send/receive on the receiving side? As with all property changes, new writes get the new properties. Old data is not rewritten. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Checksums
Thank you for the correction. My next question is, do you happen to know what the overhead difference between fletcher4 and SHA256 is? Is the checksumming multi-threaded in nature? I know my fileserver has a lot of spare cpu cycles, but it would be good to know if I'm going to take a substantial hit in throughput moving from one to the other. Tim, That all really depends on your specific system and workload. As with any performance related matter experimentation is vital for making your final decision. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs code and fishworks fork
With that said I'm concerned that there appears to be a fork between the opensource version of ZFS and ZFS that is part of the Sun/Oracle FishWorks 7nnn series appliances. I understand (implicitly) that Sun (/Oracle) as a commercial concern, is free to choose their own priorities in terms of how they use their own IP (Intellectual Property) - in this case, the source for the ZFS filesystem. Hey Al, I'm unaware of specific plans for management either at Sun or at Oracle, but from an engineering perspective suffice it to say that it is simpler and therefore more cost effective to develop for a single, unified code base, to amortize the cost of testing those modifications, and to leverage the enthusiastic ZFS community to assist with the development and testing of ZFS. Again, this isn't official policy, just the simple facts on the ground from engineering. I'm not sure what would lead you to believe that there is fork between the open source / OpenSolaris ZFS and what we have in Fishworks. Indeed, we've made efforts to make sure there is a single ZFS for the reason stated above. Any differences that exist are quickly migrated to ON as you can see from the consistent work of Eric Schrock. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] will deduplication know about old blocks?
Hi Kjetil, Unfortunately, dedup will only apply to data written after the setting is enabled. That also means that new blocks cannot dedup against old block regardless of how they were written. There is therefore no way to prepare your pool for dedup -- you just have to enable it when you have the new bits. Adam On Dec 9, 2009, at 3:40 AM, Kjetil Torgrim Homme wrote: I'm planning to try out deduplication in the near future, but started wondering if I can prepare for it on my servers. one thing which struck me was that I should change the checksum algorithm to sha256 as soon as possible. but I wonder -- is that sufficient? will the dedup code know about old blocks when I store new data? let's say I have an existing file img0.jpg. I turn on dedup, and copy it twice, to img0a.jpg and img0b.jpg. will all three files refer to the same block(s), or will only img0a and img0b share blocks? -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] will deduplication know about old blocks?
What happens if you snapshot, send, destroy, recreate (with dedup on this time around) and then write the contents of the cloned snapshot to the various places in the pool - which properties are in the ascendancy here? the host pool or the contents of the clone? The host pool I assume, because clone contents are (in this scenario) just some new data? The dedup property applies to all writes so the settings for the pool of origin don't matter, just those on the destination pool. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedupe reporting incorrect savings
Hi Giridhar, The size reported by ls can include things like holes in the file. What space usage does the zfs(1M) command report for the filesystem? Adam On Dec 16, 2009, at 10:33 PM, Giridhar K R wrote: Hi, Reposting as I have not gotten any response. Here is the issue. I created a zpool with 64k recordsize and enabled dedupe on it. --zpool create -O recordsize=64k TestPool device1 --zfs set dedup=on TestPool I copied files onto this pool over nfs from a windows client. Here is the output of zpool list -- zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT TestPool 696G 19.1G 677G 2% 1.13x ONLINE - I ran ls -l /TestPool and saw the total size reported as 51,193,782,290 bytes. The alloc size reported by zpool along with the DEDUP of 1.13x does not addup to 51,193,782,290 bytes. According to the DEDUP (Dedupe ratio) the amount of data copied is 21.58G (19.1G * 1.13) Here is the output from zdb -DD -- zdb -DD TestPool DDT-sha256-zap-duplicate: 33536 entries, size 272 on disk, 140 in core DDT-sha256-zap-unique: 278241 entries, size 274 on disk, 142 in core DDT histogram (aggregated over all DDTs): bucket allocated referenced __ __ __ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE -- -- - - - -- - - - 1 272K 17.0G 17.0G 17.0G 272K 17.0G 17.0G 17.0G 2 32.7K 2.05G 2.05G 2.05G 65.6K 4.10G 4.10G 4.10G 4 15 960K 960K 960K 71 4.44M 4.44M 4.44M 8 4 256K 256K 256K 53 3.31M 3.31M 3.31M 16 1 64K 64K 64K 16 1M 1M 1M 512 1 64K 64K 64K 854 53.4M 53.4M 53.4M 1K 1 64K 64K 64K 1.08K 69.1M 69.1M 69.1M 4K 1 64K 64K 64K 5.33K 341M 341M 341M Total 304K 19.0G 19.0G 19.0G 345K 21.5G 21.5G 21.5G dedup = 1.13, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.13 Am I missing something? Your inputs are much appritiated. Thanks, Giri -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Dedupe reporting incorrect savings
Thanks for the response Adam. Are you talking about ZFS list? It displays 19.6 as allocated space. What does ZFS treat as hole and how does it identify? ZFS will compress blocks of zeros down to nothing and treat them like sparse files. 19.6 is pretty close to your computed. Does your pool happen to be 10+1 RAID-Z? Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
Hey James, Personally, I think mirroring is safer (and 3 way mirroring) than raidz/z2/5. All my boot from zfs systems have 3 way mirrors root/usr/var disks (using 9 disks) but all my data partitions are 2 way mirrors (usually 8 disks or more and a spare.) Double-parity (or triple-parity) RAID are certainly more resilient against some failure modes than 2-way mirroring. For example, bit errors can arise at a certain rate from disks. In the case of a disk failure in a mirror, it's possible to encounter a bit error such that data is lost. I recently wrote an article for ACM Queue that examines recent trends in hard drives and makes the case for triple-parity RAID. It's at least peripherally relevant to this conversation: http://blogs.sun.com/ahl/entry/acm_triple_parity_raid Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz data loss stories?
Applying classic RAID terms to zfs is just plain wrong and misleading since zfs does not directly implement these classic RAID approaches even though it re-uses some of the algorithms for data recovery. Details do matter. That's not entirely true, is it? * RAIDZ is RAID5 + checksum + COW * RAIDZ2 is RAID6 + checksum + COW * A stack of mirror vdevs is RAID10 + checksum + COW Others have noted that RAID-Z isn't really the same as RAID-5 and RAID-Z2 isn't the same as RAID-6 because RAID-5 and RAID-6 define not just the number of parity disks (which would have made far more sense in my mind), but instead also include in the definition a notion of how the data and parity are laid out. The RAID levels were used to describe groupings of existing implementations and conflate things like the number of parity devices with, say, how parity is distributed across devices. For example, RAID-Z1 lays out data most like RAID-3, that is a single block is carved up and spread across many disks, but distributes parity as required for RAID-5 but in a different manner. It's an unfortunate state of affairs which is why further RAID levels should identify only the most salient aspect (the number of parity devices) or we should use unambiguous terms like single-parity and double-parity RAID. If we can compare apples and oranges, would you same recommendation (use raidz2 and/or raidz3) be the same when comparing to mirror with the same number of drives? In other words, a 2 drive mirror compares to raidz{1} the same as a 3 drive mirror compares to raidz2 and a 4 drive mirror compares to raidz3? If you were enterprise (in other words card about perf) why would you ever use raidz instead of throwing more drives at the problem and doing mirroring with identical parity? You're right that a mirror is a degenerate form of raidz1, for example, but mirrors allow for specific optimizations. While the redundancy would be the same, the performance would not. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz stripe size (not stripe width)
Hi Brad, RAID-Z will carve up the 8K blocks into chunks at the granularity of the sector size -- today 512 bytes but soon going to 4K. In this case a 9-disk RAID-Z vdev will look like this: | P | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 | | P | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 | 1K per device with an additional 1K for parity. Adam On Jan 4, 2010, at 3:17 PM, Brad wrote: If a 8K file system block is written on a 9 disk raidz vdev, how is the data distributed (writtened) between all devices in the vdev since a zfs write is one continuously IO operation? Is it distributed evenly (1.125KB) per device? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!
Hey Chris, The DDRdrive X1 OpenSolaris device driver is now complete, please join us in our first-ever ZFS Intent Log (ZIL) beta test program. A select number of X1s are available for loan, preferred candidates would have a validation background and/or a true passion for torturing new hardware/driver :-) We are singularly focused on the ZIL device market, so a test environment bound by synchronous writes is required. The beta program will provide extensive technical support and a unique opportunity to have direct interaction with the product designers. Congratulations! This is great news for ZFS. I'll be very interested to see the results members of the community can get with your device as part of their pool. COMSTAR iSCSI performance should be dramatically improved in particular. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss