Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 20.10.2012 22:24, Tim Cook wrote: On Sat, Oct 20, 2012 at 2:54 AM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net wrote: On 10/20/2012 01:10 AM, Tim Cook wrote: On Fri, Oct 19, 2012 at 3:46 PM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote: On 10/19/2012 09:58 PM, Matthew Ahrens wrote: On Wed, Oct 17, 2012 at 5:29 AM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote: We have finished a beta version of the feature. A webrev for it can be found here: http://cr.illumos.org/~webrev/sensille/fits-send/ It adds a command 'zfs fits-send'. The resulting streams can currently only be received on btrfs, but more receivers will follow. It would be great if anyone interested could give it some testing and/or review. If there are no objections, I'll send a formal webrev soon. Please don't bother changing libzfs (and proliferating the copypasta there) -- do it like lzc_send(). ok. It would be easier though if zfs_send would also already use the new style. Is it in the pipeline already? Likewise, zfs_ioc_fits_send should use the new-style API. See the comment at the beginning of zfs_ioctl.c. I'm not a fan of the name FITS but I suppose somebody else already named the format. If we are going to follow someone else's format though, it at least needs to be well-documented. Where can we find the documentation? FYI, #1 google hit for FITS: http://en.wikipedia.org/wiki/FITS #3 hit: http://code.google.com/p/fits/ Both have to do with file formats. The entire first page of google results for FITS format and FITS file format are related to these two formats. FITS btrfs didn't return anything specific to the file format, either. It's not too late to change it, but I have a hard time coming up with some better name. Also, the format is still very new and I'm sure it'll need some adjustments. -arne --matt I'm sure we can come up with something. Are you planning on this being solely for ZFS, or a larger architecture for replication both directions in the future? We have senders for zfs and btrfs. The planned receiver will be mostly filesystem agnostic and can work on a much broader range. It basically only needs to know how to create snapshots and where to store a few meta informations. It would be great if more filesystems would join on the sending side, but I have no involvement there. I see no basic problem in choosing a name that's already in use. Especially with file extensions most will be already taken. How about something with 'portable' and 'backup', like pib or pibs? 'i' for incremental. -Arne Re-using names generally isn't a big deal, but in this case the existing name is a technology that's extremely similar to what you're doing - which WILL cause a ton of confusion in the userbase, and make troubleshooting far more difficult when searching google/etc looking for links to documents that are applicable. Maybe something like far - filesystem agnostic replication? I like that one. It has a nice connotation to 'remote'. So 'far' it be. Thanks! -Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 22.10.2012 06:32, Matthew Ahrens wrote: On Sat, Oct 20, 2012 at 1:24 PM, Tim Cook t...@cook.ms mailto:t...@cook.ms wrote: On Sat, Oct 20, 2012 at 2:54 AM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net wrote: On 10/20/2012 01:10 AM, Tim Cook wrote: On Fri, Oct 19, 2012 at 3:46 PM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote: On 10/19/2012 09:58 PM, Matthew Ahrens wrote: On Wed, Oct 17, 2012 at 5:29 AM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote: We have finished a beta version of the feature. A webrev for it can be found here: http://cr.illumos.org/~webrev/sensille/fits-send/ It adds a command 'zfs fits-send'. The resulting streams can currently only be received on btrfs, but more receivers will follow. It would be great if anyone interested could give it some testing and/or review. If there are no objections, I'll send a formal webrev soon. Please don't bother changing libzfs (and proliferating the copypasta there) -- do it like lzc_send(). ok. It would be easier though if zfs_send would also already use the new style. Is it in the pipeline already? Likewise, zfs_ioc_fits_send should use the new-style API. See the comment at the beginning of zfs_ioctl.c. I'm not a fan of the name FITS but I suppose somebody else already named the format. If we are going to follow someone else's format though, it at least needs to be well-documented. Where can we find the documentation? FYI, #1 google hit for FITS: http://en.wikipedia.org/wiki/FITS #3 hit: http://code.google.com/p/fits/ Both have to do with file formats. The entire first page of google results for FITS format and FITS file format are related to these two formats. FITS btrfs didn't return anything specific to the file format, either. It's not too late to change it, but I have a hard time coming up with some better name. Also, the format is still very new and I'm sure it'll need some adjustments. -arne --matt I'm sure we can come up with something. Are you planning on this being solely for ZFS, or a larger architecture for replication both directions in the future? We have senders for zfs and btrfs. The planned receiver will be mostly filesystem agnostic and can work on a much broader range. It basically only needs to know how to create snapshots and where to store a few meta informations. It would be great if more filesystems would join on the sending side, but I have no involvement there. I see no basic problem in choosing a name that's already in use. Especially with file extensions most will be already taken. How about something with 'portable' and 'backup', like pib or pibs? 'i' for incremental. -Arne Re-using names generally isn't a big deal, but in this case the existing name is a technology that's extremely similar to what you're doing - which WILL cause a ton of confusion in the userbase, and make troubleshooting far more difficult when searching google/etc looking for links to documents that are applicable. Maybe something like far - filesystem agnostic replication? All else being equal, I like this name (FAR). It ends in AR like several other archive formats (TAR, WAR, JAR). Plus not a lot of false positives when googling around for it. However, if compatibility with the existing format is an explicit goal, we should use the same name, and the btrfs authors may be averse to changing the name. There's really nothing to keep. In the btrfs world, like in the zfs world, the stream has no special name, it's just a 'btrfs send stream', like the 'zfs send stream'. The necessity for a name only arises from the wish to build a bridge between the worlds. The author
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 10/20/2012 01:10 AM, Tim Cook wrote: On Fri, Oct 19, 2012 at 3:46 PM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net wrote: On 10/19/2012 09:58 PM, Matthew Ahrens wrote: On Wed, Oct 17, 2012 at 5:29 AM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote: We have finished a beta version of the feature. A webrev for it can be found here: http://cr.illumos.org/~webrev/sensille/fits-send/ It adds a command 'zfs fits-send'. The resulting streams can currently only be received on btrfs, but more receivers will follow. It would be great if anyone interested could give it some testing and/or review. If there are no objections, I'll send a formal webrev soon. Please don't bother changing libzfs (and proliferating the copypasta there) -- do it like lzc_send(). ok. It would be easier though if zfs_send would also already use the new style. Is it in the pipeline already? Likewise, zfs_ioc_fits_send should use the new-style API. See the comment at the beginning of zfs_ioctl.c. I'm not a fan of the name FITS but I suppose somebody else already named the format. If we are going to follow someone else's format though, it at least needs to be well-documented. Where can we find the documentation? FYI, #1 google hit for FITS: http://en.wikipedia.org/wiki/FITS #3 hit: http://code.google.com/p/fits/ Both have to do with file formats. The entire first page of google results for FITS format and FITS file format are related to these two formats. FITS btrfs didn't return anything specific to the file format, either. It's not too late to change it, but I have a hard time coming up with some better name. Also, the format is still very new and I'm sure it'll need some adjustments. -arne --matt I'm sure we can come up with something. Are you planning on this being solely for ZFS, or a larger architecture for replication both directions in the future? We have senders for zfs and btrfs. The planned receiver will be mostly filesystem agnostic and can work on a much broader range. It basically only needs to know how to create snapshots and where to store a few meta informations. It would be great if more filesystems would join on the sending side, but I have no involvement there. I see no basic problem in choosing a name that's already in use. Especially with file extensions most will be already taken. How about something with 'portable' and 'backup', like pib or pibs? 'i' for incremental. -Arne --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 10/20/2012 01:21 AM, Matthew Ahrens wrote: On Fri, Oct 19, 2012 at 1:46 PM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net wrote: On 10/19/2012 09:58 PM, Matthew Ahrens wrote: Please don't bother changing libzfs (and proliferating the copypasta there) -- do it like lzc_send(). ok. It would be easier though if zfs_send would also already use the new style. Is it in the pipeline already? Likewise, zfs_ioc_fits_send should use the new-style API. See the comment at the beginning of zfs_ioctl.c. I'm saying to use lzc_send() as an example, rather than zfs_send(). lzc_send() already uses the new style. I don't see how your job would be made easier by converting zfs_send(). Yeah, but the zfs util still uses the old version. It would be nice to convert ZFS_IOC_SEND to the new IOCTL format someday, but I don't think that the complexities of zfs_send() would be appropriate for libzfs_core. Programmatic consumers typically know exactly what snapshots they want sent and would prefer the clean error handling of lzc_send(). What I meant was if you want the full-blown zfs send-functionality with the ton of options, it would be much easier to reuse the existing logic and only call *_send_fits instead of *_send when requested. If you're content with just the -i option I've currently implemented, it's certainly easy to convert. I on my part have mostly programmatic consumers. -Arne --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 19.10.2012 10:47, Joerg Schilling wrote: Arne Jansen sensi...@gmx.net wrote: On 10/18/2012 10:19 PM, Andrew Gabriel wrote: Arne Jansen wrote: We have finished a beta version of the feature. What does FITS stand for? Filesystem Incremental Transport Stream (or Filesystem Independent Transport Stream) Is this an attempt to create a competition for TAR? Not really. We'd have preferred tar if it would have been powerful enough. It's more an alternative to rsync for incremental updates. I really like the send/receive feature and want to make it available for cross- platform syncs. Arne Jörg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 19.10.2012 11:16, Irek Szczesniak wrote: On Wed, Oct 17, 2012 at 2:29 PM, Arne Jansen sensi...@gmx.net wrote: We have finished a beta version of the feature. A webrev for it can be found here: http://cr.illumos.org/~webrev/sensille/fits-send/ It adds a command 'zfs fits-send'. The resulting streams can currently only be received on btrfs, but more receivers will follow. It would be great if anyone interested could give it some testing and/or review. If there are no objections, I'll send a formal webrev soon. Why are you trying to reinvent the wheel? AFAIK some tar versions and ATT AST pax support deltas based on a standard (I'll have to dig out the exact specification, but from looking at it you did double work). I haven't done the research myself, but the result was that pax would have needed significant extension, but I don't have the details. If you dig out a format already in use that supports everything we need (like sharing data between files, needed for btrfs reflinks), it should be easy to change the format. Stuffing the data into a specific format is not an essential part of the work and can be changed with a limited amount of work. -Arne Irek ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 19.10.2012 12:17, Joerg Schilling wrote: Arne Jansen sensi...@gmx.net wrote: Is this an attempt to create a competition for TAR? Not really. We'd have preferred tar if it would have been powerful enough. It's more an alternative to rsync for incremental updates. I really like the send/receive feature and want to make it available for cross- platform syncs. TAR with the star extensions that are also implemented by many other recent TAR programs should do, what are you missing? As said I've not done the research myself, but operations that come to mind include - partial updates of files - sparse files - punch hole - truncate - rename - referencing parts of other files as the data to write (reflinks) - create snapshot Do star support these operation? Are they part of any standard? Also, are chmod/chown/set atime/mtime possible on existing files? I'd be happy to find a well maintained receiver/format to send the data in, this way I don't need to care about a cross-platform receiver :) -Arne Jörg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 19.10.2012 13:53, Joerg Schilling wrote: Arne Jansen sensi...@gmx.net wrote: On 19.10.2012 12:17, Joerg Schilling wrote: Arne Jansen sensi...@gmx.net wrote: Is this an attempt to create a competition for TAR? Not really. We'd have preferred tar if it would have been powerful enough. It's more an alternative to rsync for incremental updates. I really like the send/receive feature and want to make it available for cross- platform syncs. TAR with the star extensions that are also implemented by many other recent TAR programs should do, what are you missing? As said I've not done the research myself, but operations that come to mind include - partial updates of files How do you intend to detect this _after_ the original file was updated? the patch is a kernel-side patch. It detects it by efficiently comparing zfs snapshots. If you change a block in a file, we'll send only the changed block. - sparse files supported in an efficient way by star - punch hole As this is a specific case of a sparse file, it could be added While 'sparse file' meant the initial transport of a sparse file, 'punch hole' is meant in the incremental context. Linux supports punching holes in files. - truncate see above - rename part of the incremental restore architecture from star, but needs a restore symbol table on the receiving system Here, the sending sides determines renames. - referencing parts of other files as the data to write (reflinks) There is no user space interface to detect this, why do you need it? As above, it is detected in kernel. In the case of btrfs, each datablock has backreferences to all its users. In zfs, the problem does not exist in this form, nevertheless a common format has to be able to transport this. - create snapshot star supports incrementals. or do you mean that a snapshot should be set up on the reeiving site? snapshot on the receiving side. The aim is to exactly replicate a number of snapshots, just as zfs send/receive does. Alexander Block (the author of btrfs send/receive) commented in another subthread and explained the motivation to move away from tar/pax. -Arne Do star support these operation? Are they part of any standard? Also, are chmod/chown/set atime/mtime possible on existing files? star allows to call: star -x -xmeta to _only_ extract meta data from a normal tar archive and it allows to create a specific meta data only archve via star -c -meta Jörg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 10/19/2012 09:58 PM, Matthew Ahrens wrote: On Wed, Oct 17, 2012 at 5:29 AM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net wrote: We have finished a beta version of the feature. A webrev for it can be found here: http://cr.illumos.org/~webrev/sensille/fits-send/ It adds a command 'zfs fits-send'. The resulting streams can currently only be received on btrfs, but more receivers will follow. It would be great if anyone interested could give it some testing and/or review. If there are no objections, I'll send a formal webrev soon. Please don't bother changing libzfs (and proliferating the copypasta there) -- do it like lzc_send(). ok. It would be easier though if zfs_send would also already use the new style. Is it in the pipeline already? Likewise, zfs_ioc_fits_send should use the new-style API. See the comment at the beginning of zfs_ioctl.c. I'm not a fan of the name FITS but I suppose somebody else already named the format. If we are going to follow someone else's format though, it at least needs to be well-documented. Where can we find the documentation? FYI, #1 google hit for FITS: http://en.wikipedia.org/wiki/FITS #3 hit: http://code.google.com/p/fits/ Both have to do with file formats. The entire first page of google results for FITS format and FITS file format are related to these two formats. FITS btrfs didn't return anything specific to the file format, either. It's not too late to change it, but I have a hard time coming up with some better name. Also, the format is still very new and I'm sure it'll need some adjustments. -arne --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
On 10/18/2012 10:19 PM, Andrew Gabriel wrote: Arne Jansen wrote: We have finished a beta version of the feature. What does FITS stand for? Filesystem Incremental Transport Stream (or Filesystem Independent Transport Stream) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zfs] portable zfs send streams (preview webrev)
We have finished a beta version of the feature. A webrev for it can be found here: http://cr.illumos.org/~webrev/sensille/fits-send/ It adds a command 'zfs fits-send'. The resulting streams can currently only be received on btrfs, but more receivers will follow. It would be great if anyone interested could give it some testing and/or review. If there are no objections, I'll send a formal webrev soon. Thanks, Arne On 10.10.2012 21:38, Arne Jansen wrote: Hi, We're currently working on a feature to send zfs streams in a portable format that can be received on any filesystem. It is a kernel patch that generates the stream directly from kernel, analogous to what zfs send does. The stream format is based on the btrfs send format. The basic idea is to just send commands like mkdir, unlink, create, write, etc. For incremental sends it's the same idea. The receiving side is user mode only, so it's very easy to port it to any system. If the receiving side has the capability to create snapshots, those from the sending side will get recreated. If not, you still have the benefit of fast incremental transfers. My question is if there's any interest in this feature and if it had chances of getting accepted. Also, would it be acceptable to start with a working version and add performance optimizations later on? -Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] NFS performance near zero on a very full pool
Hi, currently I'm trying to debug a very strange phenomenon on a nearly full pool (96%). Here are the symptoms: over NFS, a find on the pool takes a very long time, up to 30s (!) for each file. Locally, the performance is quite normal. What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait until the writes finish, bringing the performance of the server effectively down to zero. It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be. What is the explanation for this behaviour? Is it intentional and can the threshold be tuned? I experienced this on Sol10 U8. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance near zero on a very full pool
Hi Neil, Neil Perrin wrote: NFS often demands it's transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. I think this is not what we saw, for two reason: a) we have a mirrored slog device. According to zpool iostat -v only 16MB out of 4GB were in use. b) it didn't seem like the txg would have been closed early. Rather it kept approximately the 30 second intervals. Internally we came up with a different explanation, without any backing that it might be correct: When the pool reaches 96%, zfs goes into a 'self defense' mode. Instead of allocating block from ZIL, every write turns synchronous and has to wait for the txg to finish naturally. The reasoning behind this might be that even if ZIL is available, there might not be enough space left to commit the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above 96%. While this might be proper for small pools, on large pools 4% are still some TB of free space, so there should be an upper limit of maybe 10GB on this hidden reserve. Also this sudden switch of behavior is completely unexpected and at least under- documented. Most (maybe all?) file systems perform badly when out of space. I believe we give a recommended free size and I thought it was 90%. In this situation, not only writes suffered, but as a side effect reads also came to a nearly complete halt. -- Arne Neil. On 09/09/10 09:00, Arne Jansen wrote: Hi, currently I'm trying to debug a very strange phenomenon on a nearly full pool (96%). Here are the symptoms: over NFS, a find on the pool takes a very long time, up to 30s (!) for each file. Locally, the performance is quite normal. What I found out so far: It seems that every nfs write (rfs3_write) blocks until the txg is flushed. This means a write takes up to 30 seconds. During this time, the nfs calls block, occupying all NFS server threads. With all server threads blocked, all other OPs (LOOKUP, GETATTR, ...) have to wait until the writes finish, bringing the performance of the server effectively down to zero. It may be that the trigger for this behavior is around 95%. I managed to bring the pool down to 95%, now the writes get served continuously as it should be. What is the explanation for this behaviour? Is it intentional and can the threshold be tuned? I experienced this on Sol10 U8. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance near zero on a very full pool
Richard Elling wrote: On Sep 9, 2010, at 10:09 AM, Arne Jansen wrote: Hi Neil, Neil Perrin wrote: NFS often demands it's transactions are stable before returning. This forces ZFS to do the system call synchronously. Usually the ZIL (code) allocates and writes a new block in the intent log chain to achieve this. If ever it fails to allocate a block (of the size requested) it it forced to close the txg containing the system call. Yes this can be extremely slow but there is no other option for the ZIL. I'm surprised the wait is 30 seconds. I would expect mush less, but finding room for the rest of the txg data and metadata would also be a challenge. I think this is not what we saw, for two reason: a) we have a mirrored slog device. According to zpool iostat -v only 16MB out of 4GB were in use. b) it didn't seem like the txg would have been closed early. Rather it kept approximately the 30 second intervals. Internally we came up with a different explanation, without any backing that it might be correct: When the pool reaches 96%, zfs goes into a 'self defense' mode. Instead of allocating block from ZIL, every write turns synchronous and has to wait for the txg to finish naturally. The reasoning behind this might be that even if ZIL is available, there might not be enough space left to commit the ZIL to the pool. To prevent this, zfs doen't use ZIL when the pool is above 96%. While this might be proper for small pools, on large pools 4% are still some TB of free space, so there should be an upper limit of maybe 10GB on this hidden reserve. I do not believe this is correct. At 96% the first-fit algorithm changes to best-fit and ganging can be expected. This has nothing to do with the ZIL. There is already a reserve set aside for metadata and the ZIL so that you can remove files when the file system is 100% full. This reserve is 32 MB or 1/64 of the pool size. Maybe it is some side-effect of this change of allocation scheme. But I'm very sure about what I saw. The change was drastic and abrupt. I had a dtrace script running that measured the time for rfs3_write to complete. With the pool 96% I saw a burst of writes every 30 seconds, with completion times of up to 30s. With the pool 96%, I saw a continuous stream of writes with completion times of mostly a few microseconds. In this situation, not only writes suffered, but as a side effect reads also came to a nearly complete halt. If you have atime=on, then reads create writes. atime is off. The impact on reads/lookups/getattr came imho because all server threads have been occupied by blocking writes for a prolonged time. I'll try to reproduce this on a test machine. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] what is zfs doing during a log resilver?
Giovanni Tirloni wrote: On Thu, Sep 2, 2010 at 10:18 AM, Jeff Bacon ba...@walleyesoftware.com mailto:ba...@walleyesoftware.com wrote: So, when you add a log device to a pool, it initiates a resilver. What is it actually doing, though? Isn't the slog a copy of the in-memory intent log? Wouldn't it just simply replicate the data that's in the other log, checked against what's in RAM? And presumably there isn't that much data in the slog so there isn't that much to check? Or is it just doing a generic resilver for the sake of argument because you changed something? Good question. Here it takes little over 1 hour to resilver a 32GB SSD in a mirror. I've always wondered what exactly it was doing since it was supposed to be 30 seconds worth of data. It also generates lots of checksum errors. Here it takes more than 2 days to resilver a failed slog-SSD. I'd also expect it to finish in a few seconds... It seems it resilvers the whole pool, 35T worth of data on 22 spindels (RAID-Z2). We don't get any errors during resilver. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision
Edward Ned Harvey wrote: From: Robert Milkowski [mailto:mi...@task.gda.pl] [In raidz] The issue is that each zfs filesystem block is basically spread across n-1 devices. So every time you want to read back a single fs block you need to wait for all n-1 devices to provide you with a part of it - and keep in mind in zfs you can't get a partial block even if that's what you are asking for as zfs has to check checksum of entire fs block. Can anyone else confirm or deny the correctness of this statement? If you read a small file from a raidz volume, do you have to wait for every single disk to return a small chunk of the blocksize? I know this is true for large files which require more than one block, obviously, but even a small file gets spread out across multiple disks? This may be the way it's currently implemented, but it's not a mathematical requirement. It is possible, if desired, to implement raid parity and still allow small files to be written entirely on a single disk, without losing redundancy. Thus providing the redundancy, the large file performance, (both of which are already present in raidz), and also optimizing small file random operations, which may not already be optimized in raidz. As I understand it that's the whole point of raidz. Each block is its own stripe. If necessary the block gets broken down into 512 byte chunks to spread it as wide as possible. Each block gets its own parity added. So if the array is too wide for the block to be spread to all disks, you also lose space because the stripe is not full and parity gets added to that small stripe. That means if you only write 512 byte blocks, each write writes 3 blocks to disk, so the net capacity goes down to one third, regardless how many disks you have in your raid group. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 1tb SATA drives
Jordan McQuown wrote: I’m curious to know what other people are running for HD’s in white box systems? I’m currently looking at Seagate Barracuda’s and Hitachi Deskstars. I’m looking at the 1tb models. These will be attached to an LSI expander in a sc847e2 chassis driven by an LSI 9211-8i HBA. This system will be used as a large storage array for backups and archiving. I wouldn't recommend using desktop drives in a server RAID. They can't handle the vibrations well that are present in a server. I'd recommend at least the Seagate Constellation or the Hitachi Ultrastar, though I haven't tested the Deskstar myself. --Arne Thanks, Jordan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hashing files rapidly on ZFS
Daniel Carosone wrote: Something similar would be useful, and much more readily achievable, from ZFS from such an application, and many others. Rather than a way to compare reliably between two files for identity, I'ld liek a way to compare identity of a single file between two points in time. If my application can tell quickly that the file content is unaltered since last time I saw the file, I can avoid rehashing the content and use a stored value. If I can achieve this result for a whole directory tree, even better. This would be great for any kind of archiving software. Aren't zfs checksums already ready to solve this? If a file changes, it's dnodes' checksum changes, the checksum of the directory it is in and so forth all the way up to the uberblock. There may be ways a checksum changes without a real change in the files content, but the other way round should hold. If the checksum didn't change, the file didn't change. So the only missing link is a way to determine zfs's checksum for a file/directory/dataset. Am I missing something here? Of course atime update should be turned off, otherwise the checksum will get changed by the archiving agent. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What happens when unmirrored ZIL log device is removed ungracefully
Edward Ned Harvey wrote: Due to recent experiences, and discussion on this list, my colleague and I performed some tests: Using solaris 10, fully upgraded. (zpool 15 is latest, which does not have log device removal that was introduced in zpool 19) In any way possible, you lose an unmirrored log device, and the OS will crash, and the whole zpool is permanently gone, even after reboots. I'm a bit confused. I tried hard, but haven't been able to reproduce this using Sol10U8. I have a mirrored slog device. While putting it under load doing synchronous file creations, we pulled the power cords and unplugged the slog devices. After powering on zfs imported the pool, but prompted to acknowledge the missing slog devices with zpool clear. After that the pool was accessible again. That's exactly how it should be. What am I doing wrong here? The system is on a different pool using different disks. One peculiarity I noted though: when pulling both slog devices from the running machine, zpool status reports 1 file error. In my understanding this should not happen as the file data is written from memory and not from the contents of the zil. It seems the reported write error from the slog device somehow lead to a corrupted file. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OCZ Vertex 2 Pro performance numbers
Geoff Nordli wrote: Is this the one (http://www.ocztechnology.com/products/solid-state-drives/2-5--sata-ii/maxim um-performance-enterprise-solid-state-drives/ocz-vertex-2-pro-series-sata-ii -2-5--ssd-.html) with the built in supercap? Yes. Geoff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] OCZ Vertex 2 Pro performance numbers
Now the test for the Vertex 2 Pro. This was fun. For more explanation please see the thread Crucial RealSSD C300 and cache flush? This time I made sure the device is attached via 3GBit SATA. This is also only a short test. I'll retest after some weeks of usage. cache enabled, 32 buffers, 64k blocks linear write, random data: 96 MB/s linear read, random data: 206 MB/s linear write, zero data: 234 MB/s linear read, zero data: 255 MB/s random write, random data: 84 MB/s random read, random data: 180 MB/s random write, zero data: 224 MB/s randow read, zero data: 190 MB/s cache enabled, 32 buffers, 4k blocks linear write, random data: 93 MB/s linear read, random data: 138 MB/s linear write, zero data: 113 MB/s linear read, zero data: 141 MB/s random write, random data: 41 MB/s (10300 ops/s) random read, random data: 76 MB/s (19000 ops/s) random write, zero data: 54 MB/s (13800 ops/s) random read, zero data: 91 MB/s (22800 ops/s) cache enabled, 1 buffer, 4k blocks linear write, random data: 62 MB/s (15700 ops/s) linear read, random data: 32 MB/s (8000 ops/s) linear write, zero data: 64 MB/s (16100 ops/s) linear read, zero data: 45 MB/s (11300 ops/s) random write, random data: 14 MB/s (3400 ops/s) random read, random data: 22 MB/s (5600 ops/s) random write, zero data: 19 MB/s (4500 ops/s) random read, zero data: 21 MB/s (5100 ops/s) cache enabled, 1 buffer, 4k blocks, with cache flushes: linear write, random data, flush after every write: 5700 ops/s linear write, zero data, flush after every write: 5700 ops/s linear write, random data, flush after every 4th write: 8500 ops/s linear write, zero data, flush after every 4th write: 8500 ops/s Some remarks: The random op numbers have to be read with care: - reading occurs in the same order as the writing before - the ops are not aligned to any specific boundary The device also passed the write-loss-test: after 5 repeats no data has been lost. It doesn't make any difference if the cache is enabled or disabled, so it might be worth to tune zfs to not issue cache flushes. Conclusion: This device will make an excellent slog device. I'll order them today ;) --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?
Hi, Roy Sigurd Karlsbakk wrote: Crucial RealSSD C300 has been released and showing good numbers for use as Zil and L2ARC. Does anyone know if this unit flushes its cache on request, as opposed to Intel units etc? I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and did some quick testing. Here are the numbers first, some explanation follows below: cache enabled, 32 buffers: Linear read, 64k blocks: 134 MB/s random read, 64k blocks: 134 MB/s linear read, 4k blocks: 87 MB/s random read, 4k blocks: 87 MB/s linear write, 64k blocks: 107 MB/s random write, 64k blocks: 110 MB/s linear write, 4k blocks: 76 MB/s random write, 4k blocks: 32 MB/s cache enabled, 1 buffer: linear write, 4k blocks: 51 MB/s (12800 ops/s) random write, 4k blocks: 7 MB/s (1750 ops/s) linear write, 64k blocks: 106 MB/s (1610 ops/s) random write, 64k blocks: 59 MB/s (920 ops/s) cache disabled, 1 buffer: linear write, 4k blocks: 4.2 MB/s (1050 ops/s) random write, 4k blocks: 3.9 MB/s (980 ops/s) linear write, 64k blocks: 40 MB/s (650 ops/s) random write, 64k blocks: 40 MB/s (650 ops/s) cache disabled, 32 buffers: linear write, 4k blocks: 4.5 MB/s, 1120 ops/s random write, 4k blocks: 4.2 MB/s, 1050 ops/s linear write, 64k blocks: 43 MB/s, 680 ops/s random write, 64k blocks: 44 MB/s, 690 ops/s cache enabled, 1 buffer, with cache flushes linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s The numbers are rough numbers read quickly from iostat, so please don't multiply block size by ops and compare with the bandwidth given ;) The test operates directly on top of LDI, just like ZFS. - nk blocks means the size of each read/write given to the device driver - n buffers means the number of buffers I keep in flight. This is to keep the command queue of the device busy - cache flush means a synchronous ioctl DKIOCFLUSHWRITECACHE These numbers contain a few surprises (at least for me). The biggest surprise is that with cache disabled one cannot get good data rates with small blocks, even if one keeps the command queue filled. This is completely different from what I've seen from hard drives. Also the IOPS with cache flushes is quite low, 385 is not much better than a 15k hdd, while the latter scales better. On the other hand, from the large drop in performance when using flushes one could infer that they indeed flush properly, but I haven't built a test setup for that yet. Conclusion: From the measurements I'd infer the device makes a good L2ARC, but for a slog device the latency is too high and it doesn't scale well. I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive. If there are numbers you are missing please tell me, I'll measure them if possible. Also please ask if there are questions regarding the test setup. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?
Arne Jansen wrote: Hi, Roy Sigurd Karlsbakk wrote: Crucial RealSSD C300 has been released and showing good numbers for use as Zil and L2ARC. Does anyone know if this unit flushes its cache on request, as opposed to Intel units etc? I had a chance to get my hands on a Crucial RealSSD C300/128MB yesterday and did some quick testing. Here are the numbers first, some explanation follows below: After taemun alerted my that the linear read/write numbers are too low I found a bottleneck: the controller decided to connect the SSD with only 1.5GBit. I have to check if we can jumper it to least 3GBit. To connect it with 6GBit we need some new cables, so this might take some time. The main purpose of this test was to evaluate the SSD with respect to usage as a slog device and I think the connection speed doesn't affect this. Nevertheless I'll repeat the tests as soon as we solved the issues. Sorry. --Arne cache enabled, 32 buffers: Linear read, 64k blocks: 134 MB/s random read, 64k blocks: 134 MB/s linear read, 4k blocks: 87 MB/s random read, 4k blocks: 87 MB/s linear write, 64k blocks: 107 MB/s random write, 64k blocks: 110 MB/s linear write, 4k blocks: 76 MB/s random write, 4k blocks: 32 MB/s cache enabled, 1 buffer: linear write, 4k blocks: 51 MB/s (12800 ops/s) random write, 4k blocks: 7 MB/s (1750 ops/s) linear write, 64k blocks: 106 MB/s (1610 ops/s) random write, 64k blocks: 59 MB/s (920 ops/s) cache disabled, 1 buffer: linear write, 4k blocks: 4.2 MB/s (1050 ops/s) random write, 4k blocks: 3.9 MB/s (980 ops/s) linear write, 64k blocks: 40 MB/s (650 ops/s) random write, 64k blocks: 40 MB/s (650 ops/s) cache disabled, 32 buffers: linear write, 4k blocks: 4.5 MB/s, 1120 ops/s random write, 4k blocks: 4.2 MB/s, 1050 ops/s linear write, 64k blocks: 43 MB/s, 680 ops/s random write, 64k blocks: 44 MB/s, 690 ops/s cache enabled, 1 buffer, with cache flushes linear write, 4k blocks, flush after every write: 1.5 MB/s, 385 writes/s linear write, 4k blocks, flush after every 4th write: 4.2 MB/s, 1120 writes/s The numbers are rough numbers read quickly from iostat, so please don't multiply block size by ops and compare with the bandwidth given ;) The test operates directly on top of LDI, just like ZFS. - nk blocks means the size of each read/write given to the device driver - n buffers means the number of buffers I keep in flight. This is to keep the command queue of the device busy - cache flush means a synchronous ioctl DKIOCFLUSHWRITECACHE These numbers contain a few surprises (at least for me). The biggest surprise is that with cache disabled one cannot get good data rates with small blocks, even if one keeps the command queue filled. This is completely different from what I've seen from hard drives. Also the IOPS with cache flushes is quite low, 385 is not much better than a 15k hdd, while the latter scales better. On the other hand, from the large drop in performance when using flushes one could infer that they indeed flush properly, but I haven't built a test setup for that yet. Conclusion: From the measurements I'd infer the device makes a good L2ARC, but for a slog device the latency is too high and it doesn't scale well. I'll do similar tests on a x-25 and ocz vertex 2 pro as soon as they arrive. If there are numbers you are missing please tell me, I'll measure them if possible. Also please ask if there are questions regarding the test setup. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Crucial RealSSD C300 and cache flush?
Arne Jansen wrote: Hi, Roy Sigurd Karlsbakk wrote: Crucial RealSSD C300 has been released and showing good numbers for use as Zil and L2ARC. Does anyone know if this unit flushes its cache on request, as opposed to Intel units etc? Also the IOPS with cache flushes is quite low, 385 is not much better than a 15k hdd, while the latter scales better. On the other hand, from the large drop in performance when using flushes one could infer that they indeed flush properly, but I haven't built a test setup for that yet. Result from cache flush test: While doing synchronous writes with full speed we pulled the device from the system and compared the contents afterwards. Result: no writes lost. We repeated the test several times. Cross check: we pulled also while writing with cache enabled, and it lost 8 writes. So I'd say, yes, it flushes its cache on request. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z - not even iops distribution
Ross Walker wrote: Raidz is definitely made for sequential IO patterns not random. To get good random IO with raidz you need a zpool with X raidz vdevs where X = desired IOPS/IOPS of single drive. I have seen statements like this repeated several times, though I haven't been able to find an in-depth discussion of why this is the case. From what I've gathered every block (what is the correct term for this? zio block?) written is spread across the whole raid-z. But in what units? will a 4k write be split into 512 byte writes? And in the opposite direction, every block needs to be read fully, even if only parts of it are being requested, because the checksum needs to be checked? Will the parity be read, too? If this is all the case, I can see why raid-z reduces the performance of an array effectively to one device w.r.t. random reads. Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] One dataset per user?
Paul B. Henson wrote: On Sun, 20 Jun 2010, Arne Jansen wrote: In my experience the boot time mainly depends on the number of datasets, not the number of snapshots. 200 datasets is fairly easy (we have 7000, but did some boot-time tuning). What kind of boot tuning are you referring to? We've got about 8k filesystems on an x4500, it takes about 2 hours for a full boot cycle which is kind of annoying. The majority of that time is taken up with NFS sharing, which currently scales very poorly :(. As you said most of the time is spent for nfs sharing, but mounting also isn't as fast as it could be. We found that the zfs utility is very inefficient as it does a lot of unnecessary and costly checks. We set mountpoint to legacy and handle mounting/sharing ourselves in a massively parallel fashion (50 processes). Using the system utilities makes things a lot better, but you can speed up sharing a lot more by setting the SHARE_NOINUSE_CHECK environment variable before invoking share(1M). With this you should be able to share your tree in about 10 seconds. Good luck, Arne Thanks... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] One dataset per user?
Arne Jansen wrote: Paul B. Henson wrote: On Sun, 20 Jun 2010, Arne Jansen wrote: In my experience the boot time mainly depends on the number of datasets, not the number of snapshots. 200 datasets is fairly easy (we have 7000, but did some boot-time tuning). What kind of boot tuning are you referring to? We've got about 8k filesystems on an x4500, it takes about 2 hours for a full boot cycle which is kind of annoying. The majority of that time is taken up with NFS sharing, which currently scales very poorly :(. As you said most of the time is spent for nfs sharing, but mounting also isn't as fast as it could be. We found that the zfs utility is very inefficient as it does a lot of unnecessary and costly checks. We set mountpoint to legacy and handle mounting/sharing ourselves in a massively parallel fashion (50 processes). Using the system utilities makes things a lot better, but you can speed up sharing a lot more by setting the SHARE_NOINUSE_CHECK environment variable before invoking share(1M). With this you should be able to share your tree in about 10 seconds. I forgot the disclaimer: you can crash your machine if you call share with improper arguments if you set this flag. iirc it skips a check if the fs is already shared, so it cannot handle a re-share properly. Good luck, Arne Thanks... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] One dataset per user?
David Magda wrote: On Jun 21, 2010, at 05:00, Roy Sigurd Karlsbakk wrote: So far the plan is to keep it in one pool for design and administration simplicity. Why would you want to split up (net) 40TB into more pools? Seems to me that'll mess up things a bit, having to split up SSDs for use on different pools, loosing the flexibility of a common pool etc. Why? If different groups or areas have different I/O characteristics for one. If in one case (users) you want responsiveness, you could go with striped-mirrors. However, if departments have lots of data, it may be worthwhile to put it on a RAID-Z pool for better storage efficiency. Especially if the characteristics are different I find it a good idea to mix all on one set of spindles. This way you have lots of spindles for fast access and lots of space for the sake of space. If you devide the available spindles in two sets you will have much fewer spindles available for the responsiveness goal. I don't think taking them into a mirror can compensate that. --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SLOG striping?
Roy Sigurd Karlsbakk wrote: Hi all I plan to setup a new system with four Crucial RealSSD 256MB SSDs for both SLOG and L2ARC. The plan is to use four small slices for the SLOG, striping two mirrors. I have seen questions in here about the theoretical benefit of doing this, but I haven't seen any answers, just some doubt about the effect. Does anyone know if this will help gaining performance? Or will it be bad? I'm planning to do something similar, though I only want to install 2 devices. Some thoughts I had so far: - mirroring l2arc won't gain anything, as it doesn't contain any information that cannot be rebuilt if a device is lost. Further, if a device is lost, the system just uses the remaining devices. So I wouldn't waste any space mirroring l2arc, I'll just stripe them. - the purpose of a zil device is to reduce latency. Throughput is probably not an issue, especially if you configure your pool so that large writes go to the main pool. As 2 devices don't have a lower latency than one, I see no real point in striping slog devices. - For slog you need SSD with supercap which are significantly more expensive than without. I'll try the OCZ Vertex 2 Pro in the next few days and can give a report how it performs. For L2ARC cheap MLC SSDs will do. So if I had the chance to buy 4 devices, I'd probably buy 2 different sets. 2 cheap large L2ARC devices, 2 fast supercapped small ones. The 2 slog devices would go into a mirror, the L2ARC devices in a stripe. I'd probably take the remaining space of the slog devices into the stripe, too, though this might affect write performance. Just me thoughts... -- Arne Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SLOG striping?
Roy Sigurd Karlsbakk wrote: - mirroring l2arc won't gain anything, as it doesn't contain any information that cannot be rebuilt if a device is lost. Further, if a device is lost, the system just uses the remaining devices. So I wouldn't waste any space mirroring l2arc, I'll just stripe them. I don't plan to attempt to mirror L2ARC. Even the docs say it's unsupported, so no point of that. Oops, makes sense ;) - For slog you need SSD with supercap which are significantly more expensive than without. I'll try the OCZ Vertex 2 Pro in the next few days and can give a report how it performs. For L2ARC cheap MLC SSDs will do. hm... Last I checked those OCZ Vertexes were on the both large and expensive side. What do you pay for a couple of small ones? We'll be installing 48 gigs of memory in this box, but I doubt we'll need more than 4GB SLOG in terms of traffic. 50GB for 400 Euro. They are MLC flash, but, as someone in a different thread pointed out, they have 3 years warranty ;) My hope is that they last long enough until cheaper options become available. My major concern is that if I buy two identical models they'll break the same day. This is not purely hypothetical. If they internally just count the write cycles and trigger a SMART fail if a certain threshold is reached, exactly this will happen. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] does sharing an SSD as slog and l2arc reduces its life span?
Wes Felter wrote: On 6/19/10 3:56 AM, Arne Jansen wrote: while thinking about using the OCZ Vertex 2 Pro SSD (which according to spec page has supercaps built in) as a shared slog and L2ARC device IMO it might be better to use the smallest (50GB, maybe overprovisioned down to ~20GB) Vertex 2 Pro as slog and a much cheaper SSD (X25-M) as L2ARC. No budget for this. Lucky if I can get the budget for the Vertex 2 Pro. But if this sharing works (thanks to static wear leveling) it should be sufficient to leave 10-20% space. As Bob Friesenhahn said, you're assuming dynamic wear leveling but modern SSDs also use static wear leveling, so this problem doesn't exist. (Note that in this context the terms dynamic and static may not mean what you think they mean.) Thanks for the term. Yes, this makes sense. -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] One dataset per user?
Roy Sigurd Karlsbakk wrote: I have read people are having problems with lengthy boot times with lots of datasets. We're planning to do extensive snapshotting on this system, so there might be close to a hundred snapshots per dataset, perhaps more. With 200 users and perhaps 10-20 shared department datasets, the number of filesystems, snapshots included, will be around 20k or more. In my experience the boot time mainly depends on the number of datasets, not the number of snapshots. 200 datasets is fairly easy (we have 7000, but did some boot-time tuning). Will trying such a setup be betting on help from some god, or is it doable? The box we're planning to use will have 48 gigs of memory and about 1TB L2ARC (shared with SLOG, we just use some slices for that). Try. The main problem with having many snapshots is the time used for zfs list, because it has to scrape all the information from disk, but with having so much RAM/L2ARC that shouldn't be a problem here. Another thing to consider is the frequency with which you plan to take the snap- shots and if you want individual schedules for each dataset. Taking a snapshot is a heavy-weight operation as it terminates the current txg. Btw, what did you plan to use as L2ARC/slog? --Arne Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] does sharing an SSD as slog and l2arc reduces its life span?
Hi, I don't know if it's already been discussed here, but while thinking about using the OCZ Vertex 2 Pro SSD (which according to spec page has supercaps built in) as a shared slog and L2ARC device it stroke me that this might not be a such a good idea. Because this SSD is MLC based, write cycles are an issue here, though I can't find any number in their spec. Why do I think it might be a bad idea: L2ARC is quite static in comparison with ZIL and L2ARC takes all the place it can get. But if 90% of the device are nearly statically allocated, the devices possibilities for wear-leveling are very restricted. If the ZIL is heavily used, the same 10% of the device get written over and over again, reducing the life span by 90%. Is there some fundamental flaw in this line of thought? Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Erratic behavior on 24T zpool
Curtis E. Combs Jr. wrote: Sure. And hey, maybe I just need some context to know what's normal IO for the zpool. It just...feels...slow, sometimes. It's hard to explain. I attached a log of iostat -xn 1 while doing mkfile 10g testfile on the zpool, as well as your dd with the bs set really high. When I Ctl-C'ed the dd it said 460M/seclike I said, maybe I just need some context... These iostats don't match to the creation of any large files. What are you doing there? Looks more like 512 byte random writes... Are you generating the load locally or remote? On Fri, Jun 18, 2010 at 5:36 AM, Arne Jansen sensi...@gmx.net wrote: artiepen wrote: 40MB/sec is the best that it gets. Really, the average is 5. I see 4, 5, 2, and 6 almost 10x as many times as I see 40MB/sec. It really only bumps up to 40 very rarely. As far as random vs. sequential. Correct me if I'm wrong, but if I used dd to make files from /dev/zero, wouldn't that be sequential? I measure with zpool iostat 2 in another ssh session while making files of various sizes. This is a test system. I'm wondering, now, if I should just reconfigure with maybe 7 disks and add another spare. Seems to be the general consensus that bigger raid pools = worse performance. I thought the opposite was true... A quick test on a system with 21 1TB SATA-drives in a single RAIDZ2 group show a performance of about 400MB/s with a single dd, blocksize=1048576. Creating a 10G-file with mkfile takes 25 seconds also. So I'd say basically there is nothing wrong with the zpool configuration. Can you paste some iostat -xn 1 output while your test is running? --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Erratic behavior on 24T zpool
Curtis E. Combs Jr. wrote: Um...I started 2 commands in 2 separate ssh sessions: in ssh session one: iostat -xn 1 stats in ssh session two: mkfile 10g testfile when the mkfile was finished i did the dd command... on the same zpool1 and zfs filesystem..that's it, really No, this doesn't match. Did you enable compression or dedup? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VXFS to ZFS Quota
David Magda wrote: On Fri, June 18, 2010 08:29, Sendil wrote: I can create 400+ file system for each users, but will this affect my system performance during the system boot up? Is this recommanded or any alternate is available for this issue. You can create a dataset for each user, and then set a per-dataset quota for each one: quota=size | none as a side note, you do not need to worry about creating 400 filesystems. A few thousand are no problem. --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Erratic behavior on 24T zpool
Sandon Van Ness wrote: Sounds to me like something is wrong as on my 20 disk backup machine with 20 1TB disks on a single raidz2 vdev I get the following with DD on sequential reads/writes: writes: r...@opensolaris: 11:36 AM :/data# dd bs=1M count=10 if=/dev/zero of=./100gb.bin 10+0 records in 10+0 records out 10485760 bytes (105 GB) copied, 233.257 s, 450 MB/s reads: r...@opensolaris: 11:44 AM :/data# dd bs=1M if=./100gb.bin of=/dev/null 10+0 records in 10+0 records out 10485760 bytes (105 GB) copied, 131.051 s, 800 MB/s zpool iostat pool 10 gives me about the same values that DD gives me. Maybe you have a bad drive somewhere? Which areca controller are you using as maybe you can pull the smart info off the drives from a linux boot cd as some of the controllers support that. Could be a bad drive somewhere. didn't he say he already gets 400MB/s from dd, but zpool iostat only show a few MB/s? What does zpool iostat show, the value before or after dedup? Curtis, to see if your physical setup is ok you should turn of dedup and measure again. Otherwise you only measure the power of your machine to dedup /dev/zero. --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
Christopher George wrote: So why buy SSD for ZIL at all? For the record, not all SSDs ignore cache flushes. There are at least two SSDs sold today that guarantee synchronous write semantics; the Sun/Oracle LogZilla and the DDRdrive X1. Also, I believe it is more LogZilla? Are these those STEC-thingies? For the price of those I can buy a battery backed-up RAID-controller and a few conventional drives. For ZIL this will probably do better at a lower price than STEC. The DDRdrive I wouldn't call a flash drive but rather a NVRAM-Card. NVRAM-cards are the proper way to go for ZIL. Someone should build one for $600, PCIe x1 would be sufficient. Xilinx has some nice Spartans :) accurate to describe the root cause as not power protecting on-board volatile caches. As the X25-E does implement the ATA FLUSH CACHE command, but does not have the required power protection to avoid transaction (data) loss. You could say the same about hard drives. They also just need a proper protection for their volatile cache... --Arne Best regards, Christopher George Founder/CTO www.ddrdrive.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
Arve Paalsrud wrote: Not to forget the The Deneva Reliability disks from OCZ that just got released. See http://www.oczenterprise.com/details/ocz-deneva-reliability-2-5-emlc-ssd.html The Deneva Reliability family features built-in supercapacitor (SF-1500 models) that acts as a temporary power backup in the event of sudden power loss, and enables the drive to complete its task ensuring no data loss. This one looks really interesting. No price to find though, and no detail about how many write cycles they can stand. --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] At what level does the “zfs” d irectory exist?
David Markey wrote: I have done a similar deployment, However we gave each student their own ZFS filesystem. Each of which had a .zfs directory in it. Don't host 50k filesystems on a single pool. It's more pain than it's worth. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] At what level does the “zfs ” directory exist?
MichaelHoy wrote: I’ve posted a query regarding the visibility of snapshots via CIFS here (http://opensolaris.org/jive/thread.jspa?threadID=130577tstart=0) however, I’m beginning to suspect that it may be a more fundamental ZFS question so I’m asking the same question here. At what level does the “zfs” directory exist? If the “.zfs” subdirectory only exists as the direct child of the mount point then can someone suggest how I can make it visible lower down without requiring me (even if it were possible for 50k users) to make each users’ home folder a file system? By way of a background, I’m looking at the possibility of hosting our students personal file space on OpenSolaris since the capacities required go well beyond my budget to keep investing in our NetApp kit. So far I’ve managed to implement the same functionality however, the visibility of the snapshots to allow self-service file restores is a real issue which may prevent me for going forward on this platform. I’d appreciate any suggestions. Do you only want to share the filesystem via CIFS? Have you had a look at the shadow_copy2 extension for samba? It maps the snapshots so windows can access them via previous versions from the explorers context menu. --Arne Thanks Michael. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
David Magda wrote: On Wed, June 16, 2010 11:02, David Magda wrote: [...] Yes, I understood it as suck, and that link is for ZIL. For L2ARC SSD numbers see: s/suck/such/ ah, I tried to make sense from 'suck' in the sense of 'just writing sequentially' or something like that ;) :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
David Magda wrote: On Wed, June 16, 2010 10:44, Arne Jansen wrote: David Magda wrote: I'm not sure you'd get the same latency and IOps with disk that you can with a good SSD: http://blogs.sun.com/brendan/entry/slog_screenshots [...] Please keep in mind I'm talking about a usage as ZIL, not as L2ARC or main pool. Because ZIL issues nearly sequential writes, due to the NVRAM-protection of the RAID-controller the disk can leave the write cache enabled. This means the disk can write essentially with full speed, meaning 150MB/s for a 15k drive. 114000 4k writes/s are 456MB/s, so 3 spindles should do. Yes, I understood it as suck, and that link is for ZIL. For L2ARC SSD numbers see: http://blogs.sun.com/brendan/entry/l2arc_screenshots oops, sorry, I should at least scrolled down a bit on your link... Nevertheless I don't find it improbable to reach numbers like that for a proper RAID-setup. Of cause it will take more space and power. Maybe someone has done some testing on this. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
Bob Friesenhahn wrote: On Wed, 16 Jun 2010, Arne Jansen wrote: Please keep in mind I'm talking about a usage as ZIL, not as L2ARC or main pool. Because ZIL issues nearly sequential writes, due to the NVRAM-protection of the RAID-controller the disk can leave the write cache enabled. This means the disk can write essentially with full speed, meaning 150MB/s for a 15k drive. 114000 4k writes/s are 456MB/s, so 3 spindles should do. Huh? What does the battery backed memory of a RAID-controller have to do with the unprotected memory of a hard drive? This does not compute. You're right, I took a wrong turn there. Of course the RAID-controller disables the write cache of the disks. But because the controller ACKs each write immediately (as long as it has buffer left), the requests can be queued in the disk. This enables the disk to write continously. I double checked before posting: I can nearly saturate a 15k disk if I make full use of the 32 queue slots giving 137 MB/s or 34k IOPS/s. Times 3 nearly matches the above mentioned 114k IOPS :) Thanks, Arne The flushes that the RAID-controller acks need to be ultimately delivered to the disk or else there WILL be data loss. The RAID controller should not purge its own record until the disk reports that it has flushed its cache. Once the RAID controller's cache is full, then it should start stalling writes. Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
David Magda wrote: On Wed, June 16, 2010 15:15, Arne Jansen wrote: I double checked before posting: I can nearly saturate a 15k disk if I make full use of the 32 queue slots giving 137 MB/s or 34k IOPS/s. Times 3 nearly matches the above mentioned 114k IOPS :) 34K*3 = 102K. 12K isn't anything to sneeze at :) So you'll need six disks to do what one SSD does: three spindles, and two (mirrored) disks on each spindle for redundancy (drives are riskier than SSDs). ok, 4 spindles, we already have a raid controller available :) But personally I trust drives more than SSDs. Are the 114k with mirrored or striped logzillas? In any case there are two of them, so I'd double that raid-controller setup also, being still cheaper than the STEC devices. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] SSDs adequate ZIL devices?
There has been many threads in the past asking about ZIL devices. Most of them end up in recommending Intel X-25 as an adequate device. Nevertheless there is always the warning about them not heeding cache flushes. But what use is a ZIL that ignores cache flushes? If I'm willing to tolerate that (I'm not), I can just as well take a mechanical drive and force zfs to not issue cache flushes to it. In this case it can easily compete with SSD in regard to IOPS and bandwidth. In case of a power failure I will likely lose about as many writes as I do with SSDs, a few milliseconds. So why buy SSD for ZIL at all? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSDs adequate ZIL devices?
Bob Friesenhahn wrote: On Tue, 15 Jun 2010, Arne Jansen wrote: In case of a power failure I will likely lose about as many writes as I do with SSDs, a few milliseconds. I agree with your concerns, but the data loss may span as much as 30 seconds rather than just a few milliseconds. Wait, I'm talking about using SSD for ZIL vs. using a dedicated hard drive for ZIL which is configured to ignore cache flushes. Do you say I can lose 30 seconds also if I use a badly behaving SSD? Using an SSD as the ZIL allows zfs to turn a synchronous write into a normal batched async write which is scheduled for the next TXG. Zfs intentionally postpones writes. Without the SSD, zfs needs to write to an intent log in the main pool (consuming precious IOPS) or write directly to the main pool (consuming precious response latency). Battery-backed RAM in the adaptor card or storage array can do almost as well as the SSD as long as the amount of data does not overrun the limited write cache. Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots, txgs and performance
Marcelo Leal wrote: Hello there, I think you should share it with the list, if you can, seems like an interesting work. ZFS has some issues with snapshots and spa_sync performance for snapshots deletion. I'm a bit reluctant to post it to the list where it can still be found years from now. Because the module is not compiled directly into ZFS but is a separate module that makes heavy use of internal structures of ZFS, it is designed for a specific version of ZFS (Solaris U8). It might still load without problems for years, but already in the next Solaris version it might wreak havoc because of a changed kernel structure. A much better way would be to have a similar operation integrated into the official source tree. I could try to build a patch if it has a chance of getting accepted. Until then, I have no problem with sharing it off-list. --Arne Thanks Leal [ http://www.eall.com.br/blog ] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] size of slog device
Hi, I known it's been discussed here more than once, and I read the Evil tuning guide, but I didn't find a definitive statement: There is absolutely no sense in having slog devices larger than then main memory, because it will never be used, right? ZFS will rather flush the txg to disk than reading back from zil? So there is a guideline to have enough slog to hold about 10 seconds of zil, but the absolute maximum value is the size of main memory. Is this correct? Thanks, Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] size of slog device
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Arne Jansen There is absolutely no sense in having slog devices larger than then main memory, because it will never be used, right? Also: A TXG is guaranteed to flush within 30 sec. Let's suppose you have a super fast device, which is able to log 8Gbit/sec (which is unrealistic). That's 1Gbyte/sec, unrealistically theoretically possible, at best. You do the math. ;-) That being said, it's difficult to buy an SSD smaller than 32G. So what are you going to do? I'm still building my rotational write delay eliminating driver and am trying to figure out how much space I can waste on the underlying device without ever running into problems. I need half the physical memory, or, under the assumption that it might be tunable, a maximum of my physical memory. It's good to know a hard upper limit. The more I can waste, the faster the device will be. Also, to stay in your line of argumentation, this super-fast slog is most probably a DRAM-based, battery backed solution. In this case it will make a difference if you buy 8 or 32GB ;) --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] size of slog device
Roy Sigurd Karlsbakk wrote: There is absolutely no sense in having slog devices larger than then main memory, because it will never be used, right? ZFS will rather flush the txg to disk than reading back from zil? So there is a guideline to have enough slog to hold about 10 seconds of zil, but the absolute maximum value is the size of main memory. Is this correct? ZFS uses at most RAM/2 for ZIL Thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] panic after zfs mount
Thomas Nau wrote: Dear all We ran into a nasty problem the other day. One of our mirrored zpool hosts several ZFS filesystems. After a reboot (all FS mounted at that time an in use) the machine paniced (console output further down). After detaching one of the mirrors the pool fortunately imported automatically in a faulted state without mounting the filesystems. Offling the unplugged device and clearing the fault allowed us to disable auto-mounting the filesystems. Going through them one by one all but one mounted OK. The one again triggered a panic. We left mounting on that one disabled for now to be back in production after pulling data from the backup tapes. Scrubbing didn't show any error so any idea what's behind the problem? Any chance to fix the FS? We had the same problem. Victor pointed my to http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6742788 with a workaround to mount the filesystem read-only to save the data. I still hope to figure out the chain of events that causes this. Did you use any extended attributes on this filesystem? -- Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Are recursive snapshot destroy and rename atomic too?
Darren J Moffat wrote: But the following document says Recursive ZFS snapshots are created quickly as one atomic operation. The snapshots are created together (all at once) or not created at all. http://docs.sun.com/app/docs/doc/819-5461/gdfdt?a=view I've looked at the code again - I miss read part of it - it does appear to be an all or nothing both for the create and destroy. I read the code differently, zfs destroy does the iteration in the zfs utility, not even in libzfs. The ioctl doesn't even have a recurse flag. --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Are recursive snapshot destroy and rename atomic too?
Darren J Moffat wrote: On 11/06/2010 11:42, Arne Jansen wrote: Darren J Moffat wrote: But the following document says Recursive ZFS snapshots are created quickly as one atomic operation. The snapshots are created together (all at once) or not created at all. http://docs.sun.com/app/docs/doc/819-5461/gdfdt?a=view I've looked at the code again - I miss read part of it - it does appear to be an all or nothing both for the create and destroy. I read the code differently, zfs destroy does the iteration in the zfs utility, not even in libzfs. The ioctl doesn't even have a recurse flag. In zfs_do_destroy() if recurse is set we call zfs_destroy_snaps() http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libzfs/common/libzfs_dataset.c#zfs_destroy_snaps Which makes a single ioctl call ZFS_IOC_DESTROY_SNAPS Ah! There's an extra ioctl for a recursive snapshot delete, I missed that part. Many thanks! This will also help me with my multi-snapshot- destroy-module. I should at least have checked with a simple truss... You made my day :) --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
Andrey Kuzmin wrote: On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net wrote: Andrey Kuzmin wrote: As to your results, it sounds almost too good to be true. As Bob has pointed out, h/w design targeted hundreds IOPS, and it was hard to believe it can scale 100x. Fantastic. Hundreds IOPS is not quite true, even with hard drives. I just tested a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache Linear? May be sequential? Aren't these synonyms? linear as opposed to random. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
Andrey Kuzmin wrote: Well, I'm more accustomed to sequential vs. random, but YMMW. As to 67000 512 byte writes (this sounds suspiciously close to 32Mb fitting into cache), did you have write-back enabled? It's a sustained number, so it shouldn't matter. Regards, Andrey On Fri, Jun 11, 2010 at 12:03 AM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net wrote: Andrey Kuzmin wrote: On Thu, Jun 10, 2010 at 11:51 PM, Arne Jansen sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net mailto:sensi...@gmx.net wrote: Andrey Kuzmin wrote: As to your results, it sounds almost too good to be true. As Bob has pointed out, h/w design targeted hundreds IOPS, and it was hard to believe it can scale 100x. Fantastic. Hundreds IOPS is not quite true, even with hard drives. I just tested a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache Linear? May be sequential? Aren't these synonyms? linear as opposed to random. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20
Andrey Kuzmin wrote: As to your results, it sounds almost too good to be true. As Bob has pointed out, h/w design targeted hundreds IOPS, and it was hard to believe it can scale 100x. Fantastic. Hundreds IOPS is not quite true, even with hard drives. I just tested a Hitachi 15k drive and it handles 67000 512 byte linear write/s, cache enabled. --Arne Regards, Andrey On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski mi...@task.gda.pl mailto:mi...@task.gda.pl wrote: On 21/10/2009 03:54, Bob Friesenhahn wrote: I would be interested to know how many IOPS an OS like Solaris is able to push through a single device interface. The normal driver stack is likely limited as to how many IOPS it can sustain for a given LUN since the driver stack is optimized for high latency devices like disk drives. If you are creating a driver stack, the design decisions you make when requests will be satisfied in about 12ms would be much different than if requests are satisfied in 50us. Limitations of existing software stacks are likely reasons why Sun is designing hardware with more device interfaces and more independent devices. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots, txgs and performance
thomas wrote: Very interesting. This could be useful for a number of us. Would you be willing to share your work? No problem. I'll contact you off-list. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots, txgs and performance
Arne Jansen wrote: Hi, I have a setup with thousands of filesystems, each containing several snapshots. For a good percentage of these filesystems I want to create a snapshots once every hour, for others once every 2 hours and so forth. I built some tools to do this, no problem so far. While examining disk load on the system, I found out that load jumps up whenever the snapshot creation process is running. Delving a bit deeper it seems that every snapshot start a new txg. This seems to be quite costly, about 100-150 IOPs. This takes about one second, so I can create one snapshot per second. Creating all necessary snapshot for one hour takes about 45 minutes. During this time the disks are at 70% utilization and txgs are back-to-back. So I need to optimize this. Looking at the code it seems that recursive snapshots are being collected into a single txg. So my aim is to collect all necessary snapshots in a single txg, too. Using libzfs or ioctl I haven't found any way to do this. I cannot just use recursive snapshots, because not all filesystems need to be snapshotted. Same with snapshot deletion. My idea is to write a small kernel module that roughly duplicates the code of zfs_ioc_snapshot, but instead of being fully recursive gets passed a list of filesystems to snapshot. My questions: - is there any easier way to bring down disk load and accelerate snapshot creation? - are there any arguments, why my approach isn't feasible? Just as a followup, the module works smoothly for creations. If I take 100 snapshots at a time, the gain is roughly a factor of 100 :) Because there are no recursive deletions in kernel space (recursive deletions are handled by libzfs), I haven't found an easy way to do likewise for delete. Maybe someday when I get a better grip on the source... --Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Brent Jones wrote: I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. A few days ago I posted to nfs-discuss with a proposal to add some mount/share options to change semantics of a nfs-mounted filesystem so that they parallel those of a local filesystem. The main point is that data gets flushed to stable storage only if the client explicitly requests so via fsync or O_DSYNC, not implicitly with every close(). That would give you the performance you are seeking without sacrificing data integrity for applications that need it. I get the impression that I'm not the only one who could be interested in that ;) -Arne You'd be better off getting NetApp ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Snapshots, txgs and performance
Hi, I have a setup with thousands of filesystems, each containing several snapshots. For a good percentage of these filesystems I want to create a snapshots once every hour, for others once every 2 hours and so forth. I built some tools to do this, no problem so far. While examining disk load on the system, I found out that load jumps up whenever the snapshot creation process is running. Delving a bit deeper it seems that every snapshot start a new txg. This seems to be quite costly, about 100-150 IOPs. This takes about one second, so I can create one snapshot per second. Creating all necessary snapshot for one hour takes about 45 minutes. During this time the disks are at 70% utilization and txgs are back-to-back. So I need to optimize this. Looking at the code it seems that recursive snapshots are being collected into a single txg. So my aim is to collect all necessary snapshots in a single txg, too. Using libzfs or ioctl I haven't found any way to do this. I cannot just use recursive snapshots, because not all filesystems need to be snapshotted. Same with snapshot deletion. My idea is to write a small kernel module that roughly duplicates the code of zfs_ioc_snapshot, but instead of being fully recursive gets passed a list of filesystems to snapshot. My questions: - is there any easier way to bring down disk load and accelerate snapshot creation? - are there any arguments, why my approach isn't feasible? Thanks for any hits. -Arne ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Resilver speed
Hi, I have a pool of 22 1T SATA disks in a RAIDZ3 configuration. It is filled with files of an average size of 2MB. I filled it randomly to resemble the expected workload in production use. Problems arise when I try to scrub/resilver this pool. This operation takes the better part of a week (!). During this time the disk being resilvered is at 100% utilisation with 300 writes/s, but only 3MB/s, which is only about 3% of its best case performance. Having a window of one week with degraded redundancy is intolerable. It is quite likely that one loses more disks during this period, eventually leading to a total loss of the pool, not to mention the degraded performance during this period. In fact, in previous tests I lost a pool in a 6x11 RAIDZ2 configuration. I skimmed through the code of resilver and found out that it just enumerates all object in the pool and checks them one by one, having maxinflight I/O-request in parallel. Because this does not take the order of data ondisk into account it leads to this pathological performance. Also I found Bug 6678033 stating that a prefetch might fix this. Now my questions: 1) Are there tunings that could speed up resilver, possibly with a negative effect on normal performance? I thought of raising recordsize to the expected filesize of 2MB. Could this help? 2) What is the state of the fix? When will it be ready? 3) Do you have any configuration hints for setting up a pool layout which might help resilver performance? (aside from using hardware RAID instead of RAIDZ) Thanks for any hints. sensille -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss