[zfs-discuss] questions about block sizes
Hi, ZFS can use block sizes up to 128k. If the data is compressed, then this size will be larger when decompressed. So, can the decompressed data be larger than 128k? If so, does this also hold for metadata? In other words, can I have a 128k block on the disk with, for instance, indirect blocks (compressed blkptr_t data), that results in more than 1024 blkptr_t when de-compressed? If I had a very large amount of free space, I could try this and see, but since I don't, I thought I'd ask here. thanks, max ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup for x4500?
On Wed, Apr 16, 2008 at 2:12 PM, Anna Langley [EMAIL PROTECTED] wrote: I've just joined this list, and am trying to understand the state of play with using free backup solutions for ZFS, specifically on a Sun x4500. ... Does anyone here have experience of this with multi-TB filesystems and any of these solutions that they'd be willing to share with me please? My experience so far is that anything past a terabyte and 10 million files, and any backup software struggles. (I've largely been involved with commercial solutions, as we already have them. They struggle as well.) Generally, handling data volumes on this scale seems to require some way of partitioning it into more easily digestible chunks. Either into separate filesystems (zfs makes this easy) or, if that isn't possible, to structure the data on a large filesystem into some sort of hierarchy so that it has top-level directories that break it up into smaller chunks. (Some sort of hashing scheme appears to be indicated. Unfortunately our applications fall into two classes: everything in one huge directory, or a hashing scheme that results in many thousands of top-level directories.) -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] questions about block sizes
ZFS can use block sizes up to 128k. If the data is compressed, then this size will be larger when decompressed. ZFS allows you to use variable blocksizes (sized a power of 2 from 512 to 128k), and as far as I know, a compressed block is put into the smallest fitting one. -mg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 10U5 ZFS features?
On Sun, Apr 20, 2008 at 5:48 AM, Vincent Fox [EMAIL PROTECTED] wrote: I would hope at least it has that giant FSYNC patch for ZFS already present? We ran into this issue and it nearly killed Solaris here in our Data Center as a product it was such a bad experience. Fix was in 127728 (x86) and 127729 (Sparc). I think you have sparc and x86 swapped over. Looking at an S10U5 box I have here, 127728-06 is integrated. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup for x4500?
On Sun, 20 Apr 2008, Peter Tribble wrote: Does anyone here have experience of this with multi-TB filesystems and any of these solutions that they'd be willing to share with me please? My experience so far is that anything past a terabyte and 10 million files, and any backup software struggles. What is the cause of the struggling? Does the backup host run short of RAM or CPU? If backups are incremental, is a large portion of time spent determining the changes to be backed up? What is the relative cost of many small files vs large files? How does 'zfs send' performance compare with a traditional incremental backup system? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raidz write performance:what to expect from SATA drives on ICH9R (A
Hi, First of all, my apologies for some of my posts appearing 2 or even 3 times here, the forum seems to be acting up, and although I received a Java exception for those double postings and they never appeared yesterday, apparently they still made it through eventually. Back on topic: I fruitlessly tried to extract higher write speeds from the Seagate drives using an Addonics Silicon Image 3124 based SATA controller. I got exactly the same 21 MB/s for each drive (booted from a Knoppix cd). I was planning on contacting Seagate support about this, but in the mean time I absolutely had to start using this system, even if it meant low write speeds. So I installed Solaris on a 1GB CF card and wanted to start configuring ZFS. I noticed that the first SATA disk was still shown with a different label by the format command (see my other post somewhere here). I tried to get rid of all disk labels (unsuccessfully), so I decided to boot Knoppix again and zero out the start and end sectors manually (erasing all GPT data). Back to Solaris. I ran zpool create tank raidz c1t0d0 c1t1d0 c1t2d0 and tried a dd while monitoring with iostat -xn 1 to see the effect of not having a slice as part of the zpool (write cache etc). I was seeing write speeds in excess of 50MB/s per drive! Whoa! I didn't understand this at all, because 5 minutes earlier I couldn't get more than 21MB/s in Linux using block sizes up to 1048576 bytes. How could this be? I decided to destroy the zpool and try to dd from Linux once more. This is when my jaw dropped to the floor: [EMAIL PROTECTED]:~# dd if=/dev/zero of=/dev/sda bs=4096 ^[250916+0 records in 250915+0 records out 1027747840 bytes (1.0 GB) copied, 10.0172 s, 103 MB/s Finally, the write speed one should expect from these drives, according to various reviews around the web. I still get a healthy 52MB/s at the end of the disk: # dd if=/dev/zero of=/dev/sda bs=4096 seek=18300 dd: writing `/dev/sda': No space left on device 143647+0 records in 143646+0 records out 588374016 bytes (588 MB) copied, 11.2223 s, 52.4 MB/s But how is it possible that I didn't get these speeds earlier? This may be part of the explanation: [EMAIL PROTECTED]:~# dd if=/dev/zero of=/dev/sda bs=2048 101909+0 records in 101909+0 records out 208709632 bytes (209 MB) copied, 9.32228 s, 22.4 MB/s Could it be that the firmware in these drives has issues with write requests of 2048 bytes and smaller? There must be more to it though, because I'm absolutely sure that I used larger block sizes when testing with Linux earlier (like 16384, 65536 and 1048576). It's impossible to tell, but maybe there was something fishy going on which was fixed by zero'ing parts of the drives. I absolutely cannot explain it otherwise. Anyway, I'm still not seeing much more than 50MB/s per drive from ZFS, but I suspect the 2048 VS 4096 byte write block size effect may be influencing this. Having a slice as part of the pool earlier perhaps magnified this behavior as well. Caching or swap problems are certainly no issues now. Any thoughts? I certainly want to thank everyone once more for your co-operation! Greetings, Pascal This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup for x4500?
On Sun, Apr 20, 2008 at 4:39 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Sun, 20 Apr 2008, Peter Tribble wrote: My experience so far is that anything past a terabyte and 10 million files, and any backup software struggles. What is the cause of the struggling? Does the backup host run short of RAM or CPU? If backups are incremental, is a large portion of time spent determining the changes to be backed up? What is the relative cost of many small files vs large files? It's just the fact that, while the backup completes, it can take over 24 hours. Clearly this takes you well over any backup window. It's not so much that the backup software is defective; it's an indication that traditional notions of backup need to be rethought. I have one small (200G) filesystem that takes an hour to do an incremental with no changes. (After a while, it was obvious we don't need to do that every night.) The real killer, I think, is sheer number of files. For us, 10 million files isn't excessive. I have one filesystem that's likely to have getting on for 200 million files by the time the project finishes. (Gulp!) How does 'zfs send' performance compare with a traditional incremental backup system? I haven't done that particular comparison. (zfs send isn't useful for backup - doesn't span tapes, doesn't hold an index of the files.) But I have compared it against various varieties of tar for moving data between machines, and the performance of 'zfs send' wasn't particularly good - I ended up using tar instead. (Maybe lots of smallish files again.) For incrementals, it may be useful. But that presumes a replicated configuration (preferably with the other node at a DR site), rather than use in backups. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backup for x4500?
On Sun, 20 Apr 2008, Peter Tribble wrote: What is the cause of the struggling? Does the backup host run short of RAM or CPU? If backups are incremental, is a large portion of time spent determining the changes to be backed up? What is the relative cost of many small files vs large files? It's just the fact that, while the backup completes, it can take over 24 hours. Clearly this takes you well over any backup window. It's not so much that the backup software is defective; it's an indication that traditional notions of backup need to be rethought. There is no doubt about that. However, there are organizations with hundreds of terrabytes online and they manage to survive somehow. I receive bug reports from people with 600K files in a single subdirectory. Terrabyte-sized USB drives are available now. When you say that the backup can take over 24 hours, are you talking only about the initial backup, or incrementals as well? I have one small (200G) filesystem that takes an hour to do an incremental with no changes. (After a while, it was obvious we don't need to do that every night.) That is pretty outrageous. It seems that your backup software is suspect since it must be severely assaulting the filesystem. I am using 'rsync' (version 3.0) to do disk-to-disk network backups (with differencing) to a large Firewire type drive and have not noticed any performance issues. I do not have 10 million files though (I have about half of that). Since zfs supports really efficient snapshots, a backup system which is aware of snapshots can take snapshots and then backup safely even if the initial dump takes several days. Really smart software could perform both initial dump and incremental dump simultaneously. The minimum useful incremental backup interval would still be be limited to the time required to do one incremental backup. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Periodic ZFS maintenance?
I have a 10x500 disc file server with ZFS+, do I need to perform any sort of periodic maintenance to the filesystem to keep it in tip top shape? Sam This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris 10U5 ZFS features?
Peter Tribble wrote: On Sun, Apr 20, 2008 at 5:48 AM, Vincent Fox [EMAIL PROTECTED] wrote: I would hope at least it has that giant FSYNC patch for ZFS already present? We ran into this issue and it nearly killed Solaris here in our Data Center as a product it was such a bad experience. Fix was in 127728 (x86) and 127729 (Sparc). I think you have sparc and x86 swapped over. Looking at an S10U5 box I have here, 127728-06 is integrated. Correct (127728 is sparc and 127729 is x86). They're in the respective patch clusters now, as well as 10u5. Rob++ -- | |Internet: [EMAIL PROTECTED] __o |Life: [EMAIL PROTECTED]_`\,_ | (_)/ (_) |They couldn't hit an elephant at this distance. | -- Major General John Sedgwick ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic ZFS maintenance?
Sam wrote: I have a 10x500 disc file server with ZFS+, do I need to perform any sort of periodic maintenance to the filesystem to keep it in tip top shape? No, but if there are problems, a periodic scrub will tip you off sooner rather than later. Ian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs filesystem metadata checksum
thank you. this is exactly what I was looking for. This is for remote replication so it looks like I am out of luck. bummer. Asa On Apr 14, 2008, at 4:09 PM, Jeff Bonwick wrote: Not at present, but it's a good RFE. Unfortunately it won't be quite as simple as just adding an ioctl to report the dnode checksum. To see why, consider a file with one level of indirection: that is, it consists of a dnode, a single indirect block, and several data blocks. The indirect block contains the checksums of all the data blocks -- handy. The dnode contains the checksum of the indirect block -- but that's not so handy, because the indirect block contains more than just checksums; it also contains pointers to blocks, which are specific to the physical layout of the data on your machine. If you did remote replication using zfs send | ssh elsewhere zfs recv, the dnode checksum on 'elsewhere' would not be the same. Jeff On Tue, Apr 08, 2008 at 01:45:16PM -0700, asa wrote: Hello all. I am looking to be able to verify my zfs backups in the most minimal way, ie without having to md5 the whole volume. Is there a way to get a checksum for a snapshot and compare it to another zfs volume, containing all the same blocks and verify they contain the same information? Even when I destroy the snapshot on the source? kind of like: zfs create tank/myfs dd if=/dev/urandom bs=128k count=1000 of=/tank/myfs/TESTFILE zfs snapshot tank/[EMAIL PROTECTED] zfs send tank/[EMAIL PROTECTED] | zfs recv tank/myfs_BACKUP zfs destroy tank/[EMAIL PROTECTED] zfs snapshot tank/[EMAIL PROTECTED] someCheckSumVodooFunc(tank/myfs) someCheckSumVodooFunc(tank/myfs_BACKUP) is there some zdb hackery which results in a metadata checksum usable in this scenario? Thank you all! Asa zfs worshiper Berkeley, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Periodic ZFS maintenance?
On Mon, Apr 21, 2008 at 10:41:35AM +1200, Ian Collins wrote: Sam wrote: I have a 10x500 disc file server with ZFS+, do I need to perform any sort of periodic maintenance to the filesystem to keep it in tip top shape? No, but if there are problems, a periodic scrub will tip you off sooner rather than later. Well, tip you off _and_ correct the problems if possible. I believe a long- standing RFE has been to scrub periodically in the background to ensure that correctable problems don't turn into uncorrectable ones. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss