Re: Summary: [zfs-discuss] Poor man's backup by attaching/detaching mirror drives on a _striped_ pool?
Constantin Gonzalez wrote: - The supported alternative would be zfs snapshot, then zfs send/receive, but this introduces the complexity of snapshot management which makes it less simple, thus less appealing to the clone-addicted admin. ... IMHO, we should investigate if something like zpool clone would be useful. It could be implemented as a script that recursively snapshots the source pool, then zfs send/receives it to the destination pool, then copies all properties, but the actual reason why people do mirror splitting in the first place is because of its simplicity. A zpool clone or a zpool send/receive command would be even simpler and less error-prone than the tradition of splitting mirrors, plus it could be implemented more efficiently and more reliably than a script, thus bringing real additional value to administrators. I agree that this is the best solution. I am working on zfs send -r (RFE filed but id not handy), which will provide the features you describe above. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who modified my ZFS receive destination?
Constantin Gonzalez wrote: But at some point, zfs receive says cannot receive: destination has been modified since most recent snapshot. I am pretty sure nobody changed anything at my destination filesystem and I also tried rolling back to an earlier snapshot on the destination filesystem to make it clean again. As Eric noted, you should use 'zfs recv -F' to do a rollback if necessary. Also, you could use dtrace to figure out when the modification occurred, and by whom. We are also working on 'zfs diffs' (RFE filed but id not handy), which would be able to tell you what was modified. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Benchmarking
Anton B. Rang wrote: I time mkfile'ing a 1 gb file on ufs and copying it [...] then did the same thing on each zfs partition. Then I took snapshots, copied files, more snapshots, keeping timings all the way. [ ... ] Is this a sufficient, valid test? If your applications do that -- manipulate large files, primarily copying them -- then it may be. If your applications have other access patterns, probably not. If you're concerned about whether you should put ZFS into production, then you should put it onto your test system and run your real applications on it for a while to qualify it (just as you should for any other file system or hardware). I couldn't agree more. That said, I would be extremely surprised if the presence of snapshots or clones had any impact whatsoever on the performance of accessing a given filesystem. I've never seen anything like that. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] crash
Opensolaris Aserver wrote: We tried to replicate a snapshot via the built-in send receive zfs tools. ... ZFS: bad checksum (read on unknown off 0: zio 3017b300 [L0 ZFS plain file] 2L/2P DVA[0]=0:3b98ed1e800:25800 fletcher2 uncompressed LE contiguous birth=806063 fill=1 cksum=a487e32d ... errors: Permanent errors have been detected in the following files: stor/[EMAIL PROTECTED]:01:00:/1003/kreos11/HB1030/C_Root/Documents and Settings/bvp/My Documents/My Pictures/confidential/tconfidential/confidential/96 ... Son we decided to destroy this snapshot, and then started another Replication. This time the server crashed again :-( So, some of your data has been lost due to hardware failure, where the hardware has silently corrupted your data. ZFS has detected this. If you were to read this data (other than via 'zfs send'), you will get EIO, and as you note, 'zfs status' shows what files are affected. The 'zfs send' protocol isn't able to tell the other side this part of this file is corrupt, so it panics. This is a bug. The reason you're seeing the panic when 'zfs send'-ing the next snapshot is that the (corrupt) data is shared between multiple snapshots. You can work around this by deleting or overwriting the files, then taking and sending a new snapshot. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanently removing vdevs from a pool
Robert Milkowski wrote: Hello George, Friday, April 20, 2007, 7:37:52 AM, you wrote: GW This is a high priority for us and is actively being worked. GW Vague enough for you. :-) Sorry I can't give you anything more exact GW that that. Can you at least give us feature list being developed? Some answers to questions like: 1. evacuating a vdev resulting in a smaller pool for all raid configs - ? 2. adding new vdev and rewriting all existing data to new larger stripe - ? 3. expanding stripe width for raid-z1 and raid-z2 - ? 4. live conversion between different raid kinds on the same disk set - ? No, you will not be able to change the number of disks in a raid-z set (I think that answers questions 1-4). There is no plan to implement this feature. 5. live data migration from one disk set to another - ? Yes -- just add the new disk set, then remove the old disk set. 6. rewriting data in a dataset (not entire pool) after changing some parameters like compression, encryption, ditto blocks, ... so it will affect also already written data in a dataset. This should be both pool wise and data set wise - ? Yes. 7. de-fragmentation of a pool - ? Yes. 8. anything else ? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanently removing vdevs from a pool
Matty wrote: On 4/20/07, George Wilson [EMAIL PROTECTED] wrote: This is a high priority for us and is actively being worked. Vague enough for you. :-) Sorry I can't give you anything more exact that that. Hi George, If ZFS is supposed to be part of opensolaris, then why can't the community get additional details? What additional details would you like? We are not withholding anything -- George answered the question to the best of his knowledge. We simply aren't sure when exactly this feature will be completed. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: gzip compression throttles system?
A couple more questions here. [mpstat] CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 3109 3616 316 196 5 17 48 45 245 0 85 0 15 1 0 0 3127 3797 592 217 4 17 63 46 176 0 84 0 15 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 3051 3529 277 201 2 14 25 48 216 0 83 0 17 1 0 0 3065 3739 606 195 2 14 37 47 153 0 82 0 17 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 3011 3538 316 242 3 26 16 52 202 0 81 0 19 1 0 0 3019 3698 578 269 4 25 23 56 309 0 83 0 17 ... The largest numbers from mpstat are for interrupts and cross calls. What does intrstat(1M) show? Have you run dtrace to determine the most frequent cross-callers? As far as I understand it, we have these frequent cross calls because 1. the test was run on an x86 MP machine 2. the kernel zmod / gzip code allocates and frees four big chunks of memory (4 * 65544 bytes) per zio_write_compress ( gzip ) call [1] Freeing these big memory chunks generates lots of cross calls, because page table entries for that memory are invalidated on all cpus (cores). Of cause this effect cannot be observed on an uniprocessor machine (one cpu / core). And apparently it isn't the root cause for the bad interactive performance with this test; the bad interactive performance can also be observed on single cpu/single core x86 machines. A possible optimization for MP machines: use some kind of kmem_cache for the gzip buffers, so that these buffers could be reused between gzip compression calls. [1] allocations per zio_write_compress() / gzip_compress() call: 1 6642 kobj_alloc:entry sz 5936, fl 1001 1 6642 kobj_alloc:entry sz 65544, fl 1001 1 6642 kobj_alloc:entry sz 65544, fl 1001 1 6642 kobj_alloc:entry sz 65544, fl 1001 1 6642 kobj_alloc:entry sz 65544, fl 1001 1 5769 kobj_free:entry fffeeb307000: sz 65544 1 5769 kobj_free:entry fffeeb2f5000: sz 65544 1 5769 kobj_free:entry fffeeb2e3000: sz 65544 1 5769 kobj_free:entry fffeeb2d1000: sz 65544 1 5769 kobj_free:entry fffed1c42000: sz 5936 This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: gzip compression throttles system?
with recent bits ZFS compression is now handled concurrently with many CPUs working on different records. So this load will burn more CPUs and acheive it's results (compression) faster. So the observed pauses should be consistent with that of a load generating high system time. The assumption is that compression now goes faster than when is was single threaded. Is this undesirable ? We might seek a way to slow down compression in order to limit the system load. According to this dtrace script #!/usr/sbin/dtrace -s sdt:genunix::taskq-enqueue /((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/ { @where[stack()] = count(); } tick-5s { printa(@where); trunc(@where); } ... I see bursts of ~ 1000 zio_write_compress() [gzip] taskq calls enqueued into the spa_zio_issue taskq by zfs`spa_sync() and its children: 0 76337 :tick-5s ... zfs`zio_next_stage+0xa1 zfs`zio_wait_for_children+0x5d zfs`zio_wait_children_ready+0x20 zfs`zio_next_stage_async+0xbb zfs`zio_nowait+0x11 zfs`dbuf_sync_leaf+0x1b3 zfs`dbuf_sync_list+0x51 zfs`dbuf_sync_indirect+0xcd zfs`dbuf_sync_list+0x5e zfs`dbuf_sync_indirect+0xcd zfs`dbuf_sync_list+0x5e zfs`dnode_sync+0x214 zfs`dmu_objset_sync_dnodes+0x55 zfs`dmu_objset_sync+0x13d zfs`dsl_dataset_sync+0x42 zfs`dsl_pool_sync+0xb5 zfs`spa_sync+0x1c5 zfs`txg_sync_thread+0x19a unix`thread_start+0x8 1092 0 76337 :tick-5s It seems that after such a batch of compress requests is submitted to the spa_zio_issue taskq, the kernel is busy for several seconds working on these taskq entries. It seems that this blocks all other taskq activity inside the kernel... This dtrace script counts the number of zio_write_compress() calls enqueued / execed by the kernel per second: #!/usr/sbin/dtrace -qs sdt:genunix::taskq-enqueue /((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/ { this-tqe = (taskq_ent_t *)arg1; @enq[this-tqe-tqent_func] = count(); } sdt:genunix::taskq-exec-end /((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/ { this-tqe = (taskq_ent_t *)arg1; @exec[this-tqe-tqent_func] = count(); } tick-1s { /* printf(%Y\n, walltimestamp); */ printf(TS(sec): %u\n, timestamp / 10); printa(enqueue %a: [EMAIL PROTECTED], @enq); printa(exec%a: [EMAIL PROTECTED], @exec); trunc(@enq); trunc(@exec); } I see bursts of zio_write_compress() calls enqueued / execed, and periods of time where no zio_write_compress() taskq calls are enqueued or execed. 10# ~jk/src/dtrace/zpool_gzip7.d TS(sec): 7829 TS(sec): 7830 TS(sec): 7831 TS(sec): 7832 TS(sec): 7833 TS(sec): 7834 TS(sec): 7835 enqueue zfs`zio_write_compress: 1330 execzfs`zio_write_compress: 1330 TS(sec): 7836 TS(sec): 7837 TS(sec): 7838 TS(sec): 7839 TS(sec): 7840 TS(sec): 7841 TS(sec): 7842 TS(sec): 7843 TS(sec): 7844 enqueue zfs`zio_write_compress: 1116 execzfs`zio_write_compress: 1116 TS(sec): 7845 TS(sec): 7846 TS(sec): 7847 TS(sec): 7848 TS(sec): 7849 TS(sec): 7850 TS(sec): 7851 TS(sec): 7852 TS(sec): 7853 TS(sec): 7854 TS(sec): 7855 TS(sec): 7856 TS(sec): 7857 enqueue zfs`zio_write_compress: 932 execzfs`zio_write_compress: 932 TS(sec): 7858 TS(sec): 7859 TS(sec): 7860 TS(sec): 7861 TS(sec): 7862 TS(sec): 7863 TS(sec): 7864 TS(sec): 7865 TS(sec): 7866 TS(sec): 7867 enqueue zfs`zio_write_compress: 5 execzfs`zio_write_compress: 5 TS(sec): 7868 enqueue zfs`zio_write_compress: 774 execzfs`zio_write_compress: 774 TS(sec): 7869 TS(sec): 7870 TS(sec): 7871 TS(sec): 7872 TS(sec): 7873 TS(sec): 7874 TS(sec): 7875 TS(sec): 7876 enqueue zfs`zio_write_compress: 653 execzfs`zio_write_compress: 653 TS(sec): 7877 TS(sec): 7878 TS(sec): 7879 TS(sec): 7880 TS(sec): 7881 And a final dtrace script, which monitors scheduler activity while filling a gzip compressed pool: #!/usr/sbin/dtrace -qs sched:::off-cpu, sched:::on-cpu, sched:::remain-cpu, sched:::preempt { /* @[probename, stack()] = count(); */ @[probename] = count(); } tick-1s { printf(%Y, walltimestamp); printa(@); trunc(@); } It shows periods of time with absolutely *no* scheduling activity (I guess this is when the spa_zio_issue taskq is working on such a bug batch of submitted gzip compression calls): 21# ~jk/src/dtrace/zpool_gzip9.d 2007 May 6 21:38:12 preempt 13 off-cpu 808 on-cpu 808 2007
[zfs-discuss] zdb -l goes wild about the labels
running a recent patched s10 system, zfs version 3, attempting to dump the label information using zdb when the pool is online doesn't seem to give a reasonable information, any particular reason for this ? # zpool status pool: blade-mirror-pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM blade-mirror-pool ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 errors: No known data errors pool: blade-single-pool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM blade-single-pool ONLINE 0 0 0 c2t14d0 ONLINE 0 0 0 errors: No known data errors # zdb -l /dev/dsk/c2t12d0 LABEL 0 LABEL 1 failed to unpack label 1 LABEL 2 LABEL 3 # zdb -l /dev/rdsk/c2t12d0 LABEL 0 LABEL 1 failed to unpack label 1 LABEL 2 LABEL 3 # zdb -l /dev/dsk/c2t14d0 LABEL 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 # This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zdb -l goes wild about the labels
On May 7, 2007, at 7:11 AM, Frank Batschulat wrote: running a recent patched s10 system, zfs version 3, attempting to dump the label information using zdb when the pool is online doesn't seem to give a reasonable information, any particular reason for this ? # zpool status pool: blade-mirror-pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM blade-mirror-pool ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 errors: No known data errors pool: blade-single-pool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM blade-single-pool ONLINE 0 0 0 c2t14d0 ONLINE 0 0 0 errors: No known data errors # zdb -l /dev/dsk/c2t12d0 Try giving it: # zdb -l /dev/dsk/c2t12d0s0 eric LABEL 0 LABEL 1 failed to unpack label 1 LABEL 2 LABEL 3 # zdb -l /dev/rdsk/c2t12d0 LABEL 0 LABEL 1 failed to unpack label 1 LABEL 2 LABEL 3 # zdb -l /dev/dsk/c2t14d0 LABEL 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 # This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Zpool, RaidZ how it spreads its disk load?
Greetings learned ZFS geeks guru’s, Yet another question comes from my continued ZFS performance testing. This has to do with zpool iostat, and the strangeness that I do see. I’ve created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me a 465G volume. # zpool create tp raidz c4t600 ... 8 disks worth of zpool # zfs create tp/pool # zfs set recordsize=8k tp/pool # zfs set mountpoint=/pool tp/pool I then create a 100G data file that is created by sequentially writing 64k blocks to the test data file. When I then issue a # zpool iostat -v tp 10 I see the following strange behaviour. I see anywhere from up to 16 iterations (ie 160 seconds) of the following, where there are only writes to 2 of the 8 disks: capacity operationsbandwidth pool used avail read write read write -- - - - - - - testpool29.7G 514G 0 2.76K 0 22.1M raidz129.7G 514G 0 2.76K 0 22.1M c4t600C0FF00A74531B659C5C00d0s6 - - 0 0 0 0 c4t600C0FF00A74533F3CF1AD00d0s6 - - 0 0 0 0 c4t600C0FF00A74534C5560FB00d0s6 - - 0 0 0 0 c4t600C0FF00A74535E50E5A400d0s6 - - 0 1.38K 0 2.76M c4t600C0FF00A74537C1C061500d0s6 - - 0 0 0 0 c4t600C0FF00A745343B08C4B00d0s6 - - 0 0 0 0 c4t600C0FF00A745379CB90B600d0s6 - - 0 0 0 0 c4t600C0FF00A74530237AA9300d0s6 - - 0 1.38K 0 2.76M -- - - - - - - During these periods, my data file does not grow in size, but then I see writes to all of the disks like the following: capacity operationsbandwidth pool used avail read write read write -- - - - - - - testpool64.0G 480G 0 1.45K 0 11.6M raidz164.0G 480G 0 1.45K 0 11.6M c4t600C0FF00A74531B659C5C00d0s6 - - 0246 0 8.22M c4t600C0FF00A74533F3CF1AD00d0s6 - - 0220 0 8.23M c4t600C0FF00A74534C5560FB00d0s6 - - 0254 0 8.20M c4t600C0FF00A74535E50E5A400d0s6 - - 0740 0 1.45M c4t600C0FF00A74537C1C061500d0s6 - - 0299 0 8.21M c4t600C0FF00A745343B08C4B00d0s6 - - 0284 0 8.21M c4t600C0FF00A745379CB90B600d0s6 - - 0266 0 8.22M c4t600C0FF00A74530237AA9300d0s6 - - 0740 0 1.45M -- - - - - - - And my data file will increase in size, but also notice notice, in the above, those disks that were being written to before, have a load that is consistent with the previous example. For background, the server, and the storage are dedicated solely to this testing, and there are no other applications being run at this time. I thought that RaidZ would spread its load across all disks somewhat evenly. Can someone explain this result? I can consistently reproduce it as well. Thanks -Tony This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Zpool, RaidZ how it spreads its disk load?
Something I was wondering about myself. What does the raidz toplevel (pseudo?) device do? Does it just indicate to the SPA, or whatever module is responsible, to additionally generate parity? The thing I'd like to know is if variable block sizes, dynamic striping et al still applies to a single RAIDZ device, too. Thanks! -mg This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motley group of discs?
Hi Lee, You can decide whether you want to use ZFS for a root file system now. You can find this info here: http://opensolaris.org/os/community/zfs/boot/ Consider this setup for your other disks, which are: 250, 200 and 160 GB drives, and an external USB 2.0 600 GB drive 250GB = disk1 200GB = disk2 160GB = disk3 600GB = disk4 (spare) I include a spare in this setup because you want to be protected from a disk failure. Since the replacement disk must be equal to or larger than the disk to replace, I think this is best (safest) solution. zpool create pool raidz disk1 disk2 disk3 spare disk4 This setup provides less capacity but better safety, which is probably important for older disks. Because of the spare disk requirement (must be equal to or larger in size), I don't see a better arrangement. I hope someone else can provide one. Your questions remind me that I need to provide add'l information about the current ZFS spare feature... Thanks, Cindy Lee Fyock wrote: I didn't mean to kick up a fuss. I'm reasonably zfs-savvy in that I've been reading about it for a year or more. I'm a Mac developer and general geek; I'm excited about zfs because it's new and cool. At some point I'll replace my old desktop machine with something new and better -- probably when Unreal Tournament 2007 arrives, necessitating a faster processor and better graphics card. :-) In the mean time, I'd like to hang out with the system and drives I have. As mike said, my understanding is that zfs would provide error correction until a disc fails, if the setup is properly done. That's the setup for which I'm requesting a recommendation. I won't even be able to use zfs until Leopard arrives in October, but I want to bone up so I'll be ready when it does. Money isn't an issue here, but neither is creating an optimal zfs system. I'm curious what the right zfs configuration is for the system I have. Thanks! Lee On May 4, 2007, at 7:41 PM, Al Hopper wrote: On Fri, 4 May 2007, mike wrote: Isn't the benefit of ZFS that it will allow you to use even the most unreliable risks and be able to inform you when they are attempting to corrupt your data? Yes - I won't argue that ZFS can be applied exactly as you state above. However, ZFS is no substitute for bad practices that include: - not proactively replacing mechanical components *before* they fail - not having maintenance policies in place To me it sounds like he is a SOHO user; may not have a lot of funds to go out and swap hardware on a whim like a company might. You may be right - but you're simply guessing. The original system probably cost around $3k (?? I could be wrong). So what I'm suggesting, that he spend ~ $300, represents ~ 10% of the original system cost. Since the OP asked for advice, I've given him the best advice I can come up with. I've also encountered many users who don't keep up to date with current computer hardware capabilities and pricing, and who may be completely unaware that you can purchase two 500Gb disk drives, with a 5 year warranty, for around $300. And possibly less if you checkout Frys weekly bargin disk drive offers. Now consider the total cost of ownership solution I recommended: 500 gigabytes of storage, coupled with ZFS, which translates into $60/ year for 5 years of error free storage capability. Can life get any better than this! :) Now contrast my recommendation with what you propose - re-targeting a bunch of older disk drives, which incorporate older, less reliable technology, with a view to saving money. How much is your time worth? How many hours will it take you to recover from a failure of one of these older drives and the accompying increased risk of data loss. If the ZFS savvy OP comes back to this list and says Als' solution is too expensive I'm perfectly willing to rethink my recommendation. For now, I believe it to be the best recommendation I can devise. ZFS in my opinion is well-suited for those without access to continuously upgraded hardware and expensive fault-tolerant hardware-based solutions. It is ideal for home installations where people think their data is safe until the disk completely dies. I don't know how many non-savvy people I have helped over the years who has no data protection, and ZFS could offer them at least some fault-tolerance and protection against corruption, and could help notify them when it is time to shut off their computer and call someone to come swap out their disk and move their data to a fresh drive before it's completely failed... Agreed. One piece-of-the-puzzle that's missing right now IMHO, is a reliable, two port, low-cost PCI SATA disk controller. A solid/de-bugged 3124 driver would go a long way to ZFS-enabling a bunch of cost- constrained ZFS users. And, while I'm working this hardware wish list, please ... a PCI- Express based version of the SuperMicro AOC-SAT2-MV8 8-port Marvell based disk controller
Re: [zfs-discuss] Zpool, RaidZ how it spreads its disk load?
On 5/7/07, Tony Galway [EMAIL PROTECTED] wrote: Greetings learned ZFS geeks guru's, Yet another question comes from my continued ZFS performance testing. This has to do with zpool iostat, and the strangeness that I do see. I've created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me a 465G volume. # zpool create tp raidz c4t600 ... 8 disks worth of zpool # zfs create tp/pool # zfs set recordsize=8k tp/pool # zfs set mountpoint=/pool tp/pool This is a known problem, and is an interaction between the alignment requirements imposed by RAID-Z and the small recordsize you have chosen. You may effectively avoid it in most situations by choosing a RAID-Z strip width of 2^n+1. For a fixed record size, this will work perfectly well. Even so, there will still be cases where small files will cause problems for RAID-Z. While it does not affect many people right now, I think it will become a more serious issue when disks move to 4k sectors. I think the reason for the alignment constraint was to ensure that the stranded space was accounted for, otherwise it would cause problems as the pool fills up. (Consider a 3 device RAID-Z, where only one data sector and one parity sector are written; the third sector in that stripe is essentially dead space.) Would it be possible (or worthwhile) to make the allocator aware of this dead space, rather than imposing the alignment requirements? Something like a concept of tentatively allocated space in the allocator, which would be managed based on the requirements of the vdev. Using such a mechanism, it could coalesce the space if possible for allocations. Of course, it would also have to convert the misaligned bits back into tentatively allocated space when blocks are freed. While I expect this may require changes which would not easily be backward compatible, the alignment on RAID-Z has always felt a bit wrong. While the more severe effects can be addressed by also writing out the dead space, that will not address uneven placement of data and parity across the stripes. Any thoughts? Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Motley group of discs?
Given the odd sizes of your drives, there might not be one, unless you are willing to sacrifice capacity. I think for the SoHo and home user scenarios, I think it might be of advantage if the disk drivers offer unified APIs to read out and interpret disk drive diagnostics, like SMART on ATA and whatever there's for SCSI/SAS, so that ZFS can react on it. Be it automatically invoking spare discs or showing warnings in the pool status. Or even automatically evacuating the device (given that ZFS will support it at some point) depending on the severity, should there be enough space on the other disks. For instance going top to bottom through the filesystems by importance, which would however an importance attribute. -mg This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Zpool, RaidZ how it spreads its disk load?
What are these alignment requirements? I would have thought that at the lowest level, parity stripes would have been allocated traditionally, while treating the remaining usable space like a JBOD the level above, thus not subject to any restraints (apart when getting close to the parity stripe boundaries). -mg This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motley group of discs?
Cindy, Thanks so much for the response -- this is the first one that I consider an actual answer. :-) I'm still unclear on exactly what I end up with. I apologize in advance for my ignorance -- the ZFS admin guide assumes knowledge that I don't yet have. I assume that disk4 is a hot spare, so if one of the other disks die, it'll kick into active use. Is data immediately replicated from the other surviving disks to disk4? What usable capacity do I end up with? 160 GB (the smallest disk) * 3? Or less, because raidz has parity overhead? Or more, because that overhead can be stored on the larger disks? If I didn't need a hot spare, but instead could live with running out and buying a new drive to add on as soon as one fails, what configuration would I use then? Thanks! Lee On May 7, 2007, at 2:44 PM, [EMAIL PROTECTED] wrote: Hi Lee, You can decide whether you want to use ZFS for a root file system now. You can find this info here: http://opensolaris.org/os/community/zfs/boot/ Consider this setup for your other disks, which are: 250, 200 and 160 GB drives, and an external USB 2.0 600 GB drive 250GB = disk1 200GB = disk2 160GB = disk3 600GB = disk4 (spare) I include a spare in this setup because you want to be protected from a disk failure. Since the replacement disk must be equal to or larger than the disk to replace, I think this is best (safest) solution. zpool create pool raidz disk1 disk2 disk3 spare disk4 This setup provides less capacity but better safety, which is probably important for older disks. Because of the spare disk requirement (must be equal to or larger in size), I don't see a better arrangement. I hope someone else can provide one. Your questions remind me that I need to provide add'l information about the current ZFS spare feature... Thanks, Cindy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motley group of discs?
On 7-May-07, at 3:44 PM, [EMAIL PROTECTED] wrote: Hi Lee, You can decide whether you want to use ZFS for a root file system now. You can find this info here: http://opensolaris.org/os/community/zfs/boot/ Bearing in mind that his machine is a G4 PowerPC. When Solaris 10 is ported to this platform, please let me know, too. --Toby ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motley group of discs?
Toby Thain wrote: On 7-May-07, at 3:44 PM, [EMAIL PROTECTED] wrote: Hi Lee, You can decide whether you want to use ZFS for a root file system now. You can find this info here: http://opensolaris.org/os/community/zfs/boot/ Bearing in mind that his machine is a G4 PowerPC. When Solaris 10 is ported to this platform, please let me know, too. For Solaris on PowerPC, it's probably easiest to just monitor this project: http://www.opensolaris.org/os/community/power_pc/ -Luke smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool, RaidZ how it spreads its disk load?
On 5/7/07, Chris Csanady [EMAIL PROTECTED] wrote: On 5/7/07, Tony Galway [EMAIL PROTECTED] wrote: Greetings learned ZFS geeks guru's, Yet another question comes from my continued ZFS performance testing. This has to do with zpool iostat, and the strangeness that I do see. I've created an eight (8) disk raidz pool from a Sun 3510 fibre array giving me a 465G volume. # zpool create tp raidz c4t600 ... 8 disks worth of zpool # zfs create tp/pool # zfs set recordsize=8k tp/pool # zfs set mountpoint=/pool tp/pool This is a known problem, and is an interaction between the alignment requirements imposed by RAID-Z and the small recordsize you have chosen. You may effectively avoid it in most situations by choosing a RAID-Z strip width of 2^n+1. For a fixed record size, this will work perfectly well. Well an alignment issue may be the case for the second iostat output, but not for the first. I'd suspect in the first case the I/O being seen is the syncing of the transaction group and associated block pointers to the RAID (though I could be very wrong on this). Also I'm also not entirely sure about your formula (how can you choose a stripe width that's not a power of 2?). For an 8 disk single parity RAID data is going to be written to 7 disks and parity to 1. If each disk block is 512 bytes, then 128 disk blocks will be written for each 64k filesystem block. This will require 18 rows (and a bit of the 19th) on the 7 data disks. Therefore we have a requirement for 128 blocks of data + 19 blocks of parity = 147 blocks. Now if we take into account the alignment requirement it says that the number of block written must equal a multiple of (nparity + 1). So 148 blocks will be written. 148 % 8 = 4 This means that on each successive 64k write the 'extra' roundup block will alternate between one disk and another 4 disks apart (which happens to be just what we see). Even so, there will still be cases where small files will cause problems for RAID-Z. While it does not affect many people right now, I think it will become a more serious issue when disks move to 4k sectors. True. But when disks move to 4k sectors they will be on the order of terabytes in size. It would probably be more pain than it's worth to try to efficiently pack these. (And it's very likely that your filesystem and per file block size will be at least 4k.) I think the reason for the alignment constraint was to ensure that the stranded space was accounted for, otherwise it would cause problems as the pool fills up. (Consider a 3 device RAID-Z, where only one data sector and one parity sector are written; the third sector in that stripe is essentially dead space.) Indeed. As Adam explained here: http://www.opensolaris.org/jive/thread.jspa?threadID=26115tstart=0 it specifically pertains to what happens if you allow an odd numer of disk blocks to be written, you then free that block and try to fill the space with 512 bytes fs blocks -- you get a single 512-byte hole that you can't fill. Would it be possible (or worthwhile) to make the allocator aware of this dead space, rather than imposing the alignment requirements? Something like a concept of tentatively allocated space in the allocator, which would be managed based on the requirements of the vdev. Using such a mechanism, it could coalesce the space if possible for allocations. Of course, it would also have to convert the misaligned bits back into tentatively allocated space when blocks are freed. It would add complexity and this roundup only occurs in the RAID-Z vdev. As the metaslab/space allocator doesn't have any idea about the on disk layout it wouldn't be able to say whether successive single free blocks in the space map are on the same/different disks -- and this would further add to the complexity of data/parity allocation within the RAID-Z vdev itself. While I expect this may require changes which would not easily be backward compatible, the alignment on RAID-Z has always felt a bit wrong. While the more severe effects can be addressed by also writing out the dead space, that will not address uneven placement of data and parity across the stripes. I've also had issues with this (under a slightly different guise). I've implemented a rather naive raidz implementation based on the current implementation which allows you to use all the disk space on an array of mismatched disks. What I've done is use the grid portion of the block pointer to specify a RAID 'version' number (of which you are currently allowed 255 (0 being reserved for the current layout)). I've then organized it such that metaslab_init is specialised in the raidz vdev (a la vdev_raidz_asize()) and allocates the metaslab as before, but forces a new metaslab when a boundary is reached that would alter the number of disks in a stripe. This increases the number of metaslabs by O(number of disks). It also means that you need to psize_to_asize slightly later in the metaslab allocation
Re: [zfs-discuss] Motley group of discs?
Lee, Yes, the hot spare (disk4) should kick if another disk in the pool fails and yes, the data is moved to disk4. You are correct: 160 GB (the smallest disk) * 3 + raidz parity info Here's the size of raidz pool comprised of 3 136-GB disks: # zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT pool408G 98K408G 0% ONLINE - # zfs list NAME USED AVAIL REFER MOUNTPOINT pool 89.9K 267G 32.6K /pool The pool is 408GB in size but usable space in the pool is 267GB. If you added the 600GB disk to the pool, then you'll still lose out on the extra capacity because of the smaller disks, which is why I suggested using it as a spare. Regarding this: If I didn't need a hot spare, but instead could live with running out and buying a new drive to add on as soon as one fails, what configuration would I use then? I don't have any add'l ideas but I still recommend going with a spare. Cindy Lee Fyock wrote: Cindy, Thanks so much for the response -- this is the first one that I consider an actual answer. :-) I'm still unclear on exactly what I end up with. I apologize in advance for my ignorance -- the ZFS admin guide assumes knowledge that I don't yet have. I assume that disk4 is a hot spare, so if one of the other disks die, it'll kick into active use. Is data immediately replicated from the other surviving disks to disk4? What usable capacity do I end up with? 160 GB (the smallest disk) * 3? Or less, because raidz has parity overhead? Or more, because that overhead can be stored on the larger disks? If I didn't need a hot spare, but instead could live with running out and buying a new drive to add on as soon as one fails, what configuration would I use then? Thanks! Lee On May 7, 2007, at 2:44 PM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Hi Lee, You can decide whether you want to use ZFS for a root file system now. You can find this info here: http://opensolaris.org/os/community/zfs/boot/ Consider this setup for your other disks, which are: 250, 200 and 160 GB drives, and an external USB 2.0 600 GB drive 250GB = disk1 200GB = disk2 160GB = disk3 600GB = disk4 (spare) I include a spare in this setup because you want to be protected from a disk failure. Since the replacement disk must be equal to or larger than the disk to replace, I think this is best (safest) solution. zpool create pool raidz disk1 disk2 disk3 spare disk4 This setup provides less capacity but better safety, which is probably important for older disks. Because of the spare disk requirement (must be equal to or larger in size), I don't see a better arrangement. I hope someone else can provide one. Your questions remind me that I need to provide add'l information about the current ZFS spare feature... Thanks, Cindy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motley group of discs? (doing it right, or right now)
I think it will be in the next.next (10.6) OSX, we just need to get apple to stop playing with their silly cell phone (that I cant help but want, damn them!). I have similar situation at home, but what I do is use Solaris 10 on a cheapish x86 box with 6 400gb IDE/SATA disks, I then make them into ISCSI targets and use that free GlobalSAN initiator ([EMAIL PROTECTED]). I once was like you, had 5 USB/Firewire drives hanging off everything and eventually I just got fed up with the mess of cables and wall warts. Perhaps my method of putting redundant and fast storage isn't as easy to achieve to everyone else. If you want more details about my setup, just email me directly, I don't mind :) -Andy On 5/7/07 4:48 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Lee, Yes, the hot spare (disk4) should kick if another disk in the pool fails and yes, the data is moved to disk4. You are correct: 160 GB (the smallest disk) * 3 + raidz parity info Here's the size of raidz pool comprised of 3 136-GB disks: # zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT pool408G 98K408G 0% ONLINE - # zfs list NAME USED AVAIL REFER MOUNTPOINT pool 89.9K 267G 32.6K /pool The pool is 408GB in size but usable space in the pool is 267GB. If you added the 600GB disk to the pool, then you'll still lose out on the extra capacity because of the smaller disks, which is why I suggested using it as a spare. Regarding this: If I didn't need a hot spare, but instead could live with running out and buying a new drive to add on as soon as one fails, what configuration would I use then? I don't have any add'l ideas but I still recommend going with a spare. Cindy Lee Fyock wrote: Cindy, Thanks so much for the response -- this is the first one that I consider an actual answer. :-) I'm still unclear on exactly what I end up with. I apologize in advance for my ignorance -- the ZFS admin guide assumes knowledge that I don't yet have. I assume that disk4 is a hot spare, so if one of the other disks die, it'll kick into active use. Is data immediately replicated from the other surviving disks to disk4? What usable capacity do I end up with? 160 GB (the smallest disk) * 3? Or less, because raidz has parity overhead? Or more, because that overhead can be stored on the larger disks? If I didn't need a hot spare, but instead could live with running out and buying a new drive to add on as soon as one fails, what configuration would I use then? Thanks! Lee On May 7, 2007, at 2:44 PM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Hi Lee, You can decide whether you want to use ZFS for a root file system now. You can find this info here: http://opensolaris.org/os/community/zfs/boot/ Consider this setup for your other disks, which are: 250, 200 and 160 GB drives, and an external USB 2.0 600 GB drive 250GB = disk1 200GB = disk2 160GB = disk3 600GB = disk4 (spare) I include a spare in this setup because you want to be protected from a disk failure. Since the replacement disk must be equal to or larger than the disk to replace, I think this is best (safest) solution. zpool create pool raidz disk1 disk2 disk3 spare disk4 This setup provides less capacity but better safety, which is probably important for older disks. Because of the spare disk requirement (must be equal to or larger in size), I don't see a better arrangement. I hope someone else can provide one. Your questions remind me that I need to provide add'l information about the current ZFS spare feature... Thanks, Cindy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS raid on removable media for backups/temporary use possible?
I've been using long SATA cables routed out through the case to a home built chassis with its own power supply for a year now. Not even eSATA. That part works well. Substitute this for USB/Firewire/SCSI/USB thumb drives. It's really the same problem. Ok, now you want to deal with a ZFS zpool raid on multiple(?) removable drives. How well does ZFS work on removable media? In a RAID configuration? Are there issues with matching device names to disks? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raid on removable media for backups/temporary use possible?
Tom Buskey wrote: How well does ZFS work on removable media? In a RAID configuration? Are there issues with matching device names to disks? I've had a zpool with 4-250Gb IDE drives in three places recently: - in an external 4-bay Firewire case, attached to a Sparc box - inside a dual-Opteron white box, connected to a 2-channel add-in IDE controller - inside the dual-Opteron, connected via 4 IDE-to-SATA convertors to the motherboard's built-in SATA controller In each case, once 'format' found the drives, ZFS was easily able to import the pool without any fuss or issues. Performance was miserable when running off the add-in IDE controller, but great in the other two cases. As far as I can see, this stuff generally just works. Rob T ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raid on removable media for backups/temporary use possible?
There's a video put out by some Sun people in Germany (IIRC) they made several 4 RAIDZs on 3 USB hubs using a total of 12 USB thumbdrives. At one point they pulled all the USB sticks, shuffled them and then re-imported the pool. Worked like butter. Corey On May 7, 2007, at 1:30 PM, Tom Buskey wrote: I've been using long SATA cables routed out through the case to a home built chassis with its own power supply for a year now. Not even eSATA. That part works well. Substitute this for USB/Firewire/SCSI/USB thumb drives. It's really the same problem. Ok, now you want to deal with a ZFS zpool raid on multiple(?) removable drives. How well does ZFS work on removable media? In a RAID configuration? Are there issues with matching device names to disks? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Boot disk clone with zpool present
I'm hoping that this is simpler than I think it is. :-) We routinely clone our boot disks using a fairly simple script that: 1) Copies the source disk's partition layout to the target disk using [i]prtvtoc[/i], [i]fmthard[/i] and [i]installboot.[/i] 2) Using a list, runs [i]newfs[/i] against the target slice and [i]ufsdump[/i] of the source slice piped to a [i]ufsrestore[/i] of the target slice. The result is a bootable clone of the source disk. Granted, there are vulnerabilities with using ufsdump on a mounted file system but it works for us. We're now looking at using ZFS file systems for /usr, /var, /opt, /export/home, etc., leaving the root file system (/) as UFS and swap as a bare slice as it is now. I've successfully created an Alternate Root Pool and have replicated the ZFS file systems from another source [i]zpool[/i] into the Alternate Root Pool using zfs send and zfs receive. Right now, I'm doing this without the benefit of a bootable system to play with. I'm experimenting with just ordinary file systems, [b][i]not[/i][/b] /ufs, /opt, etc. Now comes the chicken and the egg part. I think I would have to fix-up the mount points of the newly copied ZFS file systems on the Alternate Root Pool so that they remain set to /ufs, /opt, etc. By the way, would these file systems have to be legacy mount points? It seems like they would have to be. Here's the part that makes my head hurt: If I've created this Alternate Root Pool on this separate disk slice and populated it and exported it and I've replicated a UFS root (/) file system on that same disk but in slice 0, how does that zpool get connected when I try to boot that cloned disk? Fundamentally, the question is, how does one replicate a boot/system disk that contains zpool(s) for file systems other than the root file system? This is fairly straightforward with UFS file system technology. The addition of zpool identity seems to complicate the issue considerably. Thank you very much for any advice or clarification. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Support for remote mirroring
Aaron Newcomb wrote: Does ZFS support any type of remote mirroring? It seems at present my only two options to achieve this would be Sun Cluster or Availability Suite. I thought that this functionality was in the works, but I haven't heard anything lately. You could put something together using iSCSI, or zfs send/recv. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs UFS2 overhead and may be a bug?
Pawel Jakub Dawidek wrote: This is what I see on Solaris (hole is 4GB): # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=128k real 23.7 # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=128k real 21.2 # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=4k real 31.4 # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=4k real 7:32.2 This is probably because the time to execute this on ZFS is dominated by per-systemcall costs, rather than per-byte costs. You are doing 32x more system calls with the 4k blocksize, and it is taking 20x longer. That said, I could be wrong, and yowtch, that's much slower than I'd like! --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motley group of discs? (doing it right, or right now)
On 7-May-07, at 5:27 PM, Andy Lubel wrote: I think it will be in the next.next (10.6) OSX, baselessSpeculation Well, the iPhone forced a few months schedule slip, perhaps *instead of* dropping features? /baselessSpeculation Mind you I wouldn't be particularly surprised if ZFS wasn't in 10.5. Just so long as we get it eventually :-) ***suppresses giggle at MS whose schedule slipped years AND dropped any interesting features*** ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Boot disk clone with zpool present
Mark V. Dalton wrote: I'm hoping that this is simpler than I think it is. :-) We routinely clone our boot disks using a fairly simple script that: 1) Copies the source disk's partition layout to the target disk using [i]prtvtoc[/i], [i]fmthard[/i] and [i]installboot.[/i] Danger Will Robinson! Disks can and do have different sizes, even disks with the same (Sun) part number. This causes difficulties or inefficiencies when you blindly copy the partition table like this. You will be better off using a script which creates the new partition map based upon the actual geometry and your desired configuration. This really isn't hard, but it does become site specific. Hint use bc. 2) Using a list, runs [i]newfs[/i] against the target slice and [i]ufsdump[/i] of the source slice piped to a [i]ufsrestore[/i] of the target slice. Yep, been doing that for decades. Actually, cpio is generally easier. The result is a bootable clone of the source disk. Granted, there are vulnerabilities with using ufsdump on a mounted file system but it works for us. Actually, cpio is generally easier. We're now looking at using ZFS file systems for /usr, /var, /opt, /export/home, etc., leaving the root file system (/) as UFS and swap as a bare slice as it is now. Actually, cpio is generally easier. I've successfully created an Alternate Root Pool and have replicated the ZFS file systems from another source [i]zpool[/i] into the Alternate Root Pool using zfs send and zfs receive. Right now, I'm doing this without the benefit of a bootable system to play with. I'm experimenting with just ordinary file systems, [b][i]not[/i][/b] /ufs, /opt, etc. Now comes the chicken and the egg part. I think I would have to fix-up the mount points of the newly copied ZFS file systems on the Alternate Root Pool so that they remain set to /ufs, /opt, etc. By the way, would these file systems have to be legacy mount points? It seems like they would have to be. Here's the part that makes my head hurt: If I've created this Alternate Root Pool on this separate disk slice and populated it and exported it and I've replicated a UFS root (/) file system on that same disk but in slice 0, how does that zpool get connected when I try to boot that cloned disk? Fundamentally, the question is, how does one replicate a boot/system disk that contains zpool(s) for file systems other than the root file system? This is fairly straightforward with UFS file system technology. The addition of zpool identity seems to complicate the issue considerably. IMHO, it is more straightforward with ZFS, but I'm biased :-). For information see the ZFS boot pages, http://www.opensolaris.org/os/community/zfs/boot/ How far you can go with this today depends on whether you're using SPARC or x86. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs UFS2 overhead and may be a bug?
Pawel Jakub Dawidek wrote: This is what I see on Solaris (hole is 4GB): # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=128k real 23.7 # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=128k real 21.2 # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=4k real 31.4 # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=4k real 7:32.2 This is probably because the time to execute this on ZFS is dominated by per-systemcall costs, rather than per-byte costs. You are doing 32x more system calls with the 4k blocksize, and it is taking 20x longer. That said, I could be wrong, and yowtch, that's much slower than I'd like! You missed my earlier post where I showed accessing a hole file takes much longer than accessing a regular data file for blocksize of 4k and below. I will repeat the most dramatic difference: ZFSUFS2 Elapsed System Elapsed System md5 SPACY 210.01 77.46 337.51 25.54 md5 HOLEY 856.39 801.21 82.11 28.31 I used md5 because all but a couple of syscalls are for reading the file (with a buffer of 1K). dd would make an equal number of calls for writing. For both file systems and both cases the filesize is the same but SPACY has 10GB allocated while HOLEY was created with truncate -s 10G HOLEY. Look at the system times. On UFS2 system time is a little bit more for the HOLEY case because it has to clear a block. ON ZFS it is over 10 times more! Something is very wrong. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Support for remote mirroring
ZFS send/receive?? I am not familiar with this feature. Is there a doc I can reference? Thanks, Aaron Newcomb Sr. Systems Engineer Sun Microsystems [EMAIL PROTECTED] Cell: 513-238-9511 Office: 513-562-4409 Matthew Ahrens wrote: Aaron Newcomb wrote: Does ZFS support any type of remote mirroring? It seems at present my only two options to achieve this would be Sun Cluster or Availability Suite. I thought that this functionality was in the works, but I haven't heard anything lately. You could put something together using iSCSI, or zfs send/recv. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Support for remote mirroring
Matthew Ahrens wrote: Aaron Newcomb wrote: Does ZFS support any type of remote mirroring? It seems at present my only two options to achieve this would be Sun Cluster or Availability Suite. I thought that this functionality was in the works, but I haven't heard anything lately. You could put something together using iSCSI, or zfs send/recv. I think the definition of remote mirror is up for grabs here but in my mind remote mirror means the remote node has a always up to date copy of the primary data set modulo any transactions in flight. AVS, aka remote mirror, aka sndr, is usually used for this kind of work on the host. Storage arrays have things like, ahem, remote mirror, truecopy, srdf, etc. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS Support for remote mirroring
I guess when we are defining a mirror, are you talking about a synchronous mirror or an asynchronous mirror? As stated earlier, if you are looking for an asynchronous mirror and do not want to use AVS, you can use zfs send and receive and craft a fairly simple script that runs constantly and updates a remote filesystem. zfs send takes a snapshot and turns it into a datastream to standard out while zfs receive takes a stdin datastream and outputs it to a zfs filesystem. The zfs send and receive structures are only limited to your creativity. one example use might be the following [i]zfs send pool/[EMAIL PROTECTED] | ssh remote_hostname zfs receive remotepool/fs2[/i] That would get you your initial copy, then you would have to take a snap and do incrementals from there on in with something like [i]zfs send -i pool/[EMAIL PROTECTED] pool/[EMAIL PROTECTED] | ssh remote_hostname zfs receive remotepool/fs2[/i] note that the filesystem at the other end (fs2 in thsi case) will be a live filesystem that you can use anytime. Now with that incremental commandline, you might run into a bug that is well known and you can find a work around in these forums, so I won't get into it, but your script would have to incorporate the workaround which would basically run a [i]zfs rollback[/i] command on the remote host before you propagate the incremental changes. ~Bryan This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Motley group of discs?
Well since we are talking about for home use, I never tried as a spare, but if you want to get real nutty, do the setup cindys suggested but format the 600GB drive as UFS or some other filesystem and then try and create a 250GB file device as a spare on that UFS drive. it will give you redundancy and not waste all the space on the 600GB drive. Zfs allows the use of file devices instead of hardware devices zfs create test /tmp/testfiledevice as an example If you do it, let us know how it goes :) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Motley group of discs?
Bryan Wagoner wrote: Well since we are talking about for home use, I never tried as a spare, but if you want to get real nutty, do the setup cindys suggested but format the 600GB drive as UFS or some other filesystem and then try and create a 250GB file device as a spare on that UFS drive. it will give you redundancy and not waste all the space on the 600GB drive. Zfs allows the use of file devices instead of hardware devices zfs create test /tmp/testfiledevice as an example However, I do not believe it is safe to use files under UFS as ZFS vdevs. ZFS expects data to be flushed and, IIRC, UFS does not guarantee that for regular files. Search the archives for more info. That said, you can certainly divide the 600 GByte disk into 3 slices. Later, you can always replace a slice with a different, bigger slice to grow. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Benchmark which models ISP workloads
This benchmark models real-world workload faced by many ISP's worldwide everyday http://untroubled.org/benchmarking/2004-04/ Would appreciate if the ZFS team or the Performance group could take a look at it. I've run this myself on b61 (minor mods to the driver program) but obviously Team ZFS or performance team may be interested in comparing results with different operating systems This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Samba and ZFS ACL Question
Have there been any new developments regarding the availability of vfs_zfsacl.c? Jeb, were you able to get a copy of Jiri's work-in-progress? I need this ASAP (as I'm sure most everyone watching this thread does)... me too... A.S.A.P.!!! [i]-- leon[/i] This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss