[zfs-discuss] how to remove disk from raid0
Hi I have the next configuration: 3 disk 1Gb in raid0 all disks in zfs pool freespace on so raid is 1.5Gb and 1.5Gb is used. so I have some questions: 1. If I don plan to use 3 disks in pool any more. How can I remove one of it? 2. Imaine one disk has failures. I want to replace it, but now I do not have disk 1Gb and have only 2Gb I replace disk of 1Gb with 2Gb, and after some time, I want to put 1Gb disk (as it was before) back. with replace command i have error: device is too small How to return pool into beginning state? Thank you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to remove disk from raid0
On Tue, Oct 11, 2011 at 9:25 AM, KES kes-...@yandex.ua wrote: Hi I have the next configuration: 3 disk 1Gb in raid0 all disks in zfs pool freespace on so raid is 1.5Gb and 1.5Gb is used. so I have some questions: 1. If I don plan to use 3 disks in pool any more. How can I remove one of it? 2. Imaine one disk has failures. I want to replace it, but now I do not have disk 1Gb and have only 2Gb I replace disk of 1Gb with 2Gb, and after some time, I want to put 1Gb disk (as it was before) back. with replace command i have error: device is too small How to return pool into beginning state? Simpy put, current zfs can only be extended, shrinking is not possible. Until the mythical block pointer rewrite actually written, at least. -- O ascii ribbon campaign - stop html mail - www.asciiribbon.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] commercial zfs-based storage replication software?
Have you looked at the time-slider functionality that is already in Solaris ? There is a GUI for configuration of the snapshots and time-slider can be configured to do a 'zfs send' or 'rsync'. The GUI doesn't have the ability to set the 'zfs recv' command but that is set one-time in the SMF service properties. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any info about System attributes
On 09/26/11 20:03, Jesus Cea wrote: # zpool upgrade -v [...] 24 System attributes [...] This is really an on disk format issue rather than something that the end user or admin can use directly. These are special on disk blocks for storing file system metadata attributes when there isn't enough space in the bonus buffer area of the on disk version of the dnode. This can be necessary in some cases if a file has a very large and complex ACL and also has other attributes set such as the ones for CIFS compatibility. They are also always used if the filesystem is encrypted, so that all metadata is in the system attribute (also know as spill) block rather than in the dnode - this is required because we need the dnone in the clear because it contains block pointers and other information needed to navigate the pool. However we never want file system metadata to be in the clear. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to remove disk from raid0
On Oct 11, 2011, at 2:25 AM, KES kes-...@yandex.ua wrote: Hi I have the next configuration: 3 disk 1Gb in raid0 all disks in zfs pool we recommend protecting the data. Friends don't let friends use raid-0. nit: We tend to refer to disk size in bytes (B), not bits (b) freespace on so raid is 1.5Gb and 1.5Gb is used. so I have some questions: 1. If I don plan to use 3 disks in pool any more. How can I remove one of it? copy out, copy in. Using sparse volumes or file systems can help you manage this task cost effectively. The mythical block pointer rewrite is a form of copy out, copy in. 2. Imaine one disk has failures. I want to replace it, but now I do not have disk 1Gb and have only 2Gb I replace disk of 1Gb with 2Gb, and after some time, I want to put 1Gb disk (as it was before) back. with replace command i have error: device is too small How to return pool into beginning state? Partition the disk so that the size of the replacement partition on the 2GB disk is exactly the same size (in blocks) as the 1GB disk. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] OpenStorage Summit email blast
Subject: FYI on Storage Event Just an FYI on storage. I just learned that an OpenStorage Summit is happening, in San Jose, during the last week of October. Some great speakers are presenting and some really interesting topics will be addressed, including Korea Telecom on Public Cloud Storage, Intel on converged storage, and a discussion on how VMware designed a public cloud to host its Hands on Lab during the VMworld 2011 event held in Las Vegas. Also, there will be a presentation on the best practices concerning ZFS / OpenSolaris. What a great opportunity! Check it out and register here.___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS issue on read performance
Hi, I'm not familiar with ZFS stuff, so I'll try to give you as much as info I can get with our environment We are using a ZFS pool as a VLS for a backup server (Sun V445 Solaris 10), and we are faced with very low read performance (whilst write performance is much better, i.e : up to 40GB/h to migrate data onto LTO-3 tape from disk, and up to 100GB/h to unstage data from LTO-3 tape to disk, either with Time Navigator 4.2 software or directly using dd commands) We have tunned ZFS parameters for ARC and disabled preftech but performance is poor. If we dd from disk to RAM or tape, it's very slow, but if we dd from tape or RAM to disk, it's faster. I can't figure out why. I've read other posts related to this, but I'm not sure what can of tuning can be made. For disks concern, I have no idea on how our System team created the ZFS volume. Can you help ? Thank you David ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tuning zfs_arc_min
On Oct 6, 2011, at 5:19 AM, Frank Van Damme frank.vanda...@gmail.com wrote: Hello, quick and stupid question: I'm breaking my head over how to tunz zfs_arc_min on a running system. There must be some magic word to pipe into mdb -kw but I forgot it. I tried /etc/system but it's still at the old value after reboot: ZFS Tunables (/etc/system): set zfs:zfs_arc_min = 0x20 set zfs:zfs_arc_meta_limit=0x1 It is not uncommon to tune arc meta limit. But I've not seen a case where tuning arc min is justified, especially for a storage server. Can you explain your reasoning? -- richard ARC Size: Current Size: 1314 MB (arcsize) Target Size (Adaptive): 5102 MB (c) Min Size (Hard Limit):2048 MB (zfs_arc_min) Max Size (Hard Limit):5102 MB (zfs_arc_max) I could use the memory now since I'm running out of it, trying to delete a large snapshot :-/ -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS issue on read performance
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of deg...@free.fr I'm not familiar with ZFS stuff, so I'll try to give you as much as info I can get with our environment We are using a ZFS pool as a VLS for a backup server (Sun V445 Solaris 10), and we are faced with very low read performance (whilst write performance is much better, i.e : up to 40GB/h to migrate data onto LTO-3 tape from disk, and up to 100GB/h to unstage data from LTO-3 tape to disk, either with Time Navigator 4.2 software or directly using dd commands) We have tunned ZFS parameters for ARC and disabled preftech but performance is poor. If we dd from disk to RAM or tape, it's very slow, but if we dd from tape or RAM to disk, it's faster. I can't figure out why. I've read other posts related to this, but I'm not sure what can of tuning can be made. For disks concern, I have no idea on how our System team created the ZFS volume. Can you help ? Normally, even a single cheap disk in the dumbest configuration should vastly outperform an LTO3 tape device. And 100 GB/h is nowhere near what you should expect, unless you're using highly fragmented or scattered small files. In the optimal configuration, you'll read/write something like 1Gbit/sec per disk, until you saturate your controller, let's just pick rough numbers and say 6Gbit/sec = 2.7 TB per hour. So there's a ballpark to think about. Next things next. I am highly skeptical of dd. I constantly get weird performance problems when using dd. Especially if you're reading/writing tapes. Instead, this is a good benchmark for how fast your disks can actually go in the present configuration: zfs send somefilesystem@somesnap | pv -i 30 /dev/null(You might have to install pv, for example using opencsw or blastwave. If you don't have pv and don't want to install it, you might want to time zfs send | wc /dev/null, so you can get the total size and the total time.) Expect the performance to go up and down... So watch it a while. Or wait for it to complete and then you'll have the average. Also... In what way are you using dd? dd is not really an appropriate tool for backing up a ZFS filesystem. Well, there are some corner cases where it might be ok, but generally speaking, no. So the *very* first question you should be asking is probably not about the bad performance you're seeing, but verifying the validity of your backup technique. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS issue on read performance
On Tue, Oct 11, 2011 at 6:25 AM, deg...@free.fr wrote: I'm not familiar with ZFS stuff, so I'll try to give you as much as info I can get with our environment We are using a ZFS pool as a VLS for a backup server (Sun V445 Solaris 10), and we are faced with very low read performance (whilst write performance is much better, i.e : up to 40GB/h to migrate data onto LTO-3 tape from disk, and up to 100GB/h to unstage data from LTO-3 tape to disk, either with Time Navigator 4.2 software or directly using dd commands) We have tunned ZFS parameters for ARC and disabled preftech but performance is poor. If we dd from disk to RAM or tape, it's very slow, but if we dd from tape or RAM to disk, it's faster. I can't figure out why. I've read other posts related to this, but I'm not sure what can of tuning can be made. For disks concern, I have no idea on how our System team created the ZFS volume. Can you help ? If you can, please post the output from `zpool status` so we know what your configuration is. There are many ways to configure a zpool, some of which have horrible read performance. We are using zfs as backend storage for NetBackup and we do not see the disk storage as the bottleneck except when copying from disk to tape (LTO-3) and that depends on the backup images. We regularly see 75-100 MB/sec throughput disk to tape for large backup images. I rarely see LTO-3 drives writing any faster than 100 MB/sec. 100 MB/sec. is about 350 GB/hr. 75 MB/sec. is about 260 GB/hr. Our disk stage zpool is configured for capacity and reliability and not performance. pool: nbu-ds0 state: ONLINE scrub: scrub completed after 7h9m with 0 errors on Thu Sep 29 16:25:56 2011 config: NAME STATE READ WRITE CKSUM nbu-ds0 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 c3t5000C5001A67AB63d0 ONLINE 0 0 0 c3t5000C5001A671685d0 ONLINE 0 0 0 c3t5000C5001A670DE6d0 ONLINE 0 0 0 c3t5000C5001A66CDA4d0 ONLINE 0 0 0 c3t5000C5001A66A43Bd0 ONLINE 0 0 0 c3t5000C5001A66994Dd0 ONLINE 0 0 0 c3t5000C5001A663062d0 ONLINE 0 0 0 c3t5000C5001A659F79d0 ONLINE 0 0 0 c3t5000C5001A6591B2d0 ONLINE 0 0 0 c3t5000C5001A658481d0 ONLINE 0 0 0 c3t5000C5001A4C47C8d0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c3t5000C5001A6548A2d0 ONLINE 0 0 0 c3t5000C5001A6546AAd0 ONLINE 0 0 0 c3t5000C5001A65400Ed0 ONLINE 0 0 0 c3t5000C5001A653B70d0 ONLINE 0 0 0 c3t5000C5001A6531F5d0 ONLINE 0 0 0 c3t5000C5001A64332Ed0 ONLINE 0 0 0 c3t5000C500112A5AF8d0 ONLINE 0 0 0 c3t5000C5001A5D61A8d0 ONLINE 0 0 0 c3t5000C5001A5C5EA9d0 ONLINE 0 0 0 c3t5000C5001A55F7A6d0 ONLINE 0 0 0 114K repaired c3t5000C5001A5347FEd0 ONLINE 0 0 0 spares c3t5000C5001A485C88d0AVAIL c3t5000C50026A0EC78d0AVAIL errors: No known data errors -- {1-2-3-4-5-6-7-} Paul Kraus - Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) - Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) - Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tuning zfs_arc_min
2011/10/11 Richard Elling richard.ell...@gmail.com: ZFS Tunables (/etc/system): set zfs:zfs_arc_min = 0x20 set zfs:zfs_arc_meta_limit=0x1 It is not uncommon to tune arc meta limit. But I've not seen a case where tuning arc min is justified, especially for a storage server. Can you explain your reasoning? Honestly? I don't remember. might be a leftover setting from a year ago. by now, I figured out I need to update the boot archive in order for the new setting to have effect at boot time which apparently involves booting in safe mode. -- Frank Van Damme No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] weird bug with Seagate 3TB USB3 drive
Banging my head against a Seagate 3TB USB3 drive. Its marketing name is: Seagate Expansion 3 TB USB 3.0 Desktop External Hard Drive STAY3000102 format(1M) shows it identify itself as: Seagate-External-SG11-2.73TB Under both Solaris 10 and Solaris 11x, I receive the evil message: | I/O request is not aligned with 4096 disk sector size. | It is handled through Read Modify Write but the performance is very low. However, that's not my big issue as I will use the zpool-12 hack. My big issue is that once I zpool(1M) export the pool from my W2100z running S10 or my Ultra 40 running S11x, I can't import it. I thought weird USB connectivity issue, but I can run format - analyze - read merrily. Anyone seen this bug? John groenv...@acm.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] tuning zfs_arc_min
On Oct 11, 2011, at 2:03 PM, Frank Van Damme wrote: 2011/10/11 Richard Elling richard.ell...@gmail.com: ZFS Tunables (/etc/system): set zfs:zfs_arc_min = 0x20 set zfs:zfs_arc_meta_limit=0x1 It is not uncommon to tune arc meta limit. But I've not seen a case where tuning arc min is justified, especially for a storage server. Can you explain your reasoning? Honestly? I don't remember. might be a leftover setting from a year ago. by now, I figured out I need to update the boot archive in order for the new setting to have effect at boot time which apparently involves booting in safe mode. The archive should be updated when you reboot. Or you can run bootadm update-archive anytime. At boot, the zfs_arc_min is copied into arc_c_min overriding the default setting. You can see the current value via kstat: kstat -p zfs:0:arcstats:c_min zfs:0:arcstats:c_min389202432 This is the smallest size that the ARC will shrink to, when asked to shrink because other applications need memory. -- richard -- ZFS and performance consulting http://www.RichardElling.com VMworld Copenhagen, October 17-20 OpenStorage Summit, San Jose, CA, October 24-27 LISA '11, Boston, MA, December 4-9 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote: Hello all, ZFS developers have for a long time stated that ZFS is not intended, at least not in near term, for clustered environments (that is, having a pool safely imported by several nodes simultaneously). However, many people on forums have wished having ZFS features in clusters. ...and UFS before ZFS… I'd wager that every file system has this RFE in its wish list :-) I have some ideas at least for a limited implementation of clustering which may be useful aat least for some areas. If it is not my fantasy and if it is realistic to make - this might be a good start for further optimisation of ZFS clustering for other uses. For one use-case example, I would talk about VM farms with VM migration. In case of shared storage, the physical hosts need only migrate the VM RAM without copying gigabytes of data between their individual storages. Such copying makes less sense when the hosts' storage is mounted off the same NAS/NAS box(es), because: * it only wastes bandwidth moving bits around the same storage, and This is why the best solutions use snapshots… no moving of data and you get the added benefit of shared ARC -- increasing the logical working set size does not increase the physical working set size. * IP networking speed (NFS/SMB copying) may be less than that of dedicated storage net between the hosts and storage (SAS, FC, etc.) Disk access is not bandwidth bound by the channel. * with pre-configured disk layout from one storage box into LUNs for several hosts, more slack space is wasted than with having a single pool for several hosts, all using the same free pool space; ...and you die by latency of metadata traffic. * it is also less scalable (i.e. if we lay out the whole SAN for 5 hosts, it would be problematic to add a 6th server) - but it won't be a problem when the single pool consumes the whole SAM and is available to all server nodes. Are you assuming disk access is faster than RAM access? One feature of this use-case is that specific datasets within the potentially common pool on the NAS/SAN are still dedicated to certain physical hosts. This would be similar to serving iSCSI volumes or NFS datasets with individual VMs from a NAS box - just with a faster connection over SAS/FC. Hopefully this allows for some shortcuts in clustering ZFS implementation, while such solutions would still be useful in practice. I'm still missing the connection of the problem to the solution. The problem, as I see it today: disks are slow and not getting faster. SSDs are fast and getting faster and lower $/IOP. Almost all VM environments and most general purpose environments are overprovisioned for bandwidth and underprovisioned for latency. The Achille's heel of solutions that cluster for bandwidth (eg lustre, QFS, pNFS, Gluster, GFS, etc) is that you have to trade-off latency. But latency is what we need, so perhaps not the best architectural solution? So, one version of the solution would be to have a single host which imports the pool in read-write mode (i.e. the first one which boots), and other hosts would write thru it (like iSCSI or whatever; maybe using SAS or FC to connect between reader and writer hosts). However they would read directly from the ZFS pool using the full SAN bandwidth. WRITES would be consistent because only one node writes data to the active ZFS block tree using more or less the same code and algorithms as already exist. In order for READS to be consistent, the readers need only rely on whatever latest TXG they know of, and on the cached results of their more recent writes (between the last TXG these nodes know of and current state). Here's where this use-case's bonus comes in: the node which currently uses a certain dataset and issues writes for it, is the only one expected to write there - so even if its knowledge of the pool is some TXGs behind, it does not matter. In order to stay up to date, and know the current TXG completely, the reader nodes should regularly read-in the ZIL data (anyway available and accessible as part of the pool) and expire changed entries from their local caches. :-) If for some reason a reader node has lost track of the pool for too long, so that ZIL data is not sufficient to update from known in-RAM TXG to current on-disk TXG, the full read-only import can be done again (keeping track of newer TXGs appearing while the RO import is being done). Thanks to ZFS COW, nodes can expect that on-disk data (as pointed to by block addresses/numbers) does not change. So in the worst case, nodes would read outdated data a few TXGs old - but not completely invalid data. Second version of the solution is more or less the same, except that all nodes can write to the pool hardware directly using some dedicated block ranges owned by one node at a time. This would work like much a ZIL containing both data and metadata. Perhaps
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Tue, Oct 11, 2011 at 11:15 PM, Richard Elling richard.ell...@gmail.com wrote: On Oct 9, 2011, at 10:28 AM, Jim Klimov wrote: ZFS developers have for a long time stated that ZFS is not intended, at least not in near term, for clustered environments (that is, having a pool safely imported by several nodes simultaneously). However, many people on forums have wished having ZFS features in clusters. ...and UFS before ZFS… I'd wager that every file system has this RFE in its wish list :-) Except the ones that already have it! :) Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wanted: sanity check for a clustered ZFS idea
On Sun, Oct 9, 2011 at 12:28 PM, Jim Klimov jimkli...@cos.ru wrote: So, one version of the solution would be to have a single host which imports the pool in read-write mode (i.e. the first one which boots), and other hosts would write thru it (like iSCSI or whatever; maybe using SAS or FC to connect between reader and writer hosts). However they would read directly from the ZFS pool using the full SAN bandwidth. You need to do more than simply assign a node for writes. You need to send write and lock requests to one node. And then you need to figure out what to do about POSIX write visibility rules (i.e., when a write should be visible to other readers). I think you'd basically end up not meeting POSIX in this regard, just like NFS, though perhaps not with close-to-open semantics. I don't think ZFS is the beast you're looking for. You want something more like Lustre, GPFS, and so on. I suppose someone might surprise us one day with properly clustered ZFS, but I think it'd be more likely that the filesystem would be ZFS-like, not ZFS proper. Second version of the solution is more or less the same, except that all nodes can write to the pool hardware directly using some dedicated block ranges owned by one node at a time. This would work like much a ZIL containing both data and metadata. Perhaps these ranges would be whole metaslabs or some other ranges as agreed between the master node and other nodes. This is much hairier. You need consistency. If two processes on different nodes are writing to the same file, then you need to *internally* lock around all those writes so that the on-disk structure ends up being sane. There's a number of things you could do here, such as, for example, having a per-node log and one node coalescing them (possibly one node per-file, but even then one node has to be the master of every txg). And still you need to be careful about POSIX semantics. That does not come for free in any design -- you will need something like the Lustre DLM (distributed lock manager). Or else you'll have to give up on POSIX. There's a hefty price to be paid for POSIX semantics in a clustered environment. You'd do well to read up on Lustre's experience in detail. And not just Lustre -- that would be just to start. I caution you that this is not a simple project. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss