Re: [zfs-discuss] DDT sync?
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Edward Ned Harvey > > So here's what I'm going to do. With arc_meta_limit at 7680M, of which > 100M > was consumed "naturally," that leaves me 7580 to play with. Call it 7500M. > Divide by 412 bytes, it means I'll hit a brick wall when I reach a little > over 19M blocks. Which means if I set my recordsize to 32K, I'll hit that > limit around 582G disk space consumed. That is my hypothesis, and now > beginning the test. Well, this is interesting. With 7580MB theoretically available for DDT in ARC, the expectation was that 19M DDT entries would finally max out the ARC and then I'd jump off a performance cliff and start seeing a bunch of pool reads killing my write performance. In reality, what I saw was: * Up to a million blocks, the performance difference with/without dedup was basically negligible. Write time with dedup = 1x write time without dedup. * After a million, the dedup write time consistently reached 2x longer than the native write time. This happened when my ARC became full of user data (not meta data) * As the # of unique blocks in pool increased, gradually, the dedup write time deviated from the non-dedup write time. 2x, 3x, 4x. I got a consistent 4x longer write time with dedup enabled, after the pool reached 22.5M blocks. * And then it jumped off a cliff. When I got to 24M blocks, it was the last datapoint able to be collected. 28x slower write with dedup (4966 sec to write 3G, as compared to 178sec), and for the first time, a nonzero rm time. All the way up till now, even with dedup, the rm time was zero. But now it was 72sec. * I waited another 6 hours, and never got another data point. So I found the limit where the pool becomes unusably slow. At a cursory look, you might say this supported the hypothesis. You might say "24M compared to 19M, that's not too far off. This could be accounted for by using the 376byte size of ddt_entry_t, instead of the 412byte size apparently measured... This would adjust the hypothesis to 21.1M blocks." But I don't think that's quite fair. Because my arc_meta_used never got above 5,159. And I never saw the massive read overload that was predicted to be the cause of failure. In fact, starting from 0.4M to 0.5M blocks (early, early, early on) from that point onward, I always had 40-50 reads for every 250 writes. Right to the bitter end. And my arc is full of user data, not meta data. So the conclusions I'm drawing are: (1) If you don't tweak arc_meta_limit, and you want to enable dedup, you're toast. But if you do tweak arc_meta_limit, you might reasonably expect dedup to perform 3x to 4x slower on unique data... And based on results that I haven't talked about yet here, dedup performs 3x to 4x faster on duplicate data. So if you have 50% or higher duplicate data (dedup ratio 2x or higher) and you have plenty of memory and tweak it, then your performance with dedup could be comparable, or even faster than running without dedup. Of course, depending on your data patterns and usage patterns. YMMV. (2) The above is pretty much the best you can do, if your server is going to be a "normal" server, handling both reads & writes. Because the data and the meta_data are both stored in the ARC, the data has a tendency to push the meta_data out. But in a special use case - Suppose you only care about write performance and saving disk space. For example, suppose you're the destination server of a backup policy. You only do writes, so you don't care about keeping data in cache. You want to enable dedup to save cost on backup disks. You only care about keeping meta_data in ARC. If you set primarycache=metadata I'll go test this now. The hypothesis is that my arc_meta_used should actually climb up to the arc_meta_limit before I start hitting any disk reads, so my write performance with/without dedup should be pretty much equal up to that point. I'm sacrificing the potential read benefit of caching data in ARC, in order to hopefully gain write performance - So write performance can be just as good with dedup enabled or disabled. In fact, if there's much duplicate data, the dedup write performance in this case should be significantly better than without dedup. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On Wed, Jun 1, 2011 at 7:06 AM, Bill Sommerfeld wrote: > On 05/31/11 09:01, Anonymous wrote: >> Hi. I have a development system on Intel commodity hardware with a 500G ZFS >> root mirror. I have another 500G drive same as the other two. Is there any >> way to use this disk to good advantage in this box? I don't think I need any >> more redundancy, I would like to increase performance if possible. I have >> only one SATA port left so I can only use 3 drives total unless I buy a PCI >> card. Would you please advise me. Many thanks. > > I'd use the extra SATA port for an ssd, and use that ssd for some > combination of boot/root, ZIL, and L2ARC. > > I have a couple systems in this configuration now and have been quite > happy with the config. While slicing an ssd and using one slice for > root, one slice for zil, and one slice for l2arc isn't optimal from a > performance standpoint and won't scale up to a larger configuration, it > is a noticeable improvement from a 2-disk mirror. > > I used an 80G intel X25-M, with 1G for zil, with the rest split roughly > 50:50 between root pool and l2arc for the data pool. Does anyone have a benchmark or history data on how reliable an SSD is nowadays? Cheap-ish sandforce-based MLC SSDs usually say they support 1 million write cycles, and that they have some kind of wear-leveling. How does this translates when it's used as L2ARC? Can we expect something like one year or three years lifetime when the pool is relatively busy? -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On Wed, Jun 01, 2011 at 05:45:14AM +0400, Jim Klimov wrote: > Also, in a mirroring scenario is there any good reason to keep a warm spare > instead of making a three-way mirror right away (beside energy saving)? > Rebuild times and non-redundant windows can be decreased considerably ;) Perhaps where the spare may be used for any of several pools, whichever has a failure first. Not relevant to this case.. In this case, if the drive is warm, it might as well be live. My point was that, even as a cold spare it is worth something, and that the sata port may be worth more, since the OP is more interested in performance than extra redundancy. -- Dan. pgp8cB9ApGE1h.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD recommendation for ZFS usage
Thomas, You can consider the DataON DNS-1600(4U 24 3.5" Bay 6Gb/s SAS JBOD). It is perfect for ZFS storage as the alternative of J4400. http://dataonstorage.com/dns-1600 And we recommend you to use native SAS HD like Seagate Constellation ES 2TB SAS to connect 2 hosts for fail-over cluster. The following is setup diagram of HA failover cluster with Nexenta. Same configuration can applied to Solaris, OpenSolaris and OpenIndiana http://dataonstorage.com/nexentaha We also have DSM(Disk Shelf Management Tool) available for Solaris 10 and Nexenta to help identify fail disk and JBOD. You can also check the status of all FRU http://dataonstorage.com/dsm FYI, we have reseller in Germany. If you need the additional info, you can let me know! Rocky -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Thomas Nau Sent: Sunday, May 29, 2011 11:07 PM To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] JBOD recommendation for ZFS usage Dear all Sorry if it's kind of off-topic for the list but after talking to lots of vendors I'm running out of ideas... We are looking for JBOD systems which (1) hold 20+ 3.3" SATA drives (2) are rack mountable (3) have all the nive hot-swap stuff (4) allow 2 hosts to connect via SAS (4+ lines per host) and see all available drives as disks, no RAID volume. In a perfect world both hosts would connect each using two independent SAS connectors The box will be used in a ZFS Solaris/based fileserver in a fail-over cluster setup. Only one host will access a drive at any given time. It seems that a lot of vendors offer JBODs but so far I haven't found one in Germany which handles (4). Any hints? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
> If it is powered on, then it is a warm spare :-) > Warm spares are a good idea. For some platforms, you can > spin down the > disk so it doesn't waste energy. But I should note that we've had issues with a hot spare disk added to rpool in particular, preventing boots on Solaris 10u8. It turned out to be a known bug which may have since been fixed... Also, in a mirroring scenario is there any good reason to keep a warm spare instead of making a three-way mirror right away (beside energy saving)? Rebuild times and non-redundant windows can be decreased considerably ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On May 31, 2011, at 5:16 PM, Daniel Carosone wrote: > Namely, leave the third drive on the shelf as a cold spare, and use > the third sata connector for an ssd, as L2ARC, ZIL or even possibly > both (which will affect selection of which device to use). If it is powered on, then it is a warm spare :-) Warm spares are a good idea. For some platforms, you can spin down the disk so it doesn't waste energy. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
> What about writes? Writes in a mirror are deemed to be not faster than the slowest disk - all two or three drives must commit a block before it is considered written (in sync write mode), likewise for TXG sync but with some optimization by caching and write-coalescing. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE
On Tue, May 31, 2011 at 05:32:47PM +0100, Matt Keenan wrote: > Jim, > > Thanks for the response, I've nearly got it working, coming up against a > hostid issue. > > Here's the steps I'm going through : > > - At end of auto-install, on the client just installed before I manually > reboot I do the following : > $ beadm mount solaris /a > $ zpool export data > $ zpool import -R /a -N -o cachefile=/a/etc/zfs/zpool.cache data > $ beadm umount solaris > $ reboot > > - Before rebooting I check /a/etc/zfs/zpool.cache and it does contain > references to "data". > > - On reboot, the automatic import of data is attempted however following > message is displayed : > > WARNING: pool 'data' could not be loaded as it was last accessed by > another system (host: ai-client hostid: 0x87a4a4). See > http://www.sun.com/msg/ZFS-8000-EY. > > - Host id on booted client is : > $ hostid > 000c32eb > > As I don't control the import command on boot i cannot simply add a "-f" > to force the import, any ideas on what else I can do here ? Can you simply export the pool again before rebooting, but after the cachefile in /a has been unmounted? -- Dan. pgp7IC9jTUesC.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On Wed, Jun 01, 2011 at 10:16:28AM +1000, Daniel Carosone wrote: > On Tue, May 31, 2011 at 06:57:53PM -0400, Edward Ned Harvey wrote: > > If you make it a 3-way mirror, your write performance will be unaffected, > > but your read performance will increase 50% over a 2-way mirror. All 3 > > drives can read different data simultaneously for the net effect of 3x a > > single disk read performance. > > This would be my recommendation too, but for the sake of completeness, > there are other options that may provide better performance > improvement (at a cost) depending on your needs. In fact, I should state even more clearly: do this, since there is very little reason not to. Measure the benefit. Move on to the other things if the benefit is not enough. When doing so, consider what kind of benefit you're looking for. > Namely, leave the third drive on the shelf as a cold spare, and use > the third sata connector for an ssd, as L2ARC, ZIL or even possibly > both (which will affect selection of which device to use). > > L2ARC is likely to improve read latency (on average) even more than a > third submirror. ZIL will be unmirrored, but may improve writes at an > acceptable risk for development system. If this risk is acceptable, > you may wish to consider whether setting sync=disabled is also > acceptable at least for certain datasets. > > Finally, if you're considering spending money, can you increase the > RAM instead? If so, do that first. > > -- > Dan. > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss pgpHRSk23bsVr.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On Tue, May 31, 2011 at 06:57:53PM -0400, Edward Ned Harvey wrote: > If you make it a 3-way mirror, your write performance will be unaffected, > but your read performance will increase 50% over a 2-way mirror. All 3 > drives can read different data simultaneously for the net effect of 3x a > single disk read performance. This would be my recommendation too, but for the sake of completeness, there are other options that may provide better performance improvement (at a cost) depending on your needs. Namely, leave the third drive on the shelf as a cold spare, and use the third sata connector for an ssd, as L2ARC, ZIL or even possibly both (which will affect selection of which device to use). L2ARC is likely to improve read latency (on average) even more than a third submirror. ZIL will be unmirrored, but may improve writes at an acceptable risk for development system. If this risk is acceptable, you may wish to consider whether setting sync=disabled is also acceptable at least for certain datasets. Finally, if you're considering spending money, can you increase the RAM instead? If so, do that first. -- Dan. pgpt1w2jn0CGs.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On 05/31/11 09:01, Anonymous wrote: > Hi. I have a development system on Intel commodity hardware with a 500G ZFS > root mirror. I have another 500G drive same as the other two. Is there any > way to use this disk to good advantage in this box? I don't think I need any > more redundancy, I would like to increase performance if possible. I have > only one SATA port left so I can only use 3 drives total unless I buy a PCI > card. Would you please advise me. Many thanks. I'd use the extra SATA port for an ssd, and use that ssd for some combination of boot/root, ZIL, and L2ARC. I have a couple systems in this configuration now and have been quite happy with the config. While slicing an ssd and using one slice for root, one slice for zil, and one slice for l2arc isn't optimal from a performance standpoint and won't scale up to a larger configuration, it is a noticeable improvement from a 2-disk mirror. I used an 80G intel X25-M, with 1G for zil, with the rest split roughly 50:50 between root pool and l2arc for the data pool. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On May 31, 2011, at 19:00, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk >> >> Theoretically, you'll get a 50% read increase, but I doubt it'll be that >> high in >> practice. What about writes? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On Tue, 31 May 2011, Edward Ned Harvey wrote: If you make it a 3-way mirror, your write performance will be unaffected, but your read performance will increase 50% over a 2-way mirror. All 3 drives can read different data simultaneously for the net effect of 3x a single disk read performance. I think that a read performance increase of (at most) 33.3% is more correct. You might obtain (at most) 50% over one disk by mirroring it. Zfs makes a random selection of which disk to read from in a mirror set so the improvement is not truely linear. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk > > Theoretically, you'll get a 50% read increase, but I doubt it'll be that high > in > practice. In my benchmarking, I found 2-way mirror reads 1.97x the speed of a single disk, and a 3-way mirror reads 2.91x a single disk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Anonymous > > Hi. I have a development system on Intel commodity hardware with a 500G > ZFS > root mirror. I have another 500G drive same as the other two. Is there any > way to use this disk to good advantage in this box? I don't think I need any > more redundancy, I would like to increase performance if possible. I have > only one SATA port left so I can only use 3 drives total unless I buy a PCI > card. Would you please advise me. Many thanks. If you make it a 3-way mirror, your write performance will be unaffected, but your read performance will increase 50% over a 2-way mirror. All 3 drives can read different data simultaneously for the net effect of 3x a single disk read performance. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On 31 May, 2011 - Gertjan Oude Lohuis sent me these 0,9K bytes: > On 05/31/2011 03:52 PM, Tomas Ögren wrote: >> I've done a not too scientific test on reboot times for Solaris 10 vs 11 >> with regard to many filesystems... >> > >> http://www8.cs.umu.se/~stric/tmp/zfs-many.png >> >> As the picture shows, don't try 1 filesystems with nfs on sol10. >> Creating more filesystems gets slower and slower the more you have as >> well. >> > > Since all filesystem would be shared via NFS, this clearly is a nogo :). > Thanks! > >> On a different setup, we have about 750 datasets where we would like to >> use a single recursive snapshot, but when doing that all file access >> will be frozen for varying amounts of time > > What version of ZFS are you using? Like Matthew Ahrens said: version 27 > has a fix for this. 22, Solaris 10. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On May 31, 2011, at 2:29 PM, Gertjan Oude Lohuis wrote: > On 05/31/2011 03:52 PM, Tomas Ögren wrote: >> I've done a not too scientific test on reboot times for Solaris 10 vs 11 >> with regard to many filesystems... >> > >> http://www8.cs.umu.se/~stric/tmp/zfs-many.png >> >> As the picture shows, don't try 1 filesystems with nfs on sol10. >> Creating more filesystems gets slower and slower the more you have as >> well. >> > > Since all filesystem would be shared via NFS, this clearly is a nogo :). > Thanks! If you search the archives, you will find that the people who tried to do this in the past were more successful with legacy NFS export methods than the sharenfs property in ZFS. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On 05/31/2011 12:26 PM, Khushil Dep wrote: Generally snapshots are quick operations but 10,000 such operations would I believe take enough to time to complete as to present operational issues - breaking these into sets would alleviate some? Perhaps if you are starting to run into many thousands of filesystems you would need to re-examin your rationale in creating so many. Thanks for your feedback! My rationale is this: I have a lot of hostingaccounts which have databases. These databases need to be backed up, preferably with mysqldump and there need to be historic data. I would like to use ZFS snapshots for this. However, I have some variables that need to be taken into account: * Different hostingplans offer different backupschedules: every 3 hour, every 24 hour. Backups might be kept 3 days, 14 day or 30 days. These schedules thus need to be on separate storage, otherwise I can't create a matching snapshot schedule to create and rotate snapshots. * Databases are hosted on multiple databaseservers, and are frequently migrated between them. I could create a ZFS filesystem for each server, but if a hostingaccount is migrated, all backups will be 'lost'. Having one filesystem for each hostingaccount would have solved nearly all disadvantages I could think of. But I don't think it is going to work, sadly. I'll have to make some choices :). Regards, Gertjan Oude Lohuis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On 05/31/2011 03:52 PM, Tomas Ögren wrote: I've done a not too scientific test on reboot times for Solaris 10 vs 11 with regard to many filesystems... http://www8.cs.umu.se/~stric/tmp/zfs-many.png As the picture shows, don't try 1 filesystems with nfs on sol10. Creating more filesystems gets slower and slower the more you have as well. Since all filesystem would be shared via NFS, this clearly is a nogo :). Thanks! On a different setup, we have about 750 datasets where we would like to use a single recursive snapshot, but when doing that all file access will be frozen for varying amounts of time What version of ZFS are you using? Like Matthew Ahrens said: version 27 has a fix for this. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE
I've written a possible solution using svc-system-config SMF where on first boot it will import -f a specified list of pools, and it does work, I was hoping to find a cleaner solution via zpool.cache... but if there's no way to achieve it I guess I'll have to stick with the other solution. I even tried simply copying /etc/zfs/zpool.cache to /a/etc/zfs/zpool.cache and not exporting/importing the data pool at all, however this gave the same hostid problem. thanks for your help. cheers Matt Jim Klimov wrote: Actually if you need beadm to "know" about the data pool, it might be beneficial to mix both approaches - yours with bemount, and init-script to enforce the pool import on that first boot... HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
> Hi. I have a development system on Intel commodity hardware with a > 500G ZFS > root mirror. I have another 500G drive same as the other two. Is there > any > way to use this disk to good advantage in this box? I don't think I > need any > more redundancy, I would like to increase performance if possible. I > have > only one SATA port left so I can only use 3 drives total unless I buy > a PCI > card. Would you please advise me. Many thanks. A third drive in the mirror (aka three-way mirror) will increase read performance from the pool, as ZFS reads from all drives in a mirror. Theoretically, you'll get a 50% read increase, but I doubt it'll be that high in practice. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
> Hi. I have a development system on Intel commodity hardware with > a 500G ZFS > root mirror. I have another 500G drive same as the other two. Is > there any > way to use this disk to good advantage in this box? I don't > think I need any > more redundancy, I would like to increase performance if > possible. I have > only one SATA port left so I can only use 3 drives total unless > I buy a PCI > card. Would you please advise me. Many thanks. Well, you can use this drive as a separate "scratch area", as a separate single-disk pool, without redundancy. You'd have a separate spindle for some dedicated tasks with data you're okay with losing. You can also make the rpool a three-way mirror which may increase read speeds if you have enough concurrentcy. And when one drive breaks, your rpool is still mirrored. HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] not sure how to make filesystems
Alas, I have some notes on the subject of migration from UFS to ZFS with split filesystems (separate /usr /var and some /var/* subdirs), but they are an unpublished document in Russian ;) Here I will outline some main points, but will probably have omitted some others :( Hope this helps anyway... Splitting off /usr and /var/* subdirs into separate datasets has been a varying success (worked in Solaris 10 and OpenSolaris SXCE, failed in OpenIndiana) and may cause issues during first reboots after OS upgrades and after some repair reboots (system tools don't expect such layout), but separating /var as a single dataset is supported. Paths like /export and /opt are not involved as "system root", so these can be implemented any way you want, including storage on a separate "data pool". With the zfs root in place, you can either create a swap volume inside the zfs root, or use a dedicated partition for swapping, or do both. With a dedicated partition you might control where on disk it is localted (faster/slower tracks), but you dedicate this space for only swapping if it is needed. With volumes you can relatively easily resize the swap area. /tmp is usually implemented as a "tmpfs" filesystem and as such it is stored in virtual memory, which is spread between RAM and swap areas, and its contents are lost on reboot - but you don't really care much about that implementation detail. In your vfstab file you just have this line: # grep tmp /etc/vfstab swap- /tmptmpfs - yes - In short, you might not want to involve LU in this at all: after a successful migration has been tested, you're likely to kill the UFS partition and use it as part of the ZFS root pool mirror. After that you would want to start the LU history from scratch, by naming this ZFS-rooted copy of your installation the initial boot environment, and later LUpgrade it to newer releases. Data migration itself is rather simple: you create the zfs pool named "rpool" in an available slice (i.e. c0t1d0s0) and in that rpool you create and mount the needed hierarchy of filesystem datasets (compatible with LU/beadm expectations). Then you copy over all the file data from UFS into your hierarchy (ufsdump/ufsrestore or Sun cpio preferred - to keep the ACL data), then enable booting of the ZFS root (zpool set bootfs=), and test if it works ;) # format ... (create the slice #0 on c0t1d0 of appropriate size - see below) # zpool create -f -R /a rpool c0t1d0s0 # zfs create -o mountpoint=legacy rpool/ROOT # zfs create -o mountpoint=/ rpool/ROOT/sol10u8 # zfs create -o compression=on rpool/ROOT/sol10u8/var # zfs create -o compression=on rpool/ROOT/sol10u8/opt # zfs create rpool/export # zfs create -o compression=on rpool/export/home # zpool set bootfs=rpool/ROOT/sol10u8 rpool # zpool set failmode=continue rpool Optionally create the swap and dump areas, i.e. # zfs create -V2g rpool/dump # zfs create -V2g rpool/swap If all goes well (and I didn't type mistakes) you should have the hierarchy mounted under /a. Check with "df -k" to be sure... One way to copy - with ufsdump: # cd /a && ( ufsdump 0f - / | ufsrestore -rf - ) # cd /a/var && ( ufsdump 0f - /var | ufsrestore -rf - ) # cd /a/opt && ( ufsdump 0f - /opt | ufsrestore -rf - ) # cd /a/export/home && ( ufsdump 0f - /export/home | ufsrestore -rf - ) Another way - with Sun cpio: # cd /a # mkdir -p tmp proc devices var/run system/contract system/object etc/svc/volatile # touch etc/mnttab etc/dfs/sharetab # cd / && ( /usr/bin/find . var opt export/home -xdev -depth -print | /usr/bin/cpio -Ppvdm /a ) Review the /a/etc/vfstab file, you probably need to comment away the explicit mountpoints for your new datasets, including root. It might get to look like this: # cat /etc/vfstab #device device mount FS fsckmount mount #to mount to fsck point typepassat boot options # /devices- /devicesdevfs - no - /proc - /proc proc- no - ctfs- /system/contract ctfs - no - objfs - /system/object objfs - no - sharefs - /etc/dfs/sharetab sharefs - no - fd - /dev/fd fd - no - swap- /tmptmpfs - yes - /dev/zvol/dsk/rpool/swap- - swap- no - Finally, install the right bootloader for the current OS. * In case of GRUB: # /a/sbin/installgrub /a/boot/grub/stage1 /a/boot/grub/stage2 /dev/rdsk/c0t1d0s0 # mkdir -p /a/rpool/boot/grub # cp /boot/grub/menu.lst /a/rpool/boot/grub Review and update the GRUB menu file as needed. Note that the current disk wh
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On Tue, May 31, 2011 at 6:52 AM, Tomas Ögren wrote: > > On a different setup, we have about 750 datasets where we would like to > use a single recursive snapshot, but when doing that all file access > will be frozen for varying amounts of time (sometimes half an hour or > way more). Splitting it up into ~30 subsets, doing recursive snapshots > over those instead has decreased the total snapshot time greatly and cut > the "frozen time" down to single digit seconds instead of minutes or > hours. > If you can upgrade to zpool version 27 or later, you should see much much less "frozen time" when doing a "zfs snapshot -r" of thousands of filesystems. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is another drive worth anything?
Hi. I have a development system on Intel commodity hardware with a 500G ZFS root mirror. I have another 500G drive same as the other two. Is there any way to use this disk to good advantage in this box? I don't think I need any more redundancy, I would like to increase performance if possible. I have only one SATA port left so I can only use 3 drives total unless I buy a PCI card. Would you please advise me. Many thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE
Actually if you need beadm to "know" about the data pool, it might be beneficial to mix both approaches - yours with bemount, and init-script to enforce the pool import on that first boot... HTH, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Importing Corrupted zpool never ends
I had several pool corruptions on my test box recently, and recovery imports did inded take a large part of a week (the process starved my 8Gb RAM so the system hanged and had to be reset with the hardware reset button - and this contributed to the large timeframe). Luckily for me, these import attempts were cumulative, so after a while the system began working. It seems that the system crashed during a major deletion operation and needed more time to find and release the deferred-free blocks. Not sure if my success would apply to your situation though. iostat speeds can vary during pool maintenance operations (i.e. scrub and probably import and zdb walks too) depending on (metadata) fragmentation, CPU busy-ness, etc. A more relevant metric here is %busy for the disks. While researching my problem I found many older posts indicating that this is "normal", however setting some kernel values with mdb may help speed up the process and/or have it succeed. To be short here, I can suggest that you read my recent threads from that timeframe: * (OI_148a) ZFS hangs when accessing the pool. How to trace what's happening? http://opensolaris.org/jive/thread.jspa?messageID=515689 * Questions on ZFS pool as a volume in another ZFS pool - details my system's setup http://opensolaris.org/jive/thread.jspa?threadID=138604&tstart=0 Since the system froze often by dropping into swap-hell, I had to create a watchdog which would initiate an ungraceful reboot if the conditions were "right". My FreeRAM-Watchdog code and compiled i386 binary and a primitive SMF service wrapper can be found here: http://thumper.cos.ru/~jim/freeram-watchdog-20110531-smf.tgz Other related forum threads: * zpool import hangs indefinitely (retry post in parts; too long?) http://opensolaris.org/jive/thread.jspa?threadID=131237 * zpool import hangs http://opensolaris.org/jive/thread.jspa?threadID=70205&tstart=15 - Original Message - From: Christian Becker Date: Tuesday, May 31, 2011 18:02 Subject: [zfs-discuss] Importing Corrupted zpool never ends To: zfs-discuss@opensolaris.org > Hi There, > I need to import an corrupted ZPOOL after double-Crash (Mainboard and one > HDD) on a different system. > It is a RAIDZ1 - 3 HDDs - only 2 are working. > > Problem: spool import -f poolname runs and runs and runs. Looking after > iostat (not zpool iostat) it is doing something - but what? And why does it > last so long (2x 1.5TB - Atom System). > > iostat seems to read and write with something about 500kB/s - I hope that it > doesn't work through the whole 1500GB - that would need 40 Days... > > Hope someone could help me. > > Thanks allot > Chris > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ++ || | Климов Евгений, Jim Klimov | | технический директор CTO | | ЗАО "ЦОС и ВТ" JSC COS&HT | || | +7-903-7705859 (cellular) mailto:jimkli...@cos.ru | |CC:ad...@cos.ru,jimkli...@gmail.com | ++ | () ascii ribbon campaign - against html mail | | /\- against microsoft attachments | ++ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE
I should have seen that coming, but didn't ;) I think in ths case I would go with a different approach: don't import the data pool in the AI instance and save it to zpool.cache Instead, make sure it is cleanly exported from AI instance, and in the installed system create a self-destructing init script or SMF service. For an init script in might go like this: #!/bin/sh # /etc/rc2.d/S00importdatapool [ "$1" = start ] && zpool import -f datapool && rm -f "$0" Or you can try setting the hostid in a persisnt manner (perhaps via eeprom emulation in /boot/solaris/bootenv.rc ?) - Original Message - From: Matt Keenan Date: Tuesday, May 31, 2011 21:02 Subject: Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE To: j...@cos.ru Cc: zfs-discuss@opensolaris.org > Jim, > > Thanks for the response, I've nearly got it working, coming up > against a > hostid issue. > > Here's the steps I'm going through : > > - At end of auto-install, on the client just installed before I > manually > reboot I do the following : >$ beadm mount solaris /a >$ zpool export data >$ zpool import -R /a -N -o > cachefile=/a/etc/zfs/zpool.cache data >$ beadm umount solaris >$ reboot > > - Before rebooting I check /a/etc/zfs/zpool.cache and it does > contain > references to "data". > > - On reboot, the automatic import of data is attempted however > following > message is displayed : > > WARNING: pool 'data' could not be loaded as it was last > accessed by > another system (host: ai-client hostid: 0x87a4a4). See > http://www.sun.com/msg/ZFS-8000-EY. > > - Host id on booted client is : >$ hostid >000c32eb > > As I don't control the import command on boot i cannot simply > add a "-f" > to force the import, any ideas on what else I can do here ? > > cheers > > Matt > > On 05/27/11 13:43, Jim Klimov wrote: > > Did you try it as a single command, somewhat like: > > > > zpool create -R /a -o cachefile=/a/etc/zfs/zpool.cache mypool c3d0 > > Using altroots and cachefile(=none) explicitly is a nearly- > documented> way to avoid caching pools which you would not want > to see after > > reboot, i.e. removable media. > > I think that after the AI is done and before reboot you might > want to > > reset the altroot property to point to root (or be undefined) > so that > > the data pool is mounted into your new rpools hierarchy and not > > under "/a/mypool" again ;) > > And if your AI setup does not use the data pool, you might be better > > off not using altroot at all, maybe... > > > > - Original Message - > > From: Matt Keenan > > Date: Friday, May 27, 2011 13:25 > > Subject: [zfs-discuss] Ensure Newly created pool is imported > > automatically in new BE > > To: zfs-discuss@opensolaris.org > > > > > Hi, > > > > > > Trying to ensure a newly created data pool gets import on boot > > > into a > > > new BE. > > > > > > Scenario : > > >Just completed a AI install, and on the client > > > before I reboot I want > > > to create a data pool, and have this pool automatically imported > > > on boot > > > into the newly installed AI Boot Env. > > > > > >Trying to use the -R altroot option to > zpool create > > > to achieve this or > > > the zpool set -o cachefile property, but having no luck, and > > > would like > > > some advice on what the best means of achieving this would be. > > > > > > When the install completes, we have a default root pool > "rpool", which > > > contains a single default boot environment, rpool/ROOT/solaris > > > > > > This is mounted on /a so I tried : > > > zpool create -R /a mypool c3d0 > > > > > > Also tried : > > > zpool create mypool c3d0 > > > zpool set -o cachefile=/a mypool > > > > > > I can clearly see /a/etc/zfs/zpool.cache contains information > > > for rpool, > > > but it does not get any information about mypool. I would expect > > > this > > > file to contain some reference to mypool. So I tried : > > > zpool set -o > cachefile=/a/etc/zfs/zpool.cache> > > > > Which fails. > > > > > > Any advice would be great. > > > > > > cheers > > > > > > Matt > > > ___ > > > zfs-discuss mailing list > > > zfs-discuss@opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > > > > ++ > > > || > > | Климов > Евгений, Jim Klimov | > > | технический > директор CTO | > > | ЗАО "ЦОС и > ВТ" JSC COS&HT | > > > || > > | +7-903-7705859 > (cellular) mailto:jimkli...@cos.ru | > > > |CC:ad...@cos.ru,jimkli...@gmail.com | > > ++ > > | () ascii ribbon campaign - against html
Re: [zfs-discuss] Experiences with 10.000+ filesystems
In general, you may need to keep data in one dataset if it is somehow related (i.e. backup of a specific machine or program, a user's home, etc) and if you plan to manage it in a consistent manner. For example, CIFS shares can not be nested, so for a unitary share (like "distribs") you would probably want one dataset. Also you can only have hardlinks within one FS dataset, so if you manage different views into a distribution set (i.e. sorted by vendor or sorted by software type) and if you do it by hardlinks - you need one dataset as well. If you often move (link and unlink) files around, i.e. from an "incoming" directory to final storage, you may want or not want to have that "incoming" in the same dataset, this depends on some other considerations too. You want to split datasets when you need them to have different features and perhaps different uses, i.e. to have them as separate shares, to enforce separate quotas and reservations, perhaps to delegate administration to particular OS users (i.e. let a user manage snapshots of his own homedir) and/or local zones. Don't forget about individual dataset properties (i.e. you may want compression for source code files but not for a multimedia collection), snapshots and clones, etc. > 2. space management (we have wasted space in some pools while others > are starved) Well, that's a reason to decrease number of pools, but not datasets ;) > 3. tool speed > > I do not have good numbers for time to do > some of these operations > as we are down to under 200 datasets (1/3 of the way through the > migration to the new layout). I do have log entries that point to > about a minute to complete a `zfs list` operation. > > > Would I run into any problems when snapshots are taken (almost) > > simultaneously from multiple filesystems at once? > > Our logs show snapshot creation time at 2 > seconds or less, but we > do not try to do them all at once, we walk the list of datasets and > process (snapshot and replicate) each in turn. I can partially relate to that. We have a Thumper system running OpenSolaris SXCE snv_177, with a separate dataset for each user's home directory, for backups of each individual remote machine, for each VM image, each local zone, etc. - in particular as to have separate history of snapshots and possibility to clone what we need to. Its relatively many filesystems (about 350) are or are not a problem depending on the tool used. For example, a typical import of the main pool may take up to 8 minutes when in safe mode, but many of delays seem to be related to attempts to share_nfs and share_cifs while the network is down ;) Auto-snapshots are on, and listing them is indeed rather long: [root@thumper ~]# time zfs list -tall -r pond | wc -l 56528 real0m18.146s user0m7.360s sys 0m10.084s [root@thumper ~]# time zfs list -tvolume -r pond | wc -l 5 real0m0.096s user0m0.025s sys 0m0.073s [root@thumper ~]# time zfs list -tfilesystem -r pond | wc -l 353 real0m0.123s user0m0.052s sys 0m0.073s Some operations like listing the filesystems SEEM slow due to the terminal, but in fact are rather quick: [root@thumper ~]# time df -k | wc -l 363 real0m2.104s user0m0.094s sys 0m0.183s However low-level system programs may have problems with multiple FSes; one known troublemaker is LiveUpgrade. Jens Elkner published a wonderful set of patches for Solaris 10 and OpenSolaris to limit LU's interests to just the filesystems that the admin knows to be interesting for the OS upgrade (they also fix mount order and other known bugs of that LU software release): * http://iws.cs.uni-magdeburg.de/~elkner/luc/lutrouble.html True, 1 FSes is not something I would have seen, so some tools (especially legacy ones) may break at the sheer amount of mountpoints :) One of my own tricks for cleaning snapshots, i.e. to free up pool space starvation quickly, is to use parallel "zfs destroy" invokations like this (note the ampersand): # zfs list -t snapshot -r pond/export/home/user | grep @zfs-auto-snap | awk '{print $1}' | \ while read Z ; do zfs destroy "$Z" & done This may spawn several thousand processes (if called for the root dataset), but they often complete in just 1-2 minutes instead of hours for a one-by-one series of calls; I guess because this way many ZFS metadata operations are requested in a small timeframe and get coalesced into few big writes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE
Jim, Thanks for the response, I've nearly got it working, coming up against a hostid issue. Here's the steps I'm going through : - At end of auto-install, on the client just installed before I manually reboot I do the following : $ beadm mount solaris /a $ zpool export data $ zpool import -R /a -N -o cachefile=/a/etc/zfs/zpool.cache data $ beadm umount solaris $ reboot - Before rebooting I check /a/etc/zfs/zpool.cache and it does contain references to "data". - On reboot, the automatic import of data is attempted however following message is displayed : WARNING: pool 'data' could not be loaded as it was last accessed by another system (host: ai-client hostid: 0x87a4a4). See http://www.sun.com/msg/ZFS-8000-EY. - Host id on booted client is : $ hostid 000c32eb As I don't control the import command on boot i cannot simply add a "-f" to force the import, any ideas on what else I can do here ? cheers Matt On 05/27/11 13:43, Jim Klimov wrote: Did you try it as a single command, somewhat like: zpool create -R /a -o cachefile=/a/etc/zfs/zpool.cache mypool c3d0 Using altroots and cachefile(=none) explicitly is a nearly-documented way to avoid caching pools which you would not want to see after reboot, i.e. removable media. I think that after the AI is done and before reboot you might want to reset the altroot property to point to root (or be undefined) so that the data pool is mounted into your new rpools hierarchy and not under "/a/mypool" again ;) And if your AI setup does not use the data pool, you might be better off not using altroot at all, maybe... - Original Message - From: Matt Keenan Date: Friday, May 27, 2011 13:25 Subject: [zfs-discuss] Ensure Newly created pool is imported automatically in new BE To: zfs-discuss@opensolaris.org > Hi, > > Trying to ensure a newly created data pool gets import on boot > into a > new BE. > > Scenario : >Just completed a AI install, and on the client > before I reboot I want > to create a data pool, and have this pool automatically imported > on boot > into the newly installed AI Boot Env. > >Trying to use the -R altroot option to zpool create > to achieve this or > the zpool set -o cachefile property, but having no luck, and > would like > some advice on what the best means of achieving this would be. > > When the install completes, we have a default root pool "rpool", which > contains a single default boot environment, rpool/ROOT/solaris > > This is mounted on /a so I tried : > zpool create -R /a mypool c3d0 > > Also tried : > zpool create mypool c3d0 > zpool set -o cachefile=/a mypool > > I can clearly see /a/etc/zfs/zpool.cache contains information > for rpool, > but it does not get any information about mypool. I would expect > this > file to contain some reference to mypool. So I tried : > zpool set -o cachefile=/a/etc/zfs/zpool.cache > > Which fails. > > Any advice would be great. > > cheers > > Matt > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ++ || | Климов Евгений, Jim Klimov | | технический директор CTO | | ЗАО "ЦОС и ВТ" JSC COS&HT | || | +7-903-7705859 (cellular) mailto:jimkli...@cos.ru | |CC:ad...@cos.ru,jimkli...@gmail.com | ++ | () ascii ribbon campaign - against html mail | | /\- against microsoft attachments | ++ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On Tue, May 31 at 8:52, Paul Kraus wrote: When we initially configured a large (20TB) files server about 5 years ago, we went with multiple zpools and multiple datasets (zfs) in each zpool. Currently we have 17 zpools and about 280 datasets. Nowhere near the 10,000+ you intend. We are moving _away_ from the many dataset model to one zpool and one dataset. We are doing this for the following reasons: 1. manageability 2. space management (we have wasted space in some pools while others are starved) 3. tool speed I do not have good numbers for time to do some of these operations as we are down to under 200 datasets (1/3 of the way through the migration to the new layout). I do have log entries that point to about a minute to complete a `zfs list` operation. It would be interesting to see if you still had issues (#3) with 1 pool and your 280 datasets. It would definitely eliminate #2. -- Eric D. Mudama edmud...@bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
Gertjan, In addition to the comments directly relating from your post, we have had similar discussions previously on the zfs-discuss list. If you care to go and review the list archives, I can share that we have had similar discussions on at least the following time periods. March 2006 May 2008 January 2010 February 2010 There may be (and probably are) more stuff in the list archives, but I know from my personal archives that these are good dates. Hope this helps, Jerry On 05/31/11 05:08, Gertjan Oude Lohuis wrote: > "Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far > this goes. Does anyone have experience with having more than 10.000 ZFS > filesystems? I know that mounting this many filesystems during boot > while take considerable time. Are there any other disadvantages that I > should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs > get/set'. > Would I run into any problems when snapshots are taken (almost) > simultaneously from multiple filesystems at once? > > Regards, > Gertjan Oude Lohuis > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] not sure how to make filesystems
On 29/05/2011 19:55, BIll Palin wrote: I'm migrating some filesystems from UFS to ZFS and I'm not sure how to create a couple of them. I want to migrate /, /var, /opt, /export/home and also want swap and /tmp. I don't care about any of the others. The first disk, and the one with the UFS filesystems, is c0t0d0 and the 2nd disk is c0t1d0. I've been told that /tmp is supposed to be part of swap. So far I have: lucreate -m /:/dev/dsk/c0t0d0s0:ufs -m /var:/dev/dsk/c0t0d0s3:ufs -m /export/home:/dev/dsk/c0t0d0s5:ufs -m /opt:/dev/dsk/c0t0d0s4:ufs -m -:/dev/dsk/c0t1d0s2:swap -m /tmp:/dev/dsk/c0t1d0s3:swap-n zfsBE -p rootpool And then set quotas for them. Is this right? Hi So zfs root is very different, one cannot have a mix of ufs + zvol based swap at all. and lucreate is a bit restricted, one cannot split out /var. The only one that works is lucreate -n zfsBE -p rpool where rpool is an SMI based pool. To check for SMI run format, select the rpool disk and p, p, then check if it lists cylinders ( SMI ), if not run format -e on the disk and label ( delete rpool first if it all ready exists ), then preferrably ( but not necessary ), put all space in slice 0 say ( so that rpool has the whole disk ). Post boot of zfsBE, one can modify the swap and dump zvols ( look on google for zfs root swap ). Enda ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Importing Corrupted zpool never ends
Hi There, I need to import an corrupted ZPOOL after double-Crash (Mainboard and one HDD) on a different system. It is a RAIDZ1 - 3 HDDs - only 2 are working. Problem: spool import -f poolname runs and runs and runs. Looking after iostat (not zpool iostat) it is doing something - but what? And why does it last so long (2x 1.5TB - Atom System). iostat seems to read and write with something about 500kB/s - I hope that it doesn't work through the whole 1500GB - that would need 40 Days... Hope someone could help me. Thanks allot Chris___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] not sure how to make filesystems
I'm migrating some filesystems from UFS to ZFS and I'm not sure how to create a couple of them. I want to migrate /, /var, /opt, /export/home and also want swap and /tmp. I don't care about any of the others. The first disk, and the one with the UFS filesystems, is c0t0d0 and the 2nd disk is c0t1d0. I've been told that /tmp is supposed to be part of swap. So far I have: lucreate -m /:/dev/dsk/c0t0d0s0:ufs -m /var:/dev/dsk/c0t0d0s3:ufs -m /export/home:/dev/dsk/c0t0d0s5:ufs -m /opt:/dev/dsk/c0t0d0s4:ufs -m -:/dev/dsk/c0t1d0s2:swap -m /tmp:/dev/dsk/c0t1d0s3:swap-n zfsBE -p rootpool And then set quotas for them. Is this right? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On 31 May, 2011 - Khushil Dep sent me these 4,5K bytes: > The adage that I adhere to with ZFS features is "just because you can > doesn't mean you should!". I would suspect that with that many > filesystems the normal zfs-tools would also take an inordinate length > of time to complete their operations - scale according to size. I've done a not too scientific test on reboot times for Solaris 10 vs 11 with regard to many filesystems... Quad Xeon machines with single raid10 and one boot environment. Using more be's with LU in sol10 will make the situation even worse, as it's LU that's taking time (re)mounting all filesystems over and over and over and over again. http://www8.cs.umu.se/~stric/tmp/zfs-many.png As the picture shows, don't try 1 filesystems with nfs on sol10. Creating more filesystems gets slower and slower the more you have as well. > Generally snapshots are quick operations but 10,000 such operations > would I believe take enough to time to complete as to present > operational issues - breaking these into sets would alleviate some? > Perhaps if you are starting to run into many thousands of filesystems > you would need to re-examin your rationale in creating so many. On a different setup, we have about 750 datasets where we would like to use a single recursive snapshot, but when doing that all file access will be frozen for varying amounts of time (sometimes half an hour or way more). Splitting it up into ~30 subsets, doing recursive snapshots over those instead has decreased the total snapshot time greatly and cut the "frozen time" down to single digit seconds instead of minutes or hours. > My 2c. YMMV. > > -- > Khush > > On Tuesday, 31 May 2011 at 11:08, Gertjan Oude Lohuis wrote: > > > "Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far > > this goes. Does anyone have experience with having more than 10.000 ZFS > > filesystems? I know that mounting this many filesystems during boot > > while take considerable time. Are there any other disadvantages that I > > should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs > > get/set'. > > Would I run into any problems when snapshots are taken (almost) > > simultaneously from multiple filesystems at once? > > > > Regards, > > Gertjan Oude Lohuis > > ___ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org (mailto:zfs-discuss@opensolaris.org) > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
On Tue, May 31, 2011 at 6:08 AM, Gertjan Oude Lohuis wrote: > "Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far > this goes. Does anyone have experience with having more than 10.000 ZFS > filesystems? I know that mounting this many filesystems during boot > while take considerable time. Are there any other disadvantages that I > should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs > get/set'. When we initially configured a large (20TB) files server about 5 years ago, we went with multiple zpools and multiple datasets (zfs) in each zpool. Currently we have 17 zpools and about 280 datasets. Nowhere near the 10,000+ you intend. We are moving _away_ from the many dataset model to one zpool and one dataset. We are doing this for the following reasons: 1. manageability 2. space management (we have wasted space in some pools while others are starved) 3. tool speed I do not have good numbers for time to do some of these operations as we are down to under 200 datasets (1/3 of the way through the migration to the new layout). I do have log entries that point to about a minute to complete a `zfs list` operation. > Would I run into any problems when snapshots are taken (almost) > simultaneously from multiple filesystems at once? Our logs show snapshot creation time at 2 seconds or less, but we do not try to do them all at once, we walk the list of datasets and process (snapshot and replicate) each in turn. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
Interesting, although makes sense ;) Now, I wonder about reliability (with large 2-3Tb drives and long scrub/resilver/replace times): say I have 12 drives in my box. I can lay them out as 4*3-disk raidz1, 3*4-disk-raidz1 or a 1*12-disk raidz3 with nearly the same capacity (8-9 data disks plus parity). I see that with more vdevs the IOPS will grow - does this translate to better resilver and scrub times as well? Smaller raidz sets can be more easily spread over different controllers and JBOD boxes, which is also an interesting factor... How good or bad is the expected reliability of 3*4-disk-raidz1 vs 1*12-disk raidz3, so which of the tradeoffs is better - more vdevs or more parity to survive loss of ANY 3 disks vs. "right" 3 disks? Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On Fri, May 27, 2011 at 2:49 PM, Marty Scholes wrote: > For what it's worth, I ran a 22 disk home array as a single RAIDZ3 vdev > (19+3)for several > months and it was fine. These days I run a 32 disk array laid out as four > vdevs, each an > 8 disk RAIDZ2, i.e. 4x 6+2. I tested 40 drives in various configurations and determined that for random read workloads, the I/O scaled linearly with the number of vdevs, NOT the number of drives. See https://spreadsheets.google.com/a/kraus-haus.org/spreadsheet/pub?hl=en_US&hl=en_US&key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc&output=html for results using raidz2 vdevs. I did not test sequential read performance here as our workload does not include any. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question on ZFS iSCSI
> The volume is exported as whole disk. When given whole disk, zpool > creates GPT partition table by default. You need to pass the partition > (not the disk) to zdb. Yes, that is what seems to be the problem. However, for the zfs volumes (/dev/zvol/rdsk/pool/dcpool) there seems to be no concept of partitions, etc. inside of them - these are defined only for the iSCSI representation which I want to try and get rid of. > In Linux you can use kpartx to make the partitions available. I don't > know the equivalent command in Solaris. Interesting... If only lofiadm could represent not a whole file, but a given "window" into it ;) At least, trying loopback mounts as well as directly the zfs volume with "fdisk", "parted" and such reveals that there are no noticeable iSCSI service data overheads in the addresable volume space: # parted /dev/zvol/rdsk/pool/dcpool print _device_probe_geometry: DKIOCG_PHYGEOM: Inappropriate ioctl for device Model: Generic Ide (ide) Disk /dev/zvol/rdsk/pool/dcpool: 4295GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End SizeFile system Name Flags 1 131kB 4295GB 4295GB zfs 9 4295GB 4295GB 8389kB But lofiadm doesn't let me address that partition #1 as a separate device :( Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question on ZFS iSCSI
On Tue, May 31, 2011 at 5:47 PM, Jim Klimov wrote: > However it seems that there may be some extra data beside the zfs > pool in the actual volume (I'd at least expect an MBR or GPT, and > maybe some iSCSI service data as an overhead). One way or another, > the "dcpool" can not be found in the physical zfs volume: > > === > # zdb -l /dev/zvol/rdsk/pool/dcpool > > > LABEL 0 > > failed to unpack label 0 The volume is exported as whole disk. When given whole disk, zpool creates GPT partition table by default. You need to pass the partition (not the disk) to zdb. > So the questions are: > > 1) Is it possible to skip iSCSI-over-loopback in this configuration? Yes. Well, maybe. In Linux you can use kpartx to make the partitions available. I don't know the equivalent command in Solaris. -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Question on ZFS iSCSI
I have a oi_148a test box with a pool on physical HDDs, a volume in this pool shared over iSCSI with explicit commands (sbdadm and such), and this iSCSI target is initiated by the same box. In the resulting iSCSI device I have another ZFS pool "dcpool". Recently I found the iSCSI part to be a potential bottleneck in my pool operations and wanted to revert to using ZFS volume directly as the backing store for "dcpool". However it seems that there may be some extra data beside the zfs pool in the actual volume (I'd at least expect an MBR or GPT, and maybe some iSCSI service data as an overhead). One way or another, the "dcpool" can not be found in the physical zfs volume: === # zdb -l /dev/zvol/rdsk/pool/dcpool LABEL 0 failed to unpack label 0 LABEL 1 failed to unpack label 1 LABEL 2 failed to unpack label 2 LABEL 3 failed to unpack label 3 === So the questions are: 1) Is it possible to skip iSCSI-over-loopback in this configuration? Preferably I would just specify a fixed offset (at which byte in the volume the "dcpool" data starts) and remove the iSCSI/networking overheads and see if they are the bottlenecks. 2) This configuration "zpool -> iSCSI -> zvol" was initially proposed as preferable over direct volume access by Darren Moffat as the fully supported way, see last comments here: http://blogs.oracle.com/darren/entry/compress_encrypt_checksum_deduplicate_with I still wonder why - the overhead is deemed negligible and there are more options quickly available, such as mounting the iSCSI device on another server? Now that I hit the problem of reverting to direct volume access, this makes sense ;) Thanks in advance for ideas or clarifications, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Experiences with 10.000+ filesystems
The adage that I adhere to with ZFS features is "just because you can doesn't mean you should!". I would suspect that with that many filesystems the normal zfs-tools would also take an inordinate length of time to complete their operations - scale according to size. Generally snapshots are quick operations but 10,000 such operations would I believe take enough to time to complete as to present operational issues - breaking these into sets would alleviate some? Perhaps if you are starting to run into many thousands of filesystems you would need to re-examin your rationale in creating so many. My 2c. YMMV. -- Khush On Tuesday, 31 May 2011 at 11:08, Gertjan Oude Lohuis wrote: > "Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far > this goes. Does anyone have experience with having more than 10.000 ZFS > filesystems? I know that mounting this many filesystems during boot > while take considerable time. Are there any other disadvantages that I > should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs > get/set'. > Would I run into any problems when snapshots are taken (almost) > simultaneously from multiple filesystems at once? > > Regards, > Gertjan Oude Lohuis > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org (mailto:zfs-discuss@opensolaris.org) > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD recommendation for ZFS usage
So, if I may, is this the correct summary of the answer to original question (on JBOD for a ZFS HA cluster): === SC847E26-RJBOD1 with dual-ported SAS drives are known to work in a failover HA storage scenario, allowing both servers (HBAs) access to each single SAS drive individually, so zpools can be configured from any disks regardless of which backplane they are connected to. HA Clusterware, such as NexentaStor HA-Cluster plugin should be used to ensure that only one head node actually uses a given disk drive in an imported ZFS pool. === Is the indented statement correct? :) Other questions: What clusterware would be encouraged now for OpenIndiana boxes? Also, in case of clustered shared filesystems (like VMWare vmfs) can these JBODs allow two different servers to access one drive simultaneously in a safe manner (do they do fencing, reservations and other SCSI magic)? > > Following up on some of this forum's discussions, I read the > manuals on SuperMicro's > > SC847E26-RJBOD1 this weekend. > > We see quite a few of these in the NexentaStor installed base. > The NexentaStor HA-Cluster plugin manages STONITH and reservations. > I do not believe programming expanders or switches for > clustering is the best approach. > It is better to let the higher layers manage this. > The cost of a SATA disk + SATA/SAS interposer is about the same > as a native SAS > drive. Native SAS makes a better solution. > -- ++ || | Климов Евгений, Jim Klimov | | технический директор CTO | | ЗАО "ЦОС и ВТ" JSC COS&HT | || | +7-903-7705859 (cellular) mailto:jimkli...@cos.ru | |CC:ad...@cos.ru,jimkli...@gmail.com | ++ | () ascii ribbon campaign - against html mail | | /\- against microsoft attachments | ++ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Experiences with 10.000+ filesystems
"Filesystem are cheap" is one of ZFS's mottos. I'm wondering how far this goes. Does anyone have experience with having more than 10.000 ZFS filesystems? I know that mounting this many filesystems during boot while take considerable time. Are there any other disadvantages that I should be aware of? Are zfs-tools still usable, like 'zfs list', 'zfs get/set'. Would I run into any problems when snapshots are taken (almost) simultaneously from multiple filesystems at once? Regards, Gertjan Oude Lohuis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss