Re: [zfs-discuss] what have you been buying for slog and l2arc?
On Mon, Aug 6, 2012 at 2:15 PM, Stefan Ring wrote: > So you're saying that SSDs don't generally flush data to stable medium > when instructed to? So data written before an fsync is not guaranteed > to be seen after a power-down? It depends on the model. Consumer models are less likely to immediately flush. My understanding that this is done in part to do some write coalescing and reduce the number of P/E cycles. Enterprise models should either flush, or contain a super capacitor that provides enough power for the drive to complete writing any date in its buffer. > If that -- ignoring cache flush requests -- is the whole reason why > SSDs are so fast, I'm glad I haven't got one yet. They're fast for random reads and writes because they don't have seek latency. They're fast for sequential IO because they aren't limited by spindle speed. -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS "copies" attribute substitute HW disk redundancy?
On Mon, Jul 30, 2012 at 7:11 AM, GREGG WONDERLY wrote: > I thought I understood that copies would not be on the same disk, I guess I > need to go read up on this again. ZFS attempts to put copies on separate devices, but there's no guarantee. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Persistent errors?
On Mon, Jun 18, 2012 at 3:55 PM, sol wrote: > It seems as though every time I scrub my mirror I get a few megabytes of > checksum errors on one disk (luckily corrected by the other). Is there some > way of tracking down a problem which might be persistent? Check the output of 'fmdump -eV', it should have some (rather extensive) information. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migration of a Thumper to bigger HDDs
On Thu, May 17, 2012 at 2:50 PM, Jim Klimov wrote: > New question: if the snv_117 does see the 3Tb disks well, > the matter of upgrading the OS becomes not so urgent - we > might prefer to delay that until the next stable release > of OpenIndiana or so. There were some pretty major fixes and new features added between snv_117 and snv_134 (the last OpenSolaris release). It might be worth updating to snv_134 at the very least. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] checking/fixing busy locks for zfs send/receive
On Fri, Mar 16, 2012 at 2:35 PM, Philip Brown wrote: > if there isnt a process visible doing this via ps, I'm wondering how > one might check if a zfs filesystem or snapshot is rendered "busy" in > this way, interfering with an unmount or destroy? > > I'm also wondering if this sort of thing can mean interference between > some combination of multiple send/receives at the same time, on the > same filesystem? Look at 'zfs hold', 'zfs holds', and 'zfs release'. Sends and receives will place holds on snapshots to prevent them from being changed. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Compatibility of Hitachi Deskstar 7K3000 HDS723030ALA640 with ZFS
On Tue, Mar 6, 2012 at 2:40 AM, Koopmann, Jan-Peter wrote: > Do you or anyone else have experience with the 3TB 5K3000 drives > (namely HDS5C3030ALA630)? I am thinking of replacing my current 4*1TB drives > with 4*3TB drives (home server). Any issues with TER or alike? I have been using 8 x 3TB 5k3000 in a raidz2 for about a year without issue. The Deskstar 3TB come off the same production line as the Ultrastar 5k3000. I would avoid the 2TB and smaler 5k3000 - They come off a separate production line. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Compatibility of Hitachi Deskstar 7K3000 HDS723030ALA640 with ZFS
On Mon, Mar 5, 2012 at 9:52 AM, luis Johnstone wrote: > As far as I can tell, the Hitachi Deskstar 7K3000 (HDS723030ALA640) uses > 512B sectors and so I presume does not suffer from such issues (because it > doesn't lie about the physical layout of sectors on-platter) Both the 7K3000 and 5K3000 drives have 512B physical sectors. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Server upgrade
On Wed, Feb 15, 2012 at 9:16 AM, David Dyer-Bennet wrote: > Is there an upgrade path from (I think I'm running Solaris Express) to > something modern? (That could be an Oracle distribution, or the free There *was* an upgrade path from snv_134 to snv_151a (Solaris 11 Express) but I don't know if Oracle still supports it. There was an intermediate step or two along the way (snv_134b I think?) to move from OpenSolaris to Oracle Solaris. As others mentioned, you could jump to OpenIndiana from your current version. You may not be able to move between OI and S11 in the future, so it's a somewhat important decision. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'
On Wed, Nov 23, 2011 at 11:43 AM, Harry Putnam wrote: > OK, I'm out of escapes. or other tricks... other than using emacs but > I haven't installed emacs as yet. > > I can just ignore them of course, until such time as I do get emacs > installed, but by now I just want to know how it might be done from a > shell prompt. rm ./-c ./-O ./-k -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacement for X25-E
On Thu, Sep 22, 2011 at 12:53 PM, Ray Van Dolson wrote: > It seems to perform similarly to the X-25E as well (3300 IOPS for > random writes). Perhaps the drive can be overprovisioned as well? > > My impression was that Intel was classifying the 3xx series as > non-Enterprise however. Even with the SLC. I don't think the 311 has any over-provisioning (other than the 7% from GB -> GiB conversion). I believe it is an X25-E with only 5 channels populated. The upcoming enterprise models are MLC based and have greater over-provisioning AFAIK. The 20GB 311 only costs ~ $100 though. The 100GB Intel 710 costs ~ $650. The 311 is a good choice for home or budget users, and it seems that the 710 is much bigger than it needs to be for slog devices. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deskstars and CCTL (aka TLER)
On Wed, Sep 7, 2011 at 7:40 PM, Daniel Carosone wrote: > Looks like another positive for these drives over the "competition". > The same appears to be the case for the 5k3000's as well (page 96 in > that document). Be careful with the smaller 5k3000 drives. The 1TB and 2TB drives are not manufactured on the same line as the Ultrastar and seem to have lower reliability. Only the 3TB 5k3000 shares specs with the Ultrastar 5k3000. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacement for X25-E
On Tue, Sep 20, 2011 at 12:21 AM, Markus Kovero wrote: > Hi, I was wondering do you guys have any recommendations as replacement for > Intel X25-E as it is being EOL’d? Mainly as for log device. The Intel 311 seems like a good fit. It's a 20gb SLC device intended to act as a cache device with the Z68 chipset. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deskstars and CCTL (aka TLER)
On Wed, Sep 7, 2011 at 2:20 AM, Roy Sigurd Karlsbakk wrote: > Does anyone know if this is possible from OI/Solaris, or if this needs to be > done on driver level? You should be able to do it via smartctl. The setting does not persist through power cycles, so you'll want to add it to a startup script. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS raidz on top of hardware raid0
On Fri, Aug 12, 2011 at 6:34 PM, Tom Tang wrote: > Suppose I want to build a 100-drive storage system, wondering if there is any > disadvantages for me to setup 20 arrays of HW RAID0 (5 drives each), then > setup ZFS file system on these 20 virtual drives and configure them as RAIDZ? A 20-device wide raidz is a bad idea. Making those devices from stripes just compounds the issue. The biggest problem is that resilvering would be a nightmare, and you're practically guaranteed to have additional failures or read errors while degraded. You would achieve better performance, error detection and recovery by using several top-level raidz. 20 x 5-disk raidz would give you very good read and write performance with decent resilver times and 20% overhead for redundancy. 10 x 10-disk raidz2 would give more protection, but a little less performance, and higher resilver times. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel 320 as ZIL?
On Mon, Aug 15, 2011 at 2:07 PM, Ray Van Dolson wrote: > Looks interesting... specs around the same as the old X-25E. We have > heard however, that Intel will be announcing a true successor to their > X-25E line shortly. I think it's the 710 and 720 that you're referring to. The 710 is MLC-HET (high endurance) and will be in 100/200/300GB capacities. The 720 is SLC, but a PCIe interface and will be 200/400GB capacity. I don't imagine either will be very cheap. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intel 320 as ZIL?
On Thu, Aug 11, 2011 at 1:00 PM, Ray Van Dolson wrote: > Are any of you using the Intel 320 as ZIL? It's MLC based, but I > understand its wear and performance characteristics can be bumped up > significantly by increasing the overprovisioning to 20% (dropping > usable capacity to 80%). Intel recently added the 311, a small SLC-based drive for use as a temp cache with their Z68 platform. It's limited to 20GB, but it might be a better fit for use as a ZIL than the 320. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk IDs and DD
On Tue, Aug 9, 2011 at 8:20 AM, Paul Kraus wrote: > Nothing to worry about here. Controller IDs (c) are assigned > based on the order the kernel probes the hardware. On the SPARC > systems you can usually change this in the firmware (OBP), but they > really don't _mean_ anything (other than the kernel found c8 before it > found c9). If you're really bothered by the device names, you can rebuild the device map. There's no reason to do it unless you've had to replace hardware, etc. The steps are similar to these: http://spiralbound.net/blog/2005/12/21/rebuilding-the-solaris-device-tree -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL
On Mon, Aug 1, 2011 at 4:27 PM, Daniel Carosone wrote: > The other thing that can cause a storm of tiny IOs is dedup, and this > effect can last long after space has been freed and/or dedup turned > off, until all the blocks corresponding to DDT entries are rewritten. > I wonder if this was involved here. Using dedup on a pool that houses an Oracle DB is Doing It Wrong in so many ways... -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Exapnd ZFS storage.
On Wed, Aug 3, 2011 at 3:02 AM, Nix wrote: > I have 4 disk with 1 TB of disk and I want to expand the zfs pool size. > > I have 2 more disk with 1 TB of size. > > Is it possible to expand the current RAIDz array with new disk? You can't add the new drives to your current vdev. You can create another vdev to add to your pool though. If you're adding another vdev, it should have the same geometry as your current (ie: 4 drives). The zpool command will complain if you try to add a vdev with different geometry or redundancy, though you can force it with -f. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Fragmentation issue - examining the ZIL
On Mon, Aug 1, 2011 at 2:16 PM, Neil Perrin wrote: > In general the blogs conclusion is correct . When file systems get full > there is > fragmentation (happens to all file systems) and for ZFS the pool uses gang > blocks of smaller blocks when there are insufficient large blocks. The blog doesn't mention how full the pool was. It's pretty well documented that performance takes a nosedive at a certain point. A slow scrub is actually not related to the problems in the blog post, since there's not a lot of writes during (or at least caused by) a scrub. Fragmentation is a real issue with pools that are (or have been) very full. The data gets written out in fragments and has to be read back in the same order. If the mythical bp_rewrite code ever shows up, it will be possible to defrag a pool. But not yet. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] recover zpool with a new installation
On Tue, Jul 26, 2011 at 1:14 PM, Cindy Swearingen < cindy.swearin...@oracle.com> wrote: > Yes, you can reinstall the OS on another disk and as long as the > OS install doesn't touch the other pool's disks, your > previous non-root pool should be intact. After the install > is complete, just import the pool. > You can also use the Live CD or Live USB to access your pool or possibly fix your existing installation. You will have to force the zpool import with either a reinstall or a Live boot. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?
On Tue, Jul 26, 2011 at 7:51 AM, David Dyer-Bennet wrote: > "Processing" the request just means flagging the blocks, though, right? > And the actual benefits only acrue if the garbage collection / block > reshuffling background tasks get a chance to run? > I think that's right. TRIM just gives hints to the garbage collector that sectors are no longer in use. When the GC runs, it can find more flash blocks more easily that aren't used or combine several mostly-empty blocks and erase or otherwise free them for reuse later. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?
On Tue, Jul 26, 2011 at 5:59 AM, Edward Ned Harvey < opensolarisisdeadlongliveopensola...@nedharvey.com> wrote: > like 4%, and for some reason (I don't know why) there's a benefit to > optimizing on 8k pages. Which means no. If you overwrite a sector of a > >From what I've heard it's due in large part to the FAT file system, since its used in a lot of embedded systems as well as on flash cards. The FAT cluster size is 32k, so any flash block that's a factor of 32k works well. Page sizes are usually 2k with a 128k erase block, 4k with a 256k erase block, or 4k with a 512k erase block. It's also due to ECC reasons, since a larger block size allows more efficient ECC over a larger block of data. This is similar to move to 4k sectors in magnetic drives. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large scale performance query
On Sun, Jul 24, 2011 at 11:34 PM, Phil Harrison wrote: > What kind of performance would you expect from this setup? I know we can > multiple the base IOPS by 24 but what about max sequential read/write? > You should have a theoretical max close to 144x single-disk throughput. Each raidz3 has 6 "data drives" which can be read from simultaneously, multiplied by your 24 vdevs. Of course, you'll hit your controllers' limits well before that. Even with a controller per JBOD, you'll be limited by the SAS connection. The 7k3000 has throughput from 115 - 150 MB/s, meaning each of your JBODs will be capable of 5.2 GB/sec - 6.8 GB/sec, roughly 10 times the bandwidth of a single SAS 6g connection. Use multipathing if you can to increase the bandwidth to each JBOD. Depending on the types of access that clients are performing, your cache devices may not be any help. If the data is read multiple times by multiple clients, then you'll see some benefit. If it's only being read infrequently or by one client, it probably won't help much at all. That said, if your access is mostly sequential then random access latency shouldn't affect you too much, and you will still have more bandwidth from your main storage pools than from the cache devices. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacing failed drive
On Fri, Jul 22, 2011 at 1:12 PM, Chris Dunbar - Earthside, LLC wrote: > I have physically replaced the drive, but I have not partitioned it yet. I > know there is a command to copy the layout from one disk to another and that > has worked well for me in the past. I just have to find the command again. > Once that is done, do I need to detach the spare before I run the replace > command or does running the replace command automatically bump the spare out > of service and put it back to being just a spare? Since it isn't the rpool, you shouldn't have to partition the replacement drive. Since you've physically replaced the drive, you should just have to do: # zpool replace tank c10t0d0 The pool should resilver, and I think the spare should automatically detach. If not # zpool remove tank c10t6d0 should take care of it. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs "hybrid" drive - any advice?
On Thu, Jul 21, 2011 at 4:08 PM, Gordon Ross wrote: > And then for about $400 one can get an 250GB SSD, such as: > Crucial M4 CT256M4SSD2 2.5" 256GB SATA III MLC Internal Solid State > Drive (SSD) > http://www.newegg.com/Product/Product.aspx?Item=N82E16820148443 > > Anyone have experience with either one? (good or bad) The hybrid drive might accelerate some operations. No guarantees, though. It's about as fast as a WD Velociraptor in some operations, and the same as the regular Seagate 500gb in others. There is a decent review of it at Anandtech. The M4 is pretty decent, though the Vertex 3 and other Sandforce 2000-based drives beat it in benchmarks. Honestly though, you'll probably be very happy with any recent SSD, eg: C300, M4, Intel 320, Intel 510, Sandforce 1200-based (Vertex 2, Phoenix Pro, etc), Sandforce 2200-based (Vertex 3, Corsair Force GT, Patriot Wildfire, etc). -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] latest zpool version in solaris 11 express
On Mon, Jul 18, 2011 at 6:21 AM, Edward Ned Harvey wrote: > Kidding aside, for anyone finding this thread at a later time, here's the > answer. It sounds unnecessarily complex at first, but then I went through > it ... Only took like a minute or two. It was exceptionally easy in fact. > https://pkg-register.oracle.com Do you need a support contract in order to access the certificate application? I'm getting the following error when I try to get a cert: "There has been a problem with contacting the entitlement server. You will only be able to issue new certificates for public products. Please try again later" -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zil on multiple usb keys
On Sun, Jul 17, 2011 at 12:13 PM, Edward Ned Harvey wrote: > Actually, you can't do that. You can't make a vdev from other vdev's, and > when it comes to striping and mirroring your only choice is to do it the > right way. > > If you were REALLY trying to go out of your way to do it wrong somehow, I > suppose you could probably make a zvol from a stripe, and then export it to > yourself via iscsi, repeat with another zvol, and then mirror the two iscsi > targets. ;-) You might even be able to do the same crazy thing with simply > zvol's and no iscsi... But either way you'd really be going out of your way > to create a problem. ;-) The right way to do it, um, incorrectly is to create a striped device using SVM, and use that as a vdev for your pool. So yes, you could create two 800GB stripes, and use them to create a ZFS mirror. But it would be a really bad idea. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Replacement disks for Sun X4500
On Wed, Jul 6, 2011 at 10:12 PM, X4 User wrote: > I am bumping this thread because I too have the same question ... can I put > modern 3TB disks (hitachi deskstars) into an old x4500 ? I have 8 x 3TB drives (Deskstar 5k3000) attached to a Supermicro AOC-SAT2-MV8 and it works fine. This card uses the same Marvell controller as the x4500. Performance is fine if not slightly better than the WD10EADS drives that I replaced. Of course, the pool was about 92% full with the smaller drives ... -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pure SSD Pool
On Tue, Jul 12, 2011 at 12:14 PM, Eric Sproul wrote: > I see, thanks for that explanation. So finding drives that keep more > space in reserve is key to getting consistent performance under ZFS. More spare area might give you more performance, but the big difference is the lifetime of the device. A device with more spare area can handle more writes. In the capacity range (eg: 50-64 GB, 64 GiB flash), then the drive with more spare will last longer but may not offer a performance benefit. Higher capacity drives will offer better performance because they have more flash channels to write to, and they should last longer because while the spare area is the same percentage of total capacity, it's numerically larger. A "consumer" 240GB drive (256GiB flash) will have 27GiB spare area. An "enterprise" 50GB (64GiB flash) drive will have 16 GiB spare area, or about 25% of the total capacity. Even though the consumer drive only sets aside ~ 10% for spare, it's so much larger that it will last longer at any given rate of writing. If you were to completely fill and re-fill each drive, the consumer drive will fail earlier, but you'd have to write nearly 5x as much data to fill it even once. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pure SSD Pool
On Tue, Jul 12, 2011 at 7:41 AM, Eric Sproul wrote: > But that's exactly the problem-- ZFS being copy-on-write will > eventually have written to all of the available LBA addresses on the > drive, regardless of how much live data exists. It's the rate of > change, in other words, rather than the absolute amount that gets us > into trouble with SSDs. The SSD has no way of knowing what blocks Most "enterprise" SSDs use something like 30% for spare area. So a drive with 128MiB (base 2) of flash will have 100MB (base 10) of available storage. A consumer level drive will have ~ 6% spare, or 128MiB of flash and 128MB of available storage. Some drives have 120MB available, but still have 128 MiB of flash and therefore slightly more spare area. Controllers like the Sandforce that do some dedup can give you even more effective spare area, depending on the type of data. When the OS starts reusing LBAs, the drive will re-map them into new flash blocks in the spare area and may perform garbage collection on the now partially used blocks. The effectiveness of this depends on how quickly the system is writing and how full the drive is. I failed to mention earlier that ZFS's write aggregation is also helpful when used with flash drives since it can help to ensure that a whole flash block is written at once. Increasing the ashift value to 4k when the pool is created may also help. > Now, others have hinted that certain controllers are better than > others in the absence of TRIM, but I don't see how GC could know what > blocks are available to be erased without information from the OS. The changed LBAs are remapped rather than overwritten in place. The drive knows which LBAs in a flash block have been re-mapped, and can do garbage collection when the right criteria are met. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pure SSD Pool
On Mon, Jul 11, 2011 at 7:03 AM, Eric Sproul wrote: > Interesting-- what is the suspected impact of not having TRIM support? There shouldn't be much, since zfs isn't changing data in place. Any drive with reasonable garbage collection (which is pretty much everything these days) should be fine until the volume gets very full. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot format 2.5TB ext disk (EFI)
On Thu, Jun 23, 2011 at 1:20 PM, Richard Elling wrote: > 2TB limit for 32-bit Solaris. If you hit this, then you'll find a lot of > complaints at boot. > By default, an Ultra-24 should boot 64-bit. Dunno about the HBA, though... I think the limit is 1TB for 32-bit. I've tried to use 2TB drives on an Atom N270-based board and they were not recognized, but they worked fine under FreeBSD. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD recommendation for ZFS usage
On Mon, May 30, 2011 at 6:16 PM, Jim Klimov wrote: > Also some articles stated that at one time there were > single-port SAS drives, so there are at least two SAS > connectors after all ;) Nope, only one mechanical connector. A dual port cable can be used with single- or dual-ported SAS device, or with SATA drives. A single port cable can be used with a single- or dual-ported SAS device (although it will only use one port) or with a SATA drive. A SATA cable can be used with a SATA device. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] optimal layout for 8x 1 TByte SATA (consumer)
On Thu, May 26, 2011 at 9:34 AM, Eugen Leitl wrote: > How bad would raidz2 do on mostly sequential writes and reads > (Athlon64 single-core, 4 GByte RAM, FreeBSD 8.2)? I was using a similar but slightly higher spec setup (quad-core cpu & 8 GB RAM) at home and didn't have any problems with an 8-drive raidz2, though my usage is fairly light. The system is more than fast enough to saturate gigabit ethernet for sequential reads and writes. My drives were WD10EADS "Green" drives. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] offline dedup
On Thu, May 26, 2011 at 8:37 AM, Edward Ned Harvey wrote: > Question: Is it possible, or can it easily become possible, to periodically > dedup a pool instead of keeping dedup running all the time? It is easy to I think it's been discussed before, and the conclusion is that it would require bp_rewrite. Offline (or deferred) dedup certainly seems more attractive given the current real-time performance. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Oracle and Nexenta
On Tue, May 24, 2011 at 3:17 PM, Peter Jeremy wrote: > I believe the various OSS projects that use ZFS have formed a working > group to co-ordinate ZFS amongst themselves. I don't know if Oracle > was invited to join (though given the way Oracle has behaved in all Richard would probably know for certain. There will probably be a fork at some point to an OSS ZFS and an Oracle ZFS. Hopefully neither side will actively try to break compatibility. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Oracle and Nexenta
On Tue, May 24, 2011 at 12:41 PM, Richard Elling wrote: > There are many ZFS implementations, each evolving as the contributors desire. > Diversity and innovation is a good thing. ... unless Oracle's zpool v30 is different than Nexenta's v30. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring disk seeks
On Thu, May 19, 2011 at 5:35 AM, Sašo Kiselkov wrote: > I'd like to ask whether there is a way to monitor disk seeks. I have an > application where many concurrent readers (>50) sequentially read a > large dataset (>10T) at a fairly low speed (8-10 Mbit/s). I can monitor > read/write ops using iostat, but that doesn't tell me how contiguous the > data is, i.e. when iostat reports "500" read ops, does that translate to > 500 seeks + 1 read per seek, or 50 seeks + 10 reads, etc? Thanks! You can sort of do this with a DTrace script. Something like: (forgive my crappy script, I've only poked at DTrace a few times) #pragma D option quiet io:::done / args[1]->dev_name == "sd" && args[1]->dev_instance < 11 / { printf("%d.%03d,%s,%i,%s,%i\n", (timestamp/100), (timestamp / 1000) % 1000, args[1]->dev_statname, args[0]->b_lblkno, (args[0]->b_flags & B_WRITE ? "W" : "R"), args[0]->b_bcount ); } For every completed IO, this should give you the timestamp, device name, start LBA, "R"ead or "W"rite and length of the IO. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris vs FreeBSD question
On Wed, May 18, 2011 at 5:47 AM, Paul Kraus wrote: > P.S. If anyone here has a suggestion as to how to get Solaris to load > I would love to hear it. I even tried disabling multi-cores (which > makes the CPUs look like dual core instead of quad) with no change. I > have not been able to get serial console redirect to work so I do not > have a good log of the failures. Have you checked your system in the HCL device tool at http://www.sun.com/bigadmin/hcl/hcts/device_detect.jsp ? It should be able to tell you which device is causing the problem. If I remember correctly, you can feed it the output of 'lspci -vv -n'. You may have to disable some on-board devices to get through the installer, but I couldn't begin to guess which. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reboots when importing old rpool
On Tue, May 17, 2011 at 11:10 AM, Hung-ShengTsao (Lao Tsao) Ph.D. wrote: > > may be do > zpool import -R /a rpool 'zpool import -N' may work as well. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Still no way to recover a "corrupted" pool
On Mon, May 16, 2011 at 1:55 PM, Freddie Cash wrote: > Would not import in Solaris 11 Express. :( Could not even find any > pools to import. Even when using "zpool import -d /dev/dsk" or any > other import commands. Most likely due to using a FreeBSD-specific > method of labelling the disks. I think someone solved this before by creating a directory and making symlinks to the correct partition/slices on each disk. Then you can use 'zpool import -d /tmp/foo' to do the import. eg: # mkdir /tmp/fbsd # create a temp directory to point to the p0 partitions of the relevant disks # ln -s /dev/dsk/c8t1d0p0 /tmp/fbsd/ # ln -s /dev/dsk/c8t2d0p0 /tmp/fbsd/ # ln -s /dev/dsk/c8t3d0p0 /tmp/fbsd/ # ln -s /dev/dsk/c8t4d0p0 /tmp/fbsd/ # zpool import -d /tmp/fbsd/ $POOLNAME I've never used FreeBSD so I can't offer any advice about which device name is correct or if this will work. Posts from February 2010 "Import zpool from FreeBSD in OpenSolaris" indicate that you want p0. > It's just frustrating that it's still possible to corrupt a pool in > such a way that "nuke and pave" is the only solution. Especially when I'm not sure it was the only solution, it's just the one you followed. > What's most frustrating is that this is the third time I've built this > pool due to corruption like this, within three months. :( You may have an underlying hardware problem, or there could be a bug in the FreeBSD implementation that you're tripping over. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 8:33 AM, Richard Elling wrote: > As a rule of thumb, the resilvering disk is expected to max out at around > 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect > the throttles or broken data path. My system was doing far less than 80 IOPS during resilver when I recently upgraded the drives. The older and newer drives were both 5k RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to be super fast. The worst resilver was 50 hours, the best was about 20 hours. This was just my home server, which is lightly used. The clients (2-3 CIFS clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS clients) are mostly idle and don't do a lot of writes. Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things up a bit, which suggests that the default values may be too conservative for some environments. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Sat, May 14, 2011 at 11:20 PM, John Doe wrote: >> 171 Hitachi 7K3000 3TB > I'd go for the more environmentally friendly Ultrastar 5K3000 version - with > that many drives you wont mind the slower rotation but WILL notice a > difference in power and cooling cost A word of caution - The Hitachi Deskstar 5K3000 drives in 1TB and 2TB are different than the 3TB. The 1TB and 2TB are manufactured in China, and have a very high failure and DOA rate according to Newegg. The 3TB drives come off the same production line as the Ultrastar 5K3000 in Thailand and may be more reliable. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Sun, May 15, 2011 at 10:14 PM, Richard Elling wrote: > On May 15, 2011, at 10:18 AM, Jim Klimov wrote: >> In case of RAIDZ2 this recommendation leads to vdevs sized 6 (4+2), 10 (8+2) >> or 18 (16+2) disks - the latter being mentioned in the original post. > > A similar theory was disproved back in 2006 or 2007. I'd be very surprised if > there was a reliable way to predict the actual use patterns in advance. > Features > like compression and I/O coalescing improve performance, but make the old > "rules of thumb" even more obsolete. I thought that having data disks that were a power of two was still recommended, due to the way that ZFS splits records/blocks in a raidz vdev. Or are you responding to some other point? -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning disk failure detection?
On Tue, May 10, 2011 at 9:18 AM, Ray Van Dolson wrote: > My question is -- is there a way to tune the MPT driver or even ZFS > itself to be more/less aggressive on what it sees as a "failure" > scenario? You didn't mention what drives you had attached, but I'm guessing they were normal "desktop" drives. I suspect (but can't confirm) that using enterprise drives with TLER / ERC / CCTL would have reported the failure up the stack faster than a consumer drive. The drives will report an error after 7 seconds rather than retry for several minutes. You may be able to enable the feature on your drives, depending on the manufacturer and firmware revision. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] primarycache=metadata seems to force behaviour of secondarycache=metadata
On Mon, May 9, 2011 at 2:54 PM, Tomas Ögren wrote: > Slightly off topic, but we had an IBM RS/6000 43P with a PowerPC 604e > cpu, which had about 60MB/s memory bandwidth (which is kind of bad for a > 332MHz cpu) and its disks could do 70-80MB/s or so.. in some other > machine.. It wasn't that long ago when 66MB/s ATA was considered a waste because no drive could use that much bandwidth. These days a "slow" drive has max throughput greater than 110MB/s. (OK, looking at some online reviews, it was about 13 years ago. Maybe I'm just old.) -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on HP MDS 600
On Mon, May 9, 2011 at 8:33 AM, Darren Honeyball wrote: > I'm just mulling over the best configuration for this system - our work load > is mostly writing millions of small files (around 50k) with occasional reads > & we need to keep as much space as possible. If space is a priority, then raidz or raidz2 are probably the best bets. If you're going to have a lot of random iops, then mirrors are best. You have some control over the performance : space ratio with raidz by adjusting the width of the radiz vdevs. For instance, mirrors will provide 34TB of space and best random iops. 24 x 3-disk raidz vdevs will have 48TB of space but still have pretty strong random iops performance. 13 x 5-disk raidz vdevs will give 52TB of space at the lost of lower random iops. Testing will help you find the best configuration for your environment. > HP's recommendations for configuring the MDS 600 with ZFS is to let the P212 > do the raid functions (raid 1+0 is recommended here) by configuring each half > of the MDS 600 as a single logical drive (35 drives) & then use a basic zfs > pool on top to provide the zfs functionality - to me this would seem to loose > a lot of the error checking functions of zfs? If you configured the two logical drives as a mirror in ZFS, then you'd still have full protection. Your overhead would be really high though - 3/4 of your original capacity would be used for data protection if I understand the recommendation correctly. (You'd use 1/2 of the original capacity for RAID1 in the MDS, then 1/2 of the remaining for the ZFS mirror.) You could use non-redundant pool in ZFS to reduce the overhead, but you sacrifice the self-healing properties of ZFS when you do that. > Another option is to use raidz and let zfs handle the smart stuff - as the > P212 doesn't support a true dumb JBOD function I'd need to create each drive > as a single raid 0 logical drive - are there any drawback to doing this? Or > would it be better to create slightly larger logical drives using say 2 > physical drives per logical drive? Single-device logical drives are required when you can't configure a card or device as JBOD, and I believe its usually the recommended solution. Once you have the LUNs created, you can use ZFS to create mirrors or raidz vdevs. > I'm planning on having 2 hot spares - one in each side of the MDS 600, is it > also worth using a dedicated ZIL spindle or 2? It would depend on your workload. (How's that for helpful?) If you're experiencing a lot of synchronous writes, then a ZIL will help. If you aren't seeing a lot of sync writes, then a ZIL won't help. The ZIL doesn't have to be very large, since it's flushed on a regular basis. From the Best Practices guide: "For a target throughput of X MB/sec and given that ZFS pushes transaction groups every 5 seconds (and have 2 outstanding), we also expect the ZIL to not grow beyond X MB/sec * 10 sec. So to service 100MB/sec of synchronous writes, 1 GB of log device should be sufficient." If the MDS has a non-volatile cache, there should be little or no need to use a ZIL. However, some reports have shown ZFS with a ZIL to be faster than using non-volatile cache. You should test performance using your workload. > Is it worth tweaking zfs_nocacheflush or zfs_vdev_max_pending? As I mentioned above, if the MDS has a non-volatile cache, then setting zfs_nocacheflush might help performance. If you're exporting one LUN per device then you shouldn't need to adjust the max_pending. If you're exporting larger RAID10 luns from the MDS, then increasing the value might help for read workloads. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson wrote: > We use dedupe on our VMware datastores and typically see 50% savings, > often times more. We do of course keep "like" VM's on the same volume I think NetApp uses 4k blocks by default, so the block size and alignment should match up for most filesystems and yield better savings. Your server's resource requirements for ZFS and dedup will be much higher due to the large DDT, as you initially suspected. If bp_rewrite is ever completed and released, this might change. It should allow for offline dedup, which may make dedup usable in more situations. > Apologies for devolving the conversation too much in the NetApp > direction -- simply was a point of reference for me to get a better > understanding of things on the ZFS side. :) It's good to compare the two, since they have a pretty large overlap in functionality but sometimes very different implementations. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Thu, May 5, 2011 at 8:50 PM, Edward Ned Harvey wrote: > If you have to use the 4k recordsize, it is likely to consume 32x more > memory than the default 128k recordsize of ZFS. At this rate, it becomes > increasingly difficult to get a justification to enable the dedup. But it's > certainly possible. You're forgetting that zvols use an 8k volblocksize by default. If you're currently exporting exporting volumes with iSCSI it's only a 2x increase. The tradeoff is that you should have more duplicate blocks, and reap the rewards there. I'm fairly certain that it won't offset the large increase in the size of the DDT however. Dedup with zvols is probably never a good idea as a result. Only if you're hosting your VM images in .vmdk files will you get 128k blocks. Of course, your chance of getting many identical blocks gets much, much smaller. You'll have to worry about the guests' block alignment in the context of the image file, since two identical files may not create identical blocks as seen from ZFS. This means you may get only fractional savings and have an enormous DDT. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey wrote: > Generally speaking, dedup doesn't work on VM images. (Same is true for ZFS > or netapp or anything else.) Because the VM images are all going to have > their own filesystems internally with whatever blocksize is relevant to the > guest OS. If the virtual blocks in the VM don't align with the ZFS (or > whatever FS) host blocks... Then even when you write duplicated data inside > the guest, the host won't see it as a duplicated block. A zvol with 4k blocks should give you decent results with Windows guests. Recent versions use 4k alignment by default and 4k blocks, so there should be lots of duplicates for a base OS image. > There are some situations where dedup may help on VM images... For example > if you're not using sparse files and you have a zero-filed disk... But in compression=zle works even better for these cases, since it doesn't require DDT resources. > Or if you're intimately familiar with both the guest & host filesystems, and > you choose blocksizes carefully to make them align. But that seems > complicated and likely to fail. Using a 4k block size is a safe bet, since most OSs use a block size that is a multiple of 4k. It's the same reason that the new "Advanced Format" drives use 4k sectors. Windows uses 4k alignment and 4k (or larger) clusters. ext3/ext4 uses 1k, 2k, or 4k blocks. Drives over 512MB should use 4k by default. The block alignment is determined by the partitioning, so some care needs to be taken there. zfs uses 'ashift' size blocks. I'm not sure what ashift works out to be when using a zvol though, so it could be as small as 512b but may be set to the same as the blocksize property. ufs is 4k or 8k on x86 and 8k on sun4u. As with ext4, block alignment is determined by partitioning and slices. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quick zfs send -i performance questions
On Thu, May 5, 2011 at 11:17 AM, Giovanni Tirloni wrote: > What I find it curious is that it only happens with incrementals. Full > send's go as fast as possible (monitored with mbuffer). I was just wondering > if other people have seen it, if there is a bug (b111 is quite old), etc. I missed that you were using b111 earlier. That's probably a large part of the problem. There were a lot of performance and reliability improvements between b111 and b134, and there have been more between b134 and b148 (OI) or b151 (S11 Express). Updating the host you're receiving on to something more recent may fix the performance problem you're seeing. Fragmentation shouldn't be to great of an issue if the pool you're writing to is relatively empty. There were changes made to zpool metaslab allocation post-b111 that might improve performance for pools between 70% and 96% full. This could also be why the full sends perform better than incremental sends. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Wed, May 4, 2011 at 4:36 PM, Erik Trimble wrote: > If so, I'm almost certain NetApp is doing post-write dedup. That way, the > strictly controlled max FlexVol size helps with keeping the resource limits > down, as it will be able to round-robin the post-write dedup to each FlexVol > in turn. They are, its in their docs. A volume is dedup'd when 20% of non-deduped data is added to it, or something similar. 8 volumes can be processed at once though, I believe, and it could be that weaker systems are not able to do as many in parallel. > block usage has a significant 4k presence. One way I reduced this initally > was to have the VMdisk image stored on local disk, then copied the *entire* > image to the ZFS server, so the server saw a single large file, which meant > it tended to write full 128k blocks. Do note, that my 30 images only takes Wouldn't you have been better off cloning datasets that contain an unconfigured install and customizing from there? -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quick zfs send -i performance questions
On Wed, May 4, 2011 at 2:25 PM, Giovanni Tirloni wrote: > The problem we've started seeing is that a zfs send -i is taking hours to > send a very small amount of data (eg. 20GB in 6 hours) while a zfs send full > transfer everything faster than the incremental (40-70MB/s). Sometimes we > just give up on sending the incremental and send a full altogether. Does the send complete faster if you just pipe to /dev/null? I've observed that if recv stalls, it'll pause the send, and they two go back and forth stepping on each other's toes. Unfortunately, send and recv tend to pause with each individual snapshot they are working on. Putting something like mbuffer (http://www.maier-komor.de/mbuffer.html) in the middle can help smooth it out and speed things up tremendously. It prevents the send from pausing when the recv stalls, and allows the recv to continue working when the send is stalled. You will have to fiddle with the buffer size and other options to tune it for your use. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Wed, May 4, 2011 at 12:29 PM, Erik Trimble wrote: > I suspect that NetApp does the following to limit their resource > usage: they presume the presence of some sort of cache that can be > dedicated to the DDT (and, since they also control the hardware, they can > make sure there is always one present). Thus, they can make their code AFAIK, NetApp has more restrictive requirements about how much data can be dedup'd on each type of hardware. See page 29 of http://media.netapp.com/documents/tr-3505.pdf - Smaller pieces of hardware can only dedup 1TB volumes, and even the big-daddy filers will only dedup up to 16TB per volume, even if the volume size is 32TB (the largest volume available for dedup). NetApp solves the problem by putting rigid constraints around the problem, whereas ZFS lets you enable dedup for any size dataset. Both approaches have limitations, and it sucks when you hit them. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster copy from UFS to ZFS
On Tue, May 3, 2011 at 12:36 PM, Erik Trimble wrote: > rsync is indeed slower than star; so far as I can tell, this is due almost > exclusively to the fact that rsync needs to build an in-memory table of all > work being done *before* it starts to copy. After that, it copies at about rsync 3.0+ will start copying almost immediately, so it's much better in that respect than previous versions. It continues to scan update the list of files while sending data. > network use pattern), which helps for ZFS copying. The one thing I'm not > sure of is whether rsync uses a socket, pipe, or semaphore method when doing > same-host copying. I presume socket (which would slightly slow it down vs It creates a socketpair() before clone()ing itself and uses the socket for communications. > That said, rsync is really the only solution if you have a partial or > interrupted copy. It's also really the best method to do verification. For verification you should specify -c (checksums), otherwise it will only look at the size, permissions, owner and date and if they all match it will not look at the file contents. It can take as long (or longer) to complete than the original copy, since files on both side need to be read and checksummed. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster copy from UFS to ZFS
On Tue, May 3, 2011 at 5:47 AM, Joerg Schilling wrote: > But this is most likely slower than star and does rsync support sparse files? 'rsync -ASHXavP' -A: ACLs -S: Sparse files -H: Hard links -X: Xattrs -a: archive mode; equals -rlptgoD (no -H,-A,-X) You don't need to specify --whole-file, it's implied when copying on the same system. --inplace can play badly with hard links and shouldn't be used. It probably will be slower than other options but it may be more accurate, especially with -H -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls reports incorrect file size
On Mon, May 2, 2011 at 1:56 PM, Eric D. Mudama wrote: > that the application would have done the seek+write combination, since > on NTFS (which doesn't support sparse) these would have been real > 1.5GB files, and there would be hundreds or thousands of them in > normal usage. NTFS supports sparse files. http://www.flexhex.com/docs/articles/sparse-files.phtml -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Thu, Apr 28, 2011 at 6:48 PM, Edward Ned Harvey wrote: > What does it mean / what should you do, if you run that command, and it > starts spewing messages like this? > leaked space: vdev 0, offset 0x3bd8096e00, size 7168 I'm not sure there's much you can do about it short of deleting datasets and/or snapshots. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Still no way to recover a "corrupted" pool
On Fri, Apr 29, 2011 at 1:23 PM, Freddie Cash wrote: > Running ZFSv28 on 64-bit FreeBSD 8-STABLE. I'd suggest trying to import the pool into snv_151a (Solaris 11 Express), which is the reference and development platform for ZFS. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster copy from UFS to ZFS
On Fri, Apr 29, 2011 at 10:53 AM, Dan Shelton wrote: > Is anyone aware of any freeware program that can speed up copying tons of > data (2 TB) from UFS to ZFS on same server? Setting 'sync=disabled' for the initial copy will help, since it will make all writes asynchronous. You will probably want to set it back to default after you're done. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Fri, Apr 29, 2011 at 7:10 AM, Roy Sigurd Karlsbakk wrote: > This was fletcher4 earlier, and still is in opensolaris/openindiana. Given a > combination with verify (which I would use anyway, since there are always > tiny chances of collisions), why would sha256 be a better choice? fletcher4 was only an option for snv_128, which was quickly pulled and replaced with snv_128b which removed fletcher4 as an option. The official post is here: http://www.opensolaris.org/jive/thread.jspa?threadID=118519&tstart=0#437431 It looks like fletcher4 is still an option in snv_151a for non-dedup datasets, and is in fact the default. As an aside: Erik, any idea when the 159 bits will make it to the public? -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding where dedup'd files are
On Thu, Apr 28, 2011 at 4:06 PM, Erik Trimble wrote: > Which means, that while I can get a list of blocks which are deduped, it > may not be possible to generate a list of files from that list of > blocks. Is it possible to determine which datasets the blocks are referenced from? Since I have some datasets with dedup'd data, I'm a little paranoid about tanking the system if they are destroyed. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Thu, Apr 28, 2011 at 3:50 PM, Edward Ned Harvey wrote: > When a block is scheduled to be written, system performs checksum, and looks > for a matching entry in DDT in ARC/L2ARC. In the event of an ARC/L2ARC ... which, if it's on L2ARC, is another read too. While most people will be using a fast SSD, it's slower than RAM and still worth mentioning. > cache miss for a DDT entry which actually exists, the system will need to > perform a number of small disk reads in order to fetch the DDT entry from > disk. Correct? I figure at least one, probably more than one, read to > locate the entry on disk, and then another read to actually read the entry. I think it's safe to assume it'll usually be multiple reads from the pool devices. These are random iops. > After this, the system knows there is a checksum match between the block > waiting to be written, and another block that's already on disk, and it > could possibly have to do yet another read for verification, before it is > able to finally do the write. Right? If verify is on, it'll read the on-disk block and compare it to the to-be-written block. If they match, it will increment the refcount for the on-disk block. If the zpool property dedupditto is set and the refcount for the on-disk block exceeds the threshold, it will write another copy of the block to disk. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Finding where dedup'd files are
On Thu, Apr 28, 2011 at 3:48 PM, Ian Collins wrote: > Dedup is at the block, not file level. Files are usually composed of blocks. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Thu, Apr 28, 2011 at 3:05 PM, Erik Trimble wrote: > A careful reading of the man page seems to imply that there's no way to > change the dedup checksum algorithm from sha256, as the dedup property > ignores the checksum property, and there's no provided way to explicitly > set a checksum algorithm specific to dedup (i.e. there's no way to > override the default for dedup). That's my understanding as well. The initial release used fletcher4 or sha256, but there was either a bug in the fletcher4 code or a hash collision that required removing it as an option. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Wed, Apr 27, 2011 at 9:26 PM, Edward Ned Harvey wrote: > Correct me if I'm wrong, but the dedup sha256 checksum happens in addition > to (not instead of) the fletcher2 integrity checksum. So after bootup, My understanding is that enabling dedup forces sha256. "The default checksum used for deduplication is sha256 (subject to change). When dedup is enabled, the dedup checksum algorithm overrides the checksum property." -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Finding where dedup'd files are
Is there an easy way to find out what datasets have dedup'd data in them. Even better would be to discover which files in a particular dataset are dedup'd. I ran # zdb - which gave output like: index 1055c9f21af63 refcnt 2 single DVA[0]=<0:1e274ec3000:2ac00:STD:1> [L0 deduplicated block] sha256 uncompressed LE contiguous unique unencrypted 1-copy size=2L/2P birth=236799L/236799P fill=1 cksum=55c9f21af6399be:11f9d4f5ff4cb109:2af8b798671e47ba:d19caf78da295df5 How can I translate this into datasets or files? -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive
On Wed, Apr 27, 2011 at 12:51 PM, Lamp Zy wrote: > Any ideas how to identify which drive is the one that failed so I can > replace it? Try the following: # fmdump -eV # fmadm faulty -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive replacement speed
The last resilver finished after 50 hours. Ouch. I'm onto the next device now, which seems to be progressing much, much better. The current tunings that I'm using right now are: echo zfs_resilver_delay/W0t0 | mdb -kw echo zfs_resilver_min_time_ms/W0t2 | pfexec mdb -kw Things could slow down, but at 13 hours in, the resilver has been managing ~ 100M/s and is 70% done. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive replacement speed
On Mon, Apr 25, 2011 at 5:26 PM, Brandon High wrote: > Setting zfs_resilver_delay seems to have helped some, based on the > iostat output. Are there other tunables? I found zfs_resilver_min_time_ms while looking. I've tried bumping it up considerably, without much change. 'zpool status' is still showing: scan: resilver in progress since Sat Apr 23 17:03:13 2011 6.06T scanned out of 6.40T at 36.0M/s, 2h46m to go 769G resilvered, 94.64% done 'iostat -xn' shows asvc_t under 10ms still. Increasing the per-device queue depth has increased the ascv_t but hasn't done much to effect the throughput. I'm using: echo zfs_vdev_max_pending/W0t35 | pfexec mdb -kw -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?
On Mon, Apr 25, 2011 at 4:53 PM, Fred Liu wrote: > So how can I set the quota size on a file system with dedup enabled? I believe the quota applies to the non-dedup'd data size. If a user stores 10G of data, it will use 10G of quota, regardless of whether it dedups at 100:1 or 1:1. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive
On Mon, Apr 25, 2011 at 4:56 PM, Lamp Zy wrote: > I'd expect the spare drives to auto-replace the failed one but this is not > happening. > > What am I missing? Is the autoreplace property set to 'on'? # zpool get autoreplace fwgpool0 # zpool set autoreplace=on fwgpool0 > I really would like to get the pool back in a healthy state using the spare > drives before trying to identify which one is the failed drive in the > storage array and trying to replace it. How do I do this? Turning on autoreplace might start the replace. If not, the following will replace the failed drive with the first spare. (I'd suggest verifying the device names before running it.) # zpool replace fwgpool0 c4t5000C5001128FE4Dd0 c4t5000C50014D70072d0 -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive replacement speed
On Mon, Apr 25, 2011 at 4:45 PM, Richard Elling wrote: > If there is other work going on, then you might be hitting the resilver > throttle. By default, it will delay 2 clock ticks, if needed. It can be turned There is some other access to the pool from nfs and cifs clients, but not much, and mostly reads. Setting zfs_resilver_delay seems to have helped some, based on the iostat output. Are there other tunables? > Probably won't work because it does not make the resilvering drive > any faster. It doesn't seem like the devices are the bottleneck, even with the delay turned off. $ iostat -xn 60 3 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 369.2 11.5 5577.0 71.3 0.7 0.71.91.9 14 29 c2t0d0 371.9 11.5 5570.3 71.3 0.7 0.71.71.8 13 29 c2t1d0 369.9 11.5 5574.4 71.3 0.7 0.71.81.9 14 29 c2t2d0 370.7 11.5 5573.9 71.3 0.7 0.71.81.9 14 29 c2t3d0 368.0 11.5 5553.1 71.3 0.7 0.71.81.9 14 29 c2t4d0 196.1 172.8 2825.5 2436.6 0.3 1.10.83.0 6 26 c2t5d0 183.6 184.9 2717.6 2674.7 0.5 1.31.43.5 11 31 c2t6d0 393.0 11.2 5540.7 71.3 0.5 0.61.31.5 12 26 c2t7d0 95.81.2 95.6 16.2 0.0 0.00.20.2 0 1 c0t0d0 0.91.23.6 16.2 0.0 0.07.51.9 0 0 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 891.2 11.8 2386.9 64.4 0.0 1.20.01.3 1 36 c2t0d0 919.9 12.1 2351.8 64.6 0.0 1.10.01.2 0 35 c2t1d0 906.9 12.1 2346.1 64.6 0.0 1.20.01.3 0 36 c2t2d0 877.9 11.6 2351.0 64.5 0.7 0.50.80.6 23 35 c2t3d0 883.4 12.0 2322.0 64.4 0.2 1.00.21.1 7 35 c2t4d0 0.8 758.00.8 1910.4 0.2 5.00.26.6 3 72 c2t5d0 882.7 11.4 2355.1 64.4 0.8 0.40.90.4 27 34 c2t6d0 907.8 11.4 2373.1 64.5 0.7 0.30.80.4 23 30 c2t7d0 1607.89.4 1568.2 83.0 0.1 0.20.10.1 3 18 c0t0d0 7.39.1 23.5 83.0 0.1 0.06.01.4 2 2 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 960.3 12.7 2868.0 59.0 1.1 0.71.20.8 37 52 c2t0d0 963.2 12.7 2877.5 59.1 1.1 0.81.10.8 36 51 c2t1d0 960.3 12.6 2844.7 59.1 1.1 0.71.10.8 37 52 c2t2d0 1000.1 12.8 2827.1 59.0 0.6 1.20.61.2 21 52 c2t3d0 960.9 12.3 2811.1 59.0 1.3 0.61.30.6 42 51 c2t4d0 0.5 962.20.4 2418.3 0.0 4.10.04.3 0 59 c2t5d0 1014.2 12.3 2820.6 59.1 0.8 0.80.80.8 28 48 c2t6d0 1031.2 12.5 2822.0 59.1 0.8 0.80.70.8 26 45 c2t7d0 1836.40.0 1783.40.0 0.0 0.20.00.1 1 19 c0t0d0 5.30.05.30.0 0.0 0.01.11.5 1 1 c0t1d0 -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Mon, Apr 25, 2011 at 8:20 AM, Edward Ned Harvey wrote: > and 128k assuming default recordsize. (BTW, recordsize seems to be a zfs > property, not a zpool property. So how can you know or configure the > blocksize for something like a zvol iscsi target?) zvols use the 'volblocksize' property, which defaults to 8k. A 1TB zvol is therefore 2^27 blocks and would require ~ 34 GB for the ddt (assuming that a ddt entry is 270 bytes). The zfs man page for the property reads: volblocksize=blocksize For volumes, specifies the block size of the volume. The blocksize cannot be changed once the volume has been written, so it should be set at volume creation time. The default blocksize for volumes is 8 Kbytes. Any power of 2 from 512 bytes to 128 Kbytes is valid. This property can also be referred to by its shortened column name, volblock. -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Drive replacement speed
0- -771 10 1.99M 59.4K c2t2d0- -743 10 2.02M 59.4K c2t3d0- -771 11 2.01M 59.3K c2t4d0- -767 10 1.94M 59.1K replacing - - 0 1.00K 17 1.48M c2t5d0/old - - 0 0 0 0 c2t5d0 - - 0533 17 1.48M c2t6d0- -791 10 1.98M 59.2K c2t7d0- -796 10 1.99M 59.3K - - - - - - $ iostat -xn 60 3 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 362.4 11.5 5693.9 71.6 0.7 0.72.02.0 14 30 c2t0d0 365.3 11.5 5689.0 71.6 0.7 0.71.81.9 14 29 c2t1d0 363.2 11.5 5693.2 71.6 0.7 0.71.92.0 14 30 c2t2d0 364.0 11.5 5692.7 71.6 0.7 0.71.91.9 14 30 c2t3d0 361.2 11.5 5672.8 71.6 0.7 0.71.91.9 14 30 c2t4d0 202.4 163.1 2915.2 2475.3 0.3 1.10.82.9 7 26 c2t5d0 170.4 190.4 2747.3 2757.6 0.5 1.31.53.6 11 31 c2t6d0 386.4 11.2 5659.0 71.6 0.5 0.61.31.5 12 27 c2t7d0 95.01.2 94.5 16.1 0.0 0.00.20.2 0 1 c0t0d0 0.91.23.3 16.1 0.0 0.07.51.9 0 0 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 514.1 13.0 1937.7 65.7 0.2 0.80.31.5 5 27 c2t0d0 510.1 13.2 1943.1 65.7 0.2 0.80.51.6 6 29 c2t1d0 513.3 13.2 1926.3 65.8 0.2 0.80.31.5 5 28 c2t2d0 505.9 13.3 1936.7 65.8 0.2 0.90.31.8 5 30 c2t3d0 513.8 12.8 1890.1 65.8 0.2 0.80.31.5 5 26 c2t4d0 0.1 488.60.1 1216.5 0.0 2.20.04.6 0 33 c2t5d0 533.3 12.7 1875.3 65.9 0.1 0.70.21.3 4 24 c2t6d0 541.6 12.9 1923.2 65.8 0.1 0.70.21.2 3 23 c2t7d0 0.02.00.09.4 0.0 0.01.00.2 0 0 c0t0d0 0.02.00.09.4 0.0 0.01.00.2 0 0 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 506.79.2 1906.9 50.2 0.6 0.21.20.5 20 23 c2t0d0 509.89.3 1909.5 50.2 0.6 0.21.20.4 19 23 c2t1d0 508.69.0 1900.4 50.2 0.7 0.31.40.5 21 25 c2t2d0 506.89.4 1897.2 50.3 0.6 0.21.20.5 19 23 c2t3d0 505.19.4 1852.4 50.4 0.6 0.21.20.5 19 23 c2t4d0 0.0 487.60.0 1227.9 0.0 3.50.07.2 0 46 c2t5d0 534.89.2 1855.6 50.2 0.6 0.21.00.4 18 22 c2t6d0 540.59.3 1891.4 50.2 0.5 0.21.00.4 17 21 c2t7d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 0.00.0 0.0 0.0 0.0 0.00.00.0 0 0 c0t1d0 -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] just can't import
On Mon, Apr 11, 2011 at 10:55 AM, Matt Harrison wrote: > It did finish eventually, not sure how long it took in the end. Things are > looking good again :) If you want to continue using dedup, you should invest in (a lot) more memory. The amount of memory required depends on the size of your pool and the type of data that you're storing. Data that large blocks will use less memory. I suspect that the minimum memory for most moderately sized pools is over 16GB. There has been a lot of discussion regarding how much memory each dedup'd block requires, and I think it was about 250-270 bytes per block. 1TB of data (at max block size and no duplicate data) will require about 2GB of memory to run effectively. (This seems high to me, hopefully someone else can confirm.) This is memory that is available to the ARC, above and beyond what is being used by the system and applications. Of course, using all your ARC to hold dedup data won't help much either, as either cacheable data or dedup info will be evicted rather quickly. Forcing the system to read dedup tables from the pool is slow, since it's a lot of random reads. All I know is that I have 8GB in my home system, and it is not enough to work with the 8TB pool that I have. Adding a fast SSD as L2ARC can help reduce the memory requirements somewhat by keeping dedup data more easily accessible. (And make sure that your L2ARC device is large enough. I fried a 30GB OCV Vertex in just a few months of use, I suspect from the constant writes.) -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] just can't import
On Sun, Apr 10, 2011 at 10:01 PM, Matt Harrison wrote: > The machine only has 4G RAM I believe. There's your problem. 4G is not enough memory for dedup, especially without a fast L2ARC device. > It's time I should be heading to bed so I'll let it sit overnight, and if > I'm still stuck with it I'll give Ian's recent suggestions a go and report > back. I'd suggest waiting for it to finish the destroy. It will, if you give it time. Trying to force the import is only going to put you back in the same situation - The system will attempt to complete the destroy and seem to hang until it's completed. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] just can't import
On Sun, Apr 10, 2011 at 9:01 PM, Matt Harrison wrote: > I had a de-dup dataset and tried to destroy it. The command hung and so did > anything else zfs related. I waited half and hour or so, the dataset was > only 15G, and rebooted. How much RAM does the system have? Dedup uses a LOT of memory, and it can take a long time to destroy dedup'd datasets. If you keep waiting, it'll eventually return. It could be a few hours or longer. > The machine refused to boot, stuck at Reading ZFS Config. Asking around on The system resumed the destroy that was in progress. If you let it sit, it'll eventually complete. > Well the livecd is also hanging on import, anything else zfs hangs. iostat > shows some reads but they drop off to almost nothing after 2 mins or so. Likewise, it's trying to complete the destroy. Be patient and it'll complete. Never versions of Open Solaris or Solaris 11 Express may complete it faster. > Any tips greatly appreciated, Just wait... -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Going forward after Oracle - Let's get organized, let's get started.
On Sat, Apr 9, 2011 at 10:41 AM, Chris Forgeron wrote: > I see your point, but you also have to understand that sometimes too many > helpers/opinions are a bad thing. There is a set "core" of ZFS developers > who make a lot of this move forward, and they are the key right now. The rest > of us will just muddy the waters with conflicting/divergent opinions on > direction and goals. It would be nice to have some communication from the devs about what they're working on. A moderated list that only a limited set of people normally post to would be excellent. I'd be excited to hear that there's a new feature being worked on, rather than the radio silence we've had. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to rename rpool. Is that recommended ?
On Fri, Apr 8, 2011 at 12:10 AM, Arjun YK wrote: > I have a situation where a host, which is booted off its 'rpool', need > to temporarily import the 'rpool' of another host, edit some files in > it, and export the pool back retaining its original name 'rpool'. Can > this be done ? Yes you can do it, no it is not recommended. I had a need to do something similar to what you're attempting and ended up using a Live CD (which doesn't have an rpool to have a naming conflict) to do the manipulations. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?
On Thu, Apr 7, 2011 at 4:01 PM, Joe Auty wrote: > My source computer is running Solaris 10 ZFS version 15. Does this mean that > I'd be asking for trouble doing a zfs send back to this machine from any > other ZFS machine running a version > 15? I just want to make sure I > understand all of this info... There are two versions when it comes to ZFS - The zpool version and the zfs version. bhigh@basestar:~$ zpool list -o name,version NAME VERSION rpool 31 bhigh@basestar:~$ zfs list -o name,version NAME VERSION rpool5 rpool/ROOT 5 rpool/ROOT/snv_151 5 rpool/dump - rpool/rsrv 5 rpool/swap - I think that the version that matters (for your purposes) is the ZFS version. It should be set when using 'send -R' and having 'zfs receive' create the destination datasets. I recommend testing however. > If this is the case, what are my strategies? Solaris 10 for my temporary > backup machine? Is it possible to run OpenIndiana or Nexenta or something and > somehow set up these machines with ZFS v15 or something? You can set the zpool version when you create the pool, and you can set the zfs version when you create the dataset. I'm not sure that you'll need to set the pool version to anything lower if the dataset version is correct though. You should test this, however. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?
On Wed, Apr 6, 2011 at 10:42 AM, Paul Kraus wrote: > I thought I saw that with zpool 10 (or was it 15) the zfs send > format had been committed and you *could* send/recv between different > version of zpool/zfs. From Solaris 10U9 with zpool 22 manpage for zfs: There is still a problem if the dataset version is too high. I *believe* that a 'zfs send -R' should send the zfs version, and that zfs receive will create any new datasets using that version. (I have a received dataset here that's zfs v 4, whereas everything else in the pool is v5.) As long as you don't do a zfs upgrade after that point, you should be fine. It's probably a good idea to check that the received versions are the same as the source before doing a destroy though. ;-) One other thing that I forgot to mention in my last mail too: If you're receiving into a VM, make sure that the VM can manage redundancy on its zfs storage, and not just multiple vdsk on the same host disk / lun. Either give it access to the raw devices, or use iSCSI, or create your vdsk on different luns and raidz them, etc. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/receive to Solaris/FBSD/OpenIndiana/Nexenta VM guest?
On Tue, Apr 5, 2011 at 12:38 PM, Joe Auty wrote: > How about getting a little more crazy... What if this entire server > temporarily hosting this data was a VM guest running ZFS? I don't foresee > this being a problem either, but with so > The only thing to watch out for is to make sure that the receiving datasets aren't a higher version that the zfs version that you'll be using on the replacement server. Because you can't downgrade a dataset, using snv_151a and planning to send to Nexenta as a final step will trip you up unless you explicitly create them with a lower version. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NTFS on NFS and iSCSI always generates small IO's
On Thu, Mar 10, 2011 at 9:45 AM, Richard Elling wrote: > Default recordsize for NFS is 128K. For the VM case, you will want to match > the block size of > the clients. However, once the file (on the NFS server) is created with 128K > records, it will remain > at 128K forever. So you will need to create a new VM store after the > recordsize is tuned. You can change the recordsize and copy the vmdk files on the nfs server, which will re-write them with a smaller recordsize. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NTFS on NFS and iSCSI always generates small IO's
On Thu, Mar 10, 2011 at 12:15 AM, Matthew Anderson wrote: > I have a feeling it's to do with ZFS's recordsize property but haven't been > able to find any solid testing done with NTFS. I'm going to do some testing > using smaller record sizes tonight to see if that helps the issue. > At the moment I'm surviving on cache and am quickly running out of capacity. > > Can anyone suggest any further tests or have any idea about what's going on? The default blocksize for a zfs volume is 8k, so 4k writes will probably require a read as well. You can try creating a new volume with volblocksize set to 4k and see if that helps. The value can't be changed once set, so you'll have to make a new dataset. Make sure the "wcd" property is set to "false" for the volume in stmfadm in order to enable the write cache. It shouldn't make a huge difference with the zil disabled, but it certainly won't hurt. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slices and reservations Was: Re: How long should an empty destroy take? snv_134
On Mon, Mar 7, 2011 at 1:50 PM, Yaverot wrote: > 1. While performance isn't my top priority, doesn't using slices make a > significant difference? Write caching will be disabled on devices that use slices. It can be turned back on by using format -e > 2. Doesn't snv_134 that I'm running already account for variances in these > nominally-same disks? It will allow some small differences. I'm not sure what the limit on the difference size is. > 3. The market refuses to sell disks under $50, therefore I won't be able to > buy drives of 'matching' capacity anyway. You can always use a larger drive. If you think you may want to go back to smaller drives, make sure that the autoexpand zpool property is disabled though. > 3. Assuming I want to do such an allocation, is this done with quota & > reservation? Or is it snapshots as you suggest? I think Edward misspoke when he said to use snapshots, and probably meant reservation. I've taken to creating a dataset called "reserved" and giving it a 10G reservation. (10G isn't a special value, feel free to use 5% of your pool size or whatever else you're comfortable with.) It's unmounted and doesn't contain anything, but it ensures that there is a chunk of space I can make available if needed. Because it doesn't contain anything, there shouldn't be any concern for de-allocation of blocks when it's destroyed. Alternately, the reservation can be reduced to make space available. > Would it make more sense to make another filesystem in the pool, fill it > enough and keep it handy to delete? Or is there some advantage to zfs destroy > (snapshot) over zfs destroy (filesystem)? While I am thinking about the > system and have extra drives, like now, is the time to make plans for the > next "system is full" event. If a dataset contains data, the blocks will have to be freed when it's destroyed. If it's an empty dataset with a reservation, the only change is to fiddle some accounting bits. I seem to remember seeing a fix for 100% full pools a while ago so this may not be as critical as it used to be, but it's a nice safety net to have. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Format returning bogus controller info
On Mon, Feb 28, 2011 at 9:39 PM, Dave Pooser wrote: > Is the same true of controllers? That is, will c12 remain c12 or > /pci@0,0/pci8086,340c@5 remain /pci@0,0/pci8086,340c@5 even if other > controllers are active? You can rebuild the device tree if it bothers you. There are some (outdated) instructions here: http://spiralbound.net/blog/2005/12/21/rebuilding-the-solaris-device-tree . I think you can do this all with a new boot environment, rather than boot from a CD. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/recv horribly slow on system with 1800+ filesystems
On Mon, Feb 28, 2011 at 10:38 PM, Moazam Raja wrote: > We've noticed that on systems with just a handful of filesystems, ZFS > send (recursive) is quite quick, but on our 1800+ fs box, it's > horribly slow. When doing an incremental send, the system has to identify what blocks have changed, which can take some time. If not much data has changed, the delay can take longer than the actual send. I've noticed that there's a small delay when starting a send of a new snapshot and when starting the receive of one. Putting something like mbuffer in the path helps to smooth things out. It won't help in the example you've cited below, but it will help in real world use. > The other odd thing I've noticed is that during the 'zfs send' to > /dev/null, zpool iostat shows we're actually *writing* to the zpool at > the rate of 4MB-8MB/s, but reading almost nothing. How can this be the > case? The writing seems odd, but the lack of reads doesn't. You might have most or all of the data in the ARC or L2ARC, so your zpool doesn't need to be read from. > 1.) Does ZFS get immensely slow once we have thousands of filesystems? No. Incremental sends might take longer, as I mentioned above. > 2.) Why do we see 4MB-8MB/s of *writes* to the filesystem when we do a > 'zfs send' to /dev/null ? Is anything else using the filesystems in the pool? -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance
On Sun, Feb 27, 2011 at 7:35 PM, Brandon High wrote: > It moves from "best fit" to "any fit" at a certain point, which is at > ~ 95% (I think). Best fit looks for a large contiguous space to avoid > fragmentation while any fit looks for any free space. I got the terminology wrong, it's first-fit when there is space, moving to best-fit at 96% full. See http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/metaslab.c for details. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Performance
On Sun, Feb 27, 2011 at 6:59 AM, Edward Ned Harvey wrote: > But there is one specific thing, isn't there? Where ZFS will choose to use > a different algorithm for something, when pool usage exceeds some threshold. > Right? What is that? It moves from "best fit" to "any fit" at a certain point, which is at ~ 95% (I think). Best fit looks for a large contiguous space to avoid fragmentation while any fit looks for any free space. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] External SATA drive enclosures + ZFS?
On Sun, Feb 27, 2011 at 7:48 AM, taemun wrote: > eSATA has no need for any interposer chips between a modern SATA chipset on > the motherboard and a SATA hard drive. You can buy cables with appropriate eSATA has different electrical specifications, namely higher minimum transmit power and lower minimum receive power. An internal power might work with a SATA to eSATA cable or adapter, but it's not guaranteed to. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] External SATA drive enclosures + ZFS?
On Sun, Feb 27, 2011 at 4:15 PM, Rich Teer wrote: > So the question is, what eSATA non-RAID HBA do people recommend? Bear > in mind that I'm looking for something with driver support "out of the > box" with either the latest Solaris 10, or Solaris 11 Express. The SiI3124 (PCI / PCI-X) and SiI3132 (PCIe) based cards can be picked up for about $20-$30. They're supported, and support PMPs in Solaris. I don't know about support on Sparc though. http://www.newegg.com/Product/Product.aspx?Item=N82E16816132021 http://www.newegg.com/Product/Product.aspx?Item=N82E16816132027 > Assuming the use of eSATA enclosures do do people recommend? I don't > need huge amounts of space; two drives should be enough and four will > be plenty and allow for expansion. Again, I'm looking for a JBOD coz > I want ZFS do all the work. Something similar to the Sans Digital enclosures would probably work. They use a PMP to make all the drives available via one eSATA, which may or may not work. It's supposed to, but there are hardware blacklists in the drivers that may cause you trouble. Another thought is to ditch the Sun boxes and use a HP ProLiant Microserver. It's about $320 and holds 4 drives, with an expansion slot for an additional controller. I think some people have reported success with these on the list. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What drives?
On Thu, Feb 24, 2011 at 10:45 PM, Markus Kovero wrote: > Hi! I'd go for WD RE edition. Blacks and Greens are for desktop use and > therefore lack proper TLER settings and have useless power saving features > that could induce errors and mysterious slowness. There has been a lot of discussion about TLER in the past, and I'm less convinced that it's a requirement for zfs than I used to think. I've been using WD Green (EADS) drives for two years without issue. They are ones that sleep and TLER settings could be changed on though. Many of the new WD Green drives (including some of the RE) use 4k sectors, which will wreak havoc on zpool performance. Other manufacturers are starting to use 4k sectors on their 5400 rpm drives as well so shop carefully if you decide to go with a lower spindle speed. I have not seen a 7200 rpm drive with 4k sectors, but I'm sure they exist. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] External SATA drive enclosures + ZFS?
On Fri, Feb 25, 2011 at 4:34 PM, Rich Teer wrote: > Space is starting to get a bit tight here, so I'm looking at adding > a couple of TB to my home server. I'm considering external USB or > FireWire attached drive enclosures. Cost is a real issue, but I also I would avoid USB, since it can be less reliable than other connection methods. That's the impression I get from older posts made by Sun devs, at least. I'm not sure how well Firewire 400 is supported, let alone Firewire 800. You might want to consider eSATA. Port multipliers are supported in recent builds (128+ I think), and will give better performance than USB. I'm not sure if PMP are supported on Sparc though., since it requires support in both the controller and PMP. Consider enclosures from other manufacturers as well. I've heard good things about Sans Digital, but I've never used them. The 2-drive enclosure has the same components as the item you linked but 1/2 the cost via Newegg. > The intent would be put two 1TB or 2TB drives in the enclosure and use > ZFS to create a mirrored pool out of them. Assuming this enclosure is > set to JBOD mode, would I be able to use this with ZFS? The enclosure Yes, but I think the enclosure has a SiI5744 inside it, so you'll still have one connection from the computer to the enclosure. If that goes, you'll lose both drives. If you're just using two drives, two separate enclosures on separate buses may be better. Look at http://www.sansdigital.com/towerstor/ts1ut.html for instance. There are also larger enclosures with up to 8 drives. > I can't think of a reason why it wouldn't work, but I also have exactly > zero experience with this kind of set up! Like I mentioned, USB is prone to some flakiness. > Assuming this would work, given that I can't see to find a 4-drive > version of it, would I be correct in thinking that I could buy two of You might be better off using separate enclosures for reliability. Make sure to split the mirrors across the two devices. Use separate USB controllers if possible, so a bus reset doesn't affect both sides. > Assuming my proposed enclosure would work, and assuming the use of > reasonable quality 7200 RPM disks, how would you expect the performance > to compare with the differential UltraSCSI set up I'm currently using? > I think the DWIS is rated at either 20MB/sec or 40MB/sec, so on the > surface, the USB attached drives would seem to be MUCH faster... USB 2.0 is about 30-40MB/s under ideal conditions, but doesn't support any of the command queuing that SCSI does. I'd expect performance to be slightly lower, and to use slightly more CPU. Most USB controllers don't support DMA, so all I/O requires CPU time. What about an inexpensive SAS card (eg: Supermicro AOC-USAS-L4i) and external SAS enclosure (eg: Sans Digital TowerRAID TR4X). It would cost about $350 for the setup. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/Drobo (Newbie) Question
On Tue, Feb 8, 2011 at 12:53 PM, David Dyer-Bennet wrote: > Wait, are you saying that the handling of errors in RAIDZ and mirrors is > completely different? That it dumps the mirror disk immediately, but > keeps trying to get what it can from the RAIDZ disk? Because otherwise, > you assertion doesn't seem to hold up. I think he meant that if one drive in a mirror dies completely, then any single read error on the remaining drive is not recoverable. With raidz2 (or a 3-way mirror for that matter), if one drive dies completely, you still have redundancy. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 10:29 AM, Yi Zhang wrote: > I already set primarycache to metadata, and I'm not concerned about > caching reads, but caching writes. It appears writes are indeed cached > judging from the time of 2.a) compared to UFS+directio. More > specifically, 80MB/2s=40MB/s (UFS+directio) looks realistic while > 80MB/0.11s=800MB/s (ZFS+primarycache=metadata) doesn't. You're trying to force a solution that isn't relevant for the situation. ZFS is not UFS, and solutions that are required for UFS to work correctly are not needed with ZFS. Yes, writes are cached, but all the POSIX requirements for synchronous IO are met by the ZIL. As long as your storage devices, be they SAN, DAS or somewhere in between respect cache flushes, you're fine. If you need more performance, use a slog device that respects cache flushes. You don't need to worry about whether writes are being cached, because any data that is written synchronously will be committed to stable storage before the write returns. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On Mon, Feb 7, 2011 at 6:15 AM, Yi Zhang wrote: > On Mon, Feb 7, 2011 at 12:25 AM, Richard Elling > wrote: >> Solaris UFS directio has three functions: >> 1. improved async code path >> 2. multiple concurrent writers >> 3. no buffering >> > Thanks for the comments, Richard. All I wanted is to achieve 3 on ZFS. > But as I said, apprently 2.a) below didn't give me that. Do you have > any suggestion? Don't. Use a ZIL, which will meet the requirements for synchronous IO. Set primarycache to metadata to prevent caching reads. ZFS is a very different beast than UFS and doesn't require the same tuning. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS/Drobo (Newbie) Question
On Sat, Feb 5, 2011 at 9:54 AM, Gaikokujin Kyofusho wrote: > Just to make sure I understand your example, if I say had a 4x2tb drives, > 2x750gb, 2x1.5tb drives etc then i could make 3 groups (perhaps 1 raidz1 + 1 > mirrored + 1 mirrored), in terms of accessing them would they just be mounted > like 3 partitions or could it all be accessed like one big partition? You could add them to one pool, and then create multiple filesystems inside the pool. You total storage would be the sum of the drives' capacity after redundancy, or 3x2tb + 750gb + 1.5tb. It's not recommended to use different levels of redundancy in a pool, so you may want to consider using mirrors for everything. This also makes it easier to add or upgrade capacity later. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS and spindle speed (7.2k / 10k / 15k)
On Sat, Feb 5, 2011 at 3:34 PM, Roy Sigurd Karlsbakk wrote: >> so as not to exceed the channel bandwidth. When they need to get higher disk >> capacity, they add more platters. > > May this mean those drives are more robust in terms of reliability, since the > leaks between sectors is less likely with the lower density? More platters leads to more heat and higher power consumption. Most drives are 3 or 4 platters, though Hitachi usually manufactures 5 platter drives as well. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss