Re: [zfs-discuss] ZFS performance benchmarks in various configurations
Richard Elling wrote: ... As you can see, so much has changed, hopefully for the better, that running performance benchmarks on old software just isn't very interesting. NB. Oracle's Sun OpenStorage systems do not use Solaris 10 and if they did, they would not be competitive in the market. The notion that OpenSolaris is worthless and Solaris 10 rules is simply bull* OpenSolaris isn't worthless, but no way in hell would I run it in production, based on my experiences running it at home from b111 to now. The mpt driver problems are just one of many show stoppers (is that resolved yet, or do we still need magic /etc/system voodoo?). Of course, Solaris 10 couldn't properly drive the Marvell attached disks in an X4500 prior to U6 either, unless you ran an IDR (pretty inexcusable in a storage-centric server release). -- Carson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Yes, if you value your data you should change from USB drives to normal drives. I heard that USB did some strange things? Normal connection such as SATA is more reliable. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS slowness under domU high load
zfs ml wrote: sorry, scratch the above - I didn't see this: 9. domUs have ext3 mounted with: noatime,commit=120 Is the write traffic because you backing up to the same disks that the domUs live on? Yes it is. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS slowness under domU high load
Kjetil and Richard thanks for this. Kjetil Torgrim Homme wrote: Bogdan Ćulibrk b...@default.rs writes: What are my options from here? To move onto zvol with greater blocksize? 64k? 128k? Or I will get into another trouble going that way when I have small reads coming from domU (ext3 with default blocksize of 4k)? yes, definitely. have you considered using NFS rather than zvols for the data filesystems? (keep zvol for the domU software.) That makes sense. Will it be useful to simply add new drive to domU backed with greater blocksize zvol or maybe vmdk file? Does it have to be nfs backend? it's strange that you see so much write activity during backup -- I'd expect that to do just reads... what's going on at the domU? generally, the best way to improve performance is to add RAM for ARC (512 MiB is *very* little IMHO) and SSD for your ZIL, but it does seem to be a poor match for your concept of many small low-cost dom0's. Writes are coming from backup packing before transfering on real backup location. Most likely this is the main reason for whole problem. One more thing regarding SSD, will be useful to throw in additional SAS/SATA drive in to serve as L2ARC? I know SSD is the most logical thing to put as L2ARC, but will conventional drive be of *any* help in L2ARC? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs questions wrt unused blocks
Gents, We want to understand the mechanism of zfs a bit better. Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks? Q: what criteria is there for zfs to start reclaiming blocks Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk on a NFS server and a zpool inside that vdisk. This vdisk tends to grow in size even if the user writes and deletes a file again. Question is, whether this reclaiming of unused blocks can kick in earlier, so that the filesystem doesn't grow much more than what is actually allocated? Thanks, heinz -- Heinz Zerbes Security Consultant and Auditor Sun Microsystems Australia 33 Berry St., North Sydney, NSW 2060 AU Phone x59468/+61 2 9466 9468 Mobile +61 410 727 961 Fax +61 2 9466 9411 Email heinz.zer...@sun.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD and ZFS
For those following the saga: With the prefetch problem fixed, and data coming off the L2ARC instead of the disks, the system switched from IO bound to CPU bound, I opened up the throttles with some explicit PARALLEL hints in the Oracle commands, and we were finally able to max out the single SSD: r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 826.03.2 104361.8 35.2 0.0 9.90.0 12.0 3 100 c0t0d4 So, when we maxed out the SSD cache, it was delivering 100+MB/s, and 830 IOPS with 3.4 TB behind it in a 4 disk SATA RAIDz1. Still have to remap it to 8k blocks to get more efficiency, but for raw numbers, it's right what I was looking for. Now, to add the second SSD ZIL/L2ARC for a mirror. I may even splurge for one more to get a three way mirror. That will completely saturate the SCSI channel. Now I need a bigger server Did I mention it was $1000 for the whole setup? Bah-ha-ha-ha. Tracey On Sat, Feb 13, 2010 at 11:51 PM, Tracey Bernath tbern...@ix.netcom.comwrote: OK, that was the magic incantation I was looking for: - changing the noprefetch option opened the floodgates to the L2ARC - changing the max queue depth relived the wait time on the drives, although I may undo this again in the benchmarking since these drives all have NCQ I went from all four disks of the array at 100%, doing about 170 read IOPS/25MB/s to all four disks of the array at 0%, once hitting nealyr 500 IOPS/65MB/s off the cache drive (@ only 50% load). This bodes well for adding a second mirrored cache drive to push for the 1KIOPS. Now I am ready to insert the mirror for the ZIL and the CACHE, and we will be ready for some production benchmarking. BEFORE: devicer/sw/s kr/s kw/s wait actv svc_t %w %b us sy wt id sd0 170.00.4 7684.70.0 0.0 35.0 205.3 0 100 11 80 0 82 sd1 168.40.4 7680.20.0 0.0 34.6 205.1 0 100 sd2 172.00.4 7761.70.0 0.0 35.0 202.9 0 100 sd4 170.00.4 7727.10.0 0.0 35.0 205.3 0 100 sd5 1.6 2.6 182.4 104.8 0.0 0.5 117.8 0 31 AFTER: extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d0 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d1 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d2 0.00.00.00.0 0.0 0.00.00.0 0 0 c0t0d3 285.20.8 36236.2 14.4 0.0 0.50.01.8 1 37 c0t0d4 And, keep in mind this was on less than $1000 of hardware. Thanks for the pointers guys, Tracey On Sat, Feb 13, 2010 at 9:22 AM, Richard Elling richard.ell...@gmail.comwrote: comment below... On Feb 12, 2010, at 2:25 PM, TMB wrote: I have a similar question, I put together a cheapo RAID with four 1TB WD Black (7200) SATAs, in a 3TB RAIDZ1, and I added a 64GB OCZ Vertex SSD, with slice 0 (5GB) for ZIL and the rest of the SSD for cache: # zpool status dpool pool: dpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM dpool ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t0d1 ONLINE 0 0 0 c0t0d2 ONLINE 0 0 0 c0t0d3 ONLINE 0 0 0 [b]logs c0t0d4s0 ONLINE 0 0 0[/b] [b]cache c0t0d4s1 ONLINE 0 0 0[/b] spares c0t0d6AVAIL c0t0d7AVAIL capacity operationsbandwidth pool used avail read write read write -- - - - - - - dpool 72.1G 3.55T237 12 29.7M 597K raidz172.1G 3.55T237 9 29.7M 469K c0t0d0 - -166 3 7.39M 157K c0t0d1 - -166 3 7.44M 157K c0t0d2 - -166 3 7.39M 157K c0t0d3 - -167 3 7.45M 157K c0t0d4s020K 4.97G 0 3 0 127K cache - - - - - - c0t0d4s1 17.6G 36.4G 3 1 249K 119K -- - - - - - - I just don't seem to be getting any bang for the buck I should be. This was taken while rebuilding an Oracle index, all files stored in this pool. The WD disks are at 100%, and nothing is coming from the cache. The cache does have the entire DB cached (17.6G used), but hardly reads anything from it. I also am not seeing the spike of data flowing into the ZIL either, although iostat show there is just write traffic hitting the SSD: extended device statistics cpu devicer/sw/s kr/s kw/s wait actv svc_t %w %b
Re: [zfs-discuss] [networking-discuss] Help needed on big transfers failure with e1000g
Hi, Sending to zfs-discuss too since this seems to be related to the zfs receive operation. The following only holds true when the replication stream (ie the delta between snap1 and snap2) is more than about 800GB. If I proceed with this command the transfer fails after some variable amount of time, usually near about 790GB. pfexec /sbin/zfs send -DRp -I tank/t...@snap1 tank/t...@snap2 | ssh -c arcfour nc-tanktsm pfexec /sbin/zfs recv -F -d tank If however, I any no problems I proceed with : pfexec /sbin/zfs send -DRp -I tank/t...@snap1 tank/t...@snap2 | /usr/gnu/bin/dd of=/tank/stream.zfs bs=1M manually copy the stream.zfs to the remote host through scp and then /usr/gnu/bin/dd if=/tank/stream.zfs bs=1M | pfexec /sbin/zfs recv -F -d tank An additional thing to note (perhaps) is that there where no snapshots between snap1 and snap2. I reproduced the behaviour with two different replication streams (one weighting 1.63TB and the other one 1.48TB). I tried to add some buffering to the first procedure by piping (on both ends) through dd or mbuffer. While the global throughput slightly improves, the transfer still fails. I also suspected problems with ssh and used netcat, with and without dd, with or without mbuffer. I had even more throughput but it still failed. I guess asking to transfer a 1TB+ replication stream live might just be too much and I might be the only crazy fool to try to do it. I've setup zfs-auto-snap and launch my replication stream more frequently, so as long as zfs-auto-snap doesn't remove intermediate snaps before the script can send them, this is kind of solved it for me, but still, there might be a problem somewhere that others might encounter. Just thought I should report it. - Arnaud Le 09/02/2010 17:41, Arnaud Brand a écrit : Sorry for the double-post, I forgot to answer your questions. Arnaud Le 09/02/2010 17:31, Arnaud Brand a écrit : Hi James, Sorry to bother you again, I think I found the problem but need confirmation/advice. It appears that A wants to send through SSH more data to B that B can accept. In fact, B is in the process of committing the zfs recv of a big snapshot, but A still has some snaps to send (zfs send -R). After exactly 6 minutes, B sends a RST to A and both close their connections. Sadly the receive operation goes away with the ssh session. Whats looks strange is that B doesn't reduce its window, it just keeps ACKing the last byte SSH could eat and leaves the window at 64436. I guess it's related to SSH channel multiplexing features : it can't reduce the window or other channels won't make progress either, am I right here ? I did some ndd lookups in /dev/tcp and /dev/ip to see if I can find a timeout to raise, but found nothing matching the 6 minutes. The docs related to tcp/ip tunables I found on sun's website didn't bring me much further, I found nothing that seemed applicable. I agree my case is perhaps a bit overstretched and I'm going generate the replication stream in a local file and send it over by hand. Later I shouldn't have snapshots that are that big and I wish zfs recv wouldn't block for so long either, but still I'm asking myself if the behavior I'm observing is correct or if it's the sign of some misconfiguration on my part (or perhaps I've once more forgotten how things work). Sorry for my bad english, I hope you understood what I meant. Just in case I've attached the tcpdump output of the ssh session starting at the very last packet that is accepted and acked by B. A is 192.0.2.2 and B is 192.0.2.1. If you could shed some light on this case I'd be very grateful, but I don't want to bother you. Thanks, Arnaud Le 09/02/2010 14:39, James Carlson a écrit : Arnaud Brand wrote: Le 08/02/10 23:18, James Carlson a écrit : Causes for RST include: - peer application is intentionally setting the linger time to zero and issuing close(2), which results in TCP RST generation. Might be possible, but I can't see why the receiving end would do that. No idea, but a debugger on that side might be able to detect something. - bugs in one or both peers (often related to TCP keepalive; key signature of such a problem is an apparent two-hour time limit). That could be it, but I doubt it since disconnections appeared anywhere randomly in the range 10 minutes to 13 hours. It should be noted that the node sending the RST keeps the connection open (netstat -a shows its still established). To be honest that puzzles me. That sounds horrible. There's no way a node that still has state for the connection should be sending RST. Normal procedure is to generate RST when you do _not_ have state for the connection or (if you're intentionally aborting the connection) to discard the state at the same time you send RST. That points to either a bug in the peer's TCP/IP implementation or one of the causes that you've dismissed (particularly either a duplicate IP address or a
Re: [zfs-discuss] available space
Hi Charles, What kind of pool is this? The SIZE and AVAIL amounts will vary depending on the ZFS redundancy and whether the deflated or inflated amounts are displayed. I attempted to explain the differences in the zpool list/zfs list display, here: http://hub.opensolaris.org/bin/view/Community+Group+zfs/faq#HZFSAdministrationQuestions Why doesn't the space that is reported by the zpool list command and the zfs list command match? The zpool list command output in this FAQ is based on the OpenSolaris/ Nevada builds and differs in the AVAIL column, which is now displayed as ALLOC and FREE. Thanks, Cindy On 02/13/10 10:28, Charles Hedrick wrote: I have the following pool: NAME SIZE USED AVAILCAP HEALTH ALTROOT OIRT 6.31T 3.72T 2.59T58% ONLINE / zfs list shows the following for a typical file system: NAMEUSED AVAIL REFER MOUNTPOINT OIRT/sakai/production 1.40T 1.77T 1.40T /OIRT/sakai/production Why is available lower when shown by zfs than zpool? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] available space
Thanks. That makes sense. This is raidz2. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Volume Destroy Halts I/O
I've seen threads like this around this ZFS forum, so forgive me if I'm covering old ground. I currently have a ZFS configuration where I have individual drives presented to my Opensolaris machine and I'm using ZFS to do a RAIDZ-1 on the drives. I have several filesystems and volumes on this storage pool. When I do a zfs destroy on a volume (and possibly a filesystem, though I haven't tried that, yet), I run into two issues. The first is that the destroy command takes several hours to complete - for example, destroying a 10 GB volume on Friday took 5 hours. The second is that, while this command is running, all I/O on the storage pool appears to be halted, or at least paused. There are a few symptoms of this...first, NFS clients accessing volumes on this server just hang and do not respond to commands. Some clients hang indefinitely while others time out and mark the volume as stale. On iSCSI clients, the clients most often time out and disconnect from the iSCSI vol ume - which is bad for my clients that are booting over those iSCSI volumes. I'm using the latest Opensolaris dev build (132) and I have my storage pools and volumes upgraded to the latest available versions. I am using deduplication on my ZFS volumes, set at the highest volume level, so I'm not sure if this has an impact. Can anyone provide any hints as to whether this is a bug or expected behavior, what's causing it, and how I can solve or work around it? Thanks, Nick -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Volume Destroy Halts I/O
On Mon, 15 Feb 2010, Nick wrote: I'm using the latest Opensolaris dev build (132) and I have my storage pools and volumes upgraded to the latest available versions. I am using deduplication on my ZFS volumes, set at the highest volume level, so I'm not sure if this has an impact. Can anyone provide any hints as to whether this is a bug or expected behavior, what's causing it, and how I can solve or work around it? There is no doubt that it is both a bug and expected behavior and is related to deduplication being enabled. Others here have reported similar problems. The problem seems to be due to insufficient caching in the zfs ARC due to not enough RAM or L2ARC not being installed. Some people achieved rapid success after installing a SSD as a L2ARC device. Other people have reported success after moving their pool to a system with a lot more RAM installed. Others have relied on patience. A few have given up and considered their pool totally lost. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Plan for upgrading a ZFS based SAN
Good morning all. I am in the process of building my V1 SAN for media storage in house, and i am already thinkg ov the V2 build... Currently, there are 8 250Gb hdds and 3 500Gb disks. the 8 250s are in a RAIDZ2 array, and the 3 500s will be in RAIDZ1... At the moment, the current case is quite full. i am looking at a 20 drive hotswap case, which i plan to order soon. when the time comes, and i start upgrading the drives to larger drives, say 1Tb drives, would it be easy to migrate the contents of the RAIDZ2 array to the new Array? I see mentions of ZFS Send and ZFS recieve, but i have no idea if they would do the job... Thanks in advance. -- Tiernan O'Toole blog.lotas-smartman.net www.tiernanotoolephotography.com www.the-hairy-one.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Plan for upgrading a ZFS based SAN
Yes send and receive will do the job. see zfs manpage for details. James Dickens http://uadmin.blogspot.com On Mon, Feb 15, 2010 at 11:56 AM, Tiernan OToole lsmart...@gmail.comwrote: Good morning all. I am in the process of building my V1 SAN for media storage in house, and i am already thinkg ov the V2 build... Currently, there are 8 250Gb hdds and 3 500Gb disks. the 8 250s are in a RAIDZ2 array, and the 3 500s will be in RAIDZ1... At the moment, the current case is quite full. i am looking at a 20 drive hotswap case, which i plan to order soon. when the time comes, and i start upgrading the drives to larger drives, say 1Tb drives, would it be easy to migrate the contents of the RAIDZ2 array to the new Array? I see mentions of ZFS Send and ZFS recieve, but i have no idea if they would do the job... Thanks in advance. -- Tiernan O'Toole blog.lotas-smartman.net www.tiernanotoolephotography.com www.the-hairy-one.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Volume Destroy Halts I/O
There is no doubt that it is both a bug and expected behavior and is related to deduplication being enabled. Is it expected because it's a bug, or is it a bug that is not going to be fixed and so I should expect it? Is there a bug/defect I can keep an eye on in one of the Opensolaris bug/defect interfaces that will help me figure out what's going on with it and when a solution is expected? Others here have reported similar problems. The problem seems to be due to insufficient caching in the zfs ARC due to not enough RAM or L2ARC not being installed. Some people achieved rapid success after installing a SSD as a L2ARC device. Other people have reported success after moving their pool to a system with a lot more RAM installed. Others have relied on patience. A few have given up and considered their pool totally lost. I have 8 GB of RAM on my system, which I consider to be a fairly decent amount of RAM for a storage system - maybe I'm naive about that, though. 8GB should provide a pretty decent amount of RAM available for caching, so I would think that a 10 GB volume would be able to go through RAM pretty quickly. Also, there isn't much except ZFS and COMSTAR running on this box, so there isn't really anything else using the RAM. I've already considered going and buying some SSDs for the L2ARC stuff, so maybe I'll pursue this path. I was certainly patient with it - I didn't reboot the box because I could see slow progress on the destroy. However, the other guys in my group who had stuff hanging off of this ZFS storage that had to wait 5 hours for the storage to respond to their requests were not quite so understanding or patient. This is a pretty big roadblock, IMHO, to this being a workable storage solution. I certainly do understand that I'm using the dev releases, so it is under development and I should expect bugs - this one just seems pretty significant, like I would need to schedule maintenance windows to do volume management. Thanks! -Nick -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Shrink the slice used for zpool?
Hi, I recently installed OpenSoalris 200906 on a 10GB primary partition on my laptop. I noticed there wasn't any option for customizing the slices inside the solaris partition. After installation, there was only a single slice (0) occupying the entire partition. Now the problem is that I need to set up a UFS slice for my development. Is there a way to shrink slice 0 (backing storage for the zpool) and make room for a new slice to be used for UFS? I also tried to create UFS on another primary DOS partition, but apparently only one Solaris partition is allowed on one disk. So that failed... Thanks! Yi Zhang ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Volume Destroy Halts I/O
On 02/15/10 10:26, Nick wrote: There is no doubt that it is both a bug and expected behavior and is related to deduplication being enabled. Is it expected because it's a bug, or is it a bug that is not going to be fixed and so I should expect it? Is there a bug/defect I can keep an eye on in one of the Opensolaris bug/defect interfaces that will help me figure out what's going on with it and when a solution is expected? See: 6922161 zio_ddt_free is single threaded with performance impact 6924824 destroying a dedup-enabled dataset bricks system Both issues stem from the fact that free operations used to be in-memory only but with dedup enabled can result in synchronous I/O to disks in syncing context. - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrink the slice used for zpool?
Hi, I recently installed OpenSoalris 200906 on a 10GB primary partition on my laptop. I noticed there wasn't any option for customizing the slices inside the solaris partition. After installation, there was only a single slice (0) occupying the entire partition. Now the problem is that I need to set up a UFS slice for my development. Is there a way to shrink slice 0 (backing storage for the zpool) and make room for a new slice to be used for UFS? I also tried to create UFS on another primary DOS partition, but apparently only one Solaris partition is allowed on one disk. So that failed... Can you create a zvol and use that for ufs? Slow, but ... Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Volume Destroy Halts I/O
Thanks! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrink the slice used for zpool?
On Mon, Feb 15, 2010 at 1:48 PM, casper@sun.com wrote: Hi, I recently installed OpenSoalris 200906 on a 10GB primary partition on my laptop. I noticed there wasn't any option for customizing the slices inside the solaris partition. After installation, there was only a single slice (0) occupying the entire partition. Now the problem is that I need to set up a UFS slice for my development. Is there a way to shrink slice 0 (backing storage for the zpool) and make room for a new slice to be used for UFS? I also tried to create UFS on another primary DOS partition, but apparently only one Solaris partition is allowed on one disk. So that failed... Can you create a zvol and use that for ufs? Slow, but ... Casper Casper, thanks for the tip! Actually I'm not sure if this would work for me. I wanted to use directio to bypass the file system cache when reading/writing files. That's why I chose UFS instead of ZFS. Now if I create UFS on top of zvol, I'm not sure if a call to directio() would actually do its work... Yi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrink the slice used for zpool?
On Feb 15, 2010, at 11:15 AM, Yi Zhang wrote: On Mon, Feb 15, 2010 at 1:48 PM, casper@sun.com wrote: Hi, I recently installed OpenSoalris 200906 on a 10GB primary partition on my laptop. I noticed there wasn't any option for customizing the slices inside the solaris partition. After installation, there was only a single slice (0) occupying the entire partition. Now the problem is that I need to set up a UFS slice for my development. Is there a way to shrink slice 0 (backing storage for the zpool) and make room for a new slice to be used for UFS? I also tried to create UFS on another primary DOS partition, but apparently only one Solaris partition is allowed on one disk. So that failed... Can you create a zvol and use that for ufs? Slow, but ... Casper Casper, thanks for the tip! Actually I'm not sure if this would work for me. I wanted to use directio to bypass the file system cache when reading/writing files. That's why I chose UFS instead of ZFS. Now if I create UFS on top of zvol, I'm not sure if a call to directio() would actually do its work... zfs set primarycache=metadata filesystem -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrink the slice used for zpool?
On 15/02/2010 19:15, Yi Zhang wrote: Can you create a zvol and use that for ufs? Slow, but ... Casper Casper, thanks for the tip! Actually I'm not sure if this would work for me. I wanted to use directio to bypass the file system cache when reading/writing files. That's why I chose UFS instead of ZFS. Now if I create UFS on top of zvol, I'm not sure if a call to directio() would actually do its work... Why not just use ZFS and set the similar options on the ZFS dataset: zfs set primarycache=metadata datasetname That is a close approximation to the UFS feature of directio() for bypassing storing the data in the filesystem cache. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrink the slice used for zpool?
Thank you, Darren and Richard. I think this gives what I wanted. Yi On Mon, Feb 15, 2010 at 3:13 PM, Darren J Moffat darren.mof...@sun.com wrote: On 15/02/2010 19:15, Yi Zhang wrote: Can you create a zvol and use that for ufs? Slow, but ... Casper Casper, thanks for the tip! Actually I'm not sure if this would work for me. I wanted to use directio to bypass the file system cache when reading/writing files. That's why I chose UFS instead of ZFS. Now if I create UFS on top of zvol, I'm not sure if a call to directio() would actually do its work... Why not just use ZFS and set the similar options on the ZFS dataset: zfs set primarycache=metadata datasetname That is a close approximation to the UFS feature of directio() for bypassing storing the data in the filesystem cache. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs promote
Hi-- From your pre-promotion output, both fs1-patch and snap1 are referencing the same 16.4 GB, which makes sense. I don't see how fs1 could be a clone of fs1-patch because it should be REFER'ing 16.4 GB as well in your pre-promotion zfs list. If you snapshot, clone, and promote, then the snapshot--clone is promoted to be the origin file system and so is now charged the USED space, but the post-promotion snapshot space should remain in the REFER column. Try it yourself with a test file system, create a 100m file, snapshot, and clone. Then promote the clone. You will see that the 100MB in REFER space of the original snapshot becomes 100MB USED space in the newly promoted clone. The original snapshot remains 100MB in REFER. If the test snap/clone/promote works then we need to figure out what's going on in your rgd3/fs* environment. From your output, it almost looks like the space accounting for snap1 and fs1 are reversed in your post- promotion zfs list. I don't know why. I can't reproduce this in the Solaris 10 10/09 release. Thanks, Cindy # zfs create tank/fs1 # mkfile 100m /tank/fs1/file1 # zfs list -r tank NAME USED AVAIL REFER MOUNTPOINT tank 100M 66.8G23K /tank tank/fs1 100M 66.8G 100M /tank/fs1 # zfs snapshot tank/f...@snap1 # zfs clone tank/f...@snap1 tank/fs1-patch # zfs list -r tank NAME USED AVAIL REFER MOUNTPOINT tank 100M 66.8G23K /tank tank/fs1 100M 66.8G 100M /tank/fs1 tank/f...@snap1 0 - 100M - tank/fs1-patch 0 66.8G 100M /tank/fs1-patch # zfs promote tank/fs1-patch # zfs list -r tank NAME USED AVAIL REFER MOUNTPOINT tank 100M 66.8G24K /tank tank/fs1 0 66.8G 100M /tank/fs1 tank/fs1-patch 100M 66.8G 100M /tank/fs1-patch tank/fs1-pa...@snap1 0 - 100M - On 02/12/10 19:37, tester wrote: Hello, # /usr/sbin/zfs list -r rgd3 NAME USEDAVAIL REFER MOUNTPOINT rgd3 16.5G23.4G20K /rgd3 rgd3/fs1 19K 23.4G21K /app/fs1 rgd3/fs1-patch 16.4G23.4G 16.4G /app/fs1-patch rgd3/fs1-pa...@snap134.8M - 16.4G - # /usr/sbin/zfs promote rgd3/fs1 snap is 16.4G in USED. # /usr/sbin/zfs list -r rgd3 NAME USED AVAIL REFER MOUNTPOINT rgd3 16.5G 23.4G20K /rgd3 rgd3/fs1 16.4G 23.4G21K /app/fs1 rgd3/f...@snap1 16.4G - 16.4G - rgd3/fs1-patch33.9M 23.4G 16.4G/app/fs1-patch 5.10 Generic_141414-10 I tired to line up numbers, it does not work. Sorry for the format. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff beimh...@hotmail.com wrote: I am in the proof-of-concept phase of building a large ZFS/Solaris based SAN box, and am experiencing absolutely poor / unusable performance. ... From here, I discover the iscsi target on our Windows server 2008 R2 File server, and see the disk is attached in Disk Management. I initialize the 10TB disk fine, and begin to quick format it. Here is where I begin to see the poor performance issue. The Quick Format took about 45 minutes. And once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk. Did you actually make any progress on this? I've seen exactly the same thing. Basically, terrible transfer rates with Windows and the server sitting there completely idle. We had support cases open with both Sun and Microsoft, which got nowhere. This seems to me to be more a case of working out where the impedance mismatch is rather than a straightforward performance issue. In my case I could saturate the network from a Solaris client, but only maybe 2% from a Windows box. Yes, tweaking nagle got us to almost 3%. Still nowhere near enough to make replacing our FC SAN with X4540s an attractive proposition. (And I see that most of the other replies simply asserted that your zfs configuration was bad, without either having experienced this scenario or worked out that the actual delivered performance was an order of magnitude or two short of what even an admittedly sub-optimal configuration ought to have delivered.) -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff beimh...@hotmail.com wrote: I've seen exactly the same thing. Basically, terrible transfer rates with Windows and the server sitting there completely idle. I am also seeing this behaviour. It started somewhere around snv111 but I am not sure exactly when. I used to get 30-40MB/s transfers over cifs but at some point that dropped to roughly 7.5MB/s. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS slowness under domU high load
On Mon, Feb 15, 2010 at 01:45:57PM +0100, Bogdan ?ulibrk wrote: One more thing regarding SSD, will be useful to throw in additional SAS/SATA drive in to serve as L2ARC? I know SSD is the most logical thing to put as L2ARC, but will conventional drive be of *any* help in L2ARC? Only in very particular circumstances. L2ARC is a latency play; for it to win, you need the l2arc device(s) to be lower latency than the primary storage, at least for reads. This usually translates to ssd for lower latency than disk, but can also work if your data pool has unusually high latency - remote iscsi, usb, some other odd mostly channel-related configurations. If the reason your disks have high latency is simply high load, l2arc on another disk might, maybe, just work to redistribute some of that load, but it will be a precarious balance, and probably need several additional disks, perhaps roughly as many as currently in the pool. By that stage, you're better off just reshaping the pool to use the extra disks to best effect; mirrors vs raidz, more vdevs, etc. Managing all that l2arc will take memory, too. In your case, though, a couple of extra disks dedicated to staging whatever transform you're doing to the backup files might be worthwhile, if it will fit. Even if they make the backup transform itself slower (unlikely if its predominantly sequential), removing the contention impact from the primary service could be a net win. -- Dan. pgppyQ2MBTeW2.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD and ZFS
On Sun, Feb 14, 2010 at 11:08:52PM -0600, Tracey Bernath wrote: Now, to add the second SSD ZIL/L2ARC for a mirror. Just be clear: mirror ZIL by all means, but don't mirror l2arc, just add more devices and let them load-balance. This is especially true if you're sharing ssd writes with ZIL, as slices on the same devices. I may even splurge for one more to get a three way mirror. With more devices, questions about selecting different devices appropriate for each purpose come into play. Now I need a bigger server See? :) -- Dan. pgpdQ822KAFIw.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
Just thought I'd chime in for anyone who had read this - the import operation completed this time, after 60 hours of disk grinding. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
The DDT is stored within the pool, IIRC, but there is an RFE open to allow you to store it on a separate top level VDEV, like a SLOG. The other thing I've noticed with all of the destroyed a large dataset with dedup enabled and it's taking forever to import/destory/insert function here questions is that the process runs so so so much faster with 8+ GiB of RAM. Almost to a man, everyone who reports these 3, 4, or more day destroys has 8 GiB of RAM on the storage server. Just some observations/thoughts. On Mon, Feb 15, 2010 at 23:14, taemun tae...@gmail.com wrote: Just thought I'd chime in for anyone who had read this - the import operation completed this time, after 60 hours of disk grinding. :) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- You can choose your friends, you can choose the deals. - Equity Private If Linux is faster, it's a Solaris bug. - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs questions wrt unused blocks
Gents, We want to understand the mechanism of zfs a bit better. Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks? Q: what criteria is there for zfs to start reclaiming blocks Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk on a NFS server and a zpool inside that vdisk. This vdisk tends to grow in size even if the user writes and deletes a file again. Question is, whether this reclaiming of unused blocks can kick in earlier, so that the filesystem doesn't grow much more than what is actually allocated? Thanks, heinz ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
RFE open to allow you to store [DDT] on a separate top level VDEV hmm, add to this spare, log and cache vdevs, its to the point of making another pool and thinly provisioning volumes to maintain partitioning flexibility. taemun: hay, thanks for closing the loop! Rob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
The system in question has 8GB of ram. It never paged during the import (unless I was asleep at that point, but anyway). It ran for 52 hours, then started doing 47% kernel cpu usage. At this stage, dtrace stopped responding, and so iopattern died, as did iostat. It was also increasing ram usage rapidly (15mb / minute). After an hour of that, the cpu went up to 76%. An hour later, CPU usage stopped. Hard drives were churning throughout all of this (albeit at a rate that looks like each vdev is being controller by a single threaded operation). I'm guessing that if you don't have enough ram, it gets stuck on the use-lots-of-cpu phase, and just dies from too much paging. Of course, I have absolutely nothing to back that up. Personally, I think that if L2ARC devices were persistent, we already have the mechanism in place for storing the DDT as a seperate vdev. The problem is, there is nothing you can run at boot time to populate the L2ARC, so the dedup writes are ridiculously slow until the cache is warm. If the cache stayed warm, or there was an option to forcibly warm up the cache, this could be somewhat alleviated. Cheers ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs questions wrt unused blocks
On Feb 15, 2010, at 8:43 PM, heinz zerbes wrote: Gents, We want to understand the mechanism of zfs a bit better. Q: what is the design/algorithm of zfs in terms of reclaiming unused blocks? Q: what criteria is there for zfs to start reclaiming blocks The answer to these questions is too big for an email. Think of ZFS as a very dynamic system with many different factors influencing block allocation. Issue at hand is an LDOM or zone running in a virtual (thin-provisioned) disk on a NFS server and a zpool inside that vdisk. This vdisk tends to grow in size even if the user writes and deletes a file again. Question is, whether this reclaiming of unused blocks can kick in earlier, so that the filesystem doesn't grow much more than what is actually allocated? ZFS is a COW file system, which partly explains what you are seeing. Snapshots, deduplication, and the ZIL complicate the picture. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
The other thing I've noticed with all of the destroyed a large dataset with dedup enabled and it's taking forever to import/destory/insert function here questions is that the process runs so so so much faster with 8+ GiB of RAM. Almost to a man, everyone who reports these 3, 4, or more day destroys has 8 GiB of RAM on the storage server. I've witnessed destroys that take several days with 24GB+ systems (dataset over 30TB). I guess it's just matter of how large datasets vs. how much ram. Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
On 15 feb 2010, at 23.33, Bob Beverage wrote: On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff beimh...@hotmail.com wrote: I've seen exactly the same thing. Basically, terrible transfer rates with Windows and the server sitting there completely idle. I am also seeing this behaviour. It started somewhere around snv111 but I am not sure exactly when. I used to get 30-40MB/s transfers over cifs but at some point that dropped to roughly 7.5MB/s. Wasn't zvol changed a while ago from asynchronous to synchronous? Could that be it? I don't understand that change at all - of course a zvol with or without iscsi to access it should behave exactly as a (not broken) disk, strictly obeying the protocol for write cache. cache flush etc. Having it entirely synchronous is in many cases almost as useless as having it asynchronous. Just as much as zfs itself should demands this from it's disks, as it does, I believe it should provide this itself when used as storage for others. To me it seems that the zvol+iscsi functionality seems not ready for production and needs more work. If anyone has any better explanation, please share it with me! I guess a good slog could help a bit, especially if you have a bursty write load. /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss