Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On Tue, Jul 20, 2010 at 10:29 AM, Chad Cantwell c...@iomail.org wrote: No, this wasn't it. A non debug build with the same NIGHTLY_OPTIONS at Rich Lowe's 142 build is still very slow... On Tue, Jul 20, 2010 at 09:52:10AM -0700, Chad Cantwell wrote: Yes, I think this might have been it. I missed the NIGHTLY_OPTIONS variable in opensolaris and I think it was compiling a debug build. I'm not sure what the ramifications are of this or how much slower a debug build should be, but I'm recompiling a release build now so hopefully all will be well. Thanks, Chad On Tue, Jul 20, 2010 at 08:39:42AM +0100, Robert Milkowski wrote: On 20/07/2010 07:59, Chad Cantwell wrote: I've just compiled and booted into snv_142, and I experienced the same slow dd and scrubbing as I did with my 142 and 143 compilations and with the Nexanta 3 RC2 CD. So, this would seem to indicate a build environment/process flaw rather than a regression. Are you sure it is not a debug vs. non-debug issue? -- Robert Milkowski http://milek.blogspot.com Could it somehow not be compiling 64-bit support? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send to remote any ideas for a faster way than ssh?
On Mon, Jul 19, 2010 at 11:14 AM, Bruno Sousa bso...@epinfante.com wrote: Hi, If you can share those scripts that make use of mbuffer, please feel free to do so ;) Bruno On 19-7-2010 20:02, Brent Jones wrote: On Mon, Jul 19, 2010 at 9:06 AM, Richard Jahnel rich...@ellipseinc.com wrote: I've tried ssh blowfish and scp arcfour. both are CPU limited long before the 10g link is. I'vw also tried mbuffer, but I get broken pipe errors part way through the transfer. I'm open to ideas for faster ways to to either zfs send directly or through a compressed file of the zfs send output. For the moment I; zfs send pigz scp arcfour the file gz file to the remote host gunzip to zfs receive This takes a very long time for 3 TB of data, and barely makes use the 10g connection between the machines due to the CPU limiting on the scp and gunzip processes. Thank you for your thoughts Richard J. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I found builds 130 had issues with TCP. I could reproduce TCP timeouts/socket errors up until I got on 132. I have stayed on 132 so far since I haven't found any other show stoppers. Mbuffer is probably your best bet, I rolled mbuffer into my replication scripts, which I could share if anyone's interested. Older versions of my script are on www.brentrjones.com but I have a new one which uses mbuffer I can't seem to upload files to my Wordpress site any longer, so I put it up on Pastebin for now: http://pastebin.com/2feviTCy Hope it helps others -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS list snapshots incurs large delay
I have been running a pair of X4540's for almost 2 years now, the usual spec (Quad core, 64GB RAM, 48x 1TB). I have a pair of mirrored drives for rpool, and a Raidz set with 5-6 disks in each vdev for the rest of the disks. I am running snv_132 on both systems. I noticed an oddity on one particular system, that when running a scrub, or a zfs list -t snapshot, the results take forever. Mind you, these are identical systems in hardware, and software. The primary system replicates all data sets to the secondary nightly, so there isn't much of a discrepancy of space used. Primary system: # time zfs list -t snapshot | wc -l 979 real1m23.995s user0m0.360s sys 0m4.911s Secondary system: # time zfs list -t snapshot | wc -l 979 real0m1.534s user0m0.223s sys 0m0.663s At the time of running both of those, no other activity was happening, load average of .05 or so. Subsequent runs also take just as long on the primary, no matter how many times I run it, it will take about 1 minute and 25 seconds each time, very little drift (+- 1 second if that) Both systems are at about 77% used space on the storage pool, no other distinguishing factors that I can discern. Upon a reboot, performance is respectable for a little while, but within days, it will sink back to those levels. I suspect a memory leak, but both systems run the same software versions and packages, so I can't envision that. Would anyone have any ideas what may cause this? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-help] ZFS list snapshots incurs large delay
It could be a disk failing and dragging I/O down with it. Try to check for high asvc_t with `iostat -XCn 1` and errors in `iostat -En` Any timeouts or retries in /var/adm/messages ? -- Giovanni Tirloni gtirl...@sysdroid.com I checked for high service times during a scrub, and all disks are pretty equal.During a scrub, each disks peaks about 350 reads/sec, with an asvc time of up to 30 during those read spikes (I assume it means 30ms, which isn't terrible for a highly loaded SATA disk). No errors reported by smartctl, iostat, or adm/messages I opened a case on Sunsolve, but I fear since I am running a dev build that I will be out of luck. I cannot run 2009.06 due to CIFS segfaults, and problems with zfs send/recv hanging pools (well documented issues). I'd run Solaris proper, but not having in-kernel CIFS or COMSTAR would be a major setback for me. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] COMSTAR iSCSI and two Windows computers
On Thu, Jun 17, 2010 at 10:44 PM, Giovanni giof...@gmail.com wrote: Hi guys I wanted to ask how i could setup a iSCSI device to be shared by 2 computers concurrently, by that i mean sharing files like it was a NFS share but use iSCSI instead. I tried and setup iSCSI on both computers and was able to see my files (I had formatted it NTFS before), from my laptop I uploaded a 400MB video file to the root directory and from my desktop I browsed the same directory and the file was not there?? Thanks iSCSI is not a clustered file system, in fact it isn't a file system at all. For iSCSI, you need to configure data fencing, typically handled by clustering suites from various operating systems to control which host has access to the iSCSI volumes at one time. You should stick to CIFS or NFS, or investigate a real clustered file system. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iScsi slow
On Wed, May 26, 2010 at 5:08 AM, Matt Connolly matt.connolly...@gmail.com wrote: I've set up an iScsi volume on OpenSolaris (snv_134) with these commands: sh-4.0# zfs create rpool/iscsi sh-4.0# zfs set shareiscsi=on rpool/iscsi sh-4.0# zfs create -s -V 10g rpool/iscsi/test The underlying zpool is a mirror of two SATA drives. I'm connecting from a Mac client with global SAN initiator software, connected via Gigabit LAN. It connects fine, and I've initialiased a mac format volume on that iScsi volume. Performance, however, is terribly slow, about 10 times slower than an SMB share on the same pool. I expected it would be very similar, if not faster than SMB. Here's my test results copying 3GB data: iScsi: 44m01s 1.185MB/s SMB share: 4m27 11.73MB/s Reading (the same 3GB) is also worse than SMB, but only by a factor of about 3: iScsi: 4m36 11.34MB/s SMB share: 1m45 29.81MB/s Is there something obvious I've missed here? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Try jumbo frames, and making sure flow control is enabled on your iSCSI switches and all network cards -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] send/recv over ssh
On Thu, May 20, 2010 at 3:42 PM, Brandon High bh...@freaks.com wrote: On Thu, May 20, 2010 at 1:23 PM, Thomas Burgess wonsl...@gmail.com wrote: I know i'm probably doing something REALLY stupid.but for some reason i can't get send/recv to work over ssh. I just built a new media server and Unless you need to have the send to be encrypted, ssh is going to slow you down a lot. I've used mbuffer when doing sends on the same network, it's worked well. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Problem with mbuffer, if you do scripted send/receives, you'd have to pre-start an Mbuffer session on the receiving end somehow. SSH is always running on the receiving end, so no issues there. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss k.we...@science-computing.de wrote: Hi Adam, Very interesting data. Your test is inherently single-threaded so I'm not surprised that the benefits aren't more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency. Thanks for your reply. I'll probably test the multiple write case, too. But frankly at the moment I care the most about the single-threaded case because if we put e.g. user homes on this server I think they would be severely disappointed if they would have to wait 2m42s just to extract a rather small 50 MB tarball. The default 7m40s without SSD log were unacceptable and we were hoping that the F20 would make a big difference and bring the performance down to acceptable runtimes. But IMHO 2m42s is still too slow and disabling the ZIL seems to be the only option. Knowing that 100s of users could do this in parallel with good performance is nice but it does not improve the situation for the single user which only cares for his own tar run. If there's anything else we can do/try to improve the single-threaded case I'm all ears. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. You'd be better off getting NetApp -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?
On Fri, Mar 5, 2010 at 10:49 AM, Tonmaus sequoiamo...@gmx.net wrote: Hi, I have tried what dedup does on a test dataset that I have filled with 372 GB of partly redundant data. I have used snv_133. All in all, it was successful. The net data volume was only 120 GB. Destruction of the dataset finally took a while, but without any compromise of anything else. After this successful test I am planning to use dedup productively soon. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss 120GB isn't a large enough test. Do what you will, but there have now been at least a dozen reports of people locking up their 7000 series, and X4500/X4540's by enabling de-dupe on large datasets. Myself included. Check CR 6924390 for updates (if any) -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?
On Thu, Mar 4, 2010 at 8:08 AM, Henrik Johansson henr...@henkis.net wrote: Hi all, Now that the Fishworks 2010.Q1 release seems to get deduplication, does anyone know if bugid: 6924824 (destroying a dedup-enabled dataset bricks system) is still valid, it has not been fixed in in onnv and it is not mentioned in the release notes. This is one of the bugs i've been keeping my eyes on before using dedup for any serious work, so I was a but surprised to see that it was in the 2010Q1 release but not fixed in ON. It might not be an issue, just curious, both from a fishworks perspective and from a OpenSolaris perspective. Regards Henrik http://sparcv9.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss My rep says Use dedupe at your own risk at this time. Guess they've been seeing a lot of issues, and regardless if its 'supported' or not, he said not to use it. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
On Wed, Feb 17, 2010 at 11:03 PM, Matt registrat...@flash.shanje.com wrote: No SSD Log device yet. I also tried disabling the ZIL, with no effect on performance. Also - what's the best way to test local performance? I'm _somewhat_ dumb as far as opensolaris goes, so if you could provide me with an exact command line for testing my current setup (exactly as it appears above) I'd love to report the local I/O readings. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss No one has said if they're using dks, rdsk, or file-backed COMSTAR LUNs yet. I'm using file-backed COMSTAR LUNs, with ZIL currently disabled. I can get between 100-200MB/sec, depending on random/sequential and block sizes. Using dsk/rdsk, I was not able to see that level of performance at all. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
On Wed, Feb 17, 2010 at 10:42 PM, Matt registrat...@flash.shanje.com wrote: I've got a very similar rig to the OP showing up next week (plus an infiniband card) I'd love to get this performing up to GB Ethernet speeds, otherwise I may have to abandon the iSCSI project if I can't get it to perform. Do you have an SSD log device? If not, try disabling the ZIL temporarily to see if that helps. Your workload will likely benefit from a log device. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
On Wed, Feb 10, 2010 at 3:12 PM, Marc Nicholas geekyth...@gmail.com wrote: How does lowering the flush interval help? If he can't ingress data fast enough, faster flushing is a Bad Thibg(tm). -marc On 2/10/10, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Bob Friesenhahn bfrie...@simple.dallas.tx.us writes: On Wed, 10 Feb 2010, Frank Cusack wrote: The other three commonly mentioned issues are: - Disable the naggle algorithm on the windows clients. for iSCSI? shouldn't be necessary. - Set the volume block size so that it matches the client filesystem block size (default is 128K!). default for a zvol is 8 KiB. - Check for an abnormally slow disk drive using 'iostat -xe'. his problem is lazy ZFS, notice how it gathers up data for 15 seconds before flushing the data to disk. tweaking the flush interval down might help. An iostat -xndz 1 readout of the %b% coloum during a file copy to the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds of 100, and repeats. what are the other values? ie., number of ops and actual amount of data read/written. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Sent from my mobile device ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ZIL performance issues? Is writecache enabled on the LUNs? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance
On Wed, Feb 10, 2010 at 4:05 PM, Brent Jones br...@servuhome.net wrote: On Wed, Feb 10, 2010 at 3:12 PM, Marc Nicholas geekyth...@gmail.com wrote: How does lowering the flush interval help? If he can't ingress data fast enough, faster flushing is a Bad Thibg(tm). -marc On 2/10/10, Kjetil Torgrim Homme kjeti...@linpro.no wrote: Bob Friesenhahn bfrie...@simple.dallas.tx.us writes: On Wed, 10 Feb 2010, Frank Cusack wrote: The other three commonly mentioned issues are: - Disable the naggle algorithm on the windows clients. for iSCSI? shouldn't be necessary. - Set the volume block size so that it matches the client filesystem block size (default is 128K!). default for a zvol is 8 KiB. - Check for an abnormally slow disk drive using 'iostat -xe'. his problem is lazy ZFS, notice how it gathers up data for 15 seconds before flushing the data to disk. tweaking the flush interval down might help. An iostat -xndz 1 readout of the %b% coloum during a file copy to the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds of 100, and repeats. what are the other values? ie., number of ops and actual amount of data read/written. -- Kjetil T. Homme Redpill Linpro AS - Changing the game ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Sent from my mobile device ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ZIL performance issues? Is writecache enabled on the LUNs? -- Brent Jones br...@servuhome.net Also, are you using rdsk based iSCSI LUNs, or file-based LUNs? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help needed with zfs send/receive
On Tue, Feb 2, 2010 at 12:05 PM, Arnaud Brand t...@tib.cc wrote: Hi folks, I'm having (as the title suggests) a problem with zfs send/receive. Command line is like this : pfexec zfs send -Rp tank/t...@snapshot | ssh remotehost pfexec zfs recv -v -F -d tank This works like a charm as long as the snapshot is small enough. When it gets too big (meaning somewhere between 17G and 900G), I get ssh errors (can't read from remote host). I tried various encryption options (the fastest being in my case arcfour) with no better results. I tried to setup a script to insert dd on the sending and receiving side to buffer the flow, still read errors. I tried with mbuffer (which gives better performance), it didn't get better. Today I tried with netcat (and mbuffer) and I got better throughput, but it failed at 269GB transferred. The two machines are connected to the switch with 2x1GbE (Intel) joined together with LACP. The switch logs show no errors on the ports. kstat -p | grep e1000g shows one recv error on the sending side. I can't find anything in the logs which could give me a clue about what's happening. I'm running build 131. If anyone has the slightest clue of where I could look or what I could do to pinpoint/solve the problem, I'd be very gratefull if (s)he could share it with me. Thanks and have a nice evening. Arnaud ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss This issue seems to have started after snv_129 for me. I get connect reset by peer, or transfers (of any kind) simply timeout. Smaller transfers succeed most of the time, while larger ones usually fail. Rolling back to snv_127 (my last one) does not exhibit this issue. I have not had time to narrow down any causes, but I did find one bug report that found some TCP test scenarios failed during one of the builds, but unable to find that CR at this time. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Help needed with zfs send/receive
On Tue, Feb 2, 2010 at 7:41 PM, Brent Jones br...@servuhome.net wrote: On Tue, Feb 2, 2010 at 12:05 PM, Arnaud Brand t...@tib.cc wrote: Hi folks, I'm having (as the title suggests) a problem with zfs send/receive. Command line is like this : pfexec zfs send -Rp tank/t...@snapshot | ssh remotehost pfexec zfs recv -v -F -d tank This works like a charm as long as the snapshot is small enough. When it gets too big (meaning somewhere between 17G and 900G), I get ssh errors (can't read from remote host). I tried various encryption options (the fastest being in my case arcfour) with no better results. I tried to setup a script to insert dd on the sending and receiving side to buffer the flow, still read errors. I tried with mbuffer (which gives better performance), it didn't get better. Today I tried with netcat (and mbuffer) and I got better throughput, but it failed at 269GB transferred. The two machines are connected to the switch with 2x1GbE (Intel) joined together with LACP. The switch logs show no errors on the ports. kstat -p | grep e1000g shows one recv error on the sending side. I can't find anything in the logs which could give me a clue about what's happening. I'm running build 131. If anyone has the slightest clue of where I could look or what I could do to pinpoint/solve the problem, I'd be very gratefull if (s)he could share it with me. Thanks and have a nice evening. Arnaud ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss This issue seems to have started after snv_129 for me. I get connect reset by peer, or transfers (of any kind) simply timeout. Smaller transfers succeed most of the time, while larger ones usually fail. Rolling back to snv_127 (my last one) does not exhibit this issue. I have not had time to narrow down any causes, but I did find one bug report that found some TCP test scenarios failed during one of the builds, but unable to find that CR at this time. -- Brent Jones br...@servuhome.net Ah, I found the CR that seemed to describe the situation (broken pipe/connection reset by peer) http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6905510 -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS filesystem lock after running auto-replicate.ksh - how to clear?
On Sat, Jan 23, 2010 at 8:44 AM, Fletcher Cocquyt fcocq...@stanford.edu wrote: Fletcher Cocquyt fcocquyt at stanford.edu writes: I found this script for replicating zfs data: http://www.infrageeks.com/groups/infrageeks/wiki/8fb35/zfs_autoreplicate_script.html - I am testing it out in the lab with b129. It error-ed out the first run with some syntax error about the send component (recursive needed?) ..snip.. How do I clear the lock - I have not been able to find documentation on this... thanks! Hi, as one helpful user pointed out, the lock is not from ZFS, but an attribute set by the script to prevent contention (multiple replications etc). I used zfs get/set to clear the attribute and I was able to replicate the initial dataset - still working on the incrementals! thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss As the person who put in the original code for the ZFS lock/depend checks, the script is relatively simple. Seems Infrageeks added some better documentation which is very helpful. You'll want to make sure your remote side doesn't differ, ie. has the same current snapshots as the sender side. If the replication fails for some reason, unlock both sides with 'zfs set'. What problems are your experiencing with incrementals? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
On Wed, Jan 6, 2010 at 2:40 PM, Saso Kiselkov skisel...@gmail.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Buffering the writes in the OS would work for me as well - I've got RAM to spare. Slowing down rm is perhaps one way to go, but definitely not a real solution. On rare occasions I could still get lockups, leading to screwed up recordings and if its one thing people don't like about IPTV, it's packet loss. Eliminating even the possibility of packet loss completely would be the best way to go, I think. Regards, - -- Saso I shouldn't dare suggest this, but what about disabling the ZIL? Since this sounds like transient data to begin with, any risks would be pretty low I'd imagine. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol (slow) vs file (fast) performance snv_130
On Wed, Dec 30, 2009 at 9:35 PM, Ross Walker rswwal...@gmail.com wrote: On Dec 30, 2009, at 11:55 PM, Steffen Plotner swplot...@amherst.edu wrote: Hello, I was doing performance testing, validating zvol performance in particularly, and found that zvol write performance to be slow ~35-44MB/s at 1MB blocksize writes. I then tested the underlying zfs file system with the same test and got 121MB/s. Is there any way to fix this? I really would like to have compatible performance between the zfs filesystem and the zfs zvols. Been there. ZVOLs were changed a while ago to make each operation synchronous so to provide data consistency in the event of a system crash or power outage, particularly when used as backing stores for iscsitgt or comstar. While I think that the change is necessary I think they should have made the cooked 'dsk' device node run with caching enabled to provide an alternative for those willing to take the risk, or modify iscsitgt/comstar to issue a sync after every write if write-caching is enabled on the backing device and the user doesn't want to write cache, or advertise WCE on the mode page to the initiators and let them sync. I also believe performance can be better. When using zvols with iscsitgt and comstar I was unable to break 30MB/s with 4k sequential read workload to a zvol with a 128k recordsize (recommended for sequential IO), not very good. To the same hardware running Linux and iSCSI Enterprise Target I was able to drive over 50MB/s with the same workload. This isn't writes, just reads. I was able to do somewhat better going to the physical device with iscsitgt and comstar, but not as good as Linux, so I kept on using Linux for iSCSI and Solaris for NFS which performed better. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I also noticed that using ZVOLS instead of files, for 20MB/sec read I/O, I saw as many as 900 iops to the disks themselves. When using file based luns to Comstar, doing 20MB/sec read I/O will just issue a couple hundred iops. Seemed to get decent performance, it was required for me to either throw away my X4540's and switch to 7000's with expensive SSDs, or switch to file-based Comstar LUNs and disable the ZIL :( Sad when a $50k piece of equipment requires such sacrifice. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol (slow) vs file (fast) performance snv_130
On Wed, Dec 30, 2009 at 8:55 PM, Steffen Plotner swplot...@amherst.edu wrote: Hello, I was doing performance testing, validating zvol performance in particularly, and found that zvol write performance to be slow ~35-44MB/s at 1MB blocksize writes. I then tested the underlying zfs file system with the same test and got 121MB/s. Is there any way to fix this? I really would like to have compatible performance between the zfs filesystem and the zfs zvols. # first test is a file test at the root of the zpool vg_satabeast8_vol0 dd if=/dev/zero of=/vg_satabeast8_vol0/testing bs=1M count=32768 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 285.037 s, 121 MB/s # create zvol zfs create -V 100G -b 4k vg_satabeast8_vol0/lv_test # test zvol with 'dsk' device dd if=/dev/zero of=/dev/zvol/dsk/vg_satabeast8_vol0/lv_test bs=1M count=32768 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 981.219 s, 35.0 MB/s # test zvol with 'rdsk' device (results are better than 'dsk', however, not as good as a regular file) dd if=/dev/zero of=/dev/zvol/rdsk/vg_satabeast8_vol0/lv_test bs=1M count=32768 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB) copied, 766.247 s, 44.8 MB/s uname -a SunOS zfs-debug-node 5.11 snv_130 i86pc i386 i86pc Solaris I believe this problem is affecting performance tests others are doing with Comstar and exported zvol logical units. Steffen ___ Steffen Plotner Amherst College Tel (413) 542-2348 Systems/Network Administrator/Programmer PO BOX 5000 Fax (413) 542-2626 Systems Networking Amherst, MA 01002-5000 swplot...@amherst.edu ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Why did you make the ZFS file system have 4k blocks? I'd let ZFS manage that for you, which by default I believe is 128K -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-help] zfs destroy stalls, need to hard reboot
On Sun, Dec 27, 2009 at 1:35 PM, Brent Jones br...@servuhome.net wrote: On Sun, Dec 27, 2009 at 12:55 AM, Stephan Budach stephan.bud...@jvm.de wrote: Brent, I had known about that bug a couple of weeks ago, but that bug has been files against v111 and we're at v130. I have also seached the ZFS part of this forum and really couldn't find much about this issue. The other issue I noticed is that, as opposed to the statements I read, that once zfs is underway destroying a big dataset, other operations would continue to work, but that doesen't seem to be the case. When destroying the 3 TB dataset, the other zvol that had been exported via iSCSI stalled as well and that's really bad. Cheers, budy -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-h...@opensolaris.org I just tested your claim, and you appear to be correct. I created a couple dummy ZFS filesystems, loaded them with about 2TB, exported them via CIFS, and destroyed one of them. The destroy took the usual amount of time (about 2 hours), and actually, quite to my surprise, all I/O on the ENTIRE zpool stalled. I dont recall seeing this prior to 130, in fact, I know I would have noticed this, as we create and destroy large ZFS filesystems very frequently. So it seems the original issue I reported many months back has actually gained some new negative impacts :( I'll try to escalate this with my Sun support contract, but Sun support still isn't very familiar/clued in about OpenSolaris, so I doubt I will get very far. Cross posting to ZFS-discuss also, as other may have seen this and know of a solution/workaround. -- Brent Jones br...@servuhome.net I did some more testing, and it seems this is 100% reproducible ONLY if the file system and/or entire pool had compression or de-dupe enabled at one point. It doesn't seem to matter if de-dupe/compression was enabled for 5 minutes, or the entire life of the pool, as soon as either are turned on in snv_130, doing any type of mass change (like deleting a big file system) will hang ALL I/O for a significant amount of time. If I create a filesystem with neither enabled, fill it with a few TB of data, and do a 'zfs destroy' on it, it'll go pretty quick, just a couple minutes, and no noticeable impact to system I/O. I'm curious about the 7000 series appliances, since those supposedly ship now with de-dupe as a fully supported option. Is the code significantly different in the core of ZFS on the 7000 appliances than a recent build of OpenSolaris? My sales rep assures me theres very little overhead by enabling de-dupe on the 7000 series (which he's trying to sell us, obviously) but I can't see how that could be, when I have the same hardware the 7000's run on (fully loaded X4540). Any thoughts from anyone? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Troubleshooting dedup performance
the previously high performance. A bit of a let down, so I will wait on the sidelines for this feature to mature. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-help] zfs destroy stalls, need to hard reboot
On Sun, Dec 27, 2009 at 12:55 AM, Stephan Budach stephan.bud...@jvm.de wrote: Brent, I had known about that bug a couple of weeks ago, but that bug has been files against v111 and we're at v130. I have also seached the ZFS part of this forum and really couldn't find much about this issue. The other issue I noticed is that, as opposed to the statements I read, that once zfs is underway destroying a big dataset, other operations would continue to work, but that doesen't seem to be the case. When destroying the 3 TB dataset, the other zvol that had been exported via iSCSI stalled as well and that's really bad. Cheers, budy -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-h...@opensolaris.org I just tested your claim, and you appear to be correct. I created a couple dummy ZFS filesystems, loaded them with about 2TB, exported them via CIFS, and destroyed one of them. The destroy took the usual amount of time (about 2 hours), and actually, quite to my surprise, all I/O on the ENTIRE zpool stalled. I dont recall seeing this prior to 130, in fact, I know I would have noticed this, as we create and destroy large ZFS filesystems very frequently. So it seems the original issue I reported many months back has actually gained some new negative impacts :( I'll try to escalate this with my Sun support contract, but Sun support still isn't very familiar/clued in about OpenSolaris, so I doubt I will get very far. Cross posting to ZFS-discuss also, as other may have seen this and know of a solution/workaround. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
On Fri, Dec 25, 2009 at 9:56 PM, Tim Cook t...@cook.ms wrote: On Fri, Dec 25, 2009 at 11:43 PM, Brent Jones br...@servuhome.net wrote: Hang on... if you've got 77 concurrent threads going, I don't see how that's a sequential I/O load. To the backend storage it's going to look like the equivalent of random I/O. I'd also be surprised to see 12 1TB disks supporting 600MB/sec throughput and would be interested in hearing where you got those numbers from. Is your video capture doing 430MB or 430Mbit? -- --Tim Think he said 430Mbit/sec, which if these are security cameras, would be a good sized installation (30+ cameras). We have a similar system, albeit running on Windows. Writing about 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and working quite well on our system without any frame loss or much latency. Once again, Mb or MB? They're two completely different numbers. As for getting 400Mbit out of 6 SATA drive, that's not really impressive at all. If you're saying you got 400MB, that's a different story entirely, and while possible with sequential I/O and a proper raid setup, it isn't happening with random. Mb, megabit. 400 megabit is not terribly high, a single SATA drive could write that 24/7 without a sweat. Which is why he is reporting his issue. Sequential or random, any modern system should be able to perform that task without causing disruption to other processes running on the system (if Windows can, Solaris/ZFS most definitely should be able to). I have similar workload on my X4540's, streaming backups from multiple systems at a time. These are very high end machines, dual quadcore opterons and 64GB RAM, 48x 1TB drives in 5-6 disk RAIDZ vdevs. The write stalls have been a significant problem since ZFS came out, and hasn't really been addressed in an acceptable fashion yet, though work has been done to improve it. I'm still trying to find the case number I have open with Sunsolve or whatever, it was for exactly this issue, and I believe the fix was to add dozens more classes to the scheduler, to allow more fair disk I/O and overall niceness on the system when ZFS commits a transaction group. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write bursts cause short app stalls
On Fri, Dec 25, 2009 at 7:47 PM, Tim Cook t...@cook.ms wrote: On Fri, Dec 25, 2009 at 11:57 AM, Saso Kiselkov skisel...@gmail.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I've started porting a video streaming application to opensolaris on ZFS, and am hitting some pretty weird performance issues. The thing I'm trying to do is run 77 concurrent video capture processes (roughly 430Mbit/s in total) all writing into separate files on a 12TB J4200 storage array. The disks in the array are arranged into a single RAID-0 ZFS volume (though I've tried different RAID levels, none helped). CPU performance is not an issue (barely hitting 35% utilization on a single CPU quad-core X2250). I/O bottlenecks can also be ruled out, since the storage array's sequential write performance is around 600MB/s. The problem is the bursty behavior of ZFS writes. All the capture processes do, in essence is poll() on a socket and then read() and write() any available data from it to a file. The poll() call is done with a timeout of 250ms, expecting that if no data arrives within 0.25 seconds, the input is dead and recording stops (I tried increasing this value, but the problem still arises, although not as frequently). When ZFS decides that it wants to commit a transaction group to disk (every 30 seconds), the system stalls for a short amount of time and depending on the number capture of processes currently running, the poll() call (which usually blocks for 1-2ms), takes on the order of hundreds of ms, sometimes even longer. I figured that I might be able to resolve this by lowering the txg timeout to something like 1-2 seconds (I need ZFS to write as soon as data arrives, since it will likely never be overwritten), but I couldn't find any tunable parameter for it anywhere on the net. On FreeBSD, I think this can be done via the vfs.zfs.txg_timeout sysctl. A glimpse into the source at http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/txg.c on line 40 made me worry that somebody maybe hard-coded this value into the kernel, in which case I'd be pretty much screwed in opensolaris. Any help would be greatly appreciated. Regards, - -- Saso Hang on... if you've got 77 concurrent threads going, I don't see how that's a sequential I/O load. To the backend storage it's going to look like the equivalent of random I/O. I'd also be surprised to see 12 1TB disks supporting 600MB/sec throughput and would be interested in hearing where you got those numbers from. Is your video capture doing 430MB or 430Mbit? -- --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Think he said 430Mbit/sec, which if these are security cameras, would be a good sized installation (30+ cameras). We have a similar system, albeit running on Windows. Writing about 400Mbit/sec using just 6, 1TB SATA drives is entirely possible, and working quite well on our system without any frame loss or much latency. The writes lag is noticeable however with ZFS, and the behavior of the transaction group writes. If you have a big write that needs to land on disk, it seems all other I/O, CPU and niceness is thrown out the window in favor of getting all that data on disk. I was on a watch list for a ZFS I/O scheduler bug with my paid Solaris support, I'll try to find that bug number, but I believe some improvements were done in 129 and 130. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs zend is very slow
On Wed, Dec 16, 2009 at 12:19 PM, Michael Herf mbh...@gmail.com wrote: Mine is similar (4-disk RAIDZ1) - send/recv with dedup on: 4MB/sec - send/recv with dedup off: ~80M/sec - send /dev/null: ~200MB/sec. I know dedup can save some disk bandwidth on write, but it shouldn't save much read bandwidth (so I think these numbers are right). There's a warning in a Jeff Bonwick post that if the DDT (de-dupe tables) don't fit in RAM, things will be slower. Wonder what that threshold is? Second try of the same recv appears to go randomly faster (5-12MB bursting to 100MB/sec briefly) - DDT in core should make the second try quite a bit faster, but it's not as fast as I'd expect. My zdb -D output: DDT-sha256-zap-duplicate: 633396 entries, size 361 on disk, 179 in core DDT-sha256-zap-unique: 5054608 entries, size 350 on disk, 185 in core 6M entries doesn't sound like that much for a box with 6GB of RAM. CPU load is also low. mike On Wed, Dec 16, 2009 at 8:19 AM, Brandon High bh...@freaks.com wrote: On Wed, Dec 16, 2009 at 8:05 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: In his case 'zfs send' to /dev/null was still quite fast and the network was also quite fast (when tested with benchmark software). The implication is that ssh network transfer performace may have dropped with the update. zfs send appears to be fast still, but receive is slow. I tried a pipe from the send to the receive, as well as using mbuffer with a 100mb buffer, both wrote at ~ 12 MB/s. -B -- Brandon High : bh...@freaks.com Indecision is the key to flexibility. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'm seeing similar results, though my file systems currently have de-dupe disabled, and only compression enable, both systems being snv_129. An old 111 build is also sending to the 129 main file server slow, when it used to be the 111 could send about 25MB/sec over SSH to the main file server which used to run 127. Since 128 however, the main file server is receiving ZFS snapshots at a fraction of the previous speed. 129 fixed it a bit, I was literaly getting just a couple hundred -BYTES- a second on 128, but 129 I can get about 9-10MB/sec if I'm lucky, but usually 4-5MB/sec. No other configuration changes on the network occured, except for my X4540's being upgraded to snv_129. It does appear to be the zfs receive part, because I can send to /dev/null at close to 800MB/sec (42 drives in 5-6 disk vdevs, RAID-Z) Something must've changed in either SSH, or the ZFS receive bits to cause this, but sadly since I upgrade my pool, I cannot roll back these hosts :( -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs zend is very slow
On Wed, Dec 16, 2009 at 7:43 PM, Edward Ned Harvey sola...@nedharvey.com wrote: I'm seeing similar results, though my file systems currently have de-dupe disabled, and only compression enable, both systems being I can't say this is your issue, but you can count on slow writes with compression on. How slow is slow? Don't know. Irrelevant in this case? Possibly. I'm willing to accept slower writes with compression enabled, par for the course. Local writes, even with compression enabled, can still exceed 500MB/sec, with moderate to high CPU usage. These problems seem to have manifested after snv_128, and seemingly only affect ZFS receive speeds. Local pool performance is still very fast. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool fragmentation issues?
On Tue, Dec 15, 2009 at 5:28 PM, Bill Sprouse bill.spro...@sun.com wrote: Hi Everyone, I hope this is the right forum for this question. A customer is using a Thumper as an NFS file server to provide the mail store for multiple email servers (Dovecot). They find that when a zpool is freshly created and populated with mail boxes, even to the extent of 80-90% capacity, performance is ok for the users, backups and scrubs take a few hours (4TB of data). There are around 100 file systems. After running for a while (couple of months) the zpool seems to get fragmented, backups take 72 hours and a scrub takes about 180 hours. They are running mirrors with about 5TB usable per pool (500GB disks). Being a mail store, the writes and reads are small and random. Record size has been set to 8k (improved performance dramatically). The backup application is Amanda. Once backups become too tedious, the remedy is to replicate the pool and start over. Things get fast again for a while. Is this expected behavior given the application (email - small, random writes/reads)? Are there recommendations for system/ZFS/NFS configurations to improve this sort of thing? Are there best practices for structuring backups to avoid a directory walk? Thanks, bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Anyone reason in particular they chose to use Dovecot with the old Mbox format? Mbox has been proven many times over to be painfully slow when the files get larger, and in this day and age, I can't imagine anyone having smaller than a 50MB mailbox. We have about 30,000 e-mail users on various systems, and it seems the average size these days is approaching close to a GB. Though Dovecot has done a lot to improve the performance of Mbox mailboxes, Maildir might be more rounded for your system. I wonder if the soon to be released block/parity rewrite tool will freshen up a pool thats heavily fragmented, without having to redo the pools. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/recv extreme performance penalty in snv_128
On Sat, Dec 12, 2009 at 8:14 PM, Brent Jones br...@servuhome.net wrote: On Sat, Dec 12, 2009 at 11:39 AM, Brent Jones br...@servuhome.net wrote: On Sat, Dec 12, 2009 at 7:55 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Sat, 12 Dec 2009, Brent Jones wrote: I've noticed some extreme performance penalties simply by using snv_128 Does the 'zpool scrub' rate seem similar to before? Do you notice any read performance problems? What happens if you send to /dev/null rather than via ssh? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ Scrubs on both systems seem to take about the same amoutn of time (16 hours, on a 48TB pool, with about 20TB used) I'll test to dev/null tonight -- Brent Jones br...@servuhome.net I tested send performance to /dev/null, and I sent a 500GB filesystem in just a few minutes. The two servers are linked over GigE fiber (between two cities) Iperf output: [ ID] Interval Transfer Bandwidth [ 5] 0.0-60.0 sec 2.06 GBytes 295 Mbits/sec [ ID] Interval Transfer Bandwidth [ 4] 0.0-60.0 sec 2.38 GBytes 341 Mbits/sec Usually a bit faster, but some other stuff goes over that pipe. Though looking at network traffic between these two hosts during the send, I see a lot of network traffic (about 100-150Mbit usually) during the send. So theres traffic, but a 100MB send has taken over 10 minutes and still not complete. But given 100Mbit/sec, it should take about 10 seconds roughly, not 10 minutes. There is a little bit of disk activity, maybe a MB/sec on average, and about 30 iops. So it seems the hosts are exchanging a lot of data about the snapshot, but not actually replicating any data for a very long time. SSH CPU usage is minimal, just a few percent (arcfour, but tried others, no difference) Odd behavior to be sure, and looks very familiar to what snapshot replication did back in build 101, before they made significant speed improvements to snapshot replication. Wonder if this is a major regression, due to changes in newer ZFS versions, maybe to accomodate de-dupe? Sadly, I can't roll back, since I already upgraded my pool, but I may try upgrading to 129, but my IPS doesn't seem to recognize the newer version yet. -- Brent Jones br...@servuhome.net I found some time to dig into my troubles updating to 129 (my dev repository can no longer be called Dev, must use the opensolaris.org name, bleh) But at least build 129 seems to fix this. Not sure what the issue is, but bouncing between 128 and 129, I can reproduce 100% of the time terrible ZFS send/recv times. Though, 129 still isnt as fast at 127, with the same datasets and configuration, but it's good enough for now. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS send/recv extreme performance penalty in snv_128
I've noticed some extreme performance penalties simply by using snv_128 I take snapshots, and send them over SSH to another server over Gigabit ethernet. Before, I would get 20-30MBps, prior to snv_128 (127, and nearly all previous builds). However, simply image-updating to snv_128 has caused a majority of my snapshots to do this: receiving incremental stream of pdxfilu01/vault/0...@20091212-01:15:00 into pdxfilu02/vault/0...@20091212-01:15:00 received 13.8KB stream in 491 seconds (28B/sec) De-dupe is NOT enabled on any pool, but I have upgraded to the newest ZFS pool version, which prevents me from rolling back to snv_127, which would send at many tens of megabytes a second. This is on an X4540, dual quad cores, and 64GB RAM. Anyone else seeing similar issues? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/recv extreme performance penalty in snv_128
On Sat, Dec 12, 2009 at 7:55 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Sat, 12 Dec 2009, Brent Jones wrote: I've noticed some extreme performance penalties simply by using snv_128 Does the 'zpool scrub' rate seem similar to before? Do you notice any read performance problems? What happens if you send to /dev/null rather than via ssh? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ Scrubs on both systems seem to take about the same amoutn of time (16 hours, on a 48TB pool, with about 20TB used) I'll test to dev/null tonight -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/recv extreme performance penalty in snv_128
On Sat, Dec 12, 2009 at 11:39 AM, Brent Jones br...@servuhome.net wrote: On Sat, Dec 12, 2009 at 7:55 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Sat, 12 Dec 2009, Brent Jones wrote: I've noticed some extreme performance penalties simply by using snv_128 Does the 'zpool scrub' rate seem similar to before? Do you notice any read performance problems? What happens if you send to /dev/null rather than via ssh? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ Scrubs on both systems seem to take about the same amoutn of time (16 hours, on a 48TB pool, with about 20TB used) I'll test to dev/null tonight -- Brent Jones br...@servuhome.net I tested send performance to /dev/null, and I sent a 500GB filesystem in just a few minutes. The two servers are linked over GigE fiber (between two cities) Iperf output: [ ID] Interval Transfer Bandwidth [ 5] 0.0-60.0 sec 2.06 GBytes295 Mbits/sec [ ID] Interval Transfer Bandwidth [ 4] 0.0-60.0 sec 2.38 GBytes341 Mbits/sec Usually a bit faster, but some other stuff goes over that pipe. Though looking at network traffic between these two hosts during the send, I see a lot of network traffic (about 100-150Mbit usually) during the send. So theres traffic, but a 100MB send has taken over 10 minutes and still not complete. But given 100Mbit/sec, it should take about 10 seconds roughly, not 10 minutes. There is a little bit of disk activity, maybe a MB/sec on average, and about 30 iops. So it seems the hosts are exchanging a lot of data about the snapshot, but not actually replicating any data for a very long time. SSH CPU usage is minimal, just a few percent (arcfour, but tried others, no difference) Odd behavior to be sure, and looks very familiar to what snapshot replication did back in build 101, before they made significant speed improvements to snapshot replication. Wonder if this is a major regression, due to changes in newer ZFS versions, maybe to accomodate de-dupe? Sadly, I can't roll back, since I already upgraded my pool, but I may try upgrading to 129, but my IPS doesn't seem to recognize the newer version yet. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS pool unusable after attempting to destroy a dataset with dedup enabled
On Tue, Dec 8, 2009 at 6:36 PM, Jack Kielsmeier jac...@netins.net wrote: Ah, good to know! I'm learning all kinds of stuff here :) The command (zpool import) is still running and I'm still seeing disk activity. Any rough idea as to how long this command should last? Looks like each disk is being read at a rate of 1.5-2 megabytes per second. Going worst case, assuming each disk is 1572864 megs (the 1.5TB disks are actually smaller than this due to the 'rounding' drive manufacturers do) and 2 megs/sec read rate per disk, that means hopefully at most I should have to wait: 1572864(megs) / 2(megs/second) / 60 (seconds / minute) / 60 (minutes / hour) / 24 (hour / day): 9.1 days Again, I don't know if the zpool import is looking at the entire contents of the disks, or what exactly it's doing, but I'm hoping that would be the 'maximum' I'd have to wait for this command to finish :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I submitted a bug a while ago about this: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6855208 I'll escalate since I have a support contract. But yes, I see this as a serious bug, I thought my machine had locked up entirely as well, it took about 2 days to finish a destroy on a volume about 12TB in size. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] flar and tar the best way to backup S10 ZFS only?
On Mon, Nov 23, 2009 at 8:24 PM, Trevor Pretty trevor_pre...@eagle.co.nz wrote: I'm persuading a customer that when he goes to S10 he should use ZFS for everything. We only have one M3000 and a J4200 connected to it. We are not talking about a massive site here with a SAN etc. The M3000 is their mainframe. His RTO and RPO are both about 12 hours, his business gets difficult without the server but does not die horribly. He currently uses ufsdump to tape each night which is sent off site. However ufsrestore -i has saved is bacon in the past and does not want to loose this functionality. A couple of questions. flar seems to work with ZFS quite well and will backup the whole root pool flar(1M) This seems to be the best way to get the equivalent of ufsrestore -r and a great way to recover in a DR event:- http://www.sun.com/bigadmin/content/submitted/flash_archive.jsp My Questions... Q: Is there the equivalent of ufsretore -i with flar? (which seems to be an ugly shell script around cpio or pax) Q: Therefore should I have a tar of the root pool as well? Q: There is no reason I cannot use flar on the other non root pools? Q: Or is tar better for the non root pools? We will have LOTS of disk space, his whole working dataset will easily fit onto an LTO4, so can anybody think of good a reason why you would not flar the root pool into another pool and then just tar off this pool each night to tape? In fact we will have so much disk space (compared to now) I expect we will will be able to keep most backups on-line for quite some time. Discuss :-) -- Trevor Pretty | Technical Account Manager | T: +64 9 639 0652 | M: +64 21 666 161 Eagle Technology Group Ltd. Gate D, Alexandra Park, Greenlane West, Epsom Private Bag 93211, Parnell, Auckland www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss With the cost of tapes, drives, and off-site storage service (unless its stored at the owners home), you could probably co-locate a server with fast internet connectivity, a bundle of local storage, and just ZFS snapshot your relevant pools to that server. I second the recommendation of Amanda from Richard as well though, pretty flexible solution. And it can backup much more than just local ZFS snapshots if that would be a benefit to you as well. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Comstar thin provisioning space reclamation
On Wed, Nov 18, 2009 at 4:09 PM, Brent Jones br...@servuhome.net wrote: On Tue, Nov 17, 2009 at 10:32 AM, Ed Plese e...@edplese.com wrote: You can reclaim this space with the SDelete utility from Microsoft. With the -c option it will zero any free space on the volume. For example: C:\sdelete -c C: I've tested this with xVM and with compression enabled for the zvol, but it worked very well. Ed Plese It seems the compression setting on the zvol is key here. Tried without compression turned on, and the thin provisioned file grew to its maximum size. I'm re-running it on the same volume, this time with compression turned on to see how it behaves next :) -- Brent Jones br...@servuhome.net Turning compression on was the key. Reclaimed about 5TB of space running sdelete (though it takes a very long time) -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Comstar thin provisioning space reclamation
On Tue, Nov 17, 2009 at 10:32 AM, Ed Plese e...@edplese.com wrote: You can reclaim this space with the SDelete utility from Microsoft. With the -c option it will zero any free space on the volume. For example: C:\sdelete -c C: I've tested this with xVM and with compression enabled for the zvol, but it worked very well. Ed Plese It seems the compression setting on the zvol is key here. Tried without compression turned on, and the thin provisioned file grew to its maximum size. I'm re-running it on the same volume, this time with compression turned on to see how it behaves next :) -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Comstar thin provisioning space reclamation
I use several file-backed thin provisioned iSCSI volumes presented over Comstar. The initiators are Windows 2003/2008 systems with the MS MPIO initiator. The Windows systems only claim to be using about 4TB of space, but the ZFS volume says 7.12TB is used. Granted, I imagine ZFS allocates the blocks as soon as Windows needs space, and Windows will eventually not need that space again. Is there a way to reclaim un-used space on a thin provisioned iSCSI target? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ..and now ZFS send dedupe
On Mon, Nov 9, 2009 at 12:45 PM, Nigel Smith nwsm...@wilusa.freeserve.co.uk wrote: More ZFS goodness putback before close of play for snv_128. http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010768.html http://hg.genunix.org/onnv-gate.hg/rev/216d8396182e Regards Nigel Smith -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Are these recent developments due to help/support from Oracle? Or is it business as usual for ZFS developments? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount error
On Mon, Nov 2, 2009 at 1:34 PM, Ramin Moazeni ramin.moaz...@sun.com wrote: Hello A customer recently had a power outage. Prior to the outage, they did a graceful shutdown of their system. On power-up, the system is not coming up due to zfs errors as follows: cannot mount 'rpool/export': Number of symbolic links encountered during path name traversal exceeds MAXSYMLINKS mount '/export/home': failed to create mountpoint. The possible cause of this might be that a symlink is created pointing to itself since the customer stated that they created lots of symlink to get their env ready. However, since /export is not getting mounted, they can not go back and delete/fix the symlinks. Can someone suggest a way to fix this issue? Thanks Ramin Moazeni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I see these very frequently on my systems, regardless of a clean shutdown or not, 1/3 of the time filesystems cannot mount. What I do, is boot into single user mode, make sure the filesystem in question is NOT mounted, and just delete the directory that its trying to mount into. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] million files in single directory
On Sat, Oct 3, 2009 at 6:50 PM, Jeff Haferman j...@haferman.com wrote: A user has 5 directories, each has tens of thousands of files, the largest directory has over a million files. The files themselves are not very large, here is an ls -lh on the directories: [these are all ZFS-based] [r...@cluster]# ls -lh total 341M drwxr-xr-x+ 2 someone cluster 13K Sep 14 19:09 0/ drwxr-xr-x+ 2 someone cluster 50K Sep 14 19:09 1/ drwxr-xr-x+ 2 someone cluster 197K Sep 14 19:09 2/ drwxr-xr-x+ 2 someone cluster 785K Sep 14 19:09 3/ drwxr-xr-x+ 2 someone cluster 3.1M Sep 14 19:09 4/ When I go into directory 0, it takes about a minute for an ls -1 | grep wc to return (it has about 12,000 files). Directory 1 takes between 5-10 minutes for the same command to return (it has about 50,000 files). I did an rsync of this directory structure to another filesystem [lustre-based, FWIW] and it took about 24 hours to complete. We have done rsyncs on other directories that are much larger in terms of file-sizes, but have thousands of files rather than tens, hundreds, and millions of files. Is there someway to speed up simple things like determining the contents of these directories? And why does an rsync take so much longer on these directories when directories that contain hundreds of gigabytes transfer much faster? Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Be happy you don't have Windows + NTFS with hundreds of thousands, or millions of files. Explorer will crash, run your system out of memory and slow it down, or plain out hard lock windows for hours on end. This is on brand new hardware, 64bit, 32GB RAM, and 15k SAS disks. Regardless of filesystem, I'd suggest splitting your directory structure into a hierarchy. It makes sense even just for cleanliness. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
On Wed, Sep 2, 2009 at 6:27 AM, Frank Middletonf.middle...@apogeect.com wrote: On 09/02/09 05:40 AM, Henrik Johansson wrote: For those of us which have already upgraded and written data to our raidz pools, are there any risks of inconsistency, wrong checksums in the pool? Is there a bug id? This may not be a new problem insofar as it may also affect mirrors. As part of the ancient mirrored drives should not have checksum errors thread, I used Richard Elling's amazing zcksummon script http://www.richardelling.com/Home/scripts-and-programs-1/zcksummon to help diagnose this (thanks, Richard, for all your help). The bottom line is that hardware glitches (as found on cheap PCs without ECC on buses and memory) can put ZFS into a mode where it detects bogus checksum errors. If you set copies=2, it seems to always be able to repair them, but they are never actually repaired. Every time you scrub, it finds a checksum error on the affected file(s) and it pretends to repair it (or may fail if you have copies=1 set). Note: I have not tried this on raidz, only mirrors, where it is highly reproducible. It would be really interesting to see if raidz gets results similar to the mirror case when running zcksummon. Note I have NEVER had this problem on SPARC, only on certain bargain-basement PCs (used as X-Terminals) which as it turns out have mobos notorious for not detecting bus parity errors. If this is the same problem, you can certainly mitigate it by setting copies=2 and actually copying the files (e.g., by promoting a snapshot, which I believe will do this - can someone confirm?). My guess is that snv121 has done something to make the problem more likely to occur, but the problem itself is quite old (predates snv100). Could you share with us some details of your hardware, especially how much memory and if it has ECC orbus parity? Cheers -- Frank On 09/02/09 05:40 AM, Henrik Johansson wrote: Hi Adam, On Sep 2, 2009, at 1:54 AM, Adam Leventhal wrote: Hi James, After investigating this problem a bit I'd suggest avoiding deploying RAID-Z until this issue is resolved. I anticipate having it fixed in build 124. Regards Henrik http://sparcv9.blogspot.com http://sparcv9.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I see this issue on each of my X4540's, 64GB of ECC memory, 1TB drives. Rolling back to snv_118 does not reveal any checksum errors, only snc_121 So, the commodity hardware here doesn't hold up, unless Sun isn't validating their equipment (not likely, as these servers have had no hardware issues prior to this build) -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Petabytes on a budget - blog
On Wed, Sep 2, 2009 at 12:12 PM, Roland Rambauroland.ram...@sun.com wrote: Jacob, Jacob Ritorto schrieb: Torrey McMahon wrote: 3) Performance isn't going to be that great with their design but...they might not need it. Would you be able to qualify this assertion? Thinking through it a bit, even if the disks are better than average and can achieve 1000Mb/s each, each uplink from the multiplier to the controller will still have 1000Gb/s to spare in the slowest SATA mode out there. With (5) disks per multiplier * (2) multipliers * 1000GB/s each, that's 1Gb/s at the PCI-e interface, which approximately coincides with a meager 4x PCI-e slot. they use a 85$ PC motherboard - that does not have meager 4x PCI-e slots, it has one 16x and 3 *1x* PCIe slots, plus 3 PCI slots ( remember, long time ago: 32-bit wide 33 MHz, probably shared bus ). Also it seems that all external traffic uses the single GbE motherboard port. -- Roland -- ** Roland Rambau Platform Technology Team Principal Field Technologist Global Systems Engineering Phone: +49-89-46008-2520 Mobile:+49-172-84 58 129 Fax: +49-89-46008- mailto:roland.ram...@sun.com ** Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht München: HRB 161028; Geschäftsführer: Thomas Schröder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Häring *** UNIX * /bin/sh FORTRAN ** ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Probably for their usage patterns, these boxes make sense. But I concur that the reliability and performance would be very suspect to any organization which values their data in any fashion. Personally, I have some old Dual P3 systems still running fine at home, on what were cheap motherboards. But would I advocate such a system to protect business data? Not a chance. I'm sure at the price they offer storage, this was the only way they could be profitable, and it's a pretty creative solution. For my personal data backups, I'm sure their service would meet all my needs, but thats about as far as I would trust these systems - MP3's, backups of photos for which I already maintain a couple copies of. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] x4540 dead HDD replacement, remains configured.
On Wed, Aug 5, 2009 at 11:48 PM, Jorgen Lundmanlund...@gmo.jp wrote: I suspect this is what it is all about: # devfsadm -v devfsadm[16283]: verbose: no devfs node or mismatched dev_t for /devices/p...@0,0/pci10de,3...@b/pci1000,1...@0/s...@5,0:a [snip] and indeed: brw-r- 1 root sys 30, 2311 Aug 6 15:34 s...@4,0:wd crw-r- 1 root sys 30, 2311 Aug 6 15:24 s...@4,0:wd,raw drwxr-xr-x 2 root sys 2 Aug 6 14:31 s...@5,0 drwxr-xr-x 2 root sys 2 Apr 17 17:52 s...@6,0 brw-r- 1 root sys 30, 2432 Jul 6 09:50 s...@6,0:a crw-r- 1 root sys 30, 2432 Jul 6 09:48 s...@6,0:a,raw Perhaps because it was booted with the dead disk in place, it never configured the entire sd5 mpt driver. Why the other hard-disks work I don't know. I suspect the only way to fix this, is to reboot again. Lund I have a pair of X4540's also, and getting any kind of drive status, or failure alert is a lost cause. I've opened several cases with Sun with the following issues: ILOM/BMC can't see any drives (status, FRU, firmware, etc) FMA cannot see a drive failure (you can pull a drive, and it could be hours before 'zpool status' will show a failed drive, even during a 'zfs scrub') Hot swapping drives rarely works, system will not see new drive until a reboot Things I've tried that Sun has suggested: New BIOS New controller firmware New ILOM firmware Upgrading to new releases of Osol (currently on 118, no luck) Replacing ILOM card Custom FMA configs Nothing works, and my cases with Sun have been open for about 6 months now, with no resolution in sight. Given that Sun now makes the 7000, I can only assume their support on the more whitebox version, AKA X4540, is either near an end, or they don't intend to support any advanced monitoring whatsoever. Sad, really.. as my $900 Dell and HP servers can send SMS, Jabber messages, SNMP traps, etc, on ANY IPMI event, hardware issue, and what have you without any tinkering or excuses. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Managing ZFS Replication
On Fri, Jul 31, 2009 at 10:25 AM, Joseph L. Casalejcas...@activenetwerx.com wrote: I came up with a somewhat custom script, using some pre-existing scripts I found about the land. http://www.brentrjones.com/?p=45 Brent, That was super helpful. I had to make some simple changes to the ssh syntax as I use a specific user and identity file going from Solaris 10 to OpenSolaris 0906 but I am getting this message: The Source snapshot does exist on the Destination, clear to send a new one! Taking snapshot: /sbin/zfs snapshot mypool2/back...@2009-07-31t16:34:54Z receiving incremental stream of mypool2/back...@2009-07-31t16:34:54Z into mypool/back...@2009-07-31t16:34:54Z received 39.7GB stream in 2244 seconds (18.1MB/sec) cannot set property for 'mypool2/back...@2009-07-31t16:34:54Z': snapshot properties cannot be modified cannot set property for 'mypool2/back...@2009-58-30t21:58:15Z': snapshot properties cannot be modified cannot set property for 'mypool2/back...@2009-07-31t16:34:54Z': snapshot properties cannot be modified Is that intended to modify the properties of a snapshot? Does that work in some other version of Solaris other than 10u7? Thanks so much for that pointer! jlc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss If I recall correctly, modifiable snapshot properties aren't support in older versions of ZFS :( I wrote the script on Opensolaris 2008.11, which did have modifiable snapshot properties. Can you upgrade your pool versions possibly? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Managing ZFS Replication
On Thu, Jul 30, 2009 at 3:54 PM, Joseph L. Casalejcas...@activenetwerx.com wrote: Anyone come up with a solution to manage the replication of ZFS snapshots? The send/recv criteria gets tricky with all but the first unless you purge the destination of snapshots, then force a full stream into it. I was hoping to script a daily update but I see that I would have to keep track of what's been done on both sides when using the -i|I syntax so it would not be reliable in a hands off script. Would AVS be a possible solution in a mixed S10/Osol/SXCE environment? I presume that would make it fairly trivially but right now I am duplicating data from an s10 box to an osol snv118 box based on hardware/application needs forcing the two platforms. Thanks for any ideas! jlc ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I came up with a somewhat custom script, using some pre-existing scripts I found about the land. http://www.brentrjones.com/?p=45 I schedule some file systems every 5 minutes, hour, and nightly depending on requirements. It has worked quite well for me, and proved to be quite useful in restoring as well (already had to use it). E-mails status reports, handles conflicts in a simple but effective way, and replication can be reversed by just starting to run it from the other system. I expanded on it by being able to handle A-B and B-A replication (mirror half of A to B, and half of B to A for paired redundancy). I'll post that version up in a few weeks when I clean it up a little. Credits go to Constantin Gonzalez for inspiration and source for parts of my script. http://blogs.sun.com/constantin/ -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy slow?
On Mon, Jul 27, 2009 at 3:58 AM, Markus Koveromarkus.kov...@nebula.fi wrote: Oh well, whole system seems to be deadlocked. nice. Little too keen keeping data safe :-P Yours Markus Kovero From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Markus Kovero Sent: 27. heinäkuuta 2009 13:39 To: zfs-discuss@opensolaris.org Subject: [zfs-discuss] zfs destroy slow? Hi, how come zfs destroy being so slow, eg. destroying 6TB dataset renders zfs admin commands useless for time being, in this case for hours? (running osol 111b with latest patches.) Yours Markus Kovero ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I submitted a bug, but I don't think its been assigned a case number yet. I see this exact same behavior on my X4540's. I create a lot of snapshots, and when I tidy up, zfs destroy can 'stall' any and all ZFS related commands for hours, or even days (in the case of nested snapshots). The only resolution is not to ever use zfs destroy, or just simply wait it out. It will eventually finish, just not in any reasonable timeframe. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy slow?
I submitted a bug, but I don't think its been assigned a case number yet. I see this exact same behavior on my X4540's. I create a lot of snapshots, and when I tidy up, zfs destroy can 'stall' any and all ZFS related commands for hours, or even days (in the case of nested snapshots). The only resolution is not to ever use zfs destroy, or just simply wait it out. It will eventually finish, just not in any reasonable timeframe. -- Brent Jones br...@servuhome.net Correction, looks like my bug is 6855208 -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Opensolaris attached to 70 disk HP array
Looking at this external array by HP: http://h18006.www1.hp.com/products/storageworks/600mds/index.html 70 disks in 5U, which could probably be configured in JBOD. Has anyone attempted to connect this to a box running opensolaris to create a 70 disk pool? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sat, Jul 18, 2009 at 7:39 PM, Russelno-re...@opensolaris.org wrote: Yes you'll find my name all over VB at the moment, but I have found it to be stable (don't install the addons disk for solaris!!, use 3.0.2, and for me winXP32bit and OpenSolaris 2009.6 has been rock solid, it was (seems) to be opensolaris failed with extract_boot_list doesn't belong to 101, but noone on opensol, seems interested about it as other have reported it to, prob a rare issue. But yer, I hope Vicktor or someone will take a look. My worry is that if we can't recover from this, which a number of people (in variuos forms) have come accross zfs may be introuble. We had this happen at work about 18 months ago lost all the data (20TB)(didn't know about zdb nor did sun support) so we have start to back away, but I though since jan 2009 patches things were meant to be alot better, esp with sun using it in there storage servers now -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss No offense, but you trusted 10TB of important data, running in OpenSolaris from inside Virtualbox (not stable) on top of Windows XP (arguably not stable, especially for production) on probably consumer grade hardware with unknown support for any of the above products? I'd like to say this was an unfortunate circumstance, but there are many levels of fail here, and to blame ZFS seems misplaced, and the subject on this thread especially inflammatory. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hanging receive
On Fri, Jul 3, 2009 at 8:31 PM, Ian Collinsi...@ianshome.com wrote: Ian Collins wrote: I was doing an incremental send between pools, the receive side is locked up and no zfs/zpool commands work on that pool. The stacks look different from those reported in the earlier ZFS snapshot send/recv hangs X4540 servers thread. Here is the process information from scat (other commands hanging on the pool are also in cv_wait): Has anyone else seen anything like this? The box wouldn't even reboot, it had to be power cycled. It locks up on receive regularly now. SolarisCAT(live/10X) proc -L 18500 addr PID PPID RUID/UID size RSS swresv time command == == == == == == = 0xffc8d1990398 18500 14729 0 5369856 2813952 1064960 32 zfs receive -v -d backup user (LWP_SYS) thread: 0xfe84e0d5bc20 PID: 18500 cmd: zfs receive -v -d backup t_wchan: 0xa0ed62a2 sobj: condition var (from zfs:txg_wait_synced+0x83) t_procp: 0xffc8d1990398 p_as: 0xfee19d29c810 size: 5369856 RSS: 2813952 hat: 0xfedb762d2818 cpuset: zone: global t_stk: 0xfe8000143f10 sp: 0xfe8000143b10 t_stkbase: 0xfe800013f000 t_pri: 59(TS) pctcpu: 0.00 t_lwp: 0xfe84e92d6ec0 lwp_regs: 0xfe8000143f10 mstate: LMS_SLEEP ms_prev: LMS_SYSTEM ms_state_start: 15 minutes 4.476756638 seconds earlier ms_start: 15 minutes 8.447715668 seconds earlier psrset: 0 last CPU: 2 idle: 102425 ticks (17 minutes 4.25 seconds) start: Thu Jul 2 22:23:06 2009 age: 1029 seconds (17 minutes 9 seconds) syscall: #54 ioctl(, 0x0) (sysent: genunix:ioctl+0x0) tstate: TS_SLEEP - awaiting an event tflg: T_DFLTSTK - stack is default size tpflg: TP_TWAIT - wait to be freed by lwp_wait TP_MSACCT - collect micro-state accounting information tsched: TS_LOAD - thread is in memory TS_DONT_SWAP - thread/LWP should not be swapped pflag: SKILLED - SIGKILL has been posted to the process SMSACCT - process is keeping micro-state accounting SMSFORK - child inherits micro-state accounting pc: unix:_resume_from_idle+0xf8 resume_return: addq $0x8,%rsp unix:_resume_from_idle+0xf8 resume_return() unix:swtch+0x12a() genunix:cv_wait+0x68() zfs:txg_wait_synced+0x83() zfs:dsl_sync_task_group_wait+0xed() zfs:dsl_sync_task_do+0x54() zfs:dmu_objset_create+0xc5() zfs:zfs_ioc_create+0xee() zfs:zfsdev_ioctl+0x14c() genunix:cdev_ioctl+0x1d() specfs:spec_ioctl+0x50() genunix:fop_ioctl+0x25() genunix:ioctl+0xac() unix:_syscall32_save+0xbf() -- switch to user thread's user stack -- The box is an x4500, Solaris 10u7. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I hit this too: 6826836 Fixed in 117 http://opensolaris.org/jive/thread.jspa?threadID=104852tstart=120 -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write I/O stalls
On Tue, Jun 30, 2009 at 12:25 PM, Bob Friesenhahnbfrie...@simple.dallas.tx.us wrote: On Mon, 29 Jun 2009, Lejun Zhu wrote: With ZFS write throttle, the number 2.5GB is tunable. From what I've read in the code, it is possible to e.g. set zfs:zfs_write_limit_override = 0x800 (bytes) to make it write 128M instead. This works, and the difference in behavior is profound. Now it is a matter of finding the best value which optimizes both usability and performance. A tuning for 384 MB: # echo zfs_write_limit_override/W0t402653184 | mdb -kw zfs_write_limit_override: 0x3000 = 0x1800 CPU is smoothed out quite a lot and write latencies (as reported by a zio_rw.d dtrace script) are radically different than before. Perfmeter display for 256 MB: http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-256mb.png Perfmeter display for 384 MB: http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-384mb.png Perfmeter display for 768 MB: http://www.simplesystems.org/users/bfriesen/zfs-discuss/perfmeter-768mb.png Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Maybe there could be a supported ZFS tuneable (per file system even?) that is optimized for 'background' tasks, or 'foreground'. Beyond that, I will give this tuneable a shot and see how it impacts my own workload. Thanks! -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write I/O stalls
On Mon, Jun 29, 2009 at 2:48 PM, Bob Friesenhahnbfrie...@simple.dallas.tx.us wrote: On Wed, 24 Jun 2009, Lejun Zhu wrote: There is a bug in the database about reads blocked by writes which may be related: http://bugs.opensolaris.org/view_bug.do?bug_id=6471212 The symptom is sometimes reducing queue depth makes read perform better. I have been banging away at this issue without resolution. Based on Roch Bourbonnais's blog description of the ZFS write throttle code, it seems that I am facing a perfect storm. Both the storage write bandwidth (800+ MB/second) and the memory size of my system (20 GB) result in the algorithm batching up 2.5 GB of user data to write. Since I am using mirrors, this results in 5 GB of data being written at full speed to the array on a very precise schedule since my application is processing fixed-sized files with a fixed algorithm. The huge writes lead to at least 3 seconds of read starvation, resulting in a stalled application and a square-wave of system CPU utilization. I could attempt to modify my application to read ahead by 3 seconds but that would require gigabytes of memory, lots of complexity, and would not be efficient. Richard Elling thinks that my array is pokey, but based on write speed and memory size, ZFS is always going to be batching up data to fill the write channel for 5 seconds so it does not really matter how fast that write channel is. If I had 32GB of RAM and 2X the write speed, the situation would be identical. Hopefully someone at Sun is indeed working this read starvation issue and it will be resolved soon. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I see similar square-wave performance. However, my load is primarily write-based, when those commits happen, I see all network activity pause while the buffer is commited to disk. I write about 750Mbit/sec over the network to the X4540's during backup windows using primarily iSCSI. When those writes occur to my RaidZ volume, all activity pauses until the writes are fully flushed. One thing to note, on 117, the effects are seemingly reduced and a bit more even performance, but it is still there. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] ZFS snapshot send/recv hangs X4540 servers
On Fri, Jun 26, 2009 at 10:14 AM, Brent Jonesbr...@servuhome.net wrote: On Thu, Jun 25, 2009 at 12:00 AM, James Leverj...@jamver.id.au wrote: On 25/06/2009, at 4:38 PM, John Ryan wrote: Can I ask the same question - does anyone know when the 113 build will show up on pkg.opensolaris.org/dev ? On 24/06/2009, at 9:49 PM, Dave Miner wrote to indiana-discuss: There were problems with 116 that caused us to not release it. 117 is under construction, available in the next few days. cheers, James ___ storage-discuss mailing list storage-disc...@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/storage-discuss I checked this morning, and 117 is available now -- Brent Jones br...@servuhome.net Confirming this issue is fixed on build 117. Snapshots are significantly faster as well. My average transfer speed went from about 15MB/sec to over 40MB/sec. I imagine that 40MB/sec is now a limitation of the CPU, as I can see SSH maxing out a single core on the quad cores. Maybe SSH can be made multi-threaded next? :) -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS for iSCSI based SAN
On Fri, Jun 26, 2009 at 6:04 PM, Bob Friesenhahnbfrie...@simple.dallas.tx.us wrote: On Fri, 26 Jun 2009, Scott Meilicke wrote: I ran the RealLife iometer profile on NFS based storage (vs. SW iSCSI), and got nearly identical results to having the disks on iSCSI: Both of them are using TCP to access the server. So it appears NFS is doing syncs, while iSCSI is not (See my earlier zpool iostat data for iSCSI). Isn't this what we expect, because NFS does syncs, while iSCSI does not (assumed)? If iSCSI does not do syncs (presumably it should when a cache flush is requested) then NFS is safer in case the server crashes and reboots. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'll chime in here as I've had experience with this subject as well (ZFS NFS/iSCSI). It depends on your NFS client! I was using the FreeBSD NFSv3 client, which by default does an fsync() for every NFS block (8KB afaik). However, I changed the source and recompile so it would only fsync() on file close or I believe after 5MB. I went from 3MB/sec, to over 100MB/sec after my change. I detailed my struggle here: http://www.brentrjones.com/?p=29 As for iSCSI, I am currently benchmarking the COMSTAR iSCSI target. I previously used the old iscsitgtd framework with ZFS. Previously I would get about 35-40MB/sec. My initial testing with the new COMSTAR iSCSI target is not revealing any substantial performance increase at all. I've tried zvol based lu's, and file based lu's with no perceived performance difference at all. The iSCSI target is an X4540, 64GB RAM, and 48x 1TB disks configured with 8 vdevs with 5-6 disks each. No SSD, ZIL enabled. My NFS performance is now over 100MB/sec, I can get over 100MB/sec with CIFS as well. However, my iSCSI performance is still rather low for the hardware. It is a standard GigE network, currently jumbo frames are disabled, when I get some time I may make a VLAN with jumbo frames enabled and see if that changes anything at all (not likely). I am CC'ing the storage-discuss group as well for coverage as this covers ZFS, and storage. If anyone has some thoughts, code, or tests, I can run them on my X4540's and see how it goes. Thanks -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
After examining the dump we got from you (thanks again), we're relatively sure you are hitting 6826836 Deadlock possible in dmu_object_reclaim() This was introduced in nv_111 and fixed in nv_113. Sorry for the trouble. -tim Do you know when new builds will show up on pkg.opensolaris.org/dev ? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Sun, Jun 7, 2009 at 3:50 AM, Ian Collinsi...@ianshome.com wrote: Ian Collins wrote: Tim Haley wrote: Brent Jones wrote: On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. I tried a reboot -d (I even had kmem-flags=0xf set), but it did hang. I didn't try savecore. One thing I didn't try was scat on the running system. What should I look for (with scat) if this happens again? I now have a system with a hanging zfs receive, any hints on debugging it? -- Ian. I haven't figured out a way to identify the problem, still trying to find a 100% way to reproduce this problem. Seemingly the more snapshots I send at a given time, the likelihood of this happening goes up, but, correlation is not causation :) I might try to open a support case with Sun (have a support contract), but Opensolaris doesn't seem to be well understood by the support folks yet, so not sure how far it will get. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
I haven't figured out a way to identify the problem, still trying to find a 100% way to reproduce this problem. Seemingly the more snapshots I send at a given time, the likelihood of this happening goes up, but, correlation is not causation :) I might try to open a support case with Sun (have a support contract), but Opensolaris doesn't seem to be well understood by the support folks yet, so not sure how far it will get. -- Brent Jones br...@servuhome.net I can reproduce this 100% by sending about 6 or more snapshots at once. Here is some output that JBK helped me put together: Here is a pastebin 'mdb' findstack output: http://pastebin.com/m4751b08c Not sure what I'm looking at, but maybe someone at Sun can see whats going on? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Mon, Jun 8, 2009 at 9:38 PM, Richard Lowerichl...@richlowe.net wrote: Brent Jones br...@servuhome.net writes: I've had similar issues with similar traces. I think you're waiting on a transaction that's never going to come. I thought at the time that I was hitting: CR 6367701 hang because tx_state_t is inconsistent But given the rash of reports here, it seems perhaps this is something different. I, like you, hit it when sending snapshots, it seems (in my case) to be specific to incremental streams, rather than full streams, I can send seemingly any number of full streams, but incremental sends via send -i, or send -R of datasets with multiple snapshots, will get into a state like that above. -- Rich For now, back to snv_106 (the most stable build that I've seen, like it a lot) I'll open a case in the morning, and see what they suggest. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
Hello all, I had been running snv_106 for about 3 or 4 months on a pair of X4540's. I would ship snapshots from the primary server to the secondary server nightly, which was working really well. However, I have upgraded to 2009.06, and my replication scripts appear to hang when performing zfs send/recv. When one zfs send/recv process hangs, you cannot send any other snapshots from any other filesystem to the remote host. I have about 20 file systems I snapshots and replicate nightly. The script I use to perform the snapshots is here: http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh On the remote side, I end up with many hung processes, like this: bjones 11676 11661 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11673 11660 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11664 11653 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 13727 13722 0 14:21:20 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 And so on, one for each file system. On the receiving end, 'zfs list' shows one filesystem attempting to receive a snapshot, but I cannot stop it: $ zfs list NAME USED AVAIL REFER MOUNTPOINT pdxfilu02/data/fs01/%20090605-00:30:00 1.74G 27.2T 208G /pdxfilu02/data/fs01/%20090605-00:30:00 On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. I'f I boot to my snv_106 BE, everything works fine, this issue has never occurred on that version. Any thoughts? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] ZFS snapshot send/recv hangs X4540 servers
On Fri, Jun 5, 2009 at 2:28 PM, Mike La Spina mike.lasp...@laspina.ca wrote: Hi, I have replications between hosts and they are working fine with zfs send/recv's after upgrading to Indiana snv_111b (2009.06). Have you run the commands manually to see any messages/prompts are occurring? It sounds like its waiting for some input. Regards, Mike http://blog.laspina.ca/ -- This message posted from opensolaris.org ___ storage-discuss mailing list storage-disc...@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/storage-discuss If I power cycle the server, I can run the replication script manually. The script will go automatically again for another night or two, before hanging up. I've piped all output to a file, and there isn't any prompt for user input, and the zfs receive on the remote side is un-killable (and hangs the server when trying to restart). It appears to be the receiving end choking on a snapshot, and not allowing any more to run. Once one snapshot freezes, running another (for a different file system) zfs send/recv will just stall, with another un-killable zfs receive. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [storage-discuss] ZFS snapshot send/recv hangs X4540 servers
On Fri, Jun 5, 2009 at 2:49 PM, Rick Romero r...@havokmon.com wrote: On Fri, 2009-06-05 at 14:45 -0700, Brent Jones wrote: On Fri, Jun 5, 2009 at 2:28 PM, Mike La Spina mike.lasp...@laspina.ca wrote: Hi, I have replications between hosts and they are working fine with zfs send/recv's after upgrading to Indiana snv_111b (2009.06). Have you run the commands manually to see any messages/prompts are occurring? It sounds like its waiting for some input. Regards, Mike http://blog.laspina.ca/ -- This message posted from opensolaris.org ___ storage-discuss mailing list storage-disc...@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/storage-discuss If I power cycle the server, I can run the replication script manually. The script will go automatically again for another night or two, before hanging up. I've piped all output to a file, and there isn't any prompt for user input, and the zfs receive on the remote side is un-killable (and hangs the server when trying to restart). It appears to be the receiving end choking on a snapshot, and not allowing any more to run. Once one snapshot freezes, running another (for a different file system) zfs send/recv will just stall, with another un-killable zfs receive. Is it the version of ZFS? I think it was upgraded. I noticed something similar after upgrading ZFS on FreeBSD 7 STABLE. I was trying to zfs send my @Tuesday, and an automatic script ran (which deletes @Tuesday and takes a new snap) - and rather than failing as I expected, the destroy and snapshot commands hung around until the send was done (hosed up my incrementals - doh :) Rick Running the latest version of ZFS on all my file systems. My replication script adds a user property to the file system, to effectively lock it. My cleanup scripts check for that lock flag, and will die if they see it set. Its the send/receive that is hung up, I see the pending receiving still sitting there, more than 24 hours later. Sad -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Fri, Jun 5, 2009 at 3:25 PM, Ian Collins i...@ianshome.com wrote: Brent Jones wrote: On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. I have seen this on Solaris 10. Something appears to break with a pool or filesystem causing zfs receive to hang in the kernel. Once this happens, any zfs command that changes the state of the pool/filesystem will hang, including a zpool detach or an int 6. Can you get truss -p or mdb -p to work on the stuck process? -- Ian. I cannot. # truss -p 11308 truss: unanticipated system error: 11308 (r...@pdxfilu02)-(06:29 PM Fri Jun 05)-(log) # mdb -p 11308 mdb: cannot debug 11308: unanticipated system error mdb: failed to initialize target: No such file or directory All the hung zfs receives PID's have '1' as their PPID. Is it safe to truss PID 1? :) When you saw this, how did you escape it? I've found only pulling the plug will fix it. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley tim.ha...@sun.com wrote: Brent Jones wrote: Hello all, I had been running snv_106 for about 3 or 4 months on a pair of X4540's. I would ship snapshots from the primary server to the secondary server nightly, which was working really well. However, I have upgraded to 2009.06, and my replication scripts appear to hang when performing zfs send/recv. When one zfs send/recv process hangs, you cannot send any other snapshots from any other filesystem to the remote host. I have about 20 file systems I snapshots and replicate nightly. The script I use to perform the snapshots is here: http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh On the remote side, I end up with many hung processes, like this: bjones 11676 11661 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11673 11660 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11664 11653 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 13727 13722 0 14:21:20 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 And so on, one for each file system. On the receiving end, 'zfs list' shows one filesystem attempting to receive a snapshot, but I cannot stop it: $ zfs list NAME USED AVAIL REFER MOUNTPOINT pdxfilu02/data/fs01/%20090605-00:30:00 1.74G 27.2T 208G /pdxfilu02/data/fs01/%20090605-00:30:00 On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. -tim I'f I boot to my snv_106 BE, everything works fine, this issue has never occurred on that version. Any thoughts? I'm doing a savecore -L, but I have 64GB of ram, which makes the dumps a pita to work with. Is there any additional information I can provide? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot send/recv hangs X4540 servers
On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley tim.ha...@sun.com wrote: Brent Jones wrote: Hello all, I had been running snv_106 for about 3 or 4 months on a pair of X4540's. I would ship snapshots from the primary server to the secondary server nightly, which was working really well. However, I have upgraded to 2009.06, and my replication scripts appear to hang when performing zfs send/recv. When one zfs send/recv process hangs, you cannot send any other snapshots from any other filesystem to the remote host. I have about 20 file systems I snapshots and replicate nightly. The script I use to perform the snapshots is here: http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh On the remote side, I end up with many hung processes, like this: bjones 11676 11661 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11673 11660 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 11664 11653 0 01:30:03 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 bjones 13727 13722 0 14:21:20 ? 0:00 /sbin/zfs recv -vFd pdxfilu02 And so on, one for each file system. On the receiving end, 'zfs list' shows one filesystem attempting to receive a snapshot, but I cannot stop it: $ zfs list NAME USED AVAIL REFER MOUNTPOINT pdxfilu02/data/fs01/%20090605-00:30:00 1.74G 27.2T 208G /pdxfilu02/data/fs01/%20090605-00:30:00 On the sending side, I CAN kill the ZFS send process, but the remote side leaves its processes going, and I CANNOT kill -9 them. I also cannot reboot the receiving system, at init 6, the system will just hang trying to unmount the file systems. I have to physically cut power to the server, but a couple days later, this issue will occur again. A crash dump from the receiving server with the stuck receives would be highly useful, if you can get it. Reboot -d would be best, but it might just hang. You can try savecore -L. -tim I'f I boot to my snv_106 BE, everything works fine, this issue has never occurred on that version. Any thoughts? Well, I think I found a specific file system that is causing this. I kicked off a zpool scrub to see if there might be corruption on either end, but that takes well over 40 hours on these servers. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs scheduled replication script?
On Sat, Mar 28, 2009 at 5:40 PM, Fajar A. Nugraha fa...@fajar.net wrote: On Sun, Mar 29, 2009 at 3:40 AM, Brent Jones br...@servuhome.net wrote: I have since modified some scripts out there, and rolled them into my own, you can see it here at pastebin.com: http://pastebin.com/m3871e478 Thanks Brent. Your script seems to handle failed replication and locking pretty well. It doesn't seem to log WHY the replication failed though, so I think there should be something that captures stderr on line 91. One more question, is there anything on that script that requires ksh? A quick glance seems to indicate that it will work with bash as well. Regards, Fajar I'll see about capturing from stderr, something I should've added anyways. It would probably work under bash too, but some of the case checking came from the original Sun scripts, which were in KSH. I looked it up, and bash has -z string checking, so, everything 'should' work under bash. I'll test on Monday I'd love to see others improve on the original Sun ones, or parts of mine... I only say that because the dependency checking, and the locking were something that I didn't see in any one elses scripts. I did it a pretty lame way I'm sure, hopefully someone can find a better way :) CC:ing opensolaris-discuss, as others are probably asking similar questions about snapshot replication. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Notations in zpool status
On Sun, Mar 29, 2009 at 8:54 AM, Harry Putnam rea...@newsguy.com wrote: Is there some handy way to make notations about zpools. Something that would show up in the output of `zpool status' (or some other command) I mean descriptive notes maybe outlining the zpools' purpose? Browsing around in `man zpool' I don't see that, but may be overlooking it. The man page is near 1000 lines. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss You can add user properties to file systems. But afaik they would not show up in zpool status. For example: zfs set note:purpose=This file system is important zfs get note:purpose somefilesystem Maybe that helps... -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can this be done?
On Sun, Mar 29, 2009 at 1:37 PM, Michael Shadle mike...@gmail.com wrote: On Sun, Mar 29, 2009 at 10:35 AM, David Magda dma...@ee.ryerson.ca wrote: Create new pool, move data to it (zfs send/recv), destroy old RAID-Z1 pool. Would send/recv be more efficient than just a massive rsync or related? Also I'd have to reduce the data on my existing raidz1 as it is almost full, and the raidz2 it would be sending to would be 1.5tb smaller technically. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'd personally say send/recv would be more efficient, rsync is awfully slow on large data sets. But, it depends what build you are using! BugID 6418042 (slow zfs send/recv) was fixed in build 105, it impacted send/recv operations local to remote, not sure if it happens local to local, but I experienced it doing local-remote send/recv. Not sure the best way to handle moving data around, when space is tight though... -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs scheduled replication script?
On Sat, Mar 28, 2009 at 11:20 AM, Fajar A. Nugraha fa...@fajar.net wrote: I have a backup system using zfs send/receive (I know there are pros and cons to that, but it's suitable for what I need). What I have now is a script which runs daily, do zfs send, compress and write it to a file, then transfer it with ftp to a remote host. It does full backup every 1st, and do incremental (with 1st as reference) after that. It works, but not quite resource-effective (for example, the full backup every month, and the big size of incremental backup on 30th). I'm thinking of changing it to a script which can automate replication of a zfs pool or filesystem via zfs send/receive to a remote host (via ssh or whatever). It should be smart enough to choose between full and incremental, and choose which snapshot to base the incremental stream from (in case a scheduled incremental is missed), preferably able to use snapshots created by zfs/auto-snapshot smf service. To prevent re-inventing the wheel, does such script exists already? I prever not to use AVS as I can't use on existing zfs pool. Regards, Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss The ZFS automatic snapshot tools have the ability to execute any command after a snapshot has taken place, such as ZFS send. However, I ran into some issues, as it wasn't strictly designed for that and didn't handle errors very well. I have since modified some scripts out there, and rolled them into my own, you can see it here at pastebin.com: http://pastebin.com/m3871e478 Inspiration: http://blogs.sun.com/constantin/entry/zfs_replicator_script_new_edition http://blogs.sun.com/timf/en_IE/entry/zfs_automatic_snapshots_in_nv Those are some good resources, from that, you can make something work that is tailored to your environment. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Mount ZFS hangs on boot
Hello, I have an X4540 running 2008.11 snv_106 I rebooted it tonight since I had a hung iSCSI connection to the Sun box that wouldn't go away (couldn't delete that particular ZFS filesystem till the initiator drops connection) Upon reboot, the system will hang after printing the license header. I then rebooted again with '-m milestone=none' and then went to single user mode after that. It appears system/filesystem/usr:default is the service that is hanging during boot. I've looked at 'zpool iostat' and there is a substantial amount of IO happening after you initiate system/filesystem/usr:default but it will it there for quite some time and not start. I am unsure what it's doing, zpool status shows 0 errors, all devices normal (46 drives, in 5-6 disk RAIDZ groups). I also loaded arcstat.pl (saw it floating around here) and its showing ~230 ops/sec, with 100% arc misses. Whatever its doing, the load is very random I/O, and heavy, but little progress appears to be happening. I only have about 50 filesystems, and just a handful of snapshots for each filesystem. Thanks! -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Mount ZFS hangs on boot
On Wed, Mar 18, 2009 at 11:28 AM, Miles Nordin car...@ivy.net wrote: bj == Brent Jones br...@servuhome.net writes: bj I only have about 50 filesystems, and just a handful of bj snapshots for each filesystem. there were earlier stories of people who had imports taking hours to complete with no feedback because ZFS was rolling forward some partly-completed operation interrupted by the crash, like destroying a snapshot or something. maybe you shoudl just wait. Wait I did, and it did finally come up. A partially completed operation may make sense, as when the iSCSI target was block due to a Windows box hanging, and the connection not letting go, a ZFS destroy on that pool never did complete. So maybe it tried to finish that action. A mystery for sure, but its up and working now. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs streams data corruption
On Tue, Feb 24, 2009 at 10:41 AM, Christopher Mera cm...@reliantsec.net wrote: Either way - it would be ideal to quiesce the system before a snapshot anyway, no? My next question now is what particular steps would be recommended to quiesce a system for the clone/zfs stream that I'm looking to achieve... All your help is appreciated. Regards, Christopher Mera -Original Message- From: Mattias Pantzare [mailto:pantz...@gmail.com] Sent: Tuesday, February 24, 2009 1:38 PM To: Nicolas Williams Cc: Christopher Mera; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] zfs streams data corruption On Tue, Feb 24, 2009 at 19:18, Nicolas Williams nicolas.willi...@sun.com wrote: On Mon, Feb 23, 2009 at 10:05:31AM -0800, Christopher Mera wrote: I recently read up on Scott Dickson's blog with his solution for jumpstart/flashless cloning of ZFS root filesystem boxes. I have to say that it initially looks to work out cleanly, but of course there are kinks to be worked out that deal with auto mounting filesystems mostly. The issue that I'm having is that a few days after these cloned systems are brought up and reconfigured they are crashing and svc.configd refuses to start. When you snapshot a ZFS filesystem you get just that -- a snapshot at the filesystem level. That does not mean you get a snapshot at the _application_ level. Now, svc.configd is a daemon that keeps a SQLite2 database. If you snapshot the filesystem in the middle of a SQLite2 transaction you won't get the behavior that you want. In other words: quiesce your system before you snapshot its root filesystem for the purpose of replicating that root on other systems. That would be a bug in ZFS or SQLite2. A snapshoot should be an atomic operation. The effect should be the same as power fail in the meddle of an transaction and decent databases can cope with that. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss If you are writing a script to handle ZFS snapshots/backups, you could issue an SMF command to stop the service before taking the snapshot. Or at the very minimum, perform an SQL dump of the DB so you at least have a consistent full copy of the DB as a flat file in case you can't stop the DB service. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs streams data corruption
On Tue, Feb 24, 2009 at 11:32 AM, Christopher Mera cm...@reliantsec.net wrote: Thanks for your responses.. Brent: And I'd have to do that for every system that I'd want to clone? There must be a simpler way.. perhaps I'm missing something. Regards, Chris Well, unless the database software itself can notice a snapshot taking place, and flush all data to disk, pause transactions until the snapshot is finished, then properly resume, I don't know what to tell you. It's an issue for all databases, Oracle, MSSQL, MySQL... how to do an atomic backup, without stopping transactions, and maintaining consistency. Replication is on possible solution, dumping to a file periodically is one, or just tolerating that your database will not be consistent after a snapshot and have to replay logs / consistency check it after bringing it up from a snapshot. Once you figure that out in a filesystem agnostic way, you'll be a wealthy person indeed. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'zfs recv' is very slow
On Mon, Feb 2, 2009 at 6:55 AM, Robert Milkowski mi...@task.gda.pl wrote: It definitely does. I made some tests today comparing b101 with b105 while doing 'zfs send -R -I A B /dev/null' with several dozen snapshots between A and B. Well, b105 is almost 5x faster in my case - that's pretty good. -- Robert Milkowski http://milek.blogspot.com -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Sad to report that I am seeing the slow zfs recv issue cropping up again while running b105 :( Not sure what has triggered the change, but I am seeing the same behavior again: massive amounts of reads on the receiving side, while only receiving just tiny bursts of data amounting to a mere megabyte a second. It doesn't seem to happen every single time though which is odd, but I can provoke it by destroying a snapshot from the pool I am sending, then taking another snapshot and re-sending it. It seems to cause the receiving side to go into this read storm before any data is transferred. I'm going to open a case in the morning, and see if I can't get an engineer to look at this. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Write caches on X4540
On Wed, Feb 11, 2009 at 2:13 PM, Greg Mason gma...@msu.edu wrote: We're using some X4540s, with OpenSolaris 2008.11. According to my testing, to optimize our systems for our specific workload, I've determined that we get the best performance with the write cache disabled on every disk, and with zfs:zfs_nocacheflush=1 set in /etc/system. The only issue is setting the write cache permanently, or at least quickly. Right now, as it is, I've scripted up format to run on boot, disabling the write cache of all disks. This takes around two minutes. I'd like to avoid needing to take this time on every bootup (which is more often than you'd think, we've got quite a bit of construction happening, which necessitates bringing everything down periodically). This would also be painful in the event of unplanned downtime for one of our Thors. so, basically, my question is: Is there a way to quickly or permanently disable the write cache on every disk in an X4540? Thanks, -Greg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss We use several X4540's over here as well, what type of workload do you have, and how much performance increase did you see by disabling the write caches? -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data loss bug - sidelined??
://mail.opensolaris.org/mailman/listinfo/zfs-discuss Could this be related to the ZFS TXG/transfer group buffers? ie. it'll buffer writes for a bit before committing to disk. Then, when its time to commit to disk, it realizes the disk is failed, and from then enter those failmode conditions (wait, continue, panic, ?). Could this be the case? http://blogs.sun.com/roch/date/20080514 -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unusual CIFS write bursts
On Tue, Jan 27, 2009 at 5:47 PM, Richard Elling richard.ell...@gmail.com wrote: comment far below... Brent Jones wrote: On Mon, Jan 26, 2009 at 10:40 PM, Brent Jones br...@servuhome.net wrote: -- Brent Jones br...@servuhome.net I found some insight to the behavior I found at this Sun blog by Roch Bourbonnais : http://blogs.sun.com/roch/date/20080514 Excerpt from the section that I seem to have encountered: The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory. So, when I fill up that transaction group buffer, that is when I see that 4-5 second I/O burst of several hundred megabytes per second. He also documents that the buffer flush can, and does issue delays to the writing threads, which is why I'm seeing those momentary drops in throughput and sluggish system performance while that write buffer is flushed to disk. Yes, this tends to be more efficient. You can tune it by setting zfs_txg_synctime, which is 5 by default. It is rare that we've seen this be a win, which is why we don't mention it in the Evil Tuning Guide. Wish there was a better way to handle that, but at the speed I'm writing (and I'll be getting a 10GigE link soon), I don't see any other graceful methods of handling that much data in a buffer I think your workload might change dramatically when you get a faster pipe. So unless you really feel compelled to change it, I wouldn't suggest changing it. -- richard Loving these X4540's so far though... Are there any additional tuneables, such as opening a new txg buffer before the previous one is flushed? Or otherwise allow writes to continue without the tick delay? My workload will be pretty consistent, it is going to serve a few roles, which I hope to accomplish in the same units: - large scale backups - cifs share for window app servers - nfs server for unix app servers GigE quickly became the bottleneck, and I imagine 10GigE will add further stress to those write buffers. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Unusual CIFS write bursts
c4t7d0 ONLINE 0 0 0 c5t7d0 ONLINE 0 0 0 c6t7d0 ONLINE 0 0 0 c7t7d0 ONLINE 0 0 0 c8t7d0 ONLINE 0 0 0 c9t7d0 ONLINE 0 0 0 spares c6t2d0AVAIL c7t3d0AVAIL c8t4d0AVAIL c9t5d0AVAIL -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'zfs recv' is very slow
On Fri, Jan 9, 2009 at 11:41 PM, Brent Jones br...@servuhome.net wrote: On Fri, Jan 9, 2009 at 7:53 PM, Ian Collins i...@ianshome.com wrote: Ian Collins wrote: Send/receive speeds appear to be very data dependent. I have several different filesystems containing differing data types. The slowest to replicate is mail and my guess it's the changes to the index files that takes the time. Similar sized filesystems with similar deltas where files are mainly added or deleted appear to replicate faster. Has anyone investigated this? I have been replicating a server today and the differences between incremental processing is huge, for example: filesystem A: received 1.19Gb stream in 52 seconds (23.4Mb/sec) filesystem B: received 729Mb stream in 4564 seconds (164Kb/sec) I can delve further into the content if anyone is interested. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss What hardware, to/from is this? How are those filesystems laid out, what is their total size, used space, and guessable file count / file size distribution? I'm also trying to put together the puzzle to provide more detail to a case I opened with Sun regarding this. -- Brent Jones br...@servuhome.net Just to update this, hope no one is tired of hearing about it. I just image-updated to snv_105 to obtain patch for CR 6418042 at the recommendation from a Sun support technician. My results are much improved, on the order of 5-100 times faster (either over Mbuffer or SSH). Not only do snapshots begin sending right away (no longer requiring several minutes of reads before sending any data), the actual send will sustain about 35-50MB/sec over SSH, and up to 100MB/s via Mbuffer (on a single Gbit link, I am network limited now, something I never thought I would say I love to see!). Previously, I was lucky if the snapshot would begin sending any data after about 10 minutes, and once it did begin sending, it would usually peak at about 1MB/sec via SSH, and up to 20MB/sec over Mbuffer. Mbuffer seems to play a much larger role now, as SSH appears to only be single threaded for compression/encryption, peaking a single CPU worth of power. Mbuffers raw network performance saturates my Gigabit link, and making me consider link bonding or something to see how fast it -really- can go, now that the taps are open! So, my issues appears pretty much resolved, although snv_105 is in the /dev branch, things appear stable for the most part. Please let me know if you have any questions, or want additional info on my setup and testing. Regards, -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS + OpenSolaris for home NAS?
On Sat, Jan 17, 2009 at 2:46 PM, JZ j...@excelsioritsolutions.com wrote: I don't know if this email is even relevant to the list discussion. I will leave that conclusion to the smart mail server policy here. *cough* -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'zfs recv' is very slow
On Fri, Jan 9, 2009 at 7:53 PM, Ian Collins i...@ianshome.com wrote: Ian Collins wrote: Send/receive speeds appear to be very data dependent. I have several different filesystems containing differing data types. The slowest to replicate is mail and my guess it's the changes to the index files that takes the time. Similar sized filesystems with similar deltas where files are mainly added or deleted appear to replicate faster. Has anyone investigated this? I have been replicating a server today and the differences between incremental processing is huge, for example: filesystem A: received 1.19Gb stream in 52 seconds (23.4Mb/sec) filesystem B: received 729Mb stream in 4564 seconds (164Kb/sec) I can delve further into the content if anyone is interested. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss What hardware, to/from is this? How are those filesystems laid out, what is their total size, used space, and guessable file count / file size distribution? I'm also trying to put together the puzzle to provide more detail to a case I opened with Sun regarding this. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'zfs recv' is very slow
On Wed, Jan 7, 2009 at 12:36 AM, Andrew Gabriel andrew.gabr...@sun.com wrote: Brent Jones wrote: Reviving an old discussion, but has the core issue been addressed in regards to zfs send/recv performance issues? I'm not able to find any new bug reports on bugs.opensolaris.org related to this, but my search kung-fu may be weak. I raised: CR 6729347 Poor zfs receive performance across networks (Seems to still be in the Dispatched state nearly half a year later.) This relates mainly to full archives, and is most obvious when the disk throughput is the same order of magnitude as the network throughput. (It becomes less obvious if one is significantly different from the other, either way around.) There appears to be an additional problem for incrementals, which spend long periods sending almost no data at all (I presume this is when zfs send is searching for changed blocks to send). I don't know off-hand of a bugid for this. Using mbuffer can speed it up dramatically, but this seems like a hack without addressing a real problem with zfs send/recv. I don't think it's a hack, but something along these lines should be more properly integrated into the zfs receive command or documented. Trying to send any meaningful sized snapshots from say an X4540 takes up to 24 hours, for as little as 300GB changerate. Are those incrementals from a much larger filesystem? If so, that's probably mainly the the other problem. Yah, the incrementals are from a 30TB volume, with about 1TB used. Watching iostat on each side during the incremental sends, the sender side is hardly doing anything, maybe 50iops read, and that could be from other machines accessing it, really light load. The receiving side however, for about 3 minutes it is peaking around 1500 iops reads, and no writes. It will do that for 3-5 minutes, then it will calm down and only read sporadically, and write about 1MB/sec. Using Mbuffer can get the writes to spike to 20-30MB/sec, but the initial massive reads still remain. I have yet to devise a script that starts Mbuffer zfs recv on the receiving side with proper parameters, then start an Mbuffer ZFS send on the sending side, but I may work on one later this week. I'd like the snapshots to be sent every 15 minutes, just to keep the amount of change that needs to be sent as low as possible. Not sure if its worth opening a case with Sun since we have a support contract... -- Andrew -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send fails incremental snapshot
On Mon, Jan 5, 2009 at 4:29 PM, Brent Jones br...@servuhome.net wrote: On Mon, Jan 5, 2009 at 2:50 PM, Richard Elling richard.ell...@sun.com wrote: Correlation question below... Brent Jones wrote: On Sun, Jan 4, 2009 at 11:33 PM, Carsten Aulbert carsten.aulb...@aei.mpg.de wrote: Hi Brent, Brent Jones wrote: I am using 2008.11 with the Timeslider automatic snapshots, and using it to automatically send snapshots to a remote host every 15 minutes. Both sides are X4540's, with the remote filesystem mounted read-only as I read earlier that would cause problems. The snapshots send fine for several days, I accumulate many snapshots at regular intervals, and they are sent without any problems. Then I will get the dreaded: cannot receive incremental stream: most recent snapshot of pdxfilu02 does not match incremental source Which command line are you using? Maybe you need to do a rollback first (zfs receive -F)? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I am using a command similar to this: zfs send -i pdxfilu01/arch...@zfs-auto-snap:frequent-2009-01-04-03:30 pdxfilu01/arch...@zfs-auto-snap:frequent-2009-01-04-03:45 | ssh -c blowfish u...@host.com /sbin/zfs recv -d pdxfilu02 It normally works, then after some time it will stop. It is still doing a full snapshot replication at this time (very slowly it seems, I'm bit by the bug of slow zfs send/resv) Once I get back on my regular snapshotting, if it comes out of sync again, I'll try doing a -F rollback and see if that helps. When this gets slow, are the other snapshot-related commands also slow? For example, normally I see zfs list -t snapshot completing in a few seconds, but sometimes it takes minutes? -- richard I'm not seeing zfs related commands any slower. On the remote side, it builds up thousands of snapshots, and aside from SSH scrolling as fast as it can over the network, no other slowness. But the actual send and receive is getting very very slow, almost to the point of needing the scrap the project and find some other way to ship data around! -- Brent Jones br...@servuhome.net Got a small update on the ZFS send, I am in fact seeing ZFS list taking several minutes to complete. I must have timed it correctly during the send, and both sides are not completing the ZFS list, and its been about 5 minutes already. There is a small amount of network traffic between the two hosts, so maybe it's comparing what needs to be sent, not sure. I'll update when/if it completes. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'zfs recv' is very slow
On Sat, Dec 6, 2008 at 11:40 AM, Ian Collins i...@ianshome.com wrote: Richard Elling wrote: Ian Collins wrote: Ian Collins wrote: Andrew Gabriel wrote: Ian Collins wrote: I've just finished a small application to couple zfs_send and zfs_receive through a socket to remove ssh from the equation and the speed up is better than 2x. I have a small (140K) buffer on the sending side to ensure the minimum number of sent packets The times I get for 3.1GB of data (b101 ISO and some smaller files) to a modest mirror at the receive end are: 1m36s for cp over NFS, 2m48s for zfs send though ssh and 1m14s through a socket. So the best speed is equivalent to 42MB/s. It would be interesting to try putting a buffer (5 x 42MB = 210MB initial stab) at the recv side and see if you get any improvement. It took a while... I was able to get about 47MB/s with a 256MB circular input buffer. I think that's about as fast it can go, the buffer fills so receive processing is the bottleneck. Bonnie++ shows the pool (a mirror) block write speed is 58MB/s. When I reverse the transfer to the faster box, the rate drops to 35MB/s with neither the send nor receive buffer filling. So send processing appears to be the limit in this case. Those rates are what I would expect writing to a single disk. How is the pool configured? The slow system has a single mirror pool of two SATA drives, the faster one a stripe of 4 mirrors and an IDE SD boot drive. ZFS send though ssh from the slow to the fast box takes 189 seconds, the direct socket connection send takes 82 seconds. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Reviving an old discussion, but has the core issue been addressed in regards to zfs send/recv performance issues? I'm not able to find any new bug reports on bugs.opensolaris.org related to this, but my search kung-fu may be weak. Using mbuffer can speed it up dramatically, but this seems like a hack without addressing a real problem with zfs send/recv. Trying to send any meaningful sized snapshots from say an X4540 takes up to 24 hours, for as little as 300GB changerate. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send fails incremental snapshot
On Sun, Jan 4, 2009 at 11:33 PM, Carsten Aulbert carsten.aulb...@aei.mpg.de wrote: Hi Brent, Brent Jones wrote: I am using 2008.11 with the Timeslider automatic snapshots, and using it to automatically send snapshots to a remote host every 15 minutes. Both sides are X4540's, with the remote filesystem mounted read-only as I read earlier that would cause problems. The snapshots send fine for several days, I accumulate many snapshots at regular intervals, and they are sent without any problems. Then I will get the dreaded: cannot receive incremental stream: most recent snapshot of pdxfilu02 does not match incremental source Which command line are you using? Maybe you need to do a rollback first (zfs receive -F)? Cheers Carsten ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I am using a command similar to this: zfs send -i pdxfilu01/arch...@zfs-auto-snap:frequent-2009-01-04-03:30 pdxfilu01/arch...@zfs-auto-snap:frequent-2009-01-04-03:45 | ssh -c blowfish u...@host.com /sbin/zfs recv -d pdxfilu02 It normally works, then after some time it will stop. It is still doing a full snapshot replication at this time (very slowly it seems, I'm bit by the bug of slow zfs send/resv) Once I get back on my regular snapshotting, if it comes out of sync again, I'll try doing a -F rollback and see if that helps. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] X4500, snv_101a, hd and zfs [SEC=UNCLASSIFIED]
On Mon, Jan 5, 2009 at 6:55 PM, Elaine Ashton elaine.ash...@sun.com wrote: On Jan 5, 2009, at 9:33 PM, LEES, Cooper wrote: Elaine, Very bizarre problem you're having. I have no problems on either of my x4500s. One on 10u6 and one on indiana snv_101b_rc2. I agree, which is why I was hoping someone might know what the deal is. Just a straight 'hd' takes over a minute and a half. The real killer is /opt/SUNWhd/hd/bin/hdadm write_cache display all which displays all the write_cache states for each drive. This takes hours. How long does that take on your 101b system? I swear, something must be terribly amiss with this box, but I'm just not sure where to start looking. e. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'd suggest opening a case with Sun but you ARE Sun ;p The 'hd' tools don't even work on the X4540's, and even the ILOM webgui doesn't show the drive as even being installed (yet, I had 48, 1TB drives all working fine). So, at least you're able to see your drives... sorta. I -wish- I could see my drives cache status, state, FRU, etc...:( -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS send fails incremental snapshot
Hello all, I am using 2008.11 with the Timeslider automatic snapshots, and using it to automatically send snapshots to a remote host every 15 minutes. Both sides are X4540's, with the remote filesystem mounted read-only as I read earlier that would cause problems. The snapshots send fine for several days, I accumulate many snapshots at regular intervals, and they are sent without any problems. Then I will get the dreaded: cannot receive incremental stream: most recent snapshot of pdxfilu02 does not match incremental source Manually sending does not work, or destroying snapshots on the remote side and resending the batch again from the earliest point in time. The only way I have found that works, is to destroy the entire zfs filesystem on the remote side, and begin anew. Is there a way to force a ZFS receive, or to get more information about what changed on the remote system to cause it not to accept anymore snapshots? Thank you in advance -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mount hangs
, but I would really like to be able to recover it to save some time. Anything special to look for in zdb output? Any other diagnostics that would be useful? Thanks in advance! Best Regards //Magnus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I had a similar problem, but did not run truss to find the cause as it was not a live filesystem yet. Recreating the filesystem with the same name resulted in it not mounting and just hanging, but if I created it with a different name it would mount and run perfectly fine. I settled on the new name and continued on and have no noticed the problem again. But seeing this post, I'll capture as much data as I can if it happens again. -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool cannot replace a replacing device
On Tue, Dec 9, 2008 at 8:37 AM, Courtney Malone [EMAIL PROTECTED] wrote: I have another drive on the way, which will be handy in the future, but it doesn't solve the problem that zfs wont let me manipulate that pool in a manner that will return it to a non-degraded state, (even with a replacement drive or hot spare, i have already tried adding a spare) and I don't have somewhere to dump ~6TB of data and do a restore. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Did you file a bug report? If so, can you link it so we can see the resolve (if one comes, even) -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs iscsi sustained write performance
On Mon, Dec 8, 2008 at 3:09 PM, milosz [EMAIL PROTECTED] wrote: hi all, currently having trouble with sustained write performance with my setup... ms server 2003/ms iscsi initiator 2.08 w/intel e1000g nic directly connected to snv_101 w/ intel e1000g nic. basically, given enough time, the sustained write behavior is perfectly periodic. if i copy a large file to the iscsi target, iostat reports 10 seconds or so of -no- writes to disk, just small reads... then 2-3 seconds of disk-maxed writes, during which time windows reports the write performance dropping to zero (disk queues maxed). so iostat will report something like this for each of my zpool disks (with iostat -xtc 1) 1s: %b 0 2s: %b 0 3s: %b 0 4s: %b 0 5s: %b 0 6s: %b 0 7s: %b 0 8s: %b 0 9s: %b 0 10s: %b 0 11s: %b 100 12s: %b 100 13s: %b 100 14s: %b 0 15s: %b 0 it looks like solaris hangs out caching the writes and not actually committing them to disk... when the cache gets flushed, the iscsitgt (or whatever) just stops accepting writes. this is happening across controllers and zpools. also, a test copy of a 10gb file from one zpool to another (not iscsi) yielded similar iostat results: 10 seconds of big reads from the source zpool, 2-3 seconds of big writes to the target zpool (target zpool is 5x bigger than source zpool). anyone got any ideas? point me in the right direction? thanks, milosz -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Are you running at compression? I see this behavior with heavy loads, and GZIP compression enabled. What does 'zfs get compression' say? -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'zfs recv' is very slow
On Fri, Nov 14, 2008 at 10:04 AM, Joerg Schilling [EMAIL PROTECTED] wrote: Andrew Gabriel [EMAIL PROTECTED] wrote: Andrew Gabriel wrote: Interesting idea, but for 7200 RPM disks (and a 1Gb ethernet link), I need a 250GB buffer (enough to buffer 4-5 seconds worth of data). That's many orders of magnitude bigger than SO_RCVBUF can go. No -- that's wrong -- should read 250MB buffer! Still some orders of magnitude bigger than SO_RCVBUF can go. It's affordable e.g. on a X4540 with 64 GB of RAM. ZFS started with constraints that could not be made true in 2001. On my first Sun at home (a Sun 2/50 with 1 MB of RAM) in 1986, I could set the socket buffer size to 63 kB. 63kB : 1 MB is the same ratio as 256 MB : 4 GB. BTW: a lot of numbers in Solaris did not grow since a long time and thus create problems now. Just think about the maxphys values 63 kB on x86 does not even allow to write a single BluRay disk sector with a single transfer. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I'd like to see Sun's position on the speed at which large file systems perform ZFS send/receive. I expect my X4540's to nearly fill 48TB (or more considering compression), and taking 24 hours to transfer 100GB is, well, I could do better on an ISDN line from 1995. -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] continuous replication
On Wed, Nov 12, 2008 at 5:58 PM, River Tarnell [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Daryl Doami: As an aside, replication has been implemented as part of the new Storage 7000 family. Here's a link to a blog discussing using the 7000 Simulator running in two separate VMs and replicating w/ each other: that's interesting, although 'less than a minute later' makes me suspect they might just be using snapshots and send/recv? presumably, if fishworks is based on (Open)Solaris, any new ZFS features they created will make it back into Solaris proper eventually... - river. -BEGIN PGP SIGNATURE- iD8DBQFJG4m5IXd7fCuc5vIRAvY3AJ0dRRblJhwfA7X/s8CUU775hd3HNgCffARy x8Vryc/+Fl+a4pjJWN/KsDM= =ImHD -END PGP SIGNATURE- Yah from what I can tell, it's just using already-there-but-easier-to-look-at approach. Not belittling the accomplishment, rolling all the system tools into a coherent package is great, and the analytic is just awesome. I am doing a similar project, and weighed several options for replication. AVS was coveted for it's near real time replication and ability to switch directions to replicate to the primary if you had a fail-over. But some AVS limitations[1] are probably going to make us use zfs send/receive and it should keep up (delta per day is ~100GB) We will be testing both methods here in the next few weeks, will keep the list posted to our findings. [1] sending drive rebuilds over the link sucks -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenStorage GUI
On Tue, Nov 11, 2008 at 9:52 AM, Adam Leventhal [EMAIL PROTECTED] wrote: On Nov 11, 2008, at 9:38 AM, Bryan Cantrill wrote: Just to throw some ice-cold water on this: 1. It's highly unlikely that we will ever support the x4500 -- only the x4540 is a real possibility. And to warm things up a bit: there's already an upgrade path from the x4500 to the x4540 so that would be required before any upgrade to the equivalent of the Sun Storage 7210. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss We just ordered several X4540's, excited to get them in place soon. Having the Openstorage GUI as an option down the road is very appealing for our VM/hosted side, after we install this bulk storage environment. Wish I could get my hands on a beta of this GUI... -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VERY URGENT Compliance for ZFS
On Mon, Nov 10, 2008 at 12:42 PM, Keith Bierman [EMAIL PROTECTED] wrote: On Nov 10, 2008, at 4:47 AM, Vikash Gupta wrote: Hi Parmesh, Looks like this tender specification meant for Veritas. How do you handle this particular clause ? Shall provide Centralized, Cross platform, Single console management GUI Does it really make sense to have a discussion like this on an external open list? Contracts are customarily private, and company confidential. -- Keith H. Bierman [EMAIL PROTECTED] | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 speaking for myself* Copyright 2008 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Not sure disclosing confidential information online will help dispel any concerns about the contract... -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to diagnose zfs - iscsi - nfs hang
On Fri, Nov 7, 2008 at 9:11 AM, Jacob Ritorto [EMAIL PROTECTED] wrote: I have a PC server running Solaris 10 5/08 which seems to frequently become unable to share zfs filesystems via the shareiscsi and sharenfs options. It appears, from the outside, to be hung -- all clients just freeze, and while they're able to ping the host, they're not able to transfer nfs or iSCSI data. They're in the same subnet and I've found no network problems thus far. After hearing so much about the Marvell problems I'm beginning to wonder it they're the culprit, though they're supposed to be fixed in 127128-11, which is the kernel I'm running. I have an exact hardware duplicate of this machine running Nevada b91 (iirc) that doesn't exhibit this problem. There's nothing in /var/adm/messages and I'm not sure where else to begin. Would someone please help me in diagnosing this failure? thx jake -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I saw this in Nev b87, where for whatever reason, CIFS and NFS would completely hang and no longer serve requests (I don't use iscsi, unable to confirm if that had hung too). The server was responsive, SSH was fine and could execute commands, clients could ping it and reach it, but CIFS and NFS were essentially hung. Intermittently, the system would recover and resume offering shares, no triggering events could be correlated. Since upgrading to newer builds, I haven't seen similar issues. -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 'zfs recv' is very slow
On Thu, Nov 6, 2008 at 4:19 PM, River Tarnell [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ian Collins: That's very slow. What's the nature of your data? mainly two sets of mid-sized files; one of 200KB-2MB in size and other under 50KB. they are organised into subdirectories, A/B/C/file. each directory has 18,000-25,000 files. total data size is around 2.5TB. hm, something changed while i was writing this mail: now the transfer is running at 2MB/sec, and the read i/o has disappeared. that's still slower than i'd expect, but an improvement. Time each phase (send to a file, copy the file to B and receive from the file). When I tried this on a filesystem with a range of file sizes, I had about 30% of the total transfer time in send, 50% in copy and 20% in receive. i'd rather not interrupt the current send, as it's quite large. once it's finished, i'll test with smaller changes... - river. -BEGIN PGP SIGNATURE- iD8DBQFJE4mXIXd7fCuc5vIRAv0/AJoCRtMBN1/WD7zVVRzV2n4xeqBvyACeLNL/ rLB1iHlu4xZdUPSiNj/iWl4= =+F7d -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Theres been a couple threads about this now, tracked some bug ID's/ticket: 6333409 6418042 66104157 If you wanna see the status -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Windows XP nfs client poor performance
On Mon, Oct 20, 2008 at 9:29 AM, Bob Bencze [EMAIL PROTECTED] wrote: Greetings. I have a X4500 with an 8TB RAIDZ datapool, currently 75% full. I have it carved up into several filesystems. I share out two of the filesystems /datapool/data4 (approx 1.5TB) and /datapool/data5 (approx 3.5TB). THe data is imagery, and the primary application on the PCs is Socetset. The clients are Windows XP Pro, and I use services for unix (SFU) to mount the nfs shares from the thumper. When a client PC accesses files from data4, they come across quickly. When the same client accesses files from data5, the transfer rate comes to a crawl, and sometimes the application times out. The only difference I can see is the size of the volume, the data is all of the same type. I could find no references for any limitations on the volume size of nfs shares or mounts. It seems inconsistent and difficult to duplicate. I plan to begin a more in-depth troubleshooting of the problem with dtrace. Has anyone seen anything like this before? Thanks. -Bob Bencze -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss SFU NFS is often slow, but tunable, here is something you might find handy to squeeze some speed out of it: http://technet.microsoft.com/en-us/library/bb463205.aspx HTH -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving zfs send performance
On Wed, Oct 15, 2008 at 2:17 PM, Scott Williamson [EMAIL PROTECTED] wrote: Hi All, Just want to note that I had the same issue with zfs send + vdevs that had 11 drives in them on a X4500. Reducing the count of drives per zvol cleared this up. One vdev is IOPS limited to the speed of one drive in that vdev, according to this post (see comment from ptribble.) Scott, Can you tell us the configuration that you're using that is working for you? Were you using RaidZ, or RaidZ2? I'm wondering what the sweetspot is to get a good compromise in vdevs and usable space/performance Thanks! -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...
On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper [EMAIL PROTECTED] wrote: Hey, all! We've recently used six x4500 Thumpers, all publishing ~28TB iSCSI targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an x4200 head node. In trying to discover optimal ZFS pool construction settings, we've run a number of iozone tests, so I thought I'd share them with you and see if you have any comments, suggestions, etc. First, on a single Thumper, we ran baseline tests on the direct-attached storage (which is collected into a single ZFS pool comprised of four raidz2 groups)... [1GB file size, 1KB record size] Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest Write: 123919 Rewrite: 146277 Read: 383226 Reread: 383567 Random Read: 84369 Random Write: 121617 [8GB file size, 512KB record size] Command: Write: 373345 Rewrite: 665847 Read: 2261103 Reread: 2175696 Random Read: 2239877 Random Write: 666769 [64GB file size, 1MB record size] Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest Write: 517092 Rewrite: 541768 Read: 682713 Reread: 697875 Random Read: 89362 Random Write: 488944 These results look very nice, though you'll notice that the random read numbers tend to be pretty low on the 1GB and 64GB tests (relative to their sequential counterparts), but the 8GB random (and sequential) read is unbelievably good. Now we move to the head node's iSCSI aggregate ZFS pool... [1GB file size, 1KB record size] Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /volumes/data-iscsi/perftest/1gbtest Write: 127108 Rewrite: 120704 Read: 394073 Reread: 396607 Random Read: 63820 Random Write: 5907 [8GB file size, 512KB record size] Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f /volumes/data-iscsi/perftest/8gbtest Write: 235348 Rewrite: 179740 Read: 577315 Reread: 662253 Random Read: 249853 Random Write: 274589 [64GB file size, 1MB record size] Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /volumes/data-iscsi/perftest/64gbtest Write: 190535 Rewrite: 194738 Read: 297605 Reread: 314829 Random Read: 93102 Random Write: 175688 Generally speaking, the results look good, but you'll notice that random writes are atrocious on the 1GB tests and random reads are not so great on the 1GB and 64GB tests, but the 8GB test looks great across the board. Voodoo! ; Incidentally, I ran all these tests against the ZFS pool in disk, raidz1, and raidz2 modes - there were no significant changes in the results. So, how concerned should we be about the low scores here and there? Any suggestions on how to improve our configuration? And how excited should we be about the 8GB tests? ; Thanks so much for any input you have! -Gray --- University of Michigan Medical School Information Services -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Your setup sounds very interesting how you export iSCSI to another head unit, can you give me some more details on your file system layout, and how you mount it on the head unit? Sounds like a pretty clever way to export awesomely large volumes! Regards, -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Segmentation fault / core dump with recursive send/recv
On Wed, Oct 8, 2008 at 10:49 PM, BJ Quinn [EMAIL PROTECTED] wrote: Oh and I had been doing this remotely, so I didn't notice the following error before - receiving incremental stream of datapool/[EMAIL PROTECTED] into backup/[EMAIL PROTECTED] cannot receive incremental stream: destination backup/shares has been modified since most recent snapshot This is reported after the first snapshot, BACKUP081007 gets copied, and then it quits. I don't see why it would have been modified. I guess it's possible I cd'ed into the backup directory at some point during the send/recv, but I don't think so. Should I set the readonly property on the backup FS or something? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Correct, the other side should be set Read Only, that way nothing at all is modified when the other hosts tries to zfs send. -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Mon, Sep 29, 2008 at 9:28 PM, Richard Elling [EMAIL PROTECTED] wrote: Ahmed Kamal wrote: Hi everyone, We're a small Linux shop (20 users). I am currently using a Linux server to host our 2TBs of data. I am considering better options for our data storage needs. I mostly need instant snapshots and better data protection. I have been considering EMC NS20 filers and Zfs based solutions. For the Zfs solutions, I am considering NexentaStor product installed on a pogoLinux StorageDirector box. The box will be mostly sharing 2TB over NFS, nothing fancy. Now, my question is I need to assess the zfs reliability today Q4-2008 in comparison to an EMC solution. Something like EMC is pretty mature and used at the most demanding sites. Zfs is fairly new, and from time to time I have heard it had some pretty bad bugs. However, the EMC solution is like 4X more expensive. I need to somehow quantify the relative quality level, in order to judge whether or not I should be paying all that much to EMC. The only really important reliability measure to me, is not having data loss! Is there any real measure like percentage of total corruption of a pool that can assess such a quality, so you'd tell me zfs has pool failure rate of 1 in a 10^6, while EMC has a rate of 1 in a 10^7. If not, would you guys rate such a zfs solution as ??% the reliability of an EMC solution ? EMC does not, and cannot, provide end-to-end data validation. So how would measure its data reliability? If you search the ZFS-discuss archives, you will find instances where people using high-end storage also had data errors detected by ZFS. So, you should consider them complementary rather than adversaries. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Key word, detected :) I recall a few of those, I'll cite later when I'm not tired, where people used insert SAN vendor here as the iSCSI target, with ZFS on top of. ZFS was quite capable of detecting errors, but since they did not let ZFS handle the RAID, and instead relied on another level, ZFS was not able to correct the errors. -- Brent Jones [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss