Re: [zfs-discuss] Sun X4200 Question...
On Mar 14, 2013, at 5:55 PM, Jim Klimov jimkli...@cos.ru wrote: However, recently the VM virtual hardware clocks became way slow. Does NTP help correct the guest's clock? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Distro Advice
On Feb 26, 2013, at 12:44 AM, Sašo Kiselkov wrote: I'd also recommend that you go and subscribe to z...@lists.illumos.org, since this list is going to get shut down by Oracle next month. Whose description still reads, everything ZFS running on illumos-based distributions. -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On Tue, Jan 22, 2013 at 11:54:53PM +, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Nico Williams As for swap... really, you don't want to swap. If you're swapping you have problems. In solaris, I've never seen it swap out idle processes; I've only seen it use swap for the bad bad bad situation. I assume that's all it can do with swap. You would be wrong. Solaris uses swap space for paging. Paging out unused portions of an executing process from real memory to the swap device is certainly beneficial. Swapping out complete processes is a desperation move, but paging out most of an idle process is a good thing. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sonnet Tempo SSD supported?
On Dec 4, 2012, Eugen Leitl wrote: Either way I'll know the hardware support situation soon enough. Have you tried contacting Sonnet? -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] LUN sizes
On Mon, Oct 29, 2012 at 09:30:47AM -0500, Brian Wilson wrote: First I'd like to note that contrary to the nomenclature there isn't any one SAN product that all operates the same. There are a number of different vendor provided solutions that use a FC SAN to deliver luns to hosts, and they each have their own limitations. Forgive my pedanticism please. On Sun, Oct 28, 2012 at 04:43:34PM +0700, Fajar A. Nugraha wrote: On Sat, Oct 27, 2012 at 9:16 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-() boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha So my suggestion is actually just present one huge 25TB LUN to zfs and let the SAN handle redundancy. You are entering the uncharted waters of ``multi-level disk management'' here. Both ZFS and the SAN use redundancy and error- checking to ensure data integrity. Both of them also do automatic replacement of failing disks. A good SAN will present LUNs that behave as perfectly reliable virtual disks, guaranteed to be error free. Almost all of the time, ZFS will find no errors. If ZFS does find an error, there's no nice way to recover. Most commonly, this happens when the SAN is powered down or rebooted while the ZFS host is still running. On your host side, there's also the consideration of ssd/scsi queuing. If you're running on only one LUN, you're limiting your IOPS to only one IO queue over your FC paths, and if you have that throttled (per many storage vendors recommendations about ssd:ssd_max_throttle and zfs:zfs_vdev_max_pending), then one LUN will throttle your IOPS back on your host. That might also motivate you to split into multiple LUNS so your OS doesn't end up bottle-necking your IO before it even gets to your SAN HBA. That's a performance issue rather than a reliability issue. The other performance issue to consider is block size. At the last place I worked, we used an Iscsi LUN from a Netapp filer. This LUN reported a block size of 512 bytes, even though the Netapp itself used a 4K block size. This means that the filer was doing the block size conversion, resulting in much more I/O than the ZFS layer intended. The fact that Netapp does COW made this situation even worse. My impression was that very few of their customers encountered this performance problem because almost all of them used their Netapp only for NFS or CIFS. Our Netapp was extremely reliable but did not have the Iscsi LUN performance that we needed. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zpool LUN Sizes
On Sun, Oct 28, 2012 at 04:43:34PM +0700, Fajar A. Nugraha wrote: On Sat, Oct 27, 2012 at 9:16 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha So my suggestion is actually just present one huge 25TB LUN to zfs and let the SAN handle redundancy. create a bunch of 1-disk volumes and let ZFS handle them as if they're JBOD. Last time I use IBM's enterprise storage (which was, admittedly, a long time ago) you can't even do that. And looking at Morris' mail address, it should be revelant :) ... or probably it's just me who haven't found how to do that. Which why I suggested just use whatever the SAN can present :) You are entering the uncharted waters of ``multi-level disk management'' here. Both ZFS and the SAN use redundancy and error- checking to ensure data integrity. Both of them also do automatic replacement of failing disks. A good SAN will present LUNs that behave as perfectly reliable virtual disks, guaranteed to be error free. Almost all of the time, ZFS will find no errors. If ZFS does find an error, there's no nice way to recover. Most commonly, this happens when the SAN is powered down or rebooted while the ZFS host is still running. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What happens when you rm zpool.cache?
On Sun, Oct 21, 2012 at 11:40:31AM +0200, Bogdan Ćulibrk wrote: Follow up question regarding this: is there any way to disable automatic import of any non-rpool on boot without any hacks of removing zpool.cache? Certainly. Import it with an alternate cache file. You do this by specifying the `cachefile' property on the command line. The `zpool' man page describes how to do this. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, Jul 11, 2012 at 7:48 AM, Casper Dik wrote: Dan Brown seems to think so in Digital Fortress but it just means he has no grasp on big numbers. Or little else, for that matter. I seem to recall one character in the book that would routinely slide under a mainframe on his back as if on a mechanics dolly, solder CPUs to the motherboard above his face, and perform all manner of bullshit on the fly repairs that never even existed back in the earliest days of mid-20th century computing. I don't recall anything else of a technical nature that made a lick of sense and the story was only further insulting by the mass of alleged super geniuses that could barely tie their own shoelaces, etc. etc. Reading the one star reviews of this book on Amazon is far more enlightening entertaining than reading the actual book. I found it so insulting that I couldn't finish the last 70 pages of the paperback. -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] Re: History of EPERM for unlink() of directories on ZFS?
On Tue, Jun 26, 2012 at 10:41:14AM -0500, Nico Williams wrote: On Tue, Jun 26, 2012 at 9:44 AM, Alan Coopersmith alan.coopersm...@oracle.com wrote: On 06/26/12 05:46 AM, Lionel Cons wrote: On 25 June 2012 11:33, casper@oracle.com wrote: To be honest, I think we should also remove this from all other filesystems and I think ZFS was created this way because all modern filesystems do it that way. This may be wrong way to go if it breaks existing applications which rely on this feature. It does break applications in our case. Existing applications rely on the ability to corrupt UFS filesystems? Sounds horrible. My guess is that the OP just wants unlink() of an empty directory to be the same as rmdir() of the same. Or perhaps they want unlink() of a non-empty directory to result in a recursive rm... But if they really want hardlinks to directories, then yeah, that's horrible. This all sounds like a good use for LD_PRELOAD and a tiny library that intercepts and modernizes system calls. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On Thu, May 3, 2012 at 7:47 AM, Edward Ned Harvey wrote: Given the amount of ram you have, I really don't think you'll be able to get any useful metric out of iozone in this lifetime. I still think it would be apropos if dedup and compression were being used. In that case, does filebench have an option for testing either of those? -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On May 1, 2012, at 1:41 AM, Ray Van Dolson wrote: Throughput: iozone -m -t 8 -T -r 128k -o -s 36G -R -b bigfile.xls IOPS: iozone -O -i 0 -i 1 -i 2 -e -+n -r 128K -s 288G iops.txt Do you expect to be reading or writing 36 or 288Gb files very often on this array? The largest file size I've used in my still lengthy benchmarks was 16Gb. If you use the sizes you've proposed, it could take several days or weeks to complete. Try a web search for iozone examples if you want more details on the command switches. -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] IOzone benchmarking
On 5/1/12, Ray Van Dolson wrote: The problem is this box has 144GB of memory. If I go with a 16GB file size (which I did), then memory and caching influences the results pretty severely (I get around 3GB/sec for writes!). The idea of benchmarking -- IMHO -- is to vaguely attempt to reproduce real world loads. Obviously, this is an imperfect science but if you're going to be writing a lot of small files (e.g. NNTP or email servers used to be a good real world example) then you're going to want to benchmark for that. If you're going to want to write a bunch of huge files (are you writing a lot of 16GB files?) then you'll want to test for that. Caching anywhere in the pipeline is important for benchmarks because you aren't going to turn off a cache or remove RAM in production are you? -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Seagate Constellation vs. Hitachi Ultrastar
I've seen a couple sources that suggest prices should be dropping by the end of April -- apparently not as low as pre flood prices due in part to a rise in manufacturing costs but about 10% lower than they're priced today. -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Comes to OS X Courtesy of Apple's Former Chief ZFS Architect
It looks like the first iteration has finally launched... http://tenscomplement.com/our-products/zevo-silver-edition http://www.macrumors.com/2012/01/31/zfs-comes-to-os-x-courtesy-of-apples-former-chief-zfs-architect ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs and iscsi performance help
On Fri, Jan 27, 2012 at 03:25:39PM +1100, Ivan Rodriguez wrote: We have a backup server with a zpool size of 20 TB, we transfer information using zfs snapshots every day (we have around 300 fs on that pool), the storage is a dell md3000i connected by iscsi, the pool is currently version 10, the same storage is connected to another server with a smaller pool of 3 TB(zpool version 10) this server is working fine and speed is good between the storage and the server, however in the server with 20 TB pool performance is an issue after we restart the server performance is good but with the time lets say a week the performance keeps dropping until we have to bounce the server again (same behavior with new version of solaris in this case performance drops in 2 days), no errors in logs or storage or the zpool status -v This sounds like a ZFS cache problem on the server. You might check on how cache statistics change over time. Some tuning may eliminate this degradation. More memory may also help. Does a scrub show any errors? Does the performance drop affect reads or writes or both? We suspect that the pool has some issues probably there is corruption somewhere, we tested solaris 10 8/11 with zpool 29, although we haven't update the pool itself, with the new solaris the performance is even worst and every time that we restart the server we get stuff like this: SOURCE: zfs-diagnosis, REV: 1.0 EVENT-ID: 0168621d-3f61-c1fc-bc73-c50efaa836f4 DESC: All faults associated with an event id have been addressed. Refer to http://sun.com/msg/FMD-8000-4M for more information. AUTO-RESPONSE: Some system components offlined because of the original fault may have been brought back online. IMPACT: Performance degradation of the system due to the original fault may have been recovered. REC-ACTION: Use fmdump -v -u EVENT-ID to identify the repaired components. [ID 377184 daemon.notice] SUNW-MSG-ID: FMD-8000-6U, TYPE: Resolved, VER: 1, SEVERITY: Minor And we need to export and import the pool in order to be able to access it. This is a separate problem, introduced with an upgrade to the Iscsi service. The new one has a dependancy on the name service (typically DNS), which means that it isn't available when the zpool import is done during the boot. Check with Oracle support to see if they have found a solution. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unable to access the zpool after issue a reboot
On Thu, Jan 26, 2012 at 04:36:58PM +0100, Christian Meier wrote: Hi Sudheer 3)bash-3.2# zpool status pool: pool name state: UNAVAIL status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-3C scan: none requested config:NAMESTATE READ WRITE CKSUM pool name UNAVAIL 0 0 0 insufficient replicas c5t1d1UNAVAIL 0 0 0 cannot open This means that, at the time of that import, device c5t1d1 was not available. What does `ls -l /dev/rdsk/c5t1d1s0' show for the physical path? And the important thing is when I export import the zpool, then I was able to access it. Yes, later the device became available. After the boot, `svcs' will show you the services listed in order of their completion times. The ZFS mount is done by this service: svc:/system/filesystem/local:default The zpool import (without the mount) is done earlier. Check to see if any of the FC services run too late during the boot. As Gary and Bob mentioned, I saw this Issue with ISCSI Devices. Instead of export / import is a zpool clear also working? mpathadm list LU mpathadm show LU /dev/rdsk/c5t1d1s2 -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unable to access the zpool after issue a reboot
On Tue, Jan 24, 2012 at 05:33:39PM +0530, sureshkumar wrote: I am new to Solaris I am facing an issue with the dynapath [multipath s/w] for Solaris10u10 x86 . I am facing an issue with the zpool. Whats my problem is unable to access the zpool after issue a reboot. I've seen this happen when the zpool was built on an Iscsi LUN. At reboot time, the ZFS import was done before the Iscsi driver was able to connect to its target. After the system was up, an export and import was successful. The solution was to add a new service that imported the zpool later during the reboot. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs defragmentation via resilvering?
On Mon, Jan 16, 2012 at 09:13:03AM -0600, Bob Friesenhahn wrote: On Mon, 16 Jan 2012, Jim Klimov wrote: I think that in order to create a truly fragmented ZFS layout, Edward needs to do sync writes (without a ZIL?) so that every block and its metadata go to disk (coalesced as they may be) and no two blocks of the file would be sequenced on disk together. Although creating snapshots should give that effect... In my experience, most files on Unix systems are re-written from scatch. For example, when one edits a file in an editor, the editor loads the file into memory, performs the edit, and then writes out the whole file. Given sufficient free disk space, these files are unlikely to be fragmented. The case of slowly written log files or random-access databases are the worse cases for causing fragmentation. The case I've seen was with an IMAP server with many users. E-mail folders were represented as ZFS directories, and e-mail messages as files within those directories. New messages arrived randomly in the INBOX folder, so that those files were written all over the place on the storage. Users also deleted many messages from their INBOX folder, but the files were retained in snapshots for two weeks. On IMAP session startup, the server typically had to read all of the messages in the INBOX folder, making this portion slow. The server also had to refresh the folder whenever new messages arrived, making that portion slow as well. Performance degraded when the storage became 50% full. It would increase markedly when the oldest snapshot was deleted. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
On Sun, Jan 15, 2012 at 04:06:33PM +, Peter Tribble wrote: On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov jimkli...@cos.ru wrote: Does raidzN actually protect against bitrot? That's a kind of radical, possibly offensive, question formula that I have lately. Yup, it does. That's why many of us use it. There's actually no such thing as bitrot on a disk. Each sector on the disk is accompanied by a CRC that's verified by the disk controller on each read. It will either return correct data or report an unreadable sector. There's nothing inbetween. Of course, if something outside of ZFS writes to the disk, then data belonging to ZFS will be modified. I've heard of RAID controllers or SAN devices doing this when they modify the disk geometry or reserved areas on the disk. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any HP Servers recommendation for Openindiana (Capacity Server) ?
I can't comment on their 4U servers but HP's 12U includwd SAS controllers rarely allow JBOD discovery of drives. So I'd recommend an LSI card and an external storage chassis like those available from Promise and others. -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any HP Servers recommendation for Openindiana (Capacity Server) ?
On Jan 3, 2012, at 10:36 PM, Eric D. Mudama wrote: Supposedly the H200/H700 cards are just their name for the 6gbit LSI SAS cards, but I haven't tested them personally. They might use the same chipset but their firmware usually doesn't support JBOD. Unless they've changed in the last couple of years... Best you can do is try but if you don't see each drive individually you'll know it's by design and not lack of skill on your part. -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!
On Mon, Dec 19, 2011 at 11:58:57AM +, Jan-Aage Frydenbø-Bruvoll wrote: 2011/12/19 Hung-Sheng Tsao (laoTsao) laot...@gmail.com: did you run a scrub? Yes, as part of the previous drive failure. Nothing reported there. Now, interestingly - I deleted two of the oldest snapshots yesterday, and guess what - the performance went back to normal for a while. Now it is severely dropping again - after a good while on 1.5-2GB/s I am again seeing write performance in the 1-10MB/s range. That behavior is a symptom of fragmentation. Writes slow down dramatically when there are no contiguous blocks available. Deleting a snapshot provides some of these, but only temporarily. -- -Gary Mills--refurb--Winnipeg, Manitoba, Canada- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CPU sizing for ZFS/iSCSI/NFS server
On Dec 12, 2011, at 11:42 AM, \Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D.\ wrote: please check out the ZFS appliance 7120 spec 2.4Ghz /24GB memory and ZIL(SSD) Do those appliances also use the F20 PCIe flash cards? I know the Exadata storage cells use them but they aren't utilizing ZFS in the Linux version of the X2-2. Has that changed with the Solaris x86 versions of the appliance? Also, does OCZ or someone make an equivalent to the F20 now? -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup
What kind of drives are we talking about? Even SATA drives are available according to application type (desktop, enterprise server, home PVR, surveillance PVR, etc). Then there are drives with SAS fiber channel interfaces. Then you've got Winchester platters vs SSD vs hybrids. But even before considering that and all the other system factors, throughput for direct attached storage can vary greatly not only from interface type and storage tech but even small on drive controller firmware differences could potentially introduce variances. That's why server manufacturers like HP, DELL, et al prefer that you replace failed drives with one of theirs instead of something off the shelf because they usually have firmware that's been fine tuned in house or in conjunction with the manufacturer. On Dec 11, 2011, at 8:25 AM, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Nathan Kroenert That reminds me of something I have been wondering about... Why only 12x faster? If we are effectively reading from memory - as compared to a disk reading at approximately 100MB/s (which is about an average PC HDD reading sequentially), I'd have thought it should be a lot faster than 12x. Can we really only pull stuff from cache at only a little over one gigabyte per second if it's dedup data? Actually, cpu's and memory aren't as fast as you might think. In a system with 12 disks, I've had to write my own dd replacement, because dd if=/dev/zero bs=1024k wasn't fast enough to keep the disks busy. Later, I wanted to do something similar, using unique data, and it was simply impossible to generate random data fast enough. I had to tweak my dd replacement to write serial numbers, which still wasn't fast enough, so I had to tweak my dd replacement to write a big block of static data, followed by a serial number, followed by another big block (always smaller than the disk block, so it would be treated as unique when hitting the pool...) 1 typical disk sustains 1Gbit/sec. In theory, 12 should be able to sustain 12 Gbit/sec. According to Nathan's email, the memory bandwidth might be 25 Gbit, of which, you probably need to both read write, thus making it effectively 12.5 Gbit... I'm sure the actual bandwidth available varies by system and memory type. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HP JBOD D2700 - ok?
I'd be wary of purchasing HP HBAs without getting a firsthand report from someone that they're compatible. I've seen several HP controllers that use LSI chip sets but are crippled in that they won't present drives as JBOD. That said, I've used a few of the HBAs sourced from LSI resellers and they work wonderfully with ZFS. -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS forensics
Is zdb still the only way to dive in to the file system? I've seen the extensive work by Max Bruning on this but wonder if there are any tools that make this easier...? -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS forensics
On Nov 23, 2011, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. : did you see this link Thank you for this. Some of the other refs it lists will come in handy as well. kind regards, Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] sd_max_throttle
Hi folks, I'm reading through some I/O performance tuning documents and am finding some older references to sd_max_throttle kernel/project settings. Have there been any recent books or documentation written that talks about this more in depth? It seems to be more appropriate for FC or DAS but I'm wondering if anyone has had to touch this or other settings with ZFS appliances they've built...? -Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Does the zpool cache file affect import?
I have a system with ZFS root that imports another zpool from a start method. It uses a separate cache file for this zpool, like this: if [ -f $CCACHE ] then echo Importing $CPOOL with cache $CCACHE zpool import -o cachefile=$CCACHE -c $CCACHE $CPOOL else echo Importing $CPOOL with device scan zpool import -o cachefile=$CCACHE $CPOOL fi It also exports that zpool from the stop method, which has the side effect of deleting the cache. This all works nicely when the server is rebooted. What will happen when the server is halted without running the stop method, so that that zpool is not exported? I know that there is a flag in the zpool that indicates when it's been exported cleanly. The cache file will exist when the server reboots. Will the import fail with the `The pool was last accessed by another system.' error, or will the import succeed? Does the cache change the import behavior? Does it recognize that the server is the same system? I don't want to include the `-f' flag in the commands above when it's not needed. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How create a FAT filesystem on a zvol?
On Sun, Jul 10, 2011 at 11:16:02PM +0700, Fajar A. Nugraha wrote: On Sun, Jul 10, 2011 at 10:10 PM, Gary Mills mi...@cc.umanitoba.ca wrote: The `lofiadm' man page describes how to export a file as a block device and then use `mkfs -F pcfs' to create a FAT filesystem on it. Can't I do the same thing by first creating a zvol and then creating a FAT filesystem on it? seems not. [...] Some solaris tools (like fdisk, or mkfs -F pcfs) needs disk geometry to function properly. zvols doesn't provide that. If you want to use zvols to work with such tools, the easiest way would be using lofi, or exporting zvols as iscsi share and import it again. For example, if you have a 10MB zvol and use lofi, fdisk would show these geometry Total disk size is 34 cylinders Cylinder size is 602 (512 byte) blocks ... which will then be used if you run mkfs -F pcfs -o nofdisk,size=20480. Without lofi, the same command would fail with Drive geometry lookup (need tracks/cylinder and/or sectors/track: Operation not supported So, why can I do it with UFS? # zfs create -V 10m rpool/vol1 # newfs /dev/zvol/rdsk/rpool/vol1 newfs: construct a new file system /dev/zvol/rdsk/rpool/vol1: (y/n)? y Warning: 4130 sector(s) in last cylinder unallocated /dev/zvol/rdsk/rpool/vol1: 20446 sectors in 4 cylinders of 48 tracks, 128 sectors 10.0MB in 1 cyl groups (14 c/g, 42.00MB/g, 20160 i/g) super-block backups (for fsck -F ufs -o b=#) at: 32, Why is this different from PCFS? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How create a FAT filesystem on a zvol?
The `lofiadm' man page describes how to export a file as a block device and then use `mkfs -F pcfs' to create a FAT filesystem on it. Can't I do the same thing by first creating a zvol and then creating a FAT filesystem on it? Nothing I've tried seems to work. Isn't the zvol just another block device? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)
On Sun, Jun 19, 2011 at 08:03:25AM -0700, Richard Elling wrote: On Jun 19, 2011, at 6:28 AM, Edward Ned Harvey wrote: From: Richard Elling [mailto:richard.ell...@gmail.com] Sent: Saturday, June 18, 2011 7:47 PM Actually, all of the data I've gathered recently shows that the number of IOPS does not significantly increase for HDDs running random workloads. However the response time does :-( Could you clarify what you mean by that? Yes. I've been looking at what the value of zfs_vdev_max_pending should be. The old value was 35 (a guess, but a really bad guess) and the new value is 10 (another guess, but a better guess). I observe that data from a fast, modern HDD, for 1-10 threads (outstanding I/Os) the IOPS ranges from 309 to 333 IOPS. But as we add threads, the average response time increases from 2.3ms to 137ms. Since the whole idea is to get lower response time, and we know disks are not simple queues so there is no direct IOPS to response time relationship, maybe it is simply better to limit the number of outstanding I/Os. How would this work for a storage device with an intelligent controller that provides only a few LUNs to the host, even though it contains a much larger number of disks? I would expect the controller to be more efficient with a large number of outstanding IOs because it could distribute those IOs across the disks. It would, of course, require a non-volatile cache to provide fast turnaround for writes. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] JBOD recommendation for ZFS usage
On Mon, May 30, 2011 at 08:06:31AM +0200, Thomas Nau wrote: We are looking for JBOD systems which (1) hold 20+ 3.3 SATA drives (2) are rack mountable (3) have all the nive hot-swap stuff (4) allow 2 hosts to connect via SAS (4+ lines per host) and see all available drives as disks, no RAID volume. In a perfect world both hosts would connect each using two independent SAS connectors The box will be used in a ZFS Solaris/based fileserver in a fail-over cluster setup. Only one host will access a drive at any given time. I'm using a J4200 array as shared storage for a cluster. It needs a SAS HBA in each cluster node. The disks in the array are visible to both nodes in the cluster. Here's the feature list. I don't know if it's still available: Sun Storage J4200 Array: # Scales up to 48 SAS/SATA disk drives # Provides up to 72 Gb/sec of total bandwidth * Up to 72 Gb/sec of total bandwidth * Four x4-wide 3 Gb/sec SAS host/uplink ports (48 Gb/sec bandwidth) * Two x4-wide 3 Gb/sec SAS expansion ports (24 Gb/sec bandwidth) * Scales up to 48 drives -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practice for boot partition layout in ZFS
On Wed, Apr 06, 2011 at 08:08:06AM -0700, Erik Trimble wrote: On 4/6/2011 7:50 AM, Lori Alt wrote: On 04/ 6/11 07:59 AM, Arjun YK wrote: I'm not sure there's a defined best practice. Maybe someone else can answer that question. My guess is that in environments where, before, a separate ufs /var slice was used, a separate zfs /var dataset with a quota might now be appropriate. Lori Traditionally, the reason for a separate /var was one of two major items: (a) /var was writable, and / wasn't - this was typical of diskless or minimal local-disk configurations. Modern packaging systems are making this kind of configuration increasingly difficult. (b) /var held a substantial amount of data, which needed to be handled separately from / - mail and news servers are a classic example For typical machines nowdays, with large root disks, there is very little chance of /var suddenly exploding and filling / (the classic example of being screwed... wink). Outside of the above two cases, about the only other place I can see that having /var separate is a good idea is for certain test machines, where you expect frequent memory dumps (in /var/crash) - if you have a large amount of RAM, you'll need a lot of disk space, so it might be good to limit /var in this case by making it a separate dataset. People forget (c), the ability to set different filesystem options on /var. You might want to have `setuid=off' for improved security, for example. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] One LUN per RAID group
With ZFS on a Solaris server using storage on a SAN device, is it reasonable to configure the storage device to present one LUN for each RAID group? I'm assuming that the SAN and storage device are sufficiently reliable that no additional redundancy is necessary on the Solaris ZFS server. I'm also assuming that all disk management is done on the storage device. I realize that it is possible to configure more than one LUN per RAID group on the storage device, but doesn't ZFS assume that each LUN represents an independant disk, and schedule I/O accordingly? In that case, wouldn't ZFS I/O scheduling interfere with I/O scheduling already done by the storage device? Is there any reason not to use one LUN per RAID group? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] One LUN per RAID group
On Mon, Feb 14, 2011 at 03:04:18PM -0500, Paul Kraus wrote: On Mon, Feb 14, 2011 at 2:38 PM, Gary Mills mi...@cc.umanitoba.ca wrote: Is there any reason not to use one LUN per RAID group? [...] In other words, if you build a zpool with one vdev of 10GB and another with two vdev's each of 5GB (both coming from the same array and raid set) you get almost exactly twice the random read performance from the 2x5 zpool vs. the 1x10 zpool. This finding is surprising to me. How do you explain it? Is it simply that you get twice as many outstanding I/O requests with two LUNs? Is it limited by the default I/O queue depth in ZFS? After all, all of the I/O requests must be handled by the same RAID group once they reach the storage device. Also, using a 2540 disk array setup as a 10 disk RAID6 (with 2 hot spares), you get substantially better random read performance using 10 LUNs vs. 1 LUN. While inconvenient, this just reflects the scaling of ZFS aith number of vdevs and not spindles. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool-poolname has 99 threads
After an upgrade of a busy server to Oracle Solaris 10 9/10, I notice a process called zpool-poolname that has 99 threads. This seems to be a limit, as it never goes above that. It is lower on workstations. The `zpool' man page says only: Processes Each imported pool has an associated process, named zpool- poolname. The threads in this process are the pool's I/O processing threads, which handle the compression, checksum- ming, and other tasks for all I/O associated with the pool. This process exists to provides visibility into the CPU utilization of the system's storage pools. The existence of this process is an unstable interface. There are several thousand processes doing ZFS I/O on the busy server. Could this new process be a limitation in any way? I'd just like to rule it out before looking further at I/O performance. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sliced iSCSI device for doing RAIDZ?
On Fri, Sep 24, 2010 at 12:01:35AM +0200, Alexander Skwar wrote: Suppose they gave you two huge lumps of storage from the SAN, and you mirrored them with ZFS. What would you do if ZFS reported that one of its two disks had failed and needed to be replaced? You can't do disk management with ZFS in this situation anyway because those aren't real disks. Disk management all has to be done on the SAN storage device. Yes. I was rather thinking about RAIDZ instead of mirroring. I was just using a simpler example. Anyway. Without redundancy, ZFS cannot do recovery, can it? As far as I understand, it could detect block level corruption, even if there's not redundancy. But it could not correct such a corruption. Or is that a wrong understanding? That's correct, but it also should never happen. If I got the gist of what you wrote, it boils down to how reliable the SAN is? But also SANs could have block level corruption, no? I'm a bit confused, because of the (perceived?) contra- diction to the Best Practices Guide? :) The real problem is that ZFS was not designed to run in a SAN environment, that is one where all of the disk management and sufficient redundancy reside in the storage device on the SAN. ZFS certainly can't do any disk management in this situation. Error detection and correction is still a debatable issue, one that quickly becomes exceedingly complex. The decision rests on probabilities rather than certainties. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sliced iSCSI device for doing RAIDZ?
On Tue, Sep 21, 2010 at 05:48:09PM +0200, Alexander Skwar wrote: We're using ZFS via iSCSI on a S10U8 system. As the ZFS Best Practices Guide http://j.mp/zfs-bp states, it's advisable to use redundancy (ie. RAIDZ, mirroring or whatnot), even if the underlying storage does its own RAID thing. Now, our storage does RaID and the storage people say, it is impossible to have it export iSCSI devices which have no redundancy/ RAID. If you have a reliable Iscsi SAN and a reliable storage device, you don't need the additional redundancy provided by ZFS. Actually, were would there be a difference? I mean, those iSCSI devices anyway don't represent real disks/spindles, but it's just some sort of abstractation. So, if they'd give me 3x400 GB compared to 1200 GB in one huge lump like they do now, it could be, that those would use the same spots on the real hard drives. Suppose they gave you two huge lumps of storage from the SAN, and you mirrored them with ZFS. What would you do if ZFS reported that one of its two disks had failed and needed to be replaced? You can't do disk management with ZFS in this situation anyway because those aren't real disks. Disk management all has to be done on the SAN storage device. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Howto reclaim space under legacy mountpoint?
I moved my home directories to a new disk and then mounted the disk using a legacy mount point over /export/home. Here is the output of the zfs list: NAME USED AVAIL REFER MOUNTPOINT rpool 55.8G 11.1G83K /rpool rpool/ROOT 21.1G 11.1G19K legacy rpool/ROOT/snv-134 21.1G 11.1G 14.3G / rpool/dump 1.97G 11.1G 1.97G - rpool/export30.8G 11.1G23K /export rpool/export/home 30.8G 11.1G 29.3G legacy rpool/swap 1.97G 12.9G 144M - users 32.8G 881G 31.1G /export/home The question is how to remove the files from the orginal rpool/export/home (non mount point) rpool? I a bit nervous to do a: zfs destroy rpool/export/home Is the the correct and safe methodology? Thanks, Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] can ufs zones and zfs zones coexist on a single global zone
Looking at migrating zones built on an M8000 and M5000 to a new M9000. On the M9000 we started building new deployments using ZFS. The environments on the M8/M5 are UFS. these are whole root zones. they will use global zone resources. Can this be done? Or would a ZFS migration be needed? thank you, -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Newbie question
I would like to migrate my home directories to a new mirror. Currently, I have them in rpool: rpool/export rpool/export/home I've created a mirror pool, users. I figure the steps are: 1) snapshot rpool/export/home 2) send the snapshot to users. 3) unmount rpool/export/home 4) mount pool users to /export/home So, what are the appropriate commands for these steps? Thanks, Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Newbie question
Norm, Thank you. I just wanted to double-check to make sure I didn't mess up things. There were steps that I was head-scratching after reading the man page. I'll spend a bit more time re-reading it using the steps outlined so I understand these fully. Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with Equallogic storage
On Sat, Aug 21, 2010 at 06:36:37PM -0400, Toby Thain wrote: On 21-Aug-10, at 3:06 PM, Ross Walker wrote: On Aug 21, 2010, at 2:14 PM, Bill Sommerfeld bill.sommerf...@oracle.com wrote: On 08/21/10 10:14, Ross Walker wrote: ... Would I be better off forgoing resiliency for simplicity, putting all my faith into the Equallogic to handle data resiliency? IMHO, no; the resulting system will be significantly more brittle. Exactly how brittle I guess depends on the Equallogic system. If you don't let zfs manage redundancy, Bill is correct: it's a more fragile system that *cannot* self heal data errors in the (deep) stack. Quantifying the increased risk, is a question that Richard Elling could probably answer :) That's because ZFS does not have a way to handle a large class of storage designs, specifically the ones with raw storage and disk management being provided by reliable SAN devices. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solaris startup script location
On Wed, Aug 18, 2010 at 12:16:04AM -0700, Alxen4 wrote: Is there any way run start-up script before non-root pool is mounted ? For example I'm trying to use ramdisk as ZIL device (ramdiskadm ) So I need to create ramdisk before actual pool is mounted otherwise it complains that log device is missing :) Yes, it's actually quite easy. You need to create an SMF manifest and method. The manifest should make the ZFS mount dependant on it with the `dependent' and `/dependent' tag pair. It also needs to be dependant on resources it needs, with the `dependency' and `/dependency' pairs. It should also specify a `single_instance/' and `transient' service. The method script can do whatever the mount requires, such as creating the ramdisk. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On Fri, Aug 13, 2010 at 01:54:13PM -0700, Erast wrote: On 08/13/2010 01:39 PM, Tim Cook wrote: http://www.theregister.co.uk/2010/08/13/opensolaris_is_dead/ I'm a bit surprised at this development... Oracle really just doesn't get it. The part that's most disturbing to me is the fact they won't be releasing nightly snapshots. It appears they've stopped Illumos in its tracks before it really even got started (perhaps that explains the timing of this press release) Wrong. Be patient, with the pace of current Illumos development it soon will have all the closed binaries liberated and ready to sync up with promised ON code drops as dictated by GPL and CDDL licenses. Is this what you mean, from: http://hub.opensolaris.org/bin/view/Main/opensolaris_license Any Covered Software that You distribute or otherwise make available in Executable form must also be made available in Source Code form and that Source Code form must be distributed only under the terms of this License. You must include a copy of this License with every copy of the Source Code form of the Covered Software You distribute or otherwise make available. You must inform recipients of any such Covered Software in Executable form as to how they can obtain such Covered Software in Source Code form in a reasonable manner on or through a medium customarily used for software exchange. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS development moving behind closed doors
If this information is correct, http://opensolaris.org/jive/thread.jspa?threadID=133043 further development of ZFS will take place behind closed doors. Opensolaris will become the internal development version of Solaris with no public distributions. The community has been abandoned. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs upgrade unmounts filesystems
Zpool upgrade on this system went fine, but zfs upgrade failed: # zfs upgrade -a cannot unmount '/space/direct': Device busy cannot unmount '/space/dcc': Device busy cannot unmount '/space/direct': Device busy cannot unmount '/space/imap': Device busy cannot unmount '/space/log': Device busy cannot unmount '/space/mysql': Device busy 2 filesystems upgraded Do I have to shut down all the applications before upgrading the filesystems? This is on a Solaris 10 5/09 system. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs upgrade unmounts filesystems
On Thu, Jul 29, 2010 at 10:26:14PM +0200, Pawel Jakub Dawidek wrote: On Thu, Jul 29, 2010 at 12:00:08PM -0600, Cindy Swearingen wrote: I found a similar zfs upgrade failure with the device busy error, which I believe was caused by a file system mounted under another file system. If this is the cause, I will file a bug or find an existing one. No, it was caused by processes active on those filesystems. The workaround is to unmount the nested file systems and upgrade them individually, like this: # zfs upgrade space/direct # zfs upgrade space/dcc Except that I couldn't unmount them because the filesystems were busy. 'zfs upgrade' unmounts file system first, which makes it hard to upgrade for example root file system. The only work-around I found is to clone root file system (clone is created with most recent version), change root file system to newly created clone, reboot, upgrade original root file system, change root file system back, reboot, destroy clone. In this case it wasn't the root filesystem, but I still had to disable twelve services before doing the upgrade and enable them afterwards. `fuser -c' is useful to identify the processes. Mapping them to services can be difficult. The server is essentially down during the upgrade. For a root filesystem, you might have to boot off the failsafe archive or a DVD and import the filesystem in order to upgrade it. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] root pool expansion
Right now I have a machine with a mirrored boot setup. The SAS drives are 43Gs and the root pool is getting full. I do a backup of the pool nightly, so I feel confident that I don't need to mirror the drive and can break the mirror and expand the pool with the detached drive. I understand how to do this on a normal pool, but is there any restrictions for doing this on the root pool? Are there any grub issues? Thanks, Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot zvols/iscsi send backup
Thanks for quick response. I appreciate it much. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS snapshot zvols/iscsi send backup
I'm looking to use ZFS to export ISCSI volumes to a Windows/Linux client. Essentially, I'm looking to create two storage ZFS machines that I will export ISCSI targets from. Then from the client side, I will enable mirrorings. The two ZFS machines will be independent of each other. I had question about snapshoting of ISCSI zvols. If I do a snapshot of ISCSI volume, it snapshots the blocks. I know the sending the blocks will allow from some from of replication. However, if I send the snapshot to a file, will I be able to recover the ISCSI volume from the file(s)? e.g. zfs send tank/t...@1 | gzip -c zfs.tank.test.gz Can I recover this ISCSI volume from zfs.tank.test.gz by sending it directly to another ZFS machine? Will I then be able to mount the ZFS volume created from this file and have my filesystem be the way it was? If I assemble the blocks like they were before, I assume it assembles everything the way it was before, including the filesytem and such. Or am I incorrect about this? Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] are these errors dangerous
I have seen this too I 'm guessing you have SATA disks which are on a iSCSI target. I'm also guessing you have used something like iscsitadm create target --type raw -b /dev/dsk/c4t0d00 c4t0d0 ie you are not using a zfs shareiscsi property on a zfs volume but creating the target from the device cNtNdN (dsk or rdsk it doesn't seem to matter) You see these errors (always block 0) when the iSCSI initiator accesses the disks annoying ... but the iSCSI transactions seem to be OK. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS disks hitting 100% busy
Our e-mail server started to slow down today. One of the disk devices is frequently at 100% usage. The heavy writes seem to cause reads to run quite slowly. In the statistics below, `c0t0d0' is UFS, containing the / and /var slices. `c0t1d0' is ZFS, containing /var/log/syslog, a couple of databases, and the GNU mailman files. It's this latter disk that's been hitting 100% usage. $ iostat -xn 5 3 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 8.2 57.8 142.6 538.2 0.0 1.70.1 25.2 0 48 c0t0d0 5.8 273.0 303.4 24115.9 0.0 18.60.0 66.7 0 73 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 57.20.0 294.6 0.0 1.30.0 22.1 0 64 c0t0d0 0.2 370.21.1 33968.5 0.0 31.40.0 84.9 1 100 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.8 61.06.4 503.0 0.0 2.50.0 40.0 0 70 c0t0d0 0.0 295.80.0 35273.3 0.0 35.00.0 118.3 0 100 c0t1d0 This system is running Solaris 10 5/09 on a Sun 4450 server. Both the disk devices are actually hardware-mirrored pairs of SAS disks, with the Adaptec RAID controller. Can anything be done to either reduce the amount of I/O or to improve the write bandwidth? I assume that adding another disk device to the zpool will double the bandwidth. /var/log/syslog is quite large, reaching about 600 megabytes before it's rotated. This takes place each night, with compression bringing it down to about 70 megabytes. The server handles about 500,000 messages a day. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is the J4200 SAS array suitable for Sun Cluster?
On Sun, May 16, 2010 at 01:14:24PM -0700, Charles Hedrick wrote: We use this configuration. It works fine. However I don't know enough about the details to answer all of your questions. The disks are accessible from both systems at the same time. Of course with ZFS you had better not actually use them from both systems. That's what I wanted to know. I'm not familiar with SAS fabrics, so it's good to know that they operate similarly to multi-initiator SCSI in a cluster. Actually, let me be clear about what we do. We have two J4200's and one J4400. One J4200 uses SAS disks, the others SATA. The two with SATA disks are used in Sun cluster configurations as NFS servers. They fail over just fine, losing no state. The one with SAS is not used with Sun Cluster. Rather, it's a Mysql server with two systems, one of them as a hot spare. (It also acts as a mysql slave server, but it uses different storage for that.) That means that our actual failover experience is with the SATA configuration. I will say from experience that in the SAS configuration both systems see the disks at the same time. I even managed to get ZFS to mount the same pool from both systems, which shouldn't be possible. Behavior was very strange until we realized what was going on. Our situation is that we only need a small amount of shared storate in the cluster. It's intended for high-availability of core services, such as DNS and NIS, rather than as a NAS server. I get the impression that they have special hardware in the SATA version that simulates SAS dual interface drives. That's what lets you use SATA drives in a two-node configuration. There's also some additional software setup for that configuration. That would be the SATA interposer that does that. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does ZFS use large memory pages?
On Thu, May 06, 2010 at 07:46:49PM -0700, Rob wrote: Hi Gary, I would not remove this line in /etc/system. We have been combatting this bug for a while now on our ZFS file system running JES Commsuite 7. I would be interested in finding out how you were able to pin point the problem. Our problem was a year ago. Careful reading of Sun bug reports helped. Opening a support case with Sun helped even more. Large memory pages were likely not involved. We seem to have no worries with the system currently, but when the file system gets above 80% we seems to have quite a number of issues, much the same as what you've had in the past, ps and prstats hanging. are you able to tell me the IDR number that you applied? The IDR was only needed last year. Upgrading to Solaris 10 10/09 and applying the latest patches resolved the problem. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Is the J4200 SAS array suitable for Sun Cluster?
I'm setting up a two-node cluster with 1U x86 servers. It needs a small amount of shared storage, with two or four disks. I understand that the J4200 with SAS disks is approved for this use, although I haven't seen this information in writing. Does anyone have experience with this sort of configuration? I have a few questions. I understand that the J4200 with SATA disks will not do SCSI reservations. Will it with SAS disks? The X4140 seems to require two SAS HBAs, one for the internal disks and one for the external disks. Is this correct? Will the disks in the J4200 be accessible from both nodes, so that the cluster can fail over the storage? I know this works with a multi-initiator SCSI bus, but I don't know about SAS behavior. Is there a smaller, and cheaper, SAS array that can be used in this configuration? It would still need to have redundant power and redundant SAS paths. I plan to use ZFS everywhere, for the root filesystem and the shared storage. The only exception will be UFS for /globaldevices . -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SAS vs SATA: Same size, same speed, why SAS?
On Mon, Apr 26, 2010 at 01:32:33PM -0500, Dave Pooser wrote: On 4/26/10 10:10 AM, Richard Elling richard.ell...@gmail.com wrote: SAS shines with multiple connections to one or more hosts. Hence, SAS is quite popular when implementing HA clusters. So that would be how one builds something like the active/active controller failover in standalone RAID boxes. Is there a good resource on doing something like that with an OpenSolaris storage server? I could see that as a project I might want to attempt. This is interesting. I have a two-node SPARC cluster that uses a multi-initiator SCSI array for shared storage. As an application server, it need only two disks in the array. They are a ZFS mirror. This all works quite nicely under Sun Cluster. I'd like to duplicate this configuration with two small x86 servers and a small SAS array, also with only two disks. It should be easy to find a pair of 1U servers, but what's the smallest SAS array that's available? Does it need an array controller? What's needed on the servers to connect to it? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposition of a new zpool property.
I'm not sure I like this at all. Some of my pools take hours to scrub. I have a cron job run scrubs in sequence... Start one pool's scrub and then poll until it's finished, start the next and wait, and so on so I don't create too much load and bring all I/O to a crawl. The job is launched once a week, so the scrubs have plenty of time to finish. :) Scrubs every hour? Some of my pools would be in continuous scrub. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Thu, Mar 04, 2010 at 04:20:10PM -0600, Gary Mills wrote: We have an IMAP e-mail server running on a Solaris 10 10/09 system. It uses six ZFS filesystems built on a single zpool with 14 daily snapshots. Every day at 11:56, a cron command destroys the oldest snapshots and creates new ones, both recursively. For about four minutes thereafter, the load average drops and I/O to the disk devices drops to almost zero. Then, the load average shoots up to about ten times normal and then declines to normal over about four minutes, as disk activity resumes. The statistics return to their normal state about ten minutes after the cron command runs. I'm pleased to report that I found the culprit and the culprit was me! Well, ZFS peculiarities may be involved as well. Let me explain: We had a single second-level filesystem and five third-level filesystems, all with 14 daily snapshots. The snapshots were maintained by a cron command that did a `zfs list -rH -t snapshot -o name' to get the names of all of the snapshots, extracted the part after the `@', and then sorted them uniquely to get a list of suffixes that were older than 14 days. The suffixes were Julian dates so they sorted correctly. It then did a `zfs destroy -r' to delete them. The recursion was always done from the second-level filesystem. The top-level filesystem was empty and had no snapshots. Here's a portion of the script: zfs list -rH -t snapshot -o name $FS | \ cut -d@ -f2 | \ sort -ur | \ sed 1,${NR}d | \ xargs -I '{}' zfs destroy -r $FS@'{}' zfs snapshot -r $...@$jd Just over two weeks ago, I rearranged the filesystems so that the second-level filesystem was newly-created and initially had no snapshots. It did have a snapshot taken every day thereafter, so that eventually it also had 14 of them. It was during that interval that the complaints started. My statistics clearly showed the performance stall and subsequent recovery. Once that filesystem reached 14 snapshots, the complaints stopped and the statistics showed only a modest increase in CPU activity, but no stall. During this interval, the script was doing a recursive destroy for a snapshot that didn't exist at the specified level, but only existed in the descendent filesystems. I'm assuming that that unusual situation was the cause of the stall, although I don't have good evidence. By the time the complaints reached my ears, and I was able to refine my statistics gathering sufficiently, the problem had gone away. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Mon, Mar 08, 2010 at 03:18:34PM -0500, Miles Nordin wrote: gm == Gary Mills mi...@cc.umanitoba.ca writes: gm destroys the oldest snapshots and creates new ones, both gm recursively. I'd be curious if you try taking the same snapshots non-recursively instead, does the pause go away? I'm still collecting statistics, but that is one of the things I'd like to try. Because recursive snapshots are special: they're supposed to atomically synchronize the cut-point across all the filesystems involved, AIUI. I don't see that recursive destroys should be anything special though. gm Is it destroying old snapshots or creating new ones that gm causes this dead time? sortof seems like you should tell us this, not the other way around. :) Seriously though, isn't that easy to test? And I'm curious myself too. Yes, that's another thing I'd like to try. I'll just put a `sleep' in the script between the two actions to see if the dead time moves later in the day. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Mon, Mar 08, 2010 at 01:23:10PM -0800, Bill Sommerfeld wrote: On 03/08/10 12:43, Tomas Ögren wrote: So we tried adding 2x 4GB USB sticks (Kingston Data Traveller Mini Slim) as metadata L2ARC and that seems to have pushed the snapshot times down to about 30 seconds. Out of curiosity, how much physical memory does this system have? Mine has 64 GB of memory with the ARC limited to 32 GB. The Cyrus IMAP processes, thousands of them, use memory mapping extensively. I don't know if this design affects the snapshot recycle behavior. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Thu, Mar 04, 2010 at 04:20:10PM -0600, Gary Mills wrote: We have an IMAP e-mail server running on a Solaris 10 10/09 system. It uses six ZFS filesystems built on a single zpool with 14 daily snapshots. Every day at 11:56, a cron command destroys the oldest snapshots and creates new ones, both recursively. For about four minutes thereafter, the load average drops and I/O to the disk devices drops to almost zero. Then, the load average shoots up to about ten times normal and then declines to normal over about four minutes, as disk activity resumes. The statistics return to their normal state about ten minutes after the cron command runs. I should mention that this seems to be a new problem. We've been using the same scheme to cycle snapshots for several years. The complaints of an unresponsive interval have only happened recently. I'm still waiting for our help desk to report on when the complaints started. It may be the result of some recent change we made, but so far I can't tell what that might have been. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Snapshot recycle freezes system activity
We have an IMAP e-mail server running on a Solaris 10 10/09 system. It uses six ZFS filesystems built on a single zpool with 14 daily snapshots. Every day at 11:56, a cron command destroys the oldest snapshots and creates new ones, both recursively. For about four minutes thereafter, the load average drops and I/O to the disk devices drops to almost zero. Then, the load average shoots up to about ten times normal and then declines to normal over about four minutes, as disk activity resumes. The statistics return to their normal state about ten minutes after the cron command runs. Is it destroying old snapshots or creating new ones that causes this dead time? What does each of these procedures do that could affect the system? What can I do to make this less visible to users? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On Thu, Mar 04, 2010 at 07:51:13PM -0300, Giovanni Tirloni wrote: On Thu, Mar 4, 2010 at 7:28 PM, Ian Collins [1]...@ianshome.com wrote: Gary Mills wrote: We have an IMAP e-mail server running on a Solaris 10 10/09 system. It uses six ZFS filesystems built on a single zpool with 14 daily snapshots. Every day at 11:56, a cron command destroys the oldest snapshots and creates new ones, both recursively. For about four minutes thereafter, the load average drops and I/O to the disk devices drops to almost zero. Then, the load average shoots up to about ten times normal and then declines to normal over about four minutes, as disk activity resumes. The statistics return to their normal state about ten minutes after the cron command runs. Is it destroying old snapshots or creating new ones that causes this dead time? What does each of these procedures do that could affect the system? What can I do to make this less visible to users? I have a couple of Solaris 10 boxes that do something similar (hourly snaps) and I've never seen any lag in creating and destroying snapshots. One system with 16 filesystems takes 5 seconds to destroy the 16 oldest snaps and create 5 recursive new ones. I logged load average on these boxes and there is a small spike on the hour, but this is down to sending the snaps, not creating them. We've seen the behaviour that Gary describes while destroying datasets recursively (600GB and with 7 snapshots). It seems that close to the end the server stalls for 10-15 minutes and NFS activity stops. For small datasets/snapshots that doesn't happen or is harder to notice. Does ZFS have to do something special when it's done releasing the data blocks at the end of the destroy operation ? That does sound similar to the problem here. The zpool is 3 TB in size with about 1.4 TB used. It does sound as if the stall happens during the `zfs destroy -r' rather than during the `zfs snapshot -r'. What can zfs be doing when the CPU load average drops and disk I/O is close to zero? I also had peculiar problem here recently when I was upgrading the ZFS filesystems on our test server from 3 to 4. When I tried `zfs upgrade -a', the command hung for a long time and could not be interrupted, killed, or traced. Eventually it terminated on its own. Only the two upper-level filesystems had been upgraded. I upgraded the lower- level ones individually with `zfs upgrade' with no further problems. I had previously upgraded the zpool with no problems. I don't know if this behavior is related to the stall on the production server. I haven't attempted the upgrades there yet. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What Happend to my OpenSolaris X86 Install?
My guess is that the grub bootloader wasn't upgraded on the actual boot disk. Search for directions on how to mirror ZFS boot drives and you'll see how to copy the correct grub loader onto the boot disk. If you want to do this simpler, swap the disks. I did this when I was moving from SXCE to OSOL so I could make sure that things worked before making one of the drives a mirror. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do separate ZFS filesystems affect performance?
On Thu, Jan 14, 2010 at 10:58:48AM +1100, Daniel Carosone wrote: On Wed, Jan 13, 2010 at 08:21:13AM -0600, Gary Mills wrote: Yes, I understand that, but do filesystems have separate queues of any sort within the ZIL? I'm not sure. If you can experiment and measure a benefit, understanding the reasons is helpful but secondary. If you can't experiment so easily, you're stuck asking questions, as now, to see whether the effort of experimenting is potentially worthwhile. Yes, we're stuck asking questions. I appreciate your responses. Some other things to note (not necessarily arguments for or against): * you can have multiple slog devices, in case you're creating so much ZIL traffic that ZIL queueing is a real problem, however shared or structured between filesystems. For the time being, I'd like to stay with the ZIL that's internal to the zpool. * separate filesystems can have different properties which might help tuning and experiments (logbias, copies, compress, *cache), as well the recordsize. Maybe you will find that compress on mailboxes helps, as long as you're not also compressing the db's? Yes, that's a good point in favour of a separate filesystem. * separate filesystems may have different recovery requirements (snapshot cycles). Note that taking snapshots is ~free, but keeping them and deleting them have costs over time. Perhaps you can save some of these costs if the db's are throwaway/rebuildable. Also a good point. If not, would it help to put the database filesystems into a separate zpool? Maybe, if you have the extra devices - but you need to compare with the potential benefit of adding those devices (and their IOPS) to benefit all users of the existing pool. For example, if the databases are a distinctly different enough load, you could compare putting them on a dedicated pool on ssd, vs using those ssd's as additional slog/l2arc. Unless you can make quite categorical separations between the workloads, such that an unbalanced configuration matches an unbalanced workload, you may still be better with consolidated IO capacity in the one pool. As well, I'd like to keep all of the ZFS pools on the same external storage device. This makes migrating to a different server quite easy. Note, also, you can only take recursive atomic snapshots within the one pool - this might be important if the db's have to match the mailbox state exactly, for recovery. That's another good point. It's certainly better to have synchronized snapshots. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does ZFS use large memory pages?
On Mon, Jan 11, 2010 at 01:43:27PM -0600, Gary Mills wrote: This line was a workaround for bug 6642475 that had to do with searching for for large contiguous pages. The result was high system time and slow response. I can't find any public information on this bug, although I assume it's been fixed by now. It may have only affected Oracle database. I eventually found it. The bug is not visible from Sunsolve even with a contract, but it is in bugs.opensolaris.org without one. This is extremely confusing. I'd like to remove this line from /etc/system now, but I don't know if it will have any adverse effect on ZFS or the Cyrus IMAP server that runs on this machine. Does anyone know if ZFS uses large memory pages? Bug 6642475 is still outstanding, although related bugs have been fixed. I'm going to leave `set pg_contig_disable=1' in place. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repeating scrub does random fixes
Thanks for all the suggestions. Now for a strange tail... I tried upgrading to dev 130 and, as expected, things did not go well. All sorts of permission errors flew by during the upgrade stage and it would not start X-windows. I've heard that things installed from the contrib and extras repositories might cause issues but I didn't want to spend the time with my server offline while i tried to figure this out. So, I booted back to 111b and scrubs still showed errors. Late in the evening, the pool faulted preventing any backups from the other servers to this pool. Greeted this morning with the recover files from backup status message sent shivers up my spine. This IS my backup. I exported the pool and then imported it, which it did successfully. Now the scrubs run cleanly (at least for a few repeated scrubs spanning several hours). So, was it hardware? What the heck could have fixed it by just exporting and importing the pool? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How do separate ZFS filesystems affect performance?
I'm working with a Cyrus IMAP server running on a T2000 box under Solaris 10 10/09 with current patches. Mailboxes reside on six ZFS filesystems, each containing about 200 gigabytes of data. These are part of a single zpool built on four Iscsi devices from our Netapp filer. One of these ZFS filesystems contains a number of global and per-user databases in addition to one sixth of the mailboxes. I'm thinking of moving these databases to a separate ZFS filesystem. Access to these databases must be quick to ensure responsiveness of the server. We are currently experiencing a slowdown in performance when the number of simultaneous IMAP sessions rises above 3000. These databases are opened and memory-mapped by all processes. They have the usual requirement for locking and synchronous writes whenever they are updated. Is moving the databases (IMAP metadata) to a separate ZFS filesystem likely to improve performance? I've heard that this is important, but I'm not clear why this is. Does each filesystem have its own queue in the ARC or ZIL? Here are some statistics taken while the server was busy and access was slow: # /usr/local/sbin/zilstat 5 5 N-Bytes N-Bytes/s N-Max-RateB-Bytes B-Bytes/s B-Max-Rateops =4kB 4-32kB =32kB 1126664 225332 515872 1148518422970363469312292163 51 79 740536 148107 250896953548819070974005888198106 24 68 758344 151668 179104 1254604825092092682880227 93 45 89 603304 120660 204344917913618358272084864179 89 23 67 948896 189779 346520 1588019231760384173824262108 32123 # /usr/local/sbin/arcstat 5 5 Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 10:50:16 191M 31M 16 14M8 17M 48 18M 1230G 32G 10:50:211K 148 1076572 5878 1530G 32G 10:50:261K 154 1288765 7296 1830G 32G 10:50:31 79661 7547 6 3525830G 32G 10:50:361K 117 9 105812 5344 1030G 32G -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How do separate ZFS filesystems affect performance?
On Tue, Jan 12, 2010 at 11:11:36AM -0600, Bob Friesenhahn wrote: On Tue, 12 Jan 2010, Gary Mills wrote: Is moving the databases (IMAP metadata) to a separate ZFS filesystem likely to improve performance? I've heard that this is important, but I'm not clear why this is. There is an obvious potential benefit in that you are then able to tune filesystem parameters to best fit the needs of the application which updates the data. For example, if the database uses a small block size, then you can set the filesystem blocksize to match. If the database uses memory mapped files, then using a filesystem blocksize which is closest to the MMU page size may improve performance. I found a couple of references that suggest just putting the databases on their own ZFS filesystem has a great benefit. One is an e-mail message to a mailing list from Vincent Fox at UC Davis. They run a similar system to ours at that site. He says: Particularly the database is important to get it's own filesystem so that it's queue/cache are separated. The second one is from: http://blogs.sun.com/roch/entry/the_dynamics_of_zfs He says: For file modification that come with some immediate data integrity constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent log or ZIL. This sounds like the ZIL queue mentioned above. Is I/O for each of those handled separately? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repeating scrub does random fixes
I've just made a couple of consecutive scrubs, each time it found a couple of checksum errors but on different drives. No indication of any other errors. That a disk scrubs cleanly on a quiescent pool in one run but fails in the next is puzzling. It reminds me of the snv_120 odd number of disks raidz bug I reported. Looks like I've got to bite the bullet and upgrade to the dev tree and hope for the best. Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Does ZFS use large memory pages?
Last April we put this in /etc/system on a T2000 server with large ZFS filesystems: set pg_contig_disable=1 This was while we were attempting to solve a couple of ZFS problems that were eventually fixed with an IDR. Since then, we've removed the IDR and brought the system up to Solaris 10 10/09 with current patches. It's stable now, but seems slower. This line was a workaround for bug 6642475 that had to do with searching for for large contiguous pages. The result was high system time and slow response. I can't find any public information on this bug, although I assume it's been fixed by now. It may have only affected Oracle database. I'd like to remove this line from /etc/system now, but I don't know if it will have any adverse effect on ZFS or the Cyrus IMAP server that runs on this machine. Does anyone know if ZFS uses large memory pages? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Repeating scrub does random fixes
I've been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on random disks. If I repeat the scrub, it will fix errors on other disks. Occasionally it runs cleanly. That it doesn't happen in a consistent manner makes me believe it's not hardware related. fmdump only reports, three types of errors: ereport.fs.zfs.checksum ereport.io.scsi.cmd.disk.tran ereport.io.scsi.cmd.disk.recovered The middle one seems to be the issue I'd like to track down the source. Any docs on how to do this? Thanks, Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repeating scrub does random fixes
Mattias Pantzare wrote: On Sun, Jan 10, 2010 at 16:40, Gary Gendel g...@genashor.com wrote: I've been using a 5-disk raidZ for years on SXCE machine which I converted to OSOL. The only time I ever had zfs problems in SXCE was with snv_120, which was fixed. So, now I'm at OSOL snv_111b and I'm finding that scrub repairs errors on random disks. If I repeat the scrub, it will fix errors on other disks. Occasionally it runs cleanly. That it doesn't happen in a consistent manner makes me believe it's not hardware related. That is a good indication for hardware related errors. Software will do the same thing every time but hardware errors are often random. But you are running an older version now, I would recommend an upgrade. I would have thought that too if it didn't start right after the switch from SXCE to OSOL. As for an upgrade, use the dev repository on my laptop and I find that OSOL updates aren't nearly as stable as SXCE was. I tried for a bit, but always had to go back to 111b because something crucial broke. I was hoping to wait until the official release in March in order to let things stabilize. This is my main web/mail/file/etc. server and I don't really want to muck too much. That said, I may take a gambol on upgrading as we're getting closer to the 2010.x release. Gary ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS filesystems not mounted on reboot with Solaris 10 10/09
I have a system that was recently upgraded to Solaris 10 10/09. It has a UFS root on local disk and a separate zpool on Iscsi disk. After a reboot, the ZFS filesystems were not mounted, although the zpool had been imported. `zfs mount' showed nothing. `zfs mount -a' mounted them nicely. The `canmount' property is `on'. Why would they not be mounted at boot? This used to work with earlier releases of Solaris 10. The `zfs mount -a' at boot is run by the /system/filesystem/local:default service. It didn't record any errors on the console or in the log [ Dec 19 08:09:11 Executing start method (/lib/svc/method/fs-local) ] [ Dec 19 08:09:12 Method start exited with status 0 ] Is a dependancy missing? -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors on two files
On Fri, Dec 04, 2009 at 02:52:47PM -0700, Cindy Swearingen wrote: If space/dcc is a dataset, is it mounted? ZFS might not be able to print the filenames if the dataset is not mounted, but I'm not sure if this is why only object numbers are displayed. Yes, it's mounted and is quite an active filesystem. I would also check fmdump -eV to see how frequent the hardware has had problems. That shows ZFS checksum errors in July, but nothing since that time. There were also DIMM errors before that, starting in June. We replaced the failed DIMMs, also in July. This is an X4450 with ECC memory. There were no disk errors reported. I suppose we can blame the memory. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] freeNAS moves to Linux from FreeBSD
The only reason I thought this news would be of interest is that the discussions had some interesting comments. Basically, there is a significant outcry because zfs was going away. I saw NextentaOS and EON mentioned several times as the path to go. Seem that there is some opportunity for OpenSolaris advocacy in this arena while the topic is hot. Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Permanent errors on two files
On Sat, Dec 05, 2009 at 01:52:12AM +0300, Victor Latushkin wrote: On Dec 5, 2009, at 0:52, Cindy Swearingen cindy.swearin...@sun.com wrote: The zpool status -v command will generally print out filenames, dnode object numbers, or identify metadata corruption problems. These look like object numbers, because they are large, rather than metadata objects, but an expert will have to comment. Yes, thi is object numbers and most likely reason these are not turned into filnames is that corresponding files no longer exist. That seems to be the case: # zdb -d space/dcc 0x11e887 0xba25aa Dataset space/dcc [ZPL], ID 21, cr_txg 19, 20.5G, 3672408 objects So I'd run scrub another time, if the files are gone and there are no other corruptions scrub will reset error log and zpool status should become clean. That worked. After the scrub, there are no errors reported. You might be able to identify these object numbers with zdb, but I'm not sure how do that. You can try to use zdb this way to check if these objects still exist zdb -d space/dcc 0x11e887 0xba25aa -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] virsh troubling zfs!?
On Tue, Nov 03, 2009 at 11:39:28AM -0800, Ralf Teckelmann wrote: Hi and hello, I have a problem confusing me. I hope someone can help me with it. I followed a best practise - I think - using dedicated zfs filesystems for my virtual machines. Commands (for completion): [i]zfs create rpool/vms[/i] [i]zfs create rpool/vms/vm1[/i] [i] zfs create -V 10G rpool/vms/vm1/vm1-dsk[/i] This command creates the file system [i]/rpool/vms/vm1/vm1-dsk[/i] and the according [i]/dev/zvol/dsk/rpool/vms/vm1/vm1-dsk[/i]. (Clarification) Your commands create two filesystems: rpool/vms rpool/vms/vm1 You then create a ZFS Volume: rpool/vms/vm1/vm1-dsk which results in associated dsk and rdsk devices being created as: /dev/zvol/dsk/rpool/vms/vm1/vm1-dsk /dev/zvol/rdsk/rpool/vms/vm1/vm1-dsk These two nodes are artifacts of the zfs volume implementation and are required to allow zfs volumes to emulate traditional disk devices. They will appear and disappear accordingly as zfs volumes are created and destroyed. If I delete a VM i set up using this filesystem via[i] virsh undefine vm1[/i] the [i]/rpool/vms/vm1/vm1-dsk[/i] gets also deleted, but the [i]/dev/zvol/dsk/rpool/vms/vm1/vm1-dsk[/i] is left. virsh undefine does not delete filesystems, disks or any other kind of backing storage. In order to delete the three things you created, you need to issue: zfs destroy rpool/vms/vm1/vm1-dsk zfs destroy rpool/vms/vm1 zfs destroy rpool/vms or (more simply) you can do it recursively, if there's nothing else to be affected: zfs destroy -r rpool/vms Obviously you need to be careful with recursive destruction that no other filesystems/volumes are affected. Without [i]/rpool/vms/vm1/vm1-dsk[/i] I am not able to do [i]zfs destroy rpool/vms/vm1/vm1-dsk[/i] so the [i]/dev/zvol/dsk/rpool/vms/vm1/vm1-dsk[/i] could not be destroyed and will be left forever!? How can I get rid of this problem? You don't have a problem. When the zfs volume is destroyed (as I describe above), then the associated devices are also removed. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Hope that helps. Gary -- Gary Pennington Solaris Core OS Sun Microsystems gary.penning...@sun.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Apple shuts down open source ZFS project
Apple is known to strong arm in licensing negotiations. I'd really like to hear the straight-talk about what transpired. That's ok, it just means that I won't be using mac as a server. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Slow reads with ZFS+NFS
Heya all, I'm working on testing ZFS with NFS, and I could use some guidance - read speeds are a bit less than I expected. Over a gig-e line, we're seeing ~30 MB/s reads on average - doesn't seem to matter if we're doing large numbers of small files or small numbers of large files, the speed seems to top out there. We've disabled pre-fetching, which may be having some affect on read speads, but proved necessary due to severe performance issues on database reads with it enabled. (Reading from the DB with pre-fetching enabled was taking 4-5 times as long than with it disabled.) Write speed seems to be fine. Testing is showing ~95 MB/s, which seems pretty decent considering there's been no real network tuning done. The NFS server we're testing is a Sun x4500, configured with a storage pool consisting of 20x 2-disk mirrors, using separate SSD for logging. It's running the latest version of Nexenta Core. (We've also got a second x4500 in with a raidZ2 config, running OpenSolaris proper, showing the same issues with reads.) We're using NFS v4 via TCP, serving various Linux clients (the majority are CentOS 5.3). Connectivity is presently provided by a single gigabit ethernet link; entirely conventional configuration (no jumbo frames/etc). Our workload is pretty read heavy; we're serving both website assets and databases via NFS. The majority of files being served are small ( 1MB). The databases are MySQL/InnoDB, with the data in separate zfs filesystems with a record size of 16k. The website assets/etc. are in zfs filesystems with the default record size. On the database server side of things, we've disabled InnoDB's double write buffer. I'm wondering if there's any other tuning that'd be a good idea for ZFS in this situation, or if there's some NFS tuning that should be done when dealing specifically with ZFS. Any advice would be greatly appreciated. Thanks, -- -- Gary Gogick senior systems administrator | workhabit,inc. // email: g...@workhabit.com | web: http://www.workhabit.com // office: 866-workhabit | fax: 919-552-9690 -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Slow reads with ZFS+NFS
Trevor/all, We've been timing the copying of actual data (1GB of assorted files, generally 1MB with numerous larger files thrown in) in an attempt to simulate real world use. We've been copying different sets of data around to try and avoid anything being cached anywhere. I don't recall the specific numbers, but local reading/writing on the x4500 was definitely well over what can be theoretically pushed through a gig-e line; so I'm pretty convinced the problem is either with the ZFS+NFS combo or NFS, rather than with ZFS alone. I'll do some OpenSolaris - OpenSolaris testing tonight and see what happens. Thanks for the replies, appreciate the help! On Tue, Oct 20, 2009 at 1:43 PM, Trevor Pretty trevor_pre...@eagle.co.nzwrote: Gary Where you measuring the Linux NFS write performance? It's well know that Linux can use NFS in a very unsafe mode and report the write complete when it is not all the way to safe storage. This is often reported as Solaris has slow NFS write performance. This link does not mention NFS v4 but you might want to check. http://nfs.sourceforge.net/ What's the write performance like between the two OpenSolaris systems? Richard Elling wrote: cross-posting to nfs-discuss On Oct 20, 2009, at 10:35 AM, Gary Gogick wrote: Heya all, I'm working on testing ZFS with NFS, and I could use some guidance - read speeds are a bit less than I expected. Over a gig-e line, we're seeing ~30 MB/s reads on average - doesn't seem to matter if we're doing large numbers of small files or small numbers of large files, the speed seems to top out there. We've disabled pre-fetching, which may be having some affect on read speads, but proved necessary due to severe performance issues on database reads with it enabled. (Reading from the DB with pre- fetching enabled was taking 4-5 times as long than with it disabled.) What is the performance when reading locally (eliminate NFS from the equation)? -- richard Write speed seems to be fine. Testing is showing ~95 MB/s, which seems pretty decent considering there's been no real network tuning done. The NFS server we're testing is a Sun x4500, configured with a storage pool consisting of 20x 2-disk mirrors, using separate SSD for logging. It's running the latest version of Nexenta Core. (We've also got a second x4500 in with a raidZ2 config, running OpenSolaris proper, showing the same issues with reads.) We're using NFS v4 via TCP, serving various Linux clients (the majority are CentOS 5.3). Connectivity is presently provided by a single gigabit ethernet link; entirely conventional configuration (no jumbo frames/etc). Our workload is pretty read heavy; we're serving both website assets and databases via NFS. The majority of files being served are small ( 1MB). The databases are MySQL/InnoDB, with the data in separate zfs filesystems with a record size of 16k. The website assets/etc. are in zfs filesystems with the default record size. On the database server side of things, we've disabled InnoDB's double write buffer. I'm wondering if there's any other tuning that'd be a good idea for ZFS in this situation, or if there's some NFS tuning that should be done when dealing specifically with ZFS. Any advice would be greatly appreciated. Thanks, -- -- Gary Gogick senior systems administrator | workhabit,inc. // email: g...@workhabit.com | web: http://www.workhabit.com // office: 866-workhabit | fax: 919-552-9690 -- ___ zfs-discuss mailing listzfs-disc...@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing listzfs-disc...@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss * * www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. -- -- Gary Gogick senior systems administrator | workhabit,inc. // email: g...@workhabit.com | web: http://www.workhabit.com // office: 866-workhabit | fax: 919-552-9690 -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] If you have ZFS in production, willing to share some details (with me)?
On Fri, Sep 18, 2009 at 01:51:52PM -0400, Steffen Weiberle wrote: I am trying to compile some deployment scenarios of ZFS. # of systems One, our e-mail server for the entire campus. amount of storage 2 TB that's 58% used. application profile(s) This is our Cyrus IMAP spool. In addition to user's e-mail folders (directories) and messages (files), it contains global, per-folder, and per-user databases. The latter two types are quite small. type of workload (low, high; random, sequential; read-only, read-write, write-only) It's quite active. Message files arrive randomly and are deleted randomly. As a result, files in a directory are not located in proximity on the storage. Individual users often read all of their folders and messages in one IMAP session. Databases are quite active. Each incoming message adds a file to a directory and reads or updates several databases. Most IMAP I/O is done with mmap() rather than with read()/write(). So far, IMAP peformance is adequate. The backup, done by EMC Networker, is very slow because it must read thousands of small files in directory order. storage type(s) We are using an Iscsi SAN with storage on a Netapp filer. It exports four 500-gb LUNs that are striped into one ZFS pool. All disk mangement is done on the Netapp. We have had several disk failures and replacements on the Netapp, with no effect on the e-mail server. industry A University with 35,000 enabled e-mail accounts. whether it is private or I can share in a summary anything else that might be of interest You are welcome to share this information. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS commands hang after several zfs receives
On Tue, Sep 15, 2009 at 08:48:20PM +1200, Ian Collins wrote: Ian Collins wrote: I have a case open for this problem on Solaris 10u7. The case has been identified and I've just received an IDR,which I will test next week. I've been told the issue is fixed in update 8, but I'm not sure if there is an nv fix target. I'll post back once I've abused a test system for a while. The IDR I was sent appears to have fixed the problem. I have been abusing the box for a couple of weeks without any lockups. Roll on update 8! Was that IDR140221-17? That one fixed a deadlock bug for us back in May. -- -Gary Mills--Unix Group--Computer and Network Services- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem with snv_122 Zpool issue
You shouldn't hit the Raid-Z issue because it only happens with an odd number of disks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123
Alan, Thanks for the detailed explanation. The rollback successfully fixed my 5-disk RAID-Z errors. I'll hold off another upgrade attempt until 124 rolls out. Fortunately, I didn't do a zfs upgrade right away after installing 121. For those that did, this could be very painful. Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
Alan, Super find. Thanks, I thought I was just going crazy until I rolled back to 110 and the errors disappeared. When you do work out a fix, please ping me to let me know when I can try an upgrade again. Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
It looks like It's definitely related to the snv_121 upgrade. I decided to roll back to snv_110 and the checksum errors have disappeared. I'd like to issue a bug report, but I don't have any information that might help track this down, just lots of checksum errors. Looks like I'm stuck at snv_110 until someone figures out what is broken. If it helps, here is my properly list for this pool. g...@phoenix[~]101zfs get all archive NAME PROPERTY VALUE SOURCE archive type filesystem - archive creation Mon Jun 18 20:40 2007 - archive used 787G - archive available 1.01T - archive referenced125G - archive compressratio 1.13x - archive mounted yes- archive quota none default archive reservation none default archive recordsize128K default archive mountpoint/archive default archive sharenfs offdefault archive checksum on default archive compression on local archive atime offlocal archive devices on default archive exec on default archive setuidon default archive readonly offdefault archive zoned offdefault archive snapdir hidden default archive aclmode groupmask default archive aclinheritrestricted default archive canmount on default archive shareiscsioffdefault archive xattr on default archive copies1 default archive version 3 - archive utf8only off- archive normalization none - archive casesensitivity sensitive - archive vscan offdefault archive nbmandoffdefault archive sharesmb offlocal archive refquota none default archive refreservationnone default archive primarycache alldefault archive secondarycachealldefault And each of the sub-pools look like this: g...@phoenix[~]101zfs get all archive/gary archive/gary type filesystem - archive/gary creation Mon Jun 18 20:56 2007 - archive/gary used 141G - archive/gary available 1.01T - archive/gary referenced141G - archive/gary compressratio 1.22x - archive/gary mounted yes- archive/gary quota none default archive/gary reservation none default archive/gary recordsize128K default archive/gary mountpoint/archive/gary default archive/gary sharenfs offdefault archive/gary checksum on default archive/gary compression on inherited from archive archive/gary atime offinherited from archive archive/gary devices on default archive/gary exec on default archive/gary setuidon default archive/gary readonly offdefault archive/gary zoned offdefault archive/gary snapdir hidden default archive/gary aclmode groupmask default archive/gary aclinheritpassthroughlocal archive/gary canmount on default archive/gary shareiscsioffdefault archive/gary xattr on default archive/gary copies1 default archive/gary version 3 - archive/gary utf8only off- archive/gary normalization none - archive/gary casesensitivity sensitive - archive/gary vscan offdefault archive/gary
[zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool
I have a 5-500GB disk Raid-Z pool that has been producing checksum errors right after upgrading SXCE to build 121. They seem to be randomly occurring on all 5 disks, so it doesn't look like a disk failure situation. Repeatingly running a scrub on the pools randomly repairs between 20 and a few hundred checksum errors. Since I hadn't physically touched the machine, it seems a very strong coincidence that it started right after I upgraded to 121. This machine is a SunFire v20z with a Marvell SATA 8-port controller (the same one as in the original thumper). I've seen this kind of problem way back around build 40-50 ish, but haven't seen it after that until now. Anyone else experiencing this problem or knows how to isolate the problem definitively? Thanks, Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Mon, Jul 06, 2009 at 04:54:16PM +0100, Andrew Gabriel wrote: Andre van Eyssen wrote: On Mon, 6 Jul 2009, Gary Mills wrote: As for a business case, we just had an extended and catastrophic performance degradation that was the result of two ZFS bugs. If we have another one like that, our director is likely to instruct us to throw away all our Solaris toys and convert to Microsoft products. If you change platform every time you get two bugs in a product, you must cycle platforms on a pretty regular basis! You often find the change is towards Windows. That very rarely has the same rules applied, so things then stick there. There's a more general principle in operation here. Organizations do sometimes change platforms for peculiar reasons, but once they do that they're not going to do it again for a long time. That's why they disregard problems with the new platform. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, Jul 04, 2009 at 07:18:45PM +0100, Phil Harman wrote: Gary Mills wrote: On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. That's the first I've heard of this issue. Our e-mail server runs Cyrus IMAP with mailboxes on ZFS filesystems. Cyrus uses mmap(2) extensively. I understand that Solaris has an excellent implementation of mmap(2). ZFS has many advantages, snapshots for example, for mailbox storage. Is there anything that we can be do to optimize the two caches in this environment? Will mmap(2) one day play nicely with ZFS? [..] Software engineering is always about prioritising resource. Nothing prioritises performance tuning attention quite like compelling competitive data. When Bart Smaalders and I wrote libMicro we generated a lot of very compelling data. I also coined the phrase If Linux is faster, it's a Solaris bug. You will find quite a few (mostly fixed) bugs with the synopsis linux is faster than solaris at So, if mmap(2) playing nicely with ZFS is important to you, probably the best thing you can do to help that along is to provide data that will help build the business case for spending engineering resource on the issue. First of all, how significant is the double caching in terms of performance? If the effect is small, I won't worry about it anymore. What sort of data do you need? Would a list of software products that utilize mmap(2) extensively and could benefit from ZFS be suitable? As for a business case, we just had an extended and catastrophic performance degradation that was the result of two ZFS bugs. If we have another one like that, our director is likely to instruct us to throw away all our Solaris toys and convert to Microsoft products. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Sat, Jul 04, 2009 at 08:48:33AM +0100, Phil Harman wrote: ZFS doesn't mix well with mmap(2). This is because ZFS uses the ARC instead of the Solaris page cache. But mmap() uses the latter. So if anyone maps a file, ZFS has to keep the two caches in sync. That's the first I've heard of this issue. Our e-mail server runs Cyrus IMAP with mailboxes on ZFS filesystems. Cyrus uses mmap(2) extensively. I understand that Solaris has an excellent implementation of mmap(2). ZFS has many advantages, snapshots for example, for mailbox storage. Is there anything that we can be do to optimize the two caches in this environment? Will mmap(2) one day play nicely with ZFS? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Lots of metadata overhead on filesystems with 100M files
On Thu, Jun 18, 2009 at 12:12:16PM +0200, Cor Beumer - Storage Solution Architect wrote: What they noticed on the the X4500 systems, that when the zpool became filled up for about 50-60% the performance of the system did drop enormously. They do claim this has to do with the fragmentation of the ZFS filesystem. So we did try over there putting an S7410 system in with about the same config on disks, 44x 1TB SATA BUT 4x 18GB WriteZilla (in a stripe) we were able to get much and much more i/o's from the system the the comparable X4500, however they did put it in production for a couple of weeks, and as soon as the ZFS filesystem did come in the range of about 50-60% filling the did see the same problem. We had a similar problem with a T2000 and 2 TB of ZFS storage. Once the usage reached 1 TB, the write performance dropped considerably and the CPU consumption increased. Our problem was indirectly a result of fragmentation, but it was solved by a ZFS patch. I understand that this patch, which fixes a whole bunch of ZFS bugs, should be released soon. I wonder if this was your problem. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Mon, Apr 27, 2009 at 04:47:27PM -0500, Gary Mills wrote: On Sat, Apr 18, 2009 at 04:27:55PM -0500, Gary Mills wrote: We have an IMAP server with ZFS for mailbox storage that has recently become extremely slow on most weekday mornings and afternoons. When one of these incidents happens, the number of processes increases, the load average increases, but ZFS I/O bandwidth decreases. Users notice very slow response to IMAP requests. On the server, even `ps' becomes slow. The cause turned out to be this ZFS bug: 6596237: Stop looking and start ganging Apparently, the ZFS code was searching the free list looking for the perfect fit for each write. With a fragmented pool, this search took a very long time, delaying the write. Eventually, the requests arrived faster than writes could be sent to the devices, causing the server to be unresponsive. We also had another problem, due to this ZFS bug: 6591646: Hang while trying to enter a txg while holding a txg open This was a deadlock, with one thread blocking hundreds of other threads. Our symptom was that all zpool I/O would stop and the `ps' command would hang. A reboot was the only way out. If you have a support contract, Sun will supply an IDR that fixes both problems. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What causes slow performance under load?
On Sat, Apr 18, 2009 at 04:27:55PM -0500, Gary Mills wrote: We have an IMAP server with ZFS for mailbox storage that has recently become extremely slow on most weekday mornings and afternoons. When one of these incidents happens, the number of processes increases, the load average increases, but ZFS I/O bandwidth decreases. Users notice very slow response to IMAP requests. On the server, even `ps' becomes slow. The cause turned out to be this ZFS bug: 6596237: Stop looking and start ganging Apparently, the ZFS code was searching the free list looking for the perfect fit for each write. With a fragmented pool, this search took a very long time, delaying the write. Eventually, the requests arrived faster than writes could be sent to the devices, causing the server to be unresponsive. There isn't a patch for this one yet, but Sun will supply an IDR if you open a support case. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Peculiarities of COW over COW?
On Sun, Apr 26, 2009 at 05:19:18PM -0400, Ellis, Mike wrote: As soon as you put those zfs blocks ontop of iscsi, the netapp won't have a clue as far as how to defrag those iscsi files from the filer's perspective. (It might do some fancy stuff based on read/write patterns, but that's unlikely) Since the LUN is just a large file on the Netapp, I assume that all it can do is to put the blocks back into sequential order. That might have some benefit overall. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Peculiarities of COW over COW?
On Sun, Apr 26, 2009 at 05:02:38PM -0500, Tim wrote: On Sun, Apr 26, 2009 at 3:52 PM, Gary Mills [1]mi...@cc.umanitoba.ca wrote: We run our IMAP spool on ZFS that's derived from LUNs on a Netapp filer. There's a great deal of churn in e-mail folders, with messages appearing and being deleted frequently. Should ZFS and the Netapp be using the same blocksize, so that they cooperate to some extent? Just make sure ZFS is using a block size that is a multiple of 4k, which I believe it does by default. Okay, that's good. I have to ask though... why not just serve NFS off the filer to the Solaris box? ZFS on a LUN served off a filer seems to make about as much sense as sticking a ZFS based lun behind a v-filer (although the latter might actually might make sense in a world where it were supported *cough*neverhappen*cough* since you could buy the cheap newegg disk). I prefer NFS too, but the IMAP server requires POSIX semantics. I believe that NFS doesn't support that, at least NFS version 3. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is the 32 GB 2.5-Inch SATA Solid State Drive?
On Fri, Apr 24, 2009 at 09:08:52PM -0700, Richard Elling wrote: Gary Mills wrote: Does anyone know about this device? SESX3Y11Z 32 GB 2.5-Inch SATA Solid State Drive with Marlin Bracket for Sun SPARC Enterprise T5120, T5220, T5140 and T5240 Servers, RoHS-6 Compliant This is from Sun's catalog for the T5120 server. Would this work well as a separate ZIL device for ZFS? Is there any way I could use this in a T2000 server? The brackets appear to be different. The brackets are different. T2000 uses nemo bracket and T5120 uses marlin. For the part-number details, SunSolve is your friend. http://sunsolve.sun.com/handbook_pub/validateUser.do?target=Systems/SE_T5120/components http://sunsolve.sun.com/handbook_pub/validateUser.do?target=Systems/SunFireT2000_R/components I see also that no SSD is listed for the T2000. Has anyone gotten one to work as a separate ZIL device for ZFS? -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] What is the 32 GB 2.5-Inch SATA Solid State Drive?
Does anyone know about this device? SESX3Y11Z 32 GB 2.5-Inch SATA Solid State Drive with Marlin Bracket for Sun SPARC Enterprise T5120, T5220, T5140 and T5240 Servers, RoHS-6 Compliant This is from Sun's catalog for the T5120 server. Would this work well as a separate ZIL device for ZFS? Is there any way I could use this in a T2000 server? The brackets appear to be different. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss