Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Oh, one more comment. If you don't mirror your ZIL, and your unmirrored SSD goes bad, you lose your whole pool. Or at least suffer data corruption. Hmmm, I thought that in that case ZFS reverts to the regular on disks ZIL? With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
The write cache is _not_ being disabled. The write cache is being marked as non-volatile. Of course you're right :) Please filter my postings with a sed 's/write cache/write cache flush/g' ;) BTW, why is a Sun/Oracle branded product not properly respecting the NV bit in the cache flush command? This seems remarkably broken, and leads to the amazingly bad advice given on the wiki referenced above. I suspect it has something to do with emulating disk semantics over PCIE. Anyway, this did get us stumped in the beginning, performance wasn't better than when using an OCZ Vertex Turbo ;) By the way, the URL to the reference is part of the official F20 product documentation (that's how we found it in the first place)... With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to destroy iscsi dataset?
Hi, even if you didn't specify so below (both, Comstar and legacy target services are inactive) I assume that you have been using Comstar, right? In that case, the questions are: - is there still a view on the targets? (check stmfadm) - is there still a LU mapped? (check sbdadm) cheers, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
I stand corrected. You don't lose your pool. You don't have corrupted filesystem. But you lose whatever writes were not yet completed, so if those writes happen to be things like database transactions, you could have corrupted databases or files, or missing files if you were creating them at the time, and stuff like that. AKA, data corruption. But not pool corruption, and not filesystem corruption. Yeah, that's a big difference! :) Of course we could not live with pool or fs corruption. However, we can live with the fact the NFS written data is not all on disk in case of a server crash although the NFS client could rely on the write guaranteed by the NFS protocol. I.e. we do not use it for db transactions or something like that. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Adam, Very interesting data. Your test is inherently single-threaded so I'm not surprised that the benefits aren't more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency. Thanks for your reply. I'll probably test the multiple write case, too. But frankly at the moment I care the most about the single-threaded case because if we put e.g. user homes on this server I think they would be severely disappointed if they would have to wait 2m42s just to extract a rather small 50 MB tarball. The default 7m40s without SSD log were unacceptable and we were hoping that the F20 would make a big difference and bring the performance down to acceptable runtimes. But IMHO 2m42s is still too slow and disabling the ZIL seems to be the only option. Knowing that 100s of users could do this in parallel with good performance is nice but it does not improve the situation for the single user which only cares for his own tar run. If there's anything else we can do/try to improve the single-threaded case I'm all ears. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss k.we...@science-computing.de wrote: Hi Adam, Very interesting data. Your test is inherently single-threaded so I'm not surprised that the benefits aren't more impressive -- the flash modules on the F20 card are optimized more for concurrent IOPS than single-threaded latency. Thanks for your reply. I'll probably test the multiple write case, too. But frankly at the moment I care the most about the single-threaded case because if we put e.g. user homes on this server I think they would be severely disappointed if they would have to wait 2m42s just to extract a rather small 50 MB tarball. The default 7m40s without SSD log were unacceptable and we were hoping that the F20 would make a big difference and bring the performance down to acceptable runtimes. But IMHO 2m42s is still too slow and disabling the ZIL seems to be the only option. Knowing that 100s of users could do this in parallel with good performance is nice but it does not improve the situation for the single user which only cares for his own tar run. If there's anything else we can do/try to improve the single-threaded case I'm all ears. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. You'd be better off getting NetApp -- Brent Jones br...@servuhome.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Brent Jones wrote: I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. A few days ago I posted to nfs-discuss with a proposal to add some mount/share options to change semantics of a nfs-mounted filesystem so that they parallel those of a local filesystem. The main point is that data gets flushed to stable storage only if the client explicitly requests so via fsync or O_DSYNC, not implicitly with every close(). That would give you the performance you are seeking without sacrificing data integrity for applications that need it. I get the impression that I'm not the only one who could be interested in that ;) -Arne You'd be better off getting NetApp ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Need advice on handling 192 TB of Storage on hardware raid storage
Dear all, I have a hardware based array storage with a capacity of 192TB and being sliced into 64 LUNs of 3TB. What will be the best way to configure the ZFS on this? Of course we are not requiring the self healing capability of the ZFS. We just want the capability of handling big size file system and performance. Currently we are running using Solaris 10 May 2009 (Update 7), and configure the ZFS where : a. 1 hardware LUN (3TB) will become 1 zpool b. 1 zpool will become 1 ZFS file system c. 1 ZFS file system will become 1 mountpoint (obviously). The problem we have is that when the customer runs the I/O in parallel to the 64 file systems, the kernel usage (%sys) shot up very high to the 90% region and the IOPS level is degrading. It can be seen also that during that time the storage's own front end CPU does not change much, which means the bottleneck is not on the hardware storage level, but somewhere inside the Solaris box. Is there any experience of having the similar setup like the one I have? Or anybody can point me to an information on what will be the best way to deal with the hardware storage on this size? Please advice and thanks in advance, Dedhi ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] bit-flipping in RAM...
Orvar's post over in opensol-discuss has me thinking: After reading the paper and looking at design docs, I'm wondering if there is some facility to allow for comparing data in the ARC to it's corresponding checksum. That is, if I've got the data I want in the ARC, how can I be sure it's correct (and free of hardware memory errors)? I'd assume the way is to also store absolutely all the checksums for all blocks/metadatas being read/written in the ARC (which, of course, means that only so much RAM corruption can be compensated for), and do a validation when that every time that block is used/written from the ARC. You'd likely have to do constant metadata consistency checking, and likely have to hold multiple copies of metadata in-ARC to compensate for possible corruption. I'm assuming that this has at least been explored, right? (the researchers used non-ECC RAM, so honestly, I think it's a bit unrealistic to expect that your car will win the Indy 500 if you put a Yugo engine in it) - normally, this problem is exactly what you have hardware ECC and memory scrubbing for at the hardware level. I'm not saying that ZFS should consider doing this - doing a validation for in-memory data is non-trivially expensive in performance terms, and there's only so much you can do and still expect your machine to survive. I mean, I've used the old NonStop stuff, and yes, you can shoot them with a .45 and it likely will still run, but wacking them with a bazooka still is guarantied to make them, well, Non-NonStop. -Erik Original Message Subject:Re: [osol-discuss] Any news about 2010.3? Date: Wed, 31 Mar 2010 01:06:45 PDT From: Orvar Korvar knatte_fnatte_tja...@yahoo.com To: opensolaris-disc...@opensolaris.org If you value your data, you should reconsider. But if your data is not important, then skip ZFS. File system data corruption test by researcher: http://blogs.zdnet.com/storage/?p=169 ZFS data corruption test by researchers: http://www.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf -- This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-disc...@opensolaris.org -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Nobody knows any way for me to remove my unmirrored log device. Nobody knows any way for me to add a mirror to it (until Since snv_125 you can remove log devices. See http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 I've used this all the time during my testing and was able to remove both mirrored and unmirrored log devices without any problems (and without reboot). I'm using snv_134. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
I'm not saying that ZFS should consider doing this - doing a validation for in-memory data is non-trivially expensive in performance terms, and there's only so much you can do and still expect your machine to survive. I mean, I've used the old NonStop stuff, and yes, you can shoot them with a .45 and it likely will still run, but wacking them with a bazooka still is guarantied to make them, well, Non-NonStop. If we scrub the memory anyway, why not include the check of the ZFS checksums which are already in memory? OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the limitations are when you don't use ECC; the industry must start to require that all chipsets support ECC. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. Well, for lots of environments disabling ZIL is perfectly acceptable. And frankly the reason you get better performance out of the box on Linux as NFS server is that it actually behaves like with disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using Linux here or any other OS which behaves in the same manner. Actually it makes it better as even if ZIL is disabled ZFS filesystem is always consisten on a disk and you still get all the other benefits from ZFS. What would be useful though is to be able to easily disable ZIL per dataset instead of OS wide switch. This feature has already been coded and tested and awaits a formal process to be completed in order to get integrated. Should be rather sooner than later. You'd be better off getting NetApp Well, spend some extra money on a really fast NVRAM solution for ZIL and you will get much faster ZFS environment than NetApp and still you will spend much less money. Not to mention all the extra flexibity compared to NetApp. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Just to make sure you know ... if you disable the ZIL altogether, and you have a power interruption, failed cpu, or kernel halt, then you're likely to have a corrupt unusable zpool, or at least data corruption. If that is indeed acceptable to you, go nuts. ;-) I believe that the above is wrong information as long as the devices involved do flush their caches when requested to. Zfs still writes data in order (at the TXG level) and advances to the next transaction group when the devices written to affirm that they have flushed their cache. Without the ZIL, data claimed to be synchronously written since the previous transaction group may be entirely lost. If the devices don't flush their caches appropriately, the ZIL is irrelevant to pool corruption. I stand corrected. You don't lose your pool. You don't have corrupted filesystem. But you lose whatever writes were not yet completed, so if those writes happen to be things like database transactions, you could have corrupted databases or files, or missing files if you were creating them at the time, and stuff like that. AKA, data corruption. But not pool corruption, and not filesystem corruption. Which is an expected behavior when you break NFS requirements as Linux does out of the box. Disabling ZIL on a nfs server makes it no worse than the standard Linux behaviour - now you get decent performance at a cost of some data to get corrupted from a nfs client point of view. But then there are environments when it is perfectly acceptable as you there are not running critical databases but rather user home directories and zfs will flush a transaction maximum after 30s currently so user won't be able to loose more than last 30s if the nfs server would suddenly lost power. To clarify - if ZIL is disabled it makes no difference at all for a pool/filesystem level consistency. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
casper@sun.com wrote: I'm not saying that ZFS should consider doing this - doing a validation for in-memory data is non-trivially expensive in performance terms, and there's only so much you can do and still expect your machine to survive. I mean, I've used the old NonStop stuff, and yes, you can shoot them with a .45 and it likely will still run, but wacking them with a bazooka still is guarantied to make them, well, Non-NonStop. If we scrub the memory anyway, why not include the check of the ZFS checksums which are already in memory? OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the limitations are when you don't use ECC; the industry must start to require that all chipsets support ECC. Caspe Reading the paper was interesting, as it highlighted all the places where ZFS skips validation. There's a lot of places. In many ways, fixing this would likely make ZFS similar to AppleTalk whose notorious performance (relative to Ethernet) was caused by what many called the Are You Sure? design. Double and Triple checking absolutely everything has it's costs. And, yes, we really should just force computer manufacturers to use ECC in more places (not just RAM) - as densities and data volumes increase, we are more likely to see errors, and without proper hardware checking, we're really going out on a limb here to be able to trust what the hardware says. And, let's face it - hardware error correction is /so/ much faster than doing it in software. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
standard ZIL: 7m40s (ZFS default) 1x SSD ZIL: 4m07s (Flash Accelerator F20) 2x SSD ZIL: 2m42s (Flash Accelerator F20) 2x SSD mirrored ZIL: 3m59s (Flash Accelerator F20) 3x SSD ZIL: 2m47s (Flash Accelerator F20) 4x SSD ZIL: 2m57s (Flash Accelerator F20) disabled ZIL: 0m15s (local extraction0m0.269s) I was not so much interested in the absolute numbers but rather in the relative performance differences between the standard ZIL, the SSD ZIL and the disabled ZIL cases. Oh, one more comment. If you don't mirror your ZIL, and your unmirrored SSD goes bad, you lose your whole pool. Or at least suffer data corruption. This is not true. If ZIL device would die while pool is imported then ZFS would start using z ZIL withing a pool and continue to operate. On the other hand if your server would suddenly lost power and then when you power it up later on and ZFS detects that the ZIL is broken/gone it will require a sysadmin intervation to force the pool import and yes possibly loose some data. But how is it different from any other solution where your log is put on a separate device? Well, it is actually different. With ZFS you can still guearantee it to be consistent on-disk while others generally can't and often you will have to do fsck to even mount a fs in r/w... -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
Why do we still need /etc/zfs/zpool.cache file??? (I could understand it was useful when zfs import was slow) zpool import is now multi-threaded (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191), hence a lot faster, each disk contains the hostname (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725) , if a pool contains the same hostname as the server then import it. ie This bug should not be a problem any more http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6737296 with a multi-threaded zpool import. HA Storage should be changed to just do a zpool -h import mypool instead of using a private zpool.cache file (-h being ignore if the pool was imported by a different host, and maybe a noautoimport property is need on a zpool so clustering software can decided to import it by hand as it was) And therefore this zpool zplit problem would be fixed. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Simultaneous failure recovery
I have a pool (on an X4540 running S10U8) in which a disk failed, and the hot spare kicked in. That's perfect. I'm happy. Then a second disk fails. Now, I've replaced the first failed disk, and it's resilvered and I have my hot spare back. But: why hasn't it used the spare to cover the other failed drive? And can I hotspare it manually? I could do a straight replace, but that isn't quite the same thing. It seems like it is even driven. Hmmm.. perhaps it shouldn't be. Anyway you can do zpool replace and it is the same thing, why wouldn't it? -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Simultaneous failure recovery
On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrock eric.schr...@oracle.com wrote: On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote: I have a pool (on an X4540 running S10U8) in which a disk failed, and the hot spare kicked in. That's perfect. I'm happy. Then a second disk fails. Now, I've replaced the first failed disk, and it's resilvered and I have my hot spare back. But: why hasn't it used the spare to cover the other failed drive? And can I hotspare it manually? I could do a straight replace, but that isn't quite the same thing. Hot spares are only activated in response to a fault received by the zfs-retire FMA agent. There is no notion that the spares should be re-evaluated when they become available at a later point in time. Certainly a reasonable RFE, but not something ZFS does today. Definitely an RFE I would like. You can 'zpool attach' the spare like a normal device - that's all that the retire agent is doing under the hood. So, given: NAMESTATE READ WRITE CKSUM images DEGRADED 0 0 0 raidz1DEGRADED 0 0 0 c2t0d0 FAULTED 4 0 0 too many errors c3t0d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 spares c5t7d0AVAIL then it would be this? zpool attach images c2t0d0 c5t7d0 which I had considered, but the man page for attach says The existing device cannot be part of a raidz configuration. If I try that it fails, saying: /invalid vdev specification use '-f' to override the following errors: dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images. Please see zpool(1M). Thanks! -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Simultaneous failure recovery
On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrockeric.schr...@oracle.com wrote: On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote: I have a pool (on an X4540 running S10U8) in which a disk failed, and the hot spare kicked in. That's perfect. I'm happy. Then a second disk fails. Now, I've replaced the first failed disk, and it's resilvered and I have my hot spare back. But: why hasn't it used the spare to cover the other failed drive? And can I hotspare it manually? I could do a straight replace, but that isn't quite the same thing. Hot spares are only activated in response to a fault received by the zfs-retire FMA agent. There is no notion that the spares should be re-evaluated when they become available at a later point in time. Certainly a reasonable RFE, but not something ZFS does today. Definitely an RFE I would like. You can 'zpool attach' the spare like a normal device - that's all that the retire agent is doing under the hood. So, given: NAMESTATE READ WRITE CKSUM images DEGRADED 0 0 0 raidz1DEGRADED 0 0 0 c2t0d0 FAULTED 4 0 0 too many errors c3t0d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 spares c5t7d0AVAIL then it would be this? zpool attach images c2t0d0 c5t7d0 which I had considered, but the man page for attach says The existing device cannot be part of a raidz configuration. If I try that it fails, saying: /invalid vdev specification use '-f' to override the following errors: dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images. Please see zpool(1M). Thanks! You need to use zpool replace. Once you fix the failed drive and it re-synchronizes a hot spare will detach automatically (regardless if you forced it to kick-in via zpool replace or if it did so due to FMA). For more details see http://blogs.sun.com/eschrock/entry/zfs_hot_spares -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Simultaneous failure recovery
On 03/31/10 10:54 PM, Peter Tribble wrote: On Tue, Mar 30, 2010 at 10:42 PM, Eric Schrockeric.schr...@oracle.com wrote: On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote: I have a pool (on an X4540 running S10U8) in which a disk failed, and the hot spare kicked in. That's perfect. I'm happy. Then a second disk fails. Now, I've replaced the first failed disk, and it's resilvered and I have my hot spare back. But: why hasn't it used the spare to cover the other failed drive? And can I hotspare it manually? I could do a straight replace, but that isn't quite the same thing. Hot spares are only activated in response to a fault received by the zfs-retire FMA agent. There is no notion that the spares should be re-evaluated when they become available at a later point in time. Certainly a reasonable RFE, but not something ZFS does today. Definitely an RFE I would like. You can 'zpool attach' the spare like a normal device - that's all that the retire agent is doing under the hood. So, given: NAMESTATE READ WRITE CKSUM images DEGRADED 0 0 0 raidz1DEGRADED 0 0 0 c2t0d0 FAULTED 4 0 0 too many errors c3t0d0 ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c5t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c2t1d0 ONLINE 0 0 0 c3t1d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 spares c5t7d0AVAIL then it would be this? zpool attach images c2t0d0 c5t7d0 which I had considered, but the man page for attach says The existing device cannot be part of a raidz configuration. If I try that it fails, saying: /invalid vdev specification use '-f' to override the following errors: dev/dsk/c5t7d0s0 is reserved as a hot spare for ZFS pool images. Please see zpool(1M). What happens if you remove it as a spare first? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
The ECC enabled RAM should be very cheap quickly if the industry embraces it in every computer. :-) best regards, hanzhu On Wed, Mar 31, 2010 at 5:46 PM, Erik Trimble erik.trim...@oracle.comwrote: casper@sun.com wrote: I'm not saying that ZFS should consider doing this - doing a validation for in-memory data is non-trivially expensive in performance terms, and there's only so much you can do and still expect your machine to survive. I mean, I've used the old NonStop stuff, and yes, you can shoot them with a .45 and it likely will still run, but wacking them with a bazooka still is guarantied to make them, well, Non-NonStop. If we scrub the memory anyway, why not include the check of the ZFS checksums which are already in memory? OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the limitations are when you don't use ECC; the industry must start to require that all chipsets support ECC. Caspe Reading the paper was interesting, as it highlighted all the places where ZFS skips validation. There's a lot of places. In many ways, fixing this would likely make ZFS similar to AppleTalk whose notorious performance (relative to Ethernet) was caused by what many called the Are You Sure? design. Double and Triple checking absolutely everything has it's costs. And, yes, we really should just force computer manufacturers to use ECC in more places (not just RAM) - as densities and data volumes increase, we are more likely to see errors, and without proper hardware checking, we're really going out on a limb here to be able to trust what the hardware says. And, let's face it - hardware error correction is /so/ much faster than doing it in software. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Jeroen, Adam! link. Switched write caching off with the following addition to the /kernel/drv/sd.conf file (Karsten: if you didn't do this already, you _really_ want to :) Okay, I bite! :) format-inquiry on the F20 FMods disks returns: # Vendor: ATA # Product: MARVELL SD88SA02 So I put this in /kernel/drv/sd.conf and rebooted: # KAW, 2010-03-31 # Set F20 FMod devices to non-volatile mode # See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes sd-config-list = ATA MARVELL SD88SA02, nvcache1; nvcache1=1, 0x4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1; Now the tarball extraction test with active ZIL finishes in ~0m32s! I've tested with a mirrored SSD log and two separate SSD log devices. The runtime is nearly the same. Compared to the 2m42s before the /kernel/drv/sd.conf modification this is a huge improvement. The performance with active ZIL would be acceptable now. But is this mode of operation *really* safe? FWIW zilstat during the test shows this: N-Bytes N-Bytes/s N-Max-RateB-Bytes B-Bytes/s B-Max-Rateops =4kB 4-32kB =32kB 0 0 0 0 0 0 0 0 0 0 103907210390721039072377241637724163772416610299 311 0 152249615224961522496540262454026245402624874429 445 0 229295222929522292952674611267461126746112931215 716 0 232127223212722321272677478467747846774784931208 723 0 230347223034722303472654950465495046549504897195 702 0 632632632673382467338246733824935226 709 0 219832821983282198328666828866682886668288926224 702 0 217217217637337663733766373376878200 678 0 218541621854162185416635289663528966352896874197 677 0 221804022180402218040651673665167366516736897203 694 0 243698424369842436984654950465495046549504885171 714 0 I.e. ~900 ops/s. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. You'd be better off getting NetApp Hah hah. I have a Sun X4275 server exporting NFS. We have clients on all 4 of the Gb ethers, and the Gb ethers are the bottleneck, not the disks or filesystem. I suggest you either enable the WriteBack cache on your HBA, or add SSD's for ZIL. Performance is 5-10x higher this way than using naked disks. But of course, not as high as it is with a disabled ZIL. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Karsten, But is this mode of operation *really* safe? As far as I can tell it is. -The F20 uses some form of power backup that should provide power to the interface card long enough to get the cache onto solid state in case of power failure. -Recollecting from earlier threads here; in case the card fails (but not the host), there should be enough data residing in memory for ZFS to safely switch to the regular on disk ZIL. -According to my contacts at Sun, the F20 is a viable replacement solution for the X25-E. -Switching write caching off seems to be officially recommended on the Sun performance wiki (translated to more sane defaults). If I'm wrong here I'd like to know too, 'cause this is probably the way we're taking it in production. :) With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Nobody knows any way for me to remove my unmirrored log device. Nobody knows any way for me to add a mirror to it (until Since snv_125 you can remove log devices. See http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 I've used this all the time during my testing and was able to remove both mirrored and unmirrored log devices without any problems (and without reboot). I'm using snv_134. Aware. Opensolaris can remove log devices. Solaris cannot. Yet. But if you want your server in production, you can get a support contract for solaris. Opensolaris cannot. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Hi Richard, For this case, what is the average latency to the F20? I'm not giving the average since I only performed a single run here (still need to get autopilot set up :) ). However here is a graph of iostat IOPS/svc_t sampled in 10sec intervals during a run of untarring an eclipse tarbal 40 times from two hosts. I'm using 1 vmod here. http://www.science.uva.nl/~jeroen/zil_1slog_e1000_iostat_iops_svc_t_10sec_interval.pdf Maximum svc_t is around 2.7ms averaged over 10s. Still wondering why this won't scale out though. We don't seem to be CPU bound, unless ZFS limits itself to max 30% cputime? With kind regards, Jeroen -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Simultaneous failure recovery
On Mar 30, 2010, at 5:39 PM, Peter Tribble wrote: I have a pool (on an X4540 running S10U8) in which a disk failed, and the hot spare kicked in. That's perfect. I'm happy. Then a second disk fails. Now, I've replaced the first failed disk, and it's resilvered and I have my hot spare back. But: why hasn't it used the spare to cover the other failed drive? And can I hotspare it manually? I could do a straight replace, but that isn't quite the same thing. Hot spares are only activated in response to a fault received by the zfs-retire FMA agent. There is no notion that the spares should be re-evaluated when they become available at a later point in time. Certainly a reasonable RFE, but not something ZFS does today. You can 'zpool attach' the spare like a normal device - that's all that the retire agent is doing under the hood. Hope that helps, - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 03/30/10 20:00, Bob Friesenhahn wrote: On Tue, 30 Mar 2010, Edward Ned Harvey wrote: But the speedup of disabling the ZIL altogether is appealing (and would probably be acceptable in this environment). Just to make sure you know ... if you disable the ZIL altogether, and you have a power interruption, failed cpu, or kernel halt, then you're likely to have a corrupt unusable zpool, or at least data corruption. If that is indeed acceptable to you, go nuts. ;-) I believe that the above is wrong information as long as the devices involved do flush their caches when requested to. Zfs still writes data in order (at the TXG level) and advances to the next transaction group when the devices written to affirm that they have flushed their cache. Without the ZIL, data claimed to be synchronously written since the previous transaction group may be entirely lost. If the devices don't flush their caches appropriately, the ZIL is irrelevant to pool corruption. Bob Yes Bob is correct - that is exactly how it works. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
On 31/03/2010 10:27, Erik Trimble wrote: Orvar's post over in opensol-discuss has me thinking: After reading the paper and looking at design docs, I'm wondering if there is some facility to allow for comparing data in the ARC to it's corresponding checksum. That is, if I've got the data I want in the ARC, how can I be sure it's correct (and free of hardware memory errors)? I'd assume the way is to also store absolutely all the checksums for all blocks/metadatas being read/written in the ARC (which, of course, means that only so much RAM corruption can be compensated for), and do a validation when that every time that block is used/written from the ARC. You'd likely have to do constant metadata consistency checking, and likely have to hold multiple copies of metadata in-ARC to compensate for possible corruption. I'm assuming that this has at least been explored, right? A subset of this is already done. The ARC keeps its own in memory checksum (because some buffers in the ARC are not yet on stable storage so don't have a block pointer checksum yet). http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c arc_buf_freeze() arc_buf_thaw() arc_cksum_verify() arc_cksum_compute() It isn't done on every access but it can detect in memory corruption - I've seen it happen on several occasions but all due to errors in my code not bad physical memory. Doing in more frequently could cause a significant performance problem. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] can't destroy snapshot
We're getting the notorious cannot destroy ... dataset already exists. I've seen a number of reports of this, but none of the reports seem to get any response. Fortunately this is a backup system, so I can recreate the pool, but it's going to take me several days to get all the data back. Is there any known workaround? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
Incidentally, this is on Solaris 10, but I've seen identical reports from Opensolaris. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
On 31-3-2010 14:52, Charles Hedrick wrote: Incidentally, this is on Solaris 10, but I've seen identical reports from Opensolaris. Probably you need to delete any existing view over the lun you want to destroy. Example : stmfadm list-lu LU Name: 600144F0B67340004BB31F060001 stmfadm list-view -l 600144F0B67340004BB323FF0003 View Entry: 0 Host group : TEST Target group : All LUN : 1 stmfadm remove-view -l 600144F0B67340004BB323FF0003 after this, i think you can zfs destroy zfs_volume . Bruno smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Cannot replace a failed device
I had a drive fail and replaced it with a new drive,During the resilvering process,that show Too many errors,and process fail. now,the pool can online,but cannot accept any zfs's commands that change pool's state,I can list File directory,but don't mv、cp and rm -f. what can I do,I need that data files. r...@opensolaris2:~# cat /etc/release OpenSolaris 2008.11 snv_101b_rc2 X86 Copyright 2008 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 19 November 2008 r...@opensolaris2:~# zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT bfpool 4.98T 3.80T 1.18T76% ONLINE - rpool 74G 5.76G 68.2G 7% ONLINE - r...@opensolaris2:~# zpool status -v bfpool pool: bfpool state: ONLINE status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: resilver in progress for 0h3m, 0.00% done, 1610h45m to go config: NAMESTATE READ WRITE CKSUM bfpool ONLINE 60 0 0 c7d0p0ONLINE 0 0 0 c6d0p0ONLINE 0 0 0 replacing ONLINE 125 7.19K 0 c5d1p0/old UNAVAIL 0 7 0 corrupted data c5d1p0 UNAVAIL 0 7.43K 0 corrupted data c4d1p0ONLINE 0 0 0 c4d0p0ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /bfpool/storm/TDExplorerPlugIn.exe r...@opensolaris2:~# zpool history History for 'bfpool': 2009-01-05.15:55:59 zpool create bfpool c7d0p0 c6d0p0 c5d1p0 c4d1p0 c4d0p0 2009-01-05.15:56:32 zfs create bfpool/bofang 2009-01-05.15:56:59 zfs set compression=on bfpool/bofang 2009-01-06.10:24:37 zfs destroy bfpool/bofang 2009-01-06.10:25:41 zfs create bfpool/storm 2009-01-06.10:25:49 zfs create bfpool/temp 2009-01-06.10:27:16 zfs set compression=on bfpool/storm 2009-01-06.10:27:22 zfs set compression=on bfpool/temp 2009-06-26.15:06:06 zfs create bfpool/vc 2009-06-26.15:06:21 zfs create bfpool/hdmedia 2009-06-26.15:06:30 zfs create bfpool/library 2009-06-26.15:06:39 zfs create bfpool/codec 2009-06-26.15:06:46 zfs create bfpool/tools 2009-06-26.15:06:54 zfs create bfpool/software 2009-06-26.15:07:02 zfs create bfpool/opensource 2009-06-26.15:07:11 zfs create bfpool/bbs 2009-06-26.15:07:18 zfs create bfpool/user 2009-06-26.15:07:27 zfs set compression=on bfpool/vc 2009-06-26.15:07:35 zfs set compression=on bfpool/hdmedia 2009-06-26.15:07:43 zfs set compression=on bfpool/library 2009-06-26.15:07:52 zfs set compression=on bfpool/codec 2009-06-26.15:08:01 zfs set compression=on bfpool/tools 2009-06-26.15:08:10 zfs set compression=on bfpool/software 2009-06-26.15:08:18 zfs set compression=on bfpool/opensource 2009-06-26.15:08:26 zfs set compression=on bfpool/bbs 2009-06-26.15:08:33 zfs set compression=on bfpool/user 2010-03-02.15:11:33 zpool replace bfpool c5d1p0 History for 'rpool': 2009-01-05.14:02:50 zpool create -f rpool c5d0s0 2009-01-05.14:02:50 zfs set org.opensolaris.caiman:install=busy rpool 2009-01-05.14:02:50 zfs create -b 4096 -V 1023m rpool/swap 2009-01-05.14:02:50 zfs create -b 131072 -V 1023m rpool/dump 2009-01-05.14:03:19 zfs set mountpoint=/a/export rpool/export 2009-01-05.14:03:19 zfs set mountpoint=/a/export/home rpool/export/home 2009-01-05.14:03:19 zfs set mountpoint=/a/export/home/mike rpool/export/home/mike 2009-01-05.14:17:32 zpool set bootfs=rpool/ROOT/opensolaris rpool 2009-01-05.14:18:54 zfs set org.opensolaris.caiman:install=ready rpool 2009-01-05.14:18:55 zfs set mountpoint=/export/home/mike rpool/export/home/mike 2009-01-05.14:18:55 zfs set mountpoint=/export/home rpool/export/home 2009-01-05.14:18:55 zfs set mountpoint=/export rpool/export r...@opensolaris2:~# zpool get all bfpool NAMEPROPERTY VALUE SOURCE bfpool size 4.98T - bfpool used 3.80T - bfpool available 1.18T - bfpool capacity 76% - bfpool altroot- default bfpool health ONLINE - bfpool guid 8117798173515948167 - bfpool version13 default bfpool bootfs - default bfpool delegation on default bfpool autoreplaceoff default bfpool cachefile - default bfpool failmode waitdefault bfpool listsnapshots off default prtdiag -v System Configuration: MICRO-STAR INTERNATIONAL CO.,LTD MS-7519 BIOS Configuration: American Megatrends Inc. V1.6 09/17/2008 Processor Sockets Version Location Tag -- Intel(R) Pentium(R) Dual CPU E2160 @ 1.80GHz CPU 1
Re: [zfs-discuss] bit-flipping in RAM...
On 31/03/2010 10:27, Erik Trimble wrote: Orvar's post over in opensol-discuss has me thinking: After reading the paper and looking at design docs, I'm wondering if there is some facility to allow for comparing data in the ARC to it's corresponding checksum. That is, if I've got the data I want in the ARC, how can I be sure it's correct (and free of hardware memory errors)? I'd assume the way is to also store absolutely all the checksums for all blocks/metadatas being read/written in the ARC (which, of course, means that only so much RAM corruption can be compensated for), and do a validation when that every time that block is used/written from the ARC. You'd likely have to do constant metadata consistency checking, and likely have to hold multiple copies of metadata in-ARC to compensate for possible corruption. I'm assuming that this has at least been explored, right? A subset of this is already done. The ARC keeps its own in memory checksum (because some buffers in the ARC are not yet on stable storage so don't have a block pointer checksum yet). http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c arc_buf_freeze() arc_buf_thaw() arc_cksum_verify() arc_cksum_compute() It isn't done on every access but it can detect in memory corruption - I've seen it happen on several occasions but all due to errors in my code not bad physical memory. Doing in more frequently could cause a significant performance problem. or there might be an extra zpool level (or system wide) property to enable checking checksums onevery access from ARC - there will be a siginificatn performance impact but then it might be acceptable for really paranoid folks especially with modern hardware. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 6:31 AM, Edward Ned Harvey solar...@nedharvey.comwrote: Nobody knows any way for me to remove my unmirrored log device. Nobody knows any way for me to add a mirror to it (until Since snv_125 you can remove log devices. See http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 I've used this all the time during my testing and was able to remove both mirrored and unmirrored log devices without any problems (and without reboot). I'm using snv_134. Aware. Opensolaris can remove log devices. Solaris cannot. Yet. But if you want your server in production, you can get a support contract for solaris. Opensolaris cannot. According to who? http://www.opensolaris.com/learn/features/availability/ Full production level support Both Standard and Premium support offerings are available for deployment of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with following configurations: --Tim http://www.opensolaris.com/learn/features/availability/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, 31 Mar 2010, Tim Cook wrote: http://www.opensolaris.com/learn/features/availability/ Full production level support Both Standard and Premium support offerings are available for deployment of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with following configurations: This formal OpenSolaris release is too anchient to do him any good. In fact, zfs-wise, it lags the Solaris 10 releases. If there is ever another OpenSolaris formal release, then the situation will be different. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, 31 Mar 2010, Karsten Weiss wrote: But frankly at the moment I care the most about the single-threaded case because if we put e.g. user homes on this server I think they would be severely disappointed if they would have to wait 2m42s just to extract a rather small 50 MB tarball. The default 7m40s without SSD log were unacceptable and we were hoping that the F20 would make a big difference and bring the performance down to acceptable runtimes. But IMHO 2m42s is still too slow and disabling the ZIL seems to be the only option. Is extracting 50 MB tarballs something that your users do quite a lot of? Would your users be concerned if there was a possibility that after extracting a 50 MB tarball that files are incomplete, whole subdirectories are missing, or file permissions are incorrect? The Sun Flash Accelerator F20 was not strictly designed as a zfs log device. It was originally designed to be a database accelerator. It was repurposed for zfs slog use because it works. It is a bit wimpy for bulk data. If you need fast support for bulk writes, perhaps you need something like STEC's very expensive ZEUS SSD drive. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Tue, March 30, 2010 22:40, Edward Ned Harvey wrote: Here's a snippet from man zpool. (Latest version available today in solaris) zpool remove pool device ... Removes the specified device from the pool. This command currently only supports removing hot spares and cache devices. Devices that are part of a mirrored configura- tion can be removed using the zpool detach command. Non-redundant and raidz devices cannot be removed from a pool. So you think it would be ok to shutdown, physically remove the log device, and then power back on again, and force import the pool? So although A cache device is for the L2ARC, a log device is for ZIL. Log devices are removable as of snv_125 (mentioned in another e-mail). If you want log removal in Solaris proper, and you have a support account, call up and ask that CR 6574286 be fixed: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 9:47 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Wed, 31 Mar 2010, Tim Cook wrote: http://www.opensolaris.com/learn/features/availability/ Full production level support Both Standard and Premium support offerings are available for deployment of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with following configurations: This formal OpenSolaris release is too anchient to do him any good. In fact, zfs-wise, it lags the Solaris 10 releases. If there is ever another OpenSolaris formal release, then the situation will be different. Bob Cmon now, have a little faith. It hasn't even slipped past March yet :) Of course it'd be way more fun if someone from Sun threw caution to the wind and told us what the hold-up is *cough*oracle*cough*. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
On Wed, 31 Mar 2010, Robert Milkowski wrote: or there might be an extra zpool level (or system wide) property to enable checking checksums onevery access from ARC - there will be a siginificatn performance impact but then it might be acceptable for really paranoid folks especially with modern hardware. How would this checking take place for memory mapped files? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Allow me to clarify a little further, why I care about this so much. I have a solaris file server, with all the company jewels on it. I had a pair of intel X.25 SSD mirrored log devices. One of them failed. The replacement device came with a newer version of firmware on it. Now, instead of appearing as 29.802 Gb, it appears at 29.801 Gb. I cannot zpool attach. New device is too small. So apparently I'm the first guy this happened to. Oracle is caught totally off guard. They're pulling their inventory of X25's from dispatch warehouses, and inventorying all the firmware versions, and trying to figure it all out. Meanwhile, I'm still degraded. Or at least, I think I am. This isn't the only problem that SnOracle has had with the X25s. We managed to reproduce a problem with the SSDs as ZIL on an x4250. An I/O error of some sort caused a retryable write error ... which brought throughput to 0 as if a PCI bus reset had occurred. Here's a sample of our output... you might want to check and see if you're getting similar errors. Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 365881 kern.info] /p...@0,0/pci8086,2...@4/pci111d,8...@0/pci111d,8...@4/pci1000,3...@0 (mpt1): Jan 10 21:36:52 tips-fs1.tamu.edu Log info 31126000 received for target 15. Jan 10 21:36:52 tips-fs1.tamu.edu scsi_status=0, ioc_status=804b, scsi_state=c Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 365881 kern.info] /p...@0,0/pci8086,2...@4/pci111d,8...@0/pci111d,8...@4/pci1000,3...@0 (mpt1): Jan 10 21:36:52 tips-fs1.tamu.edu Log info 31126000 received for target 15. Jan 10 21:36:52 tips-fs1.tamu.edu scsi_status=0, ioc_status=804b, scsi_state=c Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.warning] WARNING: /p...@0,0/pci8086,2...@4/pci111d,8...@0/pci111d,8...@4/pci1000,3...@0/s...@f,0 (sd28): Jan 10 21:36:52 tips-fs1.tamu.edu Error for Command: write Error Level: Retryable Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Requested Block: 8448 Error Block: 8448 Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: CVEM902401BA Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 We were lucky to catch the problem before we went live. There were an exceptionally large number of I/O errors Sun has not gotten back to me with a resolution for this problem yet, but they were able to reproduce the issue. -K Karl Katzke Systems Analyst II TAMU / DRGS ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
On 03/31/10 03:50 AM, Damon Atkins wrote: Why do we still need /etc/zfs/zpool.cache file??? (I could understand it was useful when zfs import was slow) zpool import is now multi-threaded (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191), hence a lot faster, each disk contains the hostname (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725) , if a pool contains the same hostname as the server then import it. ie This bug should not be a problem any more http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6737296 with a multi-threaded zpool import. HA Storage should be changed to just do a zpool -h import mypool instead of using a private zpool.cache file (-h being ignore if the pool was imported by a different host, and maybe a noautoimport property is need on a zpool so clustering software can decided to import it by hand as it was) And therefore this zpool zplit problem would be fixed. The problem with splitting a root pool goes beyond the issue of the zpool.cache file. If you look at the comments for 6939334 http://monaco.sfbay.sun.com/detail.jsf?cr=6939334, you will see other files whose content is not correct when a root pool is renamed or split. I'm not questioning your logic about whether zpool.cache is still needed. I'm only pointing out that eliminating the zpool.cache file would not enable root pools to be split. More work is required for that. Lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Would your users be concerned if there was a possibility that after extracting a 50 MB tarball that files are incomplete, whole subdirectories are missing, or file permissions are incorrect? Correction: Would your users be concerned if there was a possibility that after extracting a 50MB tarball *and having a server crash* then files could be corrupted as described above. If you disable the ZIL, the filesystem still stays correct in RAM, and the only way you lose any data such as you've described, is to have an ungraceful power down or reboot. The advice I would give is: Do zfs autosnapshots frequently (say ... every 5 minutes, keeping the most recent 2 hours of snaps) and then run with no ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest snapshot ... and rollback once more for good measure. As long as you can afford to risk 5-10 minutes of the most recent work after a crash, then you can get a 10x performance boost most of the time, and no risk of the aforementioned data corruption. Obviously, if you cannot accept 5-10 minutes of data loss, such as credit card transactions, this would not be acceptable. You'd need to keep your ZIL enabled. Also, if you have a svn server on the ZFS server, and you have svn clients on other systems ... You should never allow your clients to advance beyond the current rev of the server. So again, you'd have to keep the ZIL enabled on the server. It all depends on your workload. For some, the disabled ZIL is worth the risk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, 31 Mar 2010, Tim Cook wrote: If there is ever another OpenSolaris formal release, then the situation will be different. Cmon now, have a little faith. It hasn't even slipped past March yet :) Of course it'd be way more fun if someone from Sun threw caution to the wind and told us what the hold-up is *cough*oracle*cough*. Oracle is a total cold boot for me. Everything they have put on their web site seems carefully designed to cast fear and panic into the former Sun customer base and cause substantial doubt, dismay, and even terror. I don't know what I can and can't trust. Every bit of trust that Sun earned with me over the past 19 years is clean-slated. Regardless, it seems likely that Oracle is taking time to change all of the copyrights, documentation, and logos to reflect the new othership. They are probably re-evaluating which parts should be included for free in OpenSolaris. The name Sun is deeply embedded in Solaris. All of the Solaris 10 packages include SUN in their name. Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The Premium service plan costs $200 more. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, March 31, 2010 12:23, Bob Friesenhahn wrote: Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The Premium service plan costs $200 more. I feel a great disturbance in the force. It is as if a great multitude of developers screamed and then went out and downloaded GCC. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 11:23 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Wed, 31 Mar 2010, Tim Cook wrote: If there is ever another OpenSolaris formal release, then the situation will be different. Cmon now, have a little faith. It hasn't even slipped past March yet :) Of course it'd be way more fun if someone from Sun threw caution to the wind and told us what the hold-up is *cough*oracle*cough*. Oracle is a total cold boot for me. Everything they have put on their web site seems carefully designed to cast fear and panic into the former Sun customer base and cause substantial doubt, dismay, and even terror. I don't know what I can and can't trust. Every bit of trust that Sun earned with me over the past 19 years is clean-slated. Regardless, it seems likely that Oracle is taking time to change all of the copyrights, documentation, and logos to reflect the new othership. They are probably re-evaluating which parts should be included for free in OpenSolaris. The name Sun is deeply embedded in Solaris. All of the Solaris 10 packages include SUN in their name. Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The Premium service plan costs $200 more. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ Where did you see that? It looks to be free to me: Sun Studio 12 Update 1 - FREE for SDN members. SDN members can download a free, full-license copy of Sun Studio 12 Update 1. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 31 Mar 2010, at 17:23, Bob Friesenhahn wrote: Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The Premium service plan costs $200 more. The download still seems to be a free, full-license copy for SDN members; the $1015 you quote is for the standard Sun Software service plan. Is a service plan now *required*, a la Solaris 10? Cheers, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
On Mar 31, 2010, at 2:50 AM, Damon Atkins wrote: Why do we still need /etc/zfs/zpool.cache file??? (I could understand it was useful when zfs import was slow) Yes. Imagine the case where your server has access to hundreds of LUs. If you must probe each one, then booting can take a long time. If you go back in history you will find many cases where probing all LUs at boot was determined to be a bad thing. zpool import is now multi-threaded (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6844191), hence a lot faster, each disk contains the hostname (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725) , if a pool contains the same hostname as the server then import it. ie This bug should not be a problem any more http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6737296 with a multi-threaded zpool import. HA Storage should be changed to just do a zpool -h import mypool instead of using a private zpool.cache file (-h being ignore if the pool was imported by a different host, and maybe a noautoimport property is need on a zpool so clustering software can decided to import it by hand as it was) And therefore this zpool zplit problem would be fixed. There is also a use case where the storage array makes a block-level copy of a LU. It would be a bad thing to discover that on a probe and attempt import. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
On 03/31/10 12:21 PM, lori.alt wrote: The problem with splitting a root pool goes beyond the issue of the zpool.cache file. If you look at the comments for 6939334 http://monaco.sfbay.sun.com/detail.jsf?cr=6939334, you will see other files whose content is not correct when a root pool is renamed or split. 6939334 seems to be inaccessible outside of Sun. Could you list the comments here? Thanks ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, Mar 31, 2010 at 11:39 AM, Chris Ridd chrisr...@mac.com wrote: On 31 Mar 2010, at 17:23, Bob Friesenhahn wrote: Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The Premium service plan costs $200 more. The download still seems to be a free, full-license copy for SDN members; the $1015 you quote is for the standard Sun Software service plan. Is a service plan now *required*, a la Solaris 10? Cheers, Chris It's still available in the opensolaris repo, and I see no license reference stating you have to have a support contract, so I'm guessing no... *Several releases of Sun Studio Software are available in the OpenSolaris repositories. The following list shows you how to download and install each release, and where you can find the documentation for the release:* - *Sun Studio 12 Update 1:** The Sun Studio 12 Update 1 release is the latest full production release of Sun Studio software. It has recently been added to the OpenSolaris IPS repository. To install this release in your OpenSolaris 2009.06 environment using the Package Manager:* * * --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VMware client solaris 10, RAW physical disk and zfs snapshots problem - all created snapshots are equal to zero.
I did those test and here are results: r...@sl-node01:~# zfs list NAMEUSED AVAIL REFER MOUNTPOINT mypool01 91.9G 136G23K /mypool01 mypool01/storage01 91.9G 136G 91.7G /mypool01/storage01 mypool01/storag...@30032010-1 0 - 91.9G - mypool01/storag...@30032010-2 0 - 91.9G - mypool01/storag...@30032010-3 2.15M - 91.7G - mypool01/storag...@30032010-441K - 91.7G - mypool01/storag...@30032010-5 1.17M - 91.7G - mypool01/storag...@30032010-6 0 - 91.7G - mypool02 91.9G 137G24K /mypool02 mypool02/copies 23K 137G23K /mypool02/copies mypool02/storage01 91.9G 137G 91.9G /mypool02/storage01 mypool02/storag...@30032010-1 0 - 91.9G - mypool02/storag...@30032010-2 0 - 91.9G - As you can see I have differences for snapshot 4,5 and 6 as you suggested to make a test. But I can see also changes on snapshot no. 3 - I complain about this snapshot because I could not see differences on it last night! Now it shows. Well, the first thing you should know is this: Suppose you take a snapshot, and create some files. Then the snapshot still occupies no disk space. Everything is in the current filesystem. The only time a snapshot occupies disk space is when the snapshot contains data that is missing from the current filesystem. That is - If you rm or overwrite some files in the current filesystem, then you will see the size of the snapshot growing. Make sense? That brings up a question though. If you did the commands as I wrote them, it would mean you created a 1G file, took a snapshot, and rm'd the file. Therefore your snapshot should contain at least 1G. I am confused by the fact that you only have 1-2M in your snapshot. Maybe I messed up the command I told you, or you messed up entering it on the system, and you only created a 1M file, instead of a 1G file? What is still strange: snapshots 1 and 2 are the oldest but they are still equal to zero! After changes and snapshots 3,4,5 and 6 I would expect that snapshots 1 and 2 are recording also changes on the storage01 file system, but not... could it be possible that snapshots 1 and 2 are somehow broken? If some file existed during all of the old snapshots, and you destroy your later snapshots, then the data occupied by the later snapshots will start to fall onto the older snapshots. Until you destroy the oldest snapshot that contained that data. At which time, the data is truly gone from all of the snapshots. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Need advice on handling 192 TB of Storage on hardware raid storage
On Mar 31, 2010, at 2:05 AM, Dedhi Sujatmiko wrote: Dear all, I have a hardware based array storage with a capacity of 192TB and being sliced into 64 LUNs of 3TB. What will be the best way to configure the ZFS on this? Of course we are not requiring the self healing capability of the ZFS. We just want the capability of handling big size file system and performance. Answers below based on the assumption that you value performance over space over dependability. Currently we are running using Solaris 10 May 2009 (Update 7), and configure the ZFS where : First, upgrade or patch to the latest Solaris 10 kernel/zfs bits. a. 1 hardware LUN (3TB) will become 1 zpool The RAID configuration of the LUs will be critical. ZFS can be easily configured to overrun most RAID arrays using modest server hardware. b. 1 zpool will become 1 ZFS file system c. 1 ZFS file system will become 1 mountpoint (obviously). I see no reason to do this. For best performance, put multiple LUs into the pool. The problem we have is that when the customer runs the I/O in parallel to the 64 file systems, the kernel usage (%sys) shot up very high to the 90% region and the IOPS level is degrading. It can be seen also that during that time the storage's own front end CPU does not change much, which means the bottleneck is not on the hardware storage level, but somewhere inside the Solaris box. The cause of the high system time should be investigated. I have seen huge amounts of I/O to RAID arrays consume relatively little system time. Is there any experience of having the similar setup like the one I have? Or anybody can point me to an information on what will be the best way to deal with the hardware storage on this size? In general, spread the I/O across all resources to get the best overall response time. Please advice and thanks in advance HTH, -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Wed, 31 Mar 2010, Chris Ridd wrote: Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The Premium service plan costs $200 more. The download still seems to be a free, full-license copy for SDN members; the $1015 you quote is for the standard Sun Software service plan. Is a service plan now *required*, a la Solaris 10? There is no telling. Everything is subject to evaluation by Oracle and it is not clear which parts of the web site are confirmed and which parts are still subject to change. In the past it was free to join SDN but if one was to put an 'M' in front of that SDN, then there would be a subtantial yearly charge for membership (up to $10,939 USD per year according to Wikipedia). This is a world that Oracle has been commonly exposed to in the past. Not everyone who uses a compiler qualifies as a developer. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 31 Mar 2010, at 17:50, Bob Friesenhahn wrote: On Wed, 31 Mar 2010, Chris Ridd wrote: Yesterday I noticed that the Sun Studio 12 compiler (used to build OpenSolaris) now costs a minimum of $1,015/year. The Premium service plan costs $200 more. The download still seems to be a free, full-license copy for SDN members; the $1015 you quote is for the standard Sun Software service plan. Is a service plan now *required*, a la Solaris 10? There is no telling. Everything is subject to evaluation by Oracle and it is not clear which parts of the web site are confirmed and which parts are still subject to change. In the past it was free to join SDN but if one was to put an 'M' in front of that SDN, then there would be a subtantial yearly charge for membership (up to $10,939 USD per year according to Wikipedia). This is a world that Oracle has been commonly exposed to in the past. Not everyone who uses a compiler qualifies as a developer. Indeed, but Microsoft still give out free express versions of their tools. If memory serves, you're not allowed to distribute binaries built with them but otherwise they're not broken in any significant way. Maybe this will also be the difference between Sun Studio and Sun Studio Express. Perhaps we should take this to tools-compilers. Cheers, Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
On 03/31/10 10:42 AM, Frank Middleton wrote: On 03/31/10 12:21 PM, lori.alt wrote: The problem with splitting a root pool goes beyond the issue of the zpool.cache file. If you look at the comments for 6939334 http://monaco.sfbay.sun.com/detail.jsf?cr=6939334, you will see other files whose content is not correct when a root pool is renamed or split. 6939334 seems to be inaccessible outside of Sun. Could you list the comments here? Thanks Here they are: Other issues: * Swap is still pointing to rpool because /etc/vfstab is never updated. * Likewise, dumpadm still has dump zvols configured with the original pool. * The /{pool}/boot/menu.lst (on sparc), and /{pool}/boot/grub/menu.lst (on x86) still reference the original pool's bootfs. Note that the 'bootfs' property in the pool itself is actually correct, because we store the object number and not the name. While each one of these issues is individually fixable, there's no way to prevent new issues coming up in the future, thus breaking zpool split. It might be more advisable to prevent splitting of root pools. *** (#2 of 3): 2010-03-30 18:48:54 GMT+00:00mark.musa...@sun.com yes, these looks like the kind of issues that flash archive install had to solve: all the tweaks that need to be made to a root file system to get it to adjust to living on different hardware. In addition to the ones listed above, there are all the device specific files in /etc/path_to_inst, /devices, and so on. This is not a trivial problem. Cloning root pools by the split mechanism is more of a project in its own right. Is zfs split good for anything related to root disks? I can't think of a use. If there is a need for a disaster recovery disk, it's probably best to just remove one of the mirrors (without doing a split operation) and stash it for later use. *** (#3 of 3): 2010-03-30 20:21:57 GMT+00:00lori@sun.com Lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] *SPAM* Re: zfs send/receive - actual performance
On 3/27/2010 3:14 AM, Svein Skogen wrote: On 26.03.2010 23:55, Ian Collins wrote: On 03/27/10 09:39 AM, Richard Elling wrote: On Mar 26, 2010, at 2:34 AM, Bruno Sousa wrote: Hi, The jumbo-frames in my case give me a boost of around 2 mb/s, so it's not that much. That is about right. IIRC, the theoretical max is about 4% improvement, for MTU of 8KB. Now i will play with link aggregation and see how it goes, and of course i'm counting that incremental replication will be slower...but since the amount of data would be much less probably it will still deliver a good performance. Probably won't help at all because of the brain dead way link aggregation has to work. See Ordering of frames at http://en.wikipedia.org/wiki/Link_Aggregation_Control_Protocol#Link_Aggregation_Control_Protocol Arse, thanks for reminding me Richard! A single stream will only use one path in a LAG. Doesn't (Open)Solaris have the option of setting the aggregate up as a FEC or in roundrobin mode? Solaris does offer what the Wiki describes as L4 or port number based hashing. I'm not sure what FEC is, but when I asked, round-robin isn't available as preserving packet ordering wouldn't be easy (possible?) that way. -Kyle //Svein ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
On 04/ 1/10 01:51 AM, Charles Hedrick wrote: We're getting the notorious cannot destroy ... dataset already exists. I've seen a number of reports of this, but none of the reports seem to get any response. Fortunately this is a backup system, so I can recreate the pool, but it's going to take me several days to get all the data back. Is there any known workaround? Exactly what commands are you running and what errors do you see? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
rm == Robert Milkowski mi...@task.gda.pl writes: rm This is not true. If ZIL device would die *while pool is rm imported* then ZFS would start using z ZIL withing a pool and rm continue to operate. what you do not say, is that a pool with dead zil cannot be 'import -f'd. So, for example, if your rpool and slog are on the same SSD, and it dies, you have just lost your whole pool. pgp9E0wFxqcc4.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
rm == Robert Milkowski mi...@task.gda.pl writes: rm the reason you get better performance out of the box on Linux rm as NFS server is that it actually behaves like with disabled rm ZIL careful. Solaris people have been slinging mud at linux for things unfsd did in spite of the fact knfsd has been around for a decade. and ``has options to behave like the ZIL is disabled (sync/async in /etc/exports)'' != ``always behaves like the ZIL is disabled''. If you are certain about Linux NFS servers not preserving data for hard mounts when the server reboots even with the 'sync' option which is the default, please confirm, but otherwise I do not believe you. rm Which is an expected behavior when you break NFS requirements rm as Linux does out of the box. wrong. The default is 'sync' in /etc/exports. The default has changed, but the default is 'sync', and the whole thing is well-documented. rm What would be useful though is to be able to easily disable rm ZIL per dataset instead of OS wide switch. yeah, Linux NFS servers have that granularity for their equivalent option. pgpg1qLhwVTDs.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Karsten Weiss wrote: Knowing that 100s of users could do this in parallel with good performance is nice but it does not improve the situation for the single user which only cares for his own tar run. If there's anything else we can do/try to improve the single-threaded case I'm all ears. A MegaRAID card with write-back cache? It should also be cheaper than the F20. Wes Felter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
# zfs destroy -r OIRT_BAK/backup_bad cannot destroy 'OIRT_BAK/backup_...@annex-2010-03-23-07:04:04-bad': dataset already exists No, there are no clones. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ2 configuration
Hi Ned, If you look at the examples on the page that you cite, they start with single-parity RAIDZ examples and then move to double-parity RAIDZ example with supporting text, here: http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view Can you restate the problem with this page? Thanks, Cindy On 03/26/10 05:42, Edward Ned Harvey wrote: Just because most people are probably too lazy to click the link, I’ll paste a phrase from that sun.com webpage below: “Creating a single-parity RAID-Z pool is identical to creating a mirrored pool, except that the ‘raidz’ or ‘raidz1’ keyword is used instead of ‘mirror’.” And “zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0” So … Shame on you, Sun, for doing this to your poor unfortunate readers. It would be nice if the page were a wiki, or somehow able to have feedback submitted… *From:* zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] *On Behalf Of *Bruno Sousa *Sent:* Thursday, March 25, 2010 3:28 PM *To:* Freddie Cash *Cc:* ZFS filesystem discussion list *Subject:* Re: [zfs-discuss] RAIDZ2 configuration Hmm...it might be completely wrong , but the idea of raidz2 vdev with 3 disks came from the reading of http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view . This particular page has the following example : *zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0* # *zpool status -v tank* pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t0d0ONLINE 0 0 0 c2t0d0ONLINE 0 0 0 c3t0d0ONLINE 0 0 0 So...what am i missing here? Just a bad example in the sun documentation regarding zfs? Bruno On 25-3-2010 20:10, Freddie Cash wrote: On Thu, Mar 25, 2010 at 11:47 AM, Bruno Sousa bso...@epinfante.com mailto:bso...@epinfante.com wrote: What do you mean by Using fewer than 4 disks in a raidz2 defeats the purpose of raidz2, as you will always be in a degraded mode ? Does it means that having 2 vdevs with 3 disks it won't be redundant in the advent of a drive failure? raidz1 is similar to raid5 in that it is single-parity, and requires a minimum of 3 drives (2 data + 1 parity) raidz2 is similar to raid6 in that it is double-parity, and requires a minimum of 4 drives (2 data + 2 parity) IOW, a raidz2 vdev made up of 3 drives will always be running in degraded mode (it's missing a drive). -- Freddie Cash fjwc...@gmail.com mailto:fjwc...@gmail.com -- This message has been scanned for viruses and dangerous content by *MailScanner* http://www.mailscanner.info/, and is believed to be clean. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Edward Ned Harvey solaris2 at nedharvey.com writes: Allow me to clarify a little further, why I care about this so much. I have a solaris file server, with all the company jewels on it. I had a pair of intel X.25 SSD mirrored log devices. One of them failed. The replacement device came with a newer version of firmware on it. Now, instead of appearing as 29.802 Gb, it appears at 29.801 Gb. I cannot zpool attach. New device is too small. So apparently I'm the first guy this happened to. Oracle is caught totally off guard. They're pulling their inventory of X25's from dispatch warehouses, and inventorying all the firmware versions, and trying to figure it all out. Meanwhile, I'm still degraded. Or at least, I think I am. Nobody knows any way for me to remove my unmirrored log device. Nobody knows any way for me to add a mirror to it (until they can locate a drive with the correct firmware.) All the support people I have on the phone are just as scared as I am. Well we could upgrade the firmware of your existing drive, but that'll reduce it by 0.001 Gb, and that might just create a time bomb to destroy your pool at a later date. So we don't do it. Nobody has suggested that I simply shutdown and remove my unmirrored SSD, and power back on. We ran into something similar with these drives in an X4170 that turned out to be an issue of the preconfigured logical volumes on the drives. Once we made sure all of our Sun PCI HBAs where running the exact same version of firmware and recreated the volumes on new drives arriving from Sun we got back into sync on the X25-E devices sizes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ2 configuration
Hi Cindy, This all issue started when i asked opinion in this list in how should i create zpools. It seems that one of my initial ideas of creating a vdev with 3 disks in a raidz configuration seems to be a non-sense configuration. Somewhere along the way i defended my initial idea with the fact that the documentation from Sun has as an example such configuration as seen here : *zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0* at http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view So if by concept the idea of having a vdev with 3 disks within a raidz configuration is a bad one, the oficial Sun documentation should not have such example. However if people made such example in Sun documentation, perhaps this all idea is not that bad at all.. Can you provide anything on this subject? Thanks, Bruno On 31-3-2010 23:49, Cindy Swearingen wrote: Hi Ned, If you look at the examples on the page that you cite, they start with single-parity RAIDZ examples and then move to double-parity RAIDZ example with supporting text, here: http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view Can you restate the problem with this page? Thanks, Cindy On 03/26/10 05:42, Edward Ned Harvey wrote: Just because most people are probably too lazy to click the link, I’ll paste a phrase from that sun.com webpage below: “Creating a single-parity RAID-Z pool is identical to creating a mirrored pool, except that the ‘raidz’ or ‘raidz1’ keyword is used instead of ‘mirror’.” And “zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0” So … Shame on you, Sun, for doing this to your poor unfortunate readers. It would be nice if the page were a wiki, or somehow able to have feedback submitted… *From:* zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-boun...@opensolaris.org] *On Behalf Of *Bruno Sousa *Sent:* Thursday, March 25, 2010 3:28 PM *To:* Freddie Cash *Cc:* ZFS filesystem discussion list *Subject:* Re: [zfs-discuss] RAIDZ2 configuration Hmm...it might be completely wrong , but the idea of raidz2 vdev with 3 disks came from the reading of http://docs.sun.com/app/docs/doc/819-5461/gcvjg?a=view . This particular page has the following example : *zpool create tank raidz2 c1t0d0 c2t0d0 c3t0d0* # *zpool status -v tank* pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c1t0d0ONLINE 0 0 0 c2t0d0ONLINE 0 0 0 c3t0d0ONLINE 0 0 0 So...what am i missing here? Just a bad example in the sun documentation regarding zfs? Bruno On 25-3-2010 20:10, Freddie Cash wrote: On Thu, Mar 25, 2010 at 11:47 AM, Bruno Sousa bso...@epinfante.com mailto:bso...@epinfante.com wrote: What do you mean by Using fewer than 4 disks in a raidz2 defeats the purpose of raidz2, as you will always be in a degraded mode ? Does it means that having 2 vdevs with 3 disks it won't be redundant in the advent of a drive failure? raidz1 is similar to raid5 in that it is single-parity, and requires a minimum of 3 drives (2 data + 1 parity) raidz2 is similar to raid6 in that it is double-parity, and requires a minimum of 4 drives (2 data + 2 parity) IOW, a raidz2 vdev made up of 3 drives will always be running in degraded mode (it's missing a drive). -- Freddie Cash fjwc...@gmail.com mailto:fjwc...@gmail.com -- This message has been scanned for viruses and dangerous content by *MailScanner* http://www.mailscanner.info/, and is believed to be clean. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 31/03/2010 17:31, Bob Friesenhahn wrote: On Wed, 31 Mar 2010, Edward Ned Harvey wrote: Would your users be concerned if there was a possibility that after extracting a 50 MB tarball that files are incomplete, whole subdirectories are missing, or file permissions are incorrect? Correction: Would your users be concerned if there was a possibility that after extracting a 50MB tarball *and having a server crash* then files could be corrupted as described above. If you disable the ZIL, the filesystem still stays correct in RAM, and the only way you lose any data such as you've described, is to have an ungraceful power down or reboot. Yes, of course. Suppose that you are a system administrator. The server spontaneously reboots. A corporate VP (CFO) comes to you and says that he had just saved the critical presentation to be given to the board of the company (and all shareholders) later that day, and now it is gone due to your spontaneous server reboot. Due to a delayed financial statement, the corporate stock plummets. What are you to do? Do you expect that your employment will continue? Reliable NFS synchronous writes are good for the system administrators. well, it really depends on your environment. There is place for Oracle database and there is place for MySQL, then you don't really need to cluster everything and then there are environments where disabling ZIL is perfectly acceptablt. One of such cases is that you need to re-import a database or recover lots of files over NFS - your service is down and disabling ZIL makes a recovery MUCH faster. Then there are cases when leaving the ZIL disabled is acceptable as well. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 31/03/2010 17:22, Edward Ned Harvey wrote: The advice I would give is: Do zfs autosnapshots frequently (say ... every 5 minutes, keeping the most recent 2 hours of snaps) and then run with no ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest snapshot ... and rollback once more for good measure. As long as you can afford to risk 5-10 minutes of the most recent work after a crash, then you can get a 10x performance boost most of the time, and no risk of the aforementioned data corruption. I don't really get it - rolling back to a last snapshot doesn't really improve things here it actually makes it worse as now you are going to loose even more data. Keep in mind that currently the maximum time after which ZFS commits a transaction is 30s - ZIL or not. So with disabled ZIL in worst case scenario you should loose no more than last 30-60s. You can tune it down if you want. Rolling back to a snapshot will only make it worse. Then also keep in mind that it is a worst case scenario here - it well may be there were no outstanding transactions at all - it all goes down basically to a risk assessment, impact assessment and a cost. Unless you are talking about doing regular snapshots and making sure that application is consistent while doing so - for example putting all Oracle tablespaces in a hot backup mode and taking a snapshot... otherwise it doesn't really make sense. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
On 31/03/2010 16:44, Bob Friesenhahn wrote: On Wed, 31 Mar 2010, Robert Milkowski wrote: or there might be an extra zpool level (or system wide) property to enable checking checksums onevery access from ARC - there will be a siginificatn performance impact but then it might be acceptable for really paranoid folks especially with modern hardware. How would this checking take place for memory mapped files? Well, and it wouldn't help if data were corrupted in an application internal buffer after read() succeeded, or just before an application does a write(). So I wasn't saying that it can work or that it can work in all circumstances but rather I was trying to say that it probably shouldn't be dismissed on a performance argument alone as for some use cases with modern HW it might well be that the performance will still be acceptable while providing still better protection and data correctness guarantee. But even then while mmap() issue is probably solvable the read() and write() cases are probably not. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 31/03/2010 21:38, Miles Nordin wrote: rm Which is an expected behavior when you break NFS requirements rm as Linux does out of the box. wrong. The default is 'sync' in /etc/exports. The default has changed, but the default is 'sync', and the whole thing is well-documented. I double checked the documentation and you're right - the default has changed to sync. I haven't found in which RH version it happened but it doesn't really matter. So yes, I was wrong - the current default it seems to be sync on Linux as well. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] benefits of zfs root over ufs root
Hi Folks, Im in a shop thats very resistant to change. The management here are looking for major justification of a move away from ufs to zfs for root file systems. Does anyone know if there are any whitepapers/blogs/discussions extolling the benefits of zfsroot over ufsroot? Regards in advance Rep -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 2010/03/31 05:13, Darren J Moffat wrote: On 31/03/2010 10:27, Erik Trimble wrote: Orvar's post over in opensol-discuss has me thinking: After reading the paper and looking at design docs, I'm wondering if there is some facility to allow for comparing data in the ARC to it's corresponding checksum. That is, if I've got the data I want in the ARC, how can I be sure it's correct (and free of hardware memory errors)? I'd assume the way is to also store absolutely all the checksums for all blocks/metadatas being read/written in the ARC (which, of course, means that only so much RAM corruption can be compensated for), and do a validation when that every time that block is used/written from the ARC. You'd likely have to do constant metadata consistency checking, and likely have to hold multiple copies of metadata in-ARC to compensate for possible corruption. I'm assuming that this has at least been explored, right? A subset of this is already done. The ARC keeps its own in memory checksum (because some buffers in the ARC are not yet on stable storage so don't have a block pointer checksum yet). http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c arc_buf_freeze() arc_buf_thaw() arc_cksum_verify() arc_cksum_compute() It isn't done on every access but it can detect in memory corruption - I've seen it happen on several occasions but all due to errors in my code not bad physical memory. Doing in more frequently could cause a significant performance problem. Agreed. I think it's probably not a very good idea to check it everywhere. It would be great if we can do some checks occasionally especially for critical data structures, but, if it's the memory we can not trust, how can we trust that the checksum checker to behave correctly? I had some questions about the FAST paper mentioned by Erik, which was not answered during the conference which makes me feel that the paper, while pointed out some interesting issues, but failed to prove it being a real world problem: - How much probability a bit flipping can happen on a non-ECC system? say, how much bits would be flipped per terabytes processed, or transactions or something? - Among these flipped bits, how much would happen on a file system buffer? What happens when, say, the application's memory hit a flipped bit, and when the file system itself have no problem with its buffer? - How much performance penalty would be if we check the checksums every time the data is being accessed? How good will the check be compared to an ECC in terms of correctness? Cheers, - -- Xin LI delp...@delphij.nethttp://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (FreeBSD) iQEcBAEBAgAGBQJLs+UZAAoJEATO+BI/yjfBfE0H/0+iG/pgrs/JNId814g5JMki eZ2tJx2Lf7+DIlrHczvcwyWAtAke7ojUMeNEw6HIqMfTQHVcgMk2XNdxWZn0sJsy PUPj9Qcg+nkHcewAoWvG0VUZN0fSBX1OtJcVG78Kt5drWmT+g5jiMH+BFCEAiISJ Kcfswp9r0JbYmI010fwqugc74bAZnMhUXMCvvplJZUE3iaDCq499TanKIVmKu4vq JsDNYXZT9Nqbb20DB4TKluauP1QVUJnBAeqfQCYZ/+CqK5+phnUgzyaBTiMKBHd0 Q0l1bvGEvjLRarlGk7/702Udu7HC4UKs09pKtBIb+cw8CmyYaZ8Vuth0Ri0drzM= =S5WS -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool split problem?
I assume the swap, dumpadm, grub is because the pool has a different name now, but is it still a problem if you take it to a *different system* boot off a CD change it back to rpool. (which is most likley unsupported, ie no help to get it working) Over 10 years ago (way before flash archive existed) I developed a script, used after spliting a mirror, which would remove most of the device tree, cleaned up path_to_inst etc so it look like the OS was just installed and about to do the reboot without the install CD. (every thing was still in there expect for hardware specific stuff, I no longer have the script and most likey would not do it again because its not a supported install method) I still had to boot from CD on the new system and create the dev tree before booting off the disk for the first time, and then fix vfstab (but the fix vfstab should be gone with zfs rpool) It would be nice for Oracle/Sun to produce a separate script which reset system/devices back to a install like begining so if you move a OS disk with current password file and software from one system to another, and have it rebuild the device tree on the new system. From member (updated for zfs) something like: zfs split rpool newrpool mount newrpool remove newrpool/dev and newrpool/devices of all non-packages content (ie dynamically created content) clean up newrpool/etc/path_to_inst create /newrool/reconfigure remove all prevoius snapshots in newrool update beadm info inside newrpool ensure grub is installed on the disk -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Mar 31, 2010, at 19:41, Robert Milkowski wrote: I double checked the documentation and you're right - the default has changed to sync. I haven't found in which RH version it happened but it doesn't really matter. From the SourceForge site: Since version 1.0.1 of the NFS utilities tarball has changed the server export default to sync, then, if no behavior is specified in the export list (thus assuming the default behavior), a warning will be generated at export time. http://nfs.sourceforge.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] benefits of zfs root over ufs root
Brett wrote: Hi Folks, Im in a shop thats very resistant to change. The management here are looking for major justification of a move away from ufs to zfs for root file systems. Does anyone know if there are any whitepapers/blogs/discussions extolling the benefits of zfsroot over ufsroot? Regards in advance Rep I can't give you any links, but here's a short list of advantages: (1) all the standard ZFS advantages over UFS (2) LiveUpgrade/beadm related improvements (a) much faster on ZFS (b) don't need dedicated slice per OS instance, so it's far simpler to have N different OS installs (c) very easy to keep track of which OS instance is installed where WITHOUT having to mount each one (d) huge space savings (snapshots save lots of space on upgrades) (3) much more flexible swap space allocation (no hard-boundary slices) (4) simpler layout of filesystem partitions, and more flexible in changing directory size limits (e.g. /var ) (5) mirroring a boot disk is simple under ZFS - much more complex under SVM/UFS (6) root-pool snapshots make backups trivially easy -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
So we tried recreating the pool and sending the data again. 1) compression wasn't set on the copy, even though I did sent -R, which is supposed to send all properties 2) I tried killing to send | receive pipe. Receive couldn't be killed. It hung. 3) This is Solaris Cluster. We tried forcing a failover. The pool mounted on the other server without dismounting on the first. zpool list showed it mounted on both machines. zpool iostat showed I/O actually occurring on both systems. Altogether this does not give me a good feeling about ZFS. I'm hoping the problem is just with receive and CLuster, and the it works properly on a single system. Because i'm running a critical database on ZFS on another system. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
On Thu, Apr 01, 2010 at 12:38:29AM +0100, Robert Milkowski wrote: So I wasn't saying that it can work or that it can work in all circumstances but rather I was trying to say that it probably shouldn't be dismissed on a performance argument alone as for some use cases It would be of great utility even if considered only as a diagnostic measure - ie, for qualifying tests or when something else raises suspicion and you want to eliminate/confirm sources of problems. With a suitable pointer in a FAQ/troubleshooting guide, it could reduce the number / improve the quality of problem reports related to bad h/w. -- Dan. pgp2jYRc6bDBB.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Mar 31, 2010, at 5:39 AM, Robert Milkowski mi...@task.gda.pl wrote: On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. Well, for lots of environments disabling ZIL is perfectly acceptable. And frankly the reason you get better performance out of the box on Linux as NFS server is that it actually behaves like with disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using Linux here or any other OS which behaves in the same manner. Actually it makes it better as even if ZIL is disabled ZFS filesystem is always consisten on a disk and you still get all the other benefits from ZFS. What would be useful though is to be able to easily disable ZIL per dataset instead of OS wide switch. This feature has already been coded and tested and awaits a formal process to be completed in order to get integrated. Should be rather sooner than later. Well being fair to Linux the default for NFS exports is to export them 'sync' now which syncs to disk on close or fsync. It has been many years before they exported 'async' by default. Now if Linux admins set their shares 'async' and loose important data then it's operator error and not Linux's fault. If apps don't care about their data consistency and don't sync their data I don't see why the file server has to care for them. I mean if it were a local file system and the machine rebooted the data would be lost too. Should we care more for data written remotely then locally? -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Mar 31, 2010, at 7:11 PM, Ross Walker wrote: On Mar 31, 2010, at 5:39 AM, Robert Milkowski mi...@task.gda.pl wrote: On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. Well, for lots of environments disabling ZIL is perfectly acceptable. And frankly the reason you get better performance out of the box on Linux as NFS server is that it actually behaves like with disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using Linux here or any other OS which behaves in the same manner. Actually it makes it better as even if ZIL is disabled ZFS filesystem is always consisten on a disk and you still get all the other benefits from ZFS. What would be useful though is to be able to easily disable ZIL per dataset instead of OS wide switch. This feature has already been coded and tested and awaits a formal process to be completed in order to get integrated. Should be rather sooner than later. Well being fair to Linux the default for NFS exports is to export them 'sync' now which syncs to disk on close or fsync. It has been many years before they exported 'async' by default. Now if Linux admins set their shares 'async' and loose important data then it's operator error and not Linux's fault. If apps don't care about their data consistency and don't sync their data I don't see why the file server has to care for them. I mean if it were a local file system and the machine rebooted the data would be lost too. Should we care more for data written remotely then locally? This is not true for sync data written locally, unless you disable the ZIL locally. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
On 04/ 1/10 02:01 PM, Charles Hedrick wrote: So we tried recreating the pool and sending the data again. 1) compression wasn't set on the copy, even though I did sent -R, which is supposed to send all properties 2) I tried killing to send | receive pipe. Receive couldn't be killed. It hung. How long did you wait and how much data had been sent? Killing a receive can take a (long!) while if it has to free all data already written. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Mar 31, 2010, at 10:25 PM, Richard Elling richard.ell...@gmail.com wrote: On Mar 31, 2010, at 7:11 PM, Ross Walker wrote: On Mar 31, 2010, at 5:39 AM, Robert Milkowski mi...@task.gda.pl wrote: On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss Use something other than Open/Solaris with ZFS as an NFS server? :) I don't think you'll find the performance you paid for with ZFS and Solaris at this time. I've been trying to more than a year, and watching dozens, if not hundreds of threads. Getting half-ways decent performance from NFS and ZFS is impossible unless you disable the ZIL. Well, for lots of environments disabling ZIL is perfectly acceptable. And frankly the reason you get better performance out of the box on Linux as NFS server is that it actually behaves like with disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using Linux here or any other OS which behaves in the same manner. Actually it makes it better as even if ZIL is disabled ZFS filesystem is always consisten on a disk and you still get all the other benefits from ZFS. What would be useful though is to be able to easily disable ZIL per dataset instead of OS wide switch. This feature has already been coded and tested and awaits a formal process to be completed in order to get integrated. Should be rather sooner than later. Well being fair to Linux the default for NFS exports is to export them 'sync' now which syncs to disk on close or fsync. It has been many years before they exported 'async' by default. Now if Linux admins set their shares 'async' and loose important data then it's operator error and not Linux's fault. If apps don't care about their data consistency and don't sync their data I don't see why the file server has to care for them. I mean if it were a local file system and the machine rebooted the data would be lost too. Should we care more for data written remotely then locally? This is not true for sync data written locally, unless you disable the ZIL locally. No, of course if it's written sync with ZIL, it just seems over Solaris NFS all writes are delayed not just sync writes. -Ross ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
Ah, I hadn't thought about that. That may be what was happening. Thanks. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
So that eliminates one of my concerns. However the other one is still an issue. Presumably Solaris Cluster shouldn't import a pool that's still active on the other system. We'll be looking more carefully into that. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] benefits of zfs root over ufs root
On Wed, Mar 31, 2010 at 7:53 PM, Erik Trimble erik.trim...@oracle.com wrote: Brett wrote: Hi Folks, Im in a shop thats very resistant to change. The management here are looking for major justification of a move away from ufs to zfs for root file systems. Does anyone know if there are any whitepapers/blogs/discussions extolling the benefits of zfsroot over ufsroot? Regards in advance Rep I can't give you any links, but here's a short list of advantages: (1) all the standard ZFS advantages over UFS (2) LiveUpgrade/beadm related improvements (a) much faster on ZFS (b) don't need dedicated slice per OS instance, so it's far simpler to have N different OS installs (c) very easy to keep track of which OS instance is installed where WITHOUT having to mount each one (d) huge space savings (snapshots save lots of space on upgrades) (3) much more flexible swap space allocation (no hard-boundary slices) (4) simpler layout of filesystem partitions, and more flexible in changing directory size limits (e.g. /var ) (5) mirroring a boot disk is simple under ZFS - much more complex under SVM/UFS (6) root-pool snapshots make backups trivially easy -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) I don't think 2b is given enough emphasis. The ability to quickly clone your root filesystem, apply whatever change you need to (patch, config change), reboot into the new environment, and be able to provably back out to the prior state with easy is a life saver (yes you could do this with ufs, but is assumes you have enough free slices on your direct attached disks, and it takes _far_ longer simply because you must first copy the entire boot environment first -- adding probably a few hours, versus the ~1s to snapshot + clone). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can't destroy snapshot
On 04/ 1/10 02:01 PM, Charles Hedrick wrote: So we tried recreating the pool and sending the data again. 1) compression wasn't set on the copy, even though I did sent -R, which is supposed to send all properties Was compression explicitly set on the root filesystem of your set? I don't think compression will be on if the root of a sent filesystem tree inherits the property from its parent. I normally set compression on the the pool, then explicitly off on an any filesystems where it isn't appropriate. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
I see the source for some confusion. On the ZFS Best Practices page: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide It says: Failure of the log device may cause the storage pool to be inaccessible if you are running the Solaris Nevada release prior to build 96 and a release prior to the Solaris 10 10/09 release. It also says: If a separate log device is not mirrored and the device that contains the log fails, storing log blocks reverts to the storage pool. I have some more concrete data on this now. Running Solaris 10u8 (which is 10/09), fully updated last weekend. We want to explore the consequences of adding or failing a non-mirrored log device. We created a pool with a non-mirrored ZIL log device. And experimented with it: (a) Simply yank out the non-mirrored log device while the system is live. The result was: Any zfs or zpool command would hang permanently. Even zfs list hangs permanently. The system cannot shutdown, cannot reboot, cannot zfs send or zfs snapshot or anything ... It's a bad state. You're basically hosed. Power cycle is the only option. (b) After power cycling, the system won't boot. It gets part way through the boot process, and eventually just hangs there, infinitely cycling error messages about services that couldn't start. Random services, such as inetd, which seem unrelated to some random data pool that failed. So we power cycle again, and go into failsafe mode, to clean up and destroy the old messed up pool ... Boot up totally clean again, and create a new totally clean pool with a non-mirrored log device. Just to ensure we really are clean, we simply zpool export and zpool import with no trouble, and reboot once for good measure. zfs list and everything are all working great... (c) Do a zpool export. Obviously, the ZIL log device is clean and flushed at this point, not being used. We simply yank out the log device, and do zpool import. Well ... Without that log device, I forget the terminology, it said something like missing disk. Plain and simple, you *can* *not* import the pool without the log device. It does not say to force use -f and even if you specify the -f, it still just throws the same error message, missing disk or whatever. Won't import. Period. ... So, to anybody who said the failed log device will simply fail over to blocks within the main pool: Sorry. That may be true in some later version, but it is not the slightest bit true in the absolute latest solaris (proper) available today. I'm going to venture a guess this is no longer a problem, after zpool version 19. This is when ZFS log device removal was introduced. Unfortunately, the latest version of solaris only goes up to zpool version 15. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
A MegaRAID card with write-back cache? It should also be cheaper than the F20. I haven't posted results yet, but I just finished a few weeks of extensive benchmarking various configurations. I can say this: WriteBack cache is much faster than naked disks, but if you can buy an SSD or two for ZIL log device, the dedicated ZIL is yet again much faster than WriteBack. It doesn't have to be F20. You could use the Intel X25 for example. If you're running solaris proper, you better mirror your ZIL log device. If you're running opensolaris ... I don't know if that's important. I'll probably test it, just to be sure, but I might never get around to it because I don't have a justifiable business reason to build the opensolaris machine just for this one little test. Seriously, all disks configured WriteThrough (spindle and SSD disks alike) using the dedicated ZIL SSD device, very noticeably faster than enabling the WriteBack. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
We ran into something similar with these drives in an X4170 that turned out to be an issue of the preconfigured logical volumes on the drives. Once we made sure all of our Sun PCI HBAs where running the exact same version of firmware and recreated the volumes on new drives arriving from Sun we got back into sync on the X25-E devices sizes. Can you elaborate? Just today, we got the replacement drive that has precisely the right version of firmware and everything. Still, when we plugged in that drive, and create simple volume in the storagetek raid utility, the new drive is 0.001 Gb smaller than the old drive. I'm still hosed. Are you saying I might benefit by sticking the SSD into some laptop, and zero'ing the disk? And then attach to the sun server? Are you saying I might benefit by finding some other way to make the drive available, instead of using the storagetek raid utility? Thanks for the suggestions... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Mar 31, 2010, at 8:58 PM, Edward Ned Harvey wrote: We ran into something similar with these drives in an X4170 that turned out to be an issue of the preconfigured logical volumes on the drives. Once we made sure all of our Sun PCI HBAs where running the exact same version of firmware and recreated the volumes on new drives arriving from Sun we got back into sync on the X25-E devices sizes. Can you elaborate? Just today, we got the replacement drive that has precisely the right version of firmware and everything. Still, when we plugged in that drive, and create simple volume in the storagetek raid utility, the new drive is 0.001 Gb smaller than the old drive. I'm still hosed. Are you saying I might benefit by sticking the SSD into some laptop, and zero'ing the disk? And then attach to the sun server? Are you saying I might benefit by finding some other way to make the drive available, instead of using the storagetek raid utility? Assuming you are also using a PCI LSI HBA from Sun that is managed with a utility called /opt/StorMan/arcconf and reports itself as the amazingly informative model number Sun STK RAID INT what worked for me was to run, arcconf delete (to delete the pre-configured volume shipped on the drive) arcconf create (to create a new volume) What I observed was that arcconf getconfig 1 would show the same physical device size for our existing drives and new ones from Sun, but they reported a slightly different logical volume size. I am fairly sure that was due to the Sun factory creating the initial volume with a different version of the HBA controller firmware then we where using to create our own volumes. If I remember the sign correctly, the newer firmware creates larger logical volumes, and you really want to upgrade the firmware if you are going to be running multiple X25-E drives from the same controller. I hope that helps. -- Stuart Anderson ander...@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On Mar 31, 2010, at 9:22 AM, Edward Ned Harvey wrote: Would your users be concerned if there was a possibility that after extracting a 50 MB tarball that files are incomplete, whole subdirectories are missing, or file permissions are incorrect? Correction: Would your users be concerned if there was a possibility that after extracting a 50MB tarball *and having a server crash* then files could be corrupted as described above. If you disable the ZIL, the filesystem still stays correct in RAM, and the only way you lose any data such as you've described, is to have an ungraceful power down or reboot. The advice I would give is: Do zfs autosnapshots frequently (say ... every 5 minutes, keeping the most recent 2 hours of snaps) and then run with no ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest snapshot ... and rollback once more for good measure. As long as you can afford to risk 5-10 minutes of the most recent work after a crash, then you can get a 10x performance boost most of the time, and no risk of the aforementioned data corruption. This approach does not solve the problem. When you do a snapshot, the txg is committed. If you wish to reduce the exposure to loss of sync data and run with ZIL disabled, then you can change the txg commit interval -- however changing the txg commit interval will not eliminate the possibility of data loss. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss