Re: [zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote: Matthew Ahrens wrote: The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Can you expand? I can think of some examples where using multiple pools - even on the same host - is quite useful given the current feature set of the product. Or are you only discussing the specific case where a host would want more reliability for a certain set of data then an other? If that's the case I'm still confused as to what failure cases would still allow you to retrieve your data if there are more then one copy in the fs or pool.but I'll gladly take some enlightenment. :) (My apologies for the length of this response, I'll try to address most of the issues brought up recently...) When I wrote this proposal, I was only seriously thinking about the case where you want different amounts of redundancy for different data. Perhaps because I failed to make this clear, discussion has concentrated on laptop reliability issues. It is true that there would be some benefit to using multiple copies on a single-disk (eg. laptop) pool, but of course it would not protect against the most common failure mode (whole disk failure). One case where this feature would be useful is if you have a pool with no redundancy (ie. no mirroring or raid-z), because most of the data in the pool is not very important. However, the pool may have a bunch of disks in it (say, four). The administrator/user may realize (perhaps later on) that some of their data really *is* important and they would like some protection against losing it if a disk fails. They may not have the option of adding more disks to mirror all of their data (cost or physical space constraints may apply here). Their problem is solved by creating a new filesystem with copies=2 and putting the important data there. Now, if a disk fails, then the data in the copies=2 filesystem will not be lost. Approximately 1/4 of the data in other filesystems will be lost. (There is a small chance that some tiny fraction of the data in the copies=2 filesystem will still be lost if we were forced to put both copies on the disk that failed.) Another plausible use case would be where you have some level of redundancy, say you have a Thumper (X4500) with its 48 disks configured into 9 5-wide single-parity raid-z groups (with 3 spares). If a single disk fails, there will be no data loss. However, if two disks within the same raid-z group fail, data will be lost. In this scenario, imagine that this data loss probability is acceptable for most of the data stored here, but there is some extremely important data for which this is unacceptable. Rather than reconfiguring the entire pool for higher redundancy (say, double-parity raid-z) and less usable storage, you can simply create a filesystem with copies=2 within the raid-z storage pool. Data within that filesystem will not be lost even if any three disks fail. I believe that these use cases, while not being extremely common, do occur. The extremely low amount of engineering effort required to implement the feature (modulo the space accounting issues) seems justified. The fact that this feature does not solve all problems (eg, it is not intended to be a replacement for mirroring) is not a downside; not all features need to be used in all situations :-) The real problem with this proposal is the confusion surrounding disk space accounting with copies1. While the same issues are present when using compression, people are understandably less upset when files take up less space than expected. Given the current lack of interest in this feature, the effort required to address the space accounting issue does not seem justified at this time. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
David Dyer-Bennet wrote: On 9/12/06, eric kustarz [EMAIL PROTECTED] wrote: So it seems to me that having this feature per-file is really useful. Say i have a presentation to give in Pleasanton, and the presentation lives on my single-disk laptop - I want all the meta-data and the actual presentation to be replicated. We already use ditto blocks for the meta-data. Now we could have an extra copy of the actual data. When i get back from the presentation i can turn off the extra copies. Yes, you could do that. *I* would make a copy on a CD, which I would carry in a separate case from the laptop. Do you backup the presentation to CD everytime you make an edit? I think my presentation is a lot safer than your presentation. I'm sure both of our presentations would be equally safe as we would know not to have the only copy(ies) on our personage. Similarly for your digital images example; I don't consider it safe until I have two or more *independent* copies. Two copies on a single hard drive doesn't come even close to passing the test for me; as many people have pointed out, those tend to fail all at once. And I will also point out that laptops get stolen a lot. And of course all the accidents involving fumble-fingers, OS bugs, and driver bugs won't be helped by the data duplication either. (Those will mostly be helped by sensible use of snapshots, though, which is another argument for ZFS on *any* disk you work on a lot.) Well of course you would have a separate, independent copy if it really mattered. The more I look at it the more I think that a second copy on the same disk doesn't protect against very much real-world risk. Am I wrong here? Are partial(small) disk corruptions more common than I think? I don't have a good statistical view of disk failures. Well let's see - my friend accompanied me on a trip and saved her photos daily onto her laptop. Near the end of the trip her hard drive started having problems. The hard drive was not dead, as it was bootable and you could access certain data. Upon returning home she was able to retrieve some of her photos but not all. She would have been much happier having ZFS + copies. And yes, you could backup to CD/DVD every night, but its a pain and people don't do it (as much as they should). Side note: it would have cost hundreds of dollars for data recovery to have just the *possibility* to get the other photos. eric ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote: eric kustarz wrote: Matthew Ahrens wrote: Matthew Ahrens wrote: Here is a proposal for a new 'copies' property which would allow different levels of replication for different filesystems. Thanks everyone for your input. The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Given the overwhelming criticism of this feature, I'm going to shelve it for now. So it seems to me that having this feature per-file is really useful. Say i have a presentation to give in Pleasanton, and the presentation lives on my single-disk laptop - I want all the meta-data and the actual presentation to be replicated. We already use ditto blocks for the meta-data. Now we could have an extra copy of the actual data. When i get back from the presentation i can turn off the extra copies. Under what failure nodes would your data still be accessible? What things can go wrong that still allow you to access the data because some event has removed one copy but left the others? Silent data corruption of one of the copies. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
On 13/09/06, Matthew Ahrens [EMAIL PROTECTED] wrote: Dick Davies wrote: But they raise a lot of administrative issues Sure, especially if you choose to change the copies property on an existing filesystem. However, if you only set it at filesystem creation time (which is the recommended way), then it's pretty easy to address your issues: You're right, that would prevent getting into some nasty messes (I see this as closer to encryption than compression in that respect). I still feel we'd be doing the same job in several places. But I'm sure anyone who cares has a pretty good idea of my opinion, so I'll shut up now :) Thanks for taking the time to feedback on the feedback. -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS imported simultanously on 2 systems...
I think this is user error: the man page explicitly says: -f Forces import, even if the pool appears to be potentially active. and that's exactly what you did. If the behaviour had been the same without the -f option, I guess this would be a bug. HTH Mathias F wrote: Hi, we are testing ZFS atm as a possible replacement for Veritas VM. While testing, we encountered a serious problem, which corrupted the whole filesystem. First we created a standard Raid10 with 4 disks. [b]NODE2:../# zpool create -f swimmingpool mirror c0t3d0 c0t11d0 mirror c0t4d0 c0t12d0 NODE2:../# zpool list NAMESIZEUSED AVAILCAP HEALTH ALTROOT swimmingpool 33.5G 81K 33.5G 0% ONLINE - NODE2:../# zpool status pool: swimmingpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM swimmingpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 errors: No known data errors [/b] After that we made a new ZFS and copied a testing file on it. [b]NODE2:../# zfs create swimmingpool/babe NODE2:../# zfs list NAME USED AVAIL REFER MOUNTPOINT swimmingpool 108K 33.0G 25.5K /swimmingpool swimmingpool/babe 24.5K 33.0G 24.5K /swimmingpool/babe NODE2:../# cp /etc/hosts /swimmingpool/babe/ [/b] Now we test the behaviour if importing the ZFS on another system while it is still imported on the first one. The expected behaviour would be that ZFS couldn't be imported due to possible corruption, but instead it is imported just fine! We now were able to write simultanously from both systems on the same ZFS: [b]NODE1:../# zpool import -f swimmingpool NODE1:../# man man /swimmingpool/babe/man NODE2:../# cat /dev/urandom /swimmingpool/babe/testfile NODE1:../# cat /dev/urandom /swimmingpool/babe/testfile2 NODE1:../# ls -l /swimmingpool/babe/ -r--r--r-- 1 root root 2194 Sep 8 14:31 hosts -rw-r--r-- 1 root root 17531 Sep 8 14:52 man -rw-r--r-- 1 root root 3830447920 Sep 8 16:20 testfile2 NODE2:../# ls -l /swimmingpool/babe/ -r--r--r-- 1 root root 2194 Sep 8 14:31 hosts -rw-r--r-- 1 root root 3534355760 Sep 8 16:19 testfile [/b] This can't be supposed to be the normal behaviour. Did we encounter a bug or is this still under development? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Michael Schuster +49 89 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Memory Usage
On 9/13/06, Thomas Burns [EMAIL PROTECTED] wrote: BTW -- did I guess right wrt where I need to set arc.c_max (etc/system)? I think you need to use mdb. As Mark and Johansen mentioned, only do this as your last resort. # mdb -kw arc::print -a c_max d3b0f874 c_max = 0x1d0fe800 d3b0f874 /W 0x1000 arc+0x34: 0x1d0fe800 = 0x1000 arc::print -a c_max d3b0f874 c_max = 0x1000 $q -- Just me, Wire ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Well, we are using the -f parameter to test failover functionality. If one system with mounted ZFS is down, we have to use the force to mount it on the failover system. But when the failed system comes online again, it remounts the ZFS without errors, so it is mounted simultanously on both nodes That's the real problem we have :[ mfg Mathias I think this is user error: the man page explicitly says: -f Forces import, even if the pool appears to be potentially active. y what you did. If the behaviour had been the same without the -f option, I guess this would be a bug. HTH This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Mathias F wrote: Well, we are using the -f parameter to test failover functionality. If one system with mounted ZFS is down, we have to use the force to mount it on the failover system. But when the failed system comes online again, it remounts the ZFS without errors, so it is mounted simultanously on both nodes ZFS currently doesn't support this, I'm sorry to say. *You* have to make sure that a zpool is not imported on more than one node at a time. regards -- Michael Schuster +49 89 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Wed, Sep 13, 2006 at 12:28:23PM +0200, Michael Schuster wrote: Mathias F wrote: Well, we are using the -f parameter to test failover functionality. If one system with mounted ZFS is down, we have to use the force to mount it on the failover system. But when the failed system comes online again, it remounts the ZFS without errors, so it is mounted simultanously on both nodes This is used on a regularly basis within cluster frameworks... ZFS currently doesn't support this, I'm sorry to say. *You* have to make sure that a zpool is not imported on more than one node at a time. Why not using a real cluster-software as *You*, taking care of using resources like a filesystem (ufs, zfs, others...) in a consistent way? I think ZFS does enough to make shure not accidentially using filesystems/pools from more then one hosts at a time. If you want more, please consider using a cluster-framework with heartbeats and all that great stuff ... Regards, Thomas ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] 'zfs mirror as backup' status?
Since we were just talking about resilience on laptops, I wondered if it there had been any progress in sorting some of the glitches that were involved in: http://www.opensolaris.org/jive/thread.jspa?messageID=25144#25144 ? -- Rasputin :: Jack of All Trades - Master of Nuns http://number9.hellooperator.net/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Without -f option, the ZFS can't be imported while reserved for the other host, even if that host is down. As I said, we are testing ZFS as a [b]replacement for VxVM[/b], which we are using atm. So as a result our tests have failed and we have to keep on using Veritas. Thanks for all your answers. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/13/06, Richard Elling [EMAIL PROTECTED] wrote: * Mirroring offers slightly better redundancy, because one disk from each mirror can fail without data loss. Is this use of slightly based upon disk failure modes? That is, when disks fail do they tend to get isolated areas of badness compared to complete loss? I would suggest that complete loss should include someone tripping over the power cord to the external array that houses the disk. The field data I have says that complete disk failures are the exception. I hate to leave this as a teaser, I'll expand my comments later. BTW, this feature will be very welcome on my laptop! I can't wait :-) On servers and stationary desktops, I just don't care whether it is a whole disk failure or a few bad blocks. In that case I have the resources to mirror, RAID5, perform daily backups, etc. The laptop disk failures that I have seen have typically been limited to a few bad blocks. As Torey McMahon mentioned, they tend to start out with some warning signs followed by a full failure. I would *really* like to have that window between warning signs and full failure as my opportunity to back up my data and replace my non-redundant hard drive with no data loss. The only part of the proposal I don't like is space accounting. Double or triple charging for data will only confuse those apps and users that check for free space or block usage. If this is worked out, it would be a great feature for those times when mirroring just isn't an option. Mike -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On 9/13/06, Mike Gerdts [EMAIL PROTECTED] wrote: The only part of the proposal I don't like is space accounting. Double or triple charging for data will only confuse those apps and users that check for free space or block usage. Why exactly isn't reporting the free space divided by the copies value on that particular file system an easy solution for this? Did I miss something? Tobias ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Mathias F wrote: Without -f option, the ZFS can't be imported while reserved for the other host, even if that host is down. As I said, we are testing ZFS as a [b]replacement for VxVM[/b], which we are using atm. So as a result our tests have failed and we have to keep on using Veritas. Thanks for all your answers. I think I get the whole picture, let me summarise: - you create a pool P and an FS on host A - Host A crashes - you import P on host B; this only works with -f, as zpool import otherwise refuses to do so. - now P is imported on B - host A comes back up and re-accesses P, thereby leading to (potential) corruption. - your hope was that when host A comes back, there exists a mechanism for telling it you need to re-import. - Vxvm, as you currently use it, has this functionality Is that correct? regards -- Michael Schuster +49 89 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Mathias F wrote: Without -f option, the ZFS can't be imported while reserved for the other host, even if that host is down. This is the correct behaviour. What do you want to cause? data corruption? As I said, we are testing ZFS as a [b]replacement for VxVM[/b], which we are using atm. So as a result our tests have failed and we have to keep on using Veritas. As I understand things, SunCluster 3.2 is expected to have support for HA-ZFS and until that version is released you will not be running in a supported configuration and so any errors you encounter are *your fault alone*. Didn't we have the PMC (poor man's cluster) talk last week as well? James C. McPherson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS API (again!), need quotactl(7I)
On 13/09/2006, at 2:29 AM, Eric Schrock wrote: On Tue, Sep 12, 2006 at 07:23:00AM -0400, Jeff A. Earickson wrote: Modify the dovecot IMAP server so that it can get zfs quota information to be able to implement the QUOTA feature of the IMAP protocol (RFC 2087). In this case pull the zfs quota numbers for quoted home directory/zfs filesystem. Just like what quotactl() would do with UFS. I am really surprised that there is no zfslib API to query/set zfs filesystem properties. Doing a fork/exec just to execute a zfs get or zfs set is expensive and inelegant. The libzfs API will be made public at some point. However, we need to finish implementing the bulk of our planned features before we can feel comfortable with the interfaces. It will take a non-trivial amount of work to clean up all the interfaces as well as document them. It will be done eventually, but I wouldn't expect it any time soon - there are simply too many important things to get done first. If you don't care about unstable interfaces, you're welcome to use them as-is. If you want a stable interface, you are correct that the only way is through invoking 'zfs get' and 'zfs set'. I'm sure I'm missing something, but is there some reason that statvfs () is not good enough? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Hi Mathias, Mathias F wrote: Without -f option, the ZFS can't be imported while reserved for the other host, even if that host is down. As I said, we are testing ZFS as a [b]replacement for VxVM[/b], which we are using atm. So as a result our tests have failed and we have to keep on using Veritas. Sun Cluster 3.2, which is in beta at the moment, will allow you to do this automatically. I don't think what you are trying to do here will be supportable unless it's managed by SC3.2. Let me know if you'd like to try out the SC3.2 beta. Thanks, Zoram -- Zoram Thanga::Sun Cluster Development::http://blogs.sun.com/zoram ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
On Tue, 12 Sep 2006, Matthew Ahrens wrote: Torrey McMahon wrote: Matthew Ahrens wrote: The problem that this feature attempts to address is when you have some data that is more important (and thus needs a higher level of redundancy) than other data. Of course in some situations you can use multiple pools, but that is antithetical to ZFS's pooled storage model. (You have to divide up your storage, you'll end up with stranded storage and bandwidth, etc.) Can you expand? I can think of some examples where using multiple pools - even on the same host - is quite useful given the current feature set of the product. Or are you only discussing the specific case where a host would want more reliability for a certain set of data then an other? If that's the case I'm still confused as to what failure cases would still allow you to retrieve your data if there are more then one copy in the fs or pool.but I'll gladly take some enlightenment. :) (My apologies for the length of this response, I'll try to address most of the issues brought up recently...) When I wrote this proposal, I was only seriously thinking about the case where you want different amounts of redundancy for different data. Perhaps because I failed to make this clear, discussion has concentrated on laptop reliability issues. It is true that there would be some benefit to using multiple copies on a single-disk (eg. laptop) pool, but of course it would not protect against the most common failure mode (whole disk failure). ... lots of Good Stuff elided Soon Samsung will release a 100% flash memory based drive (32Gb) in a laptop form factor. But flash memory chips have a limited number of write cycles available, and when exceeded, this usually results in data corruption. Some people have already encountered this issue with USB thumb drives. Its especially annoying if you were using the thumb drive as a, what you thought was, a 100% _reliable_ backup mechanism. This is a perfect application for ZFS copies=2. Also, consider that there is no time penalty for positioning the heads on a flash drive. So now you would have 2 options in a laptop type application with a single flash based drive: a) create a mirrored pool using 2 slices - expensive in terms of storage utilization b) create a pool with no redundancy create a filesystem called importantPresentationData within that pool with copies=2 (or more). Matthew - build it and they will come! Regards, Al Hopper Logical Approach Inc, Plano, TX. [EMAIL PROTECTED] Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
Mathias F wrote: I think I get the whole picture, let me summarise: - you create a pool P and an FS on host A - Host A crashes - you import P on host B; this only works with -f, as zpool import otherwise refuses to do so. - now P is imported on B - host A comes back up and re-accesses P, thereby leading to (potential) corruption. - your hope was that when host A comes back, there exists a mechanism for telling it you need to re-import. - Vxvm, as you currently use it, has this functionality Is that correct? Yes it is, you got it ;) VxVM just notices that it's previously imported DiskGroup(s) (for ZFS this is the Pool) were failed over and doesn't try to re-acquire them. It waits for an admin action. The topic of clustering ZFS is not the problem atm, we just test the failover behaviour manually. well, I think nevertheless you'll have to wait for SunCluster 3.2 for this to work. As others have said, ZFS as is currently is not made to work as you expect it to. regards -- Michael Schuster +49 89 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see 'Recursion' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
Mathias F wrote: ... Yes it is, you got it ;) VxVM just notices that it's previously imported DiskGroup(s) (for ZFS this is the Pool) were failed over and doesn't try to re-acquire them. It waits for an admin action. The topic of clustering ZFS is not the problem atm, we just test the failover behaviour manually. Actually, this is the entirety of the problem: you are expecting a product which is *not* currently multi-host-aware to behave in the same safe manner as one which is. *AND* you're doing so knowing that you are outside of the protection of a clustering framework. WHY? What valid tests do you think you are going to be able to run? Wait for the SunCluster 3.2 release (or the beta). Don't faff around with a data-killing test suite in an unsupported configuration. James C. McPherson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
James C. McPherson wrote: As I understand things, SunCluster 3.2 is expected to have support for HA-ZFS and until that version is released you will not be running in a supported configuration and so any errors you encounter are *your fault alone*. Still, after reading Mathias's description, it seems that the former node is doing an implicit forced import when it boots back up. This seems wrong to me. zpools should be imported only of the zpool itself says it's not already taken, which of course would be overidden by a manual -f import. zpool sorry, i already have a boyfriend, host b host a darn, ok, maybe next time rather than the current scenario: zpool host a, I'm over you now. host b is now the man in my life! host a I don't care! you're coming with me anyways. you'll always be mine! * host a stuffs zpool into the car and drives off ...and we know those situations never turn out particularly well. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs receive kernel panics the machine
Hi, I'm running some experiments with zfs send and receive on Solaris 10u2 between two different machines. On server 1 I have the following data/zones/app1838M 26.5G 836M /zones/app1 data/zones/[EMAIL PROTECTED] 2.35M - 832M - I have a script that creates a new snapshot and sends the diff to the other machine. When I do a zfs receive on the other side the machine kernel panics (see below for the panic). I've done a zpool scrub to make sure the pool is ok (no errors found) and I now wonder what steps I can take to stop this from happening. cheers, Nickus panic[cpu0]/thread=30002033020: BAD TRAP: type=31 rp=2a101067030 addr=0 mmu_fsr=0 occurred in module SUNW,UltraSPARC-IIe due to a NULL pointer dereference zfs: trap type = 0x31 pid=615, pc=0x11efa24, sp=0x2a1010668d1, tstate=0x4480001602, context=0x4cd g1-g7: 7ba9a3a4, 0, 1864400, 0, , 10, 30002033020 02a101066d50 unix:die+78 (31, 2a101067030, 0, 0, 2a101066e10, 1075000) %l0-3: c080 0031 0100 2000 %l4-7: 0181a010 0181a000 004480001602 02a101066e30 unix:trap+8fc (2a101067030, 5, 1fff, 1c00, 0, 1) %l0-3: 030004664780 0031 %l4-7: e000 0200 0001 0005 02a101066f80 unix:ktl0+48 (7, 0, 18a4800, 30007998a00, 30007998a00, 180c000) %l0-3: 0003 1400 004480001602 01019840 %l4-7: 0300020f4200 0003 02a101067030 02a1010670d0 SUNW,UltraSPARC-IIe:bcopy+1554 (fcfff8667600, 30007998a00, 0, 140, 1, 72bb1) %l0-3: 0001 03000799c648 0008 0300020faab0 %l4-7: 0002 01f8 02a1010672d0 zfs:zfsctl_ops_root+b75c8d0 (30007996f40, 30003e82860, , 3000799c5d8, 3000799c590, 2) %l0-3: 03000799c538 434b 030001a25500 %l4-7: 0001 0020 0002 030007996ff0 02a101067380 zfs:dnode_reallocate+150 (10e, 13, 3000799c538, 10e, 0, 30003e82860) %l0-3: 7bada800 0011 03000799c590 0200 %l4-7: 0020 030007996f40 030007996f40 0013 02a101067430 zfs:dmu_object_reclaim+80 (0, 0, 13, 200, 11, 7bada400) %l0-3: 0008 0007 0001 1af0 %l4-7: 03072b00 1aef 030003e82860 02a1010674f0 zfs:restore_object+1b8 (2a101067710, 300038da6c8, 2a1010676c8, 11, 30003e82860, 200) %l0-3: 0002 010e 0010 %l4-7: 4a004000 0004 010e 02a1010675b0 zfs:dmu_recvbackup+608 (300036b7a00, 300036b7cd8, 300036b7b30, 300075159c0, 1, 0) %l0-3: 0040 02a101067710 0138 030004664780 %l4-7: 0002f5bacbac 0200 0001 02a101067770 zfs:zfs_ioc_recvbackup+38 (300036b7000, 0, 0, 0, 9, 0) %l0-3: 0004 0064 %l4-7: 0300036b700f 0031 02a101067820 zfs:zfsdev_ioctl+160 (70336c00, 5d, ffbfee40, 1f, 7c, e68) %l0-3: 0300036b7000 007c %l4-7: 7bacd668 703371e0 02e8 70336ef8 02a1010678d0 genunix:___const_seg_90212+1c60c (30006705600, 5a1f, ffbfee40, 13, 300046d9148, 11f86c8) %l0-3: 030004be2200 030004be2200 0004 030004664780 %l4-7: 0003 0001 018a5c00 02a101067990 genunix:ioctl+184 (4, 3000438c9a0, ffbfee40, ff38db68, 40350, 5a1f) %l0-3: 0004 14da %l4-7: 0001 syncing file systems... 2 1 done skipping system dump - no dump device configured rebooting... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Re: ZFS imported simultanously on 2 systems...
[...] a product which is *not* currently multi-host-aware to behave in the same safe manner as one which is. That`s the point we figured out while testing it ;) I just wanted to have our thoughts reviewed by other ZFS users. Our next steps IF the failover would have succeeded would be to create a little ZFS-agent for a VCS testing cluster. We haven't used Sun Cluster and won't use it in future. regards Mathias This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Hello Frank, Tuesday, September 12, 2006, 9:41:05 PM, you wrote: FC It would be interesting to have a zfs enabled HBA to offload the checksum FC and parity calculations. How much of zfs would such an HBA have to FC understand? That won't be end-to-end checksuming anymore, right? That way you can disable ZFS checksuming at all and base only on HW RAID. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] Re: Re: ZFS forces system to paging to the point it is
Hello Philippe, It was recommended to lower ncsize and I did (to default ~128K). So far it works ok for last days and staying at about 1GB free ram (fluctuating between 900MB-1,4GB). Do you think it's a long term solution or with more load and more data the problem can surface again even with current ncsize value? -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Memory Usage
Hello Thomas, Tuesday, September 12, 2006, 7:40:25 PM, you wrote: TB Hi, TB We have been using zfs for a couple of months now, and, overall, really TB like it. However, we have run into a major problem -- zfs's memory TB requirements TB crowd out our primary application. Ultimately, we have to reboot the TB machine TB so there is enough free memory to start the application. What exactly bad behavior did you notice? In general if app needs memory ZFS should free it - however it doesn't always work that good now. TB What I would like is: TB 1) A way to limit the size of the cache (a gig or two would be fine TB for us) You can't. TB 2) A way to clear the caches -- hopefully, something faster than TB rebooting TB the machine. export/import the pool. Eventually export a pool and unload zfs module. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re[2]: System hang caused by a bad snapshot
Hello Matthew, Tuesday, September 12, 2006, 7:57:45 PM, you wrote: MA Ben Miller wrote: I had a strange ZFS problem this morning. The entire system would hang when mounting the ZFS filesystems. After trial and error I determined that the problem was with one of the 2500 ZFS filesystems. When mounting that users' home the system would hang and need to be rebooted. After I removed the snapshots (9 of them) for that filesystem everything was fine. I don't know how to reproduce this and didn't get a crash dump. I don't remember seeing anything about this before so I wanted to report it and see if anyone has any ideas. MA Hmm, that sounds pretty bizarre, since I don't think that mounting a MA filesystem doesn't really interact with snapshots at all. MA Unfortunately, I don't think we'll be able to diagnose this without a MA crash dump or reproducibility. If it happens again, force a crash dump MA while the system is hung and we can take a look at it. Maybe it wasn't hung after all. I've seen similar behavior here sometimes. Did your disks used in a pool were actually working? There was lots of activity on the disks (iostat and status LEDs) until it got to this one filesystem and everything stopped. 'zpool iostat 5' stopped running, the shell wouldn't respond and activity on the disks stopped. This fs is relatively small (175M used of a 512M quota). Sometimes it takes a lot of time (30-50minutes) to mount a file system - it's rare, but it happens. And during this ZFS reads from those disks in a pool. I did report it here some time ago. In my case the system crashed during the evening and it was left hung up when I came in during the morning, so it was hung for a good 9-10 hours. Ben This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] when zfs enabled java
My customer is running java on a ZFS file system. His platform is Soalris 10 x86 SF X4200. When he enabled ZFS his memory of 18 gigs drops to 2 gigs rather quickly. I had him do a # ps -e -o pid,vsz,comm | sort -n +1 and it came back: The culprit application you see is java: 507 89464 /usr/bin/postmaster 515 89944 /usr/bin/postmaster 517 91136 /usr/bin/postmaster 508 96444 /usr/bin/postmaster 516 98088 /usr/bin/postmaster 503 3449580 /usr/jre1.5.0_07/bin/amd64/java 512 3732468 /usr/jre1.5.0_07/bin/amd64/java Here is what the customer responded: Well, Java's is a memory hog, but it's not the leak -- it's the application. Even after it fails due to lack of memory, the memory is not reclaimed and we can no longer restart it. Is there a bug on zfs? I did not find one in sunsolve but then again I might have been searching the wrong thing. We have done some slueth work and are starting to think our problem might be ZFS -- the new file system Sun supports. The documentation for ZFS states that it tries to cache as much as it can, and it uses kernel memory for the cache. That would explain memory gradually disappearing. ZFS can give memory back, but it does not do so quickly. So, is there any way to check that? If turns out to be the problem... 1) Is there a way to limit the size of ZFS's caches? If not, then 2) Is there a way to clear ZFS's cache? If not, then 3) Is there a way to force the Java VM to take a certain amount of memory on startup and never give it back? Xms does not appear to work. Thanks, Jill === S U N M I C R O S Y S T E M S I N C. Jill Manfield - TSE-Alternate Platform Team email: [EMAIL PROTECTED] phone: (800)USA-4SUN (Reference your case #) address: 1617 Southwood Drive Nashua,NH 03063 mailstop: NSH-01- B287 Mgr: Dave O'Connor: [EMAIL PROTECTED] Submit, View and Update tickets at http://www.sun.com/service/online This email may contain confidential and privileged material for the sole use of the intended recipient. Any review or distribution by others is strictly prohibited. If you are not the intended recipient please contact the sender and delete all copies. = ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Snapshots and backing store
Hi, There's something really bizarre in ZFS snaphot specs : Uses no separate backing store. . Hum...if I want to mutualize one physical volume somewhere in my SAN as THE snaphots backing-store...it becomes impossible to do ! Really bad. Is there any chance to have a backing-store-file option in a future release ? In the same idea, it would be great to have some sort of propertie to add a disk/LUN/physical_space to a pool, only reserved to backing-store. At now, the only thing I see to disallow users to use my backing-store space for their usage is to put quota. Nico This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots and backing store
On Wed, Sep 13, 2006 at 07:38:22AM -0700, Nicolas Dorfsman wrote: There's something really bizarre in ZFS snaphot specs : Uses no separate backing store. . It's not at all bizarre once you understand how ZFS works. I'd suggest reading through some of the documentation available at http://www.opensolaris.org/os/community/zfs/docs/ , in paricular the Slides available there. Hum...if I want to mutualize one physical volume somewhere in my SAN as THE snaphots backing-store...it becomes impossible to do ! Really bad. Is there any chance to have a backing-store-file option in a future release ? Doing this would have a significant hit on performance if nothing else. Currently when you do a write to a volume which is snapshotted the system has to : 1) Write the new data (Yes, that's it - one step. OK, so I'm ignoring metadata, but...) If there was a dedicated backing store, this would change to : 1) Read the old data 2) Write the old data to the backing store 3) Write the new data 4) Free the old data (ok, so that's metadata only, but hey) ZFS isn't copy-on-write in the same way that things like ufssnap are. ufssnap is copy-on-write in that when you write something, it copies out the old data and writes it somewhere else (the backing store). ZFS doesn't need to do this - it simply writes the new data to a new location, and leaves the old data where it is. If that old data is needed for a snapshot then it's left unchanged, if it's not then it's freed. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: ZFS forces system to paging to the point it is
Robert Milkowski wrote: Hello Philippe, It was recommended to lower ncsize and I did (to default ~128K). So far it works ok for last days and staying at about 1GB free ram (fluctuating between 900MB-1,4GB). Do you think it's a long term solution or with more load and more data the problem can surface again even with current ncsize value? Robert, I don't think this should be impacted too much by load/data, as long as the DNLC is able to evict, you should be in good shape. We are still working on a fix for the root cause of this issue however. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots and backing store
Nicolas Dorfsman wrote: Hi, There's something really bizarre in ZFS snaphot specs : Uses no separate backing store. . Hum...if I want to mutualize one physical volume somewhere in my SAN as THE snaphots backing-store...it becomes impossible to do ! Really bad. Is there any chance to have a backing-store-file option in a future release ? In the same idea, it would be great to have some sort of propertie to add a disk/LUN/physical_space to a pool, only reserved to backing-store. At now, the only thing I see to disallow users to use my backing-store space for their usage is to put quota. If you want to copy your filesystems (or snapshots) to other disks, you can use 'zfs send' to send them to a different pool (which may even be on a different machine!). --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Bizzare problem with ZFS filesystem
I ran the DTrace script and the resulting output is rather large (1 million lines and 65MB), so I won't burden this forum with that much data. Here are the top 100 lines from the DTrace output. Let me know if you need the full output and I'll figure out a way for the group to get it. dtrace: description 'fbt:zfs::' matched 2404 probes CPU FUNCTION 520 - zfs_lookup 2929705866442880 520- zfs_zaccess 2929705866448160 520 - zfs_zaccess_common 2929705866451840 520- zfs_acl_node_read2929705866455040 520 - zfs_acl_node_read_internal 2929705866458400 520- zfs_acl_alloc2929705866461040 520- zfs_acl_alloc2929705866462880 520 - zfs_acl_node_read_internal 2929705866464080 520- zfs_acl_node_read2929705866465600 520- zfs_ace_access 2929705866467760 520- zfs_ace_access 2929705866468880 520- zfs_ace_access 2929705866469520 520- zfs_ace_access 2929705866470320 520- zfs_acl_free 2929705866471920 520- zfs_acl_free 2929705866472960 520 - zfs_zaccess_common 2929705866474720 520- zfs_zaccess 2929705866476320 520- zfs_dirlook 2929705866478320 520 - zfs_dirent_lock2929705866480880 520 - zfs_dirent_lock2929705866486560 520 - zfs_dirent_unlock 2929705866489840 520 - zfs_dirent_unlock 2929705866491600 520- zfs_dirlook 2929705866492560 520 - zfs_lookup 2929705866494080 520 - zfs_getattr2929705866499360 520- dmu_object_size_from_db 2929705866503520 520- dmu_object_size_from_db 2929705866507920 520 - zfs_getattr2929705866509280 520 - zfs_lookup 2929705866520400 520- zfs_zaccess 2929705866521200 520 - zfs_zaccess_common 2929705866521920 520- zfs_acl_node_read2929705866523280 520 - zfs_acl_node_read_internal 2929705866524800 520- zfs_acl_alloc2929705866526000 520- zfs_acl_alloc2929705866526800 520 - zfs_acl_node_read_internal 2929705866527280 520- zfs_acl_node_read2929705866528160 520- zfs_ace_access 2929705866528720 520- zfs_ace_access 2929705866529280 520- zfs_ace_access 2929705866529920 520- zfs_ace_access 2929705866530800 520- zfs_acl_free 2929705866531360 520- zfs_acl_free 2929705866531920 520 - zfs_zaccess_common 2929705866532560 520- zfs_zaccess 2929705866533440 520- zfs_dirlook 2929705866534000 520 - zfs_dirent_lock2929705866534640 520 - zfs_dirent_lock2929705866535600 520 - zfs_dirent_unlock 2929705866536480 520 - zfs_dirent_unlock 2929705866537120 520- zfs_dirlook 2929705866537760 520 - zfs_lookup 2929705866538400 520 - zfs_getsecattr 2929705866543600 520- zfs_getacl 2929705866546240 520 - zfs_zaccess2929705866546960 520- zfs_zaccess_common 2929705866547680 520 - zfs_acl_node_read 2929705866548720 520- zfs_acl_node_read_internal 2929705866549440 520 - zfs_acl_alloc 2929705866550080 520 - zfs_acl_alloc 2929705866550720 520- zfs_acl_node_read_internal 2929705866551600 520 - zfs_acl_node_read 2929705866552160 520 - zfs_ace_access 2929705866552720 520 - zfs_ace_access 2929705866553280 520 - zfs_ace_access 2929705866554160 520 - zfs_ace_access 2929705866554720 520 - zfs_ace_access 2929705866555600 520 - zfs_ace_access 2929705866556160 520 - zfs_ace_access 2929705866557040 520 - zfs_ace_access 2929705866557600 520 - zfs_ace_access 2929705866558160 520 - zfs_ace_access 2929705866558720 520 - zfs_ace_access 2929705866559760 520 - zfs_ace_access
Re: [zfs-discuss] Proposal: multiple copies of user data
On Wed, 2006-09-13 at 02:30, Richard Elling wrote: The field data I have says that complete disk failures are the exception. I hate to leave this as a teaser, I'll expand my comments later. That matches my anecdotal experience with laptop drives; maybe I'm just lucky, or maybe I'm just paying attention than most to the sounds they start to make when they're having a bad hair day, but so far they've always given *me* significant advance warning of impending doom, generally by failing to read a bunch of disk sectors. That said, I think the best use case for the copies 1 config would be in systems with exactly two disks -- which covers most of the 1U boxes out there. One question for Matt: when ditto blocks are used with raidz1, how well does this handle the case where you encounter one or more single-sector read errors on other drive(s) while reconstructing a failed drive? for a concrete example A0 B0 C0 D0 P0 A1 B1 C1 D1 P1 (A0==A1, B0==B1, ...; A^B^C^D==P) Does the current implementation of raidz + ditto blocks cope with the case where all of A, C0, and D1 are unavailable? - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
It would be interesting to have a zfs enabled HBA to offload the checksum and parity calculations. How much of zfs would such an HBA have to understand? That's an interesting question. For parity, it's actually pretty easy. One can envision an HBA which took a group of related write commands and computed the parity on the fly, using it for a final write command. This would, however, probably limit the size of a block that could be written to whatever amount of memory was available for buffering on the HBA. (Of course, memory is relatively cheap these days, but it's still not free, so the HBA might have only a few megabytes.) The checksum is more difficult. If you're willing to delay writing an indirect block until all of its children have been written [*], then we can just compute the checksum for each block as it goes out, and that's easy [**] -- easier than the parity, in fact, since there's no buffering required beyond the checksum itself. ZFS in fact does delay this write at present. However, I've argued in the past that ZFS shouldn't delay it, but should write indirect blocks in parallel with the data blocks. It would be interesting to determine whether the performance improvement of doing checksums on the HBA would outweigh the potential benefit of writing indirect blocks in parallel. Maybe it would for larger writes. Anyone got an FPGA programmer and an open-source SATA implementation? :-) (Unfortunately storage protocols have a complex analog side, and except for 1394, I'm not aware of any implementations that separate the digital/analog, which makes prototyping a lot harder, at least without much more detailed documentation on the controllers than you're likely to find.) -- Anton [*] Actually, you don't need to delay until the writes have made it to disk, but since you want to compute the checksum as the data goes out to the disk rather than making a second pass over it, you'd need to wait until the data has at least been sent to the drive cache. [**] For SCSI and FC, there's added complexity in that the drives can request data out-of-order. You can disable this but at the cost of some performance on high-end drives. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: when zfs enabled java
Jill Manfield writes: My customer is running java on a ZFS file system. His platform is Soalris 10 x86 SF X4200. When he enabled ZFS his memory of 18 gigs drops to 2 gigs rather quickly. I had him do a # ps -e -o pid,vsz,comm | sort -n +1 and it came back: The culprit application you see is java: 507 89464 /usr/bin/postmaster 515 89944 /usr/bin/postmaster 517 91136 /usr/bin/postmaster 508 96444 /usr/bin/postmaster 516 98088 /usr/bin/postmaster 503 3449580 /usr/jre1.5.0_07/bin/amd64/java 512 3732468 /usr/jre1.5.0_07/bin/amd64/java Here is what the customer responded: Well, Java's is a memory hog, but it's not the leak -- it's the application. Even after it fails due to lack of memory, the memory is not reclaimed and we can no longer restart it. Is there a bug on zfs? I did not find one in sunsolve but then again I might have been searching the wrong thing. Assuming you run S10U2, you may be hit by this one: 4034947 anon_swap_adjust(), anon_resvmem() should call kmem_reap() if availrmem is low. Fixed in snv_42. It would show up as bad return code from either of the above function when java fails to startup. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Snapshots and backing store
Well. ZFS isn't copy-on-write in the same way that things like ufssnap are. ufssnap is copy-on-write in that when you write something, it copies out the old data and writes it somewhere else (the backing store). ZFS doesn't need to do this - it simply writes the new data to a new location, and leaves the old data where it is. If that old data is needed for a snapshot then it's left unchanged, if it's not then it's freed. We need to think ZFS as ZFS, and not as a new filesystem ! I mean, the whole concept is different. So. What could be the best architecture ? With UFS, I used to have separate metadevices/LUNs for each application. With ZFS, I thought it would be nice to use a separate pool for each application. But, it means multiply snapshot backing-store OR dynamically remove/add this space/LUN to pool where we need to do backups. Knowing that I can't serialize backups, my only option is to multiply reservation for backing-stores. Uh ! Another option would be to create a single pool and put all apllications in it...don't think this as a solution. Any suggestion ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Snapshots and backing store
If you want to copy your filesystems (or snapshots) to other disks, you can use 'zfs send' to send them to a different pool (which may even be on a different machine!). Oh no ! It means copy the whole filesystem. The target here is definitively to snapshot the filesystem and them backup the snapshot. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 6:09:50 AM -0700 Mathias F [EMAIL PROTECTED] wrote: [...] a product which is *not* currently multi-host-aware to behave in the same safe manner as one which is. That`s the point we figured out while testing it ;) I just wanted to have our thoughts reviewed by other ZFS users. Our next steps IF the failover would have succeeded would be to create a little ZFS-agent for a VCS testing cluster. We haven't used Sun Cluster and won't use it in future. /etc/zfs/zpool.cache is used at boot time to find what pools to import. Remove it when the system boots and after it goes down and comes back up it won't import any pools. Not quite the same as not importing if they are imported elsewhere, but perhaps close enough for you. On September 13, 2006 10:15:28 PM +1000 James C. McPherson [EMAIL PROTECTED] wrote: As I understand things, SunCluster 3.2 is expected to have support for HA-ZFS and until that version is released you will not be running in a supported configuration and so any errors you encounter are *your fault alone*. Didn't we have the PMC (poor man's cluster) talk last week as well? I understand the objection to mickey mouse configurations, but I don't understand the objection to (what I consider) simply improving safety. Why again shouldn't zfs have a hostid written into the pool, to prevent import if the hostid doesn't match? And why should failover be limited to SC? Why shouldn't VCS be able to play? Why should SC have secrets on how to do failover? After all, this is OPENsolaris. And anyway many homegrown solutions (the kind I'm familiar with anyway) are of high quality compared to commercial ones. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
just measured quickly that a 1.2Ghz sparc can do [400-500]MB/sec of encoding (time spent in misnamed function vdev_raidz_reconstruct) for a 3 disk raid-z group. Strange, that seems very low. Ah, I see. The current code loops through each buffer, either copying or XORing it into the parity. This likely would perform quite a bit better if it were reworked to go through more than one buffer at a time, doing the XOR. (Reading the partial parity is expensive.) Actually, this would be an instance where using assembly language or even processor-dependent code would be useful. Since the prefetch buffers on UltraSPARC are only applicable to floating-point loads, we should probably use prefetch the VIS xor instructions. (Even calling bcopy instead of using the existing copy loop would help.) FWIW, on large systems we ought to be aiming to sustain 8 GB/s or so of writes, and using 16 CPUs for just parity computation seems inordinately painful. :-) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
With ZFS however the in-between cache is obsolete, as individual disk caches can be used directly. I also openly question whether even the dedicated RAID HW is faster than the newest CPUs in modern servers. Individual disk caches are typically in the 8-16 MB range; for 15 disks, that gives you about 256 MB. A RAID with 15 drives behind it might have 2-4 GB of cache. That's a big improvement. The dedicated RAID hardware may not be faster than the newest CPUs, but as a friend of mine has pointed out, even though delegating a job to somebody else often means it's done more slowly, it frees him up to do his other work. (It's also pondering the difference between latency and bandwidth. When parity is computed inline with the data path, as is often the case for hardware controllers, the bandwidth is relatively low since it's happening at the speed of data transfer to an individual disk, but the latency is effectively zero, since it's not adding any time to the transfer.) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Wed, Sep 13, 2006 at 09:14:36AM -0700, Frank Cusack wrote: Why again shouldn't zfs have a hostid written into the pool, to prevent import if the hostid doesn't match? See: 6282725 hostname/hostid should be stored in the label Keep in mind that this is not a complete clustering solution - only a mechanism to prevent administrator misconfiguration. In particular, it's possible for one host to be doing a failover, and the other host open the pool before the hostid has been written to the disk. And why should failover be limited to SC? Why shouldn't VCS be able to play? Why should SC have secrets on how to do failover? After all, this is OPENsolaris. And anyway many homegrown solutions (the kind I'm familiar with anyway) are of high quality compared to commercial ones. I'm not sure I understand this. There is no built-in clustering support for UFS - simultaneously mounting the same UFS filesystem on different hosts will corrupt your data as well. You need some sort of higher level logic to correctly implement clustering. This is not a SC secret - it's how you manage non-clustered filesystems in a failover situation. Storing the hostid as a last-ditch check for administrative error is a reasonable RFE - just one that we haven't yet gotten around to. Claiming that it will solve the clustering problem oversimplifies the problem and will lead to people who think they have a 'safe' homegrown failover when in reality the right sequence of actions will irrevocably corrupt their data. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
eric kustarz wrote: I want per pool, per dataset, and per file - where all are done by the filesystem (ZFS), not the application. I was talking about a further enhancement to copies than what Matt is currently proposing - per file copies, but its more work (one thing being we don't have administrative control over files per se). Now if you could do that and make it something that can be set at install time it would get a lot more interesting. When you install Solaris to that single laptop drive you can select files or even directories that have more then one copy in case of a problem down the road. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Proposal: multiple copies of user data
On Sep 12, 2006, at 2:55 PM, Celso wrote:On 12/09/06, Celso [EMAIL PROTECTED] wrote: One of the great things about zfs, is that it protects not just against mechanical failure, butagainst silent data corruption. Having this availableto laptop owners seems to me to be important tomaking zfs even more attractive.I'm not arguing against that. I was just saying that*if* this was useful to you(and you were happy with the dubiousresilience/performance benefits) you canalready create mirrors/raidz on a single disk byusing partitions asbuilding blocks.There's no need to implement the proposal to gainthat. It's not as granular though is it?In the situation you describe:...you split one disk in two. you then have effectively two partitions which you can then create a new mirrored zpool with. Then everything is mirrored. Correct?With ditto blocks, you can selectively add copies (seeing as how filesystem are so easy to create on zfs). If you are only concerned with copies of your important documents and email, why should /usr/bin be mirrored.That's my opinion anyway. I always enjoy choice, and I really believe this is a useful and flexible one.CelsoThis message posted from opensolaris.org___zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss One item missed in the discussion is the idea that individual ZFS filesystems can be created in a pool that will have the duplicate block behavior. The idea being that only a vast subset of your data may be critical. This allows additional flexibility in a single disk configuration. Rather than sacrificing 1/2 of the pool storage, I can say that my critical documents will reside in a pool that will keep two copies on disk.I think it's a great idea. It may not be for everybody, but I think the ability to treat some of my files as critical is a excellent feature. -Gregory Shaw, IT ArchitectPhone: (303) 673-8273     Fax: (303) 673-2773ITCTO Group, Sun Microsystems Inc.500 Eldorado Blvd, UBRM02-401          [EMAIL PROTECTED] (work)Broomfield, CO 80021                 [EMAIL PROTECTED] (home)"When Microsoft writes an application for Linux, I've Won." - Linus Torvalds ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Comments on a ZFS multiple use of a pool, RFE.
I filed this RFE earlier, since there is no way for non sun personel to see this RFE for a while I am posting it here, and asking for feedback from the community. [Fwd: CR 6470231 Created P5 opensolaris/triage-queue Add an inuse check that is inforced even if import -f is used.] Inbox Assign a GTD Label to this Conversation: [Show] Statuses: Next Action, Action, Waiting On, SomeDay, Finished Contexts: Car, Desk, Email, Home, Office, Phone, Waiting References: ProjectHome, Reference Misc.: *Synopsis*: Add an inuse check that is inforced even if import -f is used. http://bt2ws.central.sun.com/CrPrint?id=6470231 *Change Request ID*: 6470231 *Synopsis*: Add an inuse check that is inforced even if import -f is used. Product: solaris Category: opensolaris Subcategory: triage-queue Type: RFE Subtype: Status: 1-Dispatched Substatus: Priority: 5-Very Low Introduced In Release: Introduced In Build: Responsible Manager: [EMAIL PROTECTED] Responsible Engineer: Initial Evaluator: [EMAIL PROTECTED] Keywords: opensolaris === *Description* Category kernel Sub-Category zfs Description Currently many people have been trying to import ZFS pools on multiple systems at once. currently this is unsupported, and causes massive data corruption to the pool. If ZFS refuses to import any zfs pool that was used in the last 5 minutes, that was not cleanly exported. This prevents the filesystem from being mounted on multiple systems at once. Frequency Always Regression No Steps to Reproduce import the same storage pool on more than one machine or domain. Expected Result #zpool import -f datapool1 Error: ZFS pool datapool1 is currently imported on another system, and was accessed less than 5 minutes ago, ZFS currently does not currently support concurrent access. If this filesystem is no longer in use on the other system please export the filesystem from the other system or try again in 5 minutes. Actual Result #zpool import -f datapool1 # a few minutes later the system crashes because of concurrent use. Error Message(s) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 9:32:50 AM -0700 Eric Schrock [EMAIL PROTECTED] wrote: On Wed, Sep 13, 2006 at 09:14:36AM -0700, Frank Cusack wrote: Why again shouldn't zfs have a hostid written into the pool, to prevent import if the hostid doesn't match? See: 6282725 hostname/hostid should be stored in the label Keep in mind that this is not a complete clustering solution - only a mechanism to prevent administrator misconfiguration. In particular, it's possible for one host to be doing a failover, and the other host open the pool before the hostid has been written to the disk. And why should failover be limited to SC? Why shouldn't VCS be able to play? Why should SC have secrets on how to do failover? After all, this is OPENsolaris. And anyway many homegrown solutions (the kind I'm familiar with anyway) are of high quality compared to commercial ones. I'm not sure I understand this. There is no built-in clustering support for UFS - simultaneously mounting the same UFS filesystem on different hosts will corrupt your data as well. You need some sort of higher level logic to correctly implement clustering. This is not a SC secret - it's how you manage non-clustered filesystems in a failover situation. But UFS filesystems don't automatically get mounted (well, we know how to not automatically mount them in /etc/vfstab). The SC secret is in how importing of pools is prevented at boot time. Of course you need more than that, but my complaint was against the idea that you cannot build a reliable solution yourself, instead of just sharing info about zpool.cache albeit with a warning. Storing the hostid as a last-ditch check for administrative error is a reasonable RFE - just one that we haven't yet gotten around to. Claiming that it will solve the clustering problem oversimplifies the problem and will lead to people who think they have a 'safe' homegrown failover when in reality the right sequence of actions will irrevocably corrupt their data. Thanks for that clarification, very important info. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Torrey McMahon wrote: eric kustarz wrote: I want per pool, per dataset, and per file - where all are done by the filesystem (ZFS), not the application. I was talking about a further enhancement to copies than what Matt is currently proposing - per file copies, but its more work (one thing being we don't have administrative control over files per se). Now if you could do that and make it something that can be set at install time it would get a lot more interesting. When you install Solaris to that single laptop drive you can select files or even directories that have more then one copy in case of a problem down the road. Actually, this is a perfect use case for setting the copies=2 property after installation. The original binaries are quite replaceable; the customizations and personal files created later on are not. - Bart -- Bart Smaalders Solaris Kernel Performance [EMAIL PROTECTED] http://blogs.sun.com/barts ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots and backing store
Matthew Ahrens wrote: Nicolas Dorfsman wrote: Hi, There's something really bizarre in ZFS snaphot specs : Uses no separate backing store. . Hum...if I want to mutualize one physical volume somewhere in my SAN as THE snaphots backing-store...it becomes impossible to do ! Really bad. Is there any chance to have a backing-store-file option in a future release ? In the same idea, it would be great to have some sort of propertie to add a disk/LUN/physical_space to a pool, only reserved to backing-store. At now, the only thing I see to disallow users to use my backing-store space for their usage is to put quota. If you want to copy your filesystems (or snapshots) to other disks, you can use 'zfs send' to send them to a different pool (which may even be on a different machine!). The confusion is probably around the word snapshot and all its various usage over the years. The one particular case people will probably slam their head into a wall is exporting snapshots to other hosts. If you can get the customer or tech to think in terms of where they want the data and how instead of snapshots, or lun copies, or whatever, it makes for an easier conversation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposal: multiple copies of user data
Bart Smaalders wrote: Torrey McMahon wrote: eric kustarz wrote: I want per pool, per dataset, and per file - where all are done by the filesystem (ZFS), not the application. I was talking about a further enhancement to copies than what Matt is currently proposing - per file copies, but its more work (one thing being we don't have administrative control over files per se). Now if you could do that and make it something that can be set at install time it would get a lot more interesting. When you install Solaris to that single laptop drive you can select files or even directories that have more then one copy in case of a problem down the road. Actually, this is a perfect use case for setting the copies=2 property after installation. The original binaries are quite replaceable; the customizations and personal files created later on are not. We've been talking about user data but the chance of corrupting something on disk and then detecting a bad checksum on something in /kernel is also possible. (Disk drives do weird things from time to time.) If I was sufficiently paranoid I would want everything required to get into single-user mode, some other stuff, and then my user data, duplicated to avoid any issues. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Sep 13, 2006, at 12:32 PM, Eric Schrock wrote: Storing the hostid as a last-ditch check for administrative error is a reasonable RFE - just one that we haven't yet gotten around to. Claiming that it will solve the clustering problem oversimplifies the problem and will lead to people who think they have a 'safe' homegrown failover when in reality the right sequence of actions will irrevocably corrupt their data. HostID is handy, but it'll only tell you who MIGHT or MIGHT NOT have control of the pool. Such an RFE would even more worthwhile if it included something such as a time stamp. This time stamp (or similar time-oriented signature) would be updated regularly (bases on some internal ZFS event). If this stamp goes for an arbitrary length of time without being updated, another host in the cluster could force import it on the assumption that the original host is no longer able to communicate to the zpool. This is a simple idea description, but perhaps worthwhile if you're already going to change the label structure for adding the hostid. /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 1:28:47 PM -0400 Dale Ghent [EMAIL PROTECTED] wrote: On Sep 13, 2006, at 12:32 PM, Eric Schrock wrote: Storing the hostid as a last-ditch check for administrative error is a reasonable RFE - just one that we haven't yet gotten around to. Claiming that it will solve the clustering problem oversimplifies the problem and will lead to people who think they have a 'safe' homegrown failover when in reality the right sequence of actions will irrevocably corrupt their data. HostID is handy, but it'll only tell you who MIGHT or MIGHT NOT have control of the pool. Such an RFE would even more worthwhile if it included something such as a time stamp. This time stamp (or similar time-oriented signature) would be updated regularly (bases on some internal ZFS event). If this stamp goes for an arbitrary length of time without being updated, another host in the cluster could force import it on the assumption that the original host is no longer able to communicate to the zpool. This is a simple idea description, but perhaps worthwhile if you're already going to change the label structure for adding the hostid. Sounds cool! Better than depending on an out-of-band heartbeat. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool always thinks it's mounted on another system
Hi zfs-discuss,I was running Solaris 11, b42 on x86, and I tried upgrading to b44. I didn't have space on the root for live_upgrade, so I booted from disc to upgrade, but it failed on every attempt, so I ended up blowing away / and doing a clean b44 install. Now the zpool that was attached to that system won't stop thinking that it's mounted on another system, regardless of what I try.On boot, the system thinks the pool is mounted elsewhere, and won't mount it unless I log in and zpool import -f. I tried zpool export followed by import, and that required no -f, but on reboot, lo, the problem returned. I even tried destroying and reimporting the pool, which led to this hilarious sequence:# zpool importno pools available to import# zpool import -D pool: moonsideid: 8290331144559232496 state: ONLINE (DESTROYED)action: The pool can be imported using its name or numeric identifier.The pool was destroyed, but can be imported using the '-Df' flags.config:moonsideONLINE raidz1ONLINEc2t0d0 ONLINEc2t1d0 ONLINEc2t2d0 ONLINEc2t3d0 ONLINEc2t4d0 ONLINEc2t5d0 ONLINEc2t6d0 ONLINE # zpool import -D moonsidecannot import 'moonside': pool may be in use from other systemuse '-f' to import anyway#This is either a bug or a missing feature (the ability to make a filesystem stop thinking it's mounted somewhere else) - anybody have any ideas? Thanks,- Rich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Frank Cusack wrote: Sounds cool! Better than depending on an out-of-band heartbeat. I disagree it sounds really really bad. If you want a high availability cluster you really need a faster interconnect than spinning rust which is probably the slowest interface we have now! -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool always thinks it's mounted on another system
Can you send the output of 'zdb -l /dev/dsk/c2t0d0s0' ? So you do the 'zpool import -f' and all is well, but then when you reboot, it doesn't show up, and you must import it again? Can you send the output of 'zdb -C' both before and after you do the import? Thanks, - Eric On Wed, Sep 13, 2006 at 01:40:13PM -0400, Rich wrote: Hi zfs-discuss, I was running Solaris 11, b42 on x86, and I tried upgrading to b44. I didn't have space on the root for live_upgrade, so I booted from disc to upgrade, but it failed on every attempt, so I ended up blowing away / and doing a clean b44 install. Now the zpool that was attached to that system won't stop thinking that it's mounted on another system, regardless of what I try. On boot, the system thinks the pool is mounted elsewhere, and won't mount it unless I log in and zpool import -f. I tried zpool export followed by import, and that required no -f, but on reboot, lo, the problem returned. I even tried destroying and reimporting the pool, which led to this hilarious sequence: # zpool import no pools available to import # zpool import -D pool: moonside id: 8290331144559232496 state: ONLINE (DESTROYED) action: The pool can be imported using its name or numeric identifier. The pool was destroyed, but can be imported using the '-Df' flags. config: moonsideONLINE raidz1ONLINE c2t0d0 ONLINE c2t1d0 ONLINE c2t2d0 ONLINE c2t3d0 ONLINE c2t4d0 ONLINE c2t5d0 ONLINE c2t6d0 ONLINE # zpool import -D moonside cannot import 'moonside': pool may be in use from other system use '-f' to import anyway # This is either a bug or a missing feature (the ability to make a filesystem stop thinking it's mounted somewhere else) - anybody have any ideas? Thanks, - Rich ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re: [zfs-discuss] marvel cards.. as recommended
On 9/12/06, James C. McPherson [EMAIL PROTECTED] wrote: Joe Little wrote: So, people here recommended the Marvell cards, and one even provided a link to acquire them for SATA jbod support. Well, this is what the latest bits (B47) say: Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx0: Could not attach, unsupported chip stepping or unable to get the chip stepping Sep 12 13:51:54 vram marvell88sx: [ID 679681 kern.warning] WARNING: marvell88sx1: Could not attach, unsupported chip stepping or unable to get the chip stepping Any takers on how to get around this one? You could start by providing the output from prtpicl -v and prtconf -v as well as /usr/X11/bin/scanpci -v -V 1 so we know which device you're actually having a problem with. Is the pci vendor+deviceid for that card listed in your /etc/driver_aliases file against the marvell88sx driver? James I don't know if you really want all those large files, but /etc/driver_aliases lists: marvell88sx pci11ab,6081.9 [EMAIL PROTECTED]:~# lspci | grep Marv 03:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 07) 05:01.0 SCSI storage controller: Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller (rev 07) [EMAIL PROTECTED]:~# lspci -n | grep 11ab 03:01.0 0100: 11ab:6081 (rev 07) 05:01.0 0100: 11ab:6081 (rev 07) And it sees the module: 198 f571 9f10 62 1 marvell88sx (marvell88sx HBA Driver v1.8) Is this a support revision of the card? Is there something stupid like enabling the jumpers or some such that's required? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Loss of compression with send/receive
You want: 6421959 want zfs send to preserve properties ('zfs send -p') Which Matt is currently working on. - Eric On Thu, Sep 14, 2006 at 02:04:32AM +0800, Darren Reed wrote: Using Solaris 10, Update 2 (b9a) I've just used zfs send | zfs receive to move some filesystems from one disk to another (I'm sure this is the quickest move I've ever done!) but in doing so, I lost zfs set compression=on on those filesystems. If I create the filesystems first and enable compression, I can't receive to them (results in an error.) Is there some way around this? Patch? RFE? Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Sep 13, 2006, at 1:37 PM, Darren J Moffat wrote: That might be acceptable in some environments but that is going to cause disks to spin up. That will be very unacceptable in a laptop and maybe even in some energy conscious data centres. Introduce an option to 'zpool create'? Come to think of it, describing attributes for a pool seems to be lacking (unlike zfs volumes) What you are proposing sounds a lot like a cluster hear beat which IMO really should not be implemented by writing to disks. That would be an extreme example of the use for this. While it / could/ be used as a heart beat mechanism, it would be useful administratively. # zpool status foopool Pool foopool is currently imported by host.blah.com Import time: 4 April 2007 16:20:00 Last activity: 23 June 2007 18:42:53 ... ... /dale ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Wed, Sep 13, 2006 at 06:37:25PM +0100, Darren J Moffat wrote: Dale Ghent wrote: On Sep 13, 2006, at 12:32 PM, Eric Schrock wrote: Storing the hostid as a last-ditch check for administrative error is a reasonable RFE - just one that we haven't yet gotten around to. Claiming that it will solve the clustering problem oversimplifies the problem and will lead to people who think they have a 'safe' homegrown failover when in reality the right sequence of actions will irrevocably corrupt their data. HostID is handy, but it'll only tell you who MIGHT or MIGHT NOT have control of the pool. Such an RFE would even more worthwhile if it included something such as a time stamp. This time stamp (or similar time-oriented signature) would be updated regularly (bases on some internal ZFS event). If this stamp goes for an arbitrary length of time without being updated, another host in the cluster could force import it on the assumption that the original host is no longer able to communicate to the zpool. That might be acceptable in some environments but that is going to cause disks to spin up. That will be very unacceptable in a laptop and maybe even in some energy conscious data centres. What you are proposing sounds a lot like a cluster hear beat which IMO really should not be implemented by writing to disks. Wouldn't it be possible to implement this via SCSI reservations (where available) a la quorum devices? Ceri -- That must be wonderful! I don't understand it at all. -- Moliere pgpbrlHYCwiGr.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool always thinks it's mounted on another system
I do the 'zpool import -f moonside', and all is well until I reboot, at which point I must zpool import -f again.Below is zdb -l /dev/dsk/c2t0d0s0's output:LABEL 0 version=3 name='moonside' state=0 txg=1644418 pool_guid=8290331144559232496 top_guid=12835093579979239393 guid=7480231448190751824 vdev_tree type='raidz' id=0 guid=12835093579979239393 nparity=1 metaslab_array=13 metaslab_shift=30 ashift=9 asize=127371575296 children[0] type='disk' id=0 guid=7480231448190751824 path='/dev/dsk/c2t0d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=23 children[1] type='disk' id=1 guid=2626377814825345466 path='/dev/dsk/c2t1d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=22 children[2] type='disk' id=2 guid=16932309055791750053 path='/dev/dsk/c2t2d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=21 children[3] type='disk' id=3 guid=18145699204085538208 path='/dev/dsk/c2t3d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=20 children[4] type='disk' id=4 guid=2046828747707454119 path='/dev/dsk/c2t4d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=19 children[5] type='disk' id=5 guid=5851407888580937378 path='/dev/dsk/c2t5d0s0' devid='id1, [EMAIL PROTECTED]/a' whole_disk=1 DTL=18 children[6] type='disk' id=6 guid=10476478316210434659 path='/dev/dsk/c2t6d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=17LABEL 1 version=3 name='moonside' state=0 txg=1644418 pool_guid=8290331144559232496 top_guid=12835093579979239393 guid=7480231448190751824 vdev_tree type='raidz' id=0 guid=12835093579979239393 nparity=1 metaslab_array=13 metaslab_shift=30 ashift=9 asize=127371575296 children[0] type='disk' id=0 guid=7480231448190751824 path='/dev/dsk/c2t0d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=23 children[1] type='disk' id=1 guid=2626377814825345466 path='/dev/dsk/c2t1d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=22 children[2] type='disk' id=2 guid=16932309055791750053 path='/dev/dsk/c2t2d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=21 children[3] type='disk' id=3 guid=18145699204085538208 path='/dev/dsk/c2t3d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=20 children[4] type='disk' id=4 guid=2046828747707454119 path='/dev/dsk/c2t4d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=19 children[5] type='disk' id=5 guid=5851407888580937378 path='/dev/dsk/c2t5d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=18 children[6] type='disk' id=6 guid=10476478316210434659 path='/dev/dsk/c2t6d0s0' devid='id1, [EMAIL PROTECTED]/a' whole_disk=1 DTL=17LABEL 2 version=3 name='moonside' state=0 txg=1644418 pool_guid=8290331144559232496 top_guid=12835093579979239393 guid=7480231448190751824 vdev_tree type='raidz' id=0 guid=12835093579979239393 nparity=1 metaslab_array=13 metaslab_shift=30 ashift=9 asize=127371575296 children[0] type='disk' id=0 guid=7480231448190751824 path='/dev/dsk/c2t0d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=23 children[1] type='disk' id=1 guid=2626377814825345466 path='/dev/dsk/c2t1d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=22 children[2] type='disk' id=2 guid=16932309055791750053 path='/dev/dsk/c2t2d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=21 children[3] type='disk' id=3 guid=18145699204085538208 path='/dev/dsk/c2t3d0s0' devid='id1, [EMAIL PROTECTED]/a' whole_disk=1 DTL=20 children[4] type='disk' id=4 guid=2046828747707454119 path='/dev/dsk/c2t4d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=19 children[5] type='disk' id=5 guid=5851407888580937378 path='/dev/dsk/c2t5d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=18 children[6] type='disk' id=6 guid=10476478316210434659 path='/dev/dsk/c2t6d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=17 LABEL 3 version=3 name='moonside' state=0 txg=1644418 pool_guid=8290331144559232496 top_guid=12835093579979239393 guid=7480231448190751824 vdev_tree type='raidz' id=0 guid=12835093579979239393 nparity=1 metaslab_array=13 metaslab_shift=30 ashift=9 asize=127371575296 children[0] type='disk' id=0 guid=7480231448190751824 path='/dev/dsk/c2t0d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=23 children[1] type='disk' id=1 guid=2626377814825345466 path='/dev/dsk/c2t1d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=22 children[2] type='disk' id=2 guid=16932309055791750053 path='/dev/dsk/c2t2d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=21 children[3] type='disk' id=3 guid=18145699204085538208 path='/dev/dsk/c2t3d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=20 children[4] type='disk' id=4 guid=2046828747707454119 path='/dev/dsk/c2t4d0s0' devid='id1,[EMAIL
Re: [zfs-discuss] Comments on a ZFS multiple use of a pool, RFE.
On 9/13/06, Eric Schrock [EMAIL PROTECTED] wrote: There are several problems I can see: - This is what the original '-f' flag is for. I think a better approach is to expand the default message of 'zpool import' with more information, such as which was the last host to access the pool and when. The point of '-f' is that you have recognized that the pool is potentially in use, but as an administrator you've made a higher level determination that it is in fact safe to import. this would not be the first time that Solaris overrided an administive command, because its just not safe or sane to do so. For example. rm -rf / - You are going to need a flag to override this behavior for clustering situations. Forcing the user to always wait 5 minutes is unacceptable. wouldn't it be more likely to for a clustering solution to use libzfs or we could add another method to zfs for failing over in clustering solutions, this method would then check to see if the OS and the pool supported clustering at the time of import. Since Clustering support is not yet released for ZFS we have clean slate on how ZFS deals with it. - By creating a new flag (lets say '-F'), you are just going to introduce more complexity, and customers will get equally used to issuing 'zpool import -fF', and now you're back to the same problem all over again. if 5 minutes is to long, perhaps it could reduced this to 2 minutes and make ZFS update a value stored on the pool once a minute that it is in use. we could update the pool in use flag more often but it seems excessive since its only a courner case anyway. Another possible method to handle this case, but its more work but would not impact existing fast paths, would be for zpool to watch the pool if it appears to have been accessed in the last X minutes and not exported, is to have ZFS watch the devices and see if any other disk commands occur from the old host, if any occurs its obvious that the administrator is putting the system in to a case where it will crash the system, so isn't it better to fail the import then crash? Any extra delay this check imposes would not break existing specifications because importing devices is not guaranteed fixed performance because ZFS needs to find the devices and verify that all components of pool are in tact and not failed. James - A pool which is in use on another host but inactive for more than 5 minutes will fail this check (since no transactions will have been pushed), but could potentially write data after the pool has been imported. - This breaks existing behavior. The CLI utilities are documented as commmitted (a.k.a stable), and breaking existing customer scripts isn't acceptable. the existing behaviour is broken, ZFS should return EBUSY, if ZFS can determine that the pool is in active use without extreme measures. This is the same behaviour that happens should a user try and mount a filesystem twice on one system. James This seems to take the wrong approach to the root problem. Depending on how you look at it, the real root problem is either: a) ZFS is not a clustered filesystem, and actively using the same pool on multiple systems (even opening said pool) will corrupt data. b) 'zpool import' doesn't present enough information for an administrator to reliably determine if a pool is actually in use on multiple systems. The former is obviously a ton of work and something we're thinking about but won't address any time soon. The latter can be addressed by presenting more useful information when 'zpool import' is run without the '-f' flag. - Eric On Wed, Sep 13, 2006 at 12:14:06PM -0500, James Dickens wrote: I filed this RFE earlier, since there is no way for non sun personel to see this RFE for a while I am posting it here, and asking for feedback from the community. [Fwd: CR 6470231 Created P5 opensolaris/triage-queue Add an inuse check that is inforced even if import -f is used.] Inbox Assign a GTD Label to this Conversation: [Show] Statuses: Next Action, Action, Waiting On, SomeDay, Finished Contexts: Car, Desk, Email, Home, Office, Phone, Waiting References: ProjectHome, Reference Misc.: *Synopsis*: Add an inuse check that is inforced even if import -f is used. http://bt2ws.central.sun.com/CrPrint?id=6470231 *Change Request ID*: 6470231 *Synopsis*: Add an inuse check that is inforced even if import -f is used. Product: solaris Category: opensolaris Subcategory: triage-queue Type: RFE Subtype: Status: 1-Dispatched Substatus: Priority: 5-Very Low Introduced In Release: Introduced In Build: Responsible Manager: [EMAIL PROTECTED] Responsible Engineer: Initial Evaluator: [EMAIL PROTECTED] Keywords: opensolaris === *Description* Category kernel Sub-Category zfs Description Currently many people have been trying to import ZFS pools on
[zfs-discuss] Re: Comments on a ZFS multiple use of a pool, RFE.
I think there are at least two separate issues here. The first is that ZFS doesn't support multiple hosts accessing the same pool. That's simply a matter of telling people. UFS doesn't support multiple hosts, but it doesn't have any special features to prevent administrators from *trying* it. They'll just corrupt their filesystem. The second is that ZFS remembers pools and automatically imports them at boot time. This is a bigger problem, because it means that if you create a pool on host A, shut down host A, import the pool to host B, and then boot host A, your pool is automatically destroyed. The hostid solution that VxVM uses would catch this second problem, because when A came up after its reboot, it would find that -- even though it had created the pool -- it was not the last machine to access it, and could refuse to automatically mount it. If the administrator really wanted it mounted, they could force the issue. Relying on the administrator to know that they have to remove a file (the 'zpool cache') before they let the machine come up out of single-user mode seems the wrong approach to me. (By default, we'll shoot you in the foot, but we'll give you a way to unload the gun if you're fast enough and if you remember.) The hostid approach seems better to me than modifying the semantics of force. I honestly don't think the problem is administrators who don't know what they're doing; I think the problem is that our defaults are wrong in the case of shared storage. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: marvel cards.. as recommended
If I'm reading the source correctly, for the $60xx boards, the only supported revision is $09. Yours is $07, which presumably has some errata with no workaround, and which the Solaris driver refuses to support. Hope you can return it ... ? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Re: marvel cards.. as recommended
A quick peek at the Linux source shows a small workaround in place for the 07 revision...maybe if you file a bug against Solaris to support this revision it might be possible to get it added, at least if that's the only issue. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Proposal: multiple copies of user data
Is this true for single-sector, vs. single-ZFS-block, errors? (Yes, it's pathological and probably nobody really cares.) I didn't see anything in the code which falls back on single-sector reads. (It's slightly annoying that the interface to the block device drivers loses the SCSI error status, which tells you the first sector which was bad.) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Comments on a ZFS multiple use of a pool, RFE.
On Wed, Sep 13, 2006 at 02:29:55PM -0500, James Dickens wrote: this would not be the first time that Solaris overrided an administive command, because its just not safe or sane to do so. For example. rm -rf / As I've repeated before, and will continue to repeat, it's not actually possible for ZFS to determine whether a pool is in active use (short of making ZFS a cluster-aware filesystem). Adding arbitrary delays doesn't change this fact, and only less likely. I've given you examples of where this behavior is safe and sane and useful, so the above simplification (upon which most of the other arguments are based) isn't really valid. I'm curious why you didn't comment on my other suggestion (displaying last accessed host and time as part of 'zpool import'), which seems to solve your problem by giving the administrator the data they need to make an appropriate decision. As Anton and other have mentioned in previous discussions there seems to be several clear RFEs that everyone can agree with: 1. Store the hostid, hostname, and last time written as part of the label. 2. During auto-import (aka open), if the the hostid is different from our own, fault the pool and generate an appropriate FMA event. 3. During manual import, display the last hostname and time accessed if the hostid is not our own, and the pool is still marked ACTIVE. This prevents administrators from shooting themselves in the foot, while still allowing explicit cluster failover to operate with more information than was available before. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Comments on a ZFS multiple use of a pool, RFE.
On 9/13/06, Eric Schrock [EMAIL PROTECTED] wrote: On Wed, Sep 13, 2006 at 02:29:55PM -0500, James Dickens wrote: this would not be the first time that Solaris overrided an administive command, because its just not safe or sane to do so. For example. rm -rf / As I've repeated before, and will continue to repeat, it's not actually possible for ZFS to determine whether a pool is in active use (short of making ZFS a cluster-aware filesystem). Adding arbitrary delays doesn't change this fact, and only less likely. I've given you examples of where this behavior is safe and sane and useful, so the above simplification (upon which most of the other arguments are based) isn't really valid. I disagree with this, isn't there way to track when the last read was? Even computing the last write access time that you are already recommending, then sleeping for 30 seconds and recompute the value by accessing the disk again and compare,even a read on the pool will possibly cause a write if ATIME tracking is enabled on the filesystem, if someone is accessing the pool we are importing underneath us, it is a dead give away that we are about to explode if we continue on this path. I'm curious why you didn't comment on my other suggestion (displaying last accessed host and time as part of 'zpool import'), which seems to solve your problem by giving the administrator the data they need to make an appropriate decision. its a good suggestion, it just doesn't go far enough in my opinion. As Anton and other have mentioned in previous discussions there seems to be several clear RFEs that everyone can agree with: 1. Store the hostid, hostname, and last time written as part of the label. 2. During auto-import (aka open), if the the hostid is different from our own, fault the pool and generate an appropriate FMA event. 3. During manual import, display the last hostname and time accessed if the hostid is not our own, and the pool is still marked ACTIVE. This prevents administrators from shooting themselves in the foot, while still allowing explicit cluster failover to operate with more information than was available before. if this is what the community decides I can live with this, I may even provide a patch for OpenSolaris distros that does the more intensive check, it seems to be an easy fix once you have complete #1. James - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: when zfs enabled java
Jill Manfield wrote: My customer is running java on a ZFS file system. His platform is Soalris 10 x86 SF X4200. When he enabled ZFS his memory of 18 gigs drops to 2 gigs rather quickly. I had him do a # ps -e -o pid,vsz,comm | sort -n +1 and it came back: The culprit application you see is java: 507 89464 /usr/bin/postmaster 515 89944 /usr/bin/postmaster 517 91136 /usr/bin/postmaster 508 96444 /usr/bin/postmaster 516 98088 /usr/bin/postmaster 503 3449580 /usr/jre1.5.0_07/bin/amd64/java 512 3732468 /usr/jre1.5.0_07/bin/amd64/java Here is what the customer responded: Well, Java's is a memory hog, but it's not the leak -- it's the application. Even after it fails due to lack of memory, the memory is not reclaimed and we can no longer restart it. Is there a bug on zfs? I did not find one in sunsolve but then again I might have been searching the wrong thing. We have done some slueth work and are starting to think our problem might be ZFS -- the new file system Sun supports. The documentation for ZFS states that it tries to cache as much as it can, and it uses kernel memory for the cache. That would explain memory gradually disappearing. ZFS can give memory back, but it does not do so quickly. Yup, this is likely your problem. ZFS takes a little time to give back memory, and the app may fail with ENOMEM before this happens. So, is there any way to check that? If turns out to be the problem... 1) Is there a way to limit the size of ZFS's caches? Well... sort of. You can set the size of arc.c_max and this will put an upper bound on the cache. But this is a bit of a hack. If not, then 2) Is there a way to clear ZFS's cache? Try unmounting/mounting the file system, if that does not work, try export/import of the pool. -Mark ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Snapshots and backing store
Nicolas Dorfsman wrote: We need to think ZFS as ZFS, and not as a new filesystem ! I mean, the whole concept is different. Agreed. So. What could be the best architecture ? What is the problem? With UFS, I used to have separate metadevices/LUNs for each application. With ZFS, I thought it would be nice to use a separate pool for each application. Ick. It would be much better to have one pool, and a separate filesystem for each application. But, it means multiply snapshot backing-store OR dynamically remove/add this space/LUN to pool where we need to do backups. I don't understand this statement. What problem are you trying to solve? If you want to do backups, simply take a snapshot, then point your backup program at it. If you want faster incremental backups, use 'zfs send -i' to generate the file to backup. --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Bizzare problem with ZFS filesystem
One more piece of information. I was able to ascertain the slowdown happens only when ZFS is used heavily; meaning lots of inflight I/O. This morning when the system was quiet my writes to the /u099 filesystem was excellent and it has gone south like I reported earlier. I am currently awaiting the completion of a write to /u099, well over 60 seconds. At the same time I was able create/save files in /u001 without any problems. The only difference between the /u001 and /u099 is the size of the filesystem (256GB vs 768GB). Per your suggestion I ran a 'zfs set' command and it completed after a wait of around 20 seconds while my file save from vi against /u099 is still pending!!! This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Snapshots and backing store
Including performance considerations ? For instance, if I have two Oracle Databases with two I/O profiles (TP versus Batch)...what would be the best : 1) Two pools, each one on two LUNs. Each LUN distributed on n trays. 2) One pool on one LUN. This LUN distributed on 2 x n trays. 3) One pool striped on two LUNs. Each LUN distributed on n trays. Good question. I'll bet there's no way to determine that without testing. It may be that the extra extra performance from having the additional lun(s) within a single pool outweighs any performance issues from having both workloads use the same storage. With one pool, no problem. With n pools, my problem is the space used by the snapshot. With the COW method of UFS snapshot I can put all backing-stores on one single volume. With ZFS snapshot, it's conceptualy impossible. Yup. That's due to the differences in how those snapshots are implemented. In the future you may be able to add and remove storage from pools dynamically. In such a case, it could be possible to bring a disk into a pool, increase disk usage during a snapshot, delete the snapshot, then remove the disk. Disk removal would require copying data and be a performance hit. Then you go and do the same thing with the other pools. Today this isn't possible because you cannot migrate data off of a VDEV to reclaim the storage. -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Snapshots and backing store
Matthew Ahrens wrote: Nicolas Dorfsman wrote: We need to think ZFS as ZFS, and not as a new filesystem ! I mean, the whole concept is different. Agreed. So. What could be the best architecture ? What is the problem? With UFS, I used to have separate metadevices/LUNs for each application. With ZFS, I thought it would be nice to use a separate pool for each application. Ick. It would be much better to have one pool, and a separate filesystem for each application. I agree but can you set performance boundaries based on the filesystem? The pool level seems to be the place to do such things. For example making sure an application has a set level of iops at its disposal. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Importing ZFS filesystems across architectures...
OK, this may seem like a stupid question (and we all know that there are such things...) I'm considering sharing a disk array (something like a 3510FC) between two different systems, a SPARC and an Opteron. Will ZFS transparently work to import/export pools between the two systems? That is, can I export a pool created on the SPARC box, then import that on the Opteron box and have all the data there (and the pool work normally)? Normally, I'd run into problems with Fdisk vs EFI vs VTOC labeling/partitioning, but I was hoping that ZFS would magically make my life simpler here... :-) -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Importing ZFS filesystems across architectures...
If you're using EFI labels, yes (VTOC labels are not endian neutral). ZFS will automatically convert endianness from the on-disk format, and new data will be written using the native endianness, so data will be gradually be rewritten to avoid the byteswap overhead. - Eric On Wed, Sep 13, 2006 at 03:55:27PM -0700, Erik Trimble wrote: OK, this may seem like a stupid question (and we all know that there are such things...) I'm considering sharing a disk array (something like a 3510FC) between two different systems, a SPARC and an Opteron. Will ZFS transparently work to import/export pools between the two systems? That is, can I export a pool created on the SPARC box, then import that on the Opteron box and have all the data there (and the pool work normally)? Normally, I'd run into problems with Fdisk vs EFI vs VTOC labeling/partitioning, but I was hoping that ZFS would magically make my life simpler here... :-) -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Importing ZFS filesystems across architectures...
Erik Trimble wrote: OK, this may seem like a stupid question (and we all know that there are such things...) I'm considering sharing a disk array (something like a 3510FC) between two different systems, a SPARC and an Opteron. Will ZFS transparently work to import/export pools between the two systems? That is, can I export a pool created on the SPARC box, then import that on the Opteron box and have all the data there (and the pool work normally)? Normally, I'd run into problems with Fdisk vs EFI vs VTOC labeling/partitioning, but I was hoping that ZFS would magically make my life simpler here... Use EFI and you should be fine. as long as you don't try to import any pools on both hosts at the same time. James ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Importing ZFS filesystems across architectures...
Erik Trimble wrote: OK, this may seem like a stupid question (and we all know that there are such things...) I'm considering sharing a disk array (something like a 3510FC) between two different systems, a SPARC and an Opteron. Will ZFS transparently work to import/export pools between the two systems? That is, can I export a pool created on the SPARC box, then import that on the Opteron box and have all the data there (and the pool work normally)? Normally, I'd run into problems with Fdisk vs EFI vs VTOC labeling/partitioning, but I was hoping that ZFS would magically make my life simpler here... As long as you don't try to mount the pool from both systems at the same time, do funky auto-takeover stuff, etc. you should be good to goat least you're supposed to be able to do this. I haven't tested it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Importing ZFS filesystems across architectures...
On 9/13/06, Erik Trimble [EMAIL PROTECTED] wrote: OK, this may seem like a stupid question (and we all know that there are such things...) I'm considering sharing a disk array (something like a 3510FC) between two different systems, a SPARC and an Opteron. Will ZFS transparently work to import/export pools between the two systems? That is, can I export a pool created on the SPARC box, then import that on the Opteron box and have all the data there (and the pool work normally)? yes, this is a design feature. When you first move the pool and import it. The system will read the blocks of data and see that its a different endian that the current host, and convert as necessary, as data is written, its is written in native format, so data that is written by the pools new host will suffer no endian penalty reading data it wrote, and a small penalty accessing data from the old endian type. James Dickens uadmin.blogspot.com Normally, I'd run into problems with Fdisk vs EFI vs VTOC labeling/partitioning, but I was hoping that ZFS would magically make my life simpler here... :-) -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Frank Cusack wrote: ...[snip James McPherson's objections to PMC] I understand the objection to mickey mouse configurations, but I don't understand the objection to (what I consider) simply improving safety. ... And why should failover be limited to SC? Why shouldn't VCS be able to play? Why should SC have secrets on how to do failover? After all, this is OPENsolaris. And anyway many homegrown solutions (the kind I'm familiar with anyway) are of high quality compared to commercial ones. Frank, this isn't a SunCluster vs VCS argument. It's an argument about * doing cluster-y stuff with the protection that a cluster framework provides versus * doing cluster-y stuff without the protection that a cluster framework provides If you want to use VCS be my guest, and let us know how it goes. If you want to use a homegrown solution, then please let us know what you did to get it working, how well it copes and how you are addressing any data corruption that might occur. I tend to refer to SunCluster more than VCS simply because I've got more in depth experience with Sun's offering. James C. McPherson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Comments on a ZFS multiple use of a pool, RFE.
Anton B. Rang schrieb: The hostid solution that VxVM uses would catch this second problem, because when A came up after its reboot, it would find that -- even though it had created the pool -- it was not the last machine to access it, and could refuse to automatically mount it. If the administrator really wanted it mounted, they could force the issue. Relying on the administrator to know that they have to remove a file (the 'zpool cache') before they let the machine come up out of single-user mode seems the wrong approach to me. (By default, we'll shoot you in the foot, but we'll give you a way to unload the gun if you're fast enough and if you remember.) I haven't tried: Does ZFS try to -f (force) import of zpools in /etc/zfs/zpool.cache or does it just do a normal import and fails if the disks seem to be in use elsewhere, e.g. after a reboot of a proably failed and later repaired machine? Just to clear some things up. The OP who started the whole discussion would have had the same problems with VxVM as he has now with ZFS. If you force an import of a disk group on one host while it is still active on another host won't make the DG magically disappear on the other one. The corresponding flag to zpool import -f is vxdg import -C. If you issue this command you could also end up with the same DG imported on more than one host. Because on VxVM there is usually another level of indirection (volumes ontop of the DG which may contain filesystems and also have to manually mount) just importing a DG is normally harmless. So also with VxVM you can shoot yourself in the foot. On host B B# vxdg -C import DG B# vxvol -g DG startall B# mount /dev/vx/dsk/DG/filesys /some/where B# do_someting on /some/where while still on host A A# do_something on /some/where Instead of a zpool.cache file VxVM uses the hostid (not to be confused with the host-id, normally just the ordinary hostname `uname -n` of the machine) to know which DGs it should mount automatically. Additionally each DG (or more precisely: each disk) has an autoimport flag which also has to be turned on to make the DG autoimported during bootup. So to mimic VxVM in ZFS the solution would simply be: add an autoimport flag to the zpool. Daniel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Comments on a ZFS multiple use of a pool, RFE.
On September 14, 2006 1:25:01 AM +0200 Daniel Rock [EMAIL PROTECTED] wrote: Just to clear some things up. The OP who started the whole discussion would have had the same problems with VxVM as he has now with ZFS. If you force an import of a disk group on one host while it is still active on another host won't make the DG magically disappear on the other one. The OP was just showing a test case. On a real system your HA software would exchange a heartbeat and not do a double import. The problem with zfs is that after the original system fails and the second system imports the pool, the original system also tries to import on [re]boot, and the OP didn't know how to disable this. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 4:33:31 PM -0700 Frank Cusack [EMAIL PROTECTED] wrote: You'd typically have a dedicated link for heartbeat, what if that cable gets yanked or that NIC port dies. The backup system could avoid mounting the pool if zfs had its own heartbeat. What if the cluster software has a bug and tells the other system to take over? zfs could protect itself. hmm actually probably not considering heartbeat intervals and failover time vs. probable zpool update frequency. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: zfs and Oracle ASM
I did a non-scientific benchmark against ASM and ZFS. Just look for my posts and you'll see it. To summarize it was a statistical tie for simple loads of around 2GB of data and we've chosen to stick with ASM for a variety of reasons not the least of which is its ability to rebalance when disks are added/removed. Better integration comes to mind too. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Dale Ghent wrote: James C. McPherson wrote: As I understand things, SunCluster 3.2 is expected to have support for HA-ZFS and until that version is released you will not be running in a supported configuration and so any errors you encounter are *your fault alone*. Still, after reading Mathias's description, it seems that the former node is doing an implicit forced import when it boots back up. This seems wrong to me. Repeat the experiment with UFS, or most other file systems, on a raw device and you would get the same behaviour as ZFS: corruption. The question on the table is why doesn't ZFS behave like a cluster-aware volume manager not why does ZFS behave like UFS when 2 nodes mount the same file system simultaneously? -- richard -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 7:07:40 PM -0700 Richard Elling [EMAIL PROTECTED] wrote: Dale Ghent wrote: James C. McPherson wrote: As I understand things, SunCluster 3.2 is expected to have support for HA-ZFS and until that version is released you will not be running in a supported configuration and so any errors you encounter are *your fault alone*. Still, after reading Mathias's description, it seems that the former node is doing an implicit forced import when it boots back up. This seems wrong to me. Repeat the experiment with UFS, or most other file systems, on a raw device and you would get the same behaviour as ZFS: corruption. Again, the difference is that with UFS your filesystems won't auto mount at boot. If you repeated with UFS, you wouldn't try to mount until you decided you should own the disk. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: zfs and Oracle ASM
Anantha N. Srirama wrote: I did a non-scientific benchmark against ASM and ZFS. Just look for my posts and you'll see it. To summarize it was a statistical tie for simple loads of around 2GB of data and we've chosen to stick with ASM for a variety of reasons not the least of which is its ability to rebalance when disks are added/removed. Better integration comes to mind too. Yes. I think I commented on this last year, too. ASM is Oracle's solution to replace all other file systems for their database. You can expect that Oracle will ensure that it's features are tightly coupled to the systems management interfaces available from Oracle. As such, there will always be better integration between Oracle Database and ASM than any other generic file system. In other words, Oracle gains a lot by developing ASM to be consistent with their systems management infrastructure and running on heterogeneous, legacy systems -- a good thing. (I don't think ZFS is going to lose any revenue stream from ASM ;-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Re: Proposal: multiple copies of user data
On 9/13/06, Matthew Ahrens [EMAIL PROTECTED] wrote: Sure, if you want *everything* in your pool to be mirrored, there is no real need for this feature (you could argue that setting up the pool would be easier if you didn't have to slice up the disk though). Not necessarily. Implementing this on the FS level will still allow the administrator to turn on copies on the entire pool if since the pool is technically also a FS and the property is inherited by child FS's. Of course, this will allow the admin to turn off copies to the FS containing junk. It could be recommended in some situations. If you want to protect against disk firmware errors, bit flips, part of the disk getting scrogged, then mirroring on a single disk (whether via a mirror vdev or copies=2) solves your problem. Admittedly, these problems are probably less common that whole-disk failure, which mirroring on a single disk does not address. I beg to differ from experience that the above errors are more common than whole disk failures. It's just that we do not notice the disks are developing problems but panic when they finally fail completely. That's what happens to most of my disks anyway. Disks are much smarter nowadays with hiding bad sectors but it doesn't mean that there are none. If your precious data happens to sit on one, you'll be crying for copies. -- Just me, Wire ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshots and backing store
On Sep 13, 2006, at 10:52, Scott Howard wrote: It's not at all bizarre once you understand how ZFS works. I'd suggest reading through some of the documentation available at http://www.opensolaris.org/os/community/zfs/docs/ , in paricular the Slides available there. The presentation that 'goes' with those slides is available online: http://www.sun.com/software/solaris/zfs_learning_center.jsp ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: marvel cards.. as recommended
Yeah. I got the message from a few others, and we are hoping to return/buy the newer one. I've sort of surprised by the limited set of SATA RAID or JBOD cards that one can actually use. Even the one's linked to on this list sometimes aren't supported :). I need to get up and running like yesterday, so we are just ordering the cards post haste. On 9/13/06, Anton B. Rang [EMAIL PROTECTED] wrote: A quick peek at the Linux source shows a small workaround in place for the 07 revision...maybe if you file a bug against Solaris to support this revision it might be possible to get it added, at least if that's the only issue. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] any update on zfs root/boot ?
Hi folks, I'm in the annoying position of having to replace my rootdisk (since it's a [EMAIL PROTECTED]@$! maxtor and dying). I'm currently running with zfsroot after following Tabriz' and TimF's procedure to enable that. However, I'd like to know whether there's a better way to get zfs root/boot happening? The mini-ufs partition kludge is getting a bit tired :) My plan for the moment (with build45, the most recent iso that I have) is to * install the new disk * boot to singleuser off the media * create a swap slice and an everythingelse slice on the new disk * zpool create rootpool everythingelse slice * reboot and start the installer * convince the installer that I don't need to partition anything * install to the new rootpool. The second last step is where I imagine the most difficulty will arise. Is there anything that springs to mind which I could do to ensure it works? thanks in advance, James C. McPherson -- Solaris kernel software engineer, system admin and troubleshooter http://www.jmcp.homeunix.com/blog Find me on LinkedIn @ http://www.linkedin.com/pub/2/1ab/967 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss