Re: [zfs-discuss] pool metadata has duplicate children
On 2013-Jan-08 21:30:57 -0800, John Giannandrea j...@meer.net wrote: Notice that in the absence of the faulted da2 the OS has assigned da3 to da2 etc. I suspect this was part of the original problem in creating a label with two da2s The primary vdev identifier is tha guid. Tha path is of secondary importance (ZFS should automatically recover from juggled disks without an issue - and has for me). Try running zdb -l on each of your pool disks and verify that each has 4 identical labels, and that the 5 guids (one on each disk) are unique and match the vdev_tree you got from zdb. My suspicion is that you've somehow lost the disk with the guid 3419704811362497180. twa0: 3ware 9000 series Storage Controller twa0: INFO: (0x15: 0x1300): Controller details:: Model 9500S-8, 8 ports, Firmware FE9X 2.08.00.006 da0 at twa0 bus 0 scbus0 target 0 lun 0 da1 at twa0 bus 0 scbus0 target 1 lun 0 da2 at twa0 bus 0 scbus0 target 2 lun 0 da3 at twa0 bus 0 scbus0 target 3 lun 0 da4 at twa0 bus 0 scbus0 target 4 lun 0 Are these all JBOD devices? -- Peter Jeremy pgpykCYjUFT7j.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repairing corrupted ZFS pool
On 2012-Nov-19 11:02:06 -0500, Ray Arachelian r...@arachelian.com wrote: Is the pool importing properly at least? Maybe you can create another volume and transfer the data over for that volume, then destroy it? The pool is imported and passes all tests except zfs diff. Creating another pool _is_ an option but I'm not sure how to transfer the data across - using zfs send | zfs recv replicates the corruption and tar -c | tar -x loses all the snapshots. There are special things you can do with import where you can roll back to a certain txg on the import if you know the damage is recent. The damage exists in the oldest snapshot for that filesystem. -- Peter Jeremy pgpxQIIBICxmG.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repairing corrupted ZFS pool
On 2012-Nov-19 13:47:01 -0500, Ray Arachelian r...@arachelian.com wrote: On 11/19/2012 12:03 PM, Peter Jeremy wrote: The damage exists in the oldest snapshot for that filesystem. Are you able to delete that snapshot? Yes but it has no effect - the corrupt object exists in the current pool so deleting an old snapshot has no effect. What I was hoping was that someone would have a suggestion on removing the corruption in-place - using zdb, zhack or similar. -- Peter Jeremy pgpjDHtffkLE5.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repairing corrupted ZFS pool
On 2012-Nov-19 21:10:56 +0100, Jim Klimov jimkli...@cos.ru wrote: On 2012-11-19 20:28, Peter Jeremy wrote: Yep - that's the fallback solution. With 1874 snapshots spread over 54 filesystems (including a couple of clones), that's a major undertaking. (And it loses timestamp information). Well, as long as you have and know the base snapshots for the clones, you can recreate them at the same branching point on the new copy too. Yes, it's just painful. Also, while you are at it, you can use different settings on the new pool, based on your achieved knowledge of your data This pool has a rebuild in its future anyway so I have this planned. - perhaps using better compression (IMHO stale old data that became mostly read-only is a good candidate for gzip-9), setting proper block sizes for files of databases and disk images, maybe setting better checksums, and if your RAM vastness and data similarity permit - perhaps employing dedup After reading the horror stories and reading up on how dedupe works, this is definitely not on the list. (run zdb -S on source pool to simulate dedup and see if you get any better than 3x savings - then it may become worthwhile). Not without lots more RAM - and that would mean a whole new box. Perhaps, if the zfs diff does perform reasonably for you, you can feed its output as the list of objects to replicate in rsync's input and save many cycles this way. The starting point of this saga was that zfs diff failed, so that isn't an option. On 2012-Nov-19 21:24:19 +0100, Jim Klimov jimkli...@cos.ru wrote: fatally difficult scripting (I don't know if it is possible to fetch the older attribute values from snapshots - which were in force at that past moment of time; if somebody knows anything on this - plz write). The best way to identify past attributes is probably to parse zfs history, though that won't help for received attributes. -- Peter Jeremy pgpgjjcrpOhyK.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Repairing corrupted ZFS pool
On 2012-Nov-19 14:38:30 -0700, Mark Shellenbaum mark.shellenb...@oracle.com wrote: On 11/19/12 1:14 PM, Jim Klimov wrote: On 2012-11-19 20:58, Mark Shellenbaum wrote: There is probably nothing wrong with the snapshots. This is a bug in ZFS diff. The ZPL parent pointer is only guaranteed to be correct for directory objects. What you probably have is a file that was hard linked multiple times and the parent pointer (i.e. directory) was recycled and is now a file Ah. Thank you for that. I knew about the parent pointer, I wasn't aware that ZFS didn't manage it correctly. The parent pointer for hard linked files is always set to the last link to be created. $ mkdir dir.1 $ mkdir dir.2 $ touch dir.1/a $ ln dir.1/a dir.2/a.linked $ rm -rf dir.2 Now the parent pointer for a will reference a removed directory. I've done some experimenting and confirmod this behaviour. I gather zdb bypasses ARC because the change of parent pointer after the ln(1) only becames visible after a sync. The ZPL never uses the parent pointer internally. It is only used by zfs diff and other utility code to translate object numbers to full pathnames. The ZPL has always set the parent pointer, but it is more for debugging purposes. I didn't realise that. I agree that the above scenario can't be tracked with a single parent pointer but I assumed that ZFS reset the parent to unknown rather than leaving it as a pointer to a random no-longer-valid object. This probably needs to be documented as a caveat on zfs diff - especially since it can cause hangs and panics with older kernel code. -- Peter Jeremy pgpFsaNn4GfUQ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Repairing corrupted ZFS pool
send/recv (which happily and quietly replicates the corruption). Note that I have never (intentionally) used extended attributes within the pool but it has been exported to Windows XP via Samba and possibly to OS-X via NFSv3. Does anyone have any suggestions for fixing the corruption? One suggestion was tar c | tar x but that is a last resort (since there are 54 filesystems and ~1900 snapshots in the pool). -- Peter Jeremy pgpi6E6cZupsp.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS best practice for FreeBSD?
On 2012-Oct-12 08:11:13 +0100, andy thomas a...@time-domain.co.uk wrote: This is apparently what had been done in this case: gpart add -b 34 -s 600 -t freebsd-swap da0 gpart add -b 634 -s 1947525101 -t freebsd-zfs da1 gpart show Assuming that you can be sure that you'll keep 512B sector disks, that's OK but I'd recommend that you align both the swap and ZFS partitions on at least 4KiB boundaries for future-proofing (ie you can safely stick the same partition table onto a 4KiB disk in future). Is this a good scheme? The server has 12 G of memory (upped from 4 GB last year after it kept crashing with out of memory reports on the console screen) so I doubt the swap would actually be used very often. Having enough swap to hold a crashdump is useful. You might consider using gmirror for swap redundancy (though 3-way is overkill). (And I'd strongly recommend against swapping to a zvol or ZFS - FreeBSD has issues with that combination). The other issue with this server is it needs to be rebooted every 8-10 weeks as disk I/O slows to a crawl over time and the server becomes unusable. After a reboot, it's fine again. I'm told ZFS 13 on FreeBSD 8.0 has a lot of problems Yes, it does - and your symptoms match one of the problems. Does top(1) report lots of inactive and cache memory and very little free memory and a high kstat.zfs.misc.arcstats.memory_throttle_count once I/O starts slowing down? so I was planning to rebuild the server with FreeBSD 9.0 and ZFS 28 but I didn't want to make any basic design mistakes in doing this. I'd suggest you test 9.1-RC2 (just released) with a view to using 9.1, rather than installing 9.0. Since your questions are FreeBSD specific, you might prefer to ask on the freebsd-fs list. -- Peter Jeremy pgpoDwzmWvFUU.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FreeBSD ZFS
On 2012-Aug-09 16:05:00 +0530, Jim Klimov jimkli...@cos.ru wrote: 2012-08-09 13:57, Karl Wagner wrote: Firstly, I believe it currently stands at zpool v28. Is this correct? For FreeBSD 8.x and 9.x, yes. FreeBSD-head includes feature flags and com.delphix:async_destroy. Will this be updated any time soon? I expect 8-stable and 9-stable will be update to match -head once FreeBSD 9.1 is released (ie 9.1 won't support feature flags but 9.2 and a potential 8.4 will). In general, FreeBSD imports ZFS fixes and enhancements, generally from Illumos, as they become available. The Oracle v29 and later updates won't be available in FreeBSD unless they are open-sourced by Oracle. New features in the works include modernized compression and checksum algorithms, among others. Nominal zpool version is 5000 for pools which enabled feature flags, and that is currently supported by oi_151a5 prebuilt distro (I don't know of other builds with that - feature integrated into code this summer). FreeBSD-head does. -- Peter Jeremy pgpaswWHOLhMp.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?
On 2012-Aug-02 18:30:01 +0530, opensolarisisdeadlongliveopensolaris opensolarisisdeadlongliveopensola...@nedharvey.com wrote: Ok, so the point is, in some cases, somebody might want redundancy on a device that has no redundancy. They're willing to pay for it by halving their performance. This isn't quite true - write performance will be at least halved (possibly worse due to additional seeking) but read performance could potentially improve (more copies means, on average, there should be less seeking to get a a copy than if there was only one copy). And non-IO performance is unaffected. The only situation I'll acknowledge is the laptop situation, and I'll say, present day very few people would be willing to pay *that* much for this limited use-case redundancy. My guess is that, for most people, the overall performance impact would be minimal because disk write performance isn't the limiting factor for most laptop usage scenarios. The solution that I as an IT person would recommend and deploy would be to run without copies and instead cover you bum by doing backups. You need backups in any case but backups won't help you if you can't conveniently access them. Before giving a blanket recommendation, you need to consider how the person uses their laptop. Consider the following scenario: You're in the middle of a week-long business trip and your laptop develops a bad sector in an inconvenient spot. Do you: a) Let ZFS automagically repair the sector thanks to copies=2. b) Attempt to rebuild your laptop and restore from backups (left securely at home) via the dodgy hotel wifi. -- Peter Jeremy pgpvosNQQa9DJ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?
On 2012-Aug-01 21:00:46 +0530, Nigel W nige...@nosun.ca wrote: I think a fantastic idea for dealing with the DDT (and all other metadata for that matter) would be an option to put (a copy of) metadata exclusively on a SSD. This is on my wishlist as well. I believe ZEVO supports it so possibly it'll be available in ZFS in the near future. -- Peter Jeremy pgpNyzMT6fOdD.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Interaction between ZFS intent log and mmap'd files
On 2012-Jul-05 06:47:36 +1000, Nico Williams n...@cryptonector.com wrote: On Wed, Jul 4, 2012 at 11:14 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Tue, 3 Jul 2012, James Litchfield wrote: Agreed - msync/munmap is the only guarantee. I don't see that the munmap definition assures that anything is written to disk. The system is free to buffer the data in RAM as long as it likes without writing anything at all. Oddly enough the manpages at the Open Group don't make this clear. They don't specify the behaviour on write(2) or close(2) either. All this means is that there is no guarantee that munmap(2) (or write(2) or close(2)) will immediately flush the data to stable storage. So I think it may well be advisable to use msync(3C) before munmap() on MAP_SHARED mappings. If you want to be certain that your changes will be flushed to stable storage by a particular point in your program execution then you must call msync(MS_SYNC) before munmap(2). However, I think all implementors should, and probably all do (Linux even documents that it does) have an implied msync(2) when doing a munmap(2). There's nothing in the standard requiring this behaviour and it will adversely impact performance in the general case so I would expect that implementors _wouldn't_ force msync(2) on munmap(2). FreeBSD definitely doesn't. As for Linux, I keep finding cases where, if a standard doesn't mandate specific behaviour, Linux will implement (and document) different behaviour to the way other OSs behave in the same situation. I really makes no sense at all to have munmap(2) not imply msync(3C). Actually, it makes no more sense for munmap(2) to imply msync(2) than it does for close(2) [which is functionally equivalent] to imply fsync(2) - ie none at all. (That's another thing, I don't see where the standard requires that munmap(2) be synchronous. http://pubs.opengroup.org/onlinepubs/009695399/functions/munmap.html states Further references to these pages shall result in the generation of a SIGSEGV signal to the process. It's difficult to see how to implement this behaviour unless munmap(2) is synchronous. Async munmap(2) - no need to mount cross-calls, instead allowing to mapping to be torn down over time. Doing a synchronous msync(3C), then a munmap(2) is a recipe for going real slow, but if munmap(2) does not portably guarantee an implied msync(3C), then would it be safe to do an async msync(2) then munmap(2)??) I don't understand what you are trying to achieve here. munmap(2) should be a relatively cheap operation so there is very little to be gained by making it asynchronous. Can you please explain a scenario where munmap(2) would be slow (other than cases where implementors have deliberately and unnecessarily made it slow). I agree that msync(MS_SYNC) is slow but if you want a guarantee that your data is securely written to stable storage then you need to wait for that stable storage. msync(MS_ASYNC) should have no impact on a later munmap(2) and it should always be safe to call msync(MS_ASYNC) before munmap(2) (in fact, it's a good idea to maximise portability). -- Peter Jeremy pgp7hDyys4IEu.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Spare drive inherited cksum errors?
On 2012-May-29 22:04:39 +1000, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: If you have a drive (or two drives) with bad sectors, they will only be detected as long as the bad sectors get used. Given that your pool is less than 100% full, it means you might still have bad hardware going undetected, if you pass your scrub. One way around this is to 'dd' each drive to /dev/null (or do a long test using smartmontools). This ensures that the drive thinks all sectors are readable. You might consider creating a big file (dd if=/dev/zero of=bigfile.junk bs=1024k) and then when you're out of disk space, scrub again. (Obviously, you would be unable to make new writes to pool as long as it's filled...) I'm not sure how ZFS handles no large free blocks, so you might need to repeat this more than once to fill the disk. This could leave your drive seriously fragmented. If you do try this, I'd recommend creating a snapshot first and then rolling back to it, rather than just deleting the junk file. Also, this (obviously) won't work at all on a filesystem with compression enabled. -- Peter Jeremy pgpwHwVLcSvcK.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive upgrades
On 2012-Apr-17 17:25:36 +1000, Jim Klimov jimkli...@cos.ru wrote: For the sake of archives, can you please post a common troubleshooting techinque which users can try at home to see if their disks honour the request or not? ;) I guess it would involve random-write bandwidths in two cases? 1) Issue disable write cache command to drive 2) Write several MB of data to drive 3) As soon as drive acknowledges completion, remove power to drive (this will require a electronic switch in the drive's power lead) 4) Wait until drive spins down. 5) Power up drive and wait until ready 6) Verify data written in (2) can be read. 7) Argue with drive vendor that drive doesn't meet specifications :-) A similar approach can also be used to verify that NCQ cache flush commands actually work. -- Peter Jeremy pgp4WNXKBfWaW.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive upgrades
On 2012-Apr-14 02:30:54 +1000, Tim Cook t...@cook.ms wrote: You will however have an issue replacing them if one should fail. You need to have the same block count to replace a device, which is why I asked for a right-sizing years ago. The traditional approach this is to slice the disk yourself so you have a slice size with a known area and a dummy slice of a couple of GB in case a replacement is a bit smaller. Unfortunately, ZFS on Solaris disables the drive cache if you don't give it a complete disk so this approach incurs as significant performance overhead there. FreeBSD leaves the drive cache enabled in either situation. I'm not sure how OI or Linux behave. -- Peter Jeremy pgprzpycAxFkZ.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Improving snapshot write performance
On 2012-Apr-11 18:34:42 +1000, Ian Collins i...@ianshome.com wrote: I use an application with a fairly large receive data buffer (256MB) to replicate data between sites. I have noticed the buffer becoming completely full when receiving snapshots for some filesystems, even over a slow (~2MB/sec) WAN connection. I assume this is due to the changes being widely scattered. As Richard pointed out, the write side should be mostly contiguous. Is there any way to improve this situation? Is the target pool nearly full (so ZFS is spending lots of time searching for free space)? Do you have dedupe enabled on the target pool? This would force ZFS to search the DDT to write blocks - this will be expensive, especially if you don't have enough RAM. Do yoy have a high compression level (gzip or gzip-N) on the target filesystems, without enough CPU horsepower? Do you have a dying (or dead) disk in the target pool? -- Peter Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] about btrfs and zfs
On 2011-Oct-18 23:18:02 +1100, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: I recently put my first btrfs system into production. Here are the similarities/differences I noticed different between btrfs and zfs: Thanks for that. * zfs has storage tiering. (cache log devices, such as SSD's to accelerate performance.) btrfs doesn't have this yet. I'd call that multi-level caching and journalling. To me, storage tiering means something like HSM - something that lets me push rarely used data to near-line storage (eg big green SATA drives that are spun down most of the time) whilst retaining the ability to transparently access it. On 2011-Oct-19 03:46:30 +1100, Mark Sandrock mark.sandr...@oracle.com wrote: Doesn't a scrub do more than what 'fsck' does? It does different things. I'm not sure about more. fsck verifies the logical consistency of a filesystem. For UFS, this includes: used data blocks are allocated to exactly one file, directory entries point to valid inodes, allocated inodes have at least one link, the number of links in an inode exactly matches the number of directory entries pointing to that inode, directories form a single tree without loops, file sizes are consistent with the number of allocated blocks, unallocated data/inodes blocks are in the relevant free bitmaps, redundant superblock data is consistent. It can't verify data. scrub uses checksums to verify the contents of all blocks and attempts to correct errors using redundant copies of blocks. This implicitly detects some types of logical errors. I don't know if scrub includes explicit logic to detect things like directory loops, missing free blocks, unreachable allocated blocks, multiply allocated blocks, etc. IIRC, fsck was seldom needed at my former site once UFS journalling became available. Sweet update. Whilst Solaris very rarely insists we run fsck, we have had a number of cases where we have found files corrupted following a crash - even with UFS journalling enabled. Unfortunately, this isn't the sort of thing that fsck could detect. -- Peter Jeremy pgpe2tUImniF1.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large scale performance query
On 2011-Aug-08 17:12:15 +0800, Andrew Gabriel andrew.gabr...@oracle.com wrote: periodic scrubs to cater for this case. I do a scrub via cron once a week on my home system. Having almost completely filled the pool, this was taking about 24 hours. However, now that I've replaced the disks and done a send/recv of the data across to a new larger pool which is only 1/3rd full, that's dropped down to 2 hours. FWIW, scrub time is more related to how fragmented a pool is, rather than how full it is. My main pool is only at 61% (of 5.4TiB) and has never been much above that but has lots of snapshots and a fair amount of activity. A scrub takes around 17 hours. This is another area where the mythical block rewrite would help a lot. -- Peter Jeremy pgpH1dpSOBHnT.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs hybrid drive - any advice?
On 2011-Jul-26 17:24:05 +0800, Fajar A. Nugraha w...@fajar.net wrote: Shouldn't modern SSD controllers be smart enough already that they know: - if there's a request to overwrite a sector, then the old data on that sector is no longer needed ZFS never does update-in-place and UFS only does update-in-place for metadata and where the application forces update-in-place. This means there will generally (always for ZFS) be a delay between when a filesystem frees (is no longer interested in the contents of) a sector and when it overwrites that sector. Without TRIM support, a SSD can only use overwrite to indicate that the contents of a sector are not needed. Which, in turn, means there is a pool of sectors that the FS knows are unused but the SSD doesn't - and is therefore forced to preserve. Since an overwrite almost never matches the erase page, this increases wear on the SSD because it is forced to rewrite unwanted data in order to free up pages for erasure to support external write requests. It also reduces performance for several reasons: - The SSD has to unnecessarily copy data - which takes time. - The space recovered by each erasure is effectively reduced by the amount of rewritten data so more time-consuming erasures are needed for a given external write load. - The pools of unused but not erased and erased (available) sectors are smaller, increasing the probability that an external write will require a synchronous erase cycle to complete. - allocate a clean sector from pool of available sectors (part of wear-leveling mechanism) As above, in the absence of TRIM, the pool will be smaller (and more likely to be empty). - clear the old sector, and add it to the pool (possibly done in background operation) Otherwise a sector could never be rewritten. It seems to be the case with sandforce-based SSDs. That would pretty much let the SSD work just fine even without TRIM (like when used under HW raid). Better SSDs mitigate the problem by having more hidden space (keeping the available pool larger to reduce the probability of a synchronous erase being needed) and higher performance (masking the impact of the additional internal writes and erasures). If TRIM support was available then the performance would still improve. This means you either get better system performance from the same SSD, or you can get the same system performance from a lower-performance (cheaper) SSD. -- Peter Jeremy pgpoOozgavEXj.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Changed to AHCI, can not access disk???
On 2011-Jul-05 21:03:50 +0800, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Orvar Korvar ... I suspect the problem is because I changed to AHCI. This is normal, no matter what OS you have. It's the hardware. Switching to AHCI changes the device interface presented to the kernel and you need a different device driver to access the data. As long as your OS supports AHCI (and that is true of any OS that supports ZFS) then you will still be able to access the disks - though the actual path to the disk or disk device name will change. If you start using a disk in non-AHCI mode, you must always continue to use it in non-AHCI mode. If you switch, it will make the old data inaccessible. Only if your OS is broken. The data is equally accessible in either mode. ZFS makes it easier to switch modes because it doesn't care about the actual device name - at worst, you will need an export and import. -- Peter Jeremy pgpHPygB4VeNl.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS working group and feature flags proposal
On 2011-May-26 03:02:04 +0800, Matthew Ahrens mahr...@delphix.com wrote: The first product of the working group is the design for a ZFS on-disk versioning method that will allow for distributed development of ZFS on-disk format changes without further explicit coordination. This method eliminates the problem of two developers both allocating version number 31 to mean their own feature. Looks good. pool open (zpool import and implicit import from zpool.cache) If pool is at SPA_VERSION_FEATURES, we must check for feature compatibility. First we will look through entries in the label nvlist's features_for_read. If there is a feature listed there which we don't understand, and it has a nonzero value, then we can not open the pool. Is it worth splitting feature used value into optional and mandatory? (Possibly with the ability to have an optional read feature be linked to a mandatory write feature). To use an existing example: dedupe (AFAIK) does not affect read code and so could show up as an optional read feature but a mandatory write feature (though I suspect this could equally be handled by just listing it in features_for_write). As a more theoretical example, consider OS-X resource forks? The presence of a resource fork matters for both read and write on OS-X but nowhere else. A (hypothetical) ZFS port to OS-X would want to know whether the pool contained resource forks even if opened R/O but this should not stop a different ZFS port from reading (and maybe even writing to) the pool. -- Peter Jeremy pgpj1BokjEkft.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Oracle and Nexenta
On 2011-May-25 03:49:43 +0800, Brandon High bh...@freaks.com wrote: ... unless Oracle's zpool v30 is different than Nexenta's v30. This would be unfortunate but no worse than the current situation with UFS - Solaris, *BSD and HP Tru64 all have native UFS filesystems, all of which are incompatible. I believe the various OSS projects that use ZFS have formed a working group to co-ordinate ZFS amongst themselves. I don't know if Oracle was invited to join (though given the way Oracle has behaved in all the other OSS working groups it was a member of, having Oracle onboard might be a disadvantage). -- Peter Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backup complete rpool structure and data to tape
On 2011-May-12 00:20:28 +0800, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: Backup/restore of bootable rpool to tape with a 3rd party application like legato etc is kind of difficult. Because if you need to do a bare metal restore, how are you going to do it? This is a generic problem, not limited to ZFS. The generic solutions are either: a) Customised boot disk that includes the 3rd party restore client b) Separate backup of root+client in a format that's restorable using tools only on the generic boot disk (eg tar or ufsdump). (Where boot disk could be network boot instead of a physical CD/DVD). I might suggest: If you use zfs send to backup rpool to a file in the data pool... And then use legato etc to backup the data pool... As Edward pointed out later, this might be OK as a disaster-recovery approach but isn't suitable for the situation where you want to restore a subset of the files (eg you need to recover a file someone accidently deleted) and a zfs send stream isn't intended for storage. Another potential downside is that the only way to read the stream is using zfs recv into ZFS - this could present a problem if you wanted to migrate the data into a different filesystem. (All other restore utilities I'm aware of use normal open/write/chmod/... interfaces so you can restore your backup into any filesystem). Finally, the send/recv protocol is not guaranteed to be compatible between ZFS versions. I'm not aware of any specific issues (though someone reports that a zfs.v15 send | zfs.v22 recv caused pool corruption in another recent thread) and would hope that zfs recv would always maintain full compatibility with older zfs send. But I hope you can completely abandon the whole 3rd party backup software and tapes. Some people can, and others cannot. By far, the fastest best way to backup ZFS is to use zfs send | zfs receive on another system or a set of removable disks. Unfortunately, this doesn't fit cleanly into the traditional enterprise backup solution where Legato/NetBackup/TSM/... backs up into a SILO with automatic tape replication and off-site rotation. Incidentally, when you do incremental zfs send, you have to specify the from and to snapshots. So there must be at least one identical snapshot in the sending and receiving system (or else your only option is to do a complete full send.) And (at least on v15) if you are using an incremental replication stream and you create (or clone) a new descendent filesystem, you will need to manually manage the initial replication of that filesystem. BTW, if you do elect to build a bootable, removable drive for backups, you should be aware that gzip compression isn't supported - at least in v15, trying to make a gzip compressed filesystem bootable or trying to set compression=gzip on a bootable filesystem gives a very uninformative error message and it took a fair amount of trawling through the source code to find the real cause. -- Peter Jeremy pgpnNCrRwuYrc.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quick zfs send -i performance questions
On 2011-May-04 08:39:39 +0800, Rich Teer rich.t...@rite-group.com wrote: Also related to this is a performance question. My initial test involved copying a 50 MB zfs file system to a new disk, which took 2.5 minutes to complete. The strikes me as being a bit high for a mere 50 MB; are my expectation realistic or is it just because of my very budget concious set up? If so, where's the bottleneck? Possibilities I can think of: - Do you have lots of snapshots? There's an overhead of a second or so for each snapshot to be sent. - Is the source pool heavily fragmented with lots of small files? The source pool is on a pair of 146 GB 10K RPM disks on separate busses in a D1000 (split bus arrangement) and the destination pool is on a IOMega 1 GB USB attached disk. The machine to which both pools are connected is a Sun Blade 1000 with a pair of 900 MHz US-III CPUs and 2 GB of RAM. Hopefully a silly question but does the SB1000 support USB2? All of the Sun hardware I've dealt with only has USB1 ports. And, BTW, 2GB RAM is very light on for ZFS (though I note you only have a very small amount of data). -- Peter Jeremy pgp8UazHZQHJM.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs incremental send?
On 2011-Mar-29 02:19:30 +0800, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote: Is it (or will it) be possible to do a partial/resumable zfs send/receive? If having 30TB of data and only a gigabit link, such transfers takes a while, and if interrupted, will require a re-transmit of all the data. zfs send/receive works on snapshots: The smallest chunk of data that can be sent/received is the delta between two snapshots. There's no way to do a partial delta - defining the endpoint of a partial transfer or the starting point for resumption is effectively a snapshot. For an initial replication of a large amount of data, the most feasible approach is probably to temporarily co-locate the destination disk array with the server to copy the data across. You can reduce the size of each incremental chunk by taking frequent snapshots (these can be deleted once they have been replicated to the backup host). -- Peter Jeremy pgpn3inOqECRR.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Invisible snapshot/clone
On 2011-Mar-17 10:23:01 +0800, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: To find it, run zdb -d, and search for something with a % Something like: zdb -d tank | grep % And then you can zfs destroy the thing. Thanks, that worked. P.S. Every time I did this, the zfs destroy would complete with some sort of error message, but then if you searched for the thing again, you would see that it actually completed successfully. Likewise, I had 'zfs destroy' whinge but the offending clone was gone. P.S. If your primary goal is to use ZFS, you would probably be better switching to nexenta or openindiana or solaris 11 express, because they all support ZFS much better than freebsd. I'm primarily interested in running FreeBSD and will be upgrading to ZFSv28 once it's been shaken out a bit longer. -- Peter Jeremy pgp0CAxj3Ebk1.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Invisible snapshot/clone
I am in the process of upgrading from FreeBSD-8.1 with ZFSv14 to FreeBSD-8.2 with ZFSv15 and, following a crash, have run into a problem with ZFS claiming a snapshot or clone exists that I can't find. I was transferring a set of snapshots from my primary desktop to a backup host (both ZFSv14) using: zfs send -I zroot/home@20110210bu -R zroot/home@20110317bu | \ ssh backup_host zfs recv -vd zroot and whilst that was in progress, I did a 'df -k' on backup_host. At this point, both the df and the zfs recv wedged unkillably. zfs list showed that the last snapshot on the destination system was zroot/home@20110309, so I did a rollback to it (which reported no error) and ran: zfs send -I zroot/home@20110309 -R zroot/home@20110317bu | \ ssh backup_host zfs recv -vd zroot which reported: receiving incremental stream of zroot/home@20110310 into zroot/home@20110310 cannot restore to zroot/home@20110310: destination already exists warning: cannot send 'zroot/home@20110310': Broken pipe I cannot find anything by that name (or any snapshots later than zroot/home@20110309 or any clones) and cannot destroy zroot/home@20110309: # zfs rollback zroot/home@20110309 # zfs destroy zroot/home@20110309 cannot destroy 'zroot/home@20110309': dataset already exists # zfs destroy -r zroot/home@20110309 cannot destroy 'zroot/home@20110309': snapshot is cloned no snapshots destroyed # zfs destroy -R zroot/home@20110309 cannot destroy 'zroot/home@20110309': snapshot is cloned no snapshots destroyed # zfs destroy -frR zroot/home@20110309 cannot destroy 'zroot/home@20110309': snapshot is cloned no snapshots destroyed # zfs list -t all |grep home@20110310 # zfs get all | grep origin # zfs get all | grep home@20110310 # I have tried rebooting, upgrading the pool from v14 to v15 and export/import without success. Does anyone have any other suggestions? zpool history -i looks like: 2011-03-17.08:02:57 zfs rollback zroot/home@20110210bu 2011-03-17.08:02:59 zfs recv -vd zroot 2011-03-17.08:02:59 [internal replay_inc_sync txg:872817696] dataset = 973 2011-03-17.08:02:59 [internal reservation set txg:872817697] 0 dataset = 469 ... 2011-03-17.08:09:41 [internal snapshot txg:872817974] dataset = 1203 2011-03-17.08:09:42 [internal replay_inc_sync txg:872817975] dataset = 1208 2011-03-17.08:09:42 [internal reservation set txg:872817976] 0 dataset = 469 2011-03-17.08:09:42 [internal property set txg:872817977] compression=10 dataset = 469 2011-03-17.08:09:42 [internal property set txg:872817977] mountpoint=/home dataset = 469 2011-03-17.08:09:50 [internal destroy_begin_sync txg:872817980] dataset = 1208 2011-03-17.08:09:51 [internal destroy txg:872817983] dataset = 1208 2011-03-17.08:09:51 [internal reservation set txg:872817983] 0 dataset = 0 2011-03-17.08:09:51 [internal snapshot txg:872817984] dataset = 1212 2011-03-17.08:09:52 [internal replay_inc_sync txg:872817985] dataset = 1217 2011-03-17.08:09:52 [internal reservation set txg:872817986] 0 dataset = 469 2011-03-17.08:09:52 [internal property set txg:872817987] compression=10 dataset = 469 2011-03-17.08:09:52 [internal property set txg:872817987] mountpoint=/home dataset = 469 system wedged here 2011-03-17.08:35:01 [internal rollback txg:872818038] dataset = 469 2011-03-17.08:35:01 zfs rollback zroot/home@20110309 2011-03-17.08:35:14 zfs recv -vd zroot 2011-03-17.08:36:37 [internal pool scrub txg:872818059] func=1 mintxg=0 maxtxg=872818059 2011-03-17.08:36:41 zpool scrub zroot 2011-03-17.09:17:27 [internal pool scrub done txg:872818513] complete=1 2011-03-17.09:19:44 [internal rollback txg:872818542] dataset = 469 2011-03-17.09:19:45 zfs rollback zroot/home@20110309 2011-03-17.10:51:38 [internal rollback txg:872819603] dataset = 469 2011-03-17.10:51:39 zfs rollback zroot/home@20110309 2011-03-17.10:54:11 zpool upgrade zroot 2011-03-17.10:59:12 [internal rollback txg:872819688] dataset = 469 2011-03-17.10:59:12 zfs rollback zroot/home@20110309 2011-03-17.11:16:38 [internal rollback txg:872819895] dataset = 469 2011-03-17.11:16:39 zfs rollback zroot/home@20110309 2011-03-17.11:16:54 zpool export zroot 2011-03-17.11:17:31 zpool import zroot 2011-03-17.11:30:13 [internal rollback txg:872819992] dataset = 469 2011-03-17.11:30:13 zfs rollback zroot/home@20110309 2011-03-17.12:01:02 zfs recv -vd zroot 2011-03-17.12:03:57 [internal rollback txg:872820399] dataset = 469 2011-03-17.12:03:57 zfs rollback zroot/home@20110309 -- Peter Jeremy pgpQBYUCWUiu1.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Free space on ZFS file system unexpectedly missing
On 2011-Mar-10 05:50:53 +0800, Tom Fanning m...@tomfanning.eu wrote: I have a FreeNAS 0.7.2 box, based on FreeBSD 7.3-RELEASE-p1, running ZFS with 4x1TB SATA drives in RAIDz1. I appear to have lost 1TB of usable space after creating and deleting a 1TB sparse file. This happened months ago. AFAIR, ZFS on FreeBSD 7.x was always described as experimental. This is a known problem (OpenSolaris bug id 6792701) that was fixed in OpenSolaris onnv revision 9950:78fc41aa9bc5 which was committed to FreeBSD as r208775 in head and r208869 in 8-stable. The fix was never back-ported to 7.x and I am unable to locate any workaround. - Exported the pool from FreeBSD, imported it on OpenIndiana 148 - but not upgraded - same problem, much newer ZFS implementation. Can't upgrade the pool to see if the issue goes away since for now I need a route back to FreeBSD and I don't have spare storage. I thought that just importing a pool on a system with the bugfix would free the space. If that doesn't work, your only options are to either upgrade to FreeBSD 8.1-RELEASE or later (preferably 8.2 since there are a number of other fairly important ZFS fixes since 8.1) and upgrade your pool to v15 or rebuild your pool (via send/recv or similar). -- Peter Jeremy pgp2oCcOvB9YH.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAID Failure Calculator (for 8x 2TB RAIDZ)
On 2011-Feb-07 14:22:51 +0800, Matthew Angelo bang...@gmail.com wrote: I'm actually more leaning towards running a simple 7+1 RAIDZ1. Running this with 1TB is not a problem but I just wanted to investigate at what TB size the scales would tip. It's not that simple. Whilst resilver time is proportional to device size, it's far more impacted by the degree of fragmentation of the pool. And there's no 'tipping point' - it's a gradual slope so it's really up to you to decide where you want to sit on the probability curve. I understand RAIDZ2 protects against failures during a rebuild process. This would be its current primary purpose. Currently, my RAIDZ1 takes 24 hours to rebuild a failed disk, so with 2TB disks and worse case assuming this is 2 days this is my 'exposure' time. Unless this is a write-once pool, you can probably also assume that your pool will get more fragmented over time, so by the time your pool gets to twice it's current capacity, it might well take 3 days to rebuild due to the additional fragmentation. One point I haven't seen mentioned elsewhere in this thread is that all the calculations so far have assumed that drive failures were independent. In practice, this probably isn't true. All HDD manufacturers have their off days - where whole batches or models of disks are cr*p and fail unexpectedly early. The WD EARS is simply a demonstration that it's WD's turn to turn out junk. Your best protection against this is to have disks from enough different batches that a batch failure won't take out your pool. PSU, fan and SATA controller failures are likely to take out multiple disks but it's far harder to include enough redundancy to handle this and your best approach is probably to have good backups. I will be running hot (or maybe cold) spare. So I don't need to factor in Time it takes for a manufacture to replace the drive. In which case, the question is more whether 8-way RAIDZ1 with a hot spare (7+1+1) is better than 9-way RAIDZ2 (7+2). In the latter case, your hot spare is already part of the pool so you don't lose the time-to-notice plus time-to-resilver before regaining redundancy. The downside is that actively using the hot spare may increase the probability of it failing. -- Peter Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best choice - file system for system
On 2011-Jan-28 21:37:50 +0800, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: 2- When you want to restore, it's all or nothing. If a single bit is corrupt in the data stream, the whole stream is lost. Regarding point #2, I contend that zfs send is better than ufsdump. I would prefer to discover corruption in the backup, rather than blindly restoring it undetected. OTOH, it renders ZFS send useless for backup or archival purposes. With ufsdump, I can probably recover most of the data off a backup even if it has some errors. Since I'm aware of that problem, I can separately store a file of expected checksums etc to verify what I restore. If I lose a file from one backup, I can hopefully retrieve it from another backup. With ZFS send, a 1-bit error renders my multi-GB backup useless. I can't get ZFS to restore the rest of the backup and tell me what is missing - which might let me recover it in other ways. -- Peter Jeremy pgppzMAxBmwjV.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] multiple disk failure
On 2011-Jan-30 13:39:22 +0800, Richard Elling richard.ell...@gmail.com wrote: I'm not sure of the way BSD enumerates devices. Some clever person thought that hiding the partition or slice would be useful. No, there's no hiding. /dev/ada0 always refers to the entire physical disk. If it had PC-style fdisk slices, there would be a sN suffix. If it had GPT partitions, there would be a pN suffix. If it had BSD partitions, there would be an alpha suffix [a-h]. On a Solaris system, ZFS can show a disk something like c0t1d0, but that doesn't exist. If we're discussing brokenness in OS device names, I've always thought that reporting device names that don't exist and not having any way to access the complete physical disk in Solaris was silly. Having a fake 's2' meaning the whole disk if there's no label is a bad kludge. Mike might like to try gpart list - which will display FreeBSD's view of the physical disks. It might also be worthwhile looking at a hexdump of the first and last few MB of the faulty disks - it's possible that the controller has decided to just shift things by a few sectors so the labels aren't where ZFS expects to find them. -- Peter Jeremy pgpNc13adVY1q.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] stupid ZFS question - floating point operations
On 2010-Dec-23 04:48:19 +0800, Deano de...@rattie.demon.co.uk wrote: modern CPU are float monsters indeed its likely some things would be faster if converted to use the float ALU _Some_ modern CPUs are good at FP, a lot aren't. The SPARC T-1 was particularly poor as it only had a single FPU. Likewise, performance in the x86 world is highly variable, depending on the vendor and core you pick. AFAIK, iA64 and PPC are consistently good - but neither are commonly found in conjunction with ZFS. You may also need to allow for software assist: Very few CPUs implement all of the IEEE FP standard in hardware and most (including SPARC) require software to implement parts of the standard. If your algorithm happens to make significant use of things other than normalised numbers and zero, your performance may be severely affected by the resultant traps and software assistance. Any use of floating point within the kernel also means changes to when FPU context is saved - and, unless this can be implemented lazily, it will adversely impact the cost of all context switches and potentially system calls. -- Peter Jeremy pgphVXYz2zc3s.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Growing the swap vol?
On 2010-Nov-14 07:53:05 +0800, Ian Collins i...@ianshome.com wrote: -BEGIN PGP SIGNATURE- PGP signatures are a PITA on mail lists! Only when the mailing list software is broken. Signatures are probably more relevant on mailing lists than elsewhere and this is the only mailing list I'm subscribed to where signatures get mangled. -- Peter Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hardware going bad
On 2010-Oct-28 04:45:16 +0800, Harry Putnam rea...@newsguy.com wrote: Short of doing such a test, I have evidence already that machine will predictably shutdown after 15 to 20 minutes of uptime. My initial guess is thermal issues. Check that the fans are running correctly and there's no dust/fluff buildup on the CPU heatsink. The BIOS might be able to report actual fan speeds. It's also possible that you have RAM or PSU problems and I'd also recommend running some sort of offline stress test (eg memtest86 or the mersenne prime tester). It seems there ought to be something, some kind of evidence and clues if I only knew how to look for them, in the logs. Serious hardware problems are unlikely to be in the logs because the system will die before it can write the error to disk and sync the disks. You are more likely to see a problem on the console. -- Peter Jeremy pgpL46BRTTVid.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Jumping ship.. what of the data
On 2010-Oct-28 04:54:00 +0800, Harry Putnam rea...@newsguy.com wrote: If I were to decide my current setup is too problem beset to continue using it, is there a guide or some good advice I might employ to scrap it out and build something newer and better in the old roomy midtower? I'd scrap the existing PSU as well unless you are sure it is OK - consumer grade PSUs don't have especially long lives. I'm a bit worried about whether with modern hardware the IDE drives will even have a hookup. If it does, can I just hook the two rpool discs up to two of them and expect it to boot OK? Most current motherboards still have one IDE channel, though they may not be able to boot off it. It's also still very easy to find PCIe cards with IDE ports (some have SATA as well). Again, you will need to check the fine print to make sure that they support booting off IDE. Assuming that you aren't currently using any hardware RAID, then there should be no problems accessing any of your existing pools from a new motherboard. Booting off your IDE rpool just relies on BIOS support for IDE booting (which you will need to verify). I expect to make sure I have a goodly number of sata connections even if it means extra cards, but again, can just hook the other mirrored discs up and expect them to just work. Finding PCIe x1 cards with more than 2 SATA ports is difficult so you might want to make sure that either your chosen motherboard has lots of PCIe slots or has some wider slots. If you plan on using on-board video and re-using the x16 slot for something else, you should verify that the BIOS will let you do that - I've got several (admittedly old) systems where the x16 slot must either be empty or have a video card to work. If you are concerned about reliability, you might like to look at motherboard and CPU combinations that support ECC RAM. I believe all Asus AMD boards now support ECC and some Gigabyte boards do (though identifying them can be tricky). See the archives for lots more discussion on suggested systems for ZFS. Would I expect to need to reinstall for starters? With care, nothing. -- Peter Jeremy pgpN6MGlBBoFC.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Balancing LVOL fill?
On 2010-Oct-21 01:28:46 +0800, David Dyer-Bennet d...@dd-b.net wrote: On Wed, October 20, 2010 04:24, Tuomas Leikola wrote: I wished for a more aggressive write balancer but that may be too much to ask for. I don't think it can be too much to ask for. Storage servers have long enough lives that adding disks to them is a routine operation; to the extent that that's a problem, that really needs to be fixed. It will (should) arrive as part of the mythical block pointer rewrite project. -- Peter Jeremy pgpy4W6UFItyz.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Newbie ZFS Question: RAM for Dedup
On 2010-Oct-20 08:36:30 +0800, Never Best qui...@hotmail.com wrote: Sorry I couldn't find this anywhere yet. For deduping it is best to have the lookup table in RAM, but I wasn't too sure how much RAM is suggested? *Lots* ::Assuming 128KB Block Sizes, and 100% unique data: 1TB*1024*1024*1024/128 = 8388608 Blocks ::Each Block needs 8 byte pointer? 8388608*8 = 67108864 bytes ::Ram suggest per TB 67108864/1024/1024 = 64MB So if I understand correctly we should have a min of 64MB RAM per TB for deduping? *hopes my math wasn't way off*, or is there significant extra overhead stored per block for the lookup table? The rule-of-thumb is 270 bytes per DDT entry - that means a minimum of 2.2GB of RAM (or fast L2ARC) per TB. And note that 128KB is the maximum blocksize - it's quite likely that you will have smaller blocks (which implies more RAM). I know my average blocksize is only a few KB. -- Peter Jeremy pgp8Dn2Yb6bMc.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to avoid striping ?
On 2010-Oct-18 17:45:34 +0800, casper@sun.com casper@sun.com wrote: Write-lock (wlock) the specified file-system. wlock suspends writes that would modify the file system. Access times are not kept while a file system is write- locked. All the applications trying to write will suspend. What would be the risk of that? At least some versions of Oracle rdbms have timeouts around I/O and will abort if I/O operations don't complete within a short period. -- Peter Jeremy pgp1r1gM7cLEs.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [RFC] Backup solution
On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey sh...@nedharvey.com wrote: If you're going raidz3, with 7 disks, then you might as well just make mirrors instead, and eliminate the slow resilver. There is a difference in reliability: raidzN means _any_ N disks can fail, whereas mirror means one disk in each mirror pair can fail. With a mirror, Murphy's Law says that the second disk to fail will be the pair of the first disk :-). -- Peter Jeremy pgpqLss4mZKH3.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
On 2010-Oct-06 05:59:06 +0800, Michael DeMan sola...@deman.com wrote: Another annoying thing with the whole 4K sector size, is what happens when you need to replace drives next year, or the year after? About the only mitigation needed is to ensure that any partitioning is based on multiples of 4KB. Does anybody know if there any vendors that are shipping 4K sector drives that have a jumper option to make them 512 size? This would require a low-level re-format and would significantly reduce the available space if it was possible at all. WD has a jumper, but is there explicitly to work with WindowsXP, and is not a real way to dumb down the drive to 512. All it does is offset the sector numbers by 1 so that sector 63 becomes physical sector 64 (a multiple of 4KB). I would presume that any vendor that is shipping 4K sector size drives now, with a jumper to make it 'real' 512, would be supporting that over the long run? I would be very surprised if any vendor shipped a drive that could be jumpered to real 512 bytes. The best you are going to get is jumpered to logical 512 bytes and maybe a 1-sector offset (needed for WindozeXP only). These jumpers will probably last as long as the 8GB jumpers that were needed by old BIOS code. (Eg BIOS boots using simulated 512-byte sectors and then the OS tells the drive to switch to native mode). It's unfortunate that Sun didn't bite the bullet several decades ago and provide support for block sizes other than 512-bytes instead of getting custom firmware for their CD drives to make them provide 512-byte logical blocks for 2KB CD-ROMs. It's even more idiotic of WD to sell a drive with 4KB sectors but not provide any way for an OS to identify those drives and perform 4KB aligned I/O. -- Peter Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))
On 2010-Sep-24 00:58:47 +0800, R.G. Keen k...@geofex.com wrote: That may not be the best of all possible things to do on a number of levels. But for me, the likelihood of making a setup or operating mistake in a virtual machine setup server is far outweighs the hardware cost to put another physical machine on the ground. The downsides are generally that it'll be slower and less power- efficient that a current generation server and the I/O interfaces will be also be last generation (so you are more likely to be stuck with parallel SCSI and PCI or PCIx rather than SAS/SATA and PCIe). And when something fails (fan, PSU, ...), it's more likely to be customised in some way that makes it more difficult/expensive to repair/replace. In fact, the issue goes further. Processor chipsets from both Intel and AMD used to support ECC on an ad-hoc basis. It may have been there, but may or may not have been supported by the motherboard. Intels recent chipsets emphatically do not support ECC. Not quite. When Intel moved the memory controllers from the northbridge into the CPU, they made a conscious decision to separate server and desktop CPUs and chipsets. The desktop CPUs do not support ECC whereas the server ones do - this way they can continue to charge a premium for server-grade parts and prevent the server manufacturers from using lower-margin desktop parts. This means that if you want an Intel-based solution, you need to look at a Xeon CPU. That said, the low-end Xeons aren't outrageously expensive and you generally wind up with support for registered RAM and registered ECC RAM is often easier to find than unregistered ECC RAM. AMDs do, in general. AMD chose to leave ECC support in almost all their higher-end memory controllers, rather than use it as a market differentiator. AFAIK, all non-mobile Athlon, Phenom and Opteron CPUs support ECC, whereas the lower-end Sempron, Neo, Turion and Geode CPUs don't. Note that Athlon and Phenom CPUs normally need unbuffered RAM whereas Opteron CPUs normally want buffered/registered RAM. However, the motherboard must still support the ECC reporting in hardware and BIOS for ECC to actually work, and you have to buy the ECC memory. In the case of AMD motherboards, it's really just laziness on the manufacturer's part to not bother routing the additional tracks. The newer the intel motherboard, the less likely and more expensive ECC is. Older intel motherboards sometimes did support ECC, as a side note. On older Intel motherboards, it was a chipset issue rather than a CPU issue (and even if the chipset supported ECC, the motherboard manufacturer might have decided to not bother running the ECC tracks). There's about sixteen more pages of typing to cover the issue even modestly correctly. The bottom line is this: for current-generation hardware, buy an AMD AM3 socket CPU, ASUS motherboard, and ECC memory. DDR2 and DDR3 ECC memory is only moderately more expensive than non-ECC. Asus appears to have made a conscious decision to support ECC on all AMD motherboards whereas other vendors support it sporadically and determining whether a particular motherboard supports ECC can be quite difficult since it's never one of the options in their motherboard selection tools. And when picking the RAM, make sure it's compatible with your motherboard - motherboards are virtually never compatible with both unbuffered and buffered RAM. hardware going into wearout. I also bought new, high quality power supplies for $40-$60 per machine because the power supply is a single point of failure, and wears out - that's a fact that many people ignore until the machine doesn't come up one day. Doesn't come up one day is at least a clear failure. With a cheap (or under-dimensioned) PSU, things are more likely to go out of tolerance under heavy load so you wind up with unrepeatable strange glitches. Think about what happens if you find a silent bit corruption in a file system that includes encrypted files. Or compressed files. -- Peter Jeremy pgp2gl67ZdR99.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Unwanted filesystem mounting when using send/recv
I am looking at backing up my fileserver by replicating the filesystems onto an external disk using send/recv with something similar to: zfs send ... myp...@snapshot | zfs recv -d backup but have run into a bit of a gotcha with the mountpoint property: - If I use zfs send -R ... then the mountpoint gets replicated and the backup gets mounted over the top of my real filesystems. - If I skip the '-R' then none of the properties get backed up. Is there some way to have zfs recv not automatically mount filesystems when it creates them? -- Peter Jeremy pgpOliK2tC1Vs.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 64-bit vs 32-bit applications
On 2010-Aug-18 04:40:21 +0800, Joerg Schilling joerg.schill...@fokus.fraunhofer.de wrote: Ian Collins i...@ianshome.com wrote: Some application benefit from the extended register set and function call ABI, others suffer due to increased sizes impacting the cache. Well, please verify your claims as they do not meet my experience. I would agree with Ian that it varies. I have recently been evaluating a number of different SHA256 implementations and have just compared the 32-bit vs 64-bit performance on both x86 (P4 nocona using gcc 4.2.1) and SPARC (US-IVa using Studio12). Comparing the different implementations on each platform, the differences between best and worst varied from 10% to 27% depending on the platform (and the slowest algorithm on x86/64 was equal fastest in the other 3 platforms). Comparing the 32-bit vs 64-bit version of each implementation on each platform, the difference between 32-bit and 64-bit varied from -11% to +13% on SPARC and same to +68% on x86. My interpretation of those results is that you can't generalise: The only way to determine whether your application is faster in 32-bit or 64-bit more is to test it. And your choice of algorithm is at least as important as whether it's 32-bit or 64-bit. -- Peter Jeremy pgpSec5hUa4mU.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opensolaris is apparently dead
On 2010-Aug-16 08:17:10 +0800, Garrett D'Amore garr...@nexenta.com wrote: For either ZFS or BTRFS (or any other filesystem) to survive, there have to be sufficiently skilled developers with an interest in developing and maintaining it (whether the interest is commercial or recreational). Agreed. And this applies to OpenSolaris (or Illumos or any other fork) as well. Honestly, I think both ZFS and btrfs will continue to be invested in by Oracle. Given that both provide similar features, it's difficult to see why Oracle would continue to invest in both. Given that ZFS is the more mature product, it would seem more logical to transfer all the effort to ZFS and leave btrfs to die. Irrespective of the above, there is nothing requiring Oracle to release any future btrfs or ZFS improvements (or even bugfixes). They can't retrospectively change the license on already released code but they can put a different (non-OSS) license on any new code. -- Peter Jeremy pgpuCWzXnMlHq.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FreeBSD 8.1 out, has zfs vserion 14 and can boot from zfs
On 2010-Jul-27 19:43:50 +0800, Andrey V. Elsukov bu7c...@yandex.ru wrote: On 27.07.2010 1:57, Peter Jeremy wrote: Note that ZFS v15 has been integrated into the development branches (-current and 8-stable) and will be in FreeBSD 8.2 (or you can run it ZFS v15 is not yet in 8-stable. Only in HEAD. Perhaps it will be merged into stable after 2 months. Oops, sorry. There are patches available for 8-stable (which I'm running). I misremembered the commit message. -- Peter Jeremy pgpHQlZ2UoRAA.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FreeBSD 8.1 out, has zfs vserion 14 and can boot from zfs
On 2010-Jul-26 20:32:41 +0800, Eugen Leitl eu...@leitl.org wrote: FreeBSD 8.1 features version 14 of the ZFS subsystem, the addition of the ZFS Loader (zfsloader), allowing users to boot from ZFS, Only on i386 or amd64 systems at present, but you can boot RAIDZ1 and RAIDZ2 as well as mirrored roots. Note that ZFS v15 has been integrated into the development branches (-current and 8-stable) and will be in FreeBSD 8.2 (or you can run it now by compiling FreeBSD yourself - unlike OpenSolaris, the full build process is documented and everything necessary is on the release DVDs or can be downloaded). See http://www.freebsd.org/releases/8.1R/announce.html -- Peter Jeremy pgppFbh5U0Jj5.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression
On 2010-Jul-25 21:12:08 +0800, Ben ben.lav...@gmail.com wrote: I've read a small amount about compression, enough to find that it'll effect performance (not a problem for me) and that once you enable compression it only effects new files written to the file system. Is this still true of b134? And if it is, how can I compress all of the current data on the file system? Do I have to move it off then back on? Yes, changing things like compression, dedup etc only affect data written after the change. The only way to re-compress everything is to copy it off and back on again. Good news: There is an easy way to do this and preserve (whilst compressing) all your snapshots. All you need to do is set compression=gzip (or whatever you want) and then do a send/recv of that filesystem. The destination fileset will be completely created according to the source fileset parameters at the time of the send. If you have sufficient free space, you can even do a send|recv on the same system - but if the original fileset was mounted that this will result in the new fileset being mounted over the top of it, so you shouldn't do this on an active system. -- Peter Jeremy pgpBFqeTZn2jS.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hashing files rapidly on ZFS
On 2010-Jul-09 06:46:54 +0800, Edward Ned Harvey solar...@nedharvey.com wrote: md5 is significantly slower (but surprisingly not much slower) and it's a cryptographic hash. Probably not necessary for your needs. As someone else has pointed out, MD5 is no longer considered secure (neither is SHA-1). If you want cryptographic hashing, you should probably use SHA-256 for now and be prepared to migrate to SHA-3 once it is announced. Unfortunately, SHA-256 is significantly slower than MD5 (about 4 times on a P-4, about 3 times on a SPARC-IV) and no cryptographic hash is amenable to multi-threading . The new crypto instructions on some of Intel's recent offerings may help performance (and it's likely that they will help more with SHA-3). And one more thing. No matter how strong your hash is, unless your hash is just as big as your file, collisions happen. Don't assume data is the same just because hash is the same, if you care about your data. Always byte-level verify every block or file whose hash matches some other hash. In theory, collisions happen. In practice, given a cryptographic hash, if you can find two different blocks or files that produce the same output, please publicise it widely as you have broken that hash function. -- Peter Jeremy pgpiebzGoklvU.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Remove non-redundant disk
On 2010-Jul-08 02:39:05 +0800, Garrett D'Amore garr...@nexenta.com wrote: I believe that long term folks are working on solving this problem. I believe bp_rewrite is needed for this work. Accepted. Mid/short term, the solution to me at least seems to be to migrate your data to a new zpool on the newly configured array, etc. IMHO, this isn't an acceptable solution. Note that (eg) DEC/Compaq/HP AdvFS has supported vdev removal from day 1 and (until a couple of years ago), I had an AdvFS pool that had, over a decade, grown from a mirrored pair of 4.3GB disks to six pairs of mirrored 36GB disks - without needing any downtime for disk expansion. [Adding disks was done with mirror pairs because AdvFS didn't support any RAID5/6 style redundancy, the big win was being able to remove older vdevs so those disk slots could be reused]. Most enterprises don't incrementally upgrade an array (except perhaps to add more drives, etc.) This isn't true for me. It is not uncommon for me to replace an xGB disk with a (2x)GB disk to expand an existing filesystem - in many cases, it is not possible to add more drives because there are no physical slots available. And, one of the problems with ZFS is that, unless you don't bother with any data redundancy, it's not possible to add single drives - you can only add vdevs that are pre-configured with the desired level of redundancy. Disks are cheap enough that its usually not that hard to justify a full upgrade every few years. (Frankly, spinning rust MTBFs are still low enough that I think most sites wind up assuming that they are going to have to replace their storage on a 3-5 year cycle anyway. We've not yet seen what SSDs do that trend, I think.) Maybe in some environments. We tend to run equipment into the ground and I know other companies with similar policies. And getting approval for a couple of thousand dollars of new disks is very much easier than getting approval for a complete new SAN with (eg) twice the capacity of the existing one. -- Peter Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Native ZFS for Linux
On 2010-Jun-11 17:41:38 +0800, Joerg Schilling joerg.schill...@fokus.fraunhofer.de wrote: PP.S.: Did you know that FreeBSD _includes_ the GPLd Reiserfs in the FreeBSD kernel since a while and that nobody did complain about this, see e.g.: http://svn.freebsd.org/base/stable/8/sys/gnu/fs/reiserfs/ That is completely irrelevant and somewhat misleading. FreeBSD has never prohibited non-BSD-licensed code in their kernel or userland however it has always been optional and, AFAIR, the GENERIC kernel has always defaulted to only contain BSD code. Non-BSD code (whether GPL or CDDL) is carefully segregated (note the 'gnu' in the above URI). -- Peter Jeremy pgpvmgKqx7nJf.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Pool revcovery from replaced disks.
On 2010-May-18 19:06:11 +0800, Demian Phillips demianphill...@gmail.com wrote: Is it possible to recover a pool (as it was) from a set of disks that were replaced during a capacity upgrade? If no other writes occurred during the capacity upgrade then I'd suspect it would be possible. The transaction numbers would still vary across the drives and the pool information would be inconsistent but I suspect a recent version of ZFS could manage to recover. It might be possible to test this by creating a small, file-backed RAIDZn zpool, simulating a capacity upgrade, exporting that pool and trying to import the original zpool from the detached files. -- Peter Jeremy pgp5OU8Gba0CI.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reverse lookup: inode to name lookup
On 2010-May-02 01:44:51 +0800, Edward Ned Harvey solar...@nedharvey.com wrote: Obviously, the kernel has the facility to open an inode by number. However, for security reasons (enforcing permissions of parent directories before the parent directories have been identified), the ability to open an arbitrary inode by number is not normally made available to user level applications, except perhaps when run by root. There is no provision in normal Unix to open a file by inode from userland. Some filesystems (eg HP Tru64) may expose a special pseudo-directoy that exposes all the inodes. Note that opening a file by inode number is a completely different issue to mapping an inode number to a pathname. because: (a) every directory contains an entry .. which refers to its parent by number, and (b) every directory has precisely one parent, and no more. There is no such thing as a hardlink copy of a directory. Therefore, there is exactly one absolute path to any directory in any ZFS filesystem. s/is/should be/ - I haven't checked with ZFS but it may be possible to trick/corrupt the filesystem into allowing a second real name (though the filesystem is then inconsistent). If the kernel (or root) can open an arbitrary directory by inode number, then the kernel (or root) can find the inode number of its parent by looking at the '..' entry, which the kernel (or root) can then open, and identify both: the name of the child subdir whose inode number is already known, and (b) yet another '..' entry. The kernel (or root) can repeat this process recursively, up to the root of the filesystem tree. At that time, the kernel (or root) has completely identified the absolute path of the inode that it started with. Any user can do this (subject to permissions) and this is how 'pwd' was traditionally implemented. Note that you need to check device and inode, not just inode, to correctly handle mountpoints. -- Peter Jeremy pgpsc9geRSx95.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot versus Netapp - Security and convenience
On 2010-Apr-30 21:56:46 +0800, Edward Ned Harvey solar...@nedharvey.com wrote: How many bytes long is an inode number? I couldn't find that easily by googling, so for the moment, I'll guess it's a fixed size, and I'll guess 64bits (8 bytes). Based on a rummage in some header files, it looks like it's 8 bytes. How many bytes is that? Would it be exceptionally difficult to extend and/or make variable? Extending inodes increases the amount of metadata associated with a file, which increases overheads for small files. It looks like a ZFS inode is currently 264 bytes, but is always stored with a dnode and currently has some free space. ZFS code assumes that the physical dnode (dnode+znode+some free space) is a fixed size and making it variable is likely to be quite difficult. One important consideration in that hypothetical scenario would be fragmentation. If every inode were fragmented in two, that would be a real drag for performance. Perhaps every inode could be extended (for example) 32 bytes to accommodate a list of up to 4 parent inodes, but whenever the number of parents exceeds 4, the inode itself gets fragmented to store a variable list of parents. ACLs already do something like this. And having parent information stored away from the rest of the inode would not impact the normal inode access time since the parent information is not normally needed. On 2010-Apr-30 23:08:58 +0800, Edward Ned Harvey solar...@nedharvey.com wrote: Therefore, it should be very easy to implement proof of concept, by writing a setuid root C program, similar to sudo which could then become root, identify the absolute path of a directory by its inode number, and then print that absolute path, only if the real UID has permission to ls that path. It doesn't need to be setuid. Check out http://minnie.tuhs.org/cgi-bin/utree.pl?file=V6/usr/source/s2/pwd.c http://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/pwd.c (The latter is somewhat more readable) While not trivial, it's certainly possible to extend inodes of files, to include parent pointers. This is a far more significant change and the utility is not clear. Also not trivial, it's certainly possible to make all this information available under proposed directories, .zfs/inodes or something similar. HP Tru64 already does something like this. -- Peter Jeremy pgp2nCFDIdxia.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Single-disk pool corrupted after controller failure
On 2010-May-03 23:59:17 +0800, Diogo Franco diogomfra...@gmail.com wrote: I managed to get a livefs cd that had zfs14, but it was unable to import the zpool (internal error: Illegal byte sequence). The zpool does appear if I try to run `zpool import` though, as tank FAULTED corrupted data, and ad6s1d is ONLINE. That's not promising. There is no -F option on bsd's zpool import. It was introduced around zfs20. I feared it might be needed. This is almost certainly the problem. ad6s1 may be the same as c5d0p1 but OpenSolaris isn't going to understand the FreeBSD partition label on that slice. All I can suggest is to (temporarily) change the disk slicing so that there is a fdisk slice that matches ad6s1d. How could I do just that? I know that my label has a 1G UFS, 1G swap, and the rest is ZFS; but I don't know how to calculate the correct offset to give to 'format'. I can just regenerate the UFS later after the ZFS is fixed since it was only used for its /boot. In FreeBSD, bsdlabel ad0s1 will report the size and offset of the 'd' partition in sectors. The offset is relative to the start of that slice - which would normally be absolute block 63 (fdisk ad0 will confirm that). Adding the offset of 's1' to the offset of 'd' will give you a sector offset for your ZFS data. I haven't tried using OpenSolaris on x86 so I'm not sure if format allows sector offsets (I know format on Solaris/SPARC insists on cylinder offsets). Since cylinders are a fiction anyway, you might be able to kludge a cylinder size to suit your offset if necessary. The FreeBSD fdisk(8) man page implies that slices start at a track boundary and and at a cylinder boundary but I'm not sure if this is a restriction on LBA disks. Note that if you keep a record of your existing c5d0 format and restore it later, this will recover your existing boot and swap so you shouldn't need to restore them. -- Peter Jeremy pgpHLIUCADaBM.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Single-disk pool corrupted after controller failure
On 2010-May-02 04:06:41 +0800, Diogo Franco diogomfra...@gmail.com wrote: regular data corruption and then the box locked up. I had also converted the pool to v14 a few days before, so the freebsd v13 tools couldn't do anything to help. Note that ZFS v14 was imported to FreeBSD 8-stable in mid-January. I can't comment whether it would be able to recover your data. On 2010-May-02 05:07:17 +0800, Bill Sommerfeld bill.sommerf...@oracle.com wrote: 2) the labels are not at the start of what solaris sees as p1, and thus are somewhere else on the disk. I'd look more closely at how freebsd computes the start of the partition or slice '/dev/ad6s1d' that contains the pool. I think #2 is somewhat more likely. This is almost certainly the problem. ad6s1 may be the same as c5d0p1 but OpenSolaris isn't going to understand the FreeBSD partition label on that slice. All I can suggest is to (temporarily) change the disk slicing so that there is a fdisk slice that matches ad6s1d. -- Peter Jeremy pgpuiR7yDRv37.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot versus Netapp - Security and convenience
On 2010-Apr-30 10:24:14 +0800, Edward Ned Harvey solar...@nedharvey.com wrote: Each inode contain a link count. In most cases, each inode has a link count of 1, but of course that can't be assumed. It seems trivially simple to me, that along with the link count in each inode, the filesystem could also store a list of which inodes link to it. If link count is 2, then there's a list of 2 inodes, which are the parents of this inode. I'm not sure exactly what you are trying to say here but it don't think it will work. In a Unix FS (UFS or ZFS), a directory entry contains a filename and a pointer to an inode. The inode itself contains a count of the number of directory entries that point to it and pointers to the actual data. There is currently no provision for a reverse link back to the directory. I gather you are suggesting that the inode be extended to contain a list of the inode numbers of all directories that contain a filename referring to that inode. Whilst I agree that this would simplify inode to filename mapping and provide an alternate mechanism for checking file permissions, I think you are glossing over the issue of how/where to store these links. Whilst files can have a link count of 1 (I'm not sure if this is true in most cases), they can have up to 32767 links. Where is this list of (up to) 32767 parent inodes going to be stored? In which case, it would be trivially easy to walk back up the whole tree, almost instantly identifying every combination of paths that could possibly lead to this inode, while simultaneously correctly handling security concerns about bypassing security of parent directories and everything. Whilst it's trivially easy to get from the file to the list of directories containing that file, actually getting from one directory to its parent is less so: A directory containing N sub-directories has N+2 links. Whilst the '.' link is easy to identify (it points to its own inode), distinguishing between the name of this directory in its parent and the '..' entries in its subdirectories is rather messy (requiring directory scans) unless you mandate that the reference to the parent directory is in a fixed location (ie 1st or 2nd entry in the parent inode list). It seems too perfect and too simple. Instead of a one-directional directed graph, simply make a bidirectional. There's no significant additional overhead as far as I can tell. It seems like it would even be easy. Well, you need to find somewhere to store up to 32K inode numbers, whilst having minimal space overhead for small numbers of links. Then you will need to patch the vnode operations underlying creat(), link(), unlink(), rename(), mkdir() and rmdir() to manage the backlinks (taking into account transactional consistency). -- Peter Jeremy pgpLmGCkPtpSv.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced drives
On 2010-Feb-03 00:12:43 +0800, Bob Friesenhahn bfrie...@simple.dallas.tx.us wrote: On Tue, 2 Feb 2010, David Dyer-Bennet wrote: Now, I'm sure not ALL drives offered at Newegg could qualify; but the question is, how much do I give up by buying an enterprise-grade drive from a major manufacturer, compared to the Sun-certified drive? If you have a Sun service contract, you give up quite a lot. If a Sun drive fails every other day, then Sun will replace that Sun drive every other day, even if the system warranty has expired. But if it is a non-Sun drive, then you have to deal with a disinterested drive manufacturer, which could take weeks or months. OTOH, if I'm paying 10x the street drive price upfront, plus roughly the street price annually in support, I can save a fair amount of money by just buying a pile of spare drives - when one fails, just swap it for a spare and it doesn't matter if it takes weeks for the vendor to swap it. Hopefully Oracle will do better than Sun at explaining the benefits and services provided by a service contract. I know that trying to get Sun to renew a service contract is like pulling teeth but Oracle is far worse - as far as I can tell, Oracle contracts are deliberately designed so you can't be certain whether you are compliant or not. -- Peter Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is ZFS internal reservation excessive?
On 2010-Jan-19 00:26:27 +0800, Jesus Cea j...@jcea.es wrote: On 01/18/2010 05:11 PM, David Magda wrote: Ext2/3 uses 5% by default for root's usage; 8% under FreeBSD for FFS. Solaris (10) uses a bit more nuance for its UFS: That reservation is to preclude users to exhaust diskspace in such a way that ever root can not login and solve the problem. At least for UFS-derived filesystems (ie FreeBSD and Solaris), the primary reason for the 8-10% reserved space is to minimise FS fragmentation and improve space allocation performance: More total free space means it's quicker and easier to find the required contiguous (or any) free space whilst searching a free space bitmap. Allowing root to eat into that reserved space provided a neat solution to resource starvation issues but was not the justification. I agree that is a lot of space but only 2% of a modern disk. My point is that 32GB is a lot of space to reserve to be able, for instance, to delete a file when the pool is full (thanks to COW). And more when the minimum reserved is 32MB and ZFS can get away with it. I think that could be a good thing to put a cap to the maximum implicit reservation. AFAIK, it's also necessary to ensure reasonable ZFS performance - the find some free space issue becomes much more time critical with a COW filesystem. I recently had a 2.7TB RAIDZ1 pool get to the point where zpool was reporting ~2% free space - and performance was absolutely abyssmal (fsync() was taking over 16 seconds). When I freed up a few percent more space, the performance recovered. Maybe it would be useful if ZFS allowed the reserved space to be tuned lower but, at least for ZFS v13, the reserved space seems to actually be a bit less than is needed for ZFS to function reasonably. -- Peter Jeremy pgpaYK13eLyWU.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raid-z as a boot pool
On 2009-Dec-16 00:26:28 +0800, Luca Morettoni l...@morettoni.net wrote: As reported here: http://hub.opensolaris.org/bin/view/Community+Group+zfs/zfsbootFAQ we can't boot from a pool with raidz, any plan to have this feature? Note that FreeBSD currently supports booting from RAIDZ (at least on i386). It may be possible to reuse some of that code. -- Peter Jeremy pgp0WiQELKoEj.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering FAULTED zpool
On 2009-Nov-18 11:50:44 +1100, I wrote: I have a zpool on a JBOD SE3320 that I was using for data with Solaris 10 (the root/usr/var filesystems were all UFS). Unfortunately, we had a bit of a mixup with SCSI cabling and I believe that we created a SCSI target clash. The system was unloaded and nothing happened until I ran zpool status at which point things broke. After correcting all the cabling, Solaris panic'd before reaching single user. I wound up installing OpenSolaris snv_128a on some spare disks and this enabled me to recover the data. Thanks to Tim Haley and Victor Latushkin for their assistance. As a first attempt, 'zpool import -F data' said Destroy and re-create the pool from a backup source.. 'zpool import -nFX data' initially ran the system out of swap (I hadn't attached any swap and it only has 8GB RAM): WARNING: /etc/svc/volatile: File system full, swap space limit exceeded INIT: Couldn't write persistent state file `/etc/svc/volatile/init.state'. After rebooting and adding some swap (which didn't seem to ever get used), it did work (though it took several hours - unfortunately, I didn't record exactly how long): # zpool import -nFX data Would be able to return data to its state as of Thu Jan 01 10:00:00 1970. Would discard approximately 369 minutes of transactions. # zpool import -FX data Pool data returned to its state as of Thu Jan 01 10:00:00 1970. Discarded approximately 369 minutes of transactions. cannot share 'data/backup': share(1M) failed cannot share 'data/JumpStart': share(1M) failed cannot share 'data/OS_images': share(1M) failed # I notice that the two times aren't consistent but the data appears to be present and a 'zpool scrub' reported no errors. I have reverted back to Solaris 10 and successfully copied all the data off. -- Peter Jeremy pgpC0sjEufK37.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practices for zpools on zfs
On 2009-Nov-24 14:07:06 -0600, Mike Gerdts mger...@gmail.com wrote: On Tue, Nov 24, 2009 at 1:39 PM, Richard Elling richard.ell...@gmail.com wrote: Also, the performance of /dev/*random is not very good. So prestaging lots of random data will be particularly challenging. This depends on the random number generation algorithm used in the kernel. I get 50MB/sec out of FreeBSD on 3.2GHz P4 (using Yarrow). In any case, you don't need crypto-grade random numbers, just data that is different and uncompressible - there are lots of relatively simple RNGs that can deliver this with far greater speed. I was thinking that a bignum library such as libgmp could be handy to allow easy bit shifting of large amounts of data. That is, fill a 128 KB buffer with random data then do bitwise rotations for each successive use of the buffer. Unless my math is wrong, it should allow 128 KB of random data to be write 128 GB of data with very little deduplication or compression. A much larger data set could be generated with the use of a 128 KB linear feedback shift register... This strikes me as much harder to use than just filling the buffer with 8/32/64-bit random numbers from a linear congruential generator, lagged fibonacci generator, mersenne twister or even random(3) http://en.wikipedia.org/wiki/List_of_random_number_generators -- Peter Jeremy pgpO9mAWzbb7x.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering FAULTED zpool
On 2009-Nov-18 08:40:41 -0800, Orvar Korvar knatte_fnatte_tja...@yahoo.com wrote: There is a new PSARC in b126(?) that allows to rollback to latest functioning uber block. Maybe it can help you? It's in b128 and the feedback I've received suggests it will work. I've been trying to get the relevant ZFS bits for my b127 system but haven't managed to get them to work so far. -- Peter Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovering FAULTED zpool
On 2009-Nov-19 02:57:31 +0300, Victor Latushkin victor.latush...@sun.com wrote: all the cabling, Solaris panic'd before reaching single user. Do you have crash dump of this panic saved? Yes. It was provided to Sun Support. Option -F is new one added with pool recovery support, so it'll be available in build 128 only OK, thanks I knew it was new but I wasn't certain exactly which build it had been imported into. I think it should be possible at least in readonly mode. I cannot tell if full recovery will be possible, but at least there's good chance to get some data back. That's what I was hoping. You can try build 128 as soon as it becomes available, or you can try to build BFU archives from source and apply to your build 125 BE. I'm currently discussing this off-line with Tim Haley. Metadata replication helps to protect against failures localized in space, but as all copies of metadata are written at the same time, it cannot protect against failures localized in time. Thanks for that. I suspected it might be something like this. -- Peter Jeremy pgpTbho8x8cyp.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recovering FAULTED zpool
I have a zpool on a JBOD SE3320 that I was using for data with Solaris 10 (the root/usr/var filesystems were all UFS). Unfortunately, we had a bit of a mixup with SCSI cabling and I believe that we created a SCSI target clash. The system was unloaded and nothing happened until I ran zpool status at which point things broke. After correcting all the cabling, Solaris panic'd before reaching single user. Sun Support could only suggest restoring from backups - but unfortunately, we do not have backups of some of the data that we would like to recover. Since OpenSolaris has a much newer version of ZFS, I thought I would give OpenSolaris a try and it looks slightly more promising, though I still can't access the pool. The following is using snv125 on a T2000. r...@als253:~# zpool import -F data Nov 17 15:26:46 opensolaris zfs: WARNING: can't open objset for data/backup r...@als253:~# zpool status -v data pool: data state: FAULTED status: An intent log record could not be read. Waiting for adminstrator intervention to fix the faulted pool. action: Either restore the affected device(s) and run 'zpool online', or ignore the intent log records by running 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-K4 scrub: none requested config: NAME STATE READ WRITE CKSUM data FAULTED 0 0 3 bad intent log raidz2-0 DEGRADED 0 018 c2t8d0 FAULTED 0 0 0 too many errors c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 c3t8d0 ONLINE 0 0 0 c3t9d0 ONLINE 0 0 0 c3t10d0 ONLINE 0 0 0 c3t11d0 ONLINE 0 0 0 c3t12d0 DEGRADED 0 0 0 too many errors c3t13d0 ONLINE 0 0 0 r...@als253:~# zpool online data c2t8d0 Nov 17 15:28:42 opensolaris zfs: WARNING: can't open objset for data/backup cannot open 'data': pool is unavailable r...@als253:~# zpool clear data cannot clear errors for data: one or more devices is currently unavailable r...@als253:~# zpool clear -F data cannot open '-F': name must begin with a letter r...@als253:~# zpool status data pool: data state: FAULTED status: One or more devices are faulted in response to persistent errors. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. Manually marking the device repaired using 'zpool clear' may allow some data to be recovered. scrub: none requested config: NAME STATE READ WRITE CKSUM data FAULTED 0 0 1 corrupted data raidz2-0 FAULTED 0 0 6 corrupted data c2t8d0 FAULTED 0 0 0 too many errors c2t9d0 ONLINE 0 0 0 c2t10d0 ONLINE 0 0 0 c2t11d0 ONLINE 0 0 0 c2t12d0 ONLINE 0 0 0 c2t13d0 ONLINE 0 0 0 c3t8d0 ONLINE 0 0 0 c3t9d0 ONLINE 0 0 0 c3t10d0 ONLINE 0 0 0 c3t11d0 ONLINE 0 0 0 c3t12d0 DEGRADED 0 0 0 too many errors c3t13d0 ONLINE 0 0 0 r...@als253:~# Annoyingly, data/backup is not a filesystem I'm especially worried about - I'd just like to get access to the other filesystems on it. Is is possible to hack the pool to make data/backup just disappear. For that matter: 1) Why is the whole pool faulted when n-2 vdevs are online? 2) Given that metadata is triplicated, where did the objset go? -- Peter Jeremy pgpcSxvFaLwUM.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss