Re: [zfs-discuss] Freeing unused space in thin provisioned zvols
No tools, ZFS does it automaticaly when freeing blocks when the underlying device advertises the functionality. ZFS ZVOLs shared over COMSTAR advertise SCSI UNMAP as well. If a system was running something older, e.g., Solaris 11; the free blocks will not be marked such on the server even after the system upgrades to Solaris 11.1. There might be a way to force that by disabling compression and then create a large file full with NULs and then remove that. But you need to check first that this has some effect before you even try. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
IIRC dump is special. As for swap... really, you don't want to swap. If you're swapping you have problems. Any swap space you have is to help you detect those problems and correct them before apps start getting ENOMEM. There *are* exceptions to this, such as Varnish. For Varnish and any other apps like it I'd dedicate an entire flash drive to it, no ZFS, no nothing. Yes and no: the system reserves a lot of additional memory (Solaris doesn't over-commits swap) and swap is needed to support those reservations. Also, some pages are dirtied early on and never touched again; those pages should not be kept in memory. But continuously swapping is clearly a sign of a system too small for its job. Of course, compressing and/or encrypting swap has interesting issues: in order to free memory by swapping pages out requires even more memory. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On 01/22/2013 10:50 PM, Gary Mills wrote: On Tue, Jan 22, 2013 at 11:54:53PM +, Edward Ned Harvey (opensolarisisdeadlongliveopensolari s) wrote: Paging out unused portions of an executing process from real memory to the swap device is certainly beneficial. Swapping out complete processes is a desperation move, but paging out most of an idle process is a good thing. It gets even better. Executables become part of the swap space via mmap, so that if you have a lot of copies of the same process running in memory, the executable bits don't waste any more space (well, unless you use the sticky bit, although that might be deprecated, or if you copy the binary elsewhere.) There's lots of awesome fun optimizations in UNIX. :) The sticky bit has never been used in that form of SunOS for as long as I remember (SunOS 3.x) and probably before that. It no longer makes sense in demand-paged executables. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
On 01/22/2013 02:39 PM, Darren J Moffat wrote: On 01/22/13 13:29, Darren J Moffat wrote: Since I'm replying here are a few others that have been introduced in Solaris 11 or 11.1. and another one I can't believe I missed since I was one of the people that helped design it and I did codereview... Per file sensitively labels for TX configurations. Can you give some details on that? Google search are turning up pretty dry. Start here: http://docs.oracle.com/cd/E26502_01/html/E29017/managefiles-1.html#scrolltoc Look for multilevel datasets. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
Some vendors call this (and thins like it) Thin Provisioning, I'd say it is more accurate communication between 'disk' and filesystem about in use blocks. In some cases, users of disks are charged by bytes in use; when not using SCSI UNMAP, a set of disks used for a zpool will in the end be charged for the whole reservation; this becomes costly when your standard usage is much less than your peak usage. Thin provisioning can now be used for zpools as long as the underlying LUNs have support for SCSI UNMAP Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deleting a link in ZFS
On 12-08-29 12:29 AM, Gregg Wonderly wrote: On Aug 28, 2012, at 6:01 AM, Murray Cullen themurma...@gmail.com wrote: I've copied an old home directory from an install of OS 134 to the data pool on my OI install. Opensolaris apparently had wine installed as I now have a link to / in my data pool. I've tried eve rything I can think of to remove this link with one exception. I have not tried mounting the pool o n a different OS yet, I'm trying to avoid that. Does anyone have any advice or suggestions? Ulink and rm error out as root. What is the error? Is it permission denied, I/O error, or what? Gregg The error is unlink, not owner although I am the owner. What exactly is the file? In zfs you cannot create a link to a directory; so does the link look like? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
You do realize that the age of the universe is only on the order of around 10^18 seconds, do you? Even if you had a trillion CPUs each chugging along at 3.0 GHz for all this time, the number of processor cycles you will have executed cumulatively is only on the order 10^40, still 37 orders of magnitude lower than the chance for a random hash collision. Suppose you find a weakness in a specific hash algorithm; you use this to create hash collisions and now imagined you store the hash collisions in a zfs dataset with dedup enabled using the same hash algorithm. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
Sorry, but isn't this what dedup=verify solves? I don't see the problem here. Maybe all that's needed is a comment in the manpage saying hash algorithms aren't perfect. The point is that hash functions are many to one and I think the point was about that verify wasn't really needed if the hash function is good enough. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/2012 12:24 PM, Justin Stringfellow wrote: Suppose you find a weakness in a specific hash algorithm; you use this to create hash collisions and now imagined you store the hash collisions in a zfs dataset with dedup enabled using the same hash algorithm. Sorry, but isn't this what dedup=verify solves? I don't see the problem here. Maybe all that's n eeded is a comment in the manpage saying hash algorithms aren't perfect. It does solve it, but at a cost to normal operation. Every write gets turned into a read. Assuming a big enough and reasonably busy dataset, this leads to tremendous write amplification. If and only if the block is being dedup'ed. (In that case, you're just changing the write of a whole block into one read (of the block) and an update in the dedup date (the whole block isn't written) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
This assumes you have low volumes of deduplicated data. As your dedup ratio grows, so does the performance hit from dedup=verify. At, say, dedupratio=10.0x, on average, every write results in 10 reads. I don't follow. If dedupratio == 10, it means that each item is *referenced* 10 times but it is only stored *once*. Only when you have hash collisions then multiple reads would be needed. Only one read is needed except in the case of hash collisions. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Tue, 10 Jul 2012, Edward Ned Harvey wrote: CPU's are not getting much faster. But IO is definitely getting faster. It's best to keep ahea d of that curve. It seems that per-socket CPU performance is doubling every year. That seems like faster to me. I think that I/O isn't getting as fast as CPU is; memory capacity and bandwith and CPUs are getting faster. I/O, not so much. (Apart from the one single step from harddisk to SSD; but note that I/O is limited to standard interfaces and as such it is likely be helddown by requiring a new standard. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
Unfortunately, the government imagines that people are using their home com= puters to compute hashes and try and decrypt stuff. Look at what is happen= ing with GPUs these days. People are hooking up 4 GPUs in their computers = and getting huge performance gains. 5-6 char password space covered in a f= ew days. 12 or so chars would take one machine a couple of years if I reca= ll. So, if we had 20 people with that class of machine, we'd be down to a = few months. I'm just suggesting that while the compute space is still hug= e, it's not actually undoable, it just requires some thought into how to ap= proach the problem, and then some time to do the computations. Huge space, but still finite=85 Dan Brown seems to think so in Digital Fortress but it just means he has no grasp on big numbers. 2^128 is a huge space, finite *but* beyond brute force *forever*. Cconsidering that we have nearly 10billion people and if you give them all of them 1 billion computers all being able to compute 1 billion checks per second, how many years does it take before we get the solution? Did you realize that that number is *twice* the number of the years needed for a *single* computer with the same specification to solve this problem for 64 bits? There are two reasons for finding a new hash alrgorithm: - a faster one on current hardware - a better one with a larger output But bruteforce is not what we are defending against: we're trying to defend against bugs in the hash algorithm. In the case of md5 and the related hash algorithm, a new attack method was discovered and it made many hash algorithms obsolete/broken. When a algorithm is broken, the work factor needed for a successful attack depends in part of the hash, e.g., you may left with 64 bits of effective has and that would be brute forcible. Casper Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
Do you need assurances that in the next 5 seconds a meteorite won't fall to Earth and crush you? No. And yet, the Earth puts on thousands of tons of weight each year from meteoric bombardment and people have been hit and killed by them (not to speak of mass extinction events). Nobody has ever demonstrated of being able to produce a hash collision in any suitably long hash (128-bits plus) using a random search. All hash collisions have been found by attacking the weaknesses in the mathematical definition of these functions (i.e. some part of the input didn't get obfuscated well in the hash function machinery and spilled over into the result, resulting in a slight, but usable non-randomness). The reason why we don't protect against such event because it would be extremely expensive with a very small chance of it being needed. verify doesn't cost much so even if the risk as infinitesimal as a direct meteorite hit, it may still be cost effective. (Just like we'd better be off preparing for the climate changing (rather cheap) rather then trying to keep the climate from changing (impossible and still extremely expensive) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Wed, Jul 11, 2012 at 9:48 AM, casper@oracle.com wrote: Huge space, but still finite=85 Dan Brown seems to think so in Digital Fortress but it just means he has no grasp on big numbers. I couldn't get past that. I had to put the book down. I'm guessing it was as awful as it threatened to be. It is *fiction*. So just read it as if it is magical, like Harry Potter. It's just like well researched in fiction means exactly the same as well researched in journalism: the facts aren't actually facts but could pass as facts to those who hasn't had a proper science education (which unfortunately includes a large part of the population and 99.5% of all politicians) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?
To be honest, I think we should also remove this from all other filesystems and I think ZFS was created this way because all modern filesystems do it that way. This may be wrong way to go if it breaks existing applications which rely on this feature. It does break applications in our case. I don't think this isn't supported on most Linux filesystems either. What do you use it for? Anyway, we've added this to the list of mandatory features and see what we can procure with that. Not much? I'd suggest whether you can restructure your code and work without this. For one, it requires the application to run with root (older versions) or with specific privileges which aren't, e.g., available in non-global zones. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?
We've already asked our Netapp representative. She said it's not hard to add that. And symlinks don't work for this? I'm amazed because we're talking about the same file system. Or is it that the code you have does the hardlinking? If you want this rfo Oracle, you would need to talk to an Oracle representative and not in a mailing list (for illumos email will work I suppose) I'd suggest whether you can restructure your code and work without this. It would require touching code for which we don't have sources anymore (people gone, too). It would also require to create hard links to the results files directly, which means linking 15000+ files per directory with a minimum of 3 directories. Each day (this is CERN after all). I'm assuming then that it is the code for which you don't have the source which does the hardlinking? I'm still not sure why symlinks won't work or for that matter loopback mounts. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] History of EPERM for unlink() of directories on ZFS?
Does someone know the history which led to the EPERM for unlink() of directories on ZFS? Why was this done this way, and not something like allowing the unlink and execute it on the next scrub or remount? It's not about the unlink(), it's about the link() and unlink(). But not allowing link unlink, you force the filesystem to contain only trees and not graphs. It also allows you to create directories were .. points to a directory were the inode cannot be found, simply because it was just removed. The support for link() on directories in ufs has always given issues and would create problems fsck couldn't fix. To be honest, I think we should also remove this from all other filesystems and I think ZFS was created this way because all modern filesystems do it that way. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [developer] History of EPERM for unlink() of directories on ZFS?
The decision to not support link(2) of directories was very deliberate - it is an abomination that never should have been allowed in the first place. My guess is that the behavior of unlink(2) on directories is a direct side-effect of that (if link isn't supported, then why support unlink?). Also worth noting that ZFS also doesn't let you open(2) directories and read(2) from them, something (I believe) UFS does allow. In the very beginning, mkdir(1) was a set-uid application; it used mknod to make a directory and then created a link from newdir to newdir/. and from . to newdir/.. Traditionally, we was only allowed for the superuser and when we added privileges a special privileges was added. I think we should remove it for the other filesystems. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)
The drives were the seagate green barracuda IIRC, and performance for just about everything was 20MB/s per spindle or worse, when it should have been closer to 100MB/s when streaming. Things were worse still when doing random... It is possible that your partitions weren't aligned at 4K and that will give serious issues with those drives (Solaris now tries to make sure that all partitions are on 4K boundaries or makes sure that the zpool dev_t is aligned to 4K. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Migration of a Thumper to bigger HDDs
Hello all, I'd like some practical advice on migration of a Sun Fire X4500 (Thumper) from aging data disks to a set of newer disks. Some questions below are my own, others are passed from the customer and I may consider not all of them sane - but must ask anyway ;) 1) They hope to use 3Tb disks, and hotplugged an Ultrastar 3Tb for testing. However, the system only sees it as a 802Gb device, via Solaris format/fdisk as well as via parted [1]. Is that a limitation of the Marvell controller, disk, the current OS (snv_117)? Would it be cleared by a reboot and proper disk detection on POST (I'll test tonight) or these big disks won't work in X4500, period? Your old release of Solaris (nearly three years old) doesn't support disks over 2TB, I would think. (A 3TB is 3E12, the 2TB limit is 2^41 and the difference is around 800Gb) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Accessing Data from a detached device.
Hi, As an addendum to this, I'm curious about how to grow the split pool in size. Scenario, mirrored pool comprising of two disks, one 200GB and the other 300GB, naturally the size of the mirrored pool is 200GB e.g. the smaller of the two devices. I ran some tests within vbox env and I'm curious why after a zpool split one of the pools does not increase in size to 300gb, yet for some reason both pools remain at 200gb even if I export/import them. Sizes are reported via zpool list. I checked the label, both disks have a single EFI partition consuming 100% of each disk. and format/partition shows slice 0 on both disks also consuming the entire disk respectively. So how does one force the pool with the larger disk to increase in size ? What is the autoexpand setting (I think it is off by default)? zpool get autoexpand splitted-pool Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Accessing Data from a detached device.
Is it possible to access the data from a detached device from an mirrored pool. If it is detached, I don't think there is a way to get access to the mirror. Had you used split, you should be able to reimport it. (You can try aiming zpool import at the disk but I'm not hopeful) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
On Mon, 26 Mar 2012, Andrew Gabriel wrote: I just played and knocked this up (note the stunning lack of comments, missing optarg processing, etc)... Give it a list of files to check... This is a cool program, but programmers were asking (and answering) this same question 20+ years ago before there was anything like SEEK_HOLE. If file space usage is less than file directory size then it must contain a hole. Even for compressed files, I am pretty sure that Solaris reports the uncompressed space usage. Unfortunately not true with filesystems which compress data. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL on a dedicated HDD slice (1-2 disk systems)
If the performance of the outer tracks is better than the performance of the inner tracks due to limitations of magnetic density or rotation speed (not being limited by the head speed or bus speed), then the sequential performance of the drive should increase as a square function, going toward the outer tracks. c = pi * r^2 Decrease because the outer tracks are the lower numbered tracks; they have the same density but they are larger. So, small variations of sequential performance are possible, jumping from track to track, but based on what I've seen, the maximum performance difference from the absolute slowest track to the absolute fastest track (which may or may not have any relation to inner vs outer) ... maximum variation on-par with 10% performance difference. Not a square function. I've noticed a change of 50% in speed or more between the lower and the higher numbers. (60MB to 30MB) In benchmark land, they do short-stroke disks for better performance; I believe the Pillar boxes do similar tricks under the covers (if you want more performance, it gives you the faster tracks) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bug moving files between two zfs filesystems (too many open files)
I think the too many open files is a generic error message about running out of file descriptors. You should check your shell ulimit information. Yeah, but mv shouldn't run out of file descriptors or should be handle to deal with that. Are we moving a tree of files? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] weird bug with Seagate 3TB USB3 drive
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Cindy Swearingen In the steps below, you're missing a zpool import step. I would like to see the error message when the zpool import step fails. I see him doing this... # truss -t open zpool import foo The following lines are informative, sort of. /8: openat64(6, c1t0d0s0, O_RDONLY) = 7 /4: openat64(6, c1t0d0s2, O_RDONLY) Err#5 EIO And the output result is: cannot import 'foo': no such pool available What is the partition table? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] weird bug with Seagate 3TB USB3 drive
From: casper@oracle.com [mailto:casper@oracle.com] What is the partition table? He also said this... -Original Message- From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of John D Groenveld # zpool create foo c1t0d0 Which, to me, suggests no partition table. An EFI partition table (there needs to be some form of label so there is always a partition table). Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs hybrid drive - any advice?
ZFS never does update-in-place and UFS only does update-in-place for metadata and where the application forces update-in-place. ufs always updates in place (it will rewrite earlier allocated locks). The only time when it does is when the file is growing and it may move stuff around (when the end of the file is a fragment and it needs to grow) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs hybrid drive - any advice?
Bullshit. I just got a OCZ Vertex 3, and the first fill was 450-500MB/s. Second and sequent fills are at half that speed. I'm quite confident that it's due to the flash erase cycle that's needed, and if stuff can be TRIM:ed (and thus flash erased as well), speed would be regained. Overwriting an previously used block requires a flash erase, and if that can be done in the background when the timing is not critical instead of just before you can actually write the block you want, performance will increase. I think TRIM is needed both for flash (for speed) and for thin provisioning; ZFS will dirty all of the volume even though only a small part of the volume is used at any particular time. That makes ZFS more or less unusable with thin provisioning; support for TRIM would fix that if the underlying volume management supports TRIM. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD vs hybrid drive - any advice?
Shouldn't modern SSD controllers be smart enough already that they know: - if there's a request to overwrite a sector, then the old data on that sector is no longer needed - allocate a clean sector from pool of available sectors (part of wear-leveling mechanism) - clear the old sector, and add it to the pool (possibly done in background operation) It seems to be the case with sandforce-based SSDs. That would pretty much let the SSD work just fine even without TRIM (like when used under HW raid). That is possibly not sufficient. If ZFS writes bytes to every sector, even though the pool is not full, the controller cannot know where to reclaim the data. If it uses spare sectors then it can map them to the to the new data and add the overwritten sectors to the free pool. With TRIM, it gets more blocks to reuse and it gives more time to erase them, making the SSD faster. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How about 4KB disk sectors?
So, what is the story about 4KB disk sectors? Should such disks be avoided with ZFS? Or, no proble m? Or, need to modify some config file before usage? The issue is most with 4K underwater disks; unless you make sure that all the partitions are on a 4K boundary. If it advertises as a 4K sector size disk, then there is no issue. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How about 4KB disk sectors?
On Wed, Jul 13, 2011 at 7:14 AM, casper@oracle.com wrote: The issue is most with 4K underwater disks; unless you make sure that all the partitions are on a 4K boundary. =A0If it advertises as a 4K sect= or size disk, then there is no issue. So if you hand the entire drive to ZFS you should be OK ? [Not applicable to the root zpool, will the OS installation utility do the right thing ?] I think that depends on the version of ZFS/Solaris. I remember there were some issues even when you handed the whole disk. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On Jun 27, 2011, at 17:16, Erik Trimble wrote: Think about how things were done with the i386 and i387. That's what I'm= after. With modern CPU buses like AMD Intel support, plopping a co-pro= cessor into another CPU socket would really, really help. Given the amount of transistors that are available nowadays I think it'd be= simpler to just create a series of SIMD instructions right in/on general C= PUs, and skip the whole co-processor angle. One of the VIA processors was one of the first with specific random and AES instructions. AMD Intel have followed suite and your can some information here: http://en.wikipedia.org/wiki/AES_instruction_set (Similar instructions have been added for SHA, MD5 (older CPUs), RSA, though typically using building blocks not a single long running instruction) A number of the crypto accelerators were much slower than the implementation of a direct implementation in opcodes; one issue, though, what register sets will be used and where will it be saved when the thread is preempted (I'm assuming that the reason why AMD and Intel use different instructions from VIA is possibly because of such details. The current implementation the T3 uses a co-processor (one per core, I think) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] question about COW and snapshots
Op 15-06-11 05:56, Richard Elling schreef: You can even have applications like databases make snapshots when they want. Makes me think of a backup utility called mylvmbackup, which is written with Linux in mind - basically it locks mysql tables, takes an LVM snapshot and releases the lock (and then you backup the database files from the snapshot). Should work at least as well with ZFS. If a database engine or another application keeps both the data and the log in the same filesystem, a snapshot wouldn't create inconsistent data (I think this would be true with vim and a large number of database engines; vim will detect the swap file and datbase should be able to detect the inconsistency and rollback and re-apply the log file.) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, Oracle and Nexenta
However, do remember that you might not be able to import a pool from another system, simply because your system can't support the featureset. Ideally, it would be nice if you could just import the pool and use the features your current OS supports, but that's pretty darned dicey, and I'd be very happy if importing worked when both systems supported the same featureset. You can use zpool create to set a specific version; this should allow you to create a pool usable in a number of different systems. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
Op 06-05-11 05:44, Richard Elling schreef: As the size of the data grows, the need to have the whole DDT in RAM or L2ARC decreases. With one notable exception, destroying a dataset or snapshot requires the DDT entries for the destroyed blocks to be updated. This is why people can go for months or years and not see a problem, until they try to destroy a dataset. So what you are saying is you with your ram-starved system, don't even try to start using snapshots on that system. Right? I think it's more like don't use dedup when you don't have RAM. (It is not possible to not use snapshots in Solaris; they are used for everything) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ls reports incorrect file size
On Mon, May 2 at 14:01, Bob Friesenhahn wrote: On Mon, 2 May 2011, Eric D. Mudama wrote: Hi. While doing a scan of disk usage, I noticed the following oddity. I have a directory of files (named file.dat for this example) that all appear as ~1.5GB when using 'ls -l', but that (correctly) appear as ~250KB files when using 'ls -s' or du commands: These are probably just sparse files. Nothing to be alarmed about. They were created via CIFS. I thought sparse files were an iSCSI concept, no? sparse files are a concept of the underlying filesystem. E.g., if you lseek() after the end of the file and you write, your filesystem may not need to allocate empty blocks. Most Unix filesystems allow sparse files; FAT/FAT32 filesystems do not. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to rename rpool. Is that recommended ?
On Fri, Apr 8, 2011 at 2:24 PM, Arjun YK arju...@gmail.com wrote: Hi, Let me add another query. I would assume it would be perfectly ok to choose any name for root pool, instead of 'rpool', during the OS install. Please suggest otherwise. Have you tried it? Last time I try, the pool name is predetermined, you can't change it. I tried cloning an exsisting openinstallation manually, changing the pool name in the process. IIRC it works (sorry for the somewhat vague detail, it was several years ago). You can only rename it by exporting it and importing under a different name. NOTE: when you modify your root pool on a different system, throw away the zfs.cache file. (Earlier implementation of a bug where if you rename your root pool and then your re-import it under a different name, zfs claims to have found two pools and then starts to corrupt them) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to get rid of phantom pool ?
I had a pool on external drive.Recently the drive failed,but pool still shows up when run 'zpoll s tatus' Any attempt to remove/delete/export pool ends up with unresponsiveness(The system is still up/runn ing perfectly,it's just this specific command kind of hangs so I have to open new ssh session) zpool status shows state: UNAVAIL When try zpool clear get cannot clear errors for backup: I/O error Please help me out to get rid of this phantom pool. Remove the zfs cache file: /etc/zfs/zpool.cache. Then reboot. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Fri, January 7, 2011 01:42, Michael DeMan wrote: Then - there is the other side of things. The 'black swan' event. At some point, given percentages on a scenario like the example case above, one simply has to make the business justification case internally at their own company about whether to go SHA-256 only or Fletcher+Verification? Add Murphy's Law to the 'black swan event' and of course the only data that is lost is that .01% of your data that is the most critical? The other thing to note is that by default (with de-dupe disabled), ZFS uses Fletcher checksums to prevent data corruption. Add also the fact all other file systems don't have any checksums, and simply rely on the fact that disks have a bit error rate of (at best) 10^-16. Given the above: most people are content enough to trust Fletcher to not have data corruption, but are worried about SHA-256 giving 'data corruption' when it comes de-dupe? The entire rest of the computing world is content to live with 10^-15 (for SAS disks), and yet one wouldn't be prepared to have 10^-30 (or better) for dedupe? I would; we're not talking about flipping bits the OS comparing data using just the checksums and replacing one set with another. You might want to create a file to show how weak fletcher really is but two such files wouldn't be properly stored on a de-dup zpool unless you verify. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ... open source moving forward?
ransfer-encoding: 7BIT On 11/12/2010 00:07, Erik Trimble wrote: The last update I see to the ZFS public tree is 29 Oct 2010. Which, I *think*, is about the time that the fork for the Solaris 11 Express snapshot was taken. I don't think this is the case. Although all the files show modification date of 29 Oct 2010 at src.opensolaris.org they are still old versions from August, at least the ones I checked. See http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/ the mercurial gate doesn't have any updates either. Correct; the last public push was on 2010/8/18. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS snapshot limit?
In my situation is first option, I send snapshot to another server using zfs send | zfs recv and I have problem when data send is completed, after reboot the zpool have error or have state: faulted. First server is physical, second is a virtual machine running under xenserver 5.6 What is the underlying datastorage? Typically what can happen here is that zfs is safe, it needs to trust the hardware not to lie to the kernel. If you write data and you reboot/ restart the VM, the data should still be there. If that is not the case, then it has lied to you and you may need to change something in the host. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Space not freed from large deleted sparse file
I removed the file using a simple rm /mnt/tank0/temp/mytempfile.bin. It's definitely gone. But the space hasn't been freed. I have been pointed in the direction of this bug http://bugs.opensolaris.org/view_bug.do?bug_id=6792701 It was apparently introduced in build 94 and at that time we had a zpool version of 11. Can you confirm that Fixed In: snv_118 means the issue was fixed in ZFS pool version 17? http://hub.opensolaris.org/bin/view/Community+Group+zfs/17 The version of zpool changes only when there's a change to the on-disk format. It was fixed in build 118 but no change was made to the zfs version. zpool version was bumped in build 120. which appears to be in ZFS version 14, and my FreeNAS distro is at version 13. Could this be the issue? If so, what is the correct course of action? Ditch FreeNAS and move to a distro with a more recent ZFS version? If I do this, is it safe to upgrade my ZFS version knowing that there is something up with the filesystem? I believe it will just work. Sorry, that what will just work? Moving to a distro with more recent ZFS support and upgrading my pool? Even without the upgrade but I'm not sure. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Space not freed from large deleted sparse file
I removed the file using a simple rm /mnt/tank0/temp/mytempfile.bin. It's definitely gone. But the space hasn't been freed. I have been pointed in the direction of this bug http://bugs.opensolaris.org/view_bug.do?bug_id=6792701 It was apparently introduced in build 94 and at that time we had a zpool version of 11. which appears to be in ZFS version 14, and my FreeNAS distro is at version 13. Could this be the issue? If so, what is the correct course of action? Ditch FreeNAS and move to a distro with a more recent ZFS version? If I do this, is it safe to upgrade my ZFS version knowing that there is something up with the filesystem? I believe it will just work. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unknown Space Gain
tank com.sun:auto-snapshot true local I don't utilize snapshots (this machine just stores media)...so what could be up? You've also disabled the time-slider functionality? (automatic snapshots) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Unknown Space Gain
Huh, I don't actually ever recall enabling that. Perhaps that is connected to the message I started getting every minute recently in the kernel buffer, It's on by default. You can see if it was ever enabled by using: zfs list -t snapshot |grep @zfs-auto Oct 20 12:20:49 megatron pcplusmp: [ID 805372 kern.info] pcplusmp: id= e (ata) instance 3 irq 0xf vector 0x45 ioapic 0x2 intin 0xf is bound to cpu 0 Oct 20 12:21:49 megatron pcplusmp: [ID 805372 kern.info] pcplusmp: id= e (ata) instance 3 irq 0xf vector 0x45 ioapic 0x2 intin 0xf is bound to cpu 1 This sounds more like a device driver unloaded and later it is reloaded because of some other service. I just disabled it (zfs set com.sun\:auto-snapshot=3Dfalse tank, correct?), will see if the log messages disappear. Did the filesystem kill off some snapshots or something in an effort to free up space? Yes, but typically it will log that. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to avoid striping ?
You have an application filesystem from one LUN. (vxfs is expensive, ufs/svm is not really able to handle online filesystem increase. Thus we plan to use zfs for application filesystems.) What do you mean by not really? ... Use growfs to grow UFS on the grown device. I know its off-toopic but the statement: growfs will ``write-lock'' (see lockfs(1M)) a mounted filesystem when expanding. made me always uncomfortable with this online expansion. I cannot guarantee how a specific application will behave during the expansion. -w Write-lock (wlock) the specified file-system. wlock suspends writes that would modify the file system. Access times are not kept while a file system is write- locked. All the applications trying to write will suspend. What would be the risk of that? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS equivalent of inotify
Is there a ZFS equivalent (or alternative) of inotify? You have some thing, which wants to be notified whenever a specific file or directory changes. For example, a live sync application of some kind... Have you looked at port_associate and ilk? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
Changing the sector size (if it's possible at all) would require a reformat of the drive. The WD drives only support a 4K sector but they pretend to have 512byte sectors. I don't think they need to format the drive when changing to 4K sectors. A non-aligned write requires a read-modify-write operation and that makes the file slower. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
This would require a low-level re-format and would significantly reduce the available space if it was possible at all. I don't think it is possible. WD has a jumper, but is there explicitly to work with WindowsXP, and is not a real way to dumb down the drive to 512. All it does is offset the sector numbers by 1 so that sector 63 becomes physical sector 64 (a multiple of 4KB). Is that all? And this forces 4K alignment? I would presume that any vendor that is shipping 4K sector size drives now, with a jumper to make it 'real' 512, would be supporting that over the long run? I would be very surprised if any vendor shipped a drive that could be jumpered to real 512 bytes. The best you are going to get is jumpered to logical 512 bytes and maybe a 1-sector offset (needed for WindozeXP only). These jumpers will probably last as long as the 8GB jumpers that were needed by old BIOS code. (Eg BIOS boots using simulated 512-byte sectors and then the OS tells the drive to switch to native mode). I would assume that such a jumper would change the drive from 4K native to pretend to be have 512 byte sectors/ It's unfortunate that Sun didn't bite the bullet several decades ago and provide support for block sizes other than 512-bytes instead of getting custom firmware for their CD drives to make them provide 512-byte logical blocks for 2KB CD-ROMs. Since Solaris x86 works fine with standard CD/DVD drives, that is no longer an issue. Solaris does support larger sectors. It's even more idiotic of WD to sell a drive with 4KB sectors but not provide any way for an OS to identify those drives and perform 4KB aligned I/O. I'm not sure that that is correct; the drive works on naive clients but I believe it can reveal its true colors. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
On Tue, Oct 5, 2010 at 11:49 PM, casper@sun.com wrote: I'm not sure that that is correct; the drive works on naive clients but I believe it can reveal its true colors. The drive reports 512 byte sectors to all hosts. AFAIK there's no way to make it report 4k sectors. Too bad because it makes it less useful (specifically because the label mentions sectors and if you can use bigger sectors, you can address a larger drive). They still have all sizes w/o Advanced Format (non EARS/AARS models) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
My immediate reaction to this is time to avoid WD drives for a while; until things shake out and we know what's what reliably. But, um, what do we know about say the Seagate Barracuda 7200.12 ($70), the SAMSUNG Spinpoint F3 1TB ($75), or the HITACHI Deskstar 1TB 3.5 ($70)? I've seen several important features when selecting a drive for a mirror: TLER (the ability of the drive to timeout a command) sector size (native vs virtual) power use (specifically at home) performance (mostly for work) price I've heard scary stories about a mismatch of the native sector size and unaligned Solaris partitions (4K sectors, unaligned cylinder). I was pretty happen with the WD drives (except for the one with a seriously broken cache) but I see the reasons to not to pick WD drives over the 1TB range. Are people now using 4K native sectors and formating them with 4K sectors in (Open)Solaris? Performance sucks when you use unaligned accesses but is performance good when the performance is aligned? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-discuss] [illumos-Developer] zpool upgrade and zfs upgrade behavior on b145
Additionally, even though zpool and zfs get version display the true and updated versions, I'm not convinced that the problem is zdb, as the label config is almost certainly set by the zpool and/or zfs commands. Somewhere, something is not happening that is supposed to when initiating a zpool upgrade, but since I know virtually nothing of the internals of zfs, I do The problem is likely in the boot block or in grub. The development version did not update the boot block; newer versions of beadm do fix boot blocks. For now, I'd recommend you upgrade the boot block on all halves of a bootable mirror before you upgrade the zpool version or the zfs version. export/import won't help. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] drive speeds etc
I have both EVDS and EARS 2TB green drive. And I have to say they are not good to build storage servers. I think both have native 4K sectors; as such, they balk or perform slowly when a smaller I/O or an unaligned IOP hits them. How are they formatted? Specifically, solaris slices must be aligned on a 4K boundary or performance will stink. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-discuss] zfs send/receive?
hi all I'm using a custom snaopshot scheme which snapshots every hour, day, week and month, rotating 24h, 7d, 4w and so on. What would be the best way to zfs send/receive these things? I'm a little confused about how this works for delta udpates... Vennlige hilsener / Best regards The initial backup should look like this: zfs snapshot -r exp...@backup-2010-07-12 zfs send -R exp...@backup-2010-07-12 | zfs receive -F -u -d portable/export (portable is a portable pool; the export filesystem needs to exist; I use one zpool to receive different zpools, each in their own directory) A incremental backup: zfs snapshot -r exp...@backup-2010-07-13 zfs send -R -I exp...@backup-2010-07-12 exp...@backup-2010-07-13 | zfs receive -v -u -d -F portable/export You need to make sure you keep the last backup snapshot; when receiving the incremental backup, destroyed filesystems and snapshots are also destroyed in the backup. Typically, I remove some of the snapshot *after* the backup; they are only destroyed during the next backup. I did notice that send/receive gets confused when older snapshots are destroyed by time-slider during the backup. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] fs root inode number?
Typically on most filesystems, the inode number of the root directory of the filesystem is 2, 0 being unused and 1 historically once invisible and used for bad blocks (no longer done, but kept reserved so as not to invalidate assumptions implicit in ufsdump tapes). However, my observation seems to be (at least back at snv_97), the inode number of ZFS filesystem root directories (including at the top level of a spool) is 3, not 2. Buggy files may have all types bad assumptions; this problem isn't new: the root filesystem of a zone is typically in a simple directory of a filesystem with ufs. I seem to remember that flexlm wanted that the root was an actual root directory (so you can run only one copy). They didn't realize that faking the hostid is just too simple Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))
I'm using ZFS on a system w/o ECC; it works (it's an Atom 230). Note that this is not different from using another OS; the difference is that ZFS will complain when memory leads to disk corruption; without ZFS you will still have memory corruption but you wouldn't know. Is it helpful not knowing that you have memory corruption? I don't think so. I've love to have a small (40W) system with ECC but it is difficult to find one. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] non-ECC Systems and ZFS for home users (was: Please warn a home user against OpenSolaris under VirtualBox under WinXP ; ))
On 23-9-2010 10:25, casper@sun.com wrote: I'm using ZFS on a system w/o ECC; it works (it's an Atom 230). I'm using ZFS on a non-ECC machine for years now without any issues. Never had errors. Plus, like others said, other OS'ses have the same problems and also run quite well. If not, you don't know it. With ZFS you will know. I would say - just go for it. You will never want to go back. Indeed. While I mirror stuff on the same system, I'm now also making backups using a USB connected disk (eSATA would be better but the box only has USB). My backup consists of: for pool in $pools do zfs snapshot -r $p...@$newnapshot zfs send -R -I $p...@$lastsnapshot $p...@$newsnapshot | zfs receive -v -u -d -F portable/$pool done then I export and store the portable pool somewhere else. I do run a once per two weeks scrub for all the pools, just in case. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Growing a root ZFS mirror on b134?
Ok, that doesn't seem to have worked so well ... I took one of the drives offline, rebooted and it just hangs at the splash screen after prompting for which BE to boot into. It gets to hostname: blah and just sits there. When you say offline, did you: - remove the drive physically? - or did you zfs detach it? - or both? In order to remove half of the mirror I suggest that you: split the mirror (if your ZFS is recent enough; seems to be supported since 131) [ make sure you remove /etc/zfs/zpool.cache from the split half of the mirror. ] or detach only then remove the disk. Depending on the hardware it may try to find the missing disk and this may take some time. You can boot with the debugger and/or -v to find out was is going on. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to migrate to 4KB sector drives?
On Sun, Sep 12, 2010 at 10:07 AM, Orvar Korvar knatte_fnatte_tja...@yahoo.com wrote: No replies. Does this mean that you should avoid large drives with 4KB sectors, that is, new dri ves? ZFS does not handle new drives? Solaris 10u9 handles 4k sectors, so it might be in a post-b134 release of osol. Build 118 adds support for 4K sectors with the following putback: PSARC 2008/769 Multiple disk sector size support. 6710930 Solaris needs to support large sector size hard drive disk But already in build 38 there is some support for large-sector disks in ZFS. 6407365 large-sector disk support in ZFS When new features are added to the current release, it is typically created for the next release and then backported to the current release. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ufs root to zfs root liveupgrade?
hi all Try to learn how UFS root to ZFS root liveUG work. I download the vbox image of s10u8, it come up as UFS root. add a new disks (16GB) create zpool rpool run lucreate -n zfsroot -p rpool run luactivate zfsroot run lustatus it do show zfsroot will be active in next boot init 6 but it come up with UFS root, lustatus show ufsroot active zpool rpool is mounted but not used by boot You'll need to boot from a different disk; I don't think that the OS can change the boot disk (it can on SPARC but it can't on x86) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Directory tree renaming -- disk usage
If I have a directory with a bazillion files in it (or, let's say, a directory subtree full of raw camera images, about 15MB each, totalling say 50GB) on a ZFS filesystem, and take daily snapshots of it (without altering it), the snapshots use almost no extra space, I know. If I now rename that directory, and take another snapshot, what happens? Do I get two copies of the unchanged data now, or does everything still reference the same original data (file content)? Seems like the new directory tree contains the same old files, same inodes and so forth, so it shouldn't be duplicating the data as I understand it; is that correct? This would, obviously, be fairly easy to test; and, if I removed the snapshots afterward, wouldn't take space permanently (have to make sure that the scheduler doesn't do one of my permanent snapshots during the test). But I'm interested in the theoretical answer in any case. snapshots never take additional space until it starts to reference deleted data. If the directory is renamed then the parent directory is changed and the directory's inode but the rest of the data is not modified and has no effect on the amount of data stored. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] swap - where is it coming from?
Swap is perhaps the wrong name; it is really virtual memory; virtual memory consists of real memory and swap on disk. In Solaris, a page either exists on the physical swap device or in memory. Of course, not all memory is available as the kernel and other caches use a large part of the memory. When no swap based disk is in use, then there is sufficient free memory; reserved is pages reserved, e.g., fork, (pages to copy when copy-on-write happens) or allocated memory but not written to. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] swap - where is it coming from?
On Thu, 10 Jun 2010, casper@sun.com wrote: Swap is perhaps the wrong name; it is really virtual memory; virtual memory consists of real memory and swap on disk. In Solaris, a page either exists on the physical swap device or in memory. Of course, not all memory is available as the kernel and other caches use a large part of the memory. Don't forget that virtual memory pages may also come from memory mapped files from the filesystem. However, it seems that zfs is effectively diminishing this. I should have said anonymous virtual memory. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Using WD Green drives?
On Thu, May 13, 2010 at 06:09:55PM +0200, Roy Sigurd Karlsbakk wrote: 1. even though they're 5900, not 7200, benchmarks I've seen show they are quite good Minor correction, they are 5400rpm. Seagate makes some 5900rpm drives. The green drives have reasonable raw throughput rate, due to the extremely high platter density nowadays. however, due to their low spin speed, their average-access time is significantly slower than 7200rpm drives. For bulk archive data containing large files, this is less of a concern. Regarding slow reslivering times, in the absence of other disk activity, I think that should really be limited by the throughput rate, not the relatively slow random i/o performance...again assuming large files (and low fragmentation, which if the archive is write-and-never-delete is what i'd expect). My experience is that they resilver fairly quickly and scrbs aren't slow either. (300GB in 2hrs) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why both dedup and compression?
On 06/05/2010 21:07, Erik Trimble wrote: VM images contain large quantities of executable files, most of which compress poorly, if at all. What data are you basing that generalisation on ? Look at these simple examples for libc on my OpenSolaris machine: 1.6M /usr/lib/libc.so.1* 636K /tmp/libc.gz I did the same thing for vim and got pretty much the same result. It will be different (probably not quite as good) when it is at the ZFS block level rather than whole file but those to randomly choosen by me samples say otherwise to your generalisation. Easy to test when compression is enabled for your rpool: 2191 -rwxr-xr-x 1 root bin 1794552 May 6 14:46 /usr/lib/libc.so.1* (The actual size is 3500 blocks so we're saving quite a bit) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reverse lookup: inode to name lookup
You can do in the kernel by calling vnodetopath(). I don't know if it is exposed to user space. Yes, in /proc/*/path (kinda). But that could be slow if you have large directories so you have to think about where you would use it. The kernel caches file names; however, it cannot be use for files that aren't in use. It is certainly possible to create a .zfs/snapshot_byinode but it is not clear when it helps but it can be used for finding the earlier copy of a directory (netapp/.snapshot) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reverse lookup: inode to name lookup
I understand you cannot lookup names by inode number in general, because that would present a security violation. Joe User should not be able to find the name of an item that's in a directory where he does not have permission. But, even if it can only be run by root, is there some way to lookup the name of an object based on inode number? Sure, that's typically how NFS works. The inode itself is not sufficient; an inode number might be recycled and and old snapshot with the same inode number may refer to a different file. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reverse lookup: inode to name lookup
No, a NFS client will not ask the NFS server for a name by sending the inode or NFS-handle. There is no need for a NFS client to do that. The NFS clients certainly version 2 and 3 only use the file handle; the file handle can be decoded by the server. It filehandle does not contain the name, only the FSid, the inode number and the generation. There is no way to get a name from an inode number. The nfs server knows how so it is clearly possible. It is not exported to userland but the kernel can find a file by its inumber. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
On Mon, 19 Apr 2010, Edward Ned Harvey wrote: Improbability assessment aside, suppose you use something like the DDRDrive X1 ... Which might be more like 4G instead of 32G ... Is it even physically possible to write 4G to any device in less than 10 seconds? Remember, to achieve worst case, highest demand on ZIL log device, these would all have to be 32kbyte writes (default configuration), because larger writes will go directly to primary storage, with only the intent landing on the ZIL. Note that ZFS always writes data in order so I believe that the statement larger writes will go directly to primary storage really should be larger writes will go directly to the ZIL implemented in primary storage (which always exists). Otherwise, ZFS would need to write a new TXG whenever a new large block of data appeared (which may be puny as far as the underlying store is concerned) in order to assure proper ordering. This would result in a very high TXG issue rate. Pool fragmentation would be increased. I am sure that someone will correct me if this is wrong. There's a difference between written and the data is referenced by the uberblock. There is no need to start a new TXG when a large datablock is written. (If the system resets, the data will be on disk but not referenced and is lost unless the TXG it belongs to is comitted) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] brtfs on Solaris? (Re: [osol-discuss] [indiana-discuss] So when are we gonna fork this sucker?)
brtfs could be supported on Opensolaris, too. IMO it could even complement ZFS and spawn some concurrent development between both. ZFS is too high end and works very poorly with less than 2GB while brtfs reportedly works well with 128MB on ARM. Both have license issues; Oracle can now re-license either, I believe, unless brtfs has escaped. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
The only way to guarantee consistency in the snapshot is to always (regardless of ZIL enabled/disabled) give priority for sync writes to get into the TXG before async writes. If the OS does give priority for sync writes going into TXG's before async writes (even with ZIL disabled), then after spontaneous ungraceful reboot, the latest uberblock is guaranteed to be consistent. This is what Jeff Bonwick says in the zil synchronicity arc case: What I mean is that the barrier semantic is implicit even with no ZIL at all. In ZFS, if event A happens before event B, and you lose power, then what you'll see on disk is either nothing, A, or both A and B. Never just B. It is impossible for us not to have at least barrier semantics. So there's no chance that a *later* async write will overtake an earlier sync *or* async write. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 01/04/2010 20:58, Jeroen Roodhart wrote: I'm happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Interesting thing is that apparently defaults on Solaris an Linux are chosen such that one can't signal the desired behaviour to the other. At least we didn't manage to get a Linux client to asyn chronously mount a Solaris (ZFS backed) NFS export... Which is to be expected as it is not a nfs client which requests the behavior but rather a nfs server. Currently on Linux you can export a share with as sync (default) or async share while on Solaris you can't really currently force a NFS server to start working in an async mode. The other part of the issue is that the Solaris Clients have been developed with a sync server. The client write behinds more and continues caching the non-acked data. The Linux client has been developed with a async server and has some catching up to do. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
http://nfs.sourceforge.net/ I think B4 is the answer to Casper's question: We were talking about ZFS, and under what circumstances data is flushed to disk, in what way sync and async writes are handled by the OS, and what happens if you disable ZIL and lose power to your system. We were talking about C/C++ sync and async. Not NFS sync and async. I don't think so. http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg36783.html (This discussion was started, I think, in the context of NFS performance) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
So you're saying that while the OS is building txg's to write to disk, the OS will never reorder the sequence in which individual write operations get ordered into the txg's. That is, an application performing a small sync write, followed by a large async write, will never have the second operation flushed to disk before the first. Can you support this belief in any way? The question is not how the writes are ordered but whether an earlier write can be in a later txg. A transaction group is committed atomically. In http://arc.opensolaris.org/caselog/PSARC/2010/108/mail I ask a similar question to make sure I understand it correctly, and the answer was: = Casper, the answer is from Neil Perrin: Is there a partialy order defined for all filesystem operations? File system operations will be written in order for all settings of the sync flag. Specifically, will ZFS guarantee that when fsync()/O_DATA happens on a file, (I assume by O_DATA you meant O_DSYNC). that later transactions will not be in an earlier transaction group? (Or is this already the case?) This is already the case. So what I assumed was true but what you made me doubt, was apparently still true: later transactions cannot be committed in an earlier txg. If that's true, if there's no increased risk of data corruption, then why doesn't everybody just disable their ZIL all the time on every system? For an application running on the file server, there is no difference. When the system panics you know that data might be lost. The application also dies. (The snapshot and the last valid uberblock are equally valid) But for an application on an NFS client, without ZIL data will be lost while the NFS client believes the data is written amd it will not try again. With the ZIL, when the NFS server says that data is written then it is actually on stable storage. The reason to have a sync() function in C/C++ is so you can ensure data is written to disk before you move on. It's a blocking call, that doesn't return until the sync is completed. The only reason you would ever do this is if order matters. If you cannot allow the next command to begin until after the previous one was completed. Such is the situation with databases and sometimes virtual machines. So the question is: when will your data invalid? What happens with the data when the system dies before the fsync() call? What happens with the data when the system dies after the fsync() call? What happens with the data when the system dies after more I/O operations? With the zil disabled, you call fsync() but you may encounter data from before the call to fsync(). That could happen before, so I assume you can actually recover from that situation. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Questions to answer would be: Is a ZIL log device used only by sync() and fsync() system calls? Is it ever used to accelerate async writes? There are quite a few of sync writes, specifically when you mix in the NFS server. Suppose there is an application which sometimes does sync writes, and sometimes async writes. In fact, to make it easier, suppose two processes open two files, one of which always writes asynchronously, and one of which always writes synchronously. Suppose the ZIL is disabled. Is it possible for writes to be committed to disk out-of-order? Meaning, can a large block async write be put into a TXG and committed to disk before a small sync write to a different file is committed to disk, even though the small sync write was issued by the application before the large async write? Remember, the point is: ZIL is disabled. Question is whether the async could possibly be committed to disk before the sync. What I quoted from the other discussion, it seems to be that later writes cannot be committed in an earlier TXG then your sync write or other earlier writes. I make the assumption that an uberblock is the term for a TXG after it is committed to disk. Correct? The uberblock is the root of all the data. All the data in a ZFS pool is referenced by it; after the txg is in stable storage then the uberblock is updated. At boot time, or zpool import time, what is taken to be the current filesystem? The latest uberblock? Something else? The current zpool and the filesystems such as referenced by the last uberblock. My understanding is that enabling a dedicated ZIL device guarantees sync() and fsync() system calls block until the write has been committed to nonvolatile storage, and attempts to accelerate by using a physical device which is faster or more idle than the main storage pool. My understanding is that this provides two implicit guarantees: (1) sync writes are always guaranteed to be committed to disk in order, relevant to other sync writes. (2) In the event of OS halting or ungraceful shutdown, sync writes committed to disk are guaranteed to be equal or greater than the async writes that were taking place at the same time. That is, if two processes both complete a write operation at the same time, one in sync mode and the other in async mode, then it is guaranteed the data on disk will never have the async data committed before the sync data. sync() is actually *async* and returning from sync() says nothing about stable storage. After fsync() returns it signals that all the data is in stable storage (except if you disable ZIL), or, apparently, in Linux when the write caches for your disks are enabled (the default for PC drives). ZFS doesn't care about the writecache; it makes sure it is flushed. (There's fsyc() and open(..., O_DSYNC|O_SYNC) Based on this understanding, if you disable ZIL, then there is no guarantee about order of writes being committed to disk. Neither of the above guarantees is valid anymore. Sync writes may be completed out of order. Async writes that supposedly happened after sync writes may be committed to disk before the sync writes. Somebody, (Casper?) said it before, and now I'm starting to realize ... This is also true of the snapshots. If you disable your ZIL, then there is no guarantee your snapshots are consistent either. Rolling back doesn't necessarily gain you anything. The only way to guarantee consistency in the snapshot is to always (regardless of ZIL enabled/disabled) give priority for sync writes to get into the TXG before async writes. If the OS does give priority for sync writes going into TXG's before async writes (even with ZIL disabled), then after spontaneous ungraceful reboot, the latest uberblock is guaranteed to be consistent. I believe that the writes are still ordered so the consistency you want is actually delivered even without the ZIL enabled. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
If you disable the ZIL, the filesystem still stays correct in RAM, and the only way you lose any data such as you've described, is to have an ungraceful power down or reboot. The advice I would give is: Do zfs autosnapshots frequently (say ... every 5 minutes, keeping the most recent 2 hours of snaps) and then run with no ZIL. If you have an ungraceful shutdown or reboot, rollback to the latest snapshot ... and rollback once more for good measure. As long as you can afford to risk 5-10 minutes of the most recent work after a crash, then you can get a 10x performance boost most of the time, and no risk of the aforementioned data corruption. Why do you need the rollback? The current filesystems have correct and consistent data; not different from the last two snapshots. (Snapshots can happen in the middle of untarring) The difference between running with or without ZIL is whether the client has lost data when the server reboots; not different from using Linux as an NFS server. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
If you have an ungraceful shutdown in the middle of writing stuff, while the ZIL is disabled, then you have corrupt data. Could be files that are partially written. Could be wrong permissions or attributes on files. Could be missing files or directories. Or some other problem. Some changes from the last 1 second of operation before crash might be written, while some changes from the last 4 seconds might be still unwritten. This is data corruption, which could be worse than losing a few minutes of changes. At least, if you rollback, you know the data is consistent, and you know what you lost. You won't continue having more losses afterward caused by inconsistent data on disk. How exactly is this different from rolling back to some other point of time?. I think you don't quite understand how ZFS works; all operations are grouped in transaction groups; all the transactions in a particular group are commit in one operation. I don't know what partial ordering ZFS uses when creating transaction groups, but a snapshot just picks one transaction group as the last group included in the snapshot. When the system reboots, ZFS picks the most recent, valid uberblock; so the data available is correct upto transaction group N1. If you rollback to a snapshot, you get data correct upto transaction group N2. But N2 N1 so you lose more data. Why do you think that a Snapshot has a better quality than the last snapshot available? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Dude, don't be so arrogant. Acting like you know what I'm talking about better than I do. Face it that you have something to learn here. You may say that, but then you post this: Why do you think that a Snapshot has a better quality than the last snapshot available? If you rollback to a snapshot from several minutes ago, you can rest assured all the transaction groups that belonged to that snapshot have been committed. So although you're losing the most recent few minutes of data, you can rest assured you haven't got file corruption in any of the existing files. But the actual fact is that there is *NO* difference between the last uberblock and an uberblock named as snapshot-such-and-so. All changes made after the uberblock was written are discarded by rolling back. All the transaction groups referenced by last uberblock *are* written to disk. Disabling the ZIL makes sure that fsync() and sync() no longer work; whether you take a named snapshot or the uberblock is immaterial; your strategy will cause more data to be lost. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
Is that what sync means in Linux? A sync write is one in which the application blocks until the OS acks that the write has been committed to disk. An async write is given to the OS, and the OS is permitted to buffer the write to disk at its own discretion. Meaning the async write function call returns sooner, and the application is free to continue doing other stuff, including issuing more writes. Async writes are faster from the point of view of the application. But sync writes are done by applications which need to satisfy a race condition for the sake of internal consistency. Applications which need to know their next commands will not begin until after the previous sync write was committed to disk. We're talking about the sync for NFS exports in Linux; what do they mean with sync NFS exports? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
This approach does not solve the problem. When you do a snapshot, the txg is committed. If you wish to reduce the exposure to loss of sync data and run with ZIL disabled, then you can change the txg commit interval -- however changing the txg commit interval will not eliminate the possibility of data loss. The default commit interval is what, 30 seconds? Doesn't that guarantee that any snapshot taken more than 30 seconds ago will have been fully committed to disk? When a system boots and it finds the snapshot, then all the data referred by the snapshot are on-disk. But the snapshot doesn't guarantee more than the last valid uberblock. Therefore, any snapshot older than 30 seconds old is guaranteed to be consistent on disk. While anything less than 30 seconds old could possibly have some later writes committed to disk before some older writes from a few seconds before. If I'm wrong about this, please explain. When a pointer to data is committed to disk by ZFS, then the data is also on disk. (if the pointer is reachable from the uberblock, then the data is also on dissk and reachable from the uberblock) You don't need to wait 30 seconds. If it's there, it's there. I am envisioning a database, which issues a small sync write, followed by a larger async write. Since the sync write is small, the OS would prefer to defer the write and aggregate into a larger block. So the possibility of the later async write being committed to disk before the older sync write is a real risk. The end result would be inconsistency in my database file. If you rollback to a snapshot that's at least 30 seconds old, then all the writes for that snapshot are guaranteed to be committed to disk already, and in the right order. You're acknowledging the loss of some known time worth of data. But you're gaining a guarantee of internal file consistency. I don't know what ZFS guarantees when you disable the zil; the one broken promise is that when fsync() returns, that the data may not have committed to stable storage when fsync() returns. I'm not sure whether there is a barrier when there is a sync()/fsync(), if that is the case, then ZFS is still safe for your application. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
On 01/04/2010 13:01, Edward Ned Harvey wrote: Is that what sync means in Linux? A sync write is one in which the application blocks until the OS acks that the write has been committed to disk. An async write is given to the OS, and the OS is permitted to buffer the write to disk at its own discretion. Meaning the async write function call returns sooner, and the application is free to continue doing other stuff, including issuing more writes. Async writes are faster from the point of view of the application. But sync writes are done by applications which need to satisfy a race condition for the sake of internal consistency. Applications which need to know their next commands will not begin until after the previous sync write was committed to disk. ROTFL!!! I think you should explain it even further for Casper :) :) :) :) :) :) :) :-) So what I *really* wanted to know what sync meant for the NFS server in the case of Linux. Apparently it means implement the NFS protocol to the letter. I'm happy to see that it is now the default and I hope this will cause the Linux NFS client implementation to be faster for conforming NFS servers. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sun Flash Accelerator F20 numbers
It does seem like rollback to a snapshot does help here (to assure that sync async data is consistent), but it certainly does not help any NFS clients. Only a broken application uses sync writes sometimes, and async writes at other times. But doesn't that snapshot possibly have the same issues? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bit-flipping in RAM...
I'm not saying that ZFS should consider doing this - doing a validation for in-memory data is non-trivially expensive in performance terms, and there's only so much you can do and still expect your machine to survive. I mean, I've used the old NonStop stuff, and yes, you can shoot them with a .45 and it likely will still run, but wacking them with a bazooka still is guarantied to make them, well, Non-NonStop. If we scrub the memory anyway, why not include the check of the ZFS checksums which are already in memory? OTOH, zfs gets a lot of mileage out of cheap hardware and we know what the limitations are when you don't use ECC; the industry must start to require that all chipsets support ECC. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposition of a new zpool property.
That would add unnecessary code to the ZFS layer for something that cron can handle in one line. Actually ... Why should there be a ZFS property to share NFS, when you can already do that with share and dfstab? And still the zfs property exists. Probably because it is easy to create new filesystems and clone them; as NFS only works per filesystem you need to edit dfstab every time when you add a filesystem. With the nfs property, zfs create the NFS export, etc. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems
Carson Gaspar wrote: Not quite. 11 x 10^12 =~ 10.004 x (1024^4). So, the 'zpool list' is right on, at 10T available. Duh, I was doing GiB math (y = x * 10^9 / 2^20), not TiB math (y = x * 10^12 / 2^40). Thanks for the correction. You're welcome. :-) On a not-completely-on-topic note: Has there been a consideration by anyone to do a class-action lawsuit for false advertising on this? I know they now have to include the 1GB = 1,000,000,000 bytes thing in their specs and somewhere on the box, but just because I say 1 L = 0.9 metric liters somewhere on the box, it shouldn't mean that I should be able to avertise in huge letters 2 L bottle of Coke on the outside of the package... I think such attempts have been done and I think one was settled by Western Digital. https://www.wdc.com/settlement/docs/document20.htm This was in 2006. I was apparently part of the 'class' as I had a disk registered; I think they gave some software. See also: http://en.wikipedia.org/wiki/Binary_prefix Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Posible newbie question about space between zpool and zfs file systems
IMHO, what matters is that pretty much everything from the disk controller to the CPU and network interface is advertised in power-of-2 terms and disks sit alone using power-of-10. And students are taught that computers work with bits and so everything is a power of 2. That is simply not true: Memory: power of 2(bytes) Network: power of 10 (bits/s)) Disk: power of 10 (bytes) CPU Frequency: power of 10 (cycles/s) SD/Flash/..: power of 10 (bytes) Bus speed: power of 10 Main memory is the odd one out. Just last week I had to remind people that a 24-disk JBOD with 1TB disks wouldn't provide 24TB of storage since disks show up as 931GB. Well some will say it's 24T :-) It *is* an anomaly and I don't expect it to be fixed. Perhaps some disk vendor could add more bits to its drives and advertise a real 1TB disk using power-of-2 and show how people are being misled by other vendors that use power-of-10. Highly unlikely but would sure get some respect from the storage community. You've not been misled unless you have your had in the sand for the last five to ten years. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] swap across multiple pools
The default install for OpenSolaris creates a single root pool, and creates a swap and dump dataset within this pool. In a mutipool environment, would be make sense to add swap to a pool outside or the root pool, either as the sole swap dataset to be used or as extra swap ? Would this have any performance implications ? My own experience is that the zvol swap devices are much slower than swap directly to disk. Perhaps because I had compression on in the rpool, but any form of data copying/compressing or caching for swap is a no-no: you use more memory and you need to evict more pages. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to verify ecc for ram is active and enabled?
Is there a method to view the status of the rams ecc single or double bit errors? I would like to confirm that ecc on my xeon e5520 and ecc ram are performing their role since memtest is ambiguous. I am running memory test on a p6t6 ws, e5520 xeon, 2gb samsung ecc modules and this is what is on the screen: Chipset: Core IMC (ECC : Detect / Correct) However, further down ECC is identified as being off. Yet there is a column for ECC Errs. I don't know how to interpret this. Is ECC active or not? Off but only disabled by memtest, I believe. You can enable it in the memtest menu. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Installing Solaris 10 with ZFS Root FS
Hi Romain, The option to select a ZFS root file system or a UFS root file system is available starting in the Solaris 10 10/08 release. (aka update 6, right?) I wish to install a Solaris 10 on a ZFS mirror but in the installer (in interactive text mode) I don't have choice of the filesystem : I only have 'SOLARIS' fs type (which is UFS if I'm right). That sounds like the fdisk screen; use Solaris and then go one to the next screen. This screen will create a fdisk partition. After doing, that you need to create slices and create a ufs filesystem or create a zfs rpool) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Installing Solaris 10 with ZFS Root FS
Hi Cindy, thanks for your quick response ! I'm trying to install Solaris 10 11/06. I don't know how the version numbering works so I don't know if my version is newer than 10/08. It's month/year; 11/06 is a three years and a bit over. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS replace - many to one
I'm looking to migrate a pool from using multiple smaller LUNs to one larger LUN. I don't see a way to do a zpool replace for multiple to one. Anybody know how to do this? It needs to be non disruptive. Depends on the zpool's layout and the source of the old and the new files; you can only replace or attach a vdev one by one and you could theoretically do that by making different slices in on the new device. I don't think you want that. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] chmod behavior with symbolic links
I know it's documented in the manual, but I find it a bit strange behaviour that chmod -R changes the permissions of the target of a symbolic link. Is there any reason for this behaviour? Symbolic links do not have a mode; so you can't chmod them; chmod(2) follows symbolic links (it was created before symbolic links existed). Unfortunately, when symbolic links were created, they had an owner but no relevant mode: so there's a readlink, symlink, lchown but no lchmod. I think a lchmod() would be nice, if only to avoid following them. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is there something like udev in OpenSolaris
Hello list, beeing a Linux Guy I'm actually quite new to Opensolaris. One thing I miss is udev. I found that w hen using SATA disks with ZFS - it always required manual intervention (cfgadm) to do SATA hot plug . I would like to automate the disk replacement, so that it is a fully automatic process without man ual intervention if: a) the new disk contains no ZFS labels b) the new disk does not contain a partition table .. thus it is a real replacement part On Linux I would write a udev hot plug script to automate this. Is there something like udev on OpenSolaris ? (A place / hook that is executed every time new hardware is added / detected) Sysevent, perhaps? Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Growing ZFS Volume with SMI/VTOC label
So in a ZFS boot disk configuration (rpool) in a running environment, it's not possible? The example I have does grows the rpool while running from the rpool. But you need a recent version of zfs to grow the pool while it is in use. On Fri, Feb 19, 2010 at 9:25 AM, casper@sun.com wrote: Is it possible to grow a ZFS volume on a SPARC system with a SMI/VTOC label without losing data as the OS is built on this volume? Sure as long as the new partition starts on the same block and is longer. It was a bit more difficult with UFS but for zfs it is very simple. I had a few systems with two ufs root slices using live upgrade: slice 1slice 2swap First I booted from slice 2 ludelete slice1 zpool create rpool slice1 lucreate -p rpool luactivate slice1 init 6 from the zfs root: ludelete slice2 format: remove slice2; grow slice1 to incorporate slice2 label At that time I needed to reboot to get the new device size reflected in zpool list; today that is no longer needed Casper --Boundary_(ID_oehH7aQu3QEaJqsmuxeYyA) Content-type: text/html; charset=ISO-8859-1 Content-transfer-encoding: QUOTED-PRINTABLE So in a ZFS boot disk configuration (rpool) in a running environment,= it#39;s not possible?brbrdiv class=3Dgmail_quoteOn Fri, Feb= 19, 2010 at 9:25 AM, span dir=3Dltrlt;a href=3Dmailto:Casper= @sun.comcasper@sun.com/agt;/span wrote:br blockquote class=3Dgmail_quote style=3Dmargin:0 0 0 .8ex;border-l= eft:1px #ccc solid;padding-left:1ex;div class=3Dimbr br gt;Is it possible to grow a ZFS volume on a SPARC system with a SMI/= VTOC labelbr gt;without losing data as the OS is built on this volume?br br br /divSure as long as the new partition starts on the same block and = is longer.br br It was a bit more difficult with UFS but for zfs it is very simple.b= r br I had a few systems with two ufs root slices using live upgrade:br br =A0 =A0 =A0 =A0lt;slice 1gt;lt;slice 2gt;lt;swapgt;br br First I booted from lt;slice 2gt;br ludelete quot;slice1quot;br zpool create rpool quot;slice1quot;br lucreate -p rpoolbr luactivate slice1br init 6br =66rom the zfs root:br ludelete slice2br format:br =A0 =A0 =A0 =A0 remove slice2;br =A0 =A0 =A0 =A0 grow slice1 to incorporate slice2br =A0 =A0 =A0 =A0 labelbr br At that time I needed to reboot to get the new device size reflected = inbr zpool list; today that is no longer neededbr br Casperbr br /blockquote/divbr --Boundary_(ID_oehH7aQu3QEaJqsmuxeYyA)-- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
If there were a real-world device that tended to randomly flip bits, or randomly replace swaths of LBA's with zeroes, but otherwise behave normally (not return any errors, not slow down retrying reads, not fail to attach), then copies=2 would be really valuable, but so far it seems no such device exists. If you actually explore the errors that really happen I venture there are few to no cases copies=2 would save you. I had a device which had 256 bytes of the 32MB broken (some were 1, some were always 0). But I never put it online because it was so broken. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposed idea for enhancement - damage control
If there were a real-world device that tended to randomly flip bits, or randomly replace swaths of LBA's with zeroes, but otherwise behave normally (not return any errors, not slow down retrying reads, not fail to attach), then copies=2 would be really valuable, but so far it seems no such device exists. If you actually explore the errors that really happen I venture there are few to no cases copies=2 would save you. I had a device which had 256 bytes of the 32MB broken (some were 1, some were always 0). But I never put it online because it was so broken. Of the 32MB cache, sorry. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Shrink the slice used for zpool?
Hi, I recently installed OpenSoalris 200906 on a 10GB primary partition on my laptop. I noticed there wasn't any option for customizing the slices inside the solaris partition. After installation, there was only a single slice (0) occupying the entire partition. Now the problem is that I need to set up a UFS slice for my development. Is there a way to shrink slice 0 (backing storage for the zpool) and make room for a new slice to be used for UFS? I also tried to create UFS on another primary DOS partition, but apparently only one Solaris partition is allowed on one disk. So that failed... Can you create a zvol and use that for ufs? Slow, but ... Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why checksum data?
I find that when people take this argument, they assuming that each component has perfect implementation and 100% fault coverage. The real world isn't so lucky Recently I bought a disk with a broken 32MB buffer (256 bits had bits stuck to 1 or 0) It was corrupting data by the bucket. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Clearing a directory with more than 60 million files
On Tue, January 5, 2010 05:34, Mikko Lammi wrote: As a result of one badly designed application running loose for some time, we now seem to have over 60 million files in one directory. Good thing about ZFS is that it allows it without any issues. Unfortunatelly now that we need to get rid of them (because they eat 80% of disk space) it seems to be quite challenging. How about creating a new data set, moving the directory into it, and then destroying it? Assuming the directory in question is /opt/MYapp/data: 1. zfs create rpool/junk 2. mv /opt/MYapp/data /rpool/junk/ 3. zfs destroy rpool/junk The move will create and remove the files; the remove by mv will be as inefficient removing them one by one. rm -rf would be at least as quick. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss