[zfs-discuss] Building big cheap storage system. What hardware to use?
Hello. We need big cheap storage. Looking to Supermicro systems. Something based on SC846E1-R900 case http://www.supermicro.com/products/chassis/4U/846/SC846E1-R900.cfm with 24 disc bays. This case with 3 GBit LSI SASX36 expander. But the problem with LSI based HBA timeouts really confuses me. Should i get newer motherboard with 6GBit LSI SAS 2008 HDA like http://www.supermicro.com/products/motherboard/QPI/5500/X8DT6-F.cfm?IPMI=YSAS=Y or get older motherboard with LSI 1068 HBA? Can anyone post good working configurations based on Supermicro hardware? Planning to use 2Tb Hitachi SATA drives (any thoughts on HDD choose?). -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
On 2010-01-25 at 08:31 -0600 Mike Gerdts sent off: You are missing the point. Compression and dedup will make it so that the blocks in the devices are not overwritten with zeroes. The goal is to overwrite the blocks so that a back-end storage device or back-end virtualization platform can recognize that the blocks are not in use and as such can reclaim the space. a filesystem that is able to do that fast would have to implement something like unwritten extents. Some days ago I experimented to create and allocate huge files on ZFS ontop of OpenSolaris using fnctl and F_ALLOCSP which is basically the same thing that you want to do when you zero out space. It takes ages because it actually writes zeroes to the disk. A filesystem that knows the concept of unwritten extents finishes the job immediately. There are no real zeros on the disk but the extent is tagged to be unwritten (you get zeros when you read it). Are there any plans to add unwritten extent support into ZFS or any reason why not? Björn ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
Hi, I was suffering for weeks from the following problem: a zfs dataset contained an automatic snapshot (monthly) that used 2.8 TB of data. The dataset was deprecated, so I chose to destroy it after I had deleted some files; eventually it was completely blank besides the snapshot that still locked 2.8 TB on the pool. 'zfs destroy -r pool/dataset' hung the machine within seconds to be completely unresponsive. No respective messages could be found in logs. The issue was reproducible. The same happened for 'zfs destroy pool/data...@snapshot' Thus, the conclusion was that the snapshot was indeed the problem. Solution: After trying several things, including updating the system to snv_130 and snv_131, I had the idea to restore the dataset to the snapshot before doing another zfs destroy attempt. 'zfs rollback pool/data...@snapshot' 'zfs unmount -f pool/dataset' 'zfs destroy -r pool/dataset' Et voilà! It worked. Conclusion: I guess there is something wrong in zfs handling snapshots during a recursive dataset destruction. As it seems, the destruction is only successful if the dataset is consistent with the snapshot. Even if the workaround seems to be viable a fix of the issue would be appreciated. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
On 27 janv. 2010, at 12:10, Georg S. Duck wrote: Hi, I was suffering for weeks from the following problem: a zfs dataset contained an automatic snapshot (monthly) that used 2.8 TB of data. The dataset was deprecated, so I chose to destroy it after I had deleted some files; eventually it was completely blank besides the snapshot that still locked 2.8 TB on the pool. 'zfs destroy -r pool/dataset' hung the machine within seconds to be completely unresponsive. No respective messages could be found in logs. The issue was reproducible. The same happened for 'zfs destroy pool/data...@snapshot' Thus, the conclusion was that the snapshot was indeed the problem. For info, I have exactly the same situation here with a snapshot that cannot be deleted that results in the same symptoms. Total freeze, even on the console. Server responds to pings, but that's it. All iSCSI, NFS and ssh connections are cut. Currently running b130. I'll try the workaround once I get some spare space to migrate the contents. Erik ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Strange random errors getting automatically repaired
Hello, Has anyone ever seen vdev's getting removed and added back to the pool very quickly ? That seems to be what's happening here. This has started to happen on dozens of machines at different locations since a few days ago. They are running OpenSolaris b111 and a few b126. Could this be bit rot and/or silent corruption getting detected and fixed ? Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID: FMD-8000-4M, TYPE: Repair, VER: 1, SEVERITY: Minor Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009 Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2 Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd Jan 27 01:18:01 hostname DESC: All faults associated with an event id have been addressed. Jan 27 01:18:01 hostname Refer to http://sun.com/msg/FMD-8000-4M for more information. Jan 27 01:18:01 hostname AUTO-RESPONSE: Some system components offlined because of the original fault may have been brought back online. Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system due to the original fault may have been recovered. Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u EVENT-ID to identify the repaired components. Jan 27 01:18:01 hostname fmd: [ID 441519 daemon.notice] SUNW-MSG-ID: FMD-8000-6U, TYPE: Resolved, VER: 1, SEVERITY: Minor Jan 27 01:18:01 hostname EVENT-TIME: Thu Dec 24 08:50:34 BRST 2009 Jan 27 01:18:01 hostname PLATFORM: X7DB8, CSN: 0123456789, HOSTNAME: hostname Jan 27 01:18:01 hostname SOURCE: fmd, REV: 1.2 Jan 27 01:18:01 hostname EVENT-ID: 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd Jan 27 01:18:01 hostname DESC: All faults associated with an event id have been addressed. Jan 27 01:18:01 hostname Refer to http://sun.com/msg/FMD-8000-6U for more information. Jan 27 01:18:01 hostname AUTO-RESPONSE: All system components offlined because of the original fault have been brought back online. Jan 27 01:18:01 hostname IMPACT: Performance degradation of the system due to the original fault has been recovered. Jan 27 01:18:01 hostname REC-ACTION: Use fmdump -v -u EVENT-ID to identify the repaired components. # fmdump -e -t 23Jan2010 TIME CLASS # # fmdump TIME UUID SUNW-MSG-ID Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved # fmdump -V TIME UUID SUNW-MSG-ID Jan 27 01:18:01.2372 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-4M Repaired TIME CLASS ENA Dec 24 08:50:34.4470 ereport.fs.zfs.vdev.corrupt_data 0x533bf0e964a01801 Dec 23 16:08:42.0738 ereport.fs.zfs.probe_failure 0xe87b448c8ba00c01 Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b446b04f1 Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b44664b300401 Dec 23 16:08:42.0738 ereport.fs.zfs.io 0xe87b445710a01001 Dec 23 16:08:42.0739 ereport.fs.zfs.io 0xe87b4461a4d00c01 nvlist version: 0 version = 0x0 class = list.repaired uuid = 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd code = FMD-8000-4M diag-time = 1261651834 766268 de = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = fmd authority = (embedded nvlist) nvlist version: 0 version = 0x0 product-id = X7DB8 chassis-id = 0123456789 server-id = hostname (end authority) mod-name = fmd mod-version = 1.2 (end de) fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = fault.fs.zfs.device certainty = 0x64 asru = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x9f4842f183c4c7cc vdev = 0xd207014426714df9 (end asru) resource = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = zfs pool = 0x9f4842f183c4c7cc vdev = 0xd207014426714df9 (end resource) (end fault-list[0]) fault-status = 0x6 __ttl = 0x1 __tod = 0x4b5fb069 0xe23eb38 TIME UUID SUNW-MSG-ID Jan 27 01:18:01.2391 0cb73c5a-d444-ede6-e49f-fce4aad8a1cd FMD-8000-6U Resolved TIME CLASS
Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
Server responds to pings, but that's it. All iSCSI, NFS and ssh connections are cut. That's consistent with my findings, adding that SMB is cut as well. At one vain attempt to destroy the data...@snapshot I got a [ID 224711 kern.warning] WARNING: Memory pressure: TCP defensive mode on. If I had a separate ssh session open with 'top' running I could monitor CPU load going through the roof before that session was dead along with everything. For info, I have exactly the same situation here with a snapshot that cannot be deleted that results in the same symptoms. That would rule an empty data set being a relevant side condition. I'll try the workaround once I get some spare space to migrate the contents. If your final aim isn't the destruction of the dataset that exacerbates the situation. After I had understood the issue with snapshots my choice was to de-activate all automatic snapshots on non-rpools. Specifically I have different backup protocols in place anyhow. Automatic snapshots are on by default. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
On 2010-01-27 at 09:50 + Darren J Moffat sent off: The whole point of the original question wasn't about consumers of ZFS but where ZFS is the consumer of block storage provided by something else that expects to see zeros on disk. This thread is about thin provisioning *to* ZFS not *on* it. you're right, indeed the original question is actually a different problem that unwritten extents support wouldn't address. Björn ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa
If you choose the AOC-USAS-L8i controller route, don't worry too much about the exotic looking nature of these SAS/SATA controllers. These controllers drive SAS drives and also SATA drives. As you will be using SATA drives, you'll just get cables that plug into the card. The card has 2 ports. You buy a cable that plugs in to the port and fans out into 4 SATA connectors. Just buy 2 cables if you need to drive 8 drives, or at least more than 4. SuperMicro sell a few different cable lengths for these cables, so once you've measured, you can choose. Take a look at this post of mine and look for the card, cables and text where I also remarked on the scariness factor of dealing with 'exotic' hardware. http://breden.org.uk/2009/08/29/home-fileserver-mirrored-ssd-zfs-root-boot/ And cables are here: http://supermicro.com/products/accessories/index.cfm http://64.174.237.178/products/accessories/index.cfm (DNS failed so I gave IP address version too) Then select 'cables' from the list. From the cables listed, search for 'IPASS to 4 SATA Cable' and you will find they have a 23cm version (CBL-0118L-02) and a 50cm version (CBL-0097L-02). Sounds like your larger case will probably need the 50cm version. Cheers, Simon http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboard
Hi David, On Mon, Jan 25, 2010 at 11:16 AM, David Dyer-Bennet d...@dd-b.net wrote: My current home fileserver (running Open Solaris 111b and ZFS) has an ASUS M2N-SLI DELUXE motherboard. This has 6 SATA connections, which are currently all in use (mirrored pair of 80GB for system zfs pool, two mirrors of 400GB both in my data pool). I've got two more hot-swap drive bays. And I'm getting up towards 90% full on the data pool. So, it's time to expand, right? I have two approaches in contention: #1, I can just swap drives for bigger drives, waiting for resilver and taking the risk that the other drive will fail during the resilver (I do have backups, plus I've got the old removed drive as well, so I could recover from a failure during resilver with some downtime). #2, I can find or install two additional SATA ports and put two more drives in the open bays. I've even got two 400GB drives sitting available; that's a 50% increase on current storage, so I'm not inclined to spend money for new drives yet, even though these are quite small. (I picked up a pile of free Sun-badged Hitachi 400GB drives when the project I was on at the time decided they were too small to use and put them out for people to take home. I grabbed two right away, and very conscientiously stayed away for a while to give other people a good shot too. But I took another drive every hour, and left with 7 of them. There were still some there when I left, so I feel virtuous rather than greedy.) I prefer approach two. Three pair gives me more flexibility and more performance than two, plus I don't have to pay for new drives right away since I've got spare 400GB drives around. Plus it probably bothers me more than it should that I'm wasting two of the fairly expensive hot-swap bays. So, with regard to option #2, I have two questions. First, there's some sign that this motherboard has an integral raid controller. Can it also be used to drive bare drives? If I could just find two more usable controller ports (with good drivers and hot-swap support), I'd be happy without spending any money. Anybody understand this motherboard? Second, if I have to buy an additional controller, what should I buy for driving two (or at most 4; I suppose it might make sense to reduce the load on the motherboard controller) SATA drives from this motherboard? I believe I have a free PCI-Express x16 slot and two x1 slots (and don't understand these new-fangled ports very well). I want stability, +- 10% performance is not at all important. Cheap is good :-) (paying my own money here!). (Obvious additional choices like replacing the whole box are not interesting; its performance is fine for my needs, and it can easily handle increased disk capacity.) Also, I probably should upgrade to more recent code than snv_111b, eh? What's a demonstrated-to-be-stable code level I could upgrade to? I'm not desperately missing any of the newer features, but I'm looking for bug fixes, especially any that relate to zfs send-receive, which I'm attempting to use to transfer incremental backups to an external USB drive (set up as a single-disk pool). Also I will put more memory in while I've got it open, but I can figure out what memory it takes for myself :-). I'd greatly appreciate motherboard expertise, controller advice, and code version advice from people with experience. Thanks! -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I have the same motherboard. Though I haven't used all of the SATA ports, I also have an ST Lab PCIe SATA II 300 RAID Card, 2+2 (uses a PCIe 1X port, has Sil3132 chip). I've had the card for almost 2 years now, so I'm not sure if you can still buy these. The key thing: the Sil3132 is supported in OpenSolaris. Hope this helps! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays
Cindy, It does not list our SAN (LSI/STK/NetApp)...I'm confused about disabling cache from the wiki entries. Should we disable it by turning off zfs cache syncs via echo zfs_nocacheflush/W0t1 | mdb -kw or specify it by storage device via the sd.conf method where the array ignores cache flushes from zfs? Brad -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] raidz using partitions
hi there, maybe this is a stupid question, yet i haven't found an answer anywhere ;) let say i got 3x 1,5tb hdds, can i create equal partitions out of each and make a raid5 out of it? sure the safety would drop, but that is not that important to me. with roughly 500gb partitions and the raid5 forumla of n-1*smallest drive i should be able to get 4tb storage instead of 3tb when using 3x 1,5tb in a normal raid5. thanks for you answers greetings -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz using partitions
On Wed, Jan 27, 2010 at 1:55 PM, Albert Frenz y...@zockbar.de wrote: hi there, maybe this is a stupid question, yet i haven't found an answer anywhere ;) let say i got 3x 1,5tb hdds, can i create equal partitions out of each and make a raid5 out of it? sure the safety would drop, but that is not that important to me. with roughly 500gb partitions and the raid5 forumla of n-1*smallest drive i should be able to get 4tb storage instead of 3tb when using 3x 1,5tb in a normal raid5. thanks for you answers greetings -- 3 drives is enough to make a raidz already but you can use slices...yes, i have a friend who did that. he had 2 1.5tb drives 2 1tb drives and a 2 tb drive so he made 2 raidz's one with 5 1tb slices 1 with 3 500gb slices This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP
Has anyone done research into the performance of SWAP on the traditional partitioned based SWAP device as compared to a SWAP area set up on ZFS with a zvol? I can find no best practices for this issue. In the old days it was considered important to separate the swap devices onto individual disks (controllers) and select the outer cylinder groups for the partition (to gain some read speed). How does this compare to creating a single SWAP zvol within a rootpool and then mirroring the rootpool across two separate disks? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz using partitions
ok nice to know :) thank you very much for your quick answer -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP
RayLicon wrote: Has anyone done research into the performance of SWAP on the traditional partitioned based SWAP device as compared to a SWAP area set up on ZFS with a zvol? I can find no best practices for this issue. In the old days it was considered important to separate the swap devices onto individual disks (controllers) and select the outer cylinder groups for the partition (to gain some read speed). How does this compare to creating a single SWAP zvol within a rootpool and then mirroring the rootpool across two separate disks? Best practice nowadays is to design a system so it doesn't need to swap. Then it doesn't matter what the performance of the swap device is. -- Andrew Gabriel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs destroy hangs machine if snapshot exists- workaround found
This sounds like yet another instance of 6910767 deleting large holey objects hangs other I/Os I have a module based on 130 that includes this fix if you would like to try it. -tim Hi Tim, 6910767 seems to be about ZVOLs. The dataset here was not a ZVOL. I had a 1,4 TB ZVOL on the same pool that also wasn't easy to kill. It hung the machine as well - but only once: it was gone after a forced re-boot. Regards, Tonmaus -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backing this up
Hello All, I read through the attached threads and found a solution by a poster and decided to try it. The solution was to use 3 files (in my case I made them sparse), I then created a raidz2 pool across these 3 files and started a zfs send | recv. The performance is horrible, it was 5.62mb/s. When I am backing up the other system to this failover system over a network connection I can get around 40mb/s. Is it because I am backing it up onto files rather than physical disks? Am I doing this all wrong? This pool is temporary as it will be sent to tape, deleted and recreated. Is it possible to zfs send to two destination simultaneously? Or am I stuck. Any pointers would be great! I am using OpenSolaris snv_129 and the disks are sata wd 1tb 7200rpm disks. Thanks All! Greg On Mon, Jan 25, 2010 at 3:41 PM, Gregory Durham gregory.dur...@gmail.comwrote: Well I guess I am glad I am not the only one. Thanks for the heads up! On Mon, Jan 25, 2010 at 3:39 PM, David Magda dma...@ee.ryerson.ca wrote: On Jan 25, 2010, at 18:28, Gregory Durham wrote: One option I have seen is zfs send zfs_s...@1 /some_dir/some_file_name. Then I can back this up to tape. This seems easy as I already have a created a script that does just this but I am worried that this is not the best or most secure way to do this. Does anyone have a better solution? We've been talking about this for the last week and a half. :) http://mail.opensolaris.org/pipermail/zfs-discuss/2010-January/thread.html#35929 http://opensolaris.org/jive/thread.jspa?threadID=121797 (They're the same thread, just different interfaces.) I was thinking about then gzip'ing this but that would take an enormous amount of time... If you have a decent amount of CPU, you can parallelize compression: http://www.zlib.net/pigz/ http://blogs.sun.com/timc/entry/tamp_a_lightweight_multi_threaded The LZO algorithm (as used in 7zip) is supposed to be better that gzip in many benchmarks, and supposedly is very parallel. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays
Brad, It depends on the Solaris release. What Solaris release are you running? Thanks, Cindy On 01/27/10 11:43, Brad wrote: Cindy, It does not list our SAN (LSI/STK/NetApp)...I'm confused about disabling cache from the wiki entries. Should we disable it by turning off zfs cache syncs via echo zfs_nocacheflush/W0t1 | mdb -kw or specify it by storage device via the sd.conf method where the array ignores cache flushes from zfs? Brad ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARC not using all available RAM?
I am interested in this as well. My machine is with 5 gb ram, and will soon have an 80gb SSD device. My free memory hovers around 750 Mb, and the arc around 3GB. This machine doesn't do anything other than iSCSI/CIFS, I wouldn't mind using some extra 500 Mb for caching. And this becomes especially important if the kernel will need to consume such large amounts of memory for managing the l2arc. CPU cache trashing although an important topic is of no importance in such cases IMO. i.e. I don't mind my CPU caches to be trashed if I fire up a gnome desktop occasionally. But I do mind having 750 Mb of RAM sitting unused. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP
Ok ... Given that ... yes, we all know that swapping is bad (thanks for the enlightenment). To Swap or not to Swap isn't releated to this question, and besides, even if you don't page swap, other mechanisms can still claim swap space, such as the tmp file system. The question is simple, IF you have to swap (for whatever reason), then which of two alternatives is better (separate disk partitons on multiple disks, or zvol ZFS stripes or mirrors - and why). If no one has any data on this issue then fine, but I didn't waste my time posting to this site to get responses that simply say -don't swap -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa
On 1/25/2010 6:23 PM, Simon Breden wrote: By mixing randomly purchased drives of unknown quality, people are taking unnecessary chances. But often, they refuse to see that, thinking that all drives are the same and they will all fail one day anyway... I would say, though, that buying different drives isn't inherently either random or drives of unknown quality. Most of the time, I know no reason other than price to prefer one major manufacturer to another. And, over and over again, I've heard of bad batches of drives. Small manufacturing or design or component sourcing errors. Given how the resilvering process can be quite long (on modern large drives) and quite stressful (when the system remains in production use during resilvering, so that load is on top of the normal load), I'd rather not have all my drives in the set be from the same bad batch! Google is working heavily with the philosophy that things WILL fail, so they plan for it, and have enough redundance to survive it -- and then save lots of money by not paying for premium components. I like that approach. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP
LICON, RAY (ATTPB) wrote: Thanks for the reply. In many situations, the hardware design isn't up to me and budgets tend to dictate everything these days. True, nobody wants to swap, but the question is if you had to -- what design serves you best. Independent swap slices or putting it all under control of zfs. It depends why you need to swap, i.e. why are you using more memory than you have, and is your working set size bigger than memory (thrashing), or is swapping likely to be just a once-off event or infrequently repeated? You probably need to forget most of what you learned about swapping 25 years ago, when systems routinely swapped, and technology was very different. Disks have got faster over that period, probably of the order 100 times faster. However, CPUs have got 100,000 times faster, so in reality a disk looks to be 1000 times slower from the CPU's standpoint than it did 25 years ago. This means that CPU cycles lost due to swapping will appear to have a proportionally much more dire effect on performance than they did many years back. There are lots more options available today than there were when systems routinely swapped. A couple of examples that spring to mind... ZFS has been explicitly designed to swap it's own cache data, only we don't call it swapping - we call it an L2ARC or ReadZilla. So if you have a system where the application is going to struggle with main memory, you might configure ZFS to significantly reduce it's memory buffer (ARC), and instead give it an L2ARC on a fast solid state disk. This might result in less performance degradation in some systems where memory is short, depending heavily on the behaviour of the application. If you do have to go with brute force old style swapping, then you might want to invest in solid state disk swap devices, which will go some way towards reducing the factor of 1000 I mentioned above. (Take note of aligning swap to the 4k flash i/o boundaries.) Probably lots of other possibilities too, given more than a couple of minutes thought. -- Andrew ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa
On 1/27/2010 7:29 AM, Simon Breden wrote: And cables are here: http://supermicro.com/products/accessories/index.cfm http://64.174.237.178/products/accessories/index.cfm (DNS failed so I gave IP address version too) Then select 'cables' from the list. From the cables listed, search for 'IPASS to 4 SATA Cable' and you will find they have a 23cm version (CBL-0118L-02) and a 50cm version (CBL-0097L-02). Sounds like your larger case will probably need the 50cm version. And those seem to be half the price of the others I've found. I'll still have to check the length first, though. And they're listed on Amazon. (Supermicro either doesn't, or at least makes it very hard, to buy direct from their web site, or even check a price.) (This is a big Chenbro case, I think it's really a rack 4u system being used as a tower.) -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backing this up
On Wed, Jan 27, 2010 at 12:01:36PM -0800, Gregory Durham wrote: Hello All, I read through the attached threads and found a solution by a poster and decided to try it. That may have been mine - good to know it helped, or at least started to. The solution was to use 3 files (in my case I made them sparse) yep - writes to allocate space for them up front are pointless with CoW. I then created a raidz2 pool across these 3 files Really? If you want one tape's worth of space, written to 3 tapes, you might as well just write the same file to three tapes, I think. (I'm assuming here the files are the size you expect to write to a single tape - otherwise I'm even more confused about this bit). Perhaps it's easier to let zfs cope with repairing small media errors here and there, but the main idea of using a redundant pool of files was to cope with loss or damage to whole tapes, for a backup that already needed to span multiple tapes. If you want this three-way copy of a single tape, plus easy recovery from bad spots by reading back multiple tapes, then use a 3-way mirror. But consider the error-recovery mode of whatever you're using to write to tape - some skip to the next file on a read error. I expect similar ratios of data to parity files/tapes as would be used in typical disk setups, at least for wide stripes. Say raidz2 in sets of 10, 8+2, or so. (As an aside, I like this for disks, too - since striping 128k blocks to a power-of-two wide data stripe has to be more efficient) and started a zfs send | recv. The performance is horrible There can be several reasons for this, and we'd need to know more about your setup. The first critical thing is going to be the setup of the staging filesystem tha holds your pool files. If this is itself a raidz, perhaps you're iops limited - you're expecting 3 disk-files worth of concurrency from a pool that may not have it, though it should be a write-mostly workload so less sensitive. You'll be seeking a lot either way, though. If this is purely staging to tape, consider making the staging pool out of non-redundant single-disk vdevs. Alternately, if the staging pool is safe, there's another trick you might consider: create the pool, then offline 2 files while you recv, leaving the pool-of-files degraded. Then when you're done, you can let the pool resilver and fill in the redundancy. This might change the IO pattern enough to take less time overall, or at least allow you some flexibility with windows to schedule backup and tapes. Next is dedup - make sure you have the memory and l2arc capacity to dedup the incoming write stream. Dedup within the pool of files if you want and can (because this will dedup your tapes), but don't dedup under it as well. I've found this to produce completely pathological disk thrashing, in a related configuration (pool on lofi crypto file). Stacking dedup like this doubles the performance cliff under memory pressure we've been talking about recently. (If you really do want 3-way-mirror files, then by all means dedup them in the staging pool.) Related to this is arc usage - I haven't investigated this carefully myself, but you may well be double-caching: the backup pool's data, as well as the staging pool's view of the files. Again, since it's a write mostly workload zfs should hopefully figure out that few blocks are being re-read, but you might experiment with primarycache=metadata for the staging pool holding the files. Perhaps zpool-on-files is smart enough to use direct io bypassing cache anyway, I'm not sure. How's your cpu usage? Check that you're not trying to double-compress the files (again, within the backup pool but not outside) and consider using a lightweight checksum rather than sha256 outside. Then there's streaming and concurrency - try piping through buffer and using bigger socket and tcp buffers. TCP stalls and slow-start will amplify latency many-fold. A good zil device on the staging pool might also help, the backup pool will be doing sync writes to close its txgs, though probably not too many others. I haven't experimented here, either. This pool is temporary as it will be sent to tape, deleted and recreated. I tend not to do that, since I can incrementally update the pool contents before rewriting tapes. This helps hide the performance issues dramatically since much less data is transferred and written to the files, after the first time. Is it possible to zfs send to two destination simultaneously? Yes, though it's less convenient than using -R on the top of the pool, since you have to solve any dependencies (including clone renames) yourself. Whether this helps or hurts depends on your bottleneck: it will help with network and buffering issues, but hurt (badly) if you're limited by thrashing seeks (at the writer, since you already know the reader can sustain higher rates). Or am I stuck. Any pointers would be great! Never. Always! :-) -- Dan. pgpIPYL2vmAI0.pgp Description:
Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa
On Jan 27, 2010, at 12:34 PM, David Dyer-Bennet wrote: Google is working heavily with the philosophy that things WILL fail, so they plan for it, and have enough redundance to survive it -- and then save lots of money by not paying for premium components. I like that approach. Yes, it does work reasonably well. But many people on this forum complain that mirroring disks is too expensive, so they would never consider mirroring the whole box, let alone triple or quadruple mirroring the whole box :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa
On Wed, Jan 27, 2010 at 02:34:29PM -0600, David Dyer-Bennet wrote: Google is working heavily with the philosophy that things WILL fail, so they plan for it, and have enough redundance to survive it -- and then save lots of money by not paying for premium components. I like that approach. So do I, and most other zfs fans. Google, unlike most of us, is also big enough to buy a whole pallet of disks at a time, and still spread them around to avoid common faults taking out all copies. -- Dan. pgpQ87YoZXEt0.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance of partition based SWAP vs. ZFS zvol SWAP
ag == Andrew Gabriel andrew.gabr...@sun.com writes: ag is your working set size bigger than memory (thrashing), n...no, not...not exactly. :) ag or is swapping likely to be just a once-off event or ag infrequently repeated? once-off! or...well...repeated, every time the garbage collector runs. ag You probably need to forget most of what you learned about ag swapping 25 years ago, when systems routinely swapped, and ag technology was very different. yes, some Lisp machines had integrated swapper/garbagecollectors. Now we have sbrk() + gc. dumb! We used to not worry about overcommitting because refusing to overcommit just meant some of the allocated swap space would never get written. It was a little bit foolish because the threat of thrashing means, whenever swap's involved, you're basically overcommitted, but it let us feel better. Now that we're not using swap, failure to overcommit seems rather wasteful. At the very least you should allow the ARC cache to grow into memory reserved for an allocation, then boot the ARC out of it if the process actually writes to more than you thought it would and you need to keep a commitment you thought you wouldn't. ag solid state disk swap devices, smart! it might turn out to be good for ebooks and other power-constrained devices, too, because DRAM uses battery: swapping to conserve energy rather than RAM. It might be worth tracking pages in a more complicated way than we're now doing if the goal is to evacuate RAM and power it down, so maybe holding onto ancient swap wisdom and code isn't as helpful to this as it might seem. The point, keep swap on ZFS so you can grow/shrink/delete it as fashion changes, is good. But the OP's question still stands: does ZFS swap perform almost as well as raw device swap, or is it worth partitioning disks if you insist on actually using swap? I guess no one knows. pgpXR10g3t6iA.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] cannot attach c5d0s0 to c4d0s0: device is too small
cannot attach c5d0s0 to c4d0s0: device is too small So I guess I installed OpenSolaris onto the smallest disk. Now I cannot create a mirrored root, because the device is smaller. What is the best way to correct this except starting all over with two disks of the same size (which I don't have)? Do I zfs send the stream to the smallest disk and will the bigger one attach itself? Or is there another way. I need redundency, so I hope to get answers soon. ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ARC Ghost lists, why have them and how much ram is used to keep track of them? [long]
I have the exact same questions. I am very interested in the answers of those. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot attach c5d0s0 to c4d0s0: device is too small
Hi Dick, Based on this message: cannot attach c5d0s0 to c4d0s0: device is too small c5d0s0 is the disk you are trying to attach so it must be smaller than c4d0s0. Is it possible that c5d0s0 is just partitioned so that the s0 is smaller than s0 on c4d0s0? On some disks, the default partitioning is not optimal and you have to modify it so that the bulk of the disk space is in slice 0. I would confirm this first as its the easiest solution by far. Another thought is that a recent improvement was that you can attach a disk that is an equivalent size, but not exactly the same geometry. Which OpenSolaris release is this? Thanks, Cindy On 01/27/10 15:26, dick hoogendijk wrote: cannot attach c5d0s0 to c4d0s0: device is too small So I guess I installed OpenSolaris onto the smallest disk. Now I cannot create a mirrored root, because the device is smaller. What is the best way to correct this except starting all over with two disks of the same size (which I don't have)? Do I zfs send the stream to the smallest disk and will the bigger one attach itself? Or is there another way. I need redundency, so I hope to get answers soon. ;-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays
We're running 10/09 on the dev box but 11/06 is prodqa. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] primarycache=off, secondarycache=all
In the case of a ZVOL with the following settings: primarycache=off, secondarycache=all How does the L2ARC get populated if the data never makes it to ARC ? Is this even a valid configuration? The reason I ask is I have iSCSI volumes for NTFS, I intend to use an SSD for l2arc. If something is read from the iSCSI device, then chances are Windows (or whatever OS) will cache it for a while in its own cache. It is unlikely that the data will be needed soon (under normal circumstances). Thus I would like it to avoid polluting the ARC with non-relevant data, but then the question is, how will that data make it to the L2ARC. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Instructions for ignoring ZFS write cache flushing on intelligent arrays
Hi Brad, You should see better performance on the dev box running 10/09 with the sd and ssd drivers as is because they should properly handle the SYNC_NV bit in this release. If you have determined that the 11/06 system is affected by this issue, then the best method is to set this parameter in the /kernel/drv/*conf file. I'm unclear whether you understand all the implications of disabling this parameter because we're discussing this over email. Someone with more experience with tuning this parameter should weigh in. Brad is using SAN on (LSI/STK/NetApp). Thanks, Cindy On 01/27/10 15:47, Brad wrote: We're running 10/09 on the dev box but 11/06 is prodqa. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
On 27 jan 2010, at 10.44, Björn JACKE wrote: On 2010-01-25 at 08:31 -0600 Mike Gerdts sent off: You are missing the point. Compression and dedup will make it so that the blocks in the devices are not overwritten with zeroes. The goal is to overwrite the blocks so that a back-end storage device or back-end virtualization platform can recognize that the blocks are not in use and as such can reclaim the space. a filesystem that is able to do that fast would have to implement something like unwritten extents. Rather what is needed is files with holes, as what is expected here is more free space in the file system when the unused parts of the file is punched out. With F_ALLOCSP, you would still not be able to use the space and there would be no gain. Some days ago I experimented to create and allocate huge files on ZFS ontop of OpenSolaris using fnctl and F_ALLOCSP which is basically the same thing that you want to do when you zero out space. It takes ages because it actually writes zeroes to the disk. A filesystem that knows the concept of unwritten extents finishes the job immediately. There are no real zeros on the disk but the extent is tagged to be unwritten (you get zeros when you read it). Files with holes are implemented, and as far as I know they are fast too: -bash-4.0$ cat hole.py f = open('foo', 'w') f.write('x') f.seek(2**62) f.write('y') f.close() -bash-4.0$ time python hole.py real0m0.019s user0m0.010s sys 0m0.009s -bash-4.0$ ls -la foo -rw-r--r-- 1 raggestaff4611686018427387905 Jan 28 00:26 foo Are there any plans to add unwritten extent support into ZFS or any reason why not? I have no idea, but just out of curiosity - when do you want that? /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] primarycache=off, secondarycache=all
On Wed, Jan 27, 2010 at 02:47:47PM -0800, Christo Kutrovsky wrote: In the case of a ZVOL with the following settings: primarycache=off, secondarycache=all How does the L2ARC get populated if the data never makes it to ARC ? Is this even a valid configuration? It's valid, I assume, in the sense that it can be set. However, I've also assumed that if the data never gets into primary cache, it will never be evicted into L2. That's glossing over the details, which may be important - for example, I don't think ZFS is structured to work with data that's *not* in ARC, so it may be that primarycache=off basically marks data for immediate eviction - where it still may be a candidate for l2. The reason I ask is I have iSCSI volumes for NTFS, I intend to use an SSD for l2arc. If something is read from the iSCSI device, then chances are Windows (or whatever OS) will cache it for a while in its own cache. It is unlikely that the data will be needed soon (under normal circumstances). Thus I would like it to avoid polluting the ARC with non-relevant data, but then the question is, how will that data make it to the L2ARC. With the setup above, I suspect it won't. It would be nice to get an authoritative confirmation of that, of course. Regardless, to your original requirement, it sounds like you're looking for a tuning knob to give further hints to the ARC algorithm, about which pages to evict first. More knobs are not always better. ARC should in theory already do a good job of telling the difference between accessed recently and accessed frequently. Evictees from both states can go to l2arc. Look at it another way: If the client cache in the windows machine works as you expect (and I expect it would, at least for some data), the best hint you can give to ARC that these blocks are not needed is to access *other* data. So, measure and analyse. -- Dan. pgpmFuXepzig7.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] raidz using partitions
On Wed, Jan 27, 2010 at 10:55:21AM -0800, Albert Frenz wrote: hi there, maybe this is a stupid question, yet i haven't found an answer anywhere ;) let say i got 3x 1,5tb hdds, can i create equal partitions out of each and make a raid5 out of it? sure the safety would drop, but that is not that important to me. with roughly 500gb partitions and the raid5 forumla of n-1*smallest drive i should be able to get 4tb storage instead of 3tb when using 3x 1,5tb in a normal raid5. The only way you can use more than 3TB is if your RAID5 is not protecting data on different disks. By saying 500gb partitions, it sounds like you want to create a 9 column raid on 3 disks. The safety wouild definitely drop. It would drop so much that it's not really buying you anything. The failure of any drive would mean loss of the data. So if that's already true, why not just put all the disks in a pool and not mess with a raid? You'd get 4.5TB. Partitioning it into pieces and trying to put them all into a single RAID set just makes the setup more complex, probably slower, and almost no extra protection. -- Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Strange random errors getting automatically repaired
Hi Giovanni, I have seen these while testing the mpt timeout issue, and on other systems during resilvering of failed disks and while running a scrub. Once so far on this test scrub, and several on yesterdays. I checked the iostat errors, and they weren't that high on that device, compared to other disks. c2t34d0 ONLINE 0 0 1 25.5K repaired errors --- s/w h/w trn tot device 0 8 61 69 c2t30d0 0 2 17 19 c2t31d0 0 5 41 46 c2t32d0 0 5 33 38 c2t33d0 0 3 31 34 c2t34d0 0 10 81 91 c2t35d0 0 4 22 26 c2t36d0 0 6 44 50 c2t37d0 0 3 21 24 c2t38d0 0 5 49 54 c2t39d0 0 9 77 86 c2t40d0 0 6 58 64 c2t41d0 0 5 50 55 c2t42d0 0 4 34 38 c2t43d0 0 6 37 43 c2t44d0 0 9 75 84 c2t45d0 0 13 82 95 c2t46d0 0 7 57 64 c2t47d0 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Going from 6 to 8 disks on ASUS M2N-SLI Deluxe motherboa
On 1/25/2010 6:23 PM, Simon Breden wrote: By mixing randomly purchased drives of unknown quality, people are taking unnecessary chances. But often, they refuse to see that, thinking that all drives are the same and they will all fail one day anyway... My use of the word random was a little joke to refer to drives that are bought without checking basic failure reports made by users, and then the purchaser later says 'oh no, these drives are c**p'. A little checking goes a long way IMO. But each to his own. I would say, though, that buying different drives isn't inherently either random or drives of unknown quality. Most of the time, I know no reason other than price to prefer one major manufacturer to another. Price is an important choice driver I think we all use. But the 'drives of unknown quality' bit is still possible to mitigate by checking, if one is willing to spend the time and knows where to look. We're never going to be 100% certain, but if I read widely of numerous reports that drives of a particular revision number are seriously substandard then I am going to take that info onboard to help me steer away from purchasing them. That's all. And, over and over again, I've heard of bad batches of drives. Small manufacturing or design or component sourcing errors. Given how the esilvering process can be quite long (on modern large drives) and quite stressful (when the system remains in production use during resilvering, so that load is on top of the normal load), I'd rather not have all my drives in the set be from the same bad batch! Indeed. This is why it's good to research, buy what you think is a good drive revision, then load your data onto them and test them out over a period of time. But one has to keep original data safely backed up. Google is working heavily with the philosophy that things WILL fail, so they plan for it, and have enough redundance to survive it -- and then save lots of money by not paying for premium components. I like that approach. Yep, as mentioned elsewhere, Google have enormous resources to be hugely redundant and safe. And yes, we all try to use our common sense to build in as much redundancy as we deem necessary and we are able to reasonably afford. And we have backups. Cheers, Simon http://breden.org.uk/2008/03/02/a-home-fileserver-using-zfs/ -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs streams
On 01/25/10 16:08, Daniel Carosone wrote: On Mon, Jan 25, 2010 at 05:42:59PM -0500, Miles Nordin wrote: et You cannot import a stream into a zpool of earlier revision, et thought the reverse is possible. This is very bad, because it means if your backup server is pool version 22, then you cannot use it to back up pool version 15 clients: you can backup, but then you can never restore. It would be, yes. Correct. It would be bad if it were true, but it's not. What matters when doing receives of streams is that the version of the dataset (which can differ between datasets on the same system and between datasets in the same replication stream) be less than or equal to the version of the zfs filesystem supported on the receiving system. The zfs filesystem version supported on a system can be displayed with the command zfs upgrade (with no further arguments). The zfs filesystem version is different than the zpool version (displayed by `zpool get version poolname`). You can send a stream from one system to another even if the zpool version is lower on the receiving system or pool. I verified that this works by replicating a dataset from a system running build 129 (zpool version 22 and zfs version 4 ) to a system running S10 update (zpool version 15 and zfs version 4). Since they agree on the file system version, it works. But when I try to send a stream from build 120 to S10 U6 (zfs version = 3), I get: # zfs recv rpool/new /net/x4200-brm-16/export/out.zar Jan 27 17:44:36 v20z-brm-03 zfs: Mismatched versions: File system is version 4 on-disk format, which is incompatible with this software version 3! The version of a zfs dataset (i.e. fileystem or zvol) is preserved unless modified. So, I just did zfs send from S10 U6 (zfs version 3) to S10 U8 (zfs version 4). This created a dataset and its snapshot on the build 129 system. Then I checked the version of the dataset and snapshot that was created: # zfs get -r version rpool/new NAME PROPERTY VALUESOURCE rpool/new version 3- rpool/n...@s1 version 3- So even though the current version of the zfs filesystem on the target system is 4, the dataset created by the receive is 3, because that's the version that was sent. Then I tried sending that dataset back to the U6 system, and it worked. So as long as the version of the *filesystem* is compatible with the target system, you can do sends from, say, S10U8 to S10U6, even though U8 has a higher zfs filesystem version number than U6. Also, as someone pointed out, the stream version has to match too. So if you use dedup (the -D option), that sets the dedup feature flag in the stream header, which makes the stream only receivable on systems that support stream dedup. But if you don't use dedup, the stream can still be read on earlier version of zfs. Lori O For backup to work the zfs send format needs to depend on the zfs version only, not the pool version in which it's stored nor the kernel version doing the sending. I can send from b130 to b111, zpool 22 to 14. (Though not with the new dedup send -D option, of course). I don't have S10 to test. -- Dan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building big cheap storage system. What hardware to use?
I have Supermicro 936E1 (X28 expander chip) and LSI 1068 HBA. I never got timeout issue but I'm using Seagate 15K.7 SAS. SATA might be different as it handles error and io timeout differently. If you still want volume, you make take a look at 7200 RPM SAS version. SAS disks more expensive. Besides, there is no 2Tb SAS 7200 drives on market yet. If you can wait, better wait for 6Gb SAS expander based product. Do you think it make sense if we will use SATA2 (300Gbit) disks? I heard there were problems with SAS1 expanders in Supermicro chassis, after they come out. Don't want to debug new product. BTW. I'd get Supermicro X8DTH-6F motherboard as this gives enough expansion slots. Thanks for the tip about motherboard. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zvol being charged for double space
In a thread elsewhere, trying to analyse why the zfs auto-snapshot cleanup code was cleaning up more aggressively than expected, I discovered some interesting properties of a zvol. http://mail.opensolaris.org/pipermail/zfs-auto-snapshot/2010-January/000232.html The zvol is not thin-provisioned. The entire volume has been written to (it was dd'd off a physical disk), and: volsize = refreservation referenced = usedbydataset = (volsize + a little overhead) This is as expected. Not expected is that: usedbyrefreservation = refreservation I would expect this to be 0, sinnce all the reserved space has been allocated. As a result, used is over twice the size of the volume (+ a few small snapshots as well). I think others may have have seen similar problems; it may be the root cause behind several other complaints that time-slider-cleanup deleted snapshots to free up space, when the pool still had plenty free. A quick followup test shows that usedbyrefreservation behaves as expected, for a new test zvol. http://mail.opensolaris.org/pipermail/zfs-auto-snapshot/2010-January/000233.html So apparently it may be a problem picked up along the upgrade path through many zpool version upgrades. The pool, and the zvol, would first have been created on b111 or shortly after. It has been used with both xvm kernels, and native kernels running virtualbox, in that time. Who can help me figure out what's going on with the older zvol? Any useful zdb info I can dump out? I could fix it by copying and replacing the zvol, getting compression and dedup in the process, but before I do I don't want to destroy what may be useful debug info. I'll check later whether the send|recv snapshots of this zvol on my backup server show similar problems, but I doubt they will. -- Dan. pgpccnSWWxhfy.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZPOOL somehow got same physical drive assigned twice
Guys, Need your help. My DEV131 OSOL build with my 21TB disk system somehow got really screwed: This is what my zpool status looks like: NAME STATE READ WRITE CKSUM rzpool2 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 replacing-0 DEGRADED 0 0 0 c6t1d0 OFFLINE 0 0 0 c6t16d0ONLINE 0 0 0 256M resilvered c6t2d0s2 ONLINE 0 0 0 c6t3d0p0 ONLINE 0 0 0 c6t4d0p0 ONLINE 0 0 0 c6t5d0p0 ONLINE 0 0 0 c6t6d0p0 ONLINE 0 0 0 c6t7d0p0 ONLINE 0 0 0 c6t8d0p0 ONLINE 0 0 0 c6t9d0 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c6t0d0 ONLINE 0 0 0 c6t1d0 UNAVAIL 0 0 0 cannot open c6t10d0 ONLINE 0 0 0 c6t11d0 ONLINE 0 0 0 c6t12d0 ONLINE 0 0 0 c6t13d0 ONLINE 0 0 0 c6t14d0 ONLINE 0 0 0 c6t15d0 ONLINE 0 0 0 check drive c6t1d0 - It appears in both raidz2-0 and raidz2-1 !! How do I *remove* the drive from raidz2-1 (with edit/hexedit or anything else) it is clearly a bug in ZFS that allowed me to assign the drive twiceagain: running DEV131 OSOL Please HELP me. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building big cheap storage system. What hardware to use?
We use the following for our storage servers: Chenbro 5U chassis (24 hot-swap drive bays) 1350 watt 4-way redundant PSU Tyan h200M motherboard (S3992) 2x dual-core AMD Opteron 2200-series CPUs 8 GB ECC DDR2-SDRAM 4-port Intel PRO/1000MT NIC (PCIe) 3Ware 9550SXU PCI-X RAID controller (12-port, multi-lane) 3Ware 9650SE PCIe RAID controller (12-port, muli-lane) 24x 500 GB harddrive (either Seagate ES2 or Western Digital RE2) Comes out to under $10,000 CDN, and gives 10 TB of disk space (3x 8-drive raidz2). If you use multiple 8-port SATA/SAS controllers instead of RAID controller, and a 3-way PSU, it should come out to under $8,000 CDN. Fully supportted by FreeBSD, so everything should work with OpenSolaris. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol being charged for double space
On 01/27/10 21:17, Daniel Carosone wrote: This is as expected. Not expected is that: usedbyrefreservation = refreservation I would expect this to be 0, since all the reserved space has been allocated. This would be the case if the volume had no snapshots. As a result, used is over twice the size of the volume (+ a few small snapshots as well). I'm seeing essentially the same thing with a recently-created zvol with snapshots that I export via iscsi for time machine backups on a mac. % zfs list -r -o name,refer,used,usedbyrefreservation,refreservation,volsize z/tm/mcgarrett NAMEREFER USED USEDREFRESERV REFRESERV VOLSIZE z/tm/mcgarrett 26.7G 88.2G60G60G 60G The actual volume footprint is a bit less than half of the volume size, but the refreservation ensures that there is enough free space in the pool to allow me to overwrite every block of the zvol with uncompressable data without any writes failing due to the pool being out of space. If you were to disable time-based snapshots and then overwrite a measurable fraction of the zvol you I'd expect USEDBYREFRESERVATION to shrink as the reserved blocks were actually used. If you want to allow for overcommit, you need to delete the refreservation. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Building big cheap storage system. What hardware to use?
On Wed, Jan 27, 2010 at 08:25:48PM -0800, borov wrote: SAS disks more expensive. Besides, there is no 2Tb SAS 7200 drives on market yet. Seagate released a 2 TB SAS drive last year. http://www.seagate.com/ww/v/index.jsp?locale=en-USvgnextoid=c7712f655373f110VgnVCM10f5ee0a0aRCRD -- Jason Fortezzo forte...@mechanicalism.net ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] backing this up
Yep Dan, Thank you very much for the idea, and helping me with my implementation issues. haha. I can see that raidz2 is not needed in this case. My question now lies as to full system recovery. Say all hell brakes loose and all is lost except tapes. If I use what you said and just add snapshots to a already standing zfs filesystem. I guess in this case I can do full backups to tapes as well as partial backups, what is the best way to accomplish this if data is all standing on a file. Note I will be using bacula (hopefully) unless a better is recommended. And finally, should I tar this file prior to sending it to tape or is this not needed in this case? Just a note, all of this data will fit on the tapes currently but what if it doesn't in the future? Thanks and sorry for all of the questions... Greg On Wed, Jan 27, 2010 at 1:08 PM, Daniel Carosone d...@geek.com.au wrote: On Wed, Jan 27, 2010 at 12:01:36PM -0800, Gregory Durham wrote: Hello All, I read through the attached threads and found a solution by a poster and decided to try it. That may have been mine - good to know it helped, or at least started to. The solution was to use 3 files (in my case I made them sparse) yep - writes to allocate space for them up front are pointless with CoW. I then created a raidz2 pool across these 3 files Really? If you want one tape's worth of space, written to 3 tapes, you might as well just write the same file to three tapes, I think. (I'm assuming here the files are the size you expect to write to a single tape - otherwise I'm even more confused about this bit). Perhaps it's easier to let zfs cope with repairing small media errors here and there, but the main idea of using a redundant pool of files was to cope with loss or damage to whole tapes, for a backup that already needed to span multiple tapes. If you want this three-way copy of a single tape, plus easy recovery from bad spots by reading back multiple tapes, then use a 3-way mirror. But consider the error-recovery mode of whatever you're using to write to tape - some skip to the next file on a read error. I expect similar ratios of data to parity files/tapes as would be used in typical disk setups, at least for wide stripes. Say raidz2 in sets of 10, 8+2, or so. (As an aside, I like this for disks, too - since striping 128k blocks to a power-of-two wide data stripe has to be more efficient) and started a zfs send | recv. The performance is horrible There can be several reasons for this, and we'd need to know more about your setup. The first critical thing is going to be the setup of the staging filesystem tha holds your pool files. If this is itself a raidz, perhaps you're iops limited - you're expecting 3 disk-files worth of concurrency from a pool that may not have it, though it should be a write-mostly workload so less sensitive. You'll be seeking a lot either way, though. If this is purely staging to tape, consider making the staging pool out of non-redundant single-disk vdevs. Alternately, if the staging pool is safe, there's another trick you might consider: create the pool, then offline 2 files while you recv, leaving the pool-of-files degraded. Then when you're done, you can let the pool resilver and fill in the redundancy. This might change the IO pattern enough to take less time overall, or at least allow you some flexibility with windows to schedule backup and tapes. Next is dedup - make sure you have the memory and l2arc capacity to dedup the incoming write stream. Dedup within the pool of files if you want and can (because this will dedup your tapes), but don't dedup under it as well. I've found this to produce completely pathological disk thrashing, in a related configuration (pool on lofi crypto file). Stacking dedup like this doubles the performance cliff under memory pressure we've been talking about recently. (If you really do want 3-way-mirror files, then by all means dedup them in the staging pool.) Related to this is arc usage - I haven't investigated this carefully myself, but you may well be double-caching: the backup pool's data, as well as the staging pool's view of the files. Again, since it's a write mostly workload zfs should hopefully figure out that few blocks are being re-read, but you might experiment with primarycache=metadata for the staging pool holding the files. Perhaps zpool-on-files is smart enough to use direct io bypassing cache anyway, I'm not sure. How's your cpu usage? Check that you're not trying to double-compress the files (again, within the backup pool but not outside) and consider using a lightweight checksum rather than sha256 outside. Then there's streaming and concurrency - try piping through buffer and using bigger socket and tcp buffers. TCP stalls and slow-start will amplify latency many-fold. A good zil device on the staging pool might also help, the backup pool will be doing sync writes to close
Re: [zfs-discuss] zvol being charged for double space
On Wed, Jan 27, 2010 at 09:57:08PM -0800, Bill Sommerfeld wrote: Hi Bill! :-) On 01/27/10 21:17, Daniel Carosone wrote: This is as expected. Not expected is that: usedbyrefreservation = refreservation I would expect this to be 0, since all the reserved space has been allocated. This would be the case if the volume had no snapshots. Hmm The actual volume footprint is a bit less than half of the volume size, but the refreservation ensures that there is enough free space in the pool to allow me to overwrite every block of the zvol with uncompressable data without any writes failing due to the pool being out of space. Hmm this is new (to me) and undescribed (in the manpage) behaviour, but it does explain the observed behaviour. In other words, usedbyrefreservation includes blocks currently shared with snapshots, representing a reservation for potential future CoW of these blocks. Does this happen for filesystems, or only volumes? I hope it's both, but just more commonly encountered because refreserv is more commonly used with volumes. If you were to disable time-based snapshots and then overwrite a measurable fraction of the zvol you I'd expect USEDBYREFRESERVATION to shrink as the reserved blocks were actually used. Right. If I repeat my quick test with snapshots, when the first snapshot is taken, I would see usedbyrefreservation jump back up to the full size of the volume. At that point the whole volume is shared with the snapshot. As data is overwritten, the space for the retained copy would be added to usedbysnapshots, and the space that's now unique to the dataset would come off usedbyrefreservation, with the used total staying constant - until another snapshot is taken. I'll do that for my own interest, but it now makes perfect sense and is quite reasonable. The trouble is the documentation doesn't point to this, so it's surprising and unexpected. There's text in the description of the refreservation property, about snapshots will only be allowed if there's enough free space. What needs to be clear is that this is achieved by the behaviour of usedbyrefreservation, in part by additional text in the description of that property (that it includes space shared with snapshots), and partly by improving the wording about free space here. I'll see if I can knock together some better wording later. If you want to allow for overcommit, you need to delete the refreservation. Of course, I just wasn't thinking of a taking a snapshot as having this cost, though of course it does. -- Dan. pgphFn5Xr1E59.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss