[zfs-discuss] strangeness after resilvering disk from raidz1 on disks with no EFI GPTs
I have a zpool of five 1.5TB disks in raidz1. They are on c?t?d?p0 devices - using the full disk, not any slice or partition, because the pool was created in zfs-fuse in linux and no partition tables were ever created. (for the full saga of my move from that to opensolaris, anyone who missed out on the fun can read the thread http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg34813.html - but I will try to include all relevant information so that's not necessary). When I had gotten things working in opensolaris, and done a scrub, I had gotten some errors on one disk and so I offlined it, overwrote the whole disk with random-looking data, and read the data back to check that the disk was behaving. it was, and I resilvered, and things have seemed fine since. I just noticed, though (some time later, with things working correctly in the meantime), that I now have a EFI partition table on the disk that I resilvered. none of the others have any partition table. This confuses me greatly, for a few reasons. One, why did zfs create a partition table? I thought it only did that when you gave it a shorthand disk in the form c?t?d? with no slice or partition number - I did replace giving it the full path /dev/dsk/c9t4d0p0. Doesn't this mean that zfs must actually be using s0 of the drive, not p0? c9t4d0p0 is what shows up in the zpool status, along with p0 devices for the other four drives. Two, given that this one disk has a EFI partition table - including 8MB reserved slice 8 - the actual device that zfs is using is more than 8MB smaller than the other four. How am I using a raidz1 with unequally sized devices? Since this has been running without a problem for a few weeks now, I'm not actually concerned about it being a problem - just rather confused. Can anybody explain what's up with this? Thanks, -Ethan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
Hi ALL I might be little bit confused !!! I will try to ask my question in a simple way ... Why would a 16GB L2ARC device got filled by running a benchmark that uses a 2GB workingset while having a 2GB ARC max ? I know I am missing something here ! Thanks On Sun, Mar 7, 2010 at 12:05 AM, Richard Elling wrote: > On Mar 6, 2010, at 8:05 PM, Eric D. Mudama wrote: > > On Sat, Mar 6 at 15:04, Richard Elling wrote: > >> On Mar 6, 2010, at 2:42 PM, Eric D. Mudama wrote: > >>> On Sat, Mar 6 at 3:15, Abdullah Al-Dahlawi wrote: > > hdd ONLINE 0 0 0 > c7t0d0p3 ONLINE 0 0 0 > > rpool ONLINE 0 0 0 > c7t0d0s0 ONLINE 0 0 0 > >>> > >>> I trimmed your zpool status output a bit. > >>> > >>> Are those two the same device? I'm barely familiar with solaris > >>> partitioning and labels... what's the difference between a slice and a > >>> partition? > >> > >> In this context, "partition" is an fdisk partition and "slice" is a > >> SMI or EFI labeled slice. The SMI or EFI labeling tools (format, > >> prtvtoc, and ftmhard) do not work on partitions. So when you > >> choose to use ZFS on a partition, you have no tools other than > >> fdisk to manage the space. This can lead to confusion... a bad > >> thing. > > > > So in that context, is the above 'zpool status' snippet a "bad thing > > to do"? > > If the partition containing c7t0d0s0 was p3, then it could be exceedingly > bad. Normally, if you try to create a zpool on a slice which already has a > zpool, then you will get an error message to that effect, which you can > override with the "-f" flag. However, that checking is done on slices, not > fdisk partitions. Hence, there is an opportunity for confusion... a bad > thing. > -- richard > > ZFS storage and performance consulting at http://www.RichardElling.com > ZFS training on deduplication, NexentaStor, and NAS performance > http://nexenta-atlanta.eventbrite.com (March 16-18, 2010) > > > > > -- Abdullah Al-Dahlawi PhD Candidate George Washington University Department. Of Electrical & Computer Engineering Check The Fastest 500 Super Computers Worldwide http://www.top500.org/list/2009/11/100 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS aclmode property
On Sat, 6 Mar 2010, Ralf Utermann wrote: > we recently started to look at a ZFS based solution as a possible > replacement for our DCE/DFS based campus filesystem (yes, this is still > in production here). Hey, a fellow DFS shop :)... We finally migrated the last production files off of DFS last month, I'm actually going to pull the plug on the infrastructure within a couple of weeks. It will be nice not to have to worry that software that's been unsupported for years will go blooey :(. > The ACL model of the combination OpenSolaris+ZFS+in-kernel-CIFS+NFSv4 > looks like a really promising setup, something which could place it high > up on our list ... Indeed, while we're currently running S10 with samba (our development started before OpenSolaris support was announced; we're hoping to migrate sometime this year), Solaris/ZFS was the best option we could find to replace our DFS infrastructure. The main thing I miss is the location independence and ability to migrate data between servers while it's in use. Other than this annoying chmod/ACL issue, our only other major problem is lack of scalability in NFS sharing, it takes a good 45 minutes to share/unshare the 8000 filesystems on each of our X4500's (we have 5), resulting in about a 2 hour reboot cycle :(. There's an open bug on it, but they say it will never be addressed in Solaris 10, but hopefully someday in OpenSolaris. > So from this site: we very much support the idea of adding ignore and > deny values for the aclmode property! If you have a Sun support contract, open a support call and ask to be added to SR #72456444, which is the case I have open to try and get a better solution to chmod/ACL interaction. If you're thinking of spending a lot of money on Sun hardware, bring this issue up to your sales guy and push for a solution. I think part of the problem is very few sites actually use ACLs, particularly to the extent people coming from a DFS background are used to :(. > However, reading PSARC/2010/029, it looks like we will get > aclmode=discard for everybody and the property removed. I hope this is > not the end of the story ... As do I, but so far it's not looking too good. I discussed my proposal with Mark Shellenbaum (the author of that PSARC case), and he was pretty strongly against it. I thought I made some rather good points, but as I'm sure you saw from the threads you referenced there are quite strong opinions on both sides. He seems to be Sun's main guy when it comes to ACL's; if he was on board it would be a lot more likely to happen, but I never heard back from him on my counter response to his initial reply detailing his reasons he thought it was a bad idea, and he was conspicuously absent during the recent list free-for-all... As I've offered before, I'll implement it if they'll merge it... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | hen...@csupomona.edu California State Polytechnic University | Pomona CA 91768 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
On Mar 6, 2010, at 8:05 PM, Eric D. Mudama wrote: > On Sat, Mar 6 at 15:04, Richard Elling wrote: >> On Mar 6, 2010, at 2:42 PM, Eric D. Mudama wrote: >>> On Sat, Mar 6 at 3:15, Abdullah Al-Dahlawi wrote: hdd ONLINE 0 0 0 c7t0d0p3 ONLINE 0 0 0 rpool ONLINE 0 0 0 c7t0d0s0 ONLINE 0 0 0 >>> >>> I trimmed your zpool status output a bit. >>> >>> Are those two the same device? I'm barely familiar with solaris >>> partitioning and labels... what's the difference between a slice and a >>> partition? >> >> In this context, "partition" is an fdisk partition and "slice" is a >> SMI or EFI labeled slice. The SMI or EFI labeling tools (format, >> prtvtoc, and ftmhard) do not work on partitions. So when you >> choose to use ZFS on a partition, you have no tools other than >> fdisk to manage the space. This can lead to confusion... a bad >> thing. > > So in that context, is the above 'zpool status' snippet a "bad thing > to do"? If the partition containing c7t0d0s0 was p3, then it could be exceedingly bad. Normally, if you try to create a zpool on a slice which already has a zpool, then you will get an error message to that effect, which you can override with the "-f" flag. However, that checking is done on slices, not fdisk partitions. Hence, there is an opportunity for confusion... a bad thing. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
On Sat, Mar 6 at 15:04, Richard Elling wrote: On Mar 6, 2010, at 2:42 PM, Eric D. Mudama wrote: On Sat, Mar 6 at 3:15, Abdullah Al-Dahlawi wrote: hdd ONLINE 0 0 0 c7t0d0p3 ONLINE 0 0 0 rpool ONLINE 0 0 0 c7t0d0s0 ONLINE 0 0 0 I trimmed your zpool status output a bit. Are those two the same device? I'm barely familiar with solaris partitioning and labels... what's the difference between a slice and a partition? In this context, "partition" is an fdisk partition and "slice" is a SMI or EFI labeled slice. The SMI or EFI labeling tools (format, prtvtoc, and ftmhard) do not work on partitions. So when you choose to use ZFS on a partition, you have no tools other than fdisk to manage the space. This can lead to confusion... a bad thing. So in that context, is the above 'zpool status' snippet a "bad thing to do"? -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool on sparse files
> You are running into this bug: > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6929751 > Currently, building a pool from files is not fully supported. I think Cindy and I interpreted the question differently. If you want the zpool inside a file to stay mounted while the system is running, and come up again after reboot, then I think she's right. You're running into that bug. If you want to dismount your zpool, for the sake of backing it up to tape or something like that ... and then you're seeing this error on reboot, I think you need to export your filesystem before you do your backups or reboot. Then when you want to mount it again, you just import it. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-discuss] WriteBack versus SSD-ZIL
> From everything I've seen, an SSD wins simply because it's 20-100x the > size. HBAs almost never have more than 512MB of cache, and even fancy > SAN boxes generally have 1-2GB max. So, HBAs are subject to being > overwhelmed with heavy I/O. The SSD ZIL has a much better chance of > being able to weather a heavy I/O period without being filled. Thus, > SSDs are better at "average" performance - they provide a relatively > steady performance profile, whereas HBA cache is very spiky. This is a really good point. So you think I may actually get better performance by disabling the WriteBack on all the spindle disks, and enabling it on the SSD instead. This is precisely the opposite of what I was thinking. I'm planning to publish some more results soon, but haven't gathered it all yet. But see these: Just naked disks, no acceleration. http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteThr ough.txt Same configuration as above, but WriteBack enabled. http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac k.txt Same configuration as the naked disks, but a ramdrive was created for ZIL http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-ramZIL.t xt Using the ramdrive for ZIL, and also WriteBack enabled on PERC http://nedharvey.com/iozone/iozone-DellPE2970-32G-3-mirrors-striped-WriteBac k_and_ramZIL.txt This result shows the WriteBack enabled makes a huge performance difference (3-4x higher) for writes, compared to the naked disks. I don't think it's because an entire write operation fits into the HBA DRAM, or the HBA is remaining un-saturated. The PERC has 256M, but the test includes 8 threads all simultaneously writing separate 4G files in various sized chunks and patterns. I think when the PERC ram is full of stuff queued for write to disk, it's simply able to order and organize and optimize the write operations to leverage the disks as much as possible. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts pls. : Create 3 way rpool mirror and shelve one mirror as a backup
On Mar 6, 2010, at 5:38 PM, tomwaters wrote: > Hi guys, > On my home server (2009.6) I have a 2 HDD's in a mirrored rpool. > > I just added a 3rd to the mirror and made all disks bootable (ie. installgrub > on the mirror disks). > > My though is this, I remove the 3rd mirror disk and offsite it as a backup. > > That way if I mess up the rpool, I can get back the offsite HDD, boot from it > and re-mirror this to the other 2 HDD's and I am back in business. > > I plan to leave the 3rd mirror device in the rpool (just no HDD loaded so it > will show as degraded all the time). On a monthly basis, I'll physically > insert the 3rd HDD and get it to resilver and then remove the 3rd hdd offsite > again - ie. refresh the backup. > > Anyone see any flaws in this plan? To do this either: 1. upgrade to a later version where the "zpool split" command is available 2. zfs send/receive to the disk to be stored offsite IMHO, splitting mirros for backups is a waste of time, but it is a popular way of backup for non-ZFS file systems. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Does OpenSolaris mpt driver support LSI 2008 controller
I'm about to try it! My LSI SAS 9211-8i should arrive Monday or Tuesday. I bought the cable-less version, opting instead to save a few $ and buy Adaptec 2247000-R SAS to SATA cables. My rig will be based off of fairly new kit, so it should be interesting to see how 2009.06 deals with it all :) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Thoughts pls. : Create 3 way rpool mirror and shelve one mirror as a backup
Hi guys, On my home server (2009.6) I have a 2 HDD's in a mirrored rpool. I just added a 3rd to the mirror and made all disks bootable (ie. installgrub on the mirror disks). My though is this, I remove the 3rd mirror disk and offsite it as a backup. That way if I mess up the rpool, I can get back the offsite HDD, boot from it and re-mirror this to the other 2 HDD's and I am back in business. I plan to leave the 3rd mirror device in the rpool (just no HDD loaded so it will show as degraded all the time). On a monthly basis, I'll physically insert the 3rd HDD and get it to resilver and then remove the 3rd hdd offsite again - ie. refresh the backup. Anyone see any flaws in this plan? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
On Mar 6, 2010, at 2:42 PM, Eric D. Mudama wrote: > On Sat, Mar 6 at 3:15, Abdullah Al-Dahlawi wrote: >> >> hdd ONLINE 0 0 0 >>c7t0d0p3 ONLINE 0 0 0 >> >> rpool ONLINE 0 0 0 >>c7t0d0s0 ONLINE 0 0 0 > > I trimmed your zpool status output a bit. > > Are those two the same device? I'm barely familiar with solaris > partitioning and labels... what's the difference between a slice and a > partition? In this context, "partition" is an fdisk partition and "slice" is a SMI or EFI labeled slice. The SMI or EFI labeling tools (format, prtvtoc, and ftmhard) do not work on partitions. So when you choose to use ZFS on a partition, you have no tools other than fdisk to manage the space. This can lead to confusion... a bad thing. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Monitoring my disk activity
On Mar 6, 2010, at 1:02 PM, Edward Ned Harvey wrote: > Recently, I’m benchmarking all kinds of stuff on my systems. And one > question I can’t intelligently answer is what blocksize I should use in these > tests. > > I assume there is something which monitors present disk activity, that I > could run on my production servers, to give me some statistics of the block > sizes that the users are actually performing on the production server. And > then I could use that information to make an informed decision about block > size to use while benchmarking. > > Is there a man page I should read, to figure out how to monitor and get > statistics on my real life users’ disk activity? It all depends on how they are connecting to the storage. iSCSI, CIFS, NFS, database, rsync, ...? The reason I say this is because ZFS will coalesce writes, so just looking at iostat data (ops versus size) will not be appropriate. You need to look at the data flowing between ZFS and the users. fsstat works for file systems, but won't work for zvols, as an example. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] WriteBack versus SSD-ZIL
On Mar 6, 2010, at 1:38 AM, Zhu Han wrote: > On Sat, Mar 6, 2010 at 12:50 PM, Erik Trimble wrote: > This is true. SSDs and HDs differ little in their ability to handle raw > throughput. However, we often still see problems in ZFS associated with > periodic system "pauses" where ZFS effectively monopolizes the HDs to write > out it's current buffered I/O. People have been complaining about this for > quite awhile. SSDs have a huge advantage where IOPS are concerned, and given > that the backing store HDs have to service both read and write requests, > they're severely limited on the number of IOPs they can give to incoming data. > > You have a good point, but I'd still be curious to see what an async cache > would do. After all, that is effectively what the HBA cache is, and we see a > significant improvement with it, and not just for sync write. > > > I might see what your mean here. Because ZFS has to aggregate some write data > during a short period (txn alive time) to avoid generating too many random > write HDD requests, the bandwidth of HDD during this time is wasted. For > write heavy streaming workload, especially those who can saturate the HDD > pool bandwidth easily, ZFS will make the performance worse than those legacy > file system, i.e. UFS or EXT3. The IOPS of the HDD is not the limitation > here. The bandwidth of the HDD is the root cause. This statement is too simple, and thus does not represent reality very well. For a fully streaming workload where the load is near the capacity of the storage, the algorithms in ZFS will work to optimize the match. There is still some work to be done, but I don't believe UFS has beat ZFS on Solaris for a significant streaming benchmark for several years now. What we do see is that high performance SSDs can saturate the SAS/SATA link for extended periods of time. For example, a Western Digital SiliconEdge Blue (a new, midrange model) can read at 250 MB/sec in contrast to a WD RE4 which has a media transfer rate of 138 MB/sec. High-speed SSDs are already putting the hurt on 6Gbps SAS/SATA -- the Micron models claim 370 MB/sec sustained. Since this can be easily parallelized, expect that the high-end SSDs will saturate whatever you can connect them to. This is one reason why the F5100 has 64 SAS channels for host connections. > This is the design choice of ZFS. Reducing the length of period during txn > commit can alleviate the problem. So that the size of data needing to flush > to the disk every time could be smaller. Replace the HDD with some high-end > FC disks may solve this problem. Properly matching I/O source and sink is still important, no file system can relieve you of that duty :-) > I also don't know what the threshold is in ZFS for it to consider it time to > do a async buffer flush. Is it time based? % of RAM based? Absolute amount? > All of that would impact whether an SSD async cache would be useful. The answer is "yes" to all of these questions, but there are many variables to consider, so YMMV. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
On Sat, Mar 6 at 3:15, Abdullah Al-Dahlawi wrote: hdd ONLINE 0 0 0 c7t0d0p3 ONLINE 0 0 0 rpool ONLINE 0 0 0 c7t0d0s0 ONLINE 0 0 0 I trimmed your zpool status output a bit. Are those two the same device? I'm barely familiar with solaris partitioning and labels... what's the difference between a slice and a partition? -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Fishworks 2010Q1 and dedup bug?
On Mar 5, 2010, at 5:10 PM, James Dickens wrote: > On Fri, Mar 5, 2010 at 4:48 PM, Tonmaus wrote: > Hi, > > so, what would be a critical test size in your opinion? Are there any other > side conditions? > > > when your dedup hash table ( a table that holds a checksum of every block > seen on filesystems/zvols after dedup was enabled) exceeds memory, your > performance degrades exponentially probably before that. More important is the small, random I/O performance of your pool. For fast devices, like 15krpm disks, SSDs, or array controllers with nonvolatile caches, performance should be good. For big, slow JBOD drives, the small, random I/O performance is poor and you pay for that cost savings with time spent waiting. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 16-18, 2010) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Monitoring my disk activity
Recently, I'm benchmarking all kinds of stuff on my systems. And one question I can't intelligently answer is what blocksize I should use in these tests. I assume there is something which monitors present disk activity, that I could run on my production servers, to give me some statistics of the block sizes that the users are actually performing on the production server. And then I could use that information to make an informed decision about block size to use while benchmarking. Is there a man page I should read, to figure out how to monitor and get statistics on my real life users' disk activity? Thanks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS aclmode property
On 6-3-2010 18:41, Ralf Utermann wrote: So from this site: we very much support the idea of adding ignore and deny values for the aclmode property! However, reading PSARC/2010/029, it looks like we will get aclmode=discard for everybody and the property removed. I hope this is not the end of the story ... +1 Carefully constructed ACL's should -never- be destroyed by an (unwanted/unexpected) chmod. Extra aclmode properties should not be so hard to implement. -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | OpenSolaris 2010.03 b131 + All that's really worth doing is what we do for others (Lewis Carrol) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
Hi okay its not what i feared, it is probably caching every bit of data and metadata you have written so far, why shouldn't it you have the space in the l2 cache, and it can't offer to return it if its not in the cache, after the cache is full or near full it will choose more carefully what to keep and what to throw away. James Dickens http://uadmin.blogspot.com On Sat, Mar 6, 2010 at 2:15 AM, Abdullah Al-Dahlawi wrote: > hi James > > > here is the out put you've requested > > abdul...@hp_hdx_16:~/Downloads# zpool status -v > pool: hdd > state: ONLINE > scrub: none requested > config: > > NAMESTATE READ WRITE CKSUM > hdd ONLINE 0 0 0 > c7t0d0p3 ONLINE 0 0 0 > cache > c8t0d0p0 ONLINE 0 0 0 > > errors: No known data errors > > pool: rpool > state: ONLINE > scrub: none requested > config: > > NAMESTATE READ WRITE CKSUM > rpool ONLINE 0 0 0 > c7t0d0s0 ONLINE 0 0 0 > > --- > > abdul...@hp_hdx_16:~/Downloads# zpool iostat -v hdd >capacity operationsbandwidth > pool used avail read write read write > -- - - - - - - > hdd 1.96G 17.7G 10 64 1.27M 7.76M > c7t0d0p3 1.96G 17.7G 10 64 1.27M 7.76M > cache - - - - - - > c8t0d0p0 *2.87G* 12.0G 0 17103 2.19M > -- - - - - - - > > abdul...@hp_hdx_16:~/Downloads# kstat -m zfs > module: zfs instance: 0 > name: arcstatsclass:misc > c 2147483648 > c_max 2147483648 > c_min 268435456 > crtime 34.558539423 > data_size 2078015488 > deleted 9816 > demand_data_hits382992 > demand_data_misses 20579 > demand_metadata_hits74629 > demand_metadata_misses 6434 > evict_skip 21073 > hash_chain_max 5 > hash_chains 7032 > hash_collisions 31409 > hash_elements 36568 > hash_elements_max 36568 > hdr_size7827792 > hits481410 > l2_abort_lowmem 0 > l2_cksum_bad0 > l2_evict_lock_retry 0 > l2_evict_reading0 > l2_feeds1157 > l2_free_on_write475 > l2_hdr_size 0 > l2_hits 0 > l2_io_error 0 > l2_misses 14997 > l2_read_bytes 0 > l2_rw_clash 0 > l2_size 588342784 > l2_write_bytes 3085701632 > l2_writes_done 194 > l2_writes_error 0 > l2_writes_hdr_miss 0 > l2_writes_sent 194 > memory_throttle_count 0 > mfu_ghost_hits 9410 > mfu_hits343112 > misses 33011 > mru_ghost_hits 4609 > mru_hits116739 > mutex_miss 90 > other_size 51590832 > p 1320449024 > prefetch_data_hits 4775 > prefetch_data_misses1694 > prefetch_metadata_hits 19014 > prefetch_metadata_misses4304 > recycle_miss484 > size2137434112 > snaptime1945.241664714 > > module: zfs instance: 0 > name: vdev_cache_statsclass:misc > crtime 34.558587713 > delegations 3415 > hits5578 > misses 3647 > snaptime1945.243484925 > > > > > On Fri, Mar 5, 2010 at 9:02 PM, James Dickens wrote: > >> please post the output of zpool status -v. >> >> >> Thanks >> >> James Dickens >> >> >> On Fri, Mar 5, 2010 at 3:46 AM, Abdullah Al-Dahlawi wrote: >> >>> Greeting All >>> >>> I have create a pool that consists oh a hard disk and a ssd as a cache >>> >>> zpool create hdd c11t0d0p3 >>> zpool add hdd cache c8t0d0p0 - cache device >>> >>> I ran an OLTP bench mark to emulate a DMBS >>> >>> One I ran the benchmark, the pool started create the database file on the >>> ssd cache device ??? >>> >>> >>> can any one explain why this happening ? >>> >>> is not L2ARC is used to a
[zfs-discuss] ZFS aclmode property
we recently started to look at a ZFS based solution as a possible replacement for our DCE/DFS based campus filesystem (yes, this is still in production here). The ACL model of the combination OpenSolaris+ZFS+in-kernel-CIFS+NFSv4 looks like a really promising setup, something which could place it high up on our list ... So we had our test system installed (build 133) and were happily manipulating ACLs from Windows and also from our standard Debian client using the Linux nfsv4 utilities ... transparently! We were impressed ... until an applicaton issued a chmod and destroyed the ACL. We then of course found Paul Henson's proposal for aclmode ignore and deny values [http://mail.opensolaris.org/pipermail/zfs-discuss/2010/February/037206.html] and the ZFS ACL thread he started in http://mail.opensolaris.org/pipermail/zfs-discuss/2010-February/037863.html . So from this site: we very much support the idea of adding ignore and deny values for the aclmode property! However, reading PSARC/2010/029, it looks like we will get aclmode=discard for everybody and the property removed. I hope this is not the end of the story ... - Ralf -- Ralf Utermann _ Universität Augsburg, Institut für Physik -- EDV-Betreuer Universitätsstr.1 D-86135 Augsburg Phone: +49-821-598-3231 SMTP: ralf.uterm...@physik.uni-augsburg.de Fax: -3411 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
Hello, On Mar 6, 2010, at 6:02 PM, Andrey Kuzmin wrote: > This is purely tactical, to avoid l2arc write penalty on eviction. You seem > to have missed the very next paragraph: > >3644 * 2. The L2ARC attempts to cache data from the ARC before it is > evicted. >3645 * It does this by periodically scanning buffers from the > eviction-end of >3646 * the MFU and MRU ARC lists, copying them to the L2ARC devices if > they are >3647 * not already there. > > My point was just that nothing is evicted from the ARC to the L2ARC, of course things evicted can be available in the L2ARC, but its not pushed there when evicted. I commented on the question "is not L2ARC is used to absorb the evicted data from ARC ?" Then no, the L2ARC absorbs non-evicted data from the ARC, that possibly gets evicted later. But it's just semantics. Regards Henrik http://sparcv9.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
This is purely tactical, to avoid l2arc write penalty on eviction. You seem to have missed the very next paragraph: 3644 * 2. The L2ARC attempts to cache data from the ARC before it is evicted. 3645 * It does this by periodically scanning buffers from the eviction-end of 3646 * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are 3647 * not already there. Regards, Andrey On Sat, Mar 6, 2010 at 3:58 PM, Henrik Johansson wrote: > Hello, > > On Mar 5, 2010, at 10:46 AM, Abdullah Al-Dahlawi wrote: > > Greeting All > > I have create a pool that consists oh a hard disk and a ssd as a cache > > zpool create hdd c11t0d0p3 > zpool add hdd cache c8t0d0p0 - cache device > > I ran an OLTP bench mark to emulate a DMBS > > One I ran the benchmark, the pool started create the database file on the > ssd cache device ??? > > > can any one explain why this happening ? > > is not L2ARC is used to absorb the evicted data from ARC ? > > > No, it is not. if we look in the source there is a very good description of > the L2ARC behavior: > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c > > "1. There is no eviction path from the ARC to the L2ARC. Evictions > from the ARC behave as usual, freeing buffers and placing headers on > ghost lists. The ARC does not send buffers to the L2ARC during eviction > as this would add inflated write latencies for all ARC memory pressure." > > Regards > > Henrik > http://sparcv9.blogspot.com > > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
Hello, On Mar 5, 2010, at 10:46 AM, Abdullah Al-Dahlawi wrote: > Greeting All > > I have create a pool that consists oh a hard disk and a ssd as a cache > > zpool create hdd c11t0d0p3 > zpool add hdd cache c8t0d0p0 - cache device > > I ran an OLTP bench mark to emulate a DMBS > > One I ran the benchmark, the pool started create the database file on the ssd > cache device ??? > > > can any one explain why this happening ? > > is not L2ARC is used to absorb the evicted data from ARC ? No, it is not. if we look in the source there is a very good description of the L2ARC behavior: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c "1. There is no eviction path from the ARC to the L2ARC. Evictions from the ARC behave as usual, freeing buffers and placing headers on ghost lists. The ARC does not send buffers to the L2ARC during eviction as this would add inflated write latencies for all ARC memory pressure." Regards Henrik http://sparcv9.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
On Sat, Mar 6, 2010 at 3:15 PM, Abdullah Al-Dahlawi wrote: > abdul...@hp_hdx_16:~/Downloads# zpool iostat -v hdd > capacity operations bandwidth > pool used avail read write read write > -- - - - - - - > hdd 1.96G 17.7G 10 64 1.27M 7.76M > c7t0d0p3 1.96G 17.7G 10 64 1.27M 7.76M you only have 17.7GB free space there, not 50GB that you said earlier. -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hardware for high-end ZFS NAS file server - 2010 March edition
2010/3/4 Michael Shadle : > Typically rackmounts are not designed for quiet. He said quietness is > #2 in his priorities... I have a Supermicro 743 case, also 4U. The one I used is the "Super Quiet" variant, which uses fewer & slower PWM fans. It's got 8 hot swap bays and an additional 3x 5.25" bays which you can put an additional hot swap bay in. It's quiet enough to have in my home office without being a distraction. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] WriteBack versus SSD-ZIL
On Sat, Mar 6, 2010 at 12:50 PM, Erik Trimble wrote: > This is true. SSDs and HDs differ little in their ability to handle raw > throughput. However, we often still see problems in ZFS associated with > periodic system "pauses" where ZFS effectively monopolizes the HDs to write > out it's current buffered I/O. People have been complaining about this for > quite awhile. SSDs have a huge advantage where IOPS are concerned, and > given that the backing store HDs have to service both read and write > requests, they're severely limited on the number of IOPs they can give to > incoming data. > > You have a good point, but I'd still be curious to see what an async cache > would do. After all, that is effectively what the HBA cache is, and we see > a significant improvement with it, and not just for sync write. > > I might see what your mean here. Because ZFS has to aggregate some write data during a short period (txn alive time) to avoid generating too many random write HDD requests, the bandwidth of HDD during this time is wasted. For write heavy streaming workload, especially those who can saturate the HDD pool bandwidth easily, ZFS will make the performance worse than those legacy file system, i.e. UFS or EXT3. The IOPS of the HDD is not the limitation here. The bandwidth of the HDD is the root cause. This is the design choice of ZFS. Reducing the length of period during txn commit can alleviate the problem. So that the size of data needing to flush to the disk every time could be smaller. Replace the HDD with some high-end FC disks may solve this problem. > I also don't know what the threshold is in ZFS for it to consider it time > to do a async buffer flush. Is it time based? % of RAM based? Absolute > amount? All of that would impact whether an SSD async cache would be useful. > > IMHO, ZFS flush the data back to disk asynchronously every 5 seconds, which is the default configuration of txn commit period. ZFS will also flush the data back to disk even before the 5 second period, based on the estimation of amount of memory has been used for the current txn. This is called as write throttle. See below link: http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] why L2ARC device is used to store files ?
hi James here is the out put you've requested abdul...@hp_hdx_16:~/Downloads# zpool status -v pool: hdd state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM hdd ONLINE 0 0 0 c7t0d0p3 ONLINE 0 0 0 cache c8t0d0p0 ONLINE 0 0 0 errors: No known data errors pool: rpool state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM rpool ONLINE 0 0 0 c7t0d0s0 ONLINE 0 0 0 --- abdul...@hp_hdx_16:~/Downloads# zpool iostat -v hdd capacity operationsbandwidth pool used avail read write read write -- - - - - - - hdd 1.96G 17.7G 10 64 1.27M 7.76M c7t0d0p3 1.96G 17.7G 10 64 1.27M 7.76M cache - - - - - - c8t0d0p0 *2.87G* 12.0G 0 17103 2.19M -- - - - - - - abdul...@hp_hdx_16:~/Downloads# kstat -m zfs module: zfs instance: 0 name: arcstatsclass:misc c 2147483648 c_max 2147483648 c_min 268435456 crtime 34.558539423 data_size 2078015488 deleted 9816 demand_data_hits382992 demand_data_misses 20579 demand_metadata_hits74629 demand_metadata_misses 6434 evict_skip 21073 hash_chain_max 5 hash_chains 7032 hash_collisions 31409 hash_elements 36568 hash_elements_max 36568 hdr_size7827792 hits481410 l2_abort_lowmem 0 l2_cksum_bad0 l2_evict_lock_retry 0 l2_evict_reading0 l2_feeds1157 l2_free_on_write475 l2_hdr_size 0 l2_hits 0 l2_io_error 0 l2_misses 14997 l2_read_bytes 0 l2_rw_clash 0 l2_size 588342784 l2_write_bytes 3085701632 l2_writes_done 194 l2_writes_error 0 l2_writes_hdr_miss 0 l2_writes_sent 194 memory_throttle_count 0 mfu_ghost_hits 9410 mfu_hits343112 misses 33011 mru_ghost_hits 4609 mru_hits116739 mutex_miss 90 other_size 51590832 p 1320449024 prefetch_data_hits 4775 prefetch_data_misses1694 prefetch_metadata_hits 19014 prefetch_metadata_misses4304 recycle_miss484 size2137434112 snaptime1945.241664714 module: zfs instance: 0 name: vdev_cache_statsclass:misc crtime 34.558587713 delegations 3415 hits5578 misses 3647 snaptime1945.243484925 On Fri, Mar 5, 2010 at 9:02 PM, James Dickens wrote: > please post the output of zpool status -v. > > > Thanks > > James Dickens > > > On Fri, Mar 5, 2010 at 3:46 AM, Abdullah Al-Dahlawi wrote: > >> Greeting All >> >> I have create a pool that consists oh a hard disk and a ssd as a cache >> >> zpool create hdd c11t0d0p3 >> zpool add hdd cache c8t0d0p0 - cache device >> >> I ran an OLTP bench mark to emulate a DMBS >> >> One I ran the benchmark, the pool started create the database file on the >> ssd cache device ??? >> >> >> can any one explain why this happening ? >> >> is not L2ARC is used to absorb the evicted data from ARC ? >> >> why it is used this way ??? >> >> >> >> >> >> -- >> Abdullah Al-Dahlawi >> George Washington University >> Department. Of Electrical & Computer Engineering >> >> Check The Fastest 500 Super Computers Worldwide >> http://www.top500.org/list/2009/11/100 >> >> ___ >> zfs-discuss mailing list >> zfs-discuss@opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> > -- Abdullah Al-Dahlawi PhD Candidate George Washington University Department. Of Electrical & Computer Engineering Check The Fastest 500 Super Computers Worldwide http://ww