Re: [zfs-discuss] Best practice for setting ACL
So Cindy, Simon (or anyone else)... now that we are over a year past when Simon wrote his excellent blog introduction, is there an updated best practices for ACLs with CIFS? Or, is this blog entry still the best word on the street? In my case, I am supporting multiple PCs (Workgroup) and Macs; running OpenSolaris B134. Thanks, Craig -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to migrate to 4KB sector drives?
On Sun, Sep 12, 2010 at 10:07 AM, Orvar Korvar knatte_fnatte_tja...@yahoo.com wrote: No replies. Does this mean that you should avoid large drives with 4KB sectors, that is, new dri ves? ZFS does not handle new drives? Solaris 10u9 handles 4k sectors, so it might be in a post-b134 release of osol. Build 118 adds support for 4K sectors with the following putback: PSARC 2008/769 Multiple disk sector size support. 6710930 Solaris needs to support large sector size hard drive disk But already in build 38 there is some support for large-sector disks in ZFS. 6407365 large-sector disk support in ZFS When new features are added to the current release, it is typically created for the next release and then backported to the current release. Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: Richard Elling [mailto:rich...@nexenta.com] This operational definition of fragmentation comes from the single- user, single-tasking world (PeeCees). In that world, only one thread writes files from one application at one time. In those cases, there is a reasonable expectation that a single file's blocks might be contiguous on a single disk. That isn't the world we live in, where have RAID, multi-user, or multi- threaded environments. I don't know what you're saying, but I'm quite sure I disagree with it. Regardless of multithreading, multiprocessing, it's absolutely possible to have contiguous files, and/or file fragmentation. That's not a characteristic which depends on the threading model. Also regardless of raid, it's possible to have contiguous or fragmented files. The same concept applies to multiple disks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file recovery on lost RAIDZ array
That sounds strange. What happened? You used raidz1? You can mount your zpool into an earlier snapshot. Have you tried that? Or, you can mount your pool within the last 30 seconds or so, I think. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hang on zpool import (dedup related)
On Sun, Sep 12, 2010 at 11:24:06AM -0700, Chris Murray wrote: Absolutely spot on George. The import with -N took seconds. Working on the assumption that esx_prod is the one with the problem, I bumped that to the bottom of the list. Each mount was done in a second: # zfs mount zp # zfs mount zp/nfs # zfs mount zp/nfs/esx_dev # zfs mount zp/nfs/esx_hedgehog # zfs mount zp/nfs/esx_meerkat # zfs mount zp/nfs/esx_meerkat_dedup # zfs mount zp/nfs/esx_page # zfs mount zp/nfs/esx_skunk # zfs mount zp/nfs/esx_temp # zfs mount zp/nfs/esx_template And those directories have the content in them that I'd expect. Good! So now I try to mount esx_prod, and the influx of reads has started in zpool iostat zp 1 This is the filesystem with the issue, but what can I do now? You could try to snapshot it (but keep it unmounted), then zfs send it and zfs recv it to eg. zp/foo. Use -u option for zfs recv too, then try to mount what you received. -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgpQxyW0TDNO3.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
I was thinking to delete all zfs snapshots before zfs send receive to another new zpool. Then everything would be defragmented, I thought. (I assume snapshots works this way: I snapshot once and do some changes, say delete file A and edit file B. When I delete the snapshot, the file A is still deleted and file B is still edited. In other words, deletion of snapshot does not revert back the changes. Therefore I just delete all snapshots and make my filesystem up to date before zfs send receive) -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Orvar Korvar I was thinking to delete all zfs snapshots before zfs send receive to another new zpool. Then everything would be defragmented, I thought. You don't need to delete snaps before zfs send, if your goal is to defragment your filesystem. Just perform a single zfs send, and don't do any incrementals afterward. The receiving filesystem will layout the filesystem as it wishes. (I assume snapshots works this way: I snapshot once and do some changes, say delete file A and edit file B. When I delete the snapshot, the file A is still deleted and file B is still edited. In other words, deletion of snapshot does not revert back the changes. You are correct. A snapshot is a read-only image of the filesystem, as it was, at some time in the past. If you destroy the snapshot, you've only destroyed the snapshot. You haven't destroyed the most recent live version of the filesystem. If you wanted to, you could rollback, which destroys the live version of the filesystem, and restores you back to some snapshot. But that is a very different operation. Rollback is not at all similar to destroying a snapshot. These two operations are basically opposites of each other. All of this is discussed in the man pages. I suggest man zpool and man zfs Everything you need to know is written there. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote: From: Richard Elling [mailto:rich...@nexenta.com] This operational definition of fragmentation comes from the single- user, single-tasking world (PeeCees). In that world, only one thread writes files from one application at one time. In those cases, there is a reasonable expectation that a single file's blocks might be contiguous on a single disk. That isn't the world we live in, where have RAID, multi-user, or multi- threaded environments. I don't know what you're saying, but I'm quite sure I disagree with it. Regardless of multithreading, multiprocessing, it's absolutely possible to have contiguous files, and/or file fragmentation. That's not a characteristic which depends on the threading model. Possible, yes. Probable, no. Consider that a file system is allocating space for multiple, concurrent file writers. Also regardless of raid, it's possible to have contiguous or fragmented files. The same concept applies to multiple disks. RAID works against the efforts to gain performance by contiguous access because the access becomes non-contiguous. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
From: Richard Elling [mailto:rich...@nexenta.com] Regardless of multithreading, multiprocessing, it's absolutely possible to have contiguous files, and/or file fragmentation. That's not a characteristic which depends on the threading model. Possible, yes. Probable, no. Consider that a file system is allocating space for multiple, concurrent file writers. Process A is writing. Suppose it starts writing at block 10,000 out of my 1,000,000 block device. Process B is also writing. Suppose it starts writing at block 50,000. These two processes write simultaneously, and no fragmentation occurs, unless Process A writes more than 40,000 blocks. In that case, A's file gets fragmented, and the 2nd fragment might begin at block 300,000. The concept which causes fragmentation (not counting COW) in the size of the span of unallocated blocks. Most filesystems will allocate blocks from the largest unallocated contiguous area of the physical device, so as to minimize fragmentation. I can't say how ZFS behaves authoritatively, but I'd be extremely surprised if two processes writing different files as fast as possible result in all their blocks interleaved with each other on physical disk. I think this is possible if you have multiple processes lazily writing at less-than full speed, because then ZFS might remap a bunch of small writes into a single contiguous write. Also regardless of raid, it's possible to have contiguous or fragmented files. The same concept applies to multiple disks. RAID works against the efforts to gain performance by contiguous access because the access becomes non-contiguous. These might as well have been words randomly selected from the dictionary to me - I recognize that it's a complete sentence, but you might have said processors aren't needed in computers anymore, or something equally illogical. Suppose you have a 3-disk raid stripe set, using traditional simple striping, because it's very easy to explain. Suppose a process is writing as fast as it can, and suppose it's going to write block 0 through block 99 of a virtual device. virtual block 0 = block 0 of disk 0 virtual block 1 = block 0 of disk 1 virtual block 2 = block 0 of disk 2 virtual block 3 = block 1 of disk 0 virtual block 4 = block 1 of disk 1 virtual block 5 = block 1 of disk 2 virtual block 6 = block 2 of disk 0 virtual block 7 = block 2 of disk 1 virtual block 8 = block 2 of disk 2 virtual block 9 = block 3 of disk 0 ... virtual block 96 = block 32 of disk 0 virtual block 97 = block 32 of disk 1 virtual block 98 = block 32 of disk 2 virtual block 99 = block 33 of disk 0 Thanks to buffering and command queueing, the OS tells the RAID controller to write blocks 0-8, and the raid controller tells disk 0 to write blocks 0-2, tells disk 1 to write blocks 0-2, and tells disk 2 to write 0-2, simultaneously. So the total throughput is the sum of all 3 disks writing continuously and contiguously to sequential blocks. This accelerates performance for continuous sequential writes. It does not work against efforts to gain performance by contiguous access. The same concept is true for raid-5 or raidz, but it's more complicated. The filesystem or raid controller does in fact know how to write sequential filesystem blocks to sequential physical blocks on the physical devices for the sake of performance enhancement on contiguous read/write. If you don't believe me, there's a very easy test to prove it: Create a zpool with 1 disk in it. time writing 100G (or some amount of data larger than RAM.) Create a zpool with several disks in a raidz set, and time writing 100G. The speed scales up linearly with the number of disks, until you reach some other hardware bottleneck, such as bus speed or something like that. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS archive image
I have a flash archive that is stored in a ZFS snapshot stream. Is there a way to mount this image so I can read files from it. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS archive image
On 09/13/10 09:40 AM, Buck Huffman wrote: I have a flash archive that is stored in a ZFS snapshot stream. Is there a way to mount this image so I can read files from it. No, but you can use the flar split command to split the flash archive into its constituent parts, one of which will be a zfs send stream than you can unpack with zfs recv. Lori ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intermittent ZFS hang
div id=jive-html-wrapper-div Charles,br br Just like UNIX, there are several ways to drill down on the problem.nbsp; I would probably start with a live crash dump (savecore -L) when you see the problem.nbsp; Another method would be to grap multiple stats commands during the problem to see where you can drill down later.nbsp; I would probably use this method if the problem lasts for a while and drill down with dtrace base on what I saw.nbsp; But each method is going to depend on your skill, when looking at the problem.br br Davebr br Dave,br br After running clean since my last post the problem occurred again today. This time I was able to gather some data while it was going on. The only thing that jumps out at my so far is the output of echo ::zio_state | mdb -k. br Under normal operations this usually looks like this:br br ADDRESS TYPE STAGEWAITERbr br ff090eb69328 NULL OPEN -br ff090eb69c88 NULL OPEN -br br Here are a couple samples while the issue was happening:br br ADDRESS TYPE STAGEWAITERbr br ff0bfe8c59b0 NULL CHECKSUM_VERIFY ff003e2f2c60br ff090eb69328 NULL OPEN -br ff090eb69c88 NULL OPEN -br br ADDRESS TYPE STAGEWAITERbr br ff09bb12a040 NULL CHECKSUM_VERIFY ff003d6acc60br ff0bfe8c59b0 NULL CHECKSUM_VERIFY ff003e2f2c60br ff090eb69328 NULL OPEN -br ff090eb69c88 NULL OPEN -br br Operating under the assumption that the waiter column is referencing kernel threads, I went looking for those addresses in the thread list. Here are the threadlist entries for ff003d6acc60 and ff003e2f2c60 from the example directly above taken at about the same time as that output:br br ff003d6acc60 ff0930d8c700 ff09172f9de0 2 0 ff09bb12a348br PC: _resume_from_idle+0xf1CMD: zpool-pool0br stack pointer for thread ff003d6acc60: ff003d6ac360br [ ff003d6ac360 _resume_from_idle+0xf1() ]br swtch+0x145()br cv_wait+0x61()br zio_wait+0x5d()br dbuf_read+0x1e8()br dmu_buf_hold+0x93()br zap_get_leaf_byblk+0x56()br zap_deref_leaf+0x78()br fzap_length+0x42()br zap_length_uint64+0x84()br ddt_zap_lookup+0x4b()br ddt_object_lookup+0x6d()br ddt_lookup+0x115()br zio_ddt_free+0x42()br zio_execute+0x8d()br taskq_thread+0x248()br thread_start+8()br br ff003e2f2c60 fbc2dbb00 0 60 ff0bfe8c5cb8br PC: _resume_from_idle+0xf1THREAD: txg_sync_thread()br stack pointer for thread ff003e2f2c60: ff003e2f2a40br [ ff003e2f2a40 _resume_from_idle+0xf1() ]br swtch+0x145()br cv_wait+0x61()br zio_wait+0x5d()br spa_sync+0x40c()br txg_sync_thread+0x24a()br thread_start+8()br br Not sure if any of that sheds any light on the problem. I also have a live dump from the period when the problem was happening, a bunch of iostats, mpstats, and ::arc, ::spa, ::zio_state, and ::threadlist -v from mdb -k at several points during the issue.br br If you have any advice on how to proceed from here in debugging this issue I'd greatly appreciate it. So you know, I'm generally very comfortable with unix, but dtrace and the solaris kernel are unfamiliar territory. br br In any event, thanks again for all the help thus far.br br -Charles -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On Mon, September 13, 2010 07:14, Edward Ned Harvey wrote: From: Richard Elling [mailto:rich...@nexenta.com] This operational definition of fragmentation comes from the single- user, single-tasking world (PeeCees). In that world, only one thread writes files from one application at one time. In those cases, there is a reasonable expectation that a single file's blocks might be contiguous on a single disk. That isn't the world we live in, where have RAID, multi-user, or multi- threaded environments. I don't know what you're saying, but I'm quite sure I disagree with it. Regardless of multithreading, multiprocessing, it's absolutely possible to have contiguous files, and/or file fragmentation. That's not a characteristic which depends on the threading model. Also regardless of raid, it's possible to have contiguous or fragmented files. The same concept applies to multiple disks. The attitude that it *matters* seems to me to have developed, and be relevant only to, single-user computers. Regardless of whether a file is contiguous or not, by the time you read the next chunk of it, in the multi-user world some other user is going to have moved the access arm of that drive. Hence, it doesn't matter if the file is contiguous or not. -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Proper procedure when device names have changed
I am running zfs-fuse on an Ubuntu 10.04 box. I have a dual mirrored pool: mirror sdd sde mirror sdf sdg Recently the device names shifted on my box and the devices are now sdc sdd sde and sdf. The pool is of course very unhappy about the mirrors are no longer matched up and one device is missing. What is the proper procedure to deal with this? -brian -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proper procedure when device names have changed
try export and import the zpool On 9/13/2010 1:26 PM, Brian wrote: I am running zfs-fuse on an Ubuntu 10.04 box. I have a dual mirrored pool: mirror sdd sde mirror sdf sdg Recently the device names shifted on my box and the devices are now sdc sdd sde and sdf. The pool is of course very unhappy about the mirrors are no longer matched up and one device is missing. What is the proper procedure to deal with this? -brian attachment: laotsao.vcf___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
To summarize, A) resilver does not defrag. B) zfs send receive to a new zpool means it will be defragged Correctly understood? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proper procedure when device names have changed
That seems to have done the trick. I was worried because in the past I've had problems importing faulted file systems. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intermittent ZFS hang
At first we blamed de-dupe, but we've disabled that. Next we suspected the SSD log disks, but we've seen the problem with those removed, as well. Did you have dedup enabled and then disabled it? If so, data can (or will) be deduplicated on the drives. Currently the only way of de-deduping them is to recopy them after disabling dedup. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On Sep 13, 2010, at 10:54 AM, Orvar Korvar wrote: To summarize, A) resilver does not defrag. B) zfs send receive to a new zpool means it will be defragged Define fragmentation? If you follow the wikipedia definition of defragmentation then the answer is no, zfs send/receive does not change the location of files. Why? Because zfs sends objects, not files. The objects can be allocated in a (more) contiguous form on the receiving side, or maybe not, depending on the configuration and use of the receiving side. A file may be wholly contained in an object, or not, depending on how it was created. For example, if a file is less than 128KB (by default) and is created at one time, then it will be wholly contained in one object. By contrast, UFS has an 8KB max block size will use up to 16 different blocks to store the same file. These blocks may or may not be contiguous in UFS. http://en.wikipedia.org/wiki/Defragmentation Correctly understood? Clear as mud. I suggest deprecating the use of the term defragmentation. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [mdb-discuss] mdb -k - I/O usage
This is snv_128 x86. ::arc hits = 39811943 misses=630634 demand_data_hits = 29398113 demand_data_misses=490754 demand_metadata_hits = 10413660 demand_metadata_misses=133461 prefetch_data_hits= 0 prefetch_data_misses = 0 prefetch_metadata_hits= 170 prefetch_metadata_misses = 6419 mru_hits = 2933011 mru_ghost_hits= 43202 mfu_hits = 36878818 mfu_ghost_hits= 45361 deleted = 1299527 recycle_miss = 46526 mutex_miss= 355 evict_skip= 25539 evict_l2_cached = 0 evict_l2_eligible = 77011188736 evict_l2_ineligible = 76253184 hash_elements =278135 hash_elements_max =279843 hash_collisions = 1653518 hash_chains = 75135 hash_chain_max= 9 p = 4787 MB c = 5722 MB c_min = 715 MB c_max = 5722 MB size = 5428 MB hdr_size = 56535840 data_size = 5158287360 other_size= 477726560 l2_hits = 0 l2_misses = 0 l2_feeds = 0 l2_rw_clash = 0 l2_read_bytes = 0 l2_write_bytes= 0 l2_writes_sent= 0 l2_writes_done= 0 l2_writes_error = 0 l2_writes_hdr_miss= 0 l2_evict_lock_retry = 0 l2_evict_reading = 0 l2_free_on_write = 0 l2_abort_lowmem = 0 l2_cksum_bad = 0 l2_io_error = 0 l2_size = 0 l2_hdr_size = 0 memory_throttle_count = 0 arc_no_grow = 0 arc_tempreserve = 0 MB arc_meta_used = 1288 MB arc_meta_limit= 1430 MB arc_meta_max = 1288 MB ::memstat Page SummaryPagesMB %Tot Kernel 789865 3085 19% ZFS File Data 1406055 5492 34% Anon 396297 15489% Exec and libs7178280% Page cache 8428320% Free (cachelist) 117928 4603% Free (freelist) 1464224 5719 35% Total 4189975 16367 Physical 4189974 16367 ::spa -ev ADDR STATE NAME ff04f0eb4500ACTIVE data ADDR STATE AUX DESCRIPTION ff04f2f52940 HEALTHY -root READWRITE FREECLAIMIOCTL OPS 00000 BYTES 00000 EREAD 0 EWRITE0 ECKSUM0 ff050a2fd980 HEALTHY - raidz READWRITE FREECLAIMIOCTL OPS 0x57090 0x37436a000 BYTES 0x8207f3c00 0x22345d0800000 EREAD 0 EWRITE0 ECKSUM0 ff050a2fa0c0 HEALTHY -/dev/dsk/c7t2d0s0 READWRITE FREECLAIMIOCTL OPS 0x4416e 0x10564000 0x74326 BYTES 0x10909da00 0x45089d600000 EREAD 0 EWRITE0 ECKSUM0 ff050a2fa700 HEALTHY -/dev/dsk/c7t3d0s0 READWRITE FREECLAIMIOCTL OPS 0x43fca 0x1055fa00 0x74326 BYTES 0x108e14400 0x45087a400000 EREAD 0 EWRITE0 ECKSUM0 ff050a2fad40 HEALTHY -/dev/dsk/c7t4d0s0 READWRITE FREECLAIMIOCTL OPS 0x44221 0x10553300 0x74326 BYTES 0x108a56c00 0x4508c8a00000 EREAD 0 EWRITE0 ECKSUM0 ff050a2fb380 HEALTHY -
Re: [zfs-discuss] Suggested RaidZ configuration...
Makes sense. My understanding is not good enough to confidently make my own decisions, and I'm learning as Im going. The BPG says: - The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups If there was a reason leading up to this statement, I didnt follow it. However, a few paragraphs later, their RaidZ2 example says [4x(9+2), 2 hot spares, 18.0 TB]. So I guess 8+2 should be quite acceptable, especially since performance is the lowest priority. On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey sh...@nedharvey.comwrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of hatish I have just read the Best Practices guide, and it says your group shouldnt have 9 disks. I think the value you can take from this is: Why does the BPG say that? What is the reasoning behind it? Anything that is a rule of thumb either has reasoning behind it (you should know the reasoning) or it doesn't (you should ignore the rule of thumb, dismiss it as myth.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [mdb-discuss] onnv_142 - vfs_mountroot: cannot mount root
On 09/07/10 23:26, Piotr Jasiukajtis wrote: Hi, After upgrade from snv_138 to snv_142 or snv_145 I'm unable to boot the system. Here is what I get. Any idea why it's not able to import rpool? I saw this issue also on older builds on a different machines. This sounds (based on the presence of cpqary) not unlike: 6972328 Installation of snv_139+ on HP BL685c G5 fails due to panic during auto install process which was introduced into onnv_139 by the fix for this 6927876 For 4k sector support, ZFS needs to use DKIOCGMEDIAINFOEXT The fix is in onnv_148 after the external push switch-off, fixed via 6967658 sd_send_scsi_READ_CAPACITY_16() needs to handle SBC-2 and SBC-3 response formats I experienced this on data pools rather than the rpool, but I suspect on the rpool you'd get the vfs_mountroot panic you see when rpool import fails. My workaround was to compile a zfs with the fix for 6927876 changed to force the default physical block size of 512 and drop that into the BE before booting to it. There was no simpler workaround available. Gavin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
Mattias, what you say makes a lot of sense. When I saw *Both of the above situations resilver in equal time*, I was like no way! But like you said, assuming no bus bottlenecks. This is my exact breakdown (cheap disks on cheap bus :P) : PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier). The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM's capability. So the drives arent likely to hit max read speed for long lengths of time, especially during rebuild time. So the bus is going to be quite a bottleneck. Lets assume that the drives are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 7.2TB. Best case scenario, we can read 7.2TB at 3Gbps = 57.6 Tb at 3Gbps = 57600 Gb at 3Gbps = 19200 seconds = 320 minutes = 5 Hours 20 minutes. Even if it takes twice that amount of time, Im happy. Initially I had been thinking 2 PM's for each vdev. But now Im thinking maybe split it wide as best I can ([2disks per PM] x 2, [3disks per PM] x 2) for each vdev. It'll give the best possible speed, but still wont max out the HDD's. I've never actually sat and done the math before. Hope its decently accurate :) On Wed, Sep 8, 2010 at 3:27 PM, Edward Ned Harvey sh...@nedharvey.comwrote: From: pantz...@gmail.com [mailto:pantz...@gmail.com] On Behalf Of Mattias Pantzare It is about 1 vdev with 12 disk or 2 vdev with 6 disks. If you have 2 vdev you have to read half the data compared to 1 vdev to resilver a disk. Let's suppose you have 1T of data. You have 12-disk raidz2. So you have approx 100G on each disk, and you replace one disk. Then 11 disks will each read 100G, and the new disk will write 100G. Let's suppose you have 1T of data. You have 2 vdev's that are each 6-disk raidz1. Then we'll estimate 500G is on each vdev, so each disk has approx 100G. You replace a disk. Then 5 disks will each read 100G, and 1 disk will write 100G. Both of the above situations resilver in equal time, unless there is a bus bottleneck. 21 disks in a single raidz3 will resilver just as fast as 7 disks in a raidz1, as long as you are avoiding the bus bottleneck. But 21 disks in a single raidz3 provides better redundancy than 3 vdev's each containing a 7 disk raidz1. In my personal experience, approx 5 disks can max out approx 1 bus. (It actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks on a good bus, or good disks on a crap bus, but generally speaking people don't do that. Generally people get a good bus for good disks, and cheap disks for crap bus, so approx 5 disks max out approx 1 bus.) In my personal experience, servers are generally built with a separate bus for approx every 5-7 disk slots. So what it really comes down to is ... Instead of the Best Practices Guide saying Don't put more than ___ disks into a single vdev, the BPG should say Avoid the bus bandwidth bottleneck by constructing your vdev's using physical disks which are distributed across multiple buses, as necessary per the speed of your disks and buses. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
A, I see. But I think your math is a bit out: 62.5e6 iop @ 100iops = 625000 seconds = 10416m = 173h = 7D6h. So 7 days 6 hours. Thats long, but I can live with it. This isnt for an enterprise environment. While the length of time is of worry in terms of increasing the chance another drive will fail, in my mind that is mitigated by the fact that the drives wont be under major stress during that time. Its a workable solution. On Thu, Sep 9, 2010 at 3:03 PM, Erik Trimble erik.trim...@oracle.comwrote: On 9/9/2010 5:49 AM, hatish wrote: Very interesting... Well, lets see if we can do the numbers for my setup. From a previous post of mine: [i]This is my exact breakdown (cheap disks on cheap bus :P) : PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier). The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM's capability. So the drives arent likely to hit max read speed for long lengths of time, especially during rebuild time. So the bus is going to be quite a bottleneck. Lets assume that the drives are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 7.2TB. Best case scenario, we can read 7.2TB at 3Gbps = 57.6 Tb at 3Gbps = 57600 Gb at 3Gbps = 19200 seconds = 320 minutes = 5 Hours 20 minutes. Even if it takes twice that amount of time, Im happy. Initially I had been thinking 2 PM's for each vdev. But now Im thinking maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks1Pdisk per PM] x 2) for each vdev. It'll give the best possible speed, but still wont max out the HDD's. I've never actually sat and done the math before. Hope its decently accurate :)[/i] My scenario, as from Erik's post: Scenario: I have 10 1TB disks in a raidz2, and I have 128k slab sizes. Thus, I have 16k of data for each slab written to each disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 16k of data on the failed drive. It thus takes about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive. Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is going. Best Case: I'll read at 12Gbps, write at 3Gbps (4:1). I read 128K for every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a more realistic time to read 7.6TB. Actually, your biggest bottleneck will be the IOPS limits of the drives. A 7200RPM SATA drive tops out at 100 IOPS. Yup. That's it. So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 IOPS, that means you will finish (best case) in 62.5e4 seconds. Which is over 173 hours. Or, about 7.25 WEEKS. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NetApp/Oracle-Sun lawsuit done
Run away! Run fast little Netapp. Don't anger the sleeping giant - Oracle! David Magda wrote: Seems that things have been cleared up: NetApp (NASDAQ: NTAP) today announced that both parties have agreed to dismiss their pending patent litigation, which began in 2007 between Sun Microsystems and NetApp. Oracle and NetApp seek to have the lawsuits dismissed without prejudice. The terms of the agreement are confidential. http://tinyurl.com/39qkzgz http://www.netapp.com/us/company/news/news-rel-20100909-oracle-settlement.html A recap of the history at: http://www.theregister.co.uk/2010/09/09/oracle_netapp_zfs_dismiss/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Craig Cory Senior Instructor :: ExitCertified : Sun Certified System Administrator : Sun Certified Network Administrator : Sun Certified Security Administrator : Veritas Certified Instructor 8950 Cal Center Drive Bldg 1, Suite 110 Sacramento, California 95826 [e] craig.c...@exitcertified.com [p] 916.669.3970 [f] 916.669.3977 +-+ ExitCertified :: Excellence in IT Certified Education Certified training with Oracle, Sun Microsystems, Apple, Symantec, IBM, Red Hat, MySQL, Hitachi Storage, SpringSource and VMWare. 1.800.803.EXIT (3948) | www.ExitCertified.com +-+ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggested RaidZ configuration...
Hi, *The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller.* I was simply listing the bandwidth available at the different stages of the data cycle. The PCIE port gives me 32Gbps. The Sata card gives me a possible 12Gbps. I'd rather be cautious and asuume I'll get more like 6Gbps, it is a cheap card after all. *I guarantee you this is not a sustainable speed for 7.2krpm sata disks.* (I am well aware :) ) * Which is 333% of the PM's capability. * Assuming that it is, 5 drives at that speed will max out my PM 3 times over. So my PM will automatically throttle the drives speed to a third of that on the account that the PM will be maxed out. Thanks for the rough IO speed check :) On Thu, Sep 9, 2010 at 3:20 PM, Edward Ned Harvey sh...@nedharvey.comwrote: From: Hatish Narotam [mailto:hat...@gmail.com] PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD's. (each connected to a Port Multiplier). Assuming your disks can all sustain 500Mbit/sec, which I find to be typical for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit upstream bottleneck, it means each of your groups of 5 should be fine in a raidz1 configuration. You think that your sata card can do 32Gbit because it's on a PCIe x8 bus. I highly doubt it unless you paid a grand or two for your sata controller, but please prove me wrong. ;-) I think the backplane of the sata controller is more likely either 3G or 6G. If it's 3G, then you should use 4 groups of raidz1. If it's 6G, then you can use 2 groups of raidz2 (because 10 drives of 500Mbit can only sustain 5Gbit) If it's 12G or higher, then you can make all of your drives one big vdev of raidz3. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. I guarantee you this is not a sustainable speed for 7.2krpm sata disks. You can get a decent measure of sustainable speed by doing something like: (write 1G byte) time dd if=/dev/zero of=/some/file bs=1024k count=1024 (beware: you might get an inaccurate speed measurement here due to ram buffering. See below.) (reboot to ensure nothing is in cache) (read 1G byte) time dd if=/some/file of=/dev/null bs=1024k (Now you're certain you have a good measurement. If it matches the measurement you had before, that means your original measurement was also accurate. ;-) ) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zpool upgrade and zfs upgrade behavior on b145
Not sure what the best list to send this to is right now, so I have selected a few, apologies in advance. A couple questions. First I have a physical host (call him bob) that was just installed with b134 a few days ago. I upgraded to b145 using the instructions on the Illumos wiki yesterday. The pool has been upgraded (27) and the zfs file systems have been upgraded (5). ch...@bob:~# zpool upgrade rpool This system is currently running ZFS pool version 27. Pool 'rpool' is already formatted using the current version. ch...@bob:~# zfs upgrade rpool 7 file systems upgraded The file systems have been upgraded according to zfs get version rpool Looks ok to me. However, I now get an error when I run zdb -D. I can't remember exactly when I turned dedup on, but I moved some data on rpool, and zpool list shows 1.74x ratio. ch...@bob:~# zdb -D rpool zdb: can't open 'rpool': No such file or directory Also, running zdb by itself, returns expected output, but still says my rpool is version 22. Is that expected? I never ran zdb before the upgrade, since it was a clean install from the b134 iso to go straight to b145. One thing I will mention is that the hostname of the machine was changed too (using these instructionshttp://wiki.genunix.org/wiki/index.php/Change_hostname_HOWTO). bob used to be eric. I don't know if that matters, but I can't open up the Users and Groups from Gnome anymore, *unable to su* so something is still not right there. Moving on, I have another fresh install of b134 from iso inside a virtualbox virtual machine, on a total different physical machine. This machine is named weston and was upgraded to b145 using the same Illumos wiki instructions. His name has never changed. When I run the same zdb -D command I get the expected output. ch...@weston:~# zdb -D rpool DDT-sha256-zap-unique: 11 entries, size 558 on disk, 744 in core dedup = 1.00, compress = 7.51, copies = 1.00, dedup * compress / copies = 7.51 However, after zpool and zfs upgrades *on both machines*, they still say the rpool is version 22. Is that expected/correct? I added a new virtual disk to the vm weston to see what would happen if I made a new pool on the new disk. ch...@weston:~# zpool create test c5t1d0 Well, the new test pool shows version 27, but rpool is still listed at 22 by zdb. Is this expected /correct behavior? See the output below to see the rpool and test pool version numbers according to zdb on the host weston. Can anyone provide any insight into what I'm seeing? Do I need to delete my b134 boot environments for rpool to show as version 27 in zdb? Why does zdb -D rpool give me can't open on the host bob? Thank you in advance, -Chris ch...@weston:~# zdb rpool: version: 22 name: 'rpool' state: 0 txg: 7254 pool_guid: 17616386148370290153 hostid: 8413798 hostname: 'weston' vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 17616386148370290153 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 14826633751084073618 path: '/dev/dsk/c5t0d0s0' devid: 'id1,s...@sata_vbox_harddiskvbf6ff53d9-49330fdb/a' phys_path: '/p...@0,0/pci8086,2...@d/d...@0,0:a' whole_disk: 0 metaslab_array: 23 metaslab_shift: 28 ashift: 9 asize: 32172408832 is_log: 0 create_txg: 4 test: version: 27 name: 'test' state: 0 txg: 26 pool_guid: 13455895622924169480 hostid: 8413798 hostname: 'weston' vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 13455895622924169480 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 7436238939623596891 path: '/dev/dsk/c5t1d0s0' devid: 'id1,s...@sata_vbox_harddiskvba371da65-169e72ea/a' phys_path: '/p...@0,0/pci8086,2...@d/d...@1,0:a' whole_disk: 1 metaslab_array: 30 metaslab_shift: 24 ashift: 9 asize: 3207856128 is_log: 0 create_txg: 4 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [osol-code] What append to ZFS bp rewrite?
On Fri, Sep 10, 2010 at 08:36:13AM -0700, Steeve Roy wrote: I am currently preparing a big SAN deployment using ZFS. As I will start with 60tB of data with a growing rate of 25% per year, I need some online defrag, data redistribution against drive as storage pool increase etc... When can we expect to get the bp rewrite feature into ZFS? Thanks! I'm thinking zfs-discuss@opensolaris.org is a better place to ask (cc'ed). -- Will Fiveash Oracle http://opensolaris.org/os/project/kerberos/ Sent using mutt, a sweet text based e-mail app: http://www.mutt.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] What append to ZFS bp rewrite?
I am currently preparing a big SAN deployment using ZFS. As I will start with 60tB of data with a growing rate of 25% per year, I need some online defrag, data redistribution against drive as storage pool increase etc... When can we expect to get the bp rewrite feature into ZFS? Thanks! Steeve Roy IT Manager Coveo 2800 St-Jean Baptiste Suite 212 Québec, Qc G2E 6J5 Office: +1-418-263- ext:330 FAX: +1-418-263-1221 Mobile: +1-418-802-5440 s...@coveo.commailto:s...@coveo.com www.coveo.comhttp://www.coveo.com Information Access at the Speed of Business(tm) --- This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient named in the original email to which this message was attached. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please return this email to the sender immediately and permanently delete the original and any copies of this email and any attachments thereto. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proper procedure when device names have changed
Or you can go into udev's persistent rules and set it up such that the drives always get the correct names. I'd guess you'll probably find them somewhere under /etc/udev/rules.d or something similar. It will likely save you trouble in the long run, as they likely are getting shuffled with either a kernel or udev upgrade. Robert On 9/13/10 10:31 AM, LaoTsao 老曹 wrote: try export and import the zpool On 9/13/2010 1:26 PM, Brian wrote: I am running zfs-fuse on an Ubuntu 10.04 box. I have a dual mirrored pool: mirror sdd sde mirror sdf sdg Recently the device names shifted on my box and the devices are now sdc sdd sde and sdf. The pool is of course very unhappy about the mirrors are no longer matched up and one device is missing. What is the proper procedure to deal with this? -brian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Intermittent ZFS hang
At first we blamed de-dupe, but we've disabled that. Next we suspected the SSD log disks, but we've seen the problem with those removed, as well. Did you have dedup enabled and then disabled it? If so, data can (or will) be deduplicated on the drives. Currently the only way of de- deduping them is to recopy them after disabling dedup. That's a good point. There is deduplicated data still present on disk. Do you think the issue we're seeing may be related to the existing deduped data? I'm not against copying the contents of the pool over to a new pool, but considering the effort/disruption I'd want to make sure it's not just a shot in the dark. If I don't have a good theory in another week, that's when I start shooting in the dark... -Charles ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Configuration questions for Home File Server (CPU cores, dedup, checksum)?
On Tue, September 7, 2010 15:58, Craig Stevenson wrote: 3. Should I consider using dedup if my server has only 8Gb of RAM? Or, will that not be enough to hold the DDT? In which case, should I add L2ARC / ZIL or am I better to just skip using dedup on a home file server? I would not consider using dedup in the current state of the code. I hear too many horror stories. Also, why do you think you'd get much benefit? It takes pretty big blocks of exact bit-for-bit duplication to actually trigger the code, and you're not going to find them in compressed image (including motion picture / video) or audio files, for example (the main things that take up much space on most home servers). -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs compression with Oracle - anyone implemented?
Hi! I'd been scouring the forums and web for admins/users who deployed zfs with compression enabled on Oracle backed by storage array luns. Any problems with cpu/memory overhead? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs compression with Oracle - anyone implemented?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Brad Hi! I'd been scouring the forums and web for admins/users who deployed zfs with compression enabled on Oracle backed by storage array luns. Any problems with cpu/memory overhead? I don't think your question is clear. What do you mean on oracle backed by storage luns? Do you mean on oracle hardware? Do you mean you plan to run oracle database on the server, with ZFS under the database? Generally speaking, you can enable compression on any zfs filesystem, and the cpu overhead is not very big, and the compression level is not very strong by default. However, if the data you have is generally uncompressible, any overhead is a waste. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS online device management
Can anyone elaborate on the zpool split command. I have not seen any examples in use am I am very curious about it. Say I have 12 disks in a pool named tank. 6 in a RAIDZ2 + another 6 in a RAIDZ2. All is well, and I'm not even close to maximum capacity in the pool. Say I want to swap out 6 of the 12 SATA disks for faster SAS disks, and make a new 6 disk pool with just the SAS disks, leaving the existing pool with the SATA disks intact. Can I run something like: zpool split tank dozer c4t8d0 c4t9d0 c4t10d0 c4t11d0 c4t12d0 c4t13d0 zpool export dozer Now, turn off the server, remove the 6 SATA disks. Put in the 6 SAS disks. Power on the server. echo | format to get the disk ID's of the new SAS disks. zpool create speed raidz disk1 disk2 disk3 disk4 disk5 disk6 Thanks in advance, -Chris On Sat, Sep 11, 2010 at 4:37 PM, besson3c j...@netmusician.org wrote: Ahhh, I figured you could always do that, I guess I was wrong... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS online device management
On Sep 13, 2010, at 4:40 PM, Chris Mosetick wrote: Can anyone elaborate on the zpool split command. I have not seen any examples in use am I am very curious about it. Say I have 12 disks in a pool named tank. 6 in a RAIDZ2 + another 6 in a RAIDZ2. All is well, and I'm not even close to maximum capacity in the pool. Say I want to swap out 6 of the 12 SATA disks for faster SAS disks, and make a new 6 disk pool with just the SAS disks, leaving the existing pool with the SATA disks intact. zpool split only works on mirrors. For examples, see the section Creating a New Pool By Splitting a Mirrored ZFS Storage Pool in the ZFS Admin Guide. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS online device management
So are there now any methods to achieve the scenario I described to shrink a pools size with existing ZFS tools? I don't see a definitive way listed on the old shrinking threadhttp://www.opensolaris.org/jive/thread.jspa?threadID=8125 . Thank you, -Chris On Mon, Sep 13, 2010 at 4:55 PM, Richard Elling rich...@nexenta.com wrote: On Sep 13, 2010, at 4:40 PM, Chris Mosetick wrote: Can anyone elaborate on the zpool split command. I have not seen any examples in use am I am very curious about it. Say I have 12 disks in a pool named tank. 6 in a RAIDZ2 + another 6 in a RAIDZ2. All is well, and I'm not even close to maximum capacity in the pool. Say I want to swap out 6 of the 12 SATA disks for faster SAS disks, and make a new 6 disk pool with just the SAS disks, leaving the existing pool with the SATA disks intact. zpool split only works on mirrors. For examples, see the section Creating a New Pool By Splitting a Mirrored ZFS Storage Pool in the ZFS Admin Guide. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS online device management
On Sep 13, 2010, at 5:51 PM, Chris Mosetick wrote: So are there now any methods to achieve the scenario I described to shrink a pools size with existing ZFS tools? I don't see a definitive way listed on the old shrinking thread. Today, there is no way to accomplish what you want without copying. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file recovery on lost RAIDZ array
I don't know what happened. I was in the process of copying files onto my new file server when the copy process from the other machine failed. I turned on the monitor for the fileserver and found that it had rebooted by itself at some point (machine fault maybe?) and when I remounted the drives every last thing was gone. I am new to zfs. How do you take snapshots? Does the sytem do it automagically for you? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file recovery on lost RAIDZ array
Oh and yes, raidz1. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file recovery on lost RAIDZ array
On Sep 12, 2010, at 7:49 PM, Michael Eskowitz wrote: I recently lost all of the data on my single parity raid z array. Each of the drives was encrypted with the zfs array built within the encrypted volumes. I am not exactly sure what happened. Murphy strikes again! The files were there and accessible and then they were all gone. The server apparently crashed and rebooted and everything was lost. After the crash I remounted the encrypted drives and the zpool was still reporting that roughly 3TB of the 7TB array were used, but I could not see any of the files through the array's mount point. I unmounted the zpool and then remounted it and suddenly zpool was reporting 0TB were used. Were you using zfs send/receive? If so, then this is the behaviour expected when a session is interrupted. Since the snapshot did not completely arrive at the receiver, the changes are rolled back. It can take a few minutes for terabytes to be freed. I did not remap the virtual device. The only thing of note that I saw was that the name of storage pool had changed. Originally it was Movies and then it became Movita. I am guessing that the file system became corrupted some how. (zpool status did not report any errors) So, my questions are these... Is there anyway to undelete data from a lost raidz array? It depends entirely on the nature of the loss. In the case I describe above, there is nothing lost because nothing was there (!) If I build a new virtual device on top of the old one and the drive topology remains the same, can we scan the drives for files from old arrays? The short answer is no. Also, is there any way to repair a corrupted storage pool? Yes, but it depends entirely on the nature of the corruption. Is it possible to backup the file table or whatever partition index zfs maintains? The ZFS configuration data is stored redundantly in the pool and checksummed. I imagine that you all are going to suggest that I scrub the array, but that is not an option at this point. I had a backup of all of the data lost as I am moving between file servers so at a certain point I gave up and decided to start fresh. This doesn't give me a warm fuzzy feeling about zfs, though. AFAICT, ZFS appears to be working as designed. Are you trying to kill the canary? :-) -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
Richard Elling wrote: On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote: From: Richard Elling [mailto:rich...@nexenta.com] This operational definition of fragmentation comes from the single- user, single-tasking world (PeeCees). In that world, only one thread writes files from one application at one time. In those cases, there is a reasonable expectation that a single file's blocks might be contiguous on a single disk. That isn't the world we live in, where have RAID, multi-user, or multi- threaded environments. I don't know what you're saying, but I'm quite sure I disagree with it. Regardless of multithreading, multiprocessing, it's absolutely possible to have contiguous files, and/or file fragmentation. That's not a characteristic which depends on the threading model. Possible, yes. Probable, no. Consider that a file system is allocating space for multiple, concurrent file writers. With appropriate write caching and grouping or re-ordering of writes algorithms, it should be possible to minimize the amount of file interleaving and fragmentation on write that takes place. (Or at least optimize the amount of file interleaving. Years ago MFM hard drives had configurable sector interleave factors to better optimize performance when no interleaving meant the drive had spun the platter far enough to be ready to give the next sector to the CPU before the CPU was ready with the result that the platter had to be spun a second time around to wait for the CPU to catch up.) Also regardless of raid, it's possible to have contiguous or fragmented files. The same concept applies to multiple disks. RAID works against the efforts to gain performance by contiguous access because the access becomes non-contiguous. From what I've seen, defragmentation offers its greatest benefit when the tiniest reads are eliminated by grouping them into larger contiguous reads. Once the contiguous areas reach a certain size (somewhere in the few Mbytes to a few hundred Mbytes range), further defragmentation offers little additional benefit. Full defragmentation is a useful goal when the option of using file carving based data recovery is desirable. Also remember that defragmentation is not limited to space used by files. It can also apply to free, unused space, which should also be defragmented to prevent future writes from being fragmented on write. With regard to multiuser systems and how that negates the need to defragment, I think that is only partially true. As long as the files are defragmented enough so that each particular read request only requires one seek before it is time to service the next read request, further defragmentation may offer only marginal benefit. On the other hand, if files from been fragmented down to each sector being stored separately on the drive, then each read request is going to take that much longer to be completed (or will be interrupted by another read request because it has taken too long).. -hk ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On Sep 13, 2010, at 9:41 PM, Haudy Kazemi wrote: Richard Elling wrote: On Sep 13, 2010, at 5:14 AM, Edward Ned Harvey wrote: From: Richard Elling [mailto:rich...@nexenta.com ] This operational definition of fragmentation comes from the single- user, single-tasking world (PeeCees). In that world, only one thread writes files from one application at one time. In those cases, there is a reasonable expectation that a single file's blocks might be contiguous on a single disk. That isn't the world we live in, where have RAID, multi-user, or multi- threaded environments. I don't know what you're saying, but I'm quite sure I disagree with it. Regardless of multithreading, multiprocessing, it's absolutely possible to have contiguous files, and/or file fragmentation. That's not a characteristic which depends on the threading model. Possible, yes. Probable, no. Consider that a file system is allocating space for multiple, concurrent file writers. With appropriate write caching and grouping or re-ordering of writes algorithms, it should be possible to minimize the amount of file interleaving and fragmentation on write that takes place. To some degree, ZFS already does this. The dynamic block sizing tries to ensure that a file is written into the largest block[1] (Or at least optimize the amount of file interleaving. Years ago MFM hard drives had configurable sector interleave factors to better optimize performance when no interleaving meant the drive had spun the platter far enough to be ready to give the next sector to the CPU before the CPU was ready with the result that the platter had to be spun a second time around to wait for the CPU to catch up.) Reason #526 why SSDs kill HDDs on performance. Also regardless of raid, it's possible to have contiguous or fragmented files. The same concept applies to multiple disks. RAID works against the efforts to gain performance by contiguous access because the access becomes non-contiguous. From what I've seen, defragmentation offers its greatest benefit when the tiniest reads are eliminated by grouping them into larger contiguous reads. Once the contiguous areas reach a certain size (somewhere in the few Mbytes to a few hundred Mbytes range), further defragmentation offers little additional benefit. For the wikipedia definition of defragmentation, this can only occur when the files themselves are hundreds of megabytes in size. This is not the general case for which I see defragmentation used. Also, ZFS has an intelligent prefetch algorithm that can hide some performance aspects of defragmentation on HDDs. Full defragmentation is a useful goal when the option of using file carving based data recovery is desirable. Also remember that defragmentation is not limited to space used by files. It can also apply to free, unused space, which should also be defragmented to prevent future writes from being fragmented on write. This is why ZFS uses a first fit algorithm until space becomes too low, when it changes to a best fit algorithm. As long as available space is big enough for the block, then it will be used. With regard to multiuser systems and how that negates the need to defragment, I think that is only partially true. As long as the files are defragmented enough so that each particular read request only requires one seek before it is time to service the next read request, further defragmentation may offer only marginal benefit. On the other hand, if files from been fragmented down to each sector being stored separately on the drive, then each read request is going to take that much longer to be completed (or will be interrupted by another read request because it has taken too long).. Yes, so try to avoid running your ZFS pool at more than 96% full. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com Richard Elling rich...@nexenta.com +1-760-896-4422 Enterprise class storage for everyone www.nexenta.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss