Re: [zfs-discuss] how to know available disk space
This is one of the greatest annoyances of ZFS. I don't really understand how, a zvol's space can not be accurately enumerated from top to bottom of the tree in 'df' output etc. Why does a zvol divorce the space used from the root of the volume? Gregg Wonderly On Feb 6, 2013, at 5:26 PM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: I have a bunch of VM's, and some samba shares, etc, on a pool. I created the VM's using zvol's, specifically so they would have an appropriaterefreservation and never run out of disk space, even with snapshots. Today, I ran out of disk space, and all the VM's died. So obviously it didn't work. When I used zpool status after the system crashed, I saw this: NAME SIZE ALLOC FREE EXPANDSZCAP DEDUP HEALTH ALTROOT storage 928G 568G 360G -61% 1.00x ONLINE - I did some cleanup, so I could turn things back on ... Freed up about 4G. Now, when I use zpool status I see this: NAME SIZE ALLOC FREE EXPANDSZCAP DEDUP HEALTH ALTROOT storage 928G 564G 364G -60% 1.00x ONLINE - When I use zfs list storage I see this: NAME USED AVAIL REFER MOUNTPOINT storage 909G 4.01G 32.5K /storage So I guess the lesson is (a) refreservation and zvol alone aren't enough to ensure your VM's will stay up. and (b) if you want to know how much room is *actually* available, as in usable, as in, how much can I write before I run out of space, you should use zfs list and not zpoolstatus ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool metadata has duplicate children
Have you tried importing the pool with that drive completely unplugged? Which HBA are you using? How many of these disks are on same or separate HBAs? Gregg Wonderly On Jan 8, 2013, at 12:05 PM, John Giannandrea j...@meer.net wrote: I seem to have managed to end up with a pool that is confused abut its children disks. The pool is faulted with corrupt metadata: pool: d state: FAULTED status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. see: http://illumos.org/msg/ZFS-8000-72 scan: none requested config: NAME STATE READ WRITE CKSUM dFAULTED 0 0 1 raidz1-0 FAULTED 0 0 6 da1 ONLINE 0 0 0 3419704811362497180 OFFLINE 0 0 0 was /dev/da2 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 But if I look at the labels on all the online disks I see this: # zdb -ul /dev/da1 | egrep '(children|path)' children[0]: path: '/dev/da1' children[1]: path: '/dev/da2' children[2]: path: '/dev/da2' children[3]: path: '/dev/da3' children[4]: path: '/dev/da4' ... But the offline disk (da2) shows the older correct label: children[0]: path: '/dev/da1' children[1]: path: '/dev/da2' children[2]: path: '/dev/da3' children[3]: path: '/dev/da4' children[4]: path: '/dev/da5' zpool import -F doesnt help because none of the labels on the unfaulted disks seem to have the right label. And unless I can import the pool I cant replace the bad drive. Also zpool seems to really not want to import a raidz1 pool with one faulted drive even though that should be readable. I have read about the undocumented -V option but dont know if that would help. I got into this state when i noticed the pool was DEGRADED and was trying to replace the bad disk. I am debugging it under FreeBSD 9.1 Suggestions of things to try welcome, Im more interested in learning what went wrong than restoring the pool. I dont think I should have been able to go from one offline drive to a unrecoverable pool this easily. -jg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] suggestions for e-SATA HBA card on x86/x64
I have seen some drives not be recognized on a hot plug, but cfgadm seemed to always fix that. I don't recall a cold boot not recognizing the drives. Does the bios boot of the card show all of the drives connected? I did not update the firmware in the cards that I bought. Gregg On Nov 19, 2012, at 10:45 AM, Jerry Kemp sun.mail.lis...@oryx.cc wrote: Hello Gregg, I acquired one of these Intel RAID Controller Card SATA/SAS PCI-E x8 8internal ports (SASUC8I) from your newegg link below, and then acquired the necessary cables to get everything hooked up. After multiple executions of devfsadm and reconfigure boots, the OS see's one of my 4 drives. The drives are 2 TB Seagate drives. Did you need to do anything special to get your card to work correctly? Did you need to do a firmware upgrade or anything? I am running an up-to-date version of OpenIndiana b151a7. Thank you, Jerry On 10/26/12 10:02 AM, Gregg Wonderly wrote: I've been using this card http://www.newegg.com/Product/Product.aspx?Item=N82E16816117157 for my Solaris/Open Indiana installations because it has 8 ports. One of the issues that this card seems to have, is that certain failures can cause other secondary problems in other drives on the same SAS connector. I use mirrors for my storage machines with 4 pairs, and just put half the mirror on one side and the other drive on the other side. This, in general, has solved my problems. When a drive fails, I might see more than one drive no functioning. I can remove (I use hot swap bays such as http://www.newegg.com/Product/Product.aspx?Item=N82E16817994097) a drive, and restore the other to the pool to find which of the failed drives is actually the problem. What had happened before, was that my case was not moving enough air, and the hot drives had caused odd problems with failure. For the money, and the experience I have with these controllers, I'd still use them, they are 3GBs controllers. If you want 6GBs controllers, then some of the other suggestions might be a better choice for you. Gregg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Forcing ZFS options
Do you move the pools between machines, or just on the same physical machine? Could you just use symlinks from the new root to the old root so that the names work until you can reboot? It might be more practical to always use symlinks if you do a lot of moving things around, and then you wouldn't have to figure out how to do the reboot shuffle. Instead, you could just shuffle the symlinks. Gregg Wonderly On Nov 9, 2012, at 10:47 AM, Jim Klimov jimkli...@cos.ru wrote: There are times when ZFS options can not be applied at the moment, i.e. changing desired mountpoints of active filesystems (or setting a mountpoint over a filesystem location that is currently not empty). Such attempts now bail out with messages like: cannot unmount '/var/adm': Device busy cannot mount '/export': directory is not empty and such. Is it possible to force the new values to be saved into ZFS dataset properties, so they do take effect upon next pool import? I currently work around the harder of such situations with a reboot into a different boot environment or even into a livecd/failsafe, just so that the needed datasets or paths won't be busy and so I can set, verify and apply these mountpoint values. This is not a convenient way to do things :) Thanks, //Jim Klimov ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] suggestions for e-SATA HBA card on x86/x64
I've been using this card http://www.newegg.com/Product/Product.aspx?Item=N82E16816117157 for my Solaris/Open Indiana installations because it has 8 ports. One of the issues that this card seems to have, is that certain failures can cause other secondary problems in other drives on the same SAS connector. I use mirrors for my storage machines with 4 pairs, and just put half the mirror on one side and the other drive on the other side. This, in general, has solved my problems. When a drive fails, I might see more than one drive no functioning. I can remove (I use hot swap bays such as http://www.newegg.com/Product/Product.aspx?Item=N82E16817994097) a drive, and restore the other to the pool to find which of the failed drives is actually the problem. What had happened before, was that my case was not moving enough air, and the hot drives had caused odd problems with failure. For the money, and the experience I have with these controllers, I'd still use them, they are 3GBs controllers. If you want 6GBs controllers, then some of the other suggestions might be a better choice for you. Gregg On Oct 24, 2012, at 10:59 PM, Jerry Kemp sun.mail.lis...@oryx.cc wrote: I have just acquired a new JBOD box that will be used as a media center/storage for home use only on my x86/x64 box running OpenIndiana b151a7 currently. Its strictly a JBOD, no hw raid options, with an eSATA port to each drive. I am looking for suggestions for an HBA card with at least (2), but (4) external eSATA ports would be nice. I know enough to stay away from the port expander things. I do not need the HBA to support any internal drives. In reviewing the archives/past post, it seems that LSI is the way to go. I would like to spend USD $200 - $300, but would spend more if necessary for a good, trouble free HBA. I made this comment as I went to look at some of the LSI cards previously mentioned, and found they were priced $500 - $600 and up. TIA for any pointers, Jerry ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] cannot replace X with Y: devices have different sector alignment
What is the error message you are seeing on the replace? This sounds like a slice size/placement problem, but clearly, prtvtoc seems to think that everything is the same. Are you certain that you did prtvtoc on the correct drive, and not one of the active disks by mistake? Gregg Wonderly As does fdisk -G: root@nas:~# fdisk -G /dev/rdsk/c16t5000C5002AA08E4Dd0 * Physical geometry for device /dev/rdsk/c16t5000C5002AA08E4Dd0 * PCYL NCYL ACYL BCYL NHEAD NSECT SECSIZ 608006080000255 252 512 You have new mail in /var/mail/root root@nas:~# fdisk -G /dev/rdsk/c16t5000C5005295F727d0 * Physical geometry for device /dev/rdsk/c16t5000C5005295F727d0 * PCYL NCYL ACYL BCYL NHEAD NSECT SECSIZ 608006080000255 252 512 On Mon, Sep 24, 2012 at 9:01 AM, LIC mesh licm...@gmail.com wrote: Yet another weird thing - prtvtoc shows both drives as having the same sector size, etc: root@nas:~# prtvtoc /dev/rdsk/c16t5000C5002AA08E4Dd0 * /dev/rdsk/c16t5000C5002AA08E4Dd0 partition map * * Dimensions: * 512 bytes/sector * 3907029168 sectors * 3907029101 accessible sectors * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First SectorLast * Sector CountSector * 34 222 255 * * First SectorLast * Partition Tag FlagsSector CountSector Mount Directory 0 400256 3907012495 3907012750 8 1100 3907012751 16384 3907029134 root@nas:~# prtvtoc /dev/rdsk/c16t5000C5005295F727d0 * /dev/rdsk/c16t5000C5005295F727d0 partition map * * Dimensions: * 512 bytes/sector * 3907029168 sectors * 3907029101 accessible sectors * * Flags: * 1: unmountable * 10: read-only * * Unallocated space: * First SectorLast * Sector CountSector * 34 222 255 * * First SectorLast * Partition Tag FlagsSector CountSector Mount Directory 0 400256 3907012495 3907012750 8 1100 3907012751 16384 3907029134 On Mon, Sep 24, 2012 at 12:20 AM, Timothy Coalson tsc...@mst.edu wrote: I think you can fool a recent Illumos kernel into thinking a 4k disk is 512 (incurring a performance hit for that disk, and therefore the vdev and pool, but to save a raidz1, it might be worth it): http://wiki.illumos.org/display/illumos/ZFS+and+Advanced+Format+disks , see Overriding the Physical Sector Size I don't know what you might have to do to coax it to do the replace with a hot spare (zpool replace? export/import?). Perhaps there should be a feature in ZFS that notifies when a pool is created or imported with a hot spare that can't be automatically used in one or more vdevs? The whole point of hot spares is to have them automatically swap in when you aren't there to fiddle with things, which is a bad time to find out it won't work. Tim On Sun, Sep 23, 2012 at 10:52 PM, LIC mesh licm...@gmail.com wrote: Well this is a new one Illumos/Openindiana let me add a device as a hot spare that evidently has a different sector alignment than all of the other drives in the array. So now I'm at the point that I /need/ a hot spare and it doesn't look like I have it. And, worse, the other spares I have are all the same model as said hot spare. Is there anything I can do with this or am I just going to be up the creek when any one of the other drives in the raidz1 fails? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] deleting a link in ZFS
On Aug 28, 2012, at 6:01 AM, Murray Cullen themurma...@gmail.com wrote: I've copied an old home directory from an install of OS 134 to the data pool on my OI install. Opensolaris apparently had wine installed as I now have a link to / in my data pool. I've tried everything I can think of to remove this link with one exception. I have not tried mounting the pool on a different OS yet, I'm trying to avoid that. Does anyone have any advice or suggestions? Ulink and rm error out as root. What is the error? Is it permission denied, I/O error, or what? Gregg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] unable to import the zpool
My experience has always been that ZFS tries hard to keep you from doing something wrong when devices are failing or otherwise unavailable. With mirrors, it will import with a device missing from a mirror vdev. I don't use cache or log devices in my mainly storage pools, so I've not seen a failure with a required device like that missing. But, I've seen problems with a raid-z missing and the pool not coming on line. As Richard says, it would seem there is a cache or log vdev missing since it is showing 1 of 2 mirrored devices in that vdev missing, but still complaining about a missing device. The older OS and ZFS version may in fact have a misbehavior due to some error condition not being correctly managed. Gregg Wonderly On Aug 2, 2012, at 4:49 PM, Richard Elling richard.ell...@gmail.com wrote: On Aug 1, 2012, at 12:21 AM, Suresh Kumar wrote: Dear ZFS-Users, I am using Solarisx86 10u10, All the devices which are belongs to my zpool are in available state . But I am unable to import the zpool. #zpool import tXstpool cannot import 'tXstpool': one or more devices is currently unavailable == bash-3.2# zpool import pool: tXstpool id: 13623426894836622462 state: UNAVAIL status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-6X config: tXstpool UNAVAIL missing device mirror-0 DEGRADED c2t210100E08BB2FC85d0s0 FAULTED corrupted data c2t21E08B92FC85d2ONLINE Additional devices are known to be part of this pool, though their exact configuration cannot be determined. This message is your clue. The pool is missing a device. In most of the cases where I've seen this, it occurs on older ZFS implementations and the missing device is an auxiliary device: cache or spare. -- richard -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can the ZFS copies attribute substitute HW disk redundancy?
On Jul 29, 2012, at 3:12 PM, opensolarisisdeadlongliveopensolaris opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jim Klimov I wondered if the copies attribute can be considered sort of equivalent to the number of physical disks - limited to seek times though. Namely, for the same amount of storage on a 4-HDD box I could use raidz1 and 4*1tb@copies=1 or 4*2tb@copies=2 or even 4*3tb@copies=3, for example. The first question - reliability... copies might be on the same disk. So it's not guaranteed to help if you have a disk failure. I thought I understood that copies would not be on the same disk, I guess I need to go read up on this again. Gregg Wonderly ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] stop sparing process
Well, I hate to do it, but sometimes, I've just unplugged the power on my SATA drives, or ejected them if hot plug to stop nonsense that I could not stop. As a matter of fact, I recently received a replacement drive for one I RMA'd. I attached it to the mate drive, and it floundered for more than 5 minutes, disabling 'format' and 'spool status' pretty much. If there was anything that I'd change about ZPOOL activities, is I'd change user ioctl operations into async activities on kernel threads, and have a complete view of the pools stored in ram that could be read (as updates occurred) via a unix domain socket by a status reporting tool. That tool, in a GUI desktop environment, would post a read, and when satisfied, it would report the details of the error, transition, etc. as a popup on the desktop. Whether a GUI was present or not, it should log the data to syslog. That would make it much more nice to use ZFS so that admins could always take action on multiple pools and devices without being burdened by the constant problems with failing devices locking you out of system administration activities. Gregg Wonderly On Jul 28, 2012, at 6:45 AM, Antonio S. Cofiño antonio.cof...@unican.es wrote: Hello everyone, Somebody knows how to stop a sparing process? I have tried: ad...@seal.macc.unican.es:~$ pfexec zpool detach oceano c8t24d0 cannot detach c8t24d0: no valid replicas without success. I know that the drive been spared is OK and I want to stop the process. Here you can see my zpool status (the failing disk which originated a general failure is been replaced): admin@seal:~$ zpool status pool: oceano state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h3m, 0.01% done, 883h56m to go config: NAME STATE READ WRITE CKSUM oceano DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 c5t5000CCA369C5A416d0ONLINE 0 0 0 c5t5000CCA369C5A420d0ONLINE 0 0 0 c5t5000CCA369C5A432d0ONLINE 0 0 0 c10t5000CCA369C505D5d0 ONLINE 0 0 0 spare-4 ONLINE 0 0 0 c10t5000CCA369C506AFd0 ONLINE 0 0 0 c8t24d0ONLINE 0 0 0 131M resilvered c10t5000CCA369C506BBd0 ONLINE 0 0 0 c5t5000CCA369C5C19Ad0ONLINE 0 0 0 c10t5000CCA369C508C9d0 ONLINE 0 0 0 c5t5000CCA369C52E05d0ONLINE 0 0 0 c10t5000CCA369C508E0d0 ONLINE 0 0 0 c10t5000CCA369C50609d0 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c4t7d0 ONLINE 0 0 0 c8t10d0 ONLINE 0 0 0 c8t11d0 ONLINE 0 0 0 c8t12d0 ONLINE 0 0 0 c8t13d0 ONLINE 0 0 0 c8t14d0 ONLINE 0 0 0 c8t15d0 ONLINE 0 0 0 c8t16d0 ONLINE 0 0 0 c8t17d0 ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 c4t8d0 ONLINE 0 0 0 c4t9d0 ONLINE 0 0 0 c4t10d0 ONLINE 0 0 0 c4t11d0 ONLINE 0 0 0 c8t6d0 ONLINE 0 0 0 c8t18d0 ONLINE 0 0 0 c8t19d0 ONLINE 0 0 0 c8t20d0 ONLINE 0 0 0 c8t21d0 ONLINE 0 0 0 c8t22d0 ONLINE 0 0 0 c8t23d0 ONLINE 0 0 0 raidz2-3 ONLINE 0 0 0 c5t5000CCA369C5A41Dd0ONLINE 0 0 0 c10t5000CCA369C4E90Bd0 ONLINE 0 0 0 c5t5000CCA369C5A42Dd0ONLINE 0 0 0 c10t5000CCA369C4F888d0 ONLINE 0 0 0 c5t5000CCA369C5A374d0ONLINE 0 0 0 c10t5000CCA369C50F1Fd0 ONLINE 0 0 0
Re: [zfs-discuss] New fast hash algorithm - is it needed?
Since there is a finite number of bit patterns per block, have you tried to just calculate the SHA-256 or SHA-512 for every possible bit pattern to see if there is ever a collision? If you found an algorithm that produced no collisions for any possible block bit pattern, wouldn't that be the win? Gregg Wonderly On Jul 11, 2012, at 5:56 AM, Sašo Kiselkov wrote: On 07/11/2012 12:24 PM, Justin Stringfellow wrote: Suppose you find a weakness in a specific hash algorithm; you use this to create hash collisions and now imagined you store the hash collisions in a zfs dataset with dedup enabled using the same hash algorithm. Sorry, but isn't this what dedup=verify solves? I don't see the problem here. Maybe all that's needed is a comment in the manpage saying hash algorithms aren't perfect. It does solve it, but at a cost to normal operation. Every write gets turned into a read. Assuming a big enough and reasonably busy dataset, this leads to tremendous write amplification. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
But this is precisely the kind of observation that some people seem to miss out on the importance of. As Tomas suggested in his post, if this was true, then we could have a huge compression ratio as well. And even if there was 10% of the bit patterns that created non-unique hashes, you could use the fact that a block hashed to a known bit pattern that didn't have collisions, to compress the other 90% of your data. I'm serious about this from a number of perspectives. We worry about the time it would take to reverse SHA or RSA hashes to passwords, not even thinking that what if someone has been quietly computing all possible hashes for the past 10-20 years into a database some where, with every 5-16 character password, and now has an instantly searchable hash-to-password database. Sometimes we ignore the scale of time, thinking that only the immediately visible details are what we have to work with. If no one has computed the hashes for every single 4K and 8K block, then fine. But, if that was done, and we had that data, we'd know for sure which algorithm was going to work the best for the number of bits we are considering. Speculating based on the theory of the algorithms for random number of bits is just silly. Where's the real data that tells us what we need to know? Gregg Wonderly On Jul 11, 2012, at 9:02 AM, Sašo Kiselkov wrote: On 07/11/2012 03:57 PM, Gregg Wonderly wrote: Since there is a finite number of bit patterns per block, have you tried to just calculate the SHA-256 or SHA-512 for every possible bit pattern to see if there is ever a collision? If you found an algorithm that produced no collisions for any possible block bit pattern, wouldn't that be the win? Don't think that, if you can think of this procedure, that the crypto security guys at universities haven't though about it as well? Of course they have. No, simply generating a sequence of random patterns and hoping to hit a match won't do the trick. P.S. I really don't mean to sound smug or anything, but I know one thing for sure: the crypto researchers who propose these algorithms are some of the brightest minds on this topic on the planet, so I would hardly think they didn't consider trivial problems. Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
Unfortunately, the government imagines that people are using their home computers to compute hashes and try and decrypt stuff. Look at what is happening with GPUs these days. People are hooking up 4 GPUs in their computers and getting huge performance gains. 5-6 char password space covered in a few days. 12 or so chars would take one machine a couple of years if I recall. So, if we had 20 people with that class of machine, we'd be down to a few months. I'm just suggesting that while the compute space is still huge, it's not actually undoable, it just requires some thought into how to approach the problem, and then some time to do the computations. Huge space, but still finite… Gregg Wonderly On Jul 11, 2012, at 9:13 AM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Gregg Wonderly Since there is a finite number of bit patterns per block, have you tried to just calculate the SHA-256 or SHA-512 for every possible bit pattern to see if there is ever a collision? If you found an algorithm that produced no collisions for any possible block bit pattern, wouldn't that be the win? Maybe I misunderstand what you're saying, but if I got it right, what you're saying is physically impossible to do in the time of the universe... And guaranteed to fail even if you had all the computational power of God. I think you're saying ... In a block of 128k, sequentially step through all the possible values ... starting with 0, 1, 2, ... 2^128k ... and compute the hashes of each value, and see if you ever find a hash collision. If this is indeed what you're saying, recall, the above operation will require on order 2^128k operations to complete. But present national security standards accept 2^256 operations as satisfactory to protect data from brute force attacks over the next 30 years. Furthermore, in a 128k block, there exist 2^128k possible values, while in a 512bit hash, there exist only 2^512 possible values (which is still a really huge number.) This means there will exist at least 2^127.5k collisions. However, these numbers are so astronomically universally magnanimously huge, it will still take more than a lifetime to find any one of those collisions. So it's impossible to perform such a computation, and if you could, you would be guaranteed to find a LOT of collisions. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
This is exactly the issue for me. It's vital to always have verify on. If you don't have the data to prove that every possible block combination possible, hashes uniquely for the small bit space we are talking about, then how in the world can you say that verify is not necessary? That just seems ridiculous to propose. Gregg Wonderly On Jul 11, 2012, at 9:22 AM, Bob Friesenhahn wrote: On Wed, 11 Jul 2012, Sašo Kiselkov wrote: the hash isn't used for security purposes. We only need something that's fast and has a good pseudo-random output distribution. That's why I looked toward Edon-R. Even though it might have security problems in itself, it's by far the fastest algorithm in the entire competition. If an algorithm is not 'secure' and zfs is not set to verify, doesn't that mean that a knowledgeable user will be able to cause intentional data corruption if deduplication is enabled? A user with very little privilege might be able to cause intentional harm by writing the magic data block before some other known block (which produces the same hash) is written. This allows one block to substitute for another. It does seem that security is important because with a human element, data is not necessarily random. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
Yes, but from the other angle, the number of unique 128K blocks that you can store on your ZFS pool, is actually finitely small, compared to the total space. So the patterns you need to actually consider is not more than the physical limits of the universe. Gregg Wonderly On Jul 11, 2012, at 9:39 AM, Sašo Kiselkov wrote: On 07/11/2012 04:27 PM, Gregg Wonderly wrote: Unfortunately, the government imagines that people are using their home computers to compute hashes and try and decrypt stuff. Look at what is happening with GPUs these days. People are hooking up 4 GPUs in their computers and getting huge performance gains. 5-6 char password space covered in a few days. 12 or so chars would take one machine a couple of years if I recall. So, if we had 20 people with that class of machine, we'd be down to a few months. I'm just suggesting that while the compute space is still huge, it's not actually undoable, it just requires some thought into how to approach the problem, and then some time to do the computations. Huge space, but still finite… There are certain physical limits which one cannot exceed. For instance, you cannot store 2^256 units of 32-byte quantities in Earth. Even if you used proton spin (or some other quantum property) to store a bit, there simply aren't enough protons in the entire visible universe to do it. You will never ever be able to search a 256-bit memory space using a simple exhaustive search. The reason why our security hashes are so long (256-bits, 512-bits, more...) is because attackers *don't* do an exhaustive search. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
So, if I had a block collision on my ZFS pool that used dedup, and it had my bank balance of $3,212.20 on it, and you tried to write your bank balance of $3,292,218.84 and got the same hash, no verify, and thus you got my block/balance and now your bank balance was reduced by 3 orders of magnitude, would you be okay with that? What assurances would you be content with using my ZFS pool? Gregg Wonderly On Jul 11, 2012, at 9:43 AM, Sašo Kiselkov wrote: On 07/11/2012 04:30 PM, Gregg Wonderly wrote: This is exactly the issue for me. It's vital to always have verify on. If you don't have the data to prove that every possible block combination possible, hashes uniquely for the small bit space we are talking about, then how in the world can you say that verify is not necessary? That just seems ridiculous to propose. Do you need assurances that in the next 5 seconds a meteorite won't fall to Earth and crush you? No. And yet, the Earth puts on thousands of tons of weight each year from meteoric bombardment and people have been hit and killed by them (not to speak of mass extinction events). Nobody has ever demonstrated of being able to produce a hash collision in any suitably long hash (128-bits plus) using a random search. All hash collisions have been found by attacking the weaknesses in the mathematical definition of these functions (i.e. some part of the input didn't get obfuscated well in the hash function machinery and spilled over into the result, resulting in a slight, but usable non-randomness). Cheers, -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
I'm just suggesting that the time frame of when 256-bits or 512-bits is less safe, is closing faster than one might actually think, because social elements of the internet allow a lot more effort to be focused on a single problem than one might consider. Gregg Wonderly On Jul 11, 2012, at 9:50 AM, Edward Ned Harvey wrote: From: Gregg Wonderly [mailto:gr...@wonderly.org] Sent: Wednesday, July 11, 2012 10:28 AM Unfortunately, the government imagines that people are using their home computers to compute hashes and try and decrypt stuff. Look at what is happening with GPUs these days. heheheh. I guess the NSA didn't think of that.;-) (That's sarcasm, in case anyone didn't get it.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
You're entirely sure that there could never be two different blocks that can hash to the same value and have different content? Wow, can you just send me the cash now and we'll call it even? Gregg On Jul 11, 2012, at 9:59 AM, Sašo Kiselkov wrote: On 07/11/2012 04:56 PM, Gregg Wonderly wrote: So, if I had a block collision on my ZFS pool that used dedup, and it had my bank balance of $3,212.20 on it, and you tried to write your bank balance of $3,292,218.84 and got the same hash, no verify, and thus you got my block/balance and now your bank balance was reduced by 3 orders of magnitude, would you be okay with that? What assurances would you be content with using my ZFS pool? I'd feel entirely safe. There, I said it. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
What I'm saying is that I am getting conflicting information from your rebuttals here. I (and others) say there will be collisions that will cause data loss if verify is off. You say it would be so rare as to be impossible from your perspective. Tomas says, well then lets just use the hash value for a 4096X compression. You fluff around his argument calling him names. I say, well then compute all the possible hashes for all possible bit patterns and demonstrate no dupes. You say it's not possible to do that. I illustrate a way that loss of data could cost you money. You say it's impossible for there to be a chance of me constructing a block that has the same hash but different content. Several people have illustrated that 128K to 32bits is a huge and lossy ratio of compression, yet you still say it's viable to leave verify off. I say, in fact that the total number of unique patterns that can exist on any pool is small, compared to the total, illustrating that I understand how the key space for the algorithm is small when looking at a ZFS pool, and thus could have a non-collision opportunity. So I can see what perspective you are drawing your confidence from, but I, and others, are not confident that the risk has zero probability. I'm pushing you to find a way to demonstrate that there is zero risk because if you do that, then you've, in fact created the ultimate compression factor (but enlarged the keys that could collide because the pool is now virtually larger), to date for random bit patterns, and you've also demonstrated that the particular algorithm is very good for dedup. That would indicate to me, that you can then take that algorithm, and run it inside of ZFS dedup to automatically manage when verify is necessary by detecting when a collision occurs. I appreciate the push back. I'm trying to drive thinking about this into the direction of what is known and finite, away from what is infinitely complex and thus impossible to explore. Maybe all the work has already been done… Gregg On Jul 11, 2012, at 11:02 AM, Sašo Kiselkov wrote: On 07/11/2012 05:58 PM, Gregg Wonderly wrote: You're entirely sure that there could never be two different blocks that can hash to the same value and have different content? Wow, can you just send me the cash now and we'll call it even? You're the one making the positive claim and I'm calling bullshit. So the onus is on you to demonstrate the collision (and that you arrived at it via your brute force method as described). Until then, my money stays safely on my bank account. Put up or shut up, as the old saying goes. -- Saso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On Jul 11, 2012, at 12:06 PM, Sašo Kiselkov wrote: I say, in fact that the total number of unique patterns that can exist on any pool is small, compared to the total, illustrating that I understand how the key space for the algorithm is small when looking at a ZFS pool, and thus could have a non-collision opportunity. This is so profoundly wrong that it leads me to suspect you never took courses on cryptography and/or information theory. The size of your storage pool DOESN'T MATTER ONE BIT to the size of the key space. Even if your pool were the size of a single block, we're talking here about the *mathematical* possibility of hitting on a random block that hashes to the same value. Given a stream of random data blocks (thus simulating an exhaustive brute-force search) and a secure pseudo-random hash function (which has a roughly equal chance of producing any output value for a given input block), you've got only a 10^-77 chance of getting a hash collision. If you don't understand how this works, read a book on digital coding theory. The size of the pool does absolutely matter, because it represents the total number of possible bit patterns you can involve in the mapping (through the math). If the size of the ZFS pool is limited, the total number of unique blocks is in fact limited by the size of the pool. This affects how many collisions are possible, and thus how effective dedup can be. Overtime, if the bit patterns can change on each block, at some point, you can arrive at one of the collisions. Yes, it's rare, I'm not disputing that, I am disputing that the risk is discardable in computer applications where data integrity matters. For example, losing money as the example I used. I'm pushing you to find a way to demonstrate that there is zero risk because if you do that, then you've, in fact created the ultimate compression factor (but enlarged the keys that could collide because the pool is now virtually larger), to date for random bit patterns, and you've also demonstrated that the particular algorithm is very good for dedup. That would indicate to me, that you can then take that algorithm, and run it inside of ZFS dedup to automatically manage when verify is necessary by detecting when a collision occurs. Do you know what a dictionary is in compression algorithms? Yes I am familiar with this kind of compression Do you even know how things like Huffman coding or LZW work, at least in principle? Yes If not, then I can see why you didn't understand my earlier explanations of why hashes aren't usable for compression. With zero collisions in a well defined key space, they work perfectly for compression. To whit, you are saying that you are comfortable enough using them for dedup, which is exactly a form of compression. I'm agreeing that the keyspace is huge, but the collision possibilities mean I'm not comfortable with verify=no If there wasn't a sufficiently small keyspace in a ZFS pool, then dedup would never succeed. There are some block contents that are recurring. Usually blocks filled with 00, FF, or some pattern from a power up memory pattern etc. So those few common patterns are easily dedup'd out. I appreciate the push back. I'm trying to drive thinking about this into the direction of what is known and finite, away from what is infinitely complex and thus impossible to explore. If you don't understand the mathematics behind my arguments, just say so. I understand the math. I'm not convinced it's nothing to worry about, because my data is valuable enough to me that I am using ZFS. If I was using dedup, I'd for sure turn on verify… Gregg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)
Use 'dd' to replicate as much of lofi/2 as you can onto another device, and then cable that into place? It looks like you just need to put a functioning, working, but not correct device, in that slot so that it will import and then you can 'zpool replace' the new disk into the pool perhaps? Gregg Wonderly On 6/16/2012 2:02 AM, Scott Aitken wrote: On Sat, Jun 16, 2012 at 08:54:05AM +0200, Stefan Ring wrote: when you say remove the device, I assume you mean simply make it unavailable for import (I can't remove it from the vdev). Yes, that's what I meant. root@openindiana-01:/mnt# zpool import -d /dev/lofi ??pool: ZP-8T-RZ1-01 ?? ??id: 9952605666247778346 ??state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing ?? ?? ?? ??devices and try again. ?? see: http://www.sun.com/msg/ZFS-8000-3C config: ?? ?? ?? ??ZP-8T-RZ1-01 ?? ?? ?? ?? ?? ?? ??FAULTED ??corrupted data ?? ?? ?? ?? ??raidz1-0 ?? ?? ?? ?? ?? ?? ?? ??DEGRADED ?? ?? ?? ?? ?? ??12339070507640025002 ??UNAVAIL ??cannot open ?? ?? ?? ?? ?? ??/dev/lofi/5 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/4 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/3 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/1 ?? ?? ?? ?? ?? ONLINE It's interesting that even though 4 of the 5 disks are available, it still can import it as DEGRADED. I agree that it's interesting. Now someone really knowledgable will need to have a look at this. I can only imagine that somehow the devices contain data from different points in time, and that it's too far apart for the aggressive txg rollback that was added in PSARC 2009/479. Btw, did you try that? Try: zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01. Hi again, that got slightly further, but still no dice: root@openindiana-01:/mnt# zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01 root@openindiana-01:/mnt# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT ZP-8T-RZ1-01 - - - - - FAULTED - rpool 15.9G 2.17G 13.7G13% 1.00x ONLINE - root@openindiana-01:/mnt# zpool status pool: ZP-8T-RZ1-01 state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scan: none requested config: NAME STATE READ WRITE CKSUM ZP-8T-RZ1-01 FAULTED 0 0 1 corrupted data raidz1-0ONLINE 0 0 6 12339070507640025002 UNAVAIL 0 0 0 was /dev/lofi/2 /dev/lofi/5 ONLINE 0 0 0 /dev/lofi/4 ONLINE 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 root@openindiana-01:/mnt# zpool scrub ZP-8T-RZ1-01 cannot scrub 'ZP-8T-RZ1-01': pool is currently unavailable Thanks for your tenacity Stefan. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)
On Jun 16, 2012, at 9:49 AM, Scott Aitken wrote: On Sat, Jun 16, 2012 at 09:09:53AM -0500, Gregg Wonderly wrote: Use 'dd' to replicate as much of lofi/2 as you can onto another device, and then cable that into place? It looks like you just need to put a functioning, working, but not correct device, in that slot so that it will import and then you can 'zpool replace' the new disk into the pool perhaps? Gregg Wonderly On 6/16/2012 2:02 AM, Scott Aitken wrote: On Sat, Jun 16, 2012 at 08:54:05AM +0200, Stefan Ring wrote: when you say remove the device, I assume you mean simply make it unavailable for import (I can't remove it from the vdev). Yes, that's what I meant. root@openindiana-01:/mnt# zpool import -d /dev/lofi ??pool: ZP-8T-RZ1-01 ?? ??id: 9952605666247778346 ??state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing ?? ?? ?? ??devices and try again. ?? see: http://www.sun.com/msg/ZFS-8000-3C config: ?? ?? ?? ??ZP-8T-RZ1-01 ?? ?? ?? ?? ?? ?? ??FAULTED ??corrupted data ?? ?? ?? ?? ??raidz1-0 ?? ?? ?? ?? ?? ?? ?? ??DEGRADED ?? ?? ?? ?? ?? ??12339070507640025002 ??UNAVAIL ??cannot open ?? ?? ?? ?? ?? ??/dev/lofi/5 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/4 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/3 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/1 ?? ?? ?? ?? ?? ONLINE It's interesting that even though 4 of the 5 disks are available, it still can import it as DEGRADED. I agree that it's interesting. Now someone really knowledgable will need to have a look at this. I can only imagine that somehow the devices contain data from different points in time, and that it's too far apart for the aggressive txg rollback that was added in PSARC 2009/479. Btw, did you try that? Try: zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01. Hi again, that got slightly further, but still no dice: root@openindiana-01:/mnt# zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01 root@openindiana-01:/mnt# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT ZP-8T-RZ1-01 - - - - - FAULTED - rpool 15.9G 2.17G 13.7G13% 1.00x ONLINE - root@openindiana-01:/mnt# zpool status pool: ZP-8T-RZ1-01 state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scan: none requested config: NAME STATE READ WRITE CKSUM ZP-8T-RZ1-01 FAULTED 0 0 1 corrupted data raidz1-0ONLINE 0 0 6 12339070507640025002 UNAVAIL 0 0 0 was /dev/lofi/2 /dev/lofi/5 ONLINE 0 0 0 /dev/lofi/4 ONLINE 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 root@openindiana-01:/mnt# zpool scrub ZP-8T-RZ1-01 cannot scrub 'ZP-8T-RZ1-01': pool is currently unavailable Thanks for your tenacity Stefan. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Hi Greg, lofi/2 is a dd of a real disk. I am using disk images because I can roll back, clone etc without using the original drives (which are long gone anyway). I have tried making /2 unavailable for import, and zfs just moans that it can't be opened. It fails to import even though I have only one disk missing of a RAIDZ array. My experience is that ZFS will not import a pool with a missing disk. There has to be something in that slot before the import will occur. Even if the disk is corrupt, it needs to be there. I think this is a failsafe mechanism that tries to keep a pool from going live when you have mistakenly not connected all the drives. That keeps the disks from becoming chronologically/txn misaligned which can result in data loss, in the right combinations I believe. Gregg Wonderly ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recovery of RAIDZ with broken label(s)
On Jun 16, 2012, at 10:13 AM, Scott Aitken wrote: On Sat, Jun 16, 2012 at 09:58:40AM -0500, Gregg Wonderly wrote: On Jun 16, 2012, at 9:49 AM, Scott Aitken wrote: On Sat, Jun 16, 2012 at 09:09:53AM -0500, Gregg Wonderly wrote: Use 'dd' to replicate as much of lofi/2 as you can onto another device, and then cable that into place? It looks like you just need to put a functioning, working, but not correct device, in that slot so that it will import and then you can 'zpool replace' the new disk into the pool perhaps? Gregg Wonderly On 6/16/2012 2:02 AM, Scott Aitken wrote: On Sat, Jun 16, 2012 at 08:54:05AM +0200, Stefan Ring wrote: when you say remove the device, I assume you mean simply make it unavailable for import (I can't remove it from the vdev). Yes, that's what I meant. root@openindiana-01:/mnt# zpool import -d /dev/lofi ??pool: ZP-8T-RZ1-01 ?? ??id: 9952605666247778346 ??state: FAULTED status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing ?? ?? ?? ??devices and try again. ?? see: http://www.sun.com/msg/ZFS-8000-3C config: ?? ?? ?? ??ZP-8T-RZ1-01 ?? ?? ?? ?? ?? ?? ??FAULTED ??corrupted data ?? ?? ?? ?? ??raidz1-0 ?? ?? ?? ?? ?? ?? ?? ??DEGRADED ?? ?? ?? ?? ?? ??12339070507640025002 ??UNAVAIL ??cannot open ?? ?? ?? ?? ?? ??/dev/lofi/5 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/4 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/3 ?? ?? ?? ?? ?? ONLINE ?? ?? ?? ?? ?? ??/dev/lofi/1 ?? ?? ?? ?? ?? ONLINE It's interesting that even though 4 of the 5 disks are available, it still can import it as DEGRADED. I agree that it's interesting. Now someone really knowledgable will need to have a look at this. I can only imagine that somehow the devices contain data from different points in time, and that it's too far apart for the aggressive txg rollback that was added in PSARC 2009/479. Btw, did you try that? Try: zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01. Hi again, that got slightly further, but still no dice: root@openindiana-01:/mnt# zpool import -d /dev/lofi -FVX ZP-8T-RZ1-01 root@openindiana-01:/mnt# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT ZP-8T-RZ1-01 - - - - - FAULTED - rpool 15.9G 2.17G 13.7G13% 1.00x ONLINE - root@openindiana-01:/mnt# zpool status pool: ZP-8T-RZ1-01 state: FAULTED status: One or more devices could not be used because the label is missing or invalid. There are insufficient replicas for the pool to continue functioning. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-5E scan: none requested config: NAME STATE READ WRITE CKSUM ZP-8T-RZ1-01 FAULTED 0 0 1 corrupted data raidz1-0ONLINE 0 0 6 12339070507640025002 UNAVAIL 0 0 0 was /dev/lofi/2 /dev/lofi/5 ONLINE 0 0 0 /dev/lofi/4 ONLINE 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 root@openindiana-01:/mnt# zpool scrub ZP-8T-RZ1-01 cannot scrub 'ZP-8T-RZ1-01': pool is currently unavailable Thanks for your tenacity Stefan. Scott ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Hi Greg, lofi/2 is a dd of a real disk. I am using disk images because I can roll back, clone etc without using the original drives (which are long gone anyway). I have tried making /2 unavailable for import, and zfs just moans that it can't be opened. It fails to import even though I have only one disk missing of a RAIDZ array. My experience is that ZFS will not import a pool with a missing disk. There has to be something in that slot before the import will occur. Even if the disk is corrupt, it needs to be there. I think this is a failsafe mechanism that tries to keep a pool from going live when you have mistakenly not connected all the drives. That keeps the disks from becoming chronologically/txn misaligned which can result in data loss, in the right combinations I believe. Gregg Wonderly Hi again Gregg, not sure if I should be top posting this... Given I am working with images, it's hard to put just anything in place of lofi/2. ZFS scans all of the files in the directory for ZFS labels, so just replacing lofi/2 with an empty file (for example) just means ZFS skips it, which is the same result as deleting lofi/2 altogether. I did this, but to no avail. ZFS complains about having insufficient replicas. I don't really know much about the total space layout of a ZFS disk surface, because I
Re: [zfs-discuss] What is your data error rate?
What I've noticed, is that when I have my drives in a situation of small airflow, and hence hotter operating temperatures, my disks will drop quite quickly. I've now moved my systems into large cases, which large amounts of airflow and using the icydock brand of removable drive enclosures. http://www.newegg.com/Product/Product.aspx?Item=N82E16817994097 http://www.newegg.com/Product/Product.aspx?Item=N82E16817994113 I use the SASUC8I SATA/SAS controller to access 8 drives. http://www.newegg.com/Product/Product.aspx?Item=N82E16816117157 I put it in PCI-e x16 slots on graphics heavy motherboards which might have as many as 4x PCI-e x16 slots. I am replacing an old motherboard with this one. http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=1124780 The case that I found to be a good match for my needs is the Raven http://www.newegg.com/Product/Product.aspx?Item=N82E16811163180 It has enough slots (7) to put 2x 3-in-2 and 1x 4-in-3 icy dock bays in to provide 10 drives in hot swap bays. I really think that the big issue is that you must move the air. The drives really need to stay cool or else you will see degraded performance and/or data loss much more often. Gregg Wonderly On 1/24/2012 9:50 AM, Stefan Ring wrote: After having read this mailing list for a little while, I get the impression that there are at least some people who regularly experience on-disk corruption that ZFS should be able to report and handle. I’ve been running a raidz1 on three 1TB consumer disks for approx. 2 years now (about 90% full), and I scrub the pool every 3-4 weeks and have never had a single error. From the oft-quoted 10^14 error rate that consumer disks are rated at, I should have seen an error by now -- the scrubbing process is not the only activity on the disks, after all, and the data transfer volume from that alone clocks in at almost exactly 10^14 by now. Not that I’m worried, of course, but it comes at a slight surprise to me. Or does the 10^14 rating just reflect the strength of the on-disk ECC algorithm? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create a mirror for a root rpool?
On 12/19/2011 8:51 PM, Frank Cusack wrote: If you don't detach the smaller drive, the pool size won't increase. Even if the remaining smaller drive fails, that doesn't mean you have to detach it. So yes, the pool size might increase, but it won't be unexpectedly. It will be because you detached all smaller drives. Also, even if a smaller drive is failed, it can still be attached. If you don't have a controller slot to connect the replacement drive through, then you have to remove the smaller drive, physically. You can, then attach the replacement drive, but will replace work then, or must you remove and then add it because it is the same disk? It doesn't make sense for attach to do anything with partition tables, IMHO. I understand that in some cases, it might be more problematic for attach to assume some things about partitioning. I don't know that I have the answer, but I know, from experience, that there is nothing I hate more than anything, then having to figure out how to partition disks on Solaris. It's just too painful to have so many steps with conditions of use. I *always* order the spare when I order the original drives, to have it on hand, even for my home system. Drive sizes change more frequently than they fail, for me. Sure, when I use the spare I may not be able to order a new spare of the same size, but at least at that time I have time to prepare and am not scrambling. Most of the time, I have spares ready too. I have returned 4 of one manufactures, and 2 of another, with 2 more disks showing signs of failure. These are all SATA disks on my home server. At this point, with drive prices so high, it's not simple to pick up a couple of more spares to have on hand. For my Root pool, I had only no remaining 250GB disks that I've been using for root.So, I put in one of my 1.5TB spares for the moment, until I decide whether or not to order a new small drive. On Mon, Dec 19, 2011 at 3:55 PM, Gregg Wonderly gregg...@gmail.com mailto:gregg...@gmail.com wrote: That's why I'm asking. I think it should always mirror the partition table and allocate exactly the same amount of space so that the pool doesn't suddenly change sizes unexpectedly and require a disk size that I don't have at hand, to put the mirror back up. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create a mirror for a root rpool?
That's why I'm asking. I think it should always mirror the partition table and allocate exactly the same amount of space so that the pool doesn't suddenly change sizes unexpectedly and require a disk size that I don't have at hand, to put the mirror back up. Gregg On 12/18/2011 4:08 PM, Nathan Kroenert wrote: Do note, that though Frank is correct, you have to be a little careful around what might happen should you drop your original disk, and only the large mirror half is left... ;) On 12/16/11 07:09 PM, Frank Cusack wrote: You can just do fdisk to create a single large partition. The attached mirror doesn't have to be the same size as the first component. On Thu, Dec 15, 2011 at 11:27 PM, Gregg Wonderly gregg...@gmail.com mailto:gregg...@gmail.com wrote: Cindy, will it ever be possible to just have attach mirror the surfaces, including the partition tables? I spent an hour today trying to get a new mirror on my root pool. There was a 250GB disk that failed. I only had a 1.5TB handy as a replacement. prtvtoc ... | fmthard does not work in this case and so you have to do the partitioning by hand, which is just silly to fight with anyway. Gregg Sent from my iPhone On Dec 15, 2011, at 6:13 PM, Tim Cook t...@cook.ms mailto:t...@cook.ms wrote: Do you still need to do the grub install? On Dec 15, 2011 5:40 PM, Cindy Swearingen cindy.swearin...@oracle.com mailto:cindy.swearin...@oracle.com wrote: Hi Anon, The disk that you attach to the root pool will need an SMI label and a slice 0. The syntax to attach a disk to create a mirrored root pool is like this, for example: # zpool attach rpool c1t0d0s0 c1t1d0s0 Thanks, Cindy On 12/15/11 16:20, Anonymous Remailer (austria) wrote: On Solaris 10 If I install using ZFS root on only one drive is there a way to add another drive as a mirror later? Sorry if this was discussed already. I searched the archives and couldn't find the answer. Thank you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor pool performance - no zfs/controller errors?!
On 12/18/2011 4:23 PM, Jan-Aage Frydenbø-Bruvoll wrote: Hi, On Sun, Dec 18, 2011 at 22:14, Nathan Kroenertnat...@tuneunix.com wrote: I know some others may already have pointed this out - but I can't see it and not say something... Do you realise that losing a single disk in that pool could pretty much render the whole thing busted? At least for me - the rate at which _I_ seem to lose disks, it would be worth considering something different ;) Yeah, I have thought that thought myself. I am pretty sure I have a broken disk, however I cannot for the life of me find out which one. zpool status gives me nothing to work on, MegaCli reports that all virtual and physical drives are fine, and iostat gives me nothing either. What other tools are there out there that could help me pinpoint what's going on? One choice would be to take a single drive that you believe is in good working condition, and add it as a mirror to each single disk in turn. If there is a bad disk, you will find out if the mirror fails because of a read error. Scrub, though, should really be telling you everything that you need to know about disk failings, once the surface becomes corrupted enough that it can't be corrected by re-reading enough times. It looks like you've started mirroring some of the drives. That's really what you should be doing for the other non-mirror drives. Gregg Wonderly ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create a mirror for a root rpool?
The issue is really quite simple. The solaris install, on x86 at least, chooses to use slice-0 for the root partition. That slice is not created by a default format/fdisk, and so we have the web strewn with prtvtoc path/to/old/slice2 | fmthard -s - path/to/new/slice2 As a way to cause the two commands to access the entire disk. If you have to use dissimilar sized disks because 1) that's the only media you have, or 2) you want to increase the size of your root pool, then all we end up with, is an error message about overlapping partitions and no ability to make progress. If I then use dd if=/dev/zero to erase the front of the disk, and the fire up format, select fdisk, say yes to create solaris2 partitioning, and then use partition to add a slice 0, I will have problems getting the whole disk in play. So, the end result, is that I have to jump through hoops, when in the end, I'd really like to just add the whole disk, every time. If I say zpool attach rpool c8t0d0s0 c12d1 I really do mean the whole disk, and I'm not sure why it can't just happen. Failing to type a slice reference, is no worse of a 'typo' than typing 's2' by accident, because that's what I've been typing with all the other commands to try and get the disk partitioned. I just really think there's not a lot of value in all of this, especially with ZFS, where we can, in fact add more disks/vdevs to a keep expanding space, and extremely rarely is that going to be done, for the root pool, with fractions of disks. The use of SMI and absolute refusal to use EFI partitioning plus all of this just stacks up to a pretty large barrier to simple and/or easy administration. I'm very nervous when I have a simplex filesystem setting there, and when a disk has died, I'm doubly nervous that the other half is going to fall over. I'm not trying to be hard nosed about this, I'm just trying to share my angst and frustration with the details that drove me in that direction. Gregg Wonderly On 12/16/2011 2:56 AM, Andrew Gabriel wrote: On 12/16/11 07:27 AM, Gregg Wonderly wrote: Cindy, will it ever be possible to just have attach mirror the surfaces, including the partition tables? I spent an hour today trying to get a new mirror on my root pool. There was a 250GB disk that failed. I only had a 1.5TB handy as a replacement. prtvtoc ... | fmthard does not work in this case Can you be more specific why it fails? I have seen a couple of cases, and I'm wondering if you're hitting the same thing. Can you post the prtvtoc output of your original disk please? and so you have to do the partitioning by hand, which is just silly to fight with anyway. Gregg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I create a mirror for a root rpool?
Cindy, will it ever be possible to just have attach mirror the surfaces, including the partition tables? I spent an hour today trying to get a new mirror on my root pool. There was a 250GB disk that failed. I only had a 1.5TB handy as a replacement. prtvtoc ... | fmthard does not work in this case and so you have to do the partitioning by hand, which is just silly to fight with anyway. Gregg Sent from my iPhone On Dec 15, 2011, at 6:13 PM, Tim Cook t...@cook.ms wrote: Do you still need to do the grub install? On Dec 15, 2011 5:40 PM, Cindy Swearingen cindy.swearin...@oracle.com wrote: Hi Anon, The disk that you attach to the root pool will need an SMI label and a slice 0. The syntax to attach a disk to create a mirrored root pool is like this, for example: # zpool attach rpool c1t0d0s0 c1t1d0s0 Thanks, Cindy On 12/15/11 16:20, Anonymous Remailer (austria) wrote: On Solaris 10 If I install using ZFS root on only one drive is there a way to add another drive as a mirror later? Sorry if this was discussed already. I searched the archives and couldn't find the answer. Thank you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] grrr, How to get rid of mis-touched file named `-c'
On 11/26/2011 5:30 AM, Brandon High wrote: On Wed, Nov 23, 2011 at 11:43 AM, Harry Putnamrea...@newsguy.com wrote: OK, I'm out of escapes. or other tricks... other than using emacs but I haven't installed emacs as yet. I can just ignore them of course, until such time as I do get emacs installed, but by now I just want to know how it might be done from a shell prompt. rm ./-c ./-O ./-k And many versions of getopt support the use of -- as the end of options indicator so that you can do rm -- -c -O -k to remove those as well. Gregg Wonderly ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] how to set up solaris os and cache within one SSD
On 11/10/2011 7:42 AM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of darkblue 1 * XEON 5606 1 * supermirco X8DT3-LN4F 6 * 4G RECC RAM 22 * WD RE3 1T harddisk 4 * intel 320 (160G) SSD 1 * supermicro 846E1-900B chassis I just want to say, this isn't supported hardware, and although many people will say they do this without problem, I've heard just as many people (including myself) saying it's unstable that way. I recommend buying either the oracle hardware or the nexenta on whatever they recommend for hardware. Definitely DO NOT run the free version of solaris without updates and expect it to be reliable. But that's a separate issue. I'm also emphasizing that even if you pay for solaris support on non-oracle hardware, don't expect it to be great. But maybe it will be. I think the key issue here, is whether this hardware will corrupt a pool or not. Ultimately, the promise of ZFS, for me anyways, is that I can take disks to new hardware if/when needed. I am not dependent on a controller or motherboard which provides some feature key to access the data on the disks. Companies which sell key software, that you depend on working, generally have proven that software to work reliably on hardware which they might sell to make use of said software. Apple's business model and success, for example is based on this fact, because they have a much smaller bug pool to consider. Oracle hardware works out the same way. I think supporting the development of ZFS is key to the next generation of storage solutions... But, I don't need the class of hardware that Oracle wants me to pay for. I need disks with 24/7 reliability. I can wait till tomorrow to store something onto my server from my laptop/desktop. Consumer/non-enterprise needs are quite different, and I don't think Oracle understands how to deal in the 1,000,000,000 potential customer marketplace. They've had a hard enough time just working in the 100,000 customer marketplace. Gregg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data distribution not even between vdevs
On 11/9/2011 8:05 AM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ding Honghui But now, as show below, the first 2 raidz1 vdev usage is about 78% and the last 2 raidz1 vdev usage is about 93%. In this case, when you write, it should be writing to the first two vdevs, not the last two. So the fact that the last two are over 93% full should be irrelevant in terms of write performance. All my file is small files which size is about 150KB. That's too bad. Raidz performs well with large sequential data, and performs poor with small random files. Now the questions is: 1. Should I balance the data between the vdevs by copy the data and remove the data which locate in last 2 vdevs? If you want to. But most people wouldn't bother. Especially since you're talking about 75% versus 90%. It's difficult to balance it so *precisely* as to get them both around 85% 2. Is there any method to automatically re-balance the data? or There is no automatic way to do it. For me, this is a key issue. If there was an automatic rebalancing mechanism, that same mechanism would work perfectly to allow pools to have disk sets removed. It would provide the needed basic mechanism of just moving stuff around to eliminate the use of a particular part of the pool that you wanted to remove. Gregg Wonderly ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Lycom has lots of hardware that looks interesting, is it supported
I've been building a few 6disk boxes for VirtualBox servers, and I am also surveying how I will add more disks as these boxes need it. Looking around on the HCL, I see the Lycom PE-103 is supported. That's just 2 more disks, I'm typically going to want to add a raid-z w/spare to my zpools, so I need at least 4 disks, and I'd prefer to build boxes with multi-lane esata expansion and put either 5 or 10 disks in them for expansion. There are lots of devices on the lycom web site at http://www.lycom.com.tw. The device at http://www.lycom.com.tw/st126rm.htm looks very attractive for bolting on to computer cases that are housing additional drives. That device says that the PE-102 can be used for multi-lane connectivity. Is multi-lane working in solaris, and since the PE-102 seems to have the same chipset as the PE-103, would it work on OpenSolaris? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss