Re: [zfs-discuss] ZFS Storage server hardwae
Am 26.08.2010 um 04:38 schrieb Edward Ned Harvey: There is no such thing as reliable external disks. Not unless you want to pay $1000 each, which is dumb. You have to scrap your mini, and use internal (or hotswappable) disks. Never expect a mini to be reliable. They're designed to be small and cute. Not reliable. The MacMini and the disks themselves are just fine. The problem seems to be the SATA-bridges to USB/FW. They just stall, when the load gets heavy. Martin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
Hello, actually this is bad news. I always assumed that the mirror redundancy of zil can also be used to handle bad blocks on the zil device (just as the main pool self healing does for data blocks). I actually dont know how SSD's die, because of the wear out characteristics I can think of a increased number of bad blocks / bit errors at the EOL of such a device - probably undiscovered. Because ZIL is write only, you only know if it worked in case you need it - wich is bad. So my suggestion was always to run with 1 zil during pre-production, and add the zil mirror 2 weeks later when production starts. This way they dont't age exactly the same and zil2 has 2 more weeks of expected flifetime (or even more, assuming the usual heavier writes during stress testing). I would call this pre-aging. However if the second zil is not used to recover from bad blocks, this does not make a lot of sense. So would say there are 2 bugs / missing features in this: 1) zil needs to report truncated transactions on zilcorruption 2) zil should need mirrored counterpart to recover bad block checksums Now with OpenSolaris beeing Oracle closed and Illumos beeing just startet, I don't know how to handle bug openenings :) - is bugs.opensolaris.org still maintained ??? Regards, Robert -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (preview) Whitepaper - ZFS Pools Explained - feedback welcome
This paper is exactly what is needed -- giving an overview to a wide audience of the ZFS fundamental components and benefits. I found several grammar errors -- to be expected in a draft and I think at least one technical error. The paper seems to imply that multiple vdevs will induce striping across the vdevs, ala RAIDx0. Though I haven't looked at the code, my understanding is that records are contained to a single vdev. The clarification that each vdev gives iops roughly equivalent to a single disk is useful information not generally understood. I was glad to see it there. Overall, this is a terrific step forward for understanding ZFS and encouraging its adoption. Now if only SRSS would work under Nexenta... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] shrink zpool
Is it currently or near future possible to shrink a zpool remove a disk As other's have noted, no, not until the mythical bp_rewrite() function is introduced. So far I have found no documentation on bp_rewrite(), other than it is the solution to evacuating a vdev, restriping a vdev, defragmenting a vdev, solving world hunger and bringing peace to the Middle East. If you search the forums you will find all sorts of discussion around this evasive feature, but nothing concrete. I think it's hiding behind the unicorn located at the end of the rainbow. With Oracle withdrawing/inhousing/whatever development, it's a safe bet that bp_rewrite() now rests in the hands of the community, possibly to be born in Nexenta-land. Maybe it's time for me to quit whining, dust off my KR book and get to work on the weekends coming up with an honest implementation plan. Anyone want to join a task force for getting bp_rewrite() implemented as a community effort? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
From: Neil Perrin [mailto:neil.per...@oracle.com] Hmm, I need to check, but if we get a checksum mismatch then I don't think we try other mirror(s). This is automatic for the 'main pool', but of course the ZIL code is different by necessity. This problem can of course be fixed. (It will be a week and a bit before I can report back on this, as I'm on vacation). Thanks... If indeed that is the behavior, then I would conclude: * Call it a bug. It needs a bug fix. * Prior to log device removal (zpool 19) it is critical to mirror log device. * After introduction of ldr, before this bug fix is available, it is pointless to mirror log devices. * After this bug fix is introduced, it is again recommended to mirror slogs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of StorageConcepts So would say there are 2 bugs / missing features in this: 1) zil needs to report truncated transactions on zilcorruption 2) zil should need mirrored counterpart to recover bad block checksums Add to that: During scrubs, perform some reads on log devices (even if there's nothing to read). In fact, during scrubs, perform some reads on every device (even if it's actually empty.) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
On Aug 26, 2010, at 9:14 AM, Edward Ned Harvey wrote: * After introduction of ldr, before this bug fix is available, it is pointless to mirror log devices. That's a bit of an overstatement. Mirrored logs protect against a wide variety of failure modes. Neil just isn't sure if it does the right thing for checksum errors. That is a very small subset of possible device failure modes. - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote: 1) zil needs to report truncated transactions on zilcorruption As Neil outlined, this isn't possible while preserving current ZIL performance. There is no way to distinguish the last ZIL block without incurring additional writes for every block. If it's even possible to implement this paranoid ZIL tunable, are you willing to take a 2-5x performance hit to be able to detect this failure mode? - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (preview) Whitepaper - ZFS Pools Explained - feedback welcome
This paper is exactly what is needed -- giving an overview to a wide audience of the ZFS fundamental components and benefits. Thanks :) I found several grammar errors -- to be expected in a draft and I think at least one technical error. Will be fixed :) The paper seems to imply that multiple vdevs will induce striping across the vdevs, ala RAIDx0. Though I haven't looked at the code, my understanding is that records are contained to a single vdev. Well according to http://www.filibeto.org/~aduritz/truetrue/solaris10/zfs-uth_3_v1.1_losug.pdf; and other sources zfs uses so called dynamic striping here all data is cpread across all disks. This is also why the failure of a single vdev is critical to pool availability. Robert -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool scrub clean, filesystem broken
Thanks for the response Victor. It is certainly still relevant in the sense that I am hoping to recover the data (although I've been informed the odds are strongly against me) My understanding is that Nexenta has been backporting ZFS code changes post 134. I suppose that it could be an error they somehow introduced or perhaps I've found a unique codepath that is also relevant pre-134 as well. Earlier today I was able to send some zdb dump information to Cindy which hopefully will shed some light on the situation (I would be happy to send to you as well) -brian On Tue, Aug 17, 2010 at 10:37 AM, Victor Latushkin victor.latush...@sun.com wrote: Hi Brian, is it still relevant? On 02.08.10 21:07, Brian Merrell wrote: Cindy, Thanks for the quick response. Consulting ZFS history I note the following actions: imported my three disk raid-z pool originally created on the most recent version of OpenSolaris but now running NexantaStor 3.03 Then we need to know what changes are there in NexentaStor 3.03 on top of build 134. Nexenta folks are reading this list, so I hope they'll chime in. regards victor upgraded my pool destroyed two file systems I was no longer using (neither of these were of course the file system at issue) destroyed a snapshot on another filesystem played around with permissions (these were my only actions directly on the file system) None of these actions seemed to have a negative impact on the filesystem and it was working well when I gracefully shutdown (to physically move the computer). I am a bit at a loss. With copy-on-write and a clean pool how can I have corruption? -brian On Mon, Aug 2, 2010 at 12:52 PM, Cindy Swearingen cindy.swearin...@oracle.com mailto:cindy.swearin...@oracle.com wrote: Brian, You might try using zpool history -il to see what ZFS operations, if any, might have lead up to this problem. If zpool history doesn't provide any clues, then what other operations might have occurred prior to this state? It looks like something trappled this file system... Thanks, Cindy On 08/02/10 10:26, Brian wrote: Thanks Preston. I am actually using ZFS locally, connected directly to 3 sata drives in a raid-z pool. The filesystem is ZFS and it mounts without complaint and the pool is clean. I am at a loss as to what is happening. -brian -- Brian Merrell, Director of Technology Backstop LLP 1455 Pennsylvania Ave., N.W. Suite 400 Washington, D.C. 20004 202-628-BACK (2225) merre...@backstopllp.com mailto:merre...@backstopllp.com www.backstopllp.com http://www.backstopllp.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- -- Victor Latushkin phone: x11467 / +74959370467 TSC-Kernel EMEAmobile: +78957693012 Sun Services, Moscow blog: http://blogs.sun.com/vlatushkin Sun Microsystems -- Brian Merrell, Director of Technology Backstop LLP 1455 Pennsylvania Ave., N.W. Suite 400 Washington, D.C. 20004 202-628-BACK (2225) merre...@backstopllp.com www.backstopllp.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS issue with ZFS
Peter, Here is where I am at right now. I can obvious read/write when using anon=0. That for sure works. But you pointed out it is also a security risk. NFS-Server# zfs get sharenfs backup NAMEPROPERTY VALUE SOURCE backup sharenfs rw=x.x.x.x,root=x.x.x.x,nosuid local # This is how i have it setup using direct setting, I'm actually using IP address and that makes no difference because I'm bypassing DNS services by doing that. This what I get on the client below: # mount -F nfs NFS-SERVER:/backup /nfs/backup nfs mount: NFS-SERVER:/backup: Permission denied NFS-SERVER# id uid=0(root) gid=0(root) # cat /etc/passwd | grep root root:x:0:0:Super-User:/:/sbin/sh CLIENT# id uid=0(root) gid=0(root) # cat /etc/passwd | grep root root:x:0:0:Super-User:/:/usr/bin/bash As you can see the only difference is the client is using bash for it's shell while the other uses sh. As I have mentioned before UID and GUID is not the issue. The only thing I have come up with is there is 2 NFS patches that are needing updating. One of them is 122300 and 117179 patch ID's and see if that fixes my issue. the others seem to be up to date. I guess this be as good of time to learn dtrace. Any suggestion on a dtrace script to use and try to see what is going on. Phillip From: Phillip Bruce (Mindsource) Sent: Saturday, August 14, 2010 2:29 PM To: Peter Karlsson Cc: zfs-discuss@opensolaris.org Subject: RE: [zfs-discuss] NFS issue with ZFS Peter, Thanks for the suggestions, I'm getting closer to solving the problem. it definitely works when using anon setting. I can read / write to the filesystem all day long. But as you mentioned using anon is a bad idea and a security risk. Something I get my hand slapped with keeping this in that configuration. I tired setting directly as root but I keep getting permission denied. I will try this as oracle user and see if I get same thing. Doesn't make sense as I'm using right now a Linux (Centos) and getting the same thing. Phillip From: Peter Karlsson [peter.k.karls...@oracle.com] Sent: Friday, August 13, 2010 9:21 PM To: Phillip Bruce (Mindsource) Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] NFS issue with ZFS On 8/14/10 11:49 , Phillip Bruce (Mindsource) wrote: Peter, what would you expect for root? That is the user I am at. root is default mapped to annon, if you don't specifically export it with the option to allow root on one or more clients to be mapped to local root on the server. zfs set sharenfs=rw,root=host zpool/fs/to/export where host is a ':' separated list of hosts. Alternatively, if you want root from any host to be mapped to root on the server (bad idea), you can do something like this zfs set sharenfs=rw,anon=0 zpool/fs/to/export to allow root access to all hosts. /peter Like I already stated it is NOT a UID or GUID issue. Both systems are the same. Try as a different user that have the same uid on both systems and have write access to the directory in qustion. Phillip From: Peter Karlsson [peter.k.karls...@oracle.com] Sent: Friday, August 13, 2010 7:23 PM To: zfs-discuss@opensolaris.org; Phillip Bruce (Mindsource) Subject: Re: [zfs-discuss] NFS issue with ZFS Hi Phillip, What's the permissions on the directory where you try to write to, and what user are you using on the client system, it's most likely a UID mapping issue between the client and the server. /peter On 8/14/10 3:19 , Phillip Bruce wrote: I have Solaris 10 U7 that is exporting ZFS filesytem. The client is Solaris 9 U7. I can mount the filesytem just fine but I am unable to write to it. showmount -e shows my mount is set for everyone. the dfstab file has option rw set. So what gives? Phillip ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS issue with ZFS
Problem solved.. Try using FQDN on the server end and that work. The client did not have to use FQDN. zfs set sharenfs=rw=nfsclient.domain.com,rw=nfsclient.domain.com,nosuid backup That worked. Both systems has the nsswitch.conf set correctly for DNS. So this is an issue when trying to dns. But that bongles me Why when I explicidly used IP which by passes DNS and did not work. Phillip -Original Message- From: Peter Karlsson [mailto:peter.k.karls...@oracle.com] Sent: Friday, August 13, 2010 9:22 PM To: Phillip Bruce (Mindsource) Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] NFS issue with ZFS On 8/14/10 11:49 , Phillip Bruce (Mindsource) wrote: Peter, what would you expect for root? That is the user I am at. root is default mapped to annon, if you don't specifically export it with the option to allow root on one or more clients to be mapped to local root on the server. zfs set sharenfs=rw,root=host zpool/fs/to/export where host is a ':' separated list of hosts. Alternatively, if you want root from any host to be mapped to root on the server (bad idea), you can do something like this zfs set sharenfs=rw,anon=0 zpool/fs/to/export to allow root access to all hosts. /peter Like I already stated it is NOT a UID or GUID issue. Both systems are the same. Try as a different user that have the same uid on both systems and have write access to the directory in qustion. Phillip From: Peter Karlsson [peter.k.karls...@oracle.com] Sent: Friday, August 13, 2010 7:23 PM To: zfs-discuss@opensolaris.org; Phillip Bruce (Mindsource) Subject: Re: [zfs-discuss] NFS issue with ZFS Hi Phillip, What's the permissions on the directory where you try to write to, and what user are you using on the client system, it's most likely a UID mapping issue between the client and the server. /peter On 8/14/10 3:19 , Phillip Bruce wrote: I have Solaris 10 U7 that is exporting ZFS filesytem. The client is Solaris 9 U7. I can mount the filesytem just fine but I am unable to write to it. showmount -e shows my mount is set for everyone. the dfstab file has option rw set. So what gives? Phillip ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS issue with ZFS
Peter, I ran truss from the client side. Below is what I am getting. What strikes me as odd that the client does a stat(64) call on the remote. He cannot find NFS-SERVER:/backup volume at all. Just before that You get the IOCTL error just before that for the same reason. Keep in mind when I use anon=0 setting from the NFS server. I do not See this issue. The only thing is maybe 2 patches that may correct this. Again: NFS-SERVER Solaris 10 U7 and NFS-CLIENT: Solaris 9 U7 I'll try installing 2 patches I see missing for the NFS and see if that will correct this issue. root[...@nfs-client# truss -v all mount -F nfs NFS-SERVER:/backup /mnt execve(/usr/sbin/mount, 0xFFBFFD2C, 0xFFBFFD44) argc = 5 resolvepath(/usr/lib/ld.so.1, /usr/lib/ld.so.1, 1023) = 16 resolvepath(/usr/sbin/mount, /usr/sbin/mount, 1023) = 15 stat(/usr/sbin/mount, 0xFFBFFB00) = 0 d=0x0154 i=1562 m=0100555 l=1 u=0 g=2 sz=27448 at = Aug 19 10:02:00 PDT 2010 [ 1282237320 ] mt = Oct 15 12:36:57 PDT 2002 [ 1034710617 ] ct = Aug 14 11:48:50 PDT 2005 [ 1124045330 ] bsz=8192 blks=54fs=ufs open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT stat(/usr/lib/libcmd.so.1, 0xFFBFF608)= 0 d=0x0154 i=2791 m=0100755 l=1 u=0 g=2 sz=22920 at = Aug 19 10:01:45 PDT 2010 [ 1282237305 ] mt = Apr 6 12:47:04 PST 2002 [ 1018126024 ] ct = Aug 14 11:50:14 PDT 2005 [ 1124045414 ] bsz=8192 blks=46fs=ufs resolvepath(/usr/lib/libcmd.so.1, /usr/lib/libcmd.so.1, 1023) = 20 open(/usr/lib/libcmd.so.1, O_RDONLY) = 3 mmap(0x0001, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 0xFF3A mmap(0x0001, 90112, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF38 mmap(0xFF38, 10440, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFF38 mmap(0xFF394000, 1131, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 16384) = 0xFF394000 munmap(0xFF384000, 65536) = 0 memcntl(0xFF38, 3720, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 close(3)= 0 stat(/usr/lib/libc.so.1, 0xFFBFF608) = 0 d=0x0154 i=3811 m=0100755 l=1 u=0 g=2 sz=867448 at = Aug 19 10:02:00 PDT 2010 [ 1282237320 ] mt = Mar 6 13:44:23 PST 2006 [ 1141681463 ] ct = May 19 15:06:59 PDT 2006 [ 1148076419 ] bsz=8192 blks=1712 fs=ufs resolvepath(/usr/lib/libc.so.1, /usr/lib/libc.so.1, 1023) = 18 open(/usr/lib/libc.so.1, O_RDONLY)= 3 mmap(0xFF3A, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFF3A mmap(0x0001, 802816, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF28 mmap(0xFF28, 702900, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFF28 mmap(0xFF33C000, 24688, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 704512) = 0xFF33C000 munmap(0xFF32C000, 65536) = 0 memcntl(0xFF28, 117444, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0 close(3)= 0 stat(/usr/lib/libdl.so.1, 0xFFBFF608) = 0 d=0x0154 i=2771 m=0100755 l=1 u=0 g=2 sz=3984 at = Aug 19 10:02:00 PDT 2010 [ 1282237320 ] mt = Oct 30 22:51:47 PST 2005 [ 1130741507 ] ct = May 19 15:24:40 PDT 2006 [ 1148077480 ] bsz=8192 blks=8 fs=ufs resolvepath(/usr/lib/libdl.so.1, /usr/lib/libdl.so.1, 1023) = 19 open(/usr/lib/libdl.so.1, O_RDONLY) = 3 mmap(0xFF3A, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFF3A mmap(0x2000, 8192, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF3FA000 mmap(0xFF3FA000, 1894, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFF3FA000 close(3)= 0 stat(/usr/platform/SUNW,Ultra-80/lib/libc_psr.so.1, 0xFFBFF318) = 0 d=0x0154 i=3240 m=0100755 l=1 u=0 g=2 sz=16768 at = Aug 19 10:02:00 PDT 2010 [ 1282237320 ] mt = Apr 6 14:27:58 PST 2002 [ 1018132078 ] ct = Aug 14 11:50:23 PDT 2005 [ 1124045423 ] bsz=8192 blks=34fs=ufs resolvepath(/usr/platform/SUNW,Ultra-80/lib/libc_psr.so.1, /usr/platform/sun4u/lib/libc_psr.so.1, 1023) = 37 open(/usr/platform/SUNW,Ultra-80/lib/libc_psr.so.1, O_RDONLY) = 3 mmap(0xFF3A, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFF3A mmap(0x2000, 16384, PROT_NONE, MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF3E6000 mmap(0xFF3E6000, 13544, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0xFF3E6000 mmap(0x, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF37 close(3)= 0 munmap(0xFF3A, 8192)= 0 getustack(0xFFBFF944) getrlimit(RLIMIT_STACK, 0xFFBFF93C) = 0 cur = 8388608 max =
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
Does that mean that when the begin of the intent log chain gets corrupted, all other intent log data after the corruption area is lost, because the checksum of the first corrupted block doesn't match? Regards, Markus Neil Perrin neil.per...@oracle.com hat am 23. August 2010 um 19:44 geschrieben: This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each block and checksum it with the embedded checksum within the same block. If we can't read a block due to an IO error then that is reported, but if the checksum does not match then we assume it's the end of the intent log chain. Using this design means we the minimum number of writes to add write an intent log record is just one write. So corruption of an intent log is not going to generate any errors. Neil. On 08/23/10 10:41, StorageConcepts wrote: Hello, we are currently extensivly testing the DDRX1 drive for ZIL and we are going through all the corner cases. The headline above all our tests is do we still need to mirror ZIL with all current fixes in ZFS (zfs can recover zil failure, as long as you don't export the pool, with latest upstream you can also import a poool with a missing zil)? This question is especially interesting with RAM based devices, because they don't wear out, have a very low bit error rate and use one PCIx slot - which are rare. Price is another aspect here :) During our tests we found a strange behaviour of ZFS ZIL failures which are not device related and we are looking for help from the ZFS guru's here :) The test in question is called offline ZIL corruption. The question is, what happens if my ZIL data is corrupted while a server is transported or moved and not properly shut down. For this we do: - Prepare 2 OS installations (ProdudctOS and CorruptOS) - Boot ProductOS and create a pool and add the ZIL - ProductOS: Issue synchronous I/O with a increasing TNX number (and print the latest committet transaciton) - ProductOS: Power off the server and record the laast committet transaction - Boot CorruptOS - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL ~ 300 MB from start of disk, overwriting the first two disk labels) - Boot ProductOS - Verify that the data corruption is detected by checking the file with the transaction number against the one recorded We ran the test and it seems with modern snv_134 the pool comes up after the corruption with all beeing ok, while ~1 Transactions (this is some seconds of writes with DDRX1) are missing and nobody knows about this. We ran a scrub and scrub does not even detect this. ZFS automatically repairs the labels on the ZIL, however no error is reported about the missing data. While it is clear to us that if we do not have a mirrored zil, the data we have overwritten in the zil is lost, we are really wondering why ZFS does not REPORT about this corruption, silently ignoring it. Is this is a bug or .. aehm ... a feature :) ? Regards, Robert -- StorageConcepts Europe GmbH Storage: Beratung. Realisierung. Support Markus Keil k...@storageconcepts.de http://www.storageconcepts.de Wiener Straße 114-116 Telefon: +49 (351) 8 76 92-21 01219 Dresden Telefax: +49 (351) 8 76 92-99 Handelregister Dresden, HRB 28281 Geschäftsführer: Robert Heinzmann, Gerd Jelinek -- Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu sind vertraulich und ausschließlich für den Gebrauch durch den Empfänger bestimmt, soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang erlaubt. Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen Schutzrechten sein. Soweit eine Weitergabe oder Verteilung nicht ausschließlich zu internen Zwecken des Empfängers geschieht, wird jede Weitergabe, Verteilung oder sonstige Kopierung untersagt. Sollten Sie nicht der beabsichtigte Empfänger der Sendung sein, informieren Sie den Absender bitte unverzüglich. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 If I might add my $0.02: it appears that the ZIL is implemented as a kind of circular log buffer. As I understand it, when a corrupt checksum is detected, it is taken to be the end of the log, but this kind of defeats the checksum's original purpose, which is to detect device failure. Thus we would first need to change this behavior to only be used for failure detection. This leaves the question of how to detect the end of the log, which I think could be done by using a monotonously incrementing counter on the ZIL entries. Once we find an entry where the counter != n+1, then we know that the block is the last one in the sequence. Now that we can use checksums to detect device failure, it would be possible to implement ZIL-scrub, allowing an environment to detect ZIL device degradation before it actually results in a catastrophe. - -- Saso On 08/26/2010 03:22 PM, Eric Schrock wrote: On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote: 1) zil needs to report truncated transactions on zilcorruption As Neil outlined, this isn't possible while preserving current ZIL performance. There is no way to distinguish the last ZIL block without incurring additional writes for every block. If it's even possible to implement this paranoid ZIL tunable, are you willing to take a 2-5x performance hit to be able to detect this failure mode? - Eric -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkx2dUwACgkQRO8UcfzpOHD6QgCfWRBvqYxwKOqrFeaMyQ3nZDVX Pu0AoJJHPybVT3GqvQbJPL8Xa58aC5P1 =pQJU -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
Actually - I can't read ZFS code, so the next assumtions are more or less based on brainware - excuse me in advance :) How does ZFS detect up to date zil's ? - with the tnx check of the ueberblock - right ? In our corruption case, we had 2 valid ueberblocks at the end and ZFS used those to import the pool. this is what the end-ueberblock is for. Ok, so the ueberblock contains the pointer to the start of the zil chain - right ? Assume we are adding the tnx number of the current transaction this zil is part of to the blocks written to the zil (special packages zil blocks). So the zil blocks are a little bit bigger then the data blocks, however the transaction count is the the same. Ok for SSD block alignment might be an issue ... agreed. For memory DRAM based ZIL's this is not a problem - except for bandwith. Logic: On ZIL import, check: - If the pointer to the zil chain is empty if yes - clean pool if not - we need to replay - now if the block the root pointer points to is ok (checksum), the zil is used and replayed. At the end, the tnxof the last zil block must be = pool tnx. If =, then OK, if not report a error about missing zil parts and switch to mirror (if available). As Neil outlined, this isn't possible while preserving current ZIL performance. There is no way to distinguish the last ZIL block without incurring additional writes for every block. If it's even possible to implement this paranoid ZIL tunable, are you willing to take a 2-5x performance hit to be able to detect this failure mode? Robert -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
On 26/08/2010 15:08, Saso Kiselkov wrote: If I might add my $0.02: it appears that the ZIL is implemented as a kind of circular log buffer. As I understand it, when a corrupt checksum It is NOT circular since that implies limited number of entries that get overwritten. is detected, it is taken to be the end of the log, but this kind of defeats the checksum's original purpose, which is to detect device failure. Thus we would first need to change this behavior to only be used for failure detection. This leaves the question of how to detect the end of the log, which I think could be done by using a monotonously incrementing counter on the ZIL entries. Once we find an entry where the counter != n+1, then we know that the block is the last one in the sequence. See the comment part way down zil_read_log_block about how we do something pretty much like that for checking the chain of log blocks: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block This is the checksum in the BP checksum field. But before we even got there we checked the ZILOG2 checksum as part of doing the zio (in zio_checksum_verify() stage): http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error A ZILOG2 checksum is a embedded in the block (at the start, the original ZILOG was at the end) version of fletcher4. If that failed - ie the block was corrupt we would have returned an error back through the dsl_read() of the log block. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
On Wed, August 25, 2010 23:00, Neil Perrin wrote: On 08/25/10 20:33, Edward Ned Harvey wrote: It's commonly stated, that even with log device removal supported, the most common failure mode for an SSD is to blindly write without reporting any errors, and only detect that the device is failed upon read. So ... If an SSD is in this failure mode, you won't detect it? At bootup, the checksum will simply mismatch, and we'll chug along forward, having lost the data ... (nothing can prevent that) ... but we don't know that we've lost data? - Indeed, we wouldn't know we lost data. Does a scrub go through the slog and/or L2ARC devices, or only the primary storage components? If it doesn't go through these secondary devices, that may be a useful RFE, as one would ideally want to test the data on every component of a storage system. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I see, thank you for the clarification. So it is possible to have something equivalent to main storage self-healing on ZIL, with ZIL-scrub to activate it. Or is that already implemented also? (Sorry for asking these obvious questions, but I'm not familiar with ZFS source code.) - -- Saso On 08/26/2010 04:31 PM, Darren J Moffat wrote: On 26/08/2010 15:08, Saso Kiselkov wrote: If I might add my $0.02: it appears that the ZIL is implemented as a kind of circular log buffer. As I understand it, when a corrupt checksum It is NOT circular since that implies limited number of entries that get overwritten. is detected, it is taken to be the end of the log, but this kind of defeats the checksum's original purpose, which is to detect device failure. Thus we would first need to change this behavior to only be used for failure detection. This leaves the question of how to detect the end of the log, which I think could be done by using a monotonously incrementing counter on the ZIL entries. Once we find an entry where the counter != n+1, then we know that the block is the last one in the sequence. See the comment part way down zil_read_log_block about how we do something pretty much like that for checking the chain of log blocks: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block This is the checksum in the BP checksum field. But before we even got there we checked the ZILOG2 checksum as part of doing the zio (in zio_checksum_verify() stage): http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error A ZILOG2 checksum is a embedded in the block (at the start, the original ZILOG was at the end) version of fletcher4. If that failed - ie the block was corrupt we would have returned an error back through the dsl_read() of the log block. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkx2f64ACgkQRO8UcfzpOHA7rACgoyydAq2hO/VIfdknRb09WWGJ BkwAn2i3nPtWNnfXwyW2089YMb8FRkZP =YMqL -END PGP SIGNATURE- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Neil Perrin This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each block and checksum it with the embedded checksum within the same block. If we can't read a block due to an IO error then that is reported, but if the checksum does not match then we assume it's the end of the intent log chain. Using this design means we the minimum number of writes to add write an intent log record is just one write. So corruption of an intent log is not going to generate any errors. I didn't know that. Very interesting. This raises another question ... It's commonly stated, that even with log device removal supported, the most common failure mode for an SSD is to blindly write without reporting any errors, and only detect that the device is failed upon read. So ... If an SSD is in this failure mode, you won't detect it? At bootup, the checksum will simply mismatch, and we'll chug along forward, having lost the data ... (nothing can prevent that) ... but we don't know that we've lost data? If the drive's firmware isn't returning back a write error of any kind then there isn't much that ZFS can really do here (regardless of whether this is an SSD or not). Turning every write into a read/write operation would totally defeat the purpose of the ZIL. It's my understanding that SSDs will eventually transition to read-only devices once they've exceeded their spare reallocation blocks. This should propagate to the OS as an EIO which means that ZFS will instead store the ZIL data on the main storage pool. Worse yet ... In preparation for the above SSD failure mode, it's commonly recommended to still mirror your log device, even if you have log device removal. If you have a mirror, and the data on each half of the mirror doesn't match each other (one device failed, and the other device is good) ... Do you read the data from *both* sides of the mirror, in order to discover the corrupted log device, and correctly move forward without data loss? Yes, we read all sides of the mirror when we claim (i.e. read) the log blocks for a log device. This is exactly what a scrub would do for a mirrored data device. - George ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
David Magda wrote: On Wed, August 25, 2010 23:00, Neil Perrin wrote: Does a scrub go through the slog and/or L2ARC devices, or only the primary storage components? A scrub will go through slogs and primary storage devices. The L2ARC device is considered volatile and data loss is not possible should it fail. - George ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS offline ZIL corruption not detected
Edward Ned Harvey wrote: Add to that: During scrubs, perform some reads on log devices (even if there's nothing to read). We do read from log device if there is data stored on them. In fact, during scrubs, perform some reads on every device (even if it's actually empty.) Reading from the data portion of an empty device wouldn't really show us much as we're going to be reading a bunch of non-checksummed data. The best we can do is to probe the device's label region to determine it's health. This is exactly what we do today. - George ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (preview) Whitepaper - ZFS Pools Explained - feedback welcome
On Wed, Aug 25, 2010 at 10:57 PM, StorageConcepts presa...@storageconcepts.de wrote: Thanks for the feedback, the idea of it is to give people new to ZFS a understanding of the terms and mode of operations to avoid common problems (wide stripe pools etc.). Also agreed that it is a little NexentaStor tweaked :) I think I have to rework the zil section anyhow because of http://opensolaris.org/jive/thread.jspa?threadID=133294tstart=0 - have to do some experiments here - and I will also do a dual command strategy showing nexentastor commands AND opensolaris commands when a command is shown. I haven't finished reading it yet (okay, barely read through the contents list), but would you be interested in the FreeBSD equivalents for the commands, if they differ? -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] slog and TRIM support [SEC=UNCLASSIFIED]
On Wed, Aug 25, 2010 at 6:18 PM, Wilkinson, Alex alex.wilkin...@dsto.defence.gov.au wrote: 0n Wed, Aug 25, 2010 at 02:54:42PM -0400, LaoTsao ?? wrote: IMHO, U want -E for ZIL and -M for L2ARC Why ? -E uses SLC flash, which is optimised for fast writes. Ideal for a ZIL which is (basically) write-only. -M uses MLC flash, which is optimised for fast reads. Ideal for an L2ARC which is (basically) read-only. -E tends to have smaller capacities, which is fine for ZIL. -M tends to have larger capacities, which is perfect for L2ARC. -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Write Once Read Many on ZFS
Hi, I'd like to know if there is a way to use WORM property on ZFS. Thanks Douglas -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Storage server hardwae
On Wed, Aug 25, 2010 at 12:29 PM, Dr. Martin Mundschenk m.mundsch...@me.com wrote: Well, I wonder what are the components to build a stable system without having an enterprise solution: eSATA, USB, FireWire, FibreChannel? If possible to get a card to fit into a MacMini, eSATA would be a lot better than USB or FireWire. If there's any way to run cables from inside the case, you can make do with plain SATA and longer cables. Otherwise, you'll need to look into something other than a MacMini for your storage box. If bandwidth is a concern, consider these napkin estimates: A bare 7200rpm SATA drive will typically get 60 MB/s SATA II is 300 MB/s Gigabit ethernet 60 MB/s 100Mbit 11.4 MB/s USB 2.0 is 30 MB/s Firewire 400 35 MB/s Firewire 800 55 MB/s I usually see 17 MB/s max on an external USB 2.0 drive. For 300 GB, the USB will take at best 5.5 hrs. SATA will be 1.5 hrs. If you have more then one process at a time hitting the drive, your speeds go through the floor. For home use, that may be ok. For a business, not so much. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS issue with ZFS
pb( == Phillip Bruce (Mindsource) v-phb...@microsoft.com writes: pb( Problem solved.. Try using FQDN on the server end and that pb( work. The client did not have to use FQDN. 1. your syntax is wrong. You must use netgroup syntax to specify an IP, otherwise it will think you mean the hostname made up of those numbers and dots as characters. NAME PROPERTY VALUE andaman/arrchive sharenfs r...@10.100.100.0/23:@192.168.2.3/32 2. there's a bug in mountd. well, there are many bugs in mountd, but this is the one I ran into, which makes the netgroup syntax mostly useless: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6901832 one workaround is to give every IP reverse lookup, ex. using BIND $generate or something. I just use a big /etc/hosts covering every IP to which I've exported. I suppose actually fixing mountd would be what a good sysadmin would have done: it can't be that hard. pgp6GX6Mwe4Z0.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs/iSCSI: 0000 = SNS Error Type: Current Error (0x70)
Hi, I'm trying to track down an error with a 64bit x86 OpenSolaris 2009.06 ZFS shared via iSCSI and an Ubuntu 10.04 client. The client can successfully log in, but no device node appears. I captured a session with wireshark. When the client attempts a SCSI: Inquiry LUN: 0x00, OpenSolaris sends a SCSI Response (Check Condition) LUN:0x00 that contains the following: .111 = SNS Error Type: Current Error (0x70) Filemark: 0, EOM: 0, ILI: 0 0100 = Sense Key: Hardware Error (0x04) The ZFS being exported is a 400GB chunk of a 1TB ZFS mirror. The underlying OS reports no hardware errors, and zpool status looks OK. Why would OpenSolaris give this error? Is there anything I can do for it? Any suggestions would be appreciated. (I discussed this with the open-iscsi people at http://groups.google.com/group/open-iscsi/browse_thread/thread/06b83227ffc6a31a/2e58a163e21ec74e#2e58a163e21ec74e.) Thanks, ==ml -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Storage server hardwae
On Thu, August 26, 2010 13:58, Tom Buskey wrote: I usually see 17 MB/s max on an external USB 2.0 drive. Interesting; I routinly see 27 MB/s peaking to 30 MB/s on the cheap WD 1TB external drives I use for backups. (Backup is probably best case, the only user of that drive is a zfs receive process.) -- David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Networker Dedup @ ZFS
Hi Daniel, We we're looking into very much the same solution you've tested. Thanks for your advise. I think we will look for something else. :) Just out of curiosity, what ZFS tweaking did you do? And what much pricier competitor solution did you end up with in the end? Regards, Sigbjorn Daniel Whitener wrote: Sigbjorn Stop! Don't do it... it's a waste of time. We tried exactly what you're thinking of... we bought two Sun/Oracle 7000 series storage units with 20TB of ZFS storage each planning to use them as a backup target for Networker. We ran into several issues eventually gave up the ZFS networker combo. We've used other storage devices in the past (virtual tape libraries) that had deduplication. We were used to seeing dedup ratios better than 20x on our backup data. The ZFS filesystem only gave us 1.03x, and it had regular issues because it couldn't do dedup for such large filesystems very easily. We didn't know it ahead of time, but VTL solutions use something called variable length block dedup, whereas ZFS uses fixed block length dedup. Like one of the other posters mentioned, things just don't line up right and the dedup ratio suffers. Yes, compression works to some degree -- I think we got 2 or 3x on that, but it was a far cry from the 20x that we were used to seeing on our old VTL. We recently ditched the 7000 series boxes in favor of a much pricier competitor. It's about double the cost, but dedup ratios are better than 20x. Personally I love ZFS and I use it in many other places, but we were very disappointed with the dedup ability for that type of data. We went to Sun with our problems and they ran it up the food chain and word came back down from the developers that this was the way it was designed, and it's not going to change anytime soon. The type of files that Networker writes out just are not friendly at all with the dedup mechanism used in ZFS. They gave us a few ideas and things to tweak in Networker, but no measurable gains ever came from any of the tweaks. If are considering a home-grown ZFS solution for budget reasons, go for it just do yourself a favor and save yourself the overhead of trying to dedup. When we disabled dedup on our 7000 series boxes, everything worked great and compression was fine with next to no overhead. Unfortunately, we NEEDED at least a 10x ratio to keep the 3 week backups we were trying to do. We couldn't even keep a 1 week backup with the dedup performance of ZFS. If you need more details, I'm happy to help. We went through months of pain trying to make it work and it just doesn't for Networker data. best wishes Daniel 2010/8/18 Sigbjorn Lie sigbj...@nixtra.com: Hi, We are considering using a ZFS based storage as a staging disk for Networker. We're aiming at providing enough storage to be able to keep 3 months worth of backups on disk, before it's moved to tape. To provide storage for 3 months of backups, we want to utilize the dedup functionality in ZFS. I've searched around for these topics and found no success stories, however those who has tried did not mention if they had attempted to change the blocksize to any smaller than the default of 128k. Does anyone have any experience with this kind of setup? Regards, Sigbjorn ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Networker Dedup @ ZFS
IMHO, if U use the backup SW that support dedupe in the SW then ZFS is still a viable solution On 8/26/2010 6:13 PM, Sigbjørn Lie wrote: Hi Daniel, We we're looking into very much the same solution you've tested. Thanks for your advise. I think we will look for something else. :) Just out of curiosity, what ZFS tweaking did you do? And what much pricier competitor solution did you end up with in the end? Regards, Sigbjorn Daniel Whitener wrote: Sigbjorn Stop! Don't do it... it's a waste of time. We tried exactly what you're thinking of... we bought two Sun/Oracle 7000 series storage units with 20TB of ZFS storage each planning to use them as a backup target for Networker. We ran into several issues eventually gave up the ZFS networker combo. We've used other storage devices in the past (virtual tape libraries) that had deduplication. We were used to seeing dedup ratios better than 20x on our backup data. The ZFS filesystem only gave us 1.03x, and it had regular issues because it couldn't do dedup for such large filesystems very easily. We didn't know it ahead of time, but VTL solutions use something called variable length block dedup, whereas ZFS uses fixed block length dedup. Like one of the other posters mentioned, things just don't line up right and the dedup ratio suffers. Yes, compression works to some degree -- I think we got 2 or 3x on that, but it was a far cry from the 20x that we were used to seeing on our old VTL. We recently ditched the 7000 series boxes in favor of a much pricier competitor. It's about double the cost, but dedup ratios are better than 20x. Personally I love ZFS and I use it in many other places, but we were very disappointed with the dedup ability for that type of data. We went to Sun with our problems and they ran it up the food chain and word came back down from the developers that this was the way it was designed, and it's not going to change anytime soon. The type of files that Networker writes out just are not friendly at all with the dedup mechanism used in ZFS. They gave us a few ideas and things to tweak in Networker, but no measurable gains ever came from any of the tweaks. If are considering a home-grown ZFS solution for budget reasons, go for it just do yourself a favor and save yourself the overhead of trying to dedup. When we disabled dedup on our 7000 series boxes, everything worked great and compression was fine with next to no overhead. Unfortunately, we NEEDED at least a 10x ratio to keep the 3 week backups we were trying to do. We couldn't even keep a 1 week backup with the dedup performance of ZFS. If you need more details, I'm happy to help. We went through months of pain trying to make it work and it just doesn't for Networker data. best wishes Daniel 2010/8/18 Sigbjorn Lie sigbj...@nixtra.com: Hi, We are considering using a ZFS based storage as a staging disk for Networker. We're aiming at providing enough storage to be able to keep 3 months worth of backups on disk, before it's moved to tape. To provide storage for 3 months of backups, we want to utilize the dedup functionality in ZFS. I've searched around for these topics and found no success stories, however those who has tried did not mention if they had attempted to change the blocksize to any smaller than the default of 128k. Does anyone have any experience with this kind of setup? Regards, Sigbjorn ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss attachment: laotsao.vcf___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS with SAN's and HA
Hey all, I currently work for a company that has purchased a number of different SAN solutions (whatever was cheap at the time!) and i want to setup a HA ZFS file store over fiber channel. Basically I've taken slices from each of the sans and added them to a ZFS pool on this box (which I'm calling a 'ZFS proxy'). I've then carved out LUN's from this pool and assigned them to other servers. I then have snapshots taken on each of the LUN's and replication off site for DR. This all works perfectly (backups for ESXi!) However, I'd like to be able to a) expand and b) make it HA. All the documentation i can find on setting up a HA cluster for file stores replicates data from 2 servers and then serves from these computers (i trust the SAN's to take care of the data and don't want to replicate anything -- cost!). Basically all i want is for the node that serves the ZFS pool to be HA (if this was to be put into production we have around 128tb and are looking to expand to a pb). We have a couple of IBM SVC's that seem to handle the HA node setup in some obscure property IBM way so logically it seems possible. Clients would only be making changes via a single 'zfs proxy' at a time (multi-pathing setup for fail over only) so i don't believe I'd need to OCFS the setup? If i do need to setup OCFS can i put ZFS on top of that? (want snap-shotting/rollback and replication to a off site location, as well as all the goodness of thin provisioning and de-duplication) However when i import the ZFS pool onto the 2nd box i got large warnings about it being mounted elsewhere and i needed to force the import, then when importing the LUN's i saw that the GUUID was different so multi-pathing doesn't pick that the LUN's are the same? can i change a GUUID via smtfadm? Is any of this even possible over fiber channel? Is anyone able to point me at some documentation? Am i simply crazy? Any input would be most welcome. Thanks in advance, -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with SAN's and HA
be very careful here!! On 8/26/2010 9:16 PM, Michael Dodwell wrote: Hey all, I currently work for a company that has purchased a number of different SAN solutions (whatever was cheap at the time!) and i want to setup a HA ZFS file store over fiber channel. Basically I've taken slices from each of the sans and added them to a ZFS pool on this box (which I'm calling a 'ZFS proxy'). I've then carved out LUN's from this pool and assigned them to other servers. I then have snapshots taken on each of the LUN's and replication off site for DR. This all works perfectly (backups for ESXi!) However, I'd like to be able to a) expand and b) make it HA. All the documentation i can find on setting up a HA cluster for file stores replicates data from 2 servers and then serves from these computers (i trust the SAN's to take care of the data and don't want to replicate anything -- cost!). Basically all i want is for the node that serves the ZFS pool to be HA (if this was to be put into production we have around 128tb and are looking to expand to a pb). We have a couple of IBM SVC's that seem to handle the HA node setup in some obscure property IBM way so logically it seems possible. Clients would only be making changes via a single 'zfs proxy' at a time (multi-pathing setup for fail over only) so i don't believe I'd need to OCFS the setup? If i do need to setup OCFS can i put ZFS on top of that? (want snap-shotting/rollback and replication to a off site location, as well as all the goodness of thin provisioning and de-duplication) However when i import the ZFS pool onto the 2nd box i got large warnings about it being mounted elsewhere and i needed to force the import, then when importing the LUN's i saw that the GUUID was different so multi-pathing doesn't pick that the LUN's are the same? can i change a GUUID via smtfadm? if U force the import and zfs are mounted by two hosts, Ur zfs could become corrupted!!! recovery is not easy Is any of this even possible over fiber channel? please at least take look the document on oracle solaris cluster software it detail the way to use ZFS in cluster env http://docs.sun.com/app/docs/prod/sun.cluster32?l=ena=view zfs http://docs.sun.com/app/docs/doc/820-7359/gbspx?l=ena=view Is anyone able to point me at some documentation? Am i simply crazy? Any input would be most welcome. Thanks in advance, attachment: laotsao.vcf___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with SAN's and HA
Lao, I had a look at the HAStoragePlus etc and from what i understand that's to mirror local storage across 2 nodes for services to be able to access 'DRBD style'. Having a read thru the documentation on the oracle site the cluster software from what i gather is how to cluster services together (oracle/apache etc) and again any documentation i've found on storage is how to duplicate local storage to multiple hosts for HA failover. Can't really see anything on clustering services to use shared storage/zfs pools. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] VM's on ZFS - 7210
We are using a 7210, 44 disks I believe, 11 stripes of RAIDz sets. When I installed I selected the best bang for the buck on the speed vs capacity chart. We run about 30 VM's on it, across 3 ESX 4 servers. Right now, its all running NFS, and it sucks... sooo slow. iSCSI was no better. I am wondering how I can increase the performance, cause they want to add more vm's... the good news is most are idleish, but even idle vm's create a lot of random chatter to the disks! So a few options maybe... 1) Change to iSCSI mounts to ESX, and enable write-cache on the LUN's since the 7210 is on a UPS. 2) get a Logzilla SSD mirror. (do ssd's fail, do I really need a mirror?) 3) reconfigure the NAS to a RAID10 instead of RAIDz Obviously all 3 would be ideal , though with a SSD can I keep using NFS for the same performance since the R_SYNC's would be satisfied with the SSD? I am dreadful of getting the OK to spend the $$,$$$ SSD's and then not get the performance increase we want. How would you weight these? I noticed in testing on a 5 disk OpenSolaris, that changing from a single RAIDz pool to RAID10 netted a larger IOP increase then adding an Intel SSD as a Logzilla. That's not going to scale the same though with a 44 disk, 11 raidz striped RAID set. Some thoughts? Would simply moving to write-cache enabled iSCSI LUN's without a SSD speed things up a lot by itself? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss