Re: [zfs-discuss] ZFS Storage server hardwae

2010-08-26 Thread Dr. Martin Mundschenk

Am 26.08.2010 um 04:38 schrieb Edward Ned Harvey:

 There is no such thing as reliable external disks.  Not unless you want to
 pay $1000 each, which is dumb.  You have to scrap your mini, and use
 internal (or hotswappable) disks.
 
 Never expect a mini to be reliable.  They're designed to be small and cute.
 Not reliable.


The MacMini and the disks themselves are just fine. The problem seems to be the 
SATA-bridges to USB/FW. They just stall, when the load gets heavy.

Martin
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread StorageConcepts
Hello, 
actually this is bad news. 

I always assumed that the mirror redundancy of zil can also be used to handle 
bad blocks on the zil device (just as the main pool self healing does for data 
blocks).

I actually dont know how SSD's die, because of the wear out characteristics 
I can think of a increased number of bad blocks / bit errors at the EOL of such 
a device -  probably undiscovered.

Because ZIL is write only, you only know if it worked in case you need it - 
wich is bad. So my suggestion was always to run with 1 zil during 
pre-production, and add the zil mirror 2 weeks later when production starts. 
This way they dont't age exactly the same and zil2 has 2 more weeks of expected 
flifetime (or even more, assuming the usual heavier writes during stress 
testing). 

I would call this pre-aging. However if the second zil is not used to recover 
from bad blocks, this does not make a lot of sense.

So would say there are 2 bugs / missing features in this: 

1) zil needs to report truncated transactions on zilcorruption
2) zil should need mirrored counterpart to recover bad block checksums 

Now with OpenSolaris beeing Oracle closed and Illumos beeing just startet, I 
don't  know how to handle bug openenings :) - is bugs.opensolaris.org still 
maintained ???

Regards, 
Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (preview) Whitepaper - ZFS Pools Explained - feedback welcome

2010-08-26 Thread Marty Scholes
This paper is exactly what is needed -- giving an overview to a wide audience 
of the ZFS fundamental components and benefits.

I found several grammar errors -- to be expected in a draft and I think at 
least one technical error.

The paper seems to imply that multiple vdevs will induce striping across the 
vdevs, ala RAIDx0.  Though I haven't looked at the code, my understanding is 
that records are contained to a single vdev.

The clarification that each vdev gives iops roughly equivalent to a single disk 
is useful information not generally understood.  I was glad to see it there.

Overall, this is a terrific step forward for understanding ZFS and encouraging 
its adoption.

Now if only SRSS would work under Nexenta...
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] shrink zpool

2010-08-26 Thread Marty Scholes
 Is it currently or near future possible to shrink a
 zpool remove a disk

As other's have noted, no, not until the mythical bp_rewrite() function is 
introduced.

So far I have found no documentation on bp_rewrite(), other than it is the 
solution to evacuating a vdev, restriping a vdev, defragmenting a vdev, solving 
world hunger and bringing peace to the Middle East.

If you search the forums you will find all sorts of discussion around this 
evasive feature, but nothing concrete.  I think it's hiding behind the unicorn 
located at the end of the rainbow.

With Oracle withdrawing/inhousing/whatever development, it's a safe bet that 
bp_rewrite() now rests in the hands of the community, possibly to be born in 
Nexenta-land.

Maybe it's time for me to quit whining, dust off my KR book and get to work on 
the weekends coming up with an honest implementation plan.

Anyone want to join a task force for getting bp_rewrite() implemented as a 
community effort?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey
 From: Neil Perrin [mailto:neil.per...@oracle.com]
 
 Hmm, I need to check, but if we get a checksum mismatch then I don't
 think we try other
 mirror(s). This is automatic for the 'main pool', but of course the ZIL
 code is different
 by necessity. This problem can of course be fixed. (It will be  a week
 and a bit before I can
 report back on this, as I'm on vacation).

Thanks...

If indeed that is the behavior, then I would conclude:  
* Call it a bug.  It needs a bug fix.
* Prior to log device removal (zpool 19) it is critical to mirror log
device.
* After introduction of ldr, before this bug fix is available, it is
pointless to mirror log devices.
* After this bug fix is introduced, it is again recommended to mirror slogs.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of StorageConcepts
 
 So would say there are 2 bugs / missing features in this:
 
 1) zil needs to report truncated transactions on zilcorruption
 2) zil should need mirrored counterpart to recover bad block checksums

Add to that:

During scrubs, perform some reads on log devices (even if there's nothing to
read).
In fact, during scrubs, perform some reads on every device (even if it's
actually empty.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Eric Schrock

On Aug 26, 2010, at 9:14 AM, Edward Ned Harvey wrote:
 * After introduction of ldr, before this bug fix is available, it is
 pointless to mirror log devices.

That's a bit of an overstatement.  Mirrored logs protect against a wide variety 
of failure modes.  Neil just isn't sure if it does the right thing for checksum 
errors.  That is a very small subset of possible device failure modes.

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Eric Schrock

On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote:
 
 1) zil needs to report truncated transactions on zilcorruption

As Neil outlined, this isn't possible while preserving current ZIL performance. 
 There is no way to distinguish the last ZIL block without incurring 
additional writes for every block.  If it's even possible to implement this 
paranoid ZIL tunable, are you willing to take a 2-5x performance hit to be 
able to detect this failure mode?

- Eric

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (preview) Whitepaper - ZFS Pools Explained - feedback welcome

2010-08-26 Thread StorageConcepts
 This paper is exactly what is needed -- giving an
 overview to a wide audience of the ZFS fundamental
 components and benefits.

Thanks :)

 I found several grammar errors -- to be expected in a
 draft and I think at least one technical error.

Will be fixed :)

 The paper seems to imply that multiple vdevs will
 induce striping across the vdevs, ala RAIDx0.  Though
 I haven't looked at the code, my understanding is
 that records are contained to a single vdev.

Well according to 
http://www.filibeto.org/~aduritz/truetrue/solaris10/zfs-uth_3_v1.1_losug.pdf; 
and other sources zfs uses so called dynamic striping here all data is cpread 
across all disks. This is also why the failure of a single vdev is critical to 
pool availability. 
 
Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] pool scrub clean, filesystem broken

2010-08-26 Thread Brian Merrell
Thanks for the response Victor.  It is certainly still relevant in the sense
that I am hoping to recover the data (although I've been informed the odds
are strongly against me)

My understanding is that Nexenta has been backporting ZFS code changes post
134.  I suppose that it could be an error they somehow introduced or perhaps
I've found a unique codepath that is also relevant pre-134 as well.
Earlier today I was able to send some zdb dump information to Cindy which
hopefully will shed some light on the situation (I would be happy to send to
you as well)

-brian

On Tue, Aug 17, 2010 at 10:37 AM, Victor Latushkin victor.latush...@sun.com
 wrote:

 Hi Brian,

 is it still relevant?


 On 02.08.10 21:07, Brian Merrell wrote:

 Cindy,

 Thanks for the quick response.  Consulting ZFS history I note the
 following actions:

 imported my three disk raid-z pool originally created on the most
 recent version of OpenSolaris but now running NexantaStor 3.03


 Then we need to know what changes are there in NexentaStor 3.03 on top of
 build 134. Nexenta folks are reading this list, so I hope they'll chime in.

 regards
 victor


  upgraded my pool
 destroyed two file systems I was no longer using (neither of these were
 of course the file system at issue)
 destroyed a snapshot on another filesystem
 played around with permissions (these were my only actions directly on the
 file system)

 None of these actions seemed to have a negative impact on the filesystem
 and it was working well when I gracefully shutdown (to physically move the
 computer).

 I am a bit at a loss.  With copy-on-write and a clean pool how can I have
 corruption?

 -brian



 On Mon, Aug 2, 2010 at 12:52 PM, Cindy Swearingen 
 cindy.swearin...@oracle.com mailto:cindy.swearin...@oracle.com wrote:

Brian,

You might try using zpool history -il to see what ZFS operations,
if any, might have lead up to this problem.

If zpool history doesn't provide any clues, then what other
operations might have occurred prior to this state?

It looks like something trappled this file system...

Thanks,

Cindy

On 08/02/10 10:26, Brian wrote:

Thanks Preston.  I am actually using ZFS locally, connected
directly to 3 sata drives in a raid-z pool. The filesystem is
ZFS and it mounts without complaint and the pool is clean.  I am
at a loss as to what is happening.
-brian




 --
 Brian Merrell, Director of Technology
 Backstop LLP
 1455 Pennsylvania Ave., N.W.
 Suite 400
 Washington, D.C.  20004
 202-628-BACK (2225)
 merre...@backstopllp.com mailto:merre...@backstopllp.com
 www.backstopllp.com http://www.backstopllp.com


 

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



 --
 --
 Victor Latushkin   phone: x11467 / +74959370467
 TSC-Kernel EMEAmobile: +78957693012
 Sun Services, Moscow   blog: http://blogs.sun.com/vlatushkin
 Sun Microsystems




-- 
Brian Merrell, Director of Technology
Backstop LLP
1455 Pennsylvania Ave., N.W.
Suite 400
Washington, D.C.  20004
202-628-BACK (2225)
merre...@backstopllp.com
www.backstopllp.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS issue with ZFS

2010-08-26 Thread Phillip Bruce (Mindsource)
Peter,

Here is where I am at right now.

I can obvious read/write when using anon=0. That for sure works.
But you pointed out it is also a security risk.

NFS-Server# zfs get sharenfs backup
NAMEPROPERTY  VALUE SOURCE
backup  sharenfs  rw=x.x.x.x,root=x.x.x.x,nosuid  local
#

This is  how i have it setup using direct setting, I'm actually using IP 
address and that makes 
no difference because I'm bypassing DNS services by doing that. This what I get 
on the 
client below:

# mount -F nfs NFS-SERVER:/backup /nfs/backup
nfs mount: NFS-SERVER:/backup: Permission denied

NFS-SERVER# id
uid=0(root) gid=0(root)

# cat /etc/passwd | grep root
root:x:0:0:Super-User:/:/sbin/sh


CLIENT# id
uid=0(root) gid=0(root)

# cat /etc/passwd | grep root
root:x:0:0:Super-User:/:/usr/bin/bash

As you can see the only difference is the client is using bash for it's shell 
while the other uses sh.
As I have mentioned before UID and GUID is not the issue.

The only thing I have come up with is there is 2 NFS patches that are needing 
updating.
One of them is 122300  and 117179 patch ID's and see if that fixes my issue. 
the others
seem to be up to date.

I guess this be as good of time to learn dtrace. Any suggestion on a dtrace 
script to use
and try to see what is going on. 

Phillip



From: Phillip Bruce (Mindsource)
Sent: Saturday, August 14, 2010 2:29 PM
To: Peter Karlsson
Cc: zfs-discuss@opensolaris.org
Subject: RE: [zfs-discuss] NFS issue with ZFS

Peter,

Thanks for the suggestions, I'm getting closer to solving the problem.
it definitely works when using anon setting. I can read / write to the 
filesystem all day
long. But as you mentioned using anon is a bad idea and a security risk.
Something I get my hand slapped with keeping this in that configuration.

I tired setting directly as root but I keep getting permission denied.
I will try this as oracle user and see if I get same thing.

Doesn't make sense as I'm using right now a Linux (Centos) and getting the same 
thing.

Phillip

From: Peter Karlsson [peter.k.karls...@oracle.com]
Sent: Friday, August 13, 2010 9:21 PM
To: Phillip Bruce (Mindsource)
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] NFS issue with ZFS

On 8/14/10 11:49 , Phillip Bruce (Mindsource) wrote:
 Peter,

 what would you expect for root?
 That is the user I am at.

root is default mapped to annon, if you don't specifically export it
with the option to allow root on one or more clients to be mapped to
local root on the server.

zfs set sharenfs=rw,root=host zpool/fs/to/export

where host is a ':' separated list of hosts.

Alternatively, if you want root from any host to be mapped to root on
the server (bad idea), you can do something like this

zfs set sharenfs=rw,anon=0 zpool/fs/to/export

to allow root access to all hosts.

/peter

 Like I already stated it is NOT a UID or GUID issue.
 Both systems are the same.

Try as a different user that have the same uid on both systems and have
write access to the directory in qustion.


 Phillip
 
 From: Peter Karlsson [peter.k.karls...@oracle.com]
 Sent: Friday, August 13, 2010 7:23 PM
 To: zfs-discuss@opensolaris.org; Phillip Bruce (Mindsource)
 Subject: Re: [zfs-discuss] NFS issue with ZFS

 Hi Phillip,

 What's the permissions on the directory where you try to write to, and
 what user are you using on the client system, it's most likely a UID
 mapping issue between the client and the server.

 /peter

 On 8/14/10 3:19 , Phillip Bruce wrote:
 I have Solaris 10 U7 that is exporting ZFS filesytem.
 The client is Solaris 9 U7.

 I can mount the filesytem just fine but I am unable to write to it.
 showmount -e shows my mount is set for everyone.
 the dfstab file has option rw set.

 So what gives?

 Phillip


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS issue with ZFS

2010-08-26 Thread Phillip Bruce (Mindsource)
Problem solved..

Try using FQDN on the server end and that work.
The client did not have to use FQDN.

zfs set sharenfs=rw=nfsclient.domain.com,rw=nfsclient.domain.com,nosuid backup

That worked.

Both systems has the nsswitch.conf set correctly for DNS.
So this is an issue when trying to dns. But that bongles me
Why when I explicidly used IP which by passes DNS and did not work. 

Phillip

-Original Message-
From: Peter Karlsson [mailto:peter.k.karls...@oracle.com] 
Sent: Friday, August 13, 2010 9:22 PM
To: Phillip Bruce (Mindsource)
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] NFS issue with ZFS



On 8/14/10 11:49 , Phillip Bruce (Mindsource) wrote:
 Peter,

 what would you expect for root?
 That is the user I am at.

root is default mapped to annon, if you don't specifically export it 
with the option to allow root on one or more clients to be mapped to 
local root on the server.

zfs set sharenfs=rw,root=host zpool/fs/to/export

where host is a ':' separated list of hosts.

Alternatively, if you want root from any host to be mapped to root on 
the server (bad idea), you can do something like this

zfs set sharenfs=rw,anon=0 zpool/fs/to/export

to allow root access to all hosts.

/peter

 Like I already stated it is NOT a UID or GUID issue.
 Both systems are the same.

Try as a different user that have the same uid on both systems and have 
write access to the directory in qustion.


 Phillip
 
 From: Peter Karlsson [peter.k.karls...@oracle.com]
 Sent: Friday, August 13, 2010 7:23 PM
 To: zfs-discuss@opensolaris.org; Phillip Bruce (Mindsource)
 Subject: Re: [zfs-discuss] NFS issue with ZFS

 Hi Phillip,

 What's the permissions on the directory where you try to write to, and
 what user are you using on the client system, it's most likely a UID
 mapping issue between the client and the server.

 /peter

 On 8/14/10 3:19 , Phillip Bruce wrote:
 I have Solaris 10 U7 that is exporting ZFS filesytem.
 The client is Solaris 9 U7.

 I can mount the filesytem just fine but I am unable to write to it.
 showmount -e shows my mount is set for everyone.
 the dfstab file has option rw set.

 So what gives?

 Phillip


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS issue with ZFS

2010-08-26 Thread Phillip Bruce (Mindsource)
Peter,

I ran truss from the client side. Below is what I am getting.
What strikes me as odd that the client does a stat(64) call on the remote.
He cannot find NFS-SERVER:/backup volume at all. Just before that
You get the IOCTL error just before that for the same reason.

Keep in mind when I use anon=0 setting from the NFS server. I do not
See this issue. The only thing is maybe 2 patches that may correct this.

Again: NFS-SERVER Solaris 10 U7 and NFS-CLIENT: Solaris 9 U7

I'll try installing 2 patches I see missing for the NFS and see if that will 
correct 
this issue.

root[...@nfs-client# truss -v all mount -F nfs NFS-SERVER:/backup /mnt 
execve(/usr/sbin/mount, 0xFFBFFD2C, 0xFFBFFD44)  argc = 5 
resolvepath(/usr/lib/ld.so.1, /usr/lib/ld.so.1, 1023) = 16 
resolvepath(/usr/sbin/mount, /usr/sbin/mount, 1023) = 15
stat(/usr/sbin/mount, 0xFFBFFB00) = 0
d=0x0154 i=1562  m=0100555 l=1  u=0 g=2 sz=27448
at = Aug 19 10:02:00 PDT 2010  [ 1282237320 ]
mt = Oct 15 12:36:57 PDT 2002  [ 1034710617 ]
ct = Aug 14 11:48:50 PDT 2005  [ 1124045330 ]
bsz=8192  blks=54fs=ufs
open(/var/ld/ld.config, O_RDONLY) Err#2 ENOENT
stat(/usr/lib/libcmd.so.1, 0xFFBFF608)= 0
d=0x0154 i=2791  m=0100755 l=1  u=0 g=2 sz=22920
at = Aug 19 10:01:45 PDT 2010  [ 1282237305 ]
mt = Apr  6 12:47:04 PST 2002  [ 1018126024 ]
ct = Aug 14 11:50:14 PDT 2005  [ 1124045414 ]
bsz=8192  blks=46fs=ufs
resolvepath(/usr/lib/libcmd.so.1, /usr/lib/libcmd.so.1, 1023) = 20
open(/usr/lib/libcmd.so.1, O_RDONLY)  = 3
mmap(0x0001, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_ALIGN, 3, 0) = 
0xFF3A mmap(0x0001, 90112, PROT_NONE, 
MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF38 
mmap(0xFF38, 10440, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 
0xFF38 mmap(0xFF394000, 1131, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED, 3, 16384) = 0xFF394000
munmap(0xFF384000, 65536)   = 0
memcntl(0xFF38, 3720, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
close(3)= 0
stat(/usr/lib/libc.so.1, 0xFFBFF608)  = 0
d=0x0154 i=3811  m=0100755 l=1  u=0 g=2 sz=867448
at = Aug 19 10:02:00 PDT 2010  [ 1282237320 ]
mt = Mar  6 13:44:23 PST 2006  [ 1141681463 ]
ct = May 19 15:06:59 PDT 2006  [ 1148076419 ]
bsz=8192  blks=1712  fs=ufs
resolvepath(/usr/lib/libc.so.1, /usr/lib/libc.so.1, 1023) = 18
open(/usr/lib/libc.so.1, O_RDONLY)= 3
mmap(0xFF3A, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 
0xFF3A mmap(0x0001, 802816, PROT_NONE, 
MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF28 
mmap(0xFF28, 702900, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 
0xFF28 mmap(0xFF33C000, 24688, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_FIXED, 3, 704512) = 0xFF33C000
munmap(0xFF32C000, 65536)   = 0
memcntl(0xFF28, 117444, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
close(3)= 0
stat(/usr/lib/libdl.so.1, 0xFFBFF608) = 0
d=0x0154 i=2771  m=0100755 l=1  u=0 g=2 sz=3984
at = Aug 19 10:02:00 PDT 2010  [ 1282237320 ]
mt = Oct 30 22:51:47 PST 2005  [ 1130741507 ]
ct = May 19 15:24:40 PDT 2006  [ 1148077480 ]
bsz=8192  blks=8 fs=ufs
resolvepath(/usr/lib/libdl.so.1, /usr/lib/libdl.so.1, 1023) = 19
open(/usr/lib/libdl.so.1, O_RDONLY)   = 3
mmap(0xFF3A, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 
0xFF3A mmap(0x2000, 8192, PROT_NONE, 
MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF3FA000 
mmap(0xFF3FA000, 1894, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 
3, 0) = 0xFF3FA000
close(3)= 0
stat(/usr/platform/SUNW,Ultra-80/lib/libc_psr.so.1, 0xFFBFF318) = 0
d=0x0154 i=3240  m=0100755 l=1  u=0 g=2 sz=16768
at = Aug 19 10:02:00 PDT 2010  [ 1282237320 ]
mt = Apr  6 14:27:58 PST 2002  [ 1018132078 ]
ct = Aug 14 11:50:23 PDT 2005  [ 1124045423 ]
bsz=8192  blks=34fs=ufs
resolvepath(/usr/platform/SUNW,Ultra-80/lib/libc_psr.so.1, 
/usr/platform/sun4u/lib/libc_psr.so.1, 1023) = 37 
open(/usr/platform/SUNW,Ultra-80/lib/libc_psr.so.1, O_RDONLY) = 3 
mmap(0xFF3A, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 
0xFF3A mmap(0x2000, 16384, PROT_NONE, 
MAP_PRIVATE|MAP_NORESERVE|MAP_ANON|MAP_ALIGN, -1, 0) = 0xFF3E6000 
mmap(0xFF3E6000, 13544, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0) = 
0xFF3E6000 mmap(0x, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, 
MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF37
close(3)= 0
munmap(0xFF3A, 8192)= 0
getustack(0xFFBFF944)
getrlimit(RLIMIT_STACK, 0xFFBFF93C) = 0
cur = 8388608  max = 

Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Markus Keil
Does that mean that when the begin of the intent log chain gets corrupted, all
other intent log data after the corruption area is lost, because the checksum of
the first corrupted block doesn't match? 
 
Regards,
Markus

Neil Perrin neil.per...@oracle.com hat am 23. August 2010 um 19:44
geschrieben:

 This is a consequence of the design for performance of the ZIL code.
 Intent log blocks are dynamically allocated and chained together.
 When reading the intent log we read each block and checksum it
 with the embedded checksum within the same block. If we can't read
 a block due to an IO error then that is reported, but if the checksum does
 not match then we assume it's the end of the intent log chain.
 Using this design means we the minimum number of writes to add
 write an intent log record is just one write.

 So corruption of an intent log is not going to generate any errors.

 Neil.

 On 08/23/10 10:41, StorageConcepts wrote:
  Hello,
 
  we are currently extensivly testing the DDRX1 drive for ZIL and we are going
  through all the corner cases.
 
  The headline above all our tests is do we still need to mirror ZIL with
  all current fixes in ZFS (zfs can recover zil failure, as long as you don't
  export the pool, with latest upstream you can also import a poool with a
  missing zil)? This question  is especially interesting with RAM based
  devices, because they don't wear out, have a very low bit error rate and use
  one PCIx slot - which are rare. Price is another aspect here :)
 
  During our tests we found a strange behaviour of ZFS ZIL failures which are
  not device related and we are looking for help from the ZFS guru's here :)
 
  The test in question is called offline ZIL corruption. The question is,
  what happens if my ZIL data is corrupted while a server is transported or
  moved and not properly shut down. For this we do:
 
  - Prepare 2 OS installations (ProdudctOS and CorruptOS)
  - Boot ProductOS and create a pool and add the ZIL
  - ProductOS: Issue synchronous I/O with a increasing TNX number (and print
  the latest committet transaciton)
  - ProductOS: Power off the server and record the laast committet transaction
  - Boot CorruptOS
  - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL
   ~ 300 MB from start of disk, overwriting the first two disk labels)
  - Boot ProductOS
  - Verify that the data corruption is detected by checking the file with the
  transaction number against the one recorded
 
  We ran the test and it seems with modern snv_134 the pool comes up after the
  corruption with all beeing ok, while ~1 Transactions (this is some
  seconds of writes with DDRX1) are missing and nobody knows about this. We
  ran a scrub and scrub does not even detect this. ZFS automatically repairs
  the labels on the ZIL, however no error is reported about the missing data.
 
  While it is clear to us that if we do not have a mirrored zil, the data we
  have overwritten in the zil is lost, we are really wondering why ZFS does
  not REPORT about this corruption, silently ignoring it.
 
  Is this is a bug or .. aehm ... a feature  :) ?
 
  Regards,
  Robert
    


--
StorageConcepts Europe GmbH
    Storage: Beratung. Realisierung. Support     

Markus Keil            k...@storageconcepts.de
                       http://www.storageconcepts.de
Wiener Straße 114-116  Telefon:   +49 (351) 8 76 92-21
01219 Dresden          Telefax:   +49 (351) 8 76 92-99
Handelregister Dresden, HRB 28281
Geschäftsführer: Robert Heinzmann, Gerd Jelinek
--
Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu sind
vertraulich  und ausschließlich für den Gebrauch durch den Empfänger bestimmt,
soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang erlaubt.
Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen Schutzrechten
sein. Soweit eine Weitergabe oder Verteilung nicht ausschließlich zu internen
Zwecken des Empfängers geschieht, wird jede Weitergabe, Verteilung oder sonstige
Kopierung untersagt. Sollten Sie nicht  der beabsichtigte Empfänger der Sendung
sein, informieren Sie den Absender bitte unverzüglich.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

If I might add my $0.02: it appears that the ZIL is implemented as a
kind of circular log buffer. As I understand it, when a corrupt checksum
is detected, it is taken to be the end of the log, but this kind of
defeats the checksum's original purpose, which is to detect device
failure. Thus we would first need to change this behavior to only be
used for failure detection. This leaves the question of how to detect
the end of the log, which I think could be done by using a monotonously
incrementing counter on the ZIL entries. Once we find an entry where the
counter != n+1, then we know that the block is the last one in the sequence.

Now that we can use checksums to detect device failure, it would be
possible to implement ZIL-scrub, allowing an environment to detect ZIL
device degradation before it actually results in a catastrophe.

- --
Saso

On 08/26/2010 03:22 PM, Eric Schrock wrote:
 
 On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote:

 1) zil needs to report truncated transactions on zilcorruption
 
 As Neil outlined, this isn't possible while preserving current ZIL 
 performance.  There is no way to distinguish the last ZIL block without 
 incurring additional writes for every block.  If it's even possible to 
 implement this paranoid ZIL tunable, are you willing to take a 2-5x 
 performance hit to be able to detect this failure mode?
 
 - Eric
 
 --
 Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx2dUwACgkQRO8UcfzpOHD6QgCfWRBvqYxwKOqrFeaMyQ3nZDVX
Pu0AoJJHPybVT3GqvQbJPL8Xa58aC5P1
=pQJU
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread StorageConcepts
Actually - I can't read ZFS code, so the next assumtions are more or less based 
on   brainware - excuse me in advance :)

How does ZFS detect up to date zil's ? - with the tnx check of the ueberblock 
- right ? 

In our corruption case, we had 2 valid ueberblocks at the end and ZFS used 
those to import the pool. this is what the end-ueberblock is for. Ok, so the 
ueberblock contains the pointer to the start of the zil chain - right ? 

Assume we are adding the tnx number of the current transaction this zil is part 
of to the blocks written to the zil (special packages zil blocks). So the zil 
blocks are a little bit bigger then the data blocks, however the transaction 
count is the the same. Ok for SSD block alignment might be an issue ... agreed. 
For memory DRAM based ZIL's this is not a problem - except for bandwith.

Logic: 

On ZIL import, check: 
  - If the pointer to the zil chain is empty
if yes - clean pool
if not - we need to replay 

  - now if the block the root pointer points to is ok (checksum), the zil is 
used and replayed. At the end, the tnxof the last zil block must be = pool tnx. 
If =, then OK, if not report a error about missing zil parts and switch to 
mirror (if available). 

 As Neil outlined, this isn't possible while
 preserving current ZIL performance.  There is no way
 to distinguish the last ZIL block without incurring
 additional writes for every block.  If it's even
 possible to implement this paranoid ZIL tunable,
 are you willing to take a 2-5x performance hit to be
 able to detect this failure mode?
 
Robert
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Darren J Moffat

On 26/08/2010 15:08, Saso Kiselkov wrote:

If I might add my $0.02: it appears that the ZIL is implemented as a
kind of circular log buffer. As I understand it, when a corrupt checksum


It is NOT circular since that implies limited number of entries that get 
overwritten.



is detected, it is taken to be the end of the log, but this kind of
defeats the checksum's original purpose, which is to detect device
failure. Thus we would first need to change this behavior to only be
used for failure detection. This leaves the question of how to detect
the end of the log, which I think could be done by using a monotonously
incrementing counter on the ZIL entries. Once we find an entry where the
counter != n+1, then we know that the block is the last one in the sequence.


See the comment part way down zil_read_log_block about how we do 
something pretty much like that for checking the chain of log blocks:


http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block

This is the checksum in the BP checksum field.

But before we even got there we checked the ZILOG2 checksum as part of 
doing the zio (in zio_checksum_verify() stage):


http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error

A ZILOG2 checksum is a embedded  in the block (at the start, the 
original ZILOG was at the end) version of fletcher4.  If that failed - 
ie the block was corrupt we would have returned an error back through 
the dsl_read() of the log block.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread David Magda
On Wed, August 25, 2010 23:00, Neil Perrin wrote:
 On 08/25/10 20:33, Edward Ned Harvey wrote:

 It's commonly stated, that even with log device removal supported, the
 most common failure mode for an SSD is to blindly write without reporting
 any errors, and only detect that the device is failed upon read.  So ...
 If an SSD is in this failure mode, you won't detect it?  At bootup, the
 checksum will simply mismatch, and we'll chug along forward, having lost
 the data ... (nothing can prevent that) ... but we don't know that we've
 lost data?

 - Indeed, we wouldn't know we lost data.

Does a scrub go through the slog and/or L2ARC devices, or only the
primary storage components?

If it doesn't go through these secondary devices, that may be a useful
RFE, as one would ideally want to test the data on every component of a
storage system.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread Saso Kiselkov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I see, thank you for the clarification. So it is possible to have
something equivalent to main storage self-healing on ZIL, with ZIL-scrub
to activate it. Or is that already implemented also? (Sorry for asking
these obvious questions, but I'm not familiar with ZFS source code.)

- --
Saso

On 08/26/2010 04:31 PM, Darren J Moffat wrote:
 On 26/08/2010 15:08, Saso Kiselkov wrote:
 If I might add my $0.02: it appears that the ZIL is implemented as a
 kind of circular log buffer. As I understand it, when a corrupt checksum
 
 It is NOT circular since that implies limited number of entries that get
 overwritten.
 
 is detected, it is taken to be the end of the log, but this kind of
 defeats the checksum's original purpose, which is to detect device
 failure. Thus we would first need to change this behavior to only be
 used for failure detection. This leaves the question of how to detect
 the end of the log, which I think could be done by using a monotonously
 incrementing counter on the ZIL entries. Once we find an entry where the
 counter != n+1, then we know that the block is the last one in the
 sequence.
 
 See the comment part way down zil_read_log_block about how we do
 something pretty much like that for checking the chain of log blocks:
 
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block
 
 
 This is the checksum in the BP checksum field.
 
 But before we even got there we checked the ZILOG2 checksum as part of
 doing the zio (in zio_checksum_verify() stage):
 
 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error
 
 
 A ZILOG2 checksum is a embedded  in the block (at the start, the
 original ZILOG was at the end) version of fletcher4.  If that failed -
 ie the block was corrupt we would have returned an error back through
 the dsl_read() of the log block.
 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx2f64ACgkQRO8UcfzpOHA7rACgoyydAq2hO/VIfdknRb09WWGJ
BkwAn2i3nPtWNnfXwyW2089YMb8FRkZP
=YMqL
-END PGP SIGNATURE-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson

Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Neil Perrin

This is a consequence of the design for performance of the ZIL code.
Intent log blocks are dynamically allocated and chained together.
When reading the intent log we read each block and checksum it
with the embedded checksum within the same block. If we can't read
a block due to an IO error then that is reported, but if the checksum
does
not match then we assume it's the end of the intent log chain.
Using this design means we the minimum number of writes to add
write an intent log record is just one write.

So corruption of an intent log is not going to generate any errors.


I didn't know that.  Very interesting.  This raises another question ...

It's commonly stated, that even with log device removal supported, the most
common failure mode for an SSD is to blindly write without reporting any
errors, and only detect that the device is failed upon read.  So ... If an
SSD is in this failure mode, you won't detect it?  At bootup, the checksum
will simply mismatch, and we'll chug along forward, having lost the data ...
(nothing can prevent that) ... but we don't know that we've lost data?


If the drive's firmware isn't returning back a write error of any kind 
then there isn't much that ZFS can really do here (regardless of whether 
this is an SSD or not). Turning every write into a read/write operation 
would totally defeat the purpose of the ZIL. It's my understanding that 
SSDs will eventually transition to read-only devices once they've 
exceeded their spare reallocation blocks. This should propagate to the 
OS as an EIO which means that ZFS will instead store the ZIL data on the 
main storage pool.




Worse yet ... In preparation for the above SSD failure mode, it's commonly
recommended to still mirror your log device, even if you have log device
removal.  If you have a mirror, and the data on each half of the mirror
doesn't match each other (one device failed, and the other device is good)
... Do you read the data from *both* sides of the mirror, in order to
discover the corrupted log device, and correctly move forward without data
loss?


Yes, we read all sides of the mirror when we claim (i.e. read) the log 
blocks for a log device. This is exactly what a scrub would do for a 
mirrored data device.


- George



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson

David Magda wrote:

On Wed, August 25, 2010 23:00, Neil Perrin wrote:

Does a scrub go through the slog and/or L2ARC devices, or only the
primary storage components?


A scrub will go through slogs and primary storage devices. The L2ARC 
device is considered volatile and data loss is not possible should it fail.


- George
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS offline ZIL corruption not detected

2010-08-26 Thread George Wilson

Edward Ned Harvey wrote:


Add to that:

During scrubs, perform some reads on log devices (even if there's nothing to
read).


We do read from log device if there is data stored on them.

In fact, during scrubs, perform some reads on every device (even if it's
actually empty.)


Reading from the data portion of an empty device wouldn't really show us 
much as we're going to be reading a bunch of non-checksummed data. The 
best we can do is to probe the device's label region to determine it's 
health. This is exactly what we do today.


- George



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (preview) Whitepaper - ZFS Pools Explained - feedback welcome

2010-08-26 Thread Freddie Cash
On Wed, Aug 25, 2010 at 10:57 PM, StorageConcepts
presa...@storageconcepts.de wrote:
 Thanks for the feedback, the idea of it is to give people new to ZFS a 
 understanding of the terms and mode of operations to avoid common problems 
 (wide stripe pools etc.). Also agreed that it is a little NexentaStor 
 tweaked :)

 I think I have to rework the zil section anyhow because of 
 http://opensolaris.org/jive/thread.jspa?threadID=133294tstart=0 - have to do 
 some experiments here - and I will also do a dual command strategy showing 
 nexentastor commands AND opensolaris commands when a command is shown.

I haven't finished reading it yet (okay, barely read through the
contents list), but would you be interested in the FreeBSD equivalents
for the commands, if they differ?

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slog and TRIM support [SEC=UNCLASSIFIED]

2010-08-26 Thread Freddie Cash
On Wed, Aug 25, 2010 at 6:18 PM, Wilkinson, Alex
alex.wilkin...@dsto.defence.gov.au wrote:
    0n Wed, Aug 25, 2010 at 02:54:42PM -0400, LaoTsao ?? wrote:
    IMHO, U want -E for ZIL and -M for L2ARC

 Why ?

-E uses SLC flash, which is optimised for fast writes.  Ideal for a
ZIL which is (basically) write-only.
-M uses MLC flash, which is optimised for fast reads.  Ideal for an
L2ARC which is (basically) read-only.

-E tends to have smaller capacities, which is fine for ZIL.
-M tends to have larger capacities, which is perfect for L2ARC.

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Write Once Read Many on ZFS

2010-08-26 Thread Douglas Silva
Hi,

I'd like to know if there is a way to use WORM property on ZFS.

Thanks

Douglas
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Storage server hardwae

2010-08-26 Thread Tom Buskey
 On Wed, Aug 25, 2010 at 12:29 PM, Dr. Martin
 Mundschenk
 m.mundsch...@me.com wrote:
  Well, I wonder what are the components to build a
 stable system without having an enterprise solution:
 eSATA, USB, FireWire, FibreChannel?
 
 If possible to get a card to fit into a MacMini,
 eSATA would be a lot
 better than USB or FireWire.
 
 If there's any way to run cables from inside the
 case, you can make
 do with plain SATA and longer cables.
 
 Otherwise, you'll need to look into something other
 than a MacMini for
 your storage box.

If bandwidth is a concern, consider these napkin estimates:

A bare 7200rpm SATA drive will typically get 60 MB/s
SATA II is 300 MB/s
Gigabit ethernet 60 MB/s
100Mbit 11.4 MB/s
USB 2.0 is 30 MB/s
Firewire 400 35 MB/s
Firewire 800 55 MB/s
I usually see 17 MB/s max on an external USB 2.0 drive.

For 300 GB, the USB will take at best 5.5 hrs.  SATA will be 1.5 hrs.
If you have more then one process at a time hitting the drive, your speeds go 
through the floor.  For home use, that may be ok.  For a business, not so much.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS issue with ZFS

2010-08-26 Thread Miles Nordin
 pb( == Phillip Bruce (Mindsource) v-phb...@microsoft.com writes:

   pb( Problem solved..  Try using FQDN on the server end and that
   pb( work.  The client did not have to use FQDN.

1. your syntax is wrong.  You must use netgroup syntax to specify an
   IP, otherwise it will think you mean the hostname made up of those
   numbers and dots as characters.

NAME  PROPERTY  VALUE
andaman/arrchive  sharenfs  r...@10.100.100.0/23:@192.168.2.3/32

2. there's a bug in mountd.  well, there are many bugs in mountd, but
   this is the one I ran into, which makes the netgroup syntax mostly
   useless:

   http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6901832

one workaround is to give every IP reverse lookup, ex. using BIND
$generate or something.  I just use a big /etc/hosts covering every IP
to which I've exported.  I suppose actually fixing mountd would be
what a good sysadmin would have done: it can't be that hard.


pgp6GX6Mwe4Z0.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs/iSCSI: 0000 = SNS Error Type: Current Error (0x70)

2010-08-26 Thread Michael W Lucas
Hi,

I'm trying to track down an error with a 64bit x86 OpenSolaris 2009.06 ZFS 
shared via iSCSI and an Ubuntu 10.04 client.  The client can successfully log 
in, but no device node appears.  I captured a session with wireshark.  When the 
client attempts a SCSI: Inquiry LUN: 0x00, OpenSolaris sends a SCSI Response 
(Check Condition) LUN:0x00 that contains the following:

.111  = SNS Error Type: Current Error (0x70)
Filemark: 0, EOM: 0, ILI: 0
 0100 = Sense Key: Hardware Error (0x04)

The ZFS being exported is a 400GB chunk of a 1TB ZFS mirror.  The underlying OS 
reports no hardware errors, and zpool status looks OK.  Why would OpenSolaris 
give this error?  Is there anything I can do for it?  Any suggestions would be 
appreciated.

(I discussed this with the open-iscsi people at 
http://groups.google.com/group/open-iscsi/browse_thread/thread/06b83227ffc6a31a/2e58a163e21ec74e#2e58a163e21ec74e.)

Thanks,
==ml
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Storage server hardwae

2010-08-26 Thread David Dyer-Bennet

On Thu, August 26, 2010 13:58, Tom Buskey wrote:

 I usually see 17 MB/s max on an external USB 2.0 drive.

Interesting; I routinly see 27 MB/s peaking to 30 MB/s on the cheap WD 1TB
external drives I use for backups.  (Backup is probably best case, the
only user of that drive is a zfs receive process.)

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Networker Dedup @ ZFS

2010-08-26 Thread Sigbjørn Lie

Hi Daniel,

We we're looking into very much the same solution you've tested. Thanks 
for your advise. I think we will look for something else. :)


Just out of curiosity, what  ZFS tweaking did you do?  And what much 
pricier competitor solution did you end up with in the end?



Regards,
Sigbjorn



Daniel Whitener wrote:

Sigbjorn

Stop! Don't do it... it's a waste of time.  We tried exactly what
you're thinking of... we bought two Sun/Oracle 7000 series storage
units with 20TB of ZFS storage each planning to use them as a backup
target for Networker.  We ran into several issues eventually gave up
the ZFS networker combo.  We've used other storage devices in the past
(virtual tape libraries) that had deduplication.  We were used to
seeing dedup ratios better than 20x on our backup data.  The ZFS
filesystem only gave us 1.03x, and it had regular issues because it
couldn't do dedup for such large filesystems very easily.  We didn't
know it ahead of time, but VTL solutions use something called
variable length block dedup, whereas ZFS uses fixed block length
dedup. Like one of the other posters mentioned, things just don't line
up right and the dedup ratio suffers.  Yes, compression works to some
degree -- I think we got 2 or 3x on that, but it was a far cry from
the 20x that we were used to seeing on our old VTL.

We recently ditched the 7000 series boxes in favor of a much pricier
competitor.  It's about double the cost, but dedup ratios are better
than 20x.  Personally I love ZFS and I use it in many other places,
but we were very disappointed with the dedup ability for that type of
data.  We went to Sun with our problems and they ran it up the food
chain and word came back down from the developers that this was the
way it was designed, and it's not going to change anytime soon.  The
type of files that Networker writes out just are not friendly at all
with the dedup mechanism used in ZFS.  They gave us a few ideas and
things to tweak in Networker, but no measurable gains ever came from
any of the tweaks.

If are considering a home-grown ZFS solution for budget reasons, go
for it just do yourself a favor and save yourself the overhead of
trying to dedup.  When we disabled dedup on our 7000 series boxes,
everything worked great and compression was fine with next to no
overhead.  Unfortunately, we NEEDED at least a 10x ratio to keep the 3
week backups we were trying to do.  We couldn't even keep a 1 week
backup with the dedup performance of ZFS.

If you need more details, I'm happy to help.  We went through months
of pain trying to make it work and it just doesn't for Networker data.

best wishes
Daniel








2010/8/18 Sigbjorn Lie sigbj...@nixtra.com:
  

Hi,

We are considering using a ZFS based storage as a staging disk for Networker. 
We're aiming at
providing enough storage to be able to keep 3 months worth of backups on disk, 
before it's moved
to tape.

To provide storage for 3 months of backups, we want to utilize the dedup 
functionality in ZFS.

I've searched around for these topics and found no success stories, however 
those who has tried
did not mention if they had attempted to change the blocksize to any smaller 
than the default of
128k.

Does anyone have any experience with this kind of setup?


Regards,
Sigbjorn


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Networker Dedup @ ZFS

2010-08-26 Thread LaoTsao 老曹
 IMHO, if U use the backup SW that support dedupe in the SW then ZFS is 
still a viable solution



On 8/26/2010 6:13 PM, Sigbjørn Lie wrote:

Hi Daniel,

We we're looking into very much the same solution you've tested. 
Thanks for your advise. I think we will look for something else. :)


Just out of curiosity, what  ZFS tweaking did you do?  And what much 
pricier competitor solution did you end up with in the end?



Regards,
Sigbjorn



Daniel Whitener wrote:

Sigbjorn

Stop! Don't do it... it's a waste of time.  We tried exactly what
you're thinking of... we bought two Sun/Oracle 7000 series storage
units with 20TB of ZFS storage each planning to use them as a backup
target for Networker.  We ran into several issues eventually gave up
the ZFS networker combo.  We've used other storage devices in the past
(virtual tape libraries) that had deduplication.  We were used to
seeing dedup ratios better than 20x on our backup data.  The ZFS
filesystem only gave us 1.03x, and it had regular issues because it
couldn't do dedup for such large filesystems very easily.  We didn't
know it ahead of time, but VTL solutions use something called
variable length block dedup, whereas ZFS uses fixed block length
dedup. Like one of the other posters mentioned, things just don't line
up right and the dedup ratio suffers.  Yes, compression works to some
degree -- I think we got 2 or 3x on that, but it was a far cry from
the 20x that we were used to seeing on our old VTL.

We recently ditched the 7000 series boxes in favor of a much pricier
competitor.  It's about double the cost, but dedup ratios are better
than 20x.  Personally I love ZFS and I use it in many other places,
but we were very disappointed with the dedup ability for that type of
data.  We went to Sun with our problems and they ran it up the food
chain and word came back down from the developers that this was the
way it was designed, and it's not going to change anytime soon.  The
type of files that Networker writes out just are not friendly at all
with the dedup mechanism used in ZFS.  They gave us a few ideas and
things to tweak in Networker, but no measurable gains ever came from
any of the tweaks.

If are considering a home-grown ZFS solution for budget reasons, go
for it just do yourself a favor and save yourself the overhead of
trying to dedup.  When we disabled dedup on our 7000 series boxes,
everything worked great and compression was fine with next to no
overhead.  Unfortunately, we NEEDED at least a 10x ratio to keep the 3
week backups we were trying to do.  We couldn't even keep a 1 week
backup with the dedup performance of ZFS.

If you need more details, I'm happy to help.  We went through months
of pain trying to make it work and it just doesn't for Networker data.

best wishes
Daniel








2010/8/18 Sigbjorn Lie sigbj...@nixtra.com:

Hi,

We are considering using a ZFS based storage as a staging disk for 
Networker. We're aiming at
providing enough storage to be able to keep 3 months worth of 
backups on disk, before it's moved

to tape.

To provide storage for 3 months of backups, we want to utilize the 
dedup functionality in ZFS.


I've searched around for these topics and found no success stories, 
however those who has tried
did not mention if they had attempted to change the blocksize to any 
smaller than the default of

128k.

Does anyone have any experience with this kind of setup?


Regards,
Sigbjorn


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
attachment: laotsao.vcf___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS with SAN's and HA

2010-08-26 Thread Michael Dodwell
Hey all,

I currently work for a company that has purchased a number of different SAN 
solutions (whatever was cheap at the time!) and i want to setup a HA ZFS file 
store over fiber channel.

Basically I've taken slices from each of the sans and added them to a ZFS pool 
on this box (which I'm calling a 'ZFS proxy'). I've then carved out LUN's from 
this pool and assigned them to other servers. I then have snapshots taken on 
each of the LUN's and replication off site for DR. This all works perfectly 
(backups for ESXi!)

However, I'd like to be able to a) expand and b) make it HA. All the 
documentation i can find on setting up a HA cluster for file stores replicates 
data from 2 servers and then serves from these computers (i trust the SAN's to 
take care of the data and don't want to replicate anything -- cost!). Basically 
all i want is for the node that serves the ZFS pool to be HA (if this was to be 
put into production we have around 128tb and are looking to expand to a pb). We 
have a couple of IBM SVC's that seem to handle the HA node setup in some 
obscure property IBM way so logically it seems possible.

Clients would only be making changes via a single 'zfs proxy' at a time 
(multi-pathing setup for fail over only) so i don't believe I'd need to OCFS 
the setup? If i do need to setup OCFS can i put ZFS on top of that? (want 
snap-shotting/rollback and replication to a off site location, as well as all 
the goodness of thin provisioning and de-duplication)

However when i import the ZFS pool onto the 2nd box i got large warnings about 
it being mounted elsewhere and i needed to force the import, then when 
importing the LUN's i saw that the GUUID was different so multi-pathing doesn't 
pick that the LUN's are the same? can i change a GUUID via smtfadm? Is any of 
this even possible over fiber channel? Is anyone able to point me at some 
documentation? Am i simply crazy?

Any input would be most welcome.

Thanks in advance,
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with SAN's and HA

2010-08-26 Thread LaoTsao 老曹

 be very careful here!!

On 8/26/2010 9:16 PM, Michael Dodwell wrote:

Hey all,

I currently work for a company that has purchased a number of different SAN 
solutions (whatever was cheap at the time!) and i want to setup a HA ZFS file 
store over fiber channel.

Basically I've taken slices from each of the sans and added them to a ZFS pool 
on this box (which I'm calling a 'ZFS proxy'). I've then carved out LUN's from 
this pool and assigned them to other servers. I then have snapshots taken on 
each of the LUN's and replication off site for DR. This all works perfectly 
(backups for ESXi!)

However, I'd like to be able to a) expand and b) make it HA. All the 
documentation i can find on setting up a HA cluster for file stores replicates 
data from 2 servers and then serves from these computers (i trust the SAN's to 
take care of the data and don't want to replicate anything -- cost!). Basically 
all i want is for the node that serves the ZFS pool to be HA (if this was to be 
put into production we have around 128tb and are looking to expand to a pb). We 
have a couple of IBM SVC's that seem to handle the HA node setup in some 
obscure property IBM way so logically it seems possible.

Clients would only be making changes via a single 'zfs proxy' at a time 
(multi-pathing setup for fail over only) so i don't believe I'd need to OCFS 
the setup? If i do need to setup OCFS can i put ZFS on top of that? (want 
snap-shotting/rollback and replication to a off site location, as well as all 
the goodness of thin provisioning and de-duplication)

However when i import the ZFS pool onto the 2nd box i got large warnings about 
it being mounted elsewhere and i needed to force the import, then when 
importing the LUN's i saw that the GUUID was different so multi-pathing doesn't 
pick that the LUN's are the same? can i change a GUUID via smtfadm?
if U force the import and zfs are mounted by two hosts, Ur zfs could 
become corrupted!!!

recovery is not easy

Is any of this even possible over fiber channel?

please at least take look the document on oracle solaris cluster software
it detail the way to use ZFS in cluster env
http://docs.sun.com/app/docs/prod/sun.cluster32?l=ena=view
zfs
http://docs.sun.com/app/docs/doc/820-7359/gbspx?l=ena=view




  Is anyone able to point me at some documentation? Am i simply crazy?

Any input would be most welcome.

Thanks in advance,
attachment: laotsao.vcf___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with SAN's and HA

2010-08-26 Thread Michael Dodwell
Lao,

I had a look at the HAStoragePlus etc and from what i understand that's to 
mirror local storage across 2 nodes for services to be able to access 'DRBD 
style'. 

Having a read thru the documentation on the oracle site the cluster software 
from what i gather is how to cluster services together (oracle/apache etc) and 
again any documentation i've found on storage is how to duplicate local storage 
to multiple hosts for HA failover. Can't really see anything on clustering 
services to use shared storage/zfs pools.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] VM's on ZFS - 7210

2010-08-26 Thread Mark
We are using a 7210, 44 disks I believe, 11 stripes of RAIDz sets.  When I 
installed I selected the best bang for the buck on the speed vs capacity chart.

We run about 30 VM's on it, across 3 ESX 4 servers.  Right now, its all running 
NFS, and it sucks... sooo slow.

iSCSI was no better.   

I am wondering how I can increase the performance, cause they want to add more 
vm's... the good news is most are idleish, but even idle vm's create a lot of 
random chatter to the disks!

So a few options maybe... 

1) Change to iSCSI mounts to ESX, and enable write-cache on the LUN's since the 
7210 is on a UPS.
2) get a Logzilla SSD mirror.  (do ssd's fail, do I really need a mirror?)
3) reconfigure the NAS to a RAID10 instead of RAIDz

Obviously all 3 would be ideal , though with a SSD can I keep using NFS for the 
same performance since the R_SYNC's would be satisfied with the SSD?

I am dreadful of getting the OK to spend the $$,$$$ SSD's and then not get the 
performance increase we want.

How would you weight these?  I noticed in testing on a 5 disk OpenSolaris, that 
changing from a single RAIDz pool to RAID10 netted a larger IOP increase then 
adding an Intel SSD as a Logzilla.  That's not going to scale the same though 
with a 44 disk, 11 raidz striped RAID set.

Some thoughts?  Would simply moving to write-cache enabled iSCSI LUN's without 
a SSD speed things up a lot by itself?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss