from:"Adam Leventhal"

Re: [zfs-discuss] Re: Big IOs overhead due to ZFS?

2006-06-01 Thread Adam Leventhal

On Thu, Jun 01, 2006 at 02:46:32PM +0200, Robert Milkowski wrote:
 btw: what differences there'll be between raidz1 and raidz2? I guess
 two checksums will be stored so one loose approximately space of two
 disks in a one raidz2 group. Any other things?

The difference between raidz1 and raidz2 is just that the latter is
resilient against losing 2 disks rather than just 1. If you have a total
of 5 disks in a raidz1 stripe your optimal capacity will be 4/5ths of the
raw capacity of the disks whereas it would be 3/5ths with raidz2. Consider
however that you'll typically use larger stripes with raidz2 so you aren't
necessarily going to lose any capacity depending on how you configure your
pool.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expanding raidz2

2006-07-12 Thread Adam Leventhal

On Wed, Jul 12, 2006 at 02:45:40PM -0700, Darren Dunham wrote:
  There may be several parity sectors per row so adding another column doesn't
  work.
 
 But presumably it would be possible to use additional columns for future
 writes?

I guess that could be made to work, but then the data on the disk becomes
much (much much) more difficult to interpret because you have some rows which
are effectively one width and others which are another (ad infinitum). It
also doesn't really address the issue since you assume that you want to add
space because the disks are getting full, but this scheme, as you mention,
only applies the new width to empty rows.

I'm not sure I even agree with the notion that this is a real problem (and
if it is, I don't think is easily solved). Stripe widths are a function of
the expected failure rate and fault domains of the system which tend to be
static in nature. A coarser solution would be to create a new pool where you
zfs send/zfs recv the filesystems of the old pool.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Apple Time Machine

2006-08-07 Thread Adam Leventhal

Needless to say, this was a pretty interesting piece of the keynote from a
technical point of view that had quite a few of us scratching our heads.
After talking to some Apple engineers, it seems like what they're doing is
more or less this:

When a file is modified, the kernel fires off an event which a user-land
daemon listens for. Every so often, the user-land daemon does something
like a snapshot of the affected portions of the filesystem with hard links
(including hard links to directories -- I'm not making this up). That
might be a bit off, but it's the impression I was left with.

Anyhow, very slick UI, sort of dubious back end, interesting possibility
for integration with ZFS.

Adam

On Mon, Aug 07, 2006 at 12:08:17PM -0700, Eric Schrock wrote:
 Yeah, I just noticed this line:
 
 Backup Time: Time Machine will back up every night at midnight, unless
 you select a different time from this menu.
 
 So this is just standard backups, with a (very) slick GUI layered on
 top.  From the impression of the text-only rumor feed, it sounded more
 impressive, from a filesystem implementation perspective.  Still, the
 GUI integration is pretty nice, and implies that they're backups are in
 some easily accessed form.  Otherwise, extracting hundreds of files from
 a compressed stream would induce too much delay for the interactive
 stuff they describe.
 
 - Eric
 
 On Mon, Aug 07, 2006 at 08:58:15AM -1000, David J. Orman wrote:
  Reading that site, it sounds EXACTLY like snapshots. It doesn't sound
  to require a second disk, it just gives you the option of backing up
  to one. Sounds like it snapshots once a day (configurable) and then
  sends the snapshot to another drive/server if you request it to do
  so. Looks like they just made snapshots accesible to desktop users.
  Pretty impressive how they did the GUI work too.
  
 
 --
 Eric Schrock, Solaris Kernel Development   http://blogs.sun.com/eschrock
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] in-kernel gzip compression

2006-08-17 Thread Adam Leventhal

On Thu, Aug 17, 2006 at 10:00:32AM -0700, Matthew Ahrens wrote:
 (Actually, I think that a RLE compression algorithm for metadata is a
 higher priority, but if someone from the community wants to step up, we
 won't turn your code away!)

Is RLE likely to be more efficient for metadata? Have you taking a stab
as estimating the comparative benefits?

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/iSCSI target integration

2006-11-01 Thread Adam Leventhal

On Wed, Nov 01, 2006 at 10:05:01AM +, Ceri Davies wrote:
 On Wed, Nov 01, 2006 at 01:33:33AM -0800, Adam Leventhal wrote:
  Rick McNeal and I have been working on building support for sharing ZVOLs
  as iSCSI targets directly into ZFS. Below is the proposal I'll be
  submitting to PSARC. Comments and suggestions are welcome.
 
 It looks great and I'd love to see it implemented.

It's implemented! This is the end of the process, not the beginning ;-)
I expect it will be in OpenSolaris by the end of November.

Adam


-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/iSCSI target integration

2006-11-01 Thread Adam Leventhal

On Wed, Nov 01, 2006 at 12:18:36PM +0200, Cyril Plisko wrote:
 
 Note again that all configuration information is stored with the dataset. 
 As
 with NFS shared filesystems, iSCSI targets imported on a different system
 will be shared appropriately.
 
 
 Does that mean that if I manage the iSCSI target via iscsitadm after
 it is shared via
 zfs shareiscsi=on and then 'zpool export' and 'zpool import' on some other 
 host
 all the customization done via iscsitadm will be preserved ?

No. Modifications to the target must be made through zfs(1M) not through
iscsitadm(1M) if you want them to be persistent. This is similar to sharing
ZFS filesystems via NFS: you can use share(1M), but it doesn't affect the
persistent properties of the dataset.

What properties are you specifically interested in modifying?

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/iSCSI target integration

2006-11-01 Thread Adam Leventhal

 What properties are you specifically interested in modifying?
 
 LUN for example. How would I configure LUN via zfs command ?

You can't. Forgive my ignorance about how iSCSI is deployed, but why would
you want/need to change the LUN?

Adam

On Wed, Nov 01, 2006 at 01:36:05PM +0200, Cyril Plisko wrote:
 On 11/1/06, Adam Leventhal [EMAIL PROTECTED] wrote:
 On Wed, Nov 01, 2006 at 12:18:36PM +0200, Cyril Plisko wrote:
  
  Note again that all configuration information is stored with the 
 dataset.
  As
  with NFS shared filesystems, iSCSI targets imported on a different 
 system
  will be shared appropriately.
 
 
  Does that mean that if I manage the iSCSI target via iscsitadm after
  it is shared via
  zfs shareiscsi=on and then 'zpool export' and 'zpool import' on some 
 other
  host
  all the customization done via iscsitadm will be preserved ?
 
 No. Modifications to the target must be made through zfs(1M) not through
 iscsitadm(1M) if you want them to be persistent. This is similar to sharing
 ZFS filesystems via NFS: you can use share(1M), but it doesn't affect the
 persistent properties of the dataset.
 
 What properties are you specifically interested in modifying?
 
 LUN for example. How would I configure LUN via zfs command ?
 
 -- 
 Regards,
Cyril

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/iSCSI target integration

2006-11-01 Thread Adam Leventhal

On Wed, Nov 01, 2006 at 01:17:02PM -0500, Torrey McMahon wrote:
 Is there going to be a method to override that on the import? I can see 
 a situation where you want to import the pool for some kind of 
 maintenance procedure but you don't want the iSCSI target to fire up 
 automagically.

There isn't -- to my knowledge -- a way to do this today for NFS shares.
This would be a reasonable RFE that would apply to both NFS and iSCSI.

 Also, what if I don't have the iSCSI target packages on the node I'm 
 importing to? Error messages? Nothing?

You'll get an error message reporting that it could not be shared.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: [storage-discuss] ZFS/iSCSI target integration

2006-11-01 Thread Adam Leventhal

On Wed, Nov 01, 2006 at 12:22:46PM -0500, Matty wrote:
 This is super useful! Will ACLs and aliases be stored as properties? Could 
 you post the list of available iSCSI properties to the list?

We're still investigating ACL and iSNS support. The alias name will always be
the name of the dataset, but we've considered making that an option you could
set in the 'shareiscsi' property ('alias=blah' for example). The iSCSI
properties I was referring to are the private meta data for the target daemon
such as IQN.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/iSCSI target integration

2006-11-01 Thread Adam Leventhal

On Wed, Nov 01, 2006 at 09:25:26PM +0200, Cyril Plisko wrote:
 On 11/1/06, Adam Leventhal [EMAIL PROTECTED] wrote:
  What properties are you specifically interested in modifying?
 
  LUN for example. How would I configure LUN via zfs command ?
 
 You can't. Forgive my ignorance about how iSCSI is deployed, but why would
 you want/need to change the LUN?
 
 Well, with iSCSI specifically it is of less importance, since one can easily
 created multiple units identified by other means, than LUN.
 I, however, trying to look forward for FC SCSI target functionality 
 mirroring
 that of the iSCSI (AFAIK it is on the Rick' roadmap [and I really do not
 mind helping]). In FC world it is essentially the only way to have multiple
 units on a particular FC port.
 
 Can we do something similar to NFS case, where sharenfs can be
 on, off, or something else, in which case it is a list of options ?
 Would this technique be applicable to shareiscsi too ?

Absolutely. We would, however, like to be conservative about adding options
only doing so when it meets a specific need. As you noted, there's no real
requirement to be able to set the LUN.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/iSCSI target integration

2006-11-01 Thread Adam Leventhal

On Wed, Nov 01, 2006 at 04:00:43PM -0500, Torrey McMahon wrote:
 Lets say server A has the pool with NFS shared, or iSCSI shared, 
 volumes. Server A exports the pool or goes down. Server B imports the pool.
 
 Which clients would still be active on the filesystem(s)? The ones that 
 were mounting it when it was on Server A?

Clients would need to explicitly change the server they're contacting unless
that new server also took over the IP address, hostname, etc.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: [storage-discuss] ZFS/iSCSI target integration

2006-11-02 Thread Adam Leventhal

On Thu, Nov 02, 2006 at 12:10:06AM -0800, eric kustarz wrote:
  Like the 'sharenfs' property, 'shareiscsi' indicates if a ZVOL should
  be exported as an iSCSI target. The acceptable values for this 
  property
  are 'on', 'off', and 'direct'. In the future, we may support other
  target types (for example, 'tape'). The default is 'off'. This 
  property
  may be set on filesystems, but has no direct effect; this is to allow
  ZVOLs created under the ZFS hierarchy to inherit a default. For
  example, an administrator may want ZVOLs to be shared by default, and
  so set 'shareiscsi=on' for the pool.
 
 hey adam, what's direct mean?

It's iSCSI target lingo for vanilla disk emulation.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z random read performance

2006-11-09 Thread Adam Leventhal

I don't think you'd see the same performance benefits on RAID-Z since
parity isn't always on the same disk. Are you seeing hot/cool disks?

Adam

On Sun, Nov 05, 2006 at 04:03:18PM +0100, Pawel Jakub Dawidek wrote:
 In my opinion RAID-Z is closer to RAID-3 than to RAID-5. In RAID-3 you
 do only full stripe writes/reads, which is also the case for RAID-Z.
 
 What I found while working on RAID-3 implementation for FreeBSD was that
 for small RAID-3 arrays there is a way to speed up random reads up to
 40% by using parity component in a round-robin fashion. For example
 (DiskP stands for partity component):
 
   Disk0   Disk1   Disk2   Disk3   DiskP
 
 And now when I get read request I do:
 
   Request number  Components
   0   Disk0+Disk1+Disk2+Disk3
   1   Disk1+Disk2+Disk3+(Disk1^Disk2^Disk3^DiskP)
   2   Disk2+Disk3+(Disk2^Disk3^DiskP^Disk0)+Disk0
   3   Disk3+(Disk3^DiskP^Disk0+Disk1)+Disk0+Disk1
   etc.
 
 + - concatenation
 ^ - XOR
 
 In other words for every read request different component is skipped.
 
 It was still a bit slower than RAID-5, though. And of course writes in
 RAID-3 (and probably for RAID-Z) are much, much faster.
 
 -- 
 Pawel Jakub Dawidek   http://www.wheel.pl
 [EMAIL PROTECTED]   http://www.FreeBSD.org
 FreeBSD committer Am I Evil? Yes, I Am!



 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS/iSCSI target integration

2006-11-09 Thread Adam Leventhal

Thanks for all the feedback. This PSARC case was approved yesterday and
will be integrated relatively soon.

Adam

On Wed, Nov 01, 2006 at 01:33:33AM -0800, Adam Leventhal wrote:
 Rick McNeal and I have been working on building support for sharing ZVOLs
 as iSCSI targets directly into ZFS. Below is the proposal I'll be
 submitting to PSARC. Comments and suggestions are welcome.
 
 Adam
 
 ---8---
 
 iSCSI/ZFS Integration
 
 A. Overview
 
 The goal of this project is to couple ZFS with the iSCSI target in Solaris
 specifically to make it as easy to create and export ZVOLs via iSCSI as it
 is to create and export ZFS filesystems via NFS. We will add two new ZFS
 properties to support this feature.
 
 shareiscsi
 
   Like the 'sharenfs' property, 'shareiscsi' indicates if a ZVOL should
   be exported as an iSCSI target. The acceptable values for this property
   are 'on', 'off', and 'direct'. In the future, we may support other
   target types (for example, 'tape'). The default is 'off'. This property
   may be set on filesystems, but has no direct effect; this is to allow
   ZVOLs created under the ZFS hierarchy to inherit a default. For
   example, an administrator may want ZVOLs to be shared by default, and
   so set 'shareiscsi=on' for the pool.
 
 iscsioptions
 
   This property, which is hidden by default, is used by the iSCSI target
   daemon to store persistent information such as the IQN. The contents
   are not intended for users or external consumers.
 
 
 B. Examples
 
 iSCSI targets are simple to create with the zfs(1M) command:
 
 # zfs create -V 100M pool/volumes/v1
 # zfs set shareiscsi=on pool/volumes/v1
 # iscsitadm list target
 Target: pool/volumes/v1
 iSCSI Name: iqn.1986-03.com.sun:02:4db92521-f5dc-cde4-9cd5-a3f6f567220a
 Connections: 0
 
 Renaming the ZVOL has the expected result for the iSCSI target:
 
 # zfs rename pool/volumes/v1 pool/volumes/stuff 
 # iscsitadm list target
 Target: pool/volumes/stuff
 iSCSI Name: iqn.1986-03.com.sun:02:4db92521-f5dc-cde4-9cd5-a3f6f567220a
 Connections: 0
 
 Note that per the iSCSI specification (RFC3720), the iSCSI Name is unchanged
 after the ZVOL is renamed.
 
 Exporting a pool containing a shared ZVOL will cause the target to be removed;
 importing a pool containing a shared ZVOL will cause the target to be
 shared:
 
 # zpool export pool
 # iscsitadm list target
 # zpool import pool
 # iscsitadm list target
 Target: pool/volumes/stuff
 iSCSI Name: iqn.1986-03.com.sun:02:4db92521-f5dc-cde4-9cd5-a3f6f567220a
 Connections: 0
 
 Note again that all configuration information is stored with the dataset. As
 with NFS shared filesystems, iSCSI targets imported on a different system
 will be shared appropriately.
 
 ---8---
 
 -- 
 Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solaris 11/06 + iscsi integration

2006-12-14 Thread Adam Leventhal

Hey Robert,

The iSCSI target is targetting Solaris 10 update 4. There wasn't any issue
with the target, rather it was the timing of the its integration into Nevada,
and the sheer quantity of projects targetting update 3.

Adam

On Thu, Dec 14, 2006 at 02:39:17PM -0500, Robert Petkus wrote:
 Folks,
 Just wondering why iSCSI target disk support didn't make it into the 
 latest Solaris release.  Were there any problems?
 
 Robert
 
 -- 
 Robert Petkus
 Brookhaven National Laboratory
 Physics Dept. - Bldg. 510A
 http://www.bnl.gov/RHIC
 http://www.acf.bnl.gov
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] odd versus even

2007-01-07 Thread Adam Leventhal

Hey Peter,

If I recall correctly, the result was there was a very slight space-efficiency
benefit of using a multiple of 2 vdevs for raidz1 and of 3 vdevs for raidz2 --
doing this can reduce the number of 'skipped' blocks. That said, the advantage
is very slight and is only really relevant when the blocksize or recordsize
is relatively closer to the number of bytes in a stripe.

Adam

On Thu, Jan 04, 2007 at 11:17:26PM +, Peter Tribble wrote:
 I'm being a bit of a dunderhead at the moment and neither the site search
 nor
 google are picking up the information I seek...
 
 I'm setting up a thumper and I'm sure I recall some discussion of the
 optimal
 number of drives in raidz1 and raidz2 vdevs. I also recall that it was
 something
 like you would want an even number of disk for raidz1, and an odd number for
 raidz2 (so you always have an odd number of data drives). Have I remembered
 this correctly, or am I going delusional? And, if it is the case, what is
 the
 reasoning behind it?
 
 Thanks,
 
 -- 
 -Peter Tribble
 http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can you turn on zfs compression when the fs is already populated?

2007-01-25 Thread Adam Leventhal

For what it's worth, there is a plan to allow data to be scrubbed so that
you can enable compression for extant data. No ETA, but it's on the roadmap.

In fact, I was recently reminded that I filed a bug on this in 2004:

  5029294 there should be a way to compress an extant file system

Adam

On Wed, Jan 24, 2007 at 06:50:22PM +0100, [EMAIL PROTECTED] wrote:
 
 I have an 800GB raidz2 zfs filesystem.  It already has approx 142Gb of data.
 Can I simply turn on compression at this point, or do you need to start 
 with compression
 at the creation time?  If I turn on compression now, what happens to the 
 existing data?
 
 Yes.  Nothing.
 
 Casper
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: Adding my own compression to zfs

2007-01-29 Thread Adam Leventhal

On Mon, Jan 29, 2007 at 02:39:13PM -0800, roland wrote:
  # zfs get compressratio
  NAME   PROPERTY   VALUE  SOURCE
  pool/gzip  compressratio  3.27x  -
  pool/lzjb  compressratio  1.89x  -
 
 this looks MUCH better than i would have ever expected for smaller files. 
 
 any real-world data how good or bad compressratio goes with lots of very 
 small but good compressible files , for example some (evil for those solaris 
 evangelists) untarred linux-source tree ?
 
 i'm rather excited how effective gzip will compress here.
 
 for comparison:
 
 sun1:/comptest #  bzcat /tmp/linux-2.6.19.2.tar.bz2 |tar xvf -
 --snipp--
 
 sun1:/comptest # du -s -k *
 143895  linux-2.6.19.2
 1   pax_global_header
 
 sun1:/comptest # du -s -k --apparent-size *
 224282  linux-2.6.19.2
 1   pax_global_header
 
 sun1:/comptest # zfs get compressratio comptest
 NAME  PROPERTY   VALUE  SOURCE
 comptest tank  compressratio  1.79x  -

Don't start sending me your favorite files to compress (it really should
work about the same as gzip), but here's the result for the above (I found
a tar file that's about 235M uncompressed):

# du -ks linux-2.6.19.2/
80087   linux-2.6.19.2
# zfs get compressratio pool/gzip
NAME   PROPERTY   VALUE  SOURCE
pool/gzip  compressratio  3.40x  -

Doing a gzip with the default compression level (6 -- the same setting I'm
using in ZFS) yields a file that's about 52M. The small files are hurting
a bit here, but it's still pretty good -- and considerably better than LZJB.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Need help making lsof work with ZFS

2007-02-18 Thread Adam Leventhal

On Wed, Feb 14, 2007 at 01:56:33PM -0700, Matthew Ahrens wrote:
 These files are not shipped with Solaris 10.  You can find them in 
 opensolaris: usr/src/uts/common/fs/zfs/sys/
 
 The interfaces in these files are not supported, and may change without 
 notice at any time.

Even if they're not supported, shouldn't the header files be shipped so
people can make sense of kernel data structure types?

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs received vol not appearing on iscsi target list

2007-02-26 Thread Adam Leventhal

On Sat, Feb 24, 2007 at 09:29:48PM +1300, Nicholas Lee wrote:
 I'm not really a Solaris expert, but I would have expected vol4 to appear on
 the iscsi target list automatically.  Is there a way to refresh the target
 list? Or is this a bug.

Hi Nicholas,

This is a bug either in ZFS or in the iSCSI target. Please file a bug.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS overhead killed my ZVOL

2007-03-20 Thread Adam Leventhal

On Tue, Mar 20, 2007 at 06:01:28PM -0400, Brian H. Nelson wrote:
 Why does this happen? Is it a bug? I know there is a recommendation of 
 20% free space for good performance, but that thought never occurred to 
 me when this machine was set up (zvols only, no zfs proper).

It sounds like this bug:

  6430003 record size needs to affect zvol reservation size on RAID-Z

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS overhead killed my ZVOL

2007-03-20 Thread Adam Leventhal

On Wed, Mar 21, 2007 at 01:36:10AM +0100, Robert Milkowski wrote:
 btw: I assume that compression level will be hard coded after all,
 right?

Nope. You'll be able to choose from gzip-N with N ranging from 1 to 9 just
like gzip(1).

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] gzip compression support

2007-03-23 Thread Adam Leventhal

I recently integrated this fix into ON:

  6536606 gzip compression for ZFS

With this, ZFS now supports gzip compression. To enable gzip compression
just set the 'compression' property to 'gzip' (or 'gzip-N' where N=1..9).
Existing pools will need to upgrade in order to use this feature, and, yes,
this is the second ZFS version number update this week. Recall that once
you've upgraded a pool older software will no longer be able to access it
regardless of whether you're using the gzip compression algorithm.

I did some very simple tests to look at relative size and time requirements:

  http://blogs.sun.com/ahl/entry/gzip_for_zfs_update

I've also asked Roch Bourbonnais and Richard Elling to do some more
extensive tests.

Adam


From zfs(1M):

 compression=on | off | lzjb | gzip | gzip-N

 Controls  the  compression  algorithm  used   for   this
 dataset.  The  lzjb compression algorithm is optimized
 for performance while providing decent data compression.
 Setting  compression to on uses the lzjb compression
 algorithm. The gzip  compression  algorithm  uses  the
 same  compression  as  the  gzip(1)  command.   You  can
 specify the gzip level  by  using  the  value  gzip-N,
 where  N  is  an  integer  from  1  (fastest) to 9 (best
 compression ratio). Currently, gzip is  equivalent  to
 gzip-6 (which is also the default for gzip(1)).

 This property can also be referred to by  its  shortened
 column name compress.

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] gzip compression support

2007-03-23 Thread Adam Leventhal

On Fri, Mar 23, 2007 at 11:41:21AM -0700, Rich Teer wrote:
  I recently integrated this fix into ON:
  
6536606 gzip compression for ZFS
 
 Cool!  Can you recall into which build it went?

I put it back yesterday so it will be in build 62.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS over iSCSI question

2007-03-23 Thread Adam Leventhal

On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote:
 I'm in a way still hoping that it's a iSCSI related Problem as detecting
 dead hosts in a network can be a non trivial problem and it takes quite
 some time for TCP to timeout and inform the upper layers. Just a
 guess/hope here that FC-AL, ... do better in this case
 
 iscsi doesn't use TCP, does it?  Anyway, the problem is really transport
 independent.

It does use TCP. Were you thinking UDP?

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Convert raidz

2007-04-02 Thread Adam Leventhal

On Mon, Apr 02, 2007 at 12:37:24AM -0700, homerun wrote:
 Is it possible to convert live 3 disks zpool from raidz to raidz2
 And is it possible to add 1 new disk to raidz configuration without
 backups and recreating zpool from scratch.

The reason that's not possible is because RAID-Z uses a variable stripe
width. This solves some problems (notably the RAID-5 write hole [1]), but
it means that a given 'stripe' over N disks in a raidz1 configuration may
contains as many as floor(N/2) parity blocks -- clearly a single additional
disk wouldn't be sufficient to grow the stripe properly.

It would be possible to have a different type of RAID-Z where stripes were
variable-width to avoid the RAID-5 write hole, but the remainder of the
stripe was left unused. This would allow users to add an additional parity
disk (or several if we ever implement further redundancy) to an existing
configuration, BUT would potentially make much less efficient use of storage. 

Adam


[1] http://blogs.sun.com/bonwick/entry/raid_z

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Setting up for zfsboot

2007-04-04 Thread Adam Leventhal

On Wed, Apr 04, 2007 at 03:34:13PM +0200, Constantin Gonzalez wrote:
 - RAID-Z is _very_ slow when one disk is broken.

Do you have data on this? The reconstruction should be relatively cheap
especially when compared with the initial disk access.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Setting up for zfsboot

2007-04-04 Thread Adam Leventhal

On Wed, Apr 04, 2007 at 11:04:06PM +0200, Robert Milkowski wrote:
 If I stop all activity to x4500 with a pool made of several raidz2 and
 then I issue spare attach I get really poor performance (1-2MB/s) on a
 pool with lot of relatively small files.

Does that mean the spare is resilvering when you collect the performance
data? I think a fair test would be to compare the performance of a fully
functional RAID-Z stripe against a one with a missing (absent) device.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Linux

2007-04-12 Thread Adam Leventhal

On Thu, Apr 12, 2007 at 06:59:45PM -0300, Toby Thain wrote:
 Hey, then just don't *keep on* asking to relicense ZFS (and anything
 else) to GPL.
 
 I never would. But it would be horrifying to imagine it relicensed to  
 BSD. (Hello, Microsoft, you just got yourself a competitive filesystem.)

There's nothing today preventing Microsoft (or Apple) from sticking ZFS
into their OS. They'd just to have to release the (minimal) diffs to
ZFS-specific files.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
[EMAIL PROTECTED]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: gzip compression throttles system?

2007-05-09 Thread Adam Leventhal

On Thu, May 03, 2007 at 11:43:49AM -0500, [EMAIL PROTECTED] wrote:
 I think this may be a premature leap -- It is still undetermined if we are
 running up against a yet unknown bug in the kernel implementation of gzip
 used for this compression type. From my understanding the gzip code has
 been reused from an older kernel implementation,  it may be possible that
 this code has some issues with kernel stuttering when used for zfs
 compression that may have not been exposed with its original usage.  If it
 turns out that it is just a case of high cpu trade-off for buying faster
 compression times, then the talk of a tunable may make sense (if it is even
 possible given the constraints of the gzip code in kernelspace).

The in-kernel version is zlib is the latest version (1.2.3). It's not
surprising that we're spending all of our time in zlib if the machine is
being driving by I/O. There are outstanding problems with compression in
the ZIO pipeline that may contribute to the bursty behavior.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iscsitadm local_name in ZFS

2007-05-11 Thread Adam Leventhal

That would be a great RFE. Currently the iSCSI Alias is the dataset name
which should help with identification.

Adam

On Fri, May 04, 2007 at 02:02:34PM +0200, cedric briner wrote:
 cedric briner wrote:
 hello dear community,
 
 Is there a way to have a ``local_name'' as define in iscsitadm.1m when 
 you verbshareiscsi/verb a zvol. This way, it will give even easier 
 way to identify an device through IQN.
 
 Ced.
 
 
 Okay no reply from you so... maybe I didn't make myself well understandable.
 
 Let me try to re-explain you what I mean:
 when you use zvol and enable shareiscsi, could you add a suffix to the 
 IQN (Iscsi Qualified Name). This suffix will be given by myself and will 
 help me to identify which IQN correspond to which zvol : this is just a 
 more human readable tag on an IQN.
 
 Similarly, this tag is also given when you do an iscsitadm. And in the 
 man page of iscsitadm it is called a local_name.
 
 iscsitadm iscsitadm create target -b  /dev/dsk/c0d0s5  tiger
 or
 iscsitadm iscsitadm create target -b  /dev/dsk/c0d0s5  hd-1
 
 tiger and hd-1 are local_name
 
 Ced.
 
 -- 
 
 Cedric BRINER
 Geneva - Switzerland
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS over a layered driver interface

2007-05-14 Thread Adam Leventhal

Try 'trace((int)arg1);' -- 4294967295 is the unsigned representation of -1.

Adam

On Mon, May 14, 2007 at 09:57:23AM -0700, Shweta Krishnan wrote:
 Thanks Eric and Manoj.
 
 Here's what ldi_get_size() returns:
 bash-3.00# dtrace -n 'fbt::ldi_get_size:return{trace(arg1);}' -c 'zpool 
 create adsl-pool /dev/layerzfsminor1' dtrace: description 
 'fbt::ldi_get_size:return' matched 1 probe
 cannot create 'adsl-pool': invalid argument for this pool operation
 dtrace: pid 2582 has exited
 CPU IDFUNCTION:NAME
   0  20927  ldi_get_size:return4294967295
 
 
 This is strange because I looked at the code for ldi_get_size() and the only 
 possible return values in the code are DDI_SUCCESS (0) and DDI_FAILURE(-1).
 
 Looks like what I'm looking at either isn't the return value, or some bad 
 address is being reached. Any hints?
 
 Thanks,
 Swetha.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RFE: ISCSI alias when shareiscsi=on

2007-05-24 Thread Adam Leventhal

Right now -- as I'm sure you have noticed -- we use the dataset name for 
the alias. To let users explicitly set the alias we could add a new property
as you suggest or allow other options for the existing shareiscsi property:

  shareiscsi='alias=potato'

This would sort of match what we do for the sharenfs property.

Adam

On Thu, May 24, 2007 at 02:39:24PM +0200, cedric briner wrote:
 Starting from this thread:
 http://www.opensolaris.org/jive/thread.jspa?messageID=118786#118786
 
 I would love to have the possibility to set an ISCSI alias when doing an 
 shareiscsi=on on ZFS. This will greatly facilate to identify where an 
 IQN is hosted.
 
 the ISCSI alias is defined in rfc 3721
 e.g. http://www.apps.ietf.org/rfc/rfc3721.html#sec-2
 
 and the CLI could be something like:
 zfs set shareiscsi=on shareisicsiname=iscsi_alias tank
 
 
 Ced.
 -- 
 
 Cedric BRINER
 Geneva - Switzerland
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mac OS X Leopard to use ZFS

2007-06-07 Thread Adam Leventhal

On Thu, Jun 07, 2007 at 08:38:10PM -0300, Toby Thain wrote:
 When should we expect Solaris kernel under OS X? 10.6? 10.7? :-)

I'm sure Jonathan will be announcing that soon. ;-)

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: LZO compression?

2007-06-17 Thread Adam Leventhal

Those are interesting results. Does this mean you've already written lzo
support into ZFS? If not, that would be a great next step -- licensing
issues can be sorted out later...

Adam

On Sat, Jun 16, 2007 at 04:40:48AM -0700, roland wrote:
 btw - is there some way to directly compare lzjb vs lzo compression - to see 
 which performs better and using less cpu ?
 
 here those numbers from my little benchmark:
 
 |lzo |6m39.603s |2.99x
 |gzip |7m46.875s |3.41x
 |lzjb |7m7.600s |1.79x
 
 i`m just curious about these numbers - with lzo i got better speed and better 
 compression in comparison to lzjb
 
 nothing against lzjb compression - it's pretty nice - but why not taking a 
 closer look  here? maybe here is some room for improvement
 
 roland
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Mac OS X 10.5 read-only support for ZFS

2007-06-18 Thread Adam Leventhal

On Sun, Jun 17, 2007 at 09:38:51PM -0700, Anton B. Rang wrote:
 Sector errors on DVD are not uncommon. Writing a DVD in ZFS format
 with duplicated data blocks would help protect against that problem, at
 the cost of 50% or so disk space. That sounds like a lot, but with
 BluRay etc. coming along, maybe paying a 50% penalty isn't too bad.
 (And if ZFS eventually supports RAID on a single disk, the penalty
 would be less.)

It would be an interesting project to create some software that took a
directory (or ZFS filesystem) to be written to a CD or DVD and optimized
the layout for redundancy. That is, choose the compression method (if
any), and then, in effect, partition the CD for RAID-Z or mirroring to
stretch the data to fill the entire disc. It wouldn't necessarily be all
that efficient to access, but it would give you resiliency against media
errors.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to take advantage of PSARC 2007/171: ZFS Separate Intent Log

2007-07-03 Thread Adam Leventhal

Flash SSDs typically boast a huge number of _read_ IOPS (thousands), but
very few write IOPS (tens). The write throughput numbers quoted are almost
certainly for non-synchronous writes whose latency can easily be in the
milisecond range. STEC makes an interesting device which offers fast
_synchronous_ writes on an SSD, but at a pretty steep cost.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Compression algorithms - Project Proposal

2007-07-05 Thread Adam Leventhal

This is a great idea. I'd like to add a couple of suggestions:

It might be interesting to focus on compression algorithms which are
optimized for particular workloads and data types, an Oracle database for
example.

It might be worthwhile to have some sort of adaptive compression whereby
ZFS could choose a compression algorithm based on its detection of the
type of data being stored.

Adam

On Thu, Jul 05, 2007 at 08:29:38PM -0300, Domingos Soares wrote:
 Bellow, follows a proposal for a new opensolaris project. Of course,
 this is open to change since I just wrote down some ideas I had months
 ago, while researching the topic as a graduate student in Computer
 Science, and since I'm not an opensolaris/ZFS expert at all. I would
 really appreciate any suggestion or comments.
 
 PROJECT PROPOSAL: ZFS Compression Algorithms.
 
 The main purpose of this project is the development of new
 compression schemes for the ZFS file system. We plan to start with
 the development of a fast implementation of a Burrows Wheeler
 Transform based algorithm (BWT). BWT is an outstanding tool
 and the currently known lossless compression algorithms
 based on it outperform the compression ratio of algorithms derived from the 
 well
 known Ziv-Lempel algorithm, while being a little more time and space
 expensive. Therefore, there is space for improvement: recent results
 show that the running time and space needs of such algorithms can be
 significantly reduced and the same results suggests that BWT is
 likely to become the new standard in compression
 algorithms[1]. Suffixes Sorting (i.e. the problem of sorting suffixes of a
 given string) is the main bottleneck of BWT and really significant
 progress has been made in this area since the first algorithms of
 Manbers and Myers[2] and Larsson and Sadakane[3], notably the new
 linear time algorithms of Karkkainen and Sanders[4]; Kim, Sim and
 Park[5] and Ko e aluru[6] and also the promising O(nlogn) algorithm of
 Karkkainen and Burkhardt[7].
 
 As a conjecture, we believe that some intrinsic properties of ZFS and
 file systems in general (e.g. sparseness and data entropy in blocks)
 could be exploited in order to produce brand new and really efficient
 compression algorithms, as well as the adaptation of existing ones to
 the task. The study might be extended to the analysis of data in
 specific applications (e.g. web servers, mail servers and others) in
 order to develop compression schemes for specific environments and/or
 modify the existing Ziv-Lempel based scheme to deal better with such
 environments.
 
 [1] The Burrows-Wheeler Transform: Theory and Practice. Manzini,
 Giovanni. Proc. 24th Int. Symposium on Mathematical Foundations of
 Computer Science
 
 [2] Suffix Arrays: A New Method for
 On-Line String Searches. Manber, Udi and Myers, Eugene W..  SIAM
 Journal on Computing, Vol. 22 Issue 5. 1990
 
 [3] Faster suffix sorting. Larsson, N Jasper and Sadakane,
 Kunihiko. TECHREPORT, Department of Computer Science, Lund University,
 1999
 
 [4] Simple Linear Work Suffix Array Construction. Karkkainen, Juha
 and Sanders,Peter. Proc. 13th International Conference on Automata,
 Languages and Programming, 2003
 
 [5]Linear-time construction of suffix arrays D.K. Kim, J.S. Sim,
 H. Park, K. Park, CPM, LNCS, Vol. 2676, 2003
 
 [6]Space ecient linear time construction of sux arrays,P. Ko and
 S. Aluru, CPM 2003.
 
 [7]Fast Lightweight Suffix Array Construction and
 Checking. Burkhardt, Stefan and K?rkk?inen, Juha. 14th Annual
 Symposium, CPM 2003,
 
 
 Domingos Soares Neto
 University of Sao Paulo
 Institute of Mathematics and Statistics
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs vol issue?.

2007-08-17 Thread Adam Leventhal

On Thu, Aug 16, 2007 at 05:20:25AM -0700, ramprakash wrote:
 #zfs mount  -a 
 1.   mounts c  again.
 2.   but not vol1..  [ ie /dev/zvol/dsk/mytank/b/c does not contain vol1 
 ] 
 
 Is this the normal behavior or is it a bug?

That looks like a bug. Please file it.

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mirrored zpool across network

2007-08-20 Thread Adam Leventhal

On Sun, Aug 19, 2007 at 05:45:18PM -0700, Mark wrote:
 Basically, the setup is a large volume of Hi-Def video is being streamed
 from a camera, onto an editing timeline. This will be written to a
 network share. Due to the large amounts of data, ZFS is a really good
 option for us. But we need a backup. We need to do it on generic
 hardware (i was thinking AMD64 with an array of large 7200rpm hard
 drives), and therefore i think im going to have one box mirroring the
 other box. They will be connected by gigabit ethernet. So my question
 is how do I mirror one raidz Array across the network to the other?

One big decision you need to make in this scenario is whether you want
true synchronous replication or if asynchronous replication possibly with
some time-bound is acceptable. For the former, each byte must traverse the
network before it is acknowledged to the client; for the latter, data is
written locally and then transmitted shortly after that.

Synchronous replication obviously imposes a much larger performance hit,
but asychronous replication means you may lose data over some recent
period (but the data will always be consistent).

Adam

-- 
Adam Leventhal, Solaris Kernel Development   http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS RAIDZ vs. RAID5.

2007-09-12 Thread Adam Leventhal

On Mon, Sep 10, 2007 at 12:41:24PM +0200, Pawel Jakub Dawidek wrote:
 And here are the results:
 
 RAIDZ:
 
   Number of READ requests: 4.
   Number of WRITE requests: 0.
   Number of bytes to transmit: 695678976.
   Number of processes: 8.
   Bytes per second: 1305213
   Requests per second: 75
 
 RAID5:
 
   Number of READ requests: 4.
   Number of WRITE requests: 0.
   Number of bytes to transmit: 695678976.
   Number of processes: 8.
   Bytes per second: 2749719
   Requests per second: 158

I'm a bit surprised by these results. Assuming relatively large blocks
written, RAID-Z and RAID-5 should be laid out on disk very similarly
resulting in similar read performance.

Did you compare the I/O characteristic of both? Was the bottleneck in
the software or the hardware?

Very interesting experiment...

Adam

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yager on ZFS

2007-11-08 Thread Adam Leventhal

On Wed, Nov 07, 2007 at 01:47:04PM -0800, can you guess? wrote:
 I do consider the RAID-Z design to be somewhat brain-damaged [...]

How so? In my opinion, it seems like a cure for the brain damage of RAID-5.

Adam

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Yager on ZFS

2007-11-16 Thread Adam Leventhal

On Thu, Nov 08, 2007 at 07:28:47PM -0800, can you guess? wrote:
  How so? In my opinion, it seems like a cure for the brain damage of RAID-5.
 
 Nope.
 
 A decent RAID-5 hardware implementation has no 'write hole' to worry about, 
 and one can make a software implementation similarly robust with some effort 
 (e.g., by using a transaction log to protect the data-plus-parity 
 double-update or by using COW mechanisms like ZFS's in a more intelligent 
 manner).

Can you reference a software RAID implementation which implements a solution
to the write hole and performs well. My understanding (and this is based on
what I've been told from people more knowledgeable in this domain than I) is
that software RAID has suffered from being unable to provide both
correctness and acceptable performance.

 The part of RAID-Z that's brain-damaged is its 
 concurrent-small-to-medium-sized-access performance (at least up to request 
 sizes equal to the largest block size that ZFS supports, and arguably 
 somewhat beyond that):  while conventional RAID-5 can satisfy N+1 
 small-to-medium read accesses or (N+1)/2 small-to-medium write accesses in 
 parallel (though the latter also take an extra rev to complete), RAID-Z can 
 satisfy only one small-to-medium access request at a time (well, plus a 
 smidge for read accesses if it doesn't verity the parity) - effectively 
 providing RAID-3-style performance.

Brain damage seems a bit of an alarmist label. While you're certainly right
that for a given block we do need to access all disks in the given stripe,
it seems like a rather quaint argument: aren't most environments that matter
trying to avoid waiting for the disk at all? Intelligent prefetch and large
caches -- I'd argue -- are far more important for performance these days.

 The easiest way to fix ZFS's deficiency in this area would probably be to map 
 each group of N blocks in a file as a stripe with its own parity - which 
 would have the added benefit of removing any need to handle parity groups at 
 the disk level (this would, incidentally, not be a bad idea to use for 
 mirroring as well, if my impression is correct that there's a remnant of 
 LVM-style internal management there).  While this wouldn't allow use of 
 parity RAID for very small files, in most installations they really don't 
 occupy much space compared to that used by large files so this should not 
 constitute a significant drawback.

I don't really think this would be feasible given how ZFS is stratified
today, but go ahead and prove me wrong: here are the instructions for
bringing over a copy of the source code:

  http://www.opensolaris.org/os/community/tools/scm

- ahl

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expanding a RAIDZ based Pool...

2007-12-10 Thread Adam Leventhal

On Mon, Dec 10, 2007 at 03:59:22PM +, Karl Pielorz wrote:
 e.g. If I build a RAIDZ pool with 5 * 400Gb drives, and later add a 6th 
 400Gb drive to this pool, will its space instantly be available to volumes 
 using that pool? (I can't quite see this working myself)

Hi Karl,

You can't currently expand the width of a RAID-Z stripe. It has been
considered, but implementing that would require a fairly substantial change
in the way RAID-Z works. Sun's current ZFS priorities are elsewhere, but
there's nothing preventing an interested member of the community from
undertaking this project...

Adam

-- 
Adam Leventhal, FishWorkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz in zfs questions

2008-03-05 Thread Adam Leventhal

 2. in a raidz do all the disks have to be the same size?

Disks don't have to be the same size, but only as much space will be  
used
on the larger disks will be used as is available on the smallest disk.  
In
other words, there's no benefit to be gained from this approach.

 Related question:
 Does a raidz have to be either only full disks or only slices, or can
 it be mixed? E.g., can you do a 3-way raidz with 2 complete disks and
 one slice (of equal size as the disks) on a 3rd, larger, disk?

Sure. One could do this, but it's kind of a hack. I imagine you'd like
to do something like match a disk of size N with another disk of size 2N
and use RAID-Z to turn them into a single vdev. At that point it's
probably a better idea to build a striped vdev and use ditto blocks to  
do
your data redundancy by setting copies=2.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mixing RAIDZ and RAIDZ2 zvols in the same zpool

2008-03-12 Thread Adam Leventhal

On Wed, Mar 12, 2008 at 09:59:53PM +, A Darren Dunham wrote:
 It's not *bad*, but as far as I'm concerned, it's wasted space.
 
 You have to deal with the pool as a whole as having single-disk
 redundancy for failure modes.  So the fact that one section of it has
 two-disk redundancy doesn't give you anything in failure planning.
 
 And you can't assign filesystems or particular data to that vdev, so the
 added redundancy can't be concentrated anywhere.

Well, one can imagine a situation where two different type of disks have
different failure probabilities such that the same reliability could be
garnered with one using single-parity RAID as with the other using double-
parity RAID. That said, it would be a fairly uncommon scenario.

Adam

-- 
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Per filesystem scrub

2008-04-01 Thread Adam Leventhal

On Mar 31, 2008, at 10:41 AM, kristof wrote:
 I would be very happy having a filesystem based zfs scrub

 We have a 18TB big zpool, it takes more then 2 days to do the scrub.

 Since we cannot take snapshots during the scrub, this is unacceptable

While per-dataset scrubbing would certainly be a coarse-grained  
solution to
your problem, work is underway to address the problematic interaction  
between
scrubs and snapshots.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Algorithm for expanding RAID-Z

2008-04-08 Thread Adam Leventhal

After hearing many vehement requests for expanding RAID-Z vdevs, Matt Ahrens
and I sat down a few weeks ago to figure out an mechanism that would work.
While Sun isn't committing resources to imlementing a solution, I've written
up our ideas here:

  http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

I'd encourage anyone interested in getting involved with ZFS development to
take a look.

Adam

-- 
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Periodic ZFS maintenance?

2008-04-20 Thread Adam Leventhal

On Mon, Apr 21, 2008 at 10:41:35AM +1200, Ian Collins wrote:
 Sam wrote:
  I have a 10x500 disc file server with ZFS+, do I need to perform any sort 
  of periodic maintenance to the filesystem to keep it in tip top shape?
 
 No, but if there are problems, a periodic scrub will tip you off sooner
 rather than later.

Well, tip you off _and_ correct the problems if possible. I believe a long-
standing RFE has been to scrub periodically in the background to ensure that
correctable problems don't turn into uncorrectable ones.

Adam

-- 
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS in S10U6 vs openSolaris 05/08

2008-05-16 Thread Adam Leventhal

On Fri, May 16, 2008 at 03:12:02PM -0700, Zlotnick Fred wrote:
 The issues with CIFS is not just complexity; it's the total amount
 of incompatible change in the kernel that we had to make in order
 to make the CIFS protocol a first class citizen in Solaris.  This
 includes changes in the VFS layer which would break all S10 file
 systems.  So in a very real sense CIFS simply cannot be backported
 to S10.

However, the same arguments were made explaining the difficulty backporting
ZFS and GRUB boot to Solaris 10.

Adam

-- 
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD reliability, wear levelling, warranty period

2008-06-11 Thread Adam Leventhal

On Jun 11, 2008, at 1:16 AM, Al Hopper wrote:
 But... if you look
 broadly at the current SSD product offerings, you see: a) lower than
 expected performance - particularly in regard to write IOPS (I/O Ops
 per Second)

True. Flash is quite asymmetric in its performance characteristics.
That said, the L2ARC has been specially designed to play well with the
natural strengths and weaknesses of flash.

 and b) warranty periods that are typically 1 year - with
 the (currently rare) exception of products that are offered with a 5
 year warranty.

You'll see a new class of SSDs -- eSSDs -- designed for the enterprise
with longer warranties and more write/erase cycles. Further, ZFS will
do its part by not killing the write/erase cycles of the L2ARC by
constantly streaming as fast as possible. You should see lifetimes in
the 3-5 year range on typical flash.

 Obviously, for SSD products to live up to the current marketing hype,
 they need to deliver superior performance and *reliability*.
 Everyone I know *wants* one or more SSD devices - but they also have
 the expectation that those devices will come with a warranty at least
 equivalent to current hard disk drives (3 or 5 years).

I don't disagree entirely, but as a cache device flash actually can be
fairly unreliable and we'll pick it up in ZFS.

 So ... I'm interested in learning from anyone on this list, and, in
 particular, from Team ZFS, what the reality is regarding SSD
 reliability.  Obviously Sun employees are not going to compromise
 their employment and divulge upcoming product specific data - but
 there must be *some* data (white papers etc) in the public domain that
 would provide some relevant technical data??


A typical high-end SSD can sustain 100k write/erase cycles so you can
do some simple math to see that a 128GB device written to at a rate of
150M/s will last nearly 3 years. Again, note that unreliable devices
will result in a performance degradation when you fail a checksum in
the L2ARC, but the data will still be valid out of the main storage
pool.

You're going to see much more on this in the next few months. I made a
post to my blog that probably won't answer your questions directly, but
may help inform you about what we have in mind.

   http://blogs.sun.com/ahl/entry/flash_hybrid_pools_and_future

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD reliability, wear levelling, warranty period

2008-06-11 Thread Adam Leventhal

On Wed, Jun 11, 2008 at 01:51:17PM -0500, Al Hopper wrote:
 I think that I'll (personally) avoid the initial rush-to-market
 comsumer level products by vendors with no track record of high tech
 software development - let alone those who probably can't afford the
 PhD level talent it takes to get the wear leveling algorithms
 correct - and then to implement them correctly.  Instead I'll wait for
 a Sun product - from a company with a track record of proven design
 and *implementation* for enterprise level products (software and
 hardware).

Wear leveling is actually a fairly mature technology. I'm more concerned
with what will happen as people continue pushing these devices out of the
consumer space and into the enterprise where stuff like failure modes and
reliability matters in a completely different way. If my iPod sucks that's
a hassle, but it's a different matter if an SSD hangs an I/O request on my
enterprise system.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Ideal Setup: RAID-5, Areca, etc!

2008-07-25 Thread Adam Leventhal

  But, is there a performance boost with mirroring the drives? That is what
  I'm unsure of.
 
 Mirroring will provide a boost on reads, since the system to read from
 both sides of the mirror. It will not provide an increase on writes,
 since the system needs to wait for both halves of the mirror to
 finish. It could be slightly slower than a single raid5.

That's not strictly correct. Mirroring will, in fact, deliver better IOPS for
both reads and writes. For reads, as Brandon stated, mirroring will deliver
better performance because it can distribute the reads between both devices.
For writes, however, RAID-Z with an N+1 wide stripe will divide the the data
into N+1 chunks, and reads will need to access the N chunks. This reduces
the total IOPS by a factor of N+1 for reads and writes whereas mirroring
reduces the IOPS by a factor of 2 for writes and not at all for reads.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which is better for root ZFS: mlc or slc SSD?

2008-09-26 Thread Adam Leventhal

For a root device it doesn't matter that much. You're not going to be  
writing to the device at a high data rate so write/erase cycles don't  
factor much (MLC can sustain about a factor of 10 more). With MLC  
you'll get 2-4x the capacity for the same price, but again that  
doesn't matter much for a root device. Performance is typically a bit  
better with SLC -- especially on the write side -- but it's not such a  
huge difference.

The reason you'd use a flash SSD for a boot device is power (with  
maybe a dash of performance), and either SLC or MLC will do just fine.

Adam

On Sep 24, 2008, at 11:41 AM, Erik Trimble wrote:

 I was under the impression that MLC is the preferred type of SSD,  
 but I
 want to prevent myself from having a think-o.


 I'm looking to get (2) SSD to use as my boot drive. It looks like I  
 can
 get 32GB SSDs composed of either SLC or MLC for roughly equal pricing.
 Which would be the better technology?  (I'll worry about rated access
 times/etc of the drives, I'm just wondering about general tech for  
 an OS
 boot drive usage...)



 -- 
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] An slog experiment (my NAS can beat up your NAS)

2008-10-05 Thread Adam Leventhal

 So what are the downsides to this?  If both nodes were to crash and  
 I used the same technique to recreate the ramdisk I would lose any  
 transactions in the slog at the time of the crash, but the physical  
 disk image is still in a consistent state right (just not from my  
 apps point of view)?

You would lose transactions, but the pool would still reflect a  
consistent
state.

 So is this idea completely crazy?


On the contrary; it's very clever.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OpenStorage GUI

2008-11-11 Thread Adam Leventhal

On Nov 11, 2008, at 9:38 AM, Bryan Cantrill wrote:

 Just to throw some ice-cold water on this:

  1.  It's highly unlikely that we will ever support the x4500 --  
 only the
  x4540 is a real possibility.


And to warm things up a bit: there's already an upgrade path from the
x4500 to the x4540 so that would be required before any upgrade to the
equivalent of the Sun Storage 7210.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OpenStorage GUI

2008-11-11 Thread Adam Leventhal

On Nov 11, 2008, at 10:41 AM, Brent Jones wrote:
 Wish I could get my hands on a beta of this GUI...


Take a look at the VMware version that you can run on any machine:

   http://www.sun.com/storage/disk_systems/unified_storage/resources.jsp

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OpenStorage GUI

2008-11-11 Thread Adam Leventhal

 Is this software available for people who already have thumpers?

We're considering offering an upgrade path for people with existing
thumpers. Given the feedback we've been hearing, it seems very likely
that we will. No word yet on pricing or availability.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] continuous replication

2008-11-14 Thread Adam Leventhal

On Fri, Nov 14, 2008 at 10:48:25PM +0100, Mattias Pantzare wrote:
 That is _not_ active-active, that is active-passive.
 
 If you have a active-active system I can access the same data via both
 controllers at the same time. I can't if it works like you just
 described. You can't call it active-active just because different
 volumes are controlled by different controllers. Most active-passive
 RAID controllers can do that.
 
 The data sheet talks about active-active clusters, how does that work?

What the Sun Storage 7000 Series does would more accurately be described as
dual active-passive.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Storage 7000

2008-11-17 Thread Adam Leventhal

On Mon, Nov 17, 2008 at 12:35:38PM -0600, Tim wrote:
 I'm not sure if this is the right place for the question or not, but I'll
 throw it out there anyways.  Does anyone know, if you create your pool(s)
 with a system running fishworks, can that pool later be imported by a
 standard solaris system?  IE: If for some reason the head running fishworks
 were to go away, could I attach the JBOD/disks to a system running
 snv/mainline solaris/whatever, and import the pool to get at the data?  Or
 is the zfs underneath fishworks proprietary as well?

Yes. The Sun Storage 7000 Series uses the same ZFS that's in OpenSolaris
today. A pool created on the appliance could potentially be imported on an
OpenSolaris system; that is, of course, not explicitly supported in the
service contract.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Storage 7000

2008-11-17 Thread Adam Leventhal

 Would be interesting to hear more about how Fishworks differs from 
 Opensolaris, what build it is based on, what package mechanism you are 
 using (IPS already?), and other differences...

I'm sure these details will be examined in the coming weeks on the blogs
of members of the Fishworks team. Keep an eye on blogs.sun.com/fishworks.

 A little off topic: Do you know when the SSDs used in the Storage 7000 are 
 available for the rest of us?

I don't think the will be, but it will be possible to purchase them as
replacement parts.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Storage 7000

2008-11-19 Thread Adam Leventhal

On Tue, Nov 18, 2008 at 09:09:07AM -0800, Andre Lue wrote:
 Is the web interface on the appliance available for download or will it make
 it to opensolaris sometime in the near future?

It's not, and it's unlikely to make it to OpenSolaris.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Comparison between the S-TEC Zeus and the Intel X25-E ??

2009-01-16 Thread Adam Leventhal

The Intel part does about a fourth as many synchronous write IOPS at  
best.

Adam

On Jan 16, 2009, at 5:34 PM, Erik Trimble wrote:

 I'm looking at the newly-orderable (via Sun) STEC Zeus SSDs, and  
 they're
 outrageously priced.

 http://www.stec-inc.com/product/zeusssd.php

 I just looked at the Intel X25-E series, and they look comparable in
 performance.  At about 20% of the cost.

 http://www.intel.com/design/flash/nand/extreme/index.htm


 Can anyone enlighten me as to any possible difference between an STEC
 Zeus and an Intel X25-E ?  I mean, other than those associated with  
 the
 fact that you can't get the Intel one orderable through Sun right now.

 -- 
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA
 Timezone: US/Pacific (GMT-0800)

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-18 Thread Adam Leventhal

 Right, which is an absolutely piss poor design decision and why  
 every major storage vendor right-sizes drives.  What happens if I  
 have an old maxtor drive in my pool whose 500g is just slightly  
 larger than every other mfg on the market?  You know, the one who is  
 no longer making their own drives since being purchased by seagate.   
 I can't replace the drive anymore?  *GREAT*.


Sun does right size our drives. Are we talking about replacing a  
device bought from sun with another device bought from Sun? If these  
are just drives that fell off the back of some truck, you may not have  
that assurance.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disks in each RAIDZ group

2009-01-19 Thread Adam Leventhal

 The recommended number of disks per group is between 3 and 9. If you have
 more disks, use multiple groups.
 
 Odd that the Sun Unified Storage 7000 products do not allow you to control
 this, it appears to put all the hdd's into one group.  At least on the 7110
 we are evaluating there is no control to allow multiple groups/different
 raid types.

Our experience has shown that that initial guess of 3-9 per parity device was
surprisingly narrow. We see similar performance out to much wider stripes
which, of course, offer the user more usable capacity.

We don't allow you to manually set the RAID stripe widths on the 7000 series
boxes because frankly the stripe width is an implementation detail. If you
want the best performance, choose mirroring; capacity, double-parity RAID;
for something in the middle, we offer 3+1 single-parity RAID. Other than
that you're micro-optimizing for gains that would hardly be measurable given
the architecture of the Hybrid Storage Pool. Recall that unlike other
products in the same space, we get our IOPS from flash rather than from
a bazillion spindles spinning at 15,000 RPM.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-19 Thread Adam Leventhal

 Since it's done in software by HDS, NetApp, and EMC, that's complete
 bullshit.  Forcing people to spend 3x the money for a Sun drive that's
 identical to the seagate OEM version is also bullshit and a piss-poor
 answer.

I didn't know that HDS, NetApp, and EMC all allow users to replace their
drives with stuff they've bought at Fry's. Is this still covered by their
service plan or would this only be in an unsupported config?

Thanks.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-19 Thread Adam Leventhal

   Since it's done in software by HDS, NetApp, and EMC, that's complete
   bullshit.  Forcing people to spend 3x the money for a Sun drive that's
   identical to the seagate OEM version is also bullshit and a piss-poor
   answer.
 
  I didn't know that HDS, NetApp, and EMC all allow users to replace their
  drives with stuff they've bought at Fry's. Is this still covered by their
  service plan or would this only be in an unsupported config?
 
 So because an enterprise vendor requires you to use their drives in their
 array, suddenly zfs can't right-size?  Vendor requirements have absolutely
 nothing to do with their right-sizing, and everything to do with them
 wanting your money.

Sorry, I must have missed your point. I thought that you were saying that
HDS, NetApp, and EMC had a different model. Were you merely saying that the
software in those vendors' products operates differently than ZFS?

 Are you telling me zfs is deficient to the point it can't handle basic
 right-sizing like a 15$ sata raid adapter?

How do there $15 sata raid adapters solve the problem? The more details you
could provide the better obviously.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Disks in each RAIDZ group

2009-01-19 Thread Adam Leventhal

 BWAHAHAHAHA.  That's a good one.  You don't need to setup your raid, that's
 micro-managing, we'll do that.
 
 Remember that one time when I talked about limiting snapshots to protect a
 user from themselves, and you joined into the fray of people calling me a
 troll?

I don't remember this, but I don't doubt it.

 Can you feel the irony oozing out between your lips, or are you
 completely oblivious to it?

The irony would be that on one hand I object to artificial limitations to
business-critical features while on the other hand I think that users don't
need to tweak settings that add complexity and little to no value? They seem
very different to me, so I suppose the answer to your question is: no I cannot
feel the irony oozing out between my lips, and yes I'm oblivious to the same.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-19 Thread Adam Leventhal

On Mon, Jan 19, 2009 at 01:35:22PM -0600, Tim wrote:
   Are you telling me zfs is deficient to the point it can't handle basic
   right-sizing like a 15$ sata raid adapter?
 
  How do there $15 sata raid adapters solve the problem? The more details you
  could provide the better obviously.
 
 They short stroke the disk so that when you buy a new 500GB drive that isn't
 the exact same number of blocks you aren't screwed.  It's a design choice to
 be both sane, and to make the end-users life easier.  You know, sort of like
 you not letting people choose their raid layout...

Drive vendors, it would seem, have an incentive to make their 500GB drives
as small as possible. Should ZFS then choose some amount of padding at the
end of each device and chop it off as insurance against a slightly smaller
drive? How much of the device should it chop off? Conversely, should users
have the option to use the full extent of the drives they've paid for, say,
if they're using a vendor that already provides that guarantee?

 You know, sort of like you not letting people choose their raid layout...

Yes, I'm not saying it shouldn't be done. I'm asking what the right answer
might be.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] replace same sized disk fails with too small error

2009-01-19 Thread Adam Leventhal

 And again, I say take a look at the market today, figure out a percentage,
 and call it done.  I don't think you'll find a lot of users crying foul over
 losing 1% of their drive space when they don't already cry foul over the
 false advertising that is drive sizes today.

Perhaps it's quaint, but 5GB still seems like a lot to me to throw away.

 In any case, you might as well can ZFS entirely because it's not really fair
 that users are losing disk space to raid and metadata... see where this
 argument is going?

Well, I see where this _specious_ argument is going.

 I have two disks in one of my systems... both maxtor 500GB drives, purchased
 at the same time shortly after the buyout.  One is a rebadged Seagate, one
 is a true, made in China Maxtor.  Different block numbers... same model
 drive, purchased at the same time.
 
 Wasn't zfs supposed to be about using software to make up for deficiencies
 in hardware?  It would seem this request is exactly that...

That's a fair point, and I do encourage you to file an RFE, but a) Sun has
already solved this problem in a different way as a company with our products
and b) users already have the ability to right-size drives.

Perhaps a better solution would be to handle the procedure of replacing a disk
with a slightly smaller one by migrating data and then treating the extant
disks as slightly smaller as well. This would have the advantage of being far
more dynamic and of only applying the space tax in situations where it actually
applies.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD - slow down with age

2009-02-16 Thread Adam Leventhal


On Feb 14, 2009, at 12:45 PM, Nicholas Lee wrote:
A useful article about long term use of the Intel SSD X25-M: http://www.pcper.com/article.php?aid=669 
 - 	Long-term performance analysis of Intel Mainstream SSDs.


Would a zfs cache (ZIL or ARC) based on a SSD device see this kind  
of issue?  Maybe a periodic scrub via a full disk erase would be a  
useful process.


Indeed SSDs can have certain properties that would cause their  
performance to degrade over time. We've seen this to varying degrees  
with different devices we've tested in our lab. We're working on  
adapting our use of SSDs with ZFS as a ZIL device, an L2ARC device,  
and eventually as primary storage. We'll first focus on the specific  
SSDs we certify for use in our general purpose servers and the Sun  
Storage 7000 series, and help influence the industry to move to  
standards that we can then use.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SAS 15K drives as L2ARC

2009-05-06 Thread Adam Leventhal

 After all this discussion, I am not sure if anyone adequately answered the 
 original poster's question as to whether at 2540 with SAS 15K drives would 
 provide substantial synchronous write throughput improvement when used as 
 a L2ARC device.

 I was under the impression that the L2ARC was to speed up reads, as it 
 allows things to be cached on something faster than disks (usually MLC 
 SSDs). Offloading the ZIL is what handles synchronous writes, isn't it?

 How would adding an L2ARC speed up writes?

You're absolutely right. The L2ARC is for accelerating reads only and will
not affect write performance.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 7110 questions

2009-06-18 Thread Adam Leventhal

On Thu, Jun 18, 2009 at 11:51:44AM -0400, Dan Pritts wrote:
 I'm curious about a couple things that would be unsupported.
 
 Specifically, whether they are not supported if they have specifically
 been crippled in the software.

We have not crippled the software in any way, but we have designed an
appliance with some specific uses. Doing things from the Solaris shell
by hand my damage your system and void your support contract.

 1) SSD's 
 
 I can imagine buying an intel SSD, slotting it into the 7110, and using
 it as a ZFS L2ARC (? i mean the equivalent of readzilla)

That's not supported, it won't work easily, and if you get it working you'll
be out of luck if you have a problem.

 2) expandability
 
 I can imagine buying a SAS card and a JBOD and hooking it up to
 the 7110; it has plenty of PCI slots.

Ditto.

 finally, one question - I presume that I need to devote a pair of disks
 to the OS, so I really only get 14 disks for data.  Correct?

That's right. We market the 7110 as either 2TB = 146GB x 14 or 4.2TB =
300GB x 14 raw capacity.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 7110 questions

2009-06-18 Thread Adam Leventhal

Hey Lawrence,

Make sure you're running the latest software update. Note that this forumn
is not the appropriate place to discuss support issues. Please contact your
official Sun support channel.

Adam

On Thu, Jun 18, 2009 at 12:06:02PM -0700, lawrence ho wrote:
 We have a 7110 on try and buy program. 
 
 We tried using the 7110 with XEN Server 5 over iSCSI and NFS. Nothing seems 
 to solve the slow write problem. Within the VM, we observed around 8MB/s on 
 writes. Read performance is fantastic. Some troubleshooting was done with 
 local SUN rep. The conclusion is that 7110 does not have write cache in forms 
 of SSD or controller DRAM write cache. The solution from SUN is to buy 
 StorageTek or 7000 series model with SSD write cache.
 
 Adam, please advise if there any fixes for 7110. I am still shopping for SAN 
 and would rather buy a 7100 than a StorageTek or something else.
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-22 Thread Adam Leventhal


Hey Bob,

MTTDL analysis shows that given normal evironmental conditions, the  
MTTDL of RAID-Z2 is already much longer than the life of the  
computer or the attendant human.  Of course sometimes one encounters  
unusual conditions where additional redundancy is desired.


To what analysis are you referring? Today the absolute fastest you can  
resilver a 1TB drive is about 4 hours. Real-world speeds might be half  
that. In 2010 we'll have 3TB drives meaning it may take a full day to  
resilver. The odds of hitting a latent bit error are already  
reasonably high especially with a large pool that's infrequently  
scrubbed meaning. What then are the odds of a second drive failing in  
the 24 hours it takes to resiler?


I do think that it is worthwhile to be able to add another parity  
disk to an existing raidz vdev but I don't know how much work that  
entails.


It entails a bunch of work:

  http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

Matt Ahrens is working on a key component after which it should all be  
possible.


Zfs development seems to be overwelmed with marketing-driven  
requirements lately and it is time to get back to brass tacks and  
make sure that the parts already developed are truely enterprise- 
grade.



While I don't disagree that the focus for ZFS should be ensuring  
enterprise-class reliability and performance, let me assure you that  
requirements are driven by the market and not by marketing.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-22 Thread Adam Leventhal


which gap?

'RAID-Z should mind the gap on writes' ?

Message was edited by: thometal


I believe this is in reference to the raid 5 write hole, described  
here:

http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance


It's not.

So I'm not sure what the 'RAID-Z should mind the gap on writes'  
comment is getting at either.


Clarification?



I'm planning to write a blog post describing this, but the basic  
problem is that RAID-Z, by virtue of supporting variable stripe writes  
(the insight that allows us to avoid the RAID-5 write hole), must  
round the number of sectors up to a multiple of nparity+1. This means  
that we may have sectors that are effectively skipped. ZFS generally  
lays down data in large contiguous streams, but these skipped sectors  
can stymie both ZFS's write aggregation as well as the hard drive's  
ability to group I/Os and write them quickly.


Jeff Bonwick added some code to mind these gaps on reads. The key  
insight there is that if we're going to read 64K, say, with a 512 byte  
hole in the middle, we might as well do one big read rather than two  
smaller reads and just throw out the data that we don't care about.


Of course, doing this for writes is a bit trickier since we can't just  
blithely write over gaps as those might contain live data on the disk.  
To solve this we push the knowledge of those skipped sectors down to  
the I/O aggregation layer in the form of 'optional' I/Os purely for  
the purpose of coalescing writes into larger chunks.


I hope that's clear; if it's not, stay tuned for the aforementioned  
blog post.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-22 Thread Adam Leventhal


Don't hear about triple-parity RAID that often:


Author: Adam Leventhal
Repository: /hg/onnv/onnv-gate
Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651
Total changesets: 1
Log message:
6854612 triple-parity RAID-Z


http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/ 
009872.html

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612

(Via Blog O' Matty.)

Would be curious to see performance characteristics.



I just blogged about triple-parity RAID-Z (raidz3):

  http://blogs.sun.com/ahl/entry/triple_parity_raid_z

As for performance, on the system I was using (a max config Sun Storage
7410), I saw about a 25% improvement to 1GB/s for a streaming write
workload. YMMV, but I'd be interested in hearing your results.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-22 Thread Adam Leventhal


Don't hear about triple-parity RAID that often:


I agree completely.  In fact, I have wondered (probably in these  
forums), why we don't bite the bullet and make a generic raidzN,  
where N is any number =0.


I agree, but raidzN isn't simple to implement and it's potentially  
difficult
to get it to perform well. That said, it's something I intend to bring  
to

ZFS in the next year or so.

If memory serves, the second parity is calculated using Reed-Solomon  
which implies that any number of parity devices is possible.


True; it's a degenerate case.

In fact, get rid of mirroring, because it clearly is a variant of  
raidz with two devices.  Want three way mirroring?  Call that raidz2  
with three devices.  The truth is that a generic raidzN would roll  
up everything: striping, mirroring, parity raid, double parity, etc.  
into a single format with one parameter.


That's an interesting thought, but there are some advantages to  
calling out mirroring for example as its own vdev type. As has been  
pointed out, reading from either side of the mirror involves no  
computation whereas reading from a RAID-Z 1+2 for example would  
involve more computation. This would

complicate the calculus of balancing read operations over the mirror
devices.

Let's not stop there, though.  Once we have any number of parity  
devices, why can't I add a parity device to an array?  That should  
be simple enough with a scrub to set the parity.  In fact, what is  
to stop me from removing a parity device?  Once again, I think the  
code would make this rather easy.


With RAID-Z stripes can be of variable width meaning that, say, a  
single row
in a 4+2 configuration might have two stripes of 1+2. In other words,  
there
might not be enough space in the new parity device. I did write up the  
steps

that would be needed to support RAID-Z expansion; you can find it here:

  http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

Ok, back to the real world.  The one downside to triple parity is  
that I recall the code discovered the corrupt block by excluding it  
from the stripe, reconstructing the stripe and comparing that with  
the checksum.  In other words, for a given cost of X to compute a  
stripe and a number P of corrupt blocks, the cost of reading a  
stripe is approximately X^P.  More corrupt blocks would radically  
slow down the system.  With raidz2, the maximum number of corrupt  
blocks would be two, putting a cap on how costly the read can be.


Computing the additional parity of triple-parity RAID-Z is slightly  
more expensive, but not much -- it's just bitwise operations.  
Recovering from
a read failure is identical (and performs identically) to raidz1 or  
raidz2
until you actually have sustained three failures. In that case,  
performance
is slower as more computation is involved -- but aren't you just happy  
to

get your data back?

If there is silent data corruption, then and only then can you encounter
the O(n^3) algorithm that you alluded to, but only as a last resort.  
If we
don't know what drives failed, we try to reconstruct your data by  
assuming
that one drive, then two drives, then three drives are returning bad  
data.
For raidz1, this was a linear operation; raidz2, quadratic; now raidz3  
is
N-cubed. There's really no way around it. Fortunately with proper  
scrubbing

encountering data corruption in one stripe on three different drives is
highly unlikely.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] triple-parity: RAID-Z3

2009-07-23 Thread Adam Leventhal

Robert,

On Fri, Jul 24, 2009 at 12:59:01AM +0100, Robert Milkowski wrote:
 To what analysis are you referring? Today the absolute fastest you can 
 resilver a 1TB drive is about 4 hours. Real-world speeds might be half 
 that. In 2010 we'll have 3TB drives meaning it may take a full day to 
 resilver. The odds of hitting a latent bit error are already reasonably 
 high especially with a large pool that's infrequently scrubbed meaning. 
 What then are the odds of a second drive failing in the 24 hours it takes 
 to resiler?

 I wish it was so good with raid-zN.
 In real life, at least from mine experience, it can take several days to 
 resilver a disk for vdevs in raid-z2 made of 11x sata disk drives with real 
 data.
 While the way zfs ynchronizes data is way faster under some circumstances 
 it is also much slower under other.
 IIRC some builds ago there were some fixes integrated so maybe it is 
 different now.

Absolutely. I was talking more or less about optimal timing. I realize that
due to the priorities within ZFS and real word loads that it can take far
longer.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD (SLC) for cache...

2009-08-12 Thread Adam Leventhal

My question is about SSD, and the differences between use SLC for  
readzillas instead of MLC.


Sun uses MLCs for Readzillas for their 7000 series. I would think  
that if SLCs (which are generally more expensive) were really  
needed, they would be used.


That's not entirely accurate. In the 7410 and 7310 today (the members  
of the Sun Storage 7000 series that support Readzilla) we use SLC  
SSDs. We're exploring the use of MLC.


Perhaps someone on the Fishworks team could give more details, but  
by going what I've read and seen, MLCs should be sufficient for the  
L2ARC. Save your money.



That's our assessment, but it's highly dependent on the specific  
characteristics of the MLC NAND itself, the SSD controller, and, of  
course, the workload.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-08-27 Thread Adam Leventhal


Hey Gary,

There appears to be a bug in the RAID-Z code that can generate  
spurious checksum errors. I'm looking into it now and hope to have it  
fixed in build 123 or 124. Apologies for the inconvenience.


Adam

On Aug 25, 2009, at 5:29 AM, Gary Gendel wrote:

I have a 5-500GB disk Raid-Z pool that has been producing checksum  
errors right after upgrading SXCE to build 121.  They seem to be  
randomly occurring on all 5 disks, so it doesn't look like a disk  
failure situation.


Repeatingly running a scrub on the pools randomly repairs between 20  
and a few hundred checksum errors.


Since I hadn't physically touched the machine, it seems a very  
strong coincidence that it started right after I upgraded to 121.


This machine is a SunFire v20z with a Marvell SATA 8-port controller  
(the same one as in the original thumper).  I've seen this kind of  
problem way back around build 40-50 ish, but haven't seen it after  
that until now.


Anyone else experiencing this problem or knows how to isolate the  
problem definitively?


Thanks,
Gary
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] change raidz1 to raidz2 with BP rewrite?

2009-08-29 Thread Adam Leventhal

Will BP rewrite allow adding a drive to raidz1 to get raidz2? And  
how is status on BP rewrite? Far away? Not started yet? Planning?



BP rewrite is an important component technology, but there's a bunch  
beyond

that. It's not a high priority right now for us at Sun.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] change raidz1 to raidz2 with BP rewrite?

2009-08-30 Thread Adam Leventhal


Hi David,

BP rewrite is an important component technology, but there's a  
bunch beyond that. It's not a high priority right now for us at Sun.


What's the bug / RFE number for it? (So those of us with contracts  
can add a request for it.)


I don't have the number handy, but while it might be satisfying to add  
another
request for it, Matt is already cranking on it as fast as he can and  
more

requests for it are likely to have the opposite of the intended effect.

Adam


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-01 Thread Adam Leventhal


Hi James,

After investigating this problem a bit I'd suggest avoiding deploying  
RAID-Z

until this issue is resolved. I anticipate having it fixed in build 124.

Apologies for the inconvenience.

Adam

On Aug 28, 2009, at 8:20 PM, James Lever wrote:



On 28/08/2009, at 3:23 AM, Adam Leventhal wrote:

There appears to be a bug in the RAID-Z code that can generate  
spurious checksum errors. I'm looking into it now and hope to have  
it fixed in build 123 or 124. Apologies for the inconvenience.


Are the errors being generated likely to cause any significant  
problem running 121 with a RAID-Z volume or should users of RAID-Z*  
wait until this issue is resolved?


cheers,
James




--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_110 - snv_121 produces checksum errors on Raid-Z pool

2009-09-02 Thread Adam Leventhal

Hey Bob,

 I have seen few people more prone to unsubstantiated conjecture than you.  
 The raidz checksum code was recently reworked to add raidz3. It seems 
 likely that a subtle bug was added at that time.

That appears to be the case. I'm investigating the problem and hope to have
and update to the last either later today or tomorrow.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 7110: Would it self upgrade the system zpool?

2009-09-02 Thread Adam Leventhal


Hi Trevor,

We intentionally install the system pool with an old ZFS version and  
don't
provide the ability to upgrade. We don't need or use (or even expose)  
any
of the features of the newer versions so using a newer version would  
only

create problems rolling back to earlier releases.

Adam

On Sep 2, 2009, at 7:01 PM, Trevor Pretty wrote:


Just Curious

The 7110 I've on loan has an old zpool. I *assume* because it's been  
upgraded and it gives me the ability to downgrade. Anybody know if I  
delete the old version of Amber Road whether the pool would then  
upgrade (I don't want to do it as I want to show the up/downgrade  
feature).


OS pool:-
  pool: system
  state: ONLINE
  status: The pool is formatted using an older on-disk format.   
The pool can

still be used, but some features are unavailable.

And yes I may have invalidated my support. If you have a 7000 box  
don't ask me how to access the system like this, you can see the  
warning. Remember I've a loan box and are just being nosey, a sort  
of looking under the bonnet and going OOOHHH an engine, but being  
too scared to even pull the dip stick  :-)


+ 
-+
|  You are entering the operating system shell.  By confirming this  
action in |
|  the appliance shell you have agreed that THIS ACTION MAY VOID ANY  
SUPPORT  |
|  AGREEMENT.  If you do not agree to this -- or do not otherwise  
understand  |
|  what you are doing -- you should type exit at the shell  
prompt.  EVERY   |
|  COMMAND THAT YOU EXECUTE HERE IS AUDITED, and support personnel  
may use|
|  this audit trail to substantiate invalidating your support  
contract.  The  |
|  operating system shell is NOT a supported mechanism for managing  
this  |
|  appliance, and COMMANDS EXECUTED HERE MAY DO IRREPARABLE  
HARM. |
| 
 |
|  NOTHING SHOULD BE ATTEMPTED HERE BY UNTRAINED SUPPORT PERSONNEL  
UNDER ANY  |
|  CIRCUMSTANCES.  This appliance is a non-traditional operating  
system   |
|  environment, and expertise in a traditional operating system  
environment   |
|  in NO WAY constitutes training for supporting this appliance.   
THOSE WITH  |
|  EXPERTISE IN OTHER SYSTEMS -- HOWEVER SUPERFICIALLY SIMILAR --  
ARE MORE|
|  LIKELY TO MISTAKENLY EXECUTE OPERATIONS HERE THAT WILL DO  
IRREPARABLE  |
|  HARM.  Unless you have been explicitly trained on supporting  
this  |
|  appliance via the operating system shell, you should immediately  
return|
|  to the appliance  
shell.|
| 
 |
|  Type exit now to return to the appliance  
shell.  |
+ 
-+



Trevor








www.eagle.co.nz
This email is confidential and may be legally privileged. If  
received in error please destroy and immediately notify us.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123

2009-09-03 Thread Adam Leventhal

   _
   |   |   |   |P = parity
   | P | D | D |  LBAs  D = data
   |___|___|___|   |X = skipped sector
   |   |   |   |   |
   | X | P | D |   v
   |___|___|___|
   |   |   |   |
   | D | X |   |
   |___|___|___|

The logic for the optional IOs effectively (though not literally) in  
this

case would fill in the next LBA on the disk with a 0:

   _
   |   |   |   |P = parity
   | P | D | D |  LBAs  D = data
   |___|___|___|   |X = skipped sector
   |   |   |   |   |0 = zero-data from aggregation
   | 0 | P | D |   v
   |___|___|___|
   |   |   |   |
   | D | X |   |
   |___|___|___|

We can see the problem when the parity undergoes the swap described  
above:


   disks
 0   1   2
   _
   |   |   |   |P = parity
   | D | P | D |  LBAs  D = data
   |___|___|___|   |X = skipped sector
   |   |   |   |   |0 = zero-data from aggregation
   | X | 0 | P |   v
   |___|___|___|
   |   |   |   |
   | D | X |   |
   |___|___|___|

Note that the 0 incorrectly is also swapped thus inadvertently  
overwriting

a data sector in the subsequent stripe. This only occurs if there is IO
aggregation making it much more likely with small, synchronous IOs. It's
also only possible with an odd (N) number of child vdevs since to  
induce the
problem the size of the data written must consume a multiple of N-1  
sectors
_and_ the total number of sectors used for data and parity must be odd  
(to

create the need for a skipped sector).

The number of data sectors is simply size / 512 and the number of parity
sectors is ceil(size / 512 / (N-1)).

  1) size / 512 = K * (N-1)
  2) size / 512 + ceil(size / 512 / (N-1)) is odd
therefore
 K * (N-1) + K = K * N is odd

If N is even K * N cannot be odd and therefore the situation cannot  
arise.

If N is odd, it is possible to satisfy (1) and (2).

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Problem with RAID-Z in builds snv_120 - snv_123

2009-09-03 Thread Adam Leventhal

Hey Simon,

 Thanks for the info on this. Some people, including myself, reported seeing
 checksum errors within mirrors too. Is it considered that these checksum
 errors within mirrors could also be related to this bug, or is there another
 bug related to checksum errors within mirrors that I should take a look at?

Absolutely not. That is an unrelated issue. This problem is isolated to
RAID-Z.

 And good luck with the fix for build 124. Are talking days or weeks for the
 fix to be available, do you think? :) -- 

Days or hours.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] RAIDZ versus mirrroed

2009-09-17 Thread Adam Leventhal

On Thu, Sep 17, 2009 at 01:32:43PM +0200, Eugen Leitl wrote:
  reasons), you will lose 2 disks worth of storage to parity leaving 12
  disks worth of data. With raid10 you will lose half, 7 disks to
  parity/redundancy. With two raidz2 sets, you will get (5+2)+(5+2), that
  is 5+5 disks worth of storage and 2+2 disks worth of redundancy. The
  actual redudancy/parity is spread over all disks, not like raid3 which
  has a dedicated parity disk.
 
 So raidz3 has a dedicated parity disk? I couldn't see that from
 skimming http://blogs.sun.com/ahl/entry/triple_parity_raid_z

Note that Tomas was talking about RAID-3 not raidz3. To summarize the RAID
levels:

  RAID-0striping
  RAID-1mirror
  RAID-2ECC (basically not used)
  RAID-3bit-interleaved parity (basically not used)
  RAID-4block-interleaved parity
  RAID-5block-interleaved distributed parity
  RAID-6block-interleaved double distributed parity

raidz1 is most like RAID-5; raidz2 is most like RAID-6. There's no RAID
level that covers more than two parity disks, but raidz3 is most like RAID-6,
but with triple distributed parity.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Checksums

2009-10-23 Thread Adam Leventhal

On Fri, Oct 23, 2009 at 06:55:41PM -0500, Tim Cook wrote:
 So, from what I gather, even though the documentation appears to state
 otherwise, default checksums have been changed to SHA256.  Making that
 assumption, I have two questions.

That's false. The default checksum has changed from fletcher2 to fletcher4
that is to say, the definition of the value of 'on' has changed.

 First, is the default updated from fletcher2 to SHA256 automatically for a
 pool that was created with an older version of zfs and then upgraded to the
 latest?  Second, would all of the blocks be re-checksummed with a zfs
 send/receive on the receiving side?

As with all property changes, new writes get the new properties. Old data
is not rewritten.

Adam

-- 
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Checksums

2009-10-25 Thread Adam Leventhal

Thank you for the correction.  My next question is, do you happen to  
know what the overhead difference between fletcher4 and SHA256 is?   
Is the checksumming multi-threaded in nature?  I know my fileserver  
has a lot of spare cpu cycles, but it would be good to know if I'm  
going to take a substantial hit in throughput moving from one to the  
other.


Tim,

That all really depends on your specific system and workload. As with  
any
performance related matter experimentation is vital for making your  
final

decision.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs code and fishworks fork

2009-10-26 Thread Adam Leventhal

With that said I'm concerned that there appears to be a fork between  
the opensource version of ZFS and ZFS that is part of the Sun/Oracle  
FishWorks 7nnn series appliances.  I understand (implicitly) that  
Sun (/Oracle) as a commercial concern, is free to choose their own  
priorities in terms of how they use their own IP (Intellectual  
Property) - in this case, the source for the ZFS filesystem.


Hey Al,

I'm unaware of specific plans for management either at Sun or at  
Oracle, but from an engineering perspective suffice it to say that it  
is simpler and therefore more cost effective to develop for a single,  
unified code base, to amortize the cost of testing those  
modifications, and to leverage the enthusiastic ZFS community to  
assist with the development and testing of ZFS.


Again, this isn't official policy, just the simple facts on the ground  
from engineering.


I'm not sure what would lead you to believe that there is fork between  
the open source / OpenSolaris ZFS and what we have in Fishworks.  
Indeed, we've made efforts to make sure there is a single ZFS for the  
reason stated above. Any differences that exist are quickly migrated  
to ON as you can see from the consistent work of Eric Schrock.


Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] will deduplication know about old blocks?

2009-12-09 Thread Adam Leventhal

Hi Kjetil,

Unfortunately, dedup will only apply to data written after the setting is 
enabled. That also means that new blocks cannot dedup against old block 
regardless of how they were written. There is therefore no way to prepare 
your pool for dedup -- you just have to enable it when you have the new bits.

Adam

On Dec 9, 2009, at 3:40 AM, Kjetil Torgrim Homme wrote:

 I'm planning to try out deduplication in the near future, but started
 wondering if I can prepare for it on my servers.  one thing which struck
 me was that I should change the checksum algorithm to sha256 as soon as
 possible.  but I wonder -- is that sufficient?  will the dedup code know
 about old blocks when I store new data?
 
 let's say I have an existing file img0.jpg.  I turn on dedup, and copy
 it twice, to img0a.jpg and img0b.jpg.  will all three files refer to the
 same block(s), or will only img0a and img0b share blocks?
 
 -- 
 Kjetil T. Homme
 Redpill Linpro AS - Changing the game
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] will deduplication know about old blocks?

2009-12-09 Thread Adam Leventhal

 What happens if you snapshot, send, destroy, recreate (with dedup on this 
 time around) and then write the contents of the cloned snapshot to the 
 various places in the pool - which properties are in the ascendancy here? the 
 host pool or the contents of the clone? The host pool I assume, because 
 clone contents are (in this scenario) just some new data?

The dedup property applies to all writes so the settings for the pool of origin 
don't matter, just those on the destination pool.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Dedupe reporting incorrect savings

2009-12-17 Thread Adam Leventhal

Hi Giridhar,

The size reported by ls can include things like holes in the file. What space 
usage does the zfs(1M) command report for the filesystem?

Adam

On Dec 16, 2009, at 10:33 PM, Giridhar K R wrote:

 Hi,
 
 Reposting as I have not gotten any response.
 
 Here is the issue. I created a zpool with 64k recordsize and enabled dedupe 
 on it.
 --zpool create -O recordsize=64k TestPool device1
 --zfs set dedup=on TestPool
 
 I copied files onto this pool over nfs from a windows client.
 
 Here is the output of zpool list
 -- zpool list
 NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
 TestPool 696G 19.1G 677G 2% 1.13x ONLINE -
 
 I ran ls -l /TestPool and saw the total size reported as 51,193,782,290 
 bytes.
 The alloc size reported by zpool along with the DEDUP of 1.13x does not addup 
 to 51,193,782,290 bytes.
 
 According to the DEDUP (Dedupe ratio) the amount of data copied is 21.58G 
 (19.1G * 1.13) 
 
 Here is the output from zdb -DD
 
 -- zdb -DD TestPool
 DDT-sha256-zap-duplicate: 33536 entries, size 272 on disk, 140 in core
 DDT-sha256-zap-unique: 278241 entries, size 274 on disk, 142 in core
 
 DDT histogram (aggregated over all DDTs):
 
 bucket allocated referenced
 __ __ __
 refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE
 -- -- - - - -- - - -
 1 272K 17.0G 17.0G 17.0G 272K 17.0G 17.0G 17.0G
 2 32.7K 2.05G 2.05G 2.05G 65.6K 4.10G 4.10G 4.10G
 4 15 960K 960K 960K 71 4.44M 4.44M 4.44M
 8 4 256K 256K 256K 53 3.31M 3.31M 3.31M
 16 1 64K 64K 64K 16 1M 1M 1M
 512 1 64K 64K 64K 854 53.4M 53.4M 53.4M
 1K 1 64K 64K 64K 1.08K 69.1M 69.1M 69.1M
 4K 1 64K 64K 64K 5.33K 341M 341M 341M
 Total 304K 19.0G 19.0G 19.0G 345K 21.5G 21.5G 21.5G
 
 dedup = 1.13, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.13
 
 
 Am I missing something?
 
 Your inputs are much appritiated.
 
 Thanks,
 Giri
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Dedupe reporting incorrect savings

2009-12-17 Thread Adam Leventhal

 Thanks for the response Adam.
 
 Are you talking about ZFS list?
 
 It displays 19.6 as allocated space.
 
 What does ZFS treat as hole and how does it identify?

ZFS will compress blocks of zeros down to nothing and treat them like
sparse files. 19.6 is pretty close to your computed. Does your pool
happen to be 10+1 RAID-Z?

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz data loss stories?

2009-12-21 Thread Adam Leventhal

Hey James,

 Personally, I think mirroring is safer (and 3 way mirroring) than raidz/z2/5. 
  All my boot from zfs systems have 3 way mirrors root/usr/var disks (using 
 9 disks) but all my data partitions are 2 way mirrors (usually 8 disks or 
 more and a spare.)

Double-parity (or triple-parity) RAID are certainly more resilient against some 
failure modes than 2-way mirroring. For example, bit errors can arise at a 
certain rate from disks. In the case of a disk failure in a mirror, it's 
possible to encounter a bit error such that data is lost.

I recently wrote an article for ACM Queue that examines recent trends in hard 
drives and makes the case for triple-parity RAID. It's at least peripherally 
relevant to this conversation:

  http://blogs.sun.com/ahl/entry/acm_triple_parity_raid

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz data loss stories?

2009-12-25 Thread Adam Leventhal

 Applying classic RAID terms to zfs is just plain
 wrong and misleading 
 since zfs does not directly implement these classic
 RAID approaches 
 even though it re-uses some of the algorithms for
 data recovery. 
 Details do matter.
 
 That's not entirely true, is it?
 * RAIDZ is RAID5 + checksum + COW
 * RAIDZ2 is RAID6 + checksum + COW
 * A stack of mirror vdevs is RAID10 + checksum + COW

Others have noted that RAID-Z isn't really the same as RAID-5 and RAID-Z2 isn't 
the same as RAID-6 because RAID-5 and RAID-6 define not just the number of 
parity disks (which would have made far more sense in my mind), but instead 
also include in the definition a notion of how the data and parity are laid 
out. The RAID levels were used to describe groupings of existing 
implementations and conflate things like the number of parity devices with, 
say, how parity is distributed across devices.

For example, RAID-Z1 lays out data most like RAID-3, that is a single block is 
carved up and spread across many disks, but distributes parity as required for 
RAID-5 but in a different manner. It's an unfortunate state of affairs which is 
why further RAID levels should identify only the most salient aspect (the 
number of parity devices) or we should use unambiguous terms like single-parity 
and double-parity RAID.

 If we can compare apples and oranges, would you same recommendation (use 
 raidz2 and/or raidz3) be the same when comparing to mirror with the same 
 number of drives?  In other words, a 2 drive mirror compares to raidz{1} the 
 same as a 3 drive mirror compares to raidz2 and a 4 drive mirror compares to 
 raidz3?  If you were enterprise (in other words card about perf) why would 
 you ever use raidz instead of throwing more drives at the problem and doing 
 mirroring with identical parity?

You're right that a mirror is a degenerate form of raidz1, for example, but 
mirrors allow for specific optimizations. While the redundancy would be the 
same, the performance would not.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz stripe size (not stripe width)

2010-01-04 Thread Adam Leventhal

Hi Brad,

RAID-Z will carve up the 8K blocks into chunks at the granularity of the sector 
size -- today 512 bytes but soon going to 4K. In this case a 9-disk RAID-Z vdev 
will look like this:

|  P  | D00 | D01 | D02 | D03 | D04 | D05 | D06 | D07 |
|  P  | D08 | D09 | D10 | D11 | D12 | D13 | D14 | D15 |

1K per device with an additional 1K for parity.

Adam

On Jan 4, 2010, at 3:17 PM, Brad wrote:

 If a 8K file system block is written on a 9 disk raidz vdev, how is the data 
 distributed (writtened) between all devices in the vdev since a zfs write is 
 one continuously IO operation?
 
 Is it distributed evenly (1.125KB) per device?
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New ZFS Intent Log (ZIL) device available - Beta program now open!

2010-01-13 Thread Adam Leventhal

Hey Chris,

 The DDRdrive X1 OpenSolaris device driver is now complete,
 please join us in our first-ever ZFS Intent Log (ZIL) beta test 
 program.  A select number of X1s are available for loan,
 preferred candidates would have a validation background 
 and/or a true passion for torturing new hardware/driver :-)
 
 We are singularly focused on the ZIL device market, so a test
 environment bound by synchronous writes is required.  The
 beta program will provide extensive technical support and a
 unique opportunity to have direct interaction with the product
 designers.

Congratulations! This is great news for ZFS. I'll be very interested to
see the results members of the community can get with your device as part
of their pool. COMSTAR iSCSI performance should be dramatically improved
in particular.

Adam

--
Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 >

1 - 100 of 109 matches

Mail list logo