Re: [zfs-discuss] Any company willing to support a 7410 ?

2012-07-19 Thread Gordon Ross
On Thu, Jul 19, 2012 at 5:38 AM, sol a...@yahoo.com wrote:
 Other than Oracle do you think any other companies would be willing to take
 over support for a clustered 7410 appliance with 6 JBODs?

 (Some non-Oracle names which popped out of google:
 Joyent/Coraid/Nexenta/Greenbytes/NAS/RackTop/EraStor/Illumos/???)


I'm not sure, but I think there are people running NexentaStor on that h/w.
If not, then on something pretty close.  NS supports clustering, etc.


-- 
Gordon Ross g...@nexenta.com
Nexenta Systems, Inc.  www.nexenta.com
Enterprise class storage for everyone
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Creating NFSv4/ZFS XATTR through dirfd through /proc not allowed?

2012-07-13 Thread Gordon Ross
On Fri, Jul 13, 2012 at 2:16 AM, ольга крыжановская
olga.kryzhanov...@gmail.com wrote:
 Can some one here explain why accessing a NFSv4/ZFS xattr directory
 through proc is forbidden?

[...]
 truss says the syscall fails with
 open(/proc/3988/fd/10/myxattr, O_WRONLY|O_CREAT|O_TRUNC, 0666) Err#13 EACCES

 Accessing files or directories through /proc/$$/fd/ from a shell
 otherwise works, only the xattr directories cause trouble. Native C
 code has the same problem.

 Olga

Does runat let you see those xattr files?

-- 
Gordon Ross g...@nexenta.com
Nexenta Systems, Inc.  www.nexenta.com
Enterprise class storage for everyone
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Entire client hangs every few seconds

2011-07-26 Thread Gordon Ross
Are the disk active lights typically ON when this happens?

On Tue, Jul 26, 2011 at 3:27 PM, Garrett D'Amore garr...@damore.org wrote:
 This is actually a recently known problem, and a fix for it is in the
 3.1 version, which should be available any minute now, if it isn't
 already available.

 The problem has to do with some allocations which are sleeping, and jobs
 in the ZFS subsystem get backed behind some other work.

 If you have adequate system memory, you are less likely to see this
 problem, I think.

         - Garrett


 On Tue, 2011-07-26 at 08:29 -0700, Rocky Shek wrote:
 Ian,

 Did you enable DeDup?

 Rocky


 -Original Message-
 From: zfs-discuss-boun...@opensolaris.org
 [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ian D
 Sent: Tuesday, July 26, 2011 7:52 AM
 To: zfs-discuss@opensolaris.org
 Subject: [zfs-discuss] Entire client hangs every few seconds

 Hi all-
 We've been experiencing a very strange problem for two days now.

 We have three client (Linux boxes) connected to a ZFS box (Nexenta) via
 iSCSI.  Every few seconds (seems random), iostats shows the clients go from
 an normal 80K+ IOPS to zero.  It lasts up to a few seconds and things are
 fine again.  When that happens, I/Os on the local disks stops too, even the
 totally unrelated ones. How can that be?  All three clients show the same
 pattern and everything was fine prior to Sunday.  Nothing has changed on
 neither the clients or the server. The ZFS box is not even close to be
 saturated, nor the network.

 We don't even know where to start... any advices?
 Ian


 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SSD vs hybrid drive - any advice?

2011-07-21 Thread Gordon Ross
I'm looking to upgrade the disk in a high-end laptop (so called
desktop replacement type).  I use it for development work,
runing OpenIndiana (native) with lots of ZFS data sets.

These hybrid drives look kind of interesting, i.e. for about $100,
one can get:
 Seagate Momentus XT ST95005620AS 500GB 7200 RPM 2.5 SATA 3.0Gb/s
with NCQ Solid State Hybrid Drive
 http://www.newegg.com/Product/Product.aspx?Item=N82E16822148591
And then for about $400 one can get an 250GB SSD, such as:
 Crucial M4 CT256M4SSD2 2.5 256GB SATA III MLC Internal Solid State
Drive (SSD)
 http://www.newegg.com/Product/Product.aspx?Item=N82E16820148443

Anyone have experience with either one?  (good or bad)

Opinions whether the lower capacity and higher cost of
the SSD is justified in terms of performance for things
like software builds, etc?

Thanks,
Gordon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [illumos-Developer] revisiting aclmode options

2011-07-19 Thread Gordon Ross
On Mon, Jul 18, 2011 at 9:44 PM, Paul B. Henson hen...@acm.org wrote:
 Now that illumos has restored the aclmode option to zfs, I would like to
 revisit the topic of potentially expanding the suite of available modes.
[...]

At one point, I was experimenting with some code for smbfs that would
invent the mode bits (remember, smbfs does not get mode bits from
the remote server, only the ACL).  I ended up discarding it there due to
objections from reviewers, but the idea might be useful for people who
really don't care about mode bits.  I'll attempt a description below.


The idea:  A new aclmode setting called discard, meaning that the
users don't care at all about the traditional mode bits.  A dataset with
aclmode=discard would have the chmod system call and NFS setattr
do absolutely nothing to the mode bits.  The getattr call would receive
mode bits derived from the ACL.  (this derivation would actually happen
when and acl is stored, not during getattr)  The mode bits would be
derived from the ACL such that the mode represents the greatest
possible access that might be allowed by the ACL, without any
consideration of deny entries or group memberships.

In detail, that mode derivation might be:

The mode's owner part would be the union of access granted by any
owner type ACEs in the ACL and any ACEs where the ACE owner
matches the file owner.  The mode's group part would be the union
of access granted by any group ACEs and any ACEs who's type is
unknown (all SIDs are of unknown type).  The mode's other part
would be the access granted by an Everyone ACE, if present.

Would that be of any use?

Gordon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-17 Thread Ross Walker
On Jun 17, 2011, at 7:06 AM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 I will only say, that regardless of whether or not that is or ever was true,
 I believe it's entirely irrelevant.  Because your system performs read and
 write caching and buffering in ram, the tiny little ram on the disk can't
 possibly contribute anything.

You would be surprised.

The on-disk buffer is there so data is ready when the hard drive head lands, 
without it the drive's average rotational latency will trend higher due to 
missed landings because the data wasn't in buffer at the right time.

The read buffer is to allow the disk to continuously read sectors whether the 
system bus is ready to transfer or not. Without it, sequential reads wouldn't 
last long enough to reach max throughput before they would have to pause 
because of bus contention and then suffer a rotation of latency hit which would 
kill read performance.

Try disabling the on-board write or read cache and see how your sequential IO 
performs and you'll see just how valuable those puny caches are.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about COW and snapshots

2011-06-17 Thread Ross Walker
On Jun 16, 2011, at 7:23 PM, Erik Trimble erik.trim...@oracle.com wrote:

 On 6/16/2011 1:32 PM, Paul Kraus wrote:
 On Thu, Jun 16, 2011 at 4:20 PM, Richard Elling
 richard.ell...@gmail.com  wrote:
 
 You can run OpenVMS :-)
 Since *you* brought it up (I was not going to :-), how does VMS'
 versioning FS handle those issues ?
 
 It doesn't, per se.  VMS's filesystem has a versioning concept (i.e. every 
 time you do a close() on a file, it creates a new file with the version 
 number appended, e.g.  foo;1  and foo;2  are the same file, different 
 versions).  However, it is completely missing the rest of the features we're 
 talking about, like data *consistency* in that file. It's still up to the app 
 using the file to figure out what data consistency means, and such.  Really, 
 all VMS adds is versioning, nothing else (no API, no additional features, 
 etc.).

I believe NTFS was built on the same concept of file streams the VMS FS used 
for versioning.

It's a very simple versioning system.

Personnally I use Sharepoint, but there are other content management systems 
out there that provide what your looking for, so no need to bring out the crypt 
keeper.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] dual protocal on one file system?

2011-03-16 Thread Ross Walker
On Mar 16, 2011, at 8:13 AM, Paul Kraus p...@kraus-haus.org wrote:

 On Tue, Mar 15, 2011 at 11:00 PM, Edward Ned Harvey
 opensolarisisdeadlongliveopensola...@nedharvey.com wrote:
 
 BTW, what is the advantage of the kernel cifs server as opposed to samba?
 It seems, years ago, somebody must have been standing around and saying
 There is a glaring deficiency in samba, and we need to solve it.
 
Complete integration with AD/NTFS from the client perspective. In
 other words, the Sun CIFS server really does look like a genuine NTFS
 volume shared via CIFS in terms of ACLs. Snapshots even show up as
 previous versions in explorer.
 
I have never seen SAMBA provide more than just authentication
 integration with AD.
 
The in kernel CIFS server is also supposed to be much faster,
 although I have not tested that yet.

Samba has all those features as well. It has native support for different 
platform ACLs (Linux/Solaris/BSD) and supports mapping POSIX perms with 
platform ACLs to present a quasi NT ACL that reflects the native permissions of 
the host.

Samba even has modules for mapping NT RIDs to Nix UIDs/GIDs as well as a module 
that supports Previous Versions using the hosts native snapshot method.

The one glaring deficiency Samba has though, in Sun's eyes not mine, is that it 
runs in user space, though I believe that's just the cover song for It wasn't 
invented here.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-25 Thread Ross Walker
On Dec 24, 2010, at 1:21 PM, Richard Elling richard.ell...@gmail.com wrote:

 Latency is what matters most.  While there is a loose relationship between 
 IOPS
 and latency, you really want low latency.  For 15krpm drives, the average 
 latency
 is 2ms for zero seeks.  A decent SSD will beat that by an order of magnitude.

Actually I'd say that latency has a direct relationship to IOPS because it's 
the time it takes to perform an IO that determines how many IOs Per Second that 
can be performed.

Ever notice how storage vendors list their max IOPS in 512 byte sequential IO 
workloads and sustained throughput in 1MB+ sequential IO workloads. Only SSD 
makers list their random IOPS workload numbers and their 4K IO workload numbers.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-15 Thread Ross Walker
On Dec 15, 2010, at 6:48 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
wrote:

 On Wed, 15 Dec 2010, Linder, Doug wrote:
 
 But it sure would be nice if they spared everyone a lot of effort and 
 annoyance and just GPL'd ZFS.  I think the goodwill generated
 
 Why do you want them to GPL ZFS?  In what way would that save you annoyance?

I actually think Doug was trying to say he wished Oracle would open the 
development and make the source code open-sourced, not necessarily GPL'd.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [OpenIndiana-discuss] iops...

2010-12-08 Thread Ross Walker
On Dec 7, 2010, at 9:49 PM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: Ross Walker [mailto:rswwal...@gmail.com]
 
 Well besides databases there are VM datastores, busy email servers, busy
 ldap servers, busy web servers, and I'm sure the list goes on and on.
 
 I'm sure it is much harder to list servers that are truly sequential in IO
 then
 random. This is especially true when you have thousands of users hitting
 it.
 
 Depends on the purpose of your server.  For example, I have a ZFS server
 whose sole purpose is to receive a backup data stream from another machine,
 and then write it to tape.  This is a highly sequential operation, and I use
 raidz.
 
 Some people have video streaming servers.  And http/ftp servers with large
 files.  And a fileserver which is the destination for laptop whole-disk
 backups.  And a repository that stores iso files and rpm's used for OS
 installs on other machines.  And data capture from lab equipment.  And
 packet sniffer / compliance email/data logger.
 
 and I'm sure the list goes on and on.  ;-)

Ok, single stream backup servers are one type, but as soon as you have multiple 
streams, even for large files, then IOPS trumps throughput to a degree, of 
course if throughput is very bad then that's no good either.

Know your workload is key, or have enough $$ to implement RAID10 everywhere.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-12-08 Thread Ross Walker
On Dec 8, 2010, at 11:41 PM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 For anyone who cares:
 
 I created an ESXi machine.  Installed two guest (centos) machines and
 vmware-tools.  Connected them to each other via only a virtual switch.  Used
 rsh to transfer large quantities of data between the two guests,
 unencrypted, uncompressed.  Have found that ESXi virtual switch performance
 peaks around 2.5Gbit.
 
 Also, if you have a NFS datastore, which is not available at the time of ESX
 bootup, then the NFS datastore doesn't come online, and there seems to be no
 way of telling ESXi to make it come online later.  So you can't auto-boot
 any guest, which is itself stored inside another guest.
 
 So basically, if you want a layer of ZFS in between your ESX server and your
 physical storage, then you have to have at least two separate servers.  And
 if you want anything resembling actual disk speed, you need infiniband,
 fibre channel, or 10G ethernet.  (Or some really slow disks.)   ;-)

Besides the chicken and egg scenario that Ed mentions there is also the CPU 
usage that running the storage virtualized. You might find that as you get more 
machines on the storage the performance will decrease a lot faster then it 
otherwise would if it were standalone as it competes with the very machines it 
is suppose to be serving.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [OpenIndiana-discuss] iops...

2010-12-07 Thread Ross Walker
On Dec 7, 2010, at 12:46 PM, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote:

 Bear a few things in mind:
 
 iops is not iops.
 snip/
 
 I am totally aware of these differences, but it seems some people think RAIDz 
 is nonsense unless you don't need speed at all. My testing shows (so far) 
 that the speed is quite good, far better than single drives. Also, as Eric 
 said, those speeds are for random i/o. I doubt there is very much out there 
 that is truely random i/o except perhaps databases, but then, I would never 
 use raid5/raidz for a DB unless at gunpoint.

Well besides databases there are VM datastores, busy email servers, busy ldap 
servers, busy web servers, and I'm sure the list goes on and on.

I'm sure it is much harder to list servers that are truly sequential in IO then 
random. This is especially true when you have thousands of users hitting it.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-11-17 Thread Ross Walker
On Wed, Nov 17, 2010 at 3:00 PM, Pasi Kärkkäinen pa...@iki.fi wrote:
 On Wed, Nov 17, 2010 at 10:14:10AM +, Bruno Sousa wrote:
    Hi all,

    Let me tell you all that the MC/S *does* make a difference...I had a
    windows fileserver using an ISCSI connection to a host running snv_134
    with an average speed of 20-35 mb/s...After the upgrade to snv_151a
    (Solaris 11 express) this same fileserver got a performance boost and now
    has an average speed of 55-60mb/s.

    Not double performance, but WAY better , specially if we consider that
    this performance boost was purely software based :)


 Did you verify you're using more connections after the update?
 Or was is just *other* COMSTAR (and/or kernel) updates making the difference..

This is true. If someone wasn't utilizing 1Gbps before MC/S then going
to MC/S won't give you more, as you weren't using what you had (in
fact added latency in MC/S may give you less!).

I am going to say that the speed improvement from 134-151a was due to
OS and comstar improvements and not the MC/S.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-11-16 Thread Ross Walker
On Nov 16, 2010, at 4:04 PM, Tim Cook t...@cook.ms wrote:

 
 
 On Wed, Nov 17, 2010 at 7:56 AM, Miles Nordin car...@ivy.net wrote:
  tc == Tim Cook t...@cook.ms writes:
 
tc Channeling Ethernet will not make it any faster. Each
tc individual connection will be limited to 1gbit.  iSCSI with
tc mpxio may work, nfs will not.
 
 well...probably you will run into this problem, but it's not
 necessarily totally unsolved.
 
 I am just regurgitating this list again, but:
 
  need to include L4 port number in the hash:
  
 http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb
  port-channel load-balance mixed  -- for L2 etherchannels
  mls ip cef load-sharing full -- for L3 routing (OSPF ECMP)
 
  nexus makes all this more complicated.  there are a few ways that
  seem they'd be able to accomplish ECMP:
   FTag flow markers in ``FabricPath'' L2 forwarding
   LISP
   MPLS
  the basic scheme is that the L4 hash is performed only by the edge
  router and used to calculate a label.  The routing protocol will
  either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of
  per-entire-path ECMP for LISP and MPLS.  unfortunately I don't
  understand these tools well enoguh to lead you further, but if
  you're not using infiniband and want to do 10way ECMP this is
  probably where you need to look.
 
  http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942
  feature added in snv_117, NFS client connections can be spread over multiple 
 TCP connections
  When rpcmod:clnt_max_conns is set to a value  1
  however Even though the server is free to return data on different
  connections, [it does not seem to choose to actually do so] --
  6696163 fixed snv_117
 
  nfs:nfs3_max_threads=32
  in /etc/system, which changes the default 8 async threads per mount to
  32.  This is especially helpful for NFS over 10Gb and sun4v
 
  this stuff gets your NFS traffic onto multiple TCP circuits, which
  is the same thing iSCSI multipath would accomplish.  From there, you
  still need to do the cisco/??? stuff above to get TCP circuits
  spread across physical paths.
 
  
 http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html
-- suspect.  it advises ``just buy 10gig'' but many other places
   say 10G NIC's don't perform well in real multi-core machines
   unless you have at least as many TCP streams as cores, which is
   honestly kind of obvious.  lego-netadmin bias.
 
 
 
 AFAIK, esx/i doesn't support L4 hash, so that's a non-starter.

For iSCSI one just needs to have a second (third or fourth...) iSCSI session on 
a different IP to the target and run mpio/mpxio/mpath whatever your OS calls 
multi-pathing.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-11-16 Thread Ross Walker
On Nov 16, 2010, at 7:49 PM, Jim Dunham james.dun...@oracle.com wrote:

 On Nov 16, 2010, at 6:37 PM, Ross Walker wrote:
 On Nov 16, 2010, at 4:04 PM, Tim Cook t...@cook.ms wrote:
 AFAIK, esx/i doesn't support L4 hash, so that's a non-starter.
 
 For iSCSI one just needs to have a second (third or fourth...) iSCSI session 
 on a different IP to the target and run mpio/mpxio/mpath whatever your OS 
 calls multi-pathing.
 
 MC/S (Multiple Connections per Sessions) support was added to the iSCSI 
 Target in COMSTAR, now available in Oracle Solaris 11 Express. 

Good to know.

The only initiator I know of that supports that is Windows, but with MC/S one 
at least doesn't need MPIO as the initiator handles the multiplexing over the 
multiple connections itself.

Doing multiple sessions and MPIO is supported almost universally though.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-11-01 Thread Ross Walker
On Nov 1, 2010, at 5:09 PM, Ian D rewar...@hotmail.com wrote:

 Maybe you are experiencing this:
 http://opensolaris.org/jive/thread.jspa?threadID=11942
 
 It does look like this... Is this really the expected behaviour?  That's just 
 unacceptable.  It is so bad it sometimes drop connection and fail copies and 
 SQL queries...

Then set the zfs_write_limit_override to a reasonable value.

Depending on the speed of your ZIL and/or backing store (for async IO) you will 
need to limit the write size in such a way so TXG1 is fully committed before 
TXG2 fills.

Myself, with a RAID controller with a 512MB BBU write-back cache I set the 
write limit to 512MB which allows my setup to commit-before-fill.

It also prevents ARC from discarding good read cache data in favor of write 
cache.

Others may have a good calculation based on ARC execution plan timings, disk 
seek and sustained throughput to give an accurate figure based on one's setup, 
otherwise start with a reasonable value, say 1GB, and decrease until the pauses 
stop.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Excruciatingly slow resilvering on X4540 (build 134)

2010-11-01 Thread Ross Walker
On Nov 1, 2010, at 3:33 PM, Mark Sandrock mark.sandr...@oracle.com wrote:

 Hello,
 
   I'm working with someone who replaced a failed 1TB drive (50% utilized),
 on an X4540 running OS build 134, and I think something must be wrong.
 
 Last Tuesday afternoon, zpool status reported:
 
 scrub: resilver in progress for 306h0m, 63.87% done, 173h7m to go
 
 and a week being 168 hours, that put completion at sometime tomorrow night.
 
 However, he just reported zpool status shows:
 
 scrub: resilver in progress for 447h26m, 65.07% done, 240h10m to go
 
 so it's looking more like 2011 now. That can't be right.
 
 I'm hoping for a suggestion or two on this issue.
 
 I'd search the archives, but they don't seem searchable. Or am I wrong about 
 that?

Some zpool versions have an issue where snapshot creation/deletion during a 
resilver causes it to start over.

Try suspending all snapshot activity during the resilver.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] vdev failure - pool loss ?

2010-10-19 Thread Ross Walker
On Oct 19, 2010, at 4:33 PM, Tuomas Leikola tuomas.leik...@gmail.com wrote:

 On Mon, Oct 18, 2010 at 8:18 PM, Simon Breden sbre...@gmail.com wrote:
 So are we all agreed then, that a vdev failure will cause pool loss ?
 --
 
 unless you use copies=2 or 3, in which case your data is still safe
 for those datasets that have this option set.

This doesn't prevent pool loss in the face of a vdev failure, merely reduces 
the likelihood of file loss due to block corruption.

A loss of a vdev (mirror, raidz or non-redundant disk) means the loss of the 
pool.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-15 Thread Ross Walker
On Oct 15, 2010, at 9:18 AM, Stephan Budach stephan.bud...@jvm.de wrote:

 Am 14.10.10 17:48, schrieb Edward Ned Harvey:
 
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Toby Thain
 
 I don't want to heat up the discussion about ZFS managed discs vs.
 HW raids, but if RAID5/6 would be that bad, no one would use it
 anymore.
 It is. And there's no reason not to point it out. The world has
 Well, neither one of the above statements is really fair.
 
 The truth is:  radi5/6 are generally not that bad.  Data integrity failures
 are not terribly common (maybe one bit per year out of 20 large disks or
 something like that.)
 
 And in order to reach the conclusion nobody would use it, the people using
 it would have to first *notice* the failure.  Which they don't.  That's kind
 of the point.
 
 Since I started using ZFS in production, about a year ago, on three servers
 totaling approx 1.5TB used, I have had precisely one checksum error, which
 ZFS corrected.  I have every reason to believe, if that were on a raid5/6,
 the error would have gone undetected and nobody would have noticed.
 
 Point taken!
 
 So, what would you suggest, if I wanted to create really big pools? Say in 
 the 100 TB range? That would be quite a number of single drives then, 
 especially when you want to go with zpool raid-1.

A pool consisting of 4 disk raidz vdevs (25% overhead) or 6 disk raidz2 vdevs 
(33% overhead) should deliver the storage and performance for a pool that size, 
versus a pool of mirrors (50% overhead).

You need a lot if spindles to reach 100TB.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-15 Thread Ross Walker
On Oct 15, 2010, at 5:34 PM, Ian D rewar...@hotmail.com wrote:

 Has anyone suggested either removing L2ARC/SLOG
 entirely or relocating them so that all devices are
 coming off the same controller? You've swapped the
 external controller but the H700 with the internal
 drives could be the real culprit. Could there be
 issues with cross-controller IO in this case? Does
 the H700 use the same chipset/driver as the other
 controllers you've tried? 
 
 We'll try that.  We have a couple other devices we could use for the SLOG 
 like a DDRDrive X1 and an OCZ Z-Drive which are both PCIe cards and don't use 
 the local controller.

What mount options are you using on the Linux client for the NFS share?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Ross Walker
On Oct 12, 2010, at 8:21 AM, Edward Ned Harvey sh...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
  c3t211378AC0253d0  ONLINE   0 0 0
 
 How many disks are there inside of c3t211378AC0253d0?
 
 How are they configured?  Hardware raid 5?  A mirror of two hardware raid
 5's?  The point is:  This device, as seen by ZFS, is not a pure storage
 device.  It is a high level device representing some LUN or something, which
 is configured  controlled by hardware raid.
 
 If there's zero redundancy in that device, then scrub would probably find
 the checksum errors consistently and repeatably.
 
 If there's some redundancy in that device, then all bets are off.  Sometimes
 scrub might read the good half of the data, and other times, the bad half.
 
 
 But then again, the error might not be in the physical disks themselves.
 The error might be somewhere in the raid controller(s) or the interconnect.
 Or even some weird unsupported driver or something.

If it were a parity based raid set then the error would most likely be 
reproducible, if not detected by the raid controller.

The biggest problem is from hardware mirrors where the hardware can't detect an 
error on one side vs the other.

For mirrors it's always best to use ZFS' built-in mirrors, otherwise if I were 
to use HW RAID I would use RAID5/6/50/60 since errors encountered can be 
reproduced, two parity raids mirrored in ZFS would probably provide the best of 
both worlds, for a steep cost though.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Ross Walker
On Sep 9, 2010, at 8:27 AM, Fei Xu twinse...@hotmail.com wrote:

 
 Service times here are crap. Disks are malfunctioning
 in some way. If
 your source disks can take seconds (or 10+ seconds)
 to reply, then of
 course your copy will be slow. Disk is probably
 having a hard time
 reading the data or something.
 
 
 
 Yeah, that should not go over 15ms.  I just cannot understand why it starts 
 ok with hundred GB files transfered and then suddenly fall to sleep.
 by the way,  WDIDLE time is already disabled which might cause some issue.  
 I've changed to another system to test ZFS send between 8*1TB pool and 4*1TB 
 pool.  hope everythings OK on this case.

This might be the dreaded WD TLER issue. Basically the drive keeps retrying a 
read operation over and over after a bit error trying to recover from a read 
error themselves. With ZFS one really needs to disable this and have the drives 
fail immediately.

Check your drives to see if they have this feature, if so think about replacing 
the drives in the source pool that have long service times and make sure this 
feature is disabled on the destination pool drives.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Ross Walker
On Aug 27, 2010, at 1:04 AM, Mark markwo...@yahoo.com wrote:

 We are using a 7210, 44 disks I believe, 11 stripes of RAIDz sets.  When I 
 installed I selected the best bang for the buck on the speed vs capacity 
 chart.
 
 We run about 30 VM's on it, across 3 ESX 4 servers.  Right now, its all 
 running NFS, and it sucks... sooo slow.

I have a Dell 2950 server with a PERC6 controller with 512MB of write back 
cache and a pool of mirrors made out of 14 15K SAS drives. ZIL is integrated.

This is serving 30 VMs on 3 ESXi hosts and performance is good.

I find the #1 operation is random reads, so I doubt the ZIL will make as much 
difference as a very large L2ARC will. I'd hit that first, it's a cheaper buy. 
Random reads across a theoretical infinitely sized (in comparison to system 
RAM) 7200RPM device is a killer. Cache as much as possible in hope of hitting 
cache rather than disk.

Breaking your pool into two or three, setting different vdev types of different 
type disks and tiering your VMs based on their performance profile would help.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS with Equallogic storage

2010-08-21 Thread Ross Walker

I'm planning on setting up an NFS server for our ESXi hosts and plan on using a 
virtualized Solaris or Nexenta host to serve ZFS over NFS.

The storage I have available is provided by Equallogic boxes over 10Gbe iSCSI.

I am trying to figure out the best way to provide both performance and 
resiliency given the Equallogic provides the redundancy.

Since I am hoping to provide a 2TB datastore I am thinking of carving out 
either 3 1TB luns or 6 500GB luns that will be RDM'd to the storage VM and 
within the storage server setting up either 1 raidz vdev with the 1TB luns 
(less RDMs) or 2 raidz vdevs with the 500GB luns (more fine grained 
expandability, work in 1TB increments).

Given the 2GB of write-back cache on the Equallogic I think the integrated ZIL 
would work fine (needs benchmarking though).

The vmdk files themselves won't be backed up (more data then I can store), just 
the essential data contained within, so I would think resiliency would be 
important here.

My questions are these.

Does this setup make sense?

Would I be better off forgoing resiliency for simplicity, putting all my faith 
into the Equallogic to handle data resiliency?

Will this setup perform? Anybody with experience in this type of setup?

-Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with Equallogic storage

2010-08-21 Thread Ross Walker
On Aug 21, 2010, at 2:14 PM, Bill Sommerfeld bill.sommerf...@oracle.com wrote:

 On 08/21/10 10:14, Ross Walker wrote:
 I am trying to figure out the best way to provide both performance and 
 resiliency given the Equallogic provides the redundancy.
 
 (I have no specific experience with Equallogic; the following is just generic 
 advice)
 
 Every bit stored in zfs is checksummed at the block level; zfs will not use 
 data or metadata if the checksum doesn't match.

I understand that much and is the reason I picked ZFS for persistent data 
storage.

 zfs relies on redundancy (storing multiple copies) to provide resilience; if 
 it can't independently read the multiple copies and pick the one it likes, it 
 can't recover from bitrot or failure of the underlying storage.

Can't auto-recover, but will report the failure so it can be restored from 
backup, but since the vmdk files are too big to backup...

 if you want resilience, zfs must be responsible for redundancy.

Must have, not necessarily have full control.

 You imply having multiple storage servers.  The simplest thing to do is 
 export one large LUN from each of two different storage servers, and have ZFS 
 mirror them.

Well... You need to know that the multiple storage servers are acting as a 
single pool with tiered storage levels (SAS 15K in RAID10 and SATA in RAID6) 
and luns are auto-tiered across these based on demand performance, so a pool of 
mirrors won't really provide any more performance then a raidz (same physical 
RAID) and raidz will only waste 33% as oppose to 50%.

 While this reduces the available space, depending on your workload, you can 
 make some of it back by enabling compression.
 
 And, given sufficiently recent software, and sufficient memory and/or ssd for 
 l2arc, you can enable dedup.

The host is a blade server with no room for SSDs, but if SSD investment is 
needed in the future I can add an SSD Equallogic box to the storage pool.

 Of course, the effectiveness of both dedup and compression depends on your 
 workload.
 
 Would I be better off forgoing resiliency for simplicity, putting all my 
 faith into the Equallogic to handle data resiliency?
 
 IMHO, no; the resulting system will be significantly more brittle.

Exactly how brittle I guess depends on the Equallogic system.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS with Equallogic storage

2010-08-21 Thread Ross Walker
On Aug 21, 2010, at 4:40 PM, Richard Elling rich...@nexenta.com wrote:

 On Aug 21, 2010, at 10:14 AM, Ross Walker wrote:
 I'm planning on setting up an NFS server for our ESXi hosts and plan on 
 using a virtualized Solaris or Nexenta host to serve ZFS over NFS.
 
 Please follow the joint EMC+NetApp best practices for VMware ESX servers.
 The recommendations apply to any NFS implementation for ESX.

Thanks, I'll check that out! Always looking for advice on how best to tweak NFS 
for ESX.

I have a current ZFS over NFS implementation, but on direct attached storage 
using Sol10. I will be interested to see how Nexenta compares.

 The storage I have available is provided by Equallogic boxes over 10Gbe 
 iSCSI.
 
 I am trying to figure out the best way to provide both performance and 
 resiliency given the Equallogic provides the redundancy.
 
 Since I am hoping to provide a 2TB datastore I am thinking of carving out 
 either 3 1TB luns or 6 500GB luns that will be RDM'd to the storage VM and 
 within the storage server setting up either 1 raidz vdev with the 1TB luns 
 (less RDMs) or 2 raidz vdevs with the 500GB luns (more fine grained 
 expandability, work in 1TB increments).
 
 Given the 2GB of write-back cache on the Equallogic I think the integrated 
 ZIL would work fine (needs benchmarking though).
 
 This should work fine.
 
 The vmdk files themselves won't be backed up (more data then I can store), 
 just the essential data contained within, so I would think resiliency would 
 be important here.
 
 My questions are these.
 
 Does this setup make sense?
 
 Yes, it is perfectly reasonable.
 
 Would I be better off forgoing resiliency for simplicity, putting all my 
 faith into the Equallogic to handle data resiliency?
 
 I don't have much direct experience with Equillogic, but I would expect that
 they do a reasonable job of protecting data, or they would be out of business.
 
 You can also use the copies parameter to set extra redundancy for the 
 important
 files. ZFS will also tell you if corruption is found in a single file, so 
 that you can 
 recover just the file and not be forced to recover everything else. I think 
 this fits
 into your back strategy.

I thought of the copies parameter, but figured a raidz laid on top of the 
storage pool would only waste 33% instead of 50% and since this is on top of a 
conceptually single RAID volume the IOPS bottleneck won't come into play since 
the any single drive IOPS will be equal to the array IOPS as a whole.

 Will this setup perform? Anybody with experience in this type of setup?
 
 Many people are quite happy with RAID arrays and still take advantage of 
 the features of ZFS: checksums, snapshots, clones, send/receive, VMware
 integration, etc. The decision of where to implement data protection (RAID) 
 is not as important as the decision to protect your data.  
 
 My advice: protect your data.

Always good advice.

So I suppose this just confirms my analysis.

Thanks,

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-18 Thread Ross Walker
On Aug 18, 2010, at 10:43 AM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
wrote:

 On Wed, 18 Aug 2010, Joerg Schilling wrote:
 
 Linus is right with his primary decision, but this also applies for static
 linking. See Lawrence Rosen for more information, the GPL does not distinct
 between static and dynamic linking.
 
 GPLv2 does not address linking at all and only makes vague references to the 
 program.  There is no insinuation that the program needs to occupy a single 
 address space or mention of address spaces at all. The program could 
 potentially be a composition of multiple cooperating executables (e.g. like 
 GCC) or multiple modules.  As you say, everything depends on the definition 
 of a derived work.
 
 If a shell script may be dependent on GNU 'cat', does that make the shell 
 script a derived work?  Note that GNU 'cat' could be replaced with some 
 other 'cat' since 'cat' has a well defined interface.  A very similar 
 situation exists for loadable modules which have well defined interfaces 
 (like 'cat').  Based on the argument used for 'cat', the mere injection of a 
 loadable module into an execution environment which includes GPL components 
 should not require that module to be distributable under GPL.  The module 
 only needs to be distributable under GPL if it was developed in such a way 
 that it specifically depends on GPL components.

This is how I see it as well.

The big problem is not the insmod'ing of the blob but how it is distributed.

As far as I know this can be circumvented by not including it in the main 
distribution but through a separate repo to be installed afterwards, ala Debian 
non-free.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-17 Thread Ross Walker
On Aug 16, 2010, at 11:17 PM, Frank Cusack frank+lists/z...@linetwo.net wrote:

 On 8/16/10 9:57 AM -0400 Ross Walker wrote:
 No, the only real issue is the license and I highly doubt Oracle will
 re-release ZFS under GPL to dilute it's competitive advantage.
 
 You're saying Oracle wants to keep zfs out of Linux?

I would if I were them, wouldn't you?

Linux has already eroded the low-end of the Solaris business model, if Linux 
had ZFS it could possibly erode out the middle tier as well.

Solaris with only high-end customers wouldn't be very profitable (unless 
seriously marked up in price), thus unsustainable as a business.

Sun didn't get this, but Oracle does.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ross Walker
On Aug 16, 2010, at 9:06 AM, Edward Ned Harvey sh...@nedharvey.com wrote:

 ZFS does raid, and mirroring, and resilvering, and partitioning, and NFS, and 
 CIFS, and iSCSI, and device management via vdev's, and so on.  So ZFS steps 
 on a lot of linux peoples' toes.  They already have code to do this, or that, 
 why should they kill off all these other projects, and turn the world upside 
 down, and bow down and acknowledge that anyone else did anything better than 
 what they did?

Actually ZFS doesn't do NFS/CIFS/iSCSI those shareX options merely execute 
scripts to perform the OS operations as appropriate.

BTRFS also handles the RAID of the hard disks as ZFS does.

No, the only real issue is the license and I highly doubt Oracle will 
re-release ZFS under GPL to dilute it's competitive advantage.

I think the market NEEDs file system competition in order to drive innovation 
so it would be beneficial for both FSs to continue together into the future.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ross Walker
On Aug 15, 2010, at 9:44 PM, Peter Jeremy peter.jer...@alcatel-lucent.com 
wrote:

 Given that both provide similar features, it's difficult to see why
 Oracle would continue to invest in both.  Given that ZFS is the more
 mature product, it would seem more logical to transfer all the effort
 to ZFS and leave btrfs to die.

I can see Oracle ejecting BTRFS from it's folds, but seriously doubt it will 
die. BTRFS is now mainlined into the Linux kernel and I will bet that currently 
a lot of it's development is already coming from outside parties and Oracle is 
simply acting as the commit maintainer.

Linux is an evolving OS, what determines a FS's continued existence is the 
public's adoption rate of that FS. If nobody ends up using it then the kernel 
will drop it in which case it will eventually die.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and VMware

2010-08-14 Thread Ross Walker
On Aug 14, 2010, at 8:26 AM, Edward Ned Harvey sh...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 #3  I previously believed that vmfs3 was able to handle sparse files
 amazingly well, like, when you create a new vmdk, it appears almost
 instantly regardless of size, and I believed you could copy sparse
 vmdk's
 efficiently, not needing to read all the sparse consecutive zeroes.  I
 was
 wrong.  
 
 Correction:  I was originally right.  ;-)  
 
 In ESXi, if you go to command line (which is busybox) then sparse copies are
 not efficient.
 If you go into vSphere, and browse the datastore, and copy vmdk files via
 gui, then it DOES copy efficiently.
 
 The behavior is the same, regardless of NFS vs iSCSI.
 
 You should always copy files via GUI.  That's the lesson here.

Technically you should always copy vmdk files via vmfstool on the command line. 
That will give you wire speed transfers.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iScsi slow

2010-08-05 Thread Ross Walker
On Aug 5, 2010, at 11:10 AM, Roch roch.bourbonn...@sun.com wrote:

 
 Ross Walker writes:
 On Aug 4, 2010, at 12:04 PM, Roch roch.bourbonn...@sun.com wrote:
 
 
 Ross Walker writes:
 On Aug 4, 2010, at 9:20 AM, Roch roch.bourbonn...@sun.com wrote:
 
 
 
 Ross Asks: 
 So on that note, ZFS should disable the disks' write cache,
 not enable them  despite ZFS's COW properties because it
 should be resilient. 
 
 No, because ZFS builds resiliency on top of unreliable parts. it's able 
 to deal
 with contained failures (lost state) of the disk write cache. 
 
 It can then export LUNS that have WC enabled or
 disabled. But if we enable the WC on the exported LUNS, then
 the consumer of these LUNS must be able to say the same.
 The discussion at that level then needs to focus on failure groups.
 
 
 Ross also Said :
 I asked this question earlier, but got no answer: while an
 iSCSI target is presented WCE does it respect the flush
 command? 
 
 Yes. I would like to say obviously but it's been anything
 but.
 
 Sorry to probe further, but can you expand on but...
 
 Just if we had a bunch of zvols exported via iSCSI to another Solaris
 box which used them to form another zpool and had WCE turned on would
 it be reliable? 
 
 
 Nope. That's because all the iSCSI are in the same fault
 domain as they share a unified back-end cache. What works,
 in principle, is mirroring SCSI channels hosted on 
 different storage controllers (or N SCSI channels on N
 controller in a raid group).
 
 Which is why keeping the WC set to the default, is really
 better in general.
 
 Well I was actually talking about two backend Solaris storage servers 
 serving up storage over iSCSI to a front-end Solaris server serving ZFS over 
 NFS, so I have redundancy there, but want the storage to be performant, so I 
 want the iSCSI to have WCE, yet I want it to be reliable and have it honor 
 cache flush requests from the front-end NFS server.
 
 Does that make sense? Is it possible?
 
 
 Well in response to a commit (say after a file creation) then the
 front end server will end up sending flush write caches on
 both side of the iscsi mirror which will reach the backend server
 which will flush disk write caches. This will all work but
 probably  not unleash performance the way you would like it
 to.



 If you setup to have the backend server not honor the
 backend disk flush write caches, then the 2 backend pools become at
 risk of corruption, mostly because the ordering of IOs
 around the ueberblock updates. If you have faith, then you
 could consider that you won't hit 2 backend pool corruption
 together and rely on the frontend resilvering to rebuild the
 corrupted backend.

So you are saying setting WCE disables cache flush on the target and setting 
WCD forces a flush for every WRITE?

How about a way to enable WCE on the target, yet still perform cache flush when 
the initiator requests one, like a real SCSI target should do, or is that just 
not possible with ZVOLs today?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iScsi slow

2010-08-05 Thread Ross Walker
On Aug 5, 2010, at 2:24 PM, Roch Bourbonnais roch.bourbonn...@sun.com wrote:

 
 Le 5 août 2010 à 19:49, Ross Walker a écrit :
 
 On Aug 5, 2010, at 11:10 AM, Roch roch.bourbonn...@sun.com wrote:
 
 
 Ross Walker writes:
 On Aug 4, 2010, at 12:04 PM, Roch roch.bourbonn...@sun.com wrote:
 
 
 Ross Walker writes:
 On Aug 4, 2010, at 9:20 AM, Roch roch.bourbonn...@sun.com wrote:
 
 
 
 Ross Asks: 
 So on that note, ZFS should disable the disks' write cache,
 not enable them  despite ZFS's COW properties because it
 should be resilient. 
 
 No, because ZFS builds resiliency on top of unreliable parts. it's able 
 to deal
 with contained failures (lost state) of the disk write cache. 
 
 It can then export LUNS that have WC enabled or
 disabled. But if we enable the WC on the exported LUNS, then
 the consumer of these LUNS must be able to say the same.
 The discussion at that level then needs to focus on failure groups.
 
 
 Ross also Said :
 I asked this question earlier, but got no answer: while an
 iSCSI target is presented WCE does it respect the flush
 command? 
 
 Yes. I would like to say obviously but it's been anything
 but.
 
 Sorry to probe further, but can you expand on but...
 
 Just if we had a bunch of zvols exported via iSCSI to another Solaris
 box which used them to form another zpool and had WCE turned on would
 it be reliable? 
 
 
 Nope. That's because all the iSCSI are in the same fault
 domain as they share a unified back-end cache. What works,
 in principle, is mirroring SCSI channels hosted on 
 different storage controllers (or N SCSI channels on N
 controller in a raid group).
 
 Which is why keeping the WC set to the default, is really
 better in general.
 
 Well I was actually talking about two backend Solaris storage servers 
 serving up storage over iSCSI to a front-end Solaris server serving ZFS 
 over NFS, so I have redundancy there, but want the storage to be 
 performant, so I want the iSCSI to have WCE, yet I want it to be reliable 
 and have it honor cache flush requests from the front-end NFS server.
 
 Does that make sense? Is it possible?
 
 
 Well in response to a commit (say after a file creation) then the
 front end server will end up sending flush write caches on
 both side of the iscsi mirror which will reach the backend server
 which will flush disk write caches. This will all work but
 probably  not unleash performance the way you would like it
 to.
 
 
 
 If you setup to have the backend server not honor the
 backend disk flush write caches, then the 2 backend pools become at
 risk of corruption, mostly because the ordering of IOs
 around the ueberblock updates. If you have faith, then you
 could consider that you won't hit 2 backend pool corruption
 together and rely on the frontend resilvering to rebuild the
 corrupted backend.
 
 So you are saying setting WCE disables cache flush on the target and setting 
 WCD forces a flush for every WRITE?
 
 Nope. Setting WC either way has not implication on the response to a flush 
 request. We flush the cache in response to a request to do so,
 unless one sets the unsupported zfs_nocacheflush, if set then the pool is at 
 risk
 
 How about a way to enable WCE on the target, yet still perform cache flush 
 when the initiator requests one, like a real SCSI target should do, or is 
 that just not possible with ZVOLs today?
 
 I hope I've cleared that up. Not sure what I said that implicated otherwise.
 
 But if you honor the flush write cache request all the way to the disk 
 device, then 1, 2 or 3 layers of ZFS won't make a dent in the performance of 
 NFS tar x. 
 Only a device accepting low latency writes which survives power outtage can 
 do that.

Understood and thanks for the clarification, if the NFS synchronicity has too 
much of a negative impact then that can be alleviated through an SSD or NVRAM 
slog device on the head server.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iScsi slow

2010-08-04 Thread Ross Walker
On Aug 4, 2010, at 3:52 AM, Roch roch.bourbonn...@sun.com wrote:

 
 Ross Walker writes:
 
 On Aug 3, 2010, at 12:13 PM, Roch Bourbonnais roch.bourbonn...@sun.com 
 wrote:
 
 
 Le 27 mai 2010 à 07:03, Brent Jones a écrit :
 
 On Wed, May 26, 2010 at 5:08 AM, Matt Connolly
 matt.connolly...@gmail.com wrote:
 I've set up an iScsi volume on OpenSolaris (snv_134) with these commands:
 
 sh-4.0# zfs create rpool/iscsi
 sh-4.0# zfs set shareiscsi=on rpool/iscsi
 sh-4.0# zfs create -s -V 10g rpool/iscsi/test
 
 The underlying zpool is a mirror of two SATA drives. I'm connecting from 
 a Mac client with global SAN initiator software, connected via Gigabit 
 LAN. It connects fine, and I've initialiased a mac format volume on that 
 iScsi volume.
 
 Performance, however, is terribly slow, about 10 times slower than an SMB 
 share on the same pool. I expected it would be very similar, if not 
 faster than SMB.
 
 Here's my test results copying 3GB data:
 
 iScsi:  44m01s  1.185MB/s
 SMB share:  4m2711.73MB/s
 
 Reading (the same 3GB) is also worse than SMB, but only by a factor of 
 about 3:
 
 iScsi:  4m3611.34MB/s
 SMB share:  1m4529.81MB/s
 
 
 cleaning up some old mail 
 
 Not unexpected. Filesystems have readahead code to prefetch enough to cover 
 the latency of the read request. iSCSI only responds to the request.
 Put a filesystem on top of iscsi and try again.
 
 For writes, iSCSI is synchronous and SMB is not. 
 
 It may be with ZFS, but iSCSI is neither synchronous nor asynchronous is is 
 simply SCSI over IP.
 
 
 Hey Ross,
 
 Nothing to do with ZFS here, but you're right to point out
 that iSCSI is neither. It was just that in the context of
 this test (and 99+% of iSCSI usage) it will be. SMB is
 not. Thus a large discrepancy on the write test.
 
 Resilient storage, by default, should expose iSCSI channels
 with write caches disabled.


So on that note, ZFS should disable the disks' write cache, not enable them  
despite ZFS's COW properties because it should be resilient.


 It is the application using the iSCSI protocol that
 determines whether it is synchronous, issue a flush after
 write, or asynchronous, wait until target flushes.
 
 
 True.
 
 I think the ZFS developers didn't quite understand that
 and wanted strict guidelines like NFS has, but iSCSI doesn't
 have those, it is a lower level protocol than NFS is, so
 they forced guidelines on it and violated the standard. 
 
 -Ross
 
 
 Not True. 
 
 
 ZFS exposes LUNS (or ZVOL) and while at first we didn't support
 DKIOCSETWCE, we now do. So a ZFS LUN can be whatever you
 need it to be.

I asked this question earlier, but got no answer: while an iSCSI target is 
presented WCE does it respect the flush command?

 Now in the context of iSCSI luns hosted by a resilient
 storage system, enabling write caches is to be used only in
 very specific circumstances. The situation is not symmetrical
 with WCE in disks of a JBOD since that can be setup with
 enough redundancy to deal with potential data loss. When
 using a resilient storage, you need to trust the storage for
 persistence of SCSI commands and building a resilient system
 on top of write cache enabled SCSI channels is not trivial.

Not true, advertise WCE, support flush and tagged command queuing and the 
initiator will be able to use the resilient storage appropriate for it's needs.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iScsi slow

2010-08-04 Thread Ross Walker
On Aug 4, 2010, at 9:20 AM, Roch roch.bourbonn...@sun.com wrote:

 
 
  Ross Asks: 
  So on that note, ZFS should disable the disks' write cache,
  not enable them  despite ZFS's COW properties because it
  should be resilient. 
 
 No, because ZFS builds resiliency on top of unreliable parts. it's able to 
 deal
 with contained failures (lost state) of the disk write cache. 
 
 It can then export LUNS that have WC enabled or
 disabled. But if we enable the WC on the exported LUNS, then
 the consumer of these LUNS must be able to say the same.
 The discussion at that level then needs to focus on failure groups.
 
 
  Ross also Said :
  I asked this question earlier, but got no answer: while an
  iSCSI target is presented WCE does it respect the flush
  command? 
 
 Yes. I would like to say obviously but it's been anything
 but.

Sorry to probe further, but can you expand on but...

Just if we had a bunch of zvols exported via iSCSI to another Solaris box which 
used them to form another zpool and had WCE turned on would it be reliable?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iScsi slow

2010-08-04 Thread Ross Walker
On Aug 4, 2010, at 12:04 PM, Roch roch.bourbonn...@sun.com wrote:

 
 Ross Walker writes:
 On Aug 4, 2010, at 9:20 AM, Roch roch.bourbonn...@sun.com wrote:
 
 
 
 Ross Asks: 
 So on that note, ZFS should disable the disks' write cache,
 not enable them  despite ZFS's COW properties because it
 should be resilient. 
 
 No, because ZFS builds resiliency on top of unreliable parts. it's able to 
 deal
 with contained failures (lost state) of the disk write cache. 
 
 It can then export LUNS that have WC enabled or
 disabled. But if we enable the WC on the exported LUNS, then
 the consumer of these LUNS must be able to say the same.
 The discussion at that level then needs to focus on failure groups.
 
 
 Ross also Said :
 I asked this question earlier, but got no answer: while an
 iSCSI target is presented WCE does it respect the flush
 command? 
 
 Yes. I would like to say obviously but it's been anything
 but.
 
 Sorry to probe further, but can you expand on but...
 
 Just if we had a bunch of zvols exported via iSCSI to another Solaris
 box which used them to form another zpool and had WCE turned on would
 it be reliable? 
 
 
 Nope. That's because all the iSCSI are in the same fault
 domain as they share a unified back-end cache. What works,
 in principle, is mirroring SCSI channels hosted on 
 different storage controllers (or N SCSI channels on N
 controller in a raid group).
 
 Which is why keeping the WC set to the default, is really
 better in general.

Well I was actually talking about two backend Solaris storage servers serving 
up storage over iSCSI to a front-end Solaris server serving ZFS over NFS, so I 
have redundancy there, but want the storage to be performant, so I want the 
iSCSI to have WCE, yet I want it to be reliable and have it honor cache flush 
requests from the front-end NFS server.

Does that make sense? Is it possible?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iScsi slow

2010-08-03 Thread Ross Walker
On Aug 3, 2010, at 5:56 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 03/08/2010 22:49, Ross Walker wrote:
 On Aug 3, 2010, at 12:13 PM, Roch Bourbonnaisroch.bourbonn...@sun.com  
 wrote:
 
   
 Le 27 mai 2010 à 07:03, Brent Jones a écrit :
 
 
 On Wed, May 26, 2010 at 5:08 AM, Matt Connolly
 matt.connolly...@gmail.com  wrote:
   
 I've set up an iScsi volume on OpenSolaris (snv_134) with these commands:
 
 sh-4.0# zfs create rpool/iscsi
 sh-4.0# zfs set shareiscsi=on rpool/iscsi
 sh-4.0# zfs create -s -V 10g rpool/iscsi/test
 
 The underlying zpool is a mirror of two SATA drives. I'm connecting from 
 a Mac client with global SAN initiator software, connected via Gigabit 
 LAN. It connects fine, and I've initialiased a mac format volume on that 
 iScsi volume.
 
 Performance, however, is terribly slow, about 10 times slower than an SMB 
 share on the same pool. I expected it would be very similar, if not 
 faster than SMB.
 
 Here's my test results copying 3GB data:
 
 iScsi:  44m01s  1.185MB/s
 SMB share:  4m2711.73MB/s
 
 Reading (the same 3GB) is also worse than SMB, but only by a factor of 
 about 3:
 
 iScsi:  4m3611.34MB/s
 SMB share:  1m4529.81MB/s
 
 
 cleaning up some old mail
 
 Not unexpected. Filesystems have readahead code to prefetch enough to cover 
 the latency of the read request. iSCSI only responds to the request.
 Put a filesystem on top of iscsi and try again.
 
 For writes, iSCSI is synchronous and SMB is not.
 
 It may be with ZFS, but iSCSI is neither synchronous nor asynchronous is is 
 simply SCSI over IP.
 
 It is the application using the iSCSI protocol that determines whether it is 
 synchronous, issue a flush after write, or asynchronous, wait until target 
 flushes.
 
 I think the ZFS developers didn't quite understand that and wanted strict 
 guidelines like NFS has, but iSCSI doesn't have those, it is a lower level 
 protocol than NFS is, so they forced guidelines on it and violated the 
 standard.
 
   
 Nothing has been violated here.
 Look for WCE flag in COMSTAR where you can control how a given zvol  should 
 behave (synchronous or asynchronous). Additionally in recent build you have 
 zfs set sync={disabled|default|always} which also works with zvols.
 
 So you do have a control over how it is supposed to behave and to make it 
 nice it is even on per zvol basis.
 It is just that the default is synchronous.

Ah, ok, my experience has been with Solaris and the iscsitgt which, correct me 
if I am wrong, is still synchronous only.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] iScsi slow

2010-08-03 Thread Ross Walker
On Aug 3, 2010, at 5:56 PM, Robert Milkowski mi...@task.gda.pl wrote:

 On 03/08/2010 22:49, Ross Walker wrote:
 On Aug 3, 2010, at 12:13 PM, Roch Bourbonnaisroch.bourbonn...@sun.com  
 wrote:
 
   
 Le 27 mai 2010 à 07:03, Brent Jones a écrit :
 
 
 On Wed, May 26, 2010 at 5:08 AM, Matt Connolly
 matt.connolly...@gmail.com  wrote:
   
 I've set up an iScsi volume on OpenSolaris (snv_134) with these commands:
 
 sh-4.0# zfs create rpool/iscsi
 sh-4.0# zfs set shareiscsi=on rpool/iscsi
 sh-4.0# zfs create -s -V 10g rpool/iscsi/test
 
 The underlying zpool is a mirror of two SATA drives. I'm connecting from 
 a Mac client with global SAN initiator software, connected via Gigabit 
 LAN. It connects fine, and I've initialiased a mac format volume on that 
 iScsi volume.
 
 Performance, however, is terribly slow, about 10 times slower than an SMB 
 share on the same pool. I expected it would be very similar, if not 
 faster than SMB.
 
 Here's my test results copying 3GB data:
 
 iScsi:  44m01s  1.185MB/s
 SMB share:  4m2711.73MB/s
 
 Reading (the same 3GB) is also worse than SMB, but only by a factor of 
 about 3:
 
 iScsi:  4m3611.34MB/s
 SMB share:  1m4529.81MB/s
 
 
 cleaning up some old mail
 
 Not unexpected. Filesystems have readahead code to prefetch enough to cover 
 the latency of the read request. iSCSI only responds to the request.
 Put a filesystem on top of iscsi and try again.
 
 For writes, iSCSI is synchronous and SMB is not.
 
 It may be with ZFS, but iSCSI is neither synchronous nor asynchronous is is 
 simply SCSI over IP.
 
 It is the application using the iSCSI protocol that determines whether it is 
 synchronous, issue a flush after write, or asynchronous, wait until target 
 flushes.
 
 I think the ZFS developers didn't quite understand that and wanted strict 
 guidelines like NFS has, but iSCSI doesn't have those, it is a lower level 
 protocol than NFS is, so they forced guidelines on it and violated the 
 standard.
 
   
 Nothing has been violated here.
 Look for WCE flag in COMSTAR where you can control how a given zvol  should 
 behave (synchronous or asynchronous). Additionally in recent build you have 
 zfs set sync={disabled|default|always} which also works with zvols.
 
 So you do have a control over how it is supposed to behave and to make it 
 nice it is even on per zvol basis.
 It is just that the default is synchronous.

I forgot to ask, if the ZVOL is set async with WCE will it still honor a flush 
command from the initiator and flush those TXGs held for the ZVOL?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Mirrored raidz

2010-07-26 Thread Ross Walker
On Jul 26, 2010, at 2:51 PM, Dav Banks davba...@virginia.edu wrote:

 I wanted to test it as a backup solution. Maybe that's crazy in itself but I 
 want to try it.
 
 Basically, once a week detach the 'backup' pool from the mirror, replace the 
 drives, add the new raidz to the mirror and let it resilver and sit for a 
 week.

If that's the case why not create a second pool called 'backup' and 'zfs send' 
periodically to the backup pool?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision

2010-07-25 Thread Ross Walker
On Jul 23, 2010, at 10:14 PM, Edward Ned Harvey sh...@nedharvey.com wrote:

 From: Arne Jansen [mailto:sensi...@gmx.net]
 
 Can anyone else confirm or deny the correctness of this statement?
 
 As I understand it that's the whole point of raidz. Each block is its
 own
 stripe. 
 
 Nope, that doesn't count for confirmation.  It is at least theoretically
 possible to implement raidz using techniques that would (a) unintelligently
 stripe all blocks (even small ones) across multiple disks, thus hurting
 performance on small operations, or (b) implement raidz such that striping
 of blocks behaves differently for small operations (plus parity).  So the
 confirmation I'm looking for would be somebody who knows the actual source
 code, and the actual architecture that was chosen to implement raidz in this
 case.

Maybe this helps?

http://blogs.sun.com/ahl/entry/what_is_raid_z

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] File cloning

2010-07-22 Thread Ross Walker
On Jul 22, 2010, at 2:41 PM, Miles Nordin car...@ivy.net wrote:

 sw == Saxon, Will will.sa...@sage.com writes:
 
sw 'clone' vs. a 'copy' would be very easy since we have
sw deduplication now
 
 dedup doesn't replace the snapshot/clone feature for the
 NFS-share-full-of-vmdk use case because there's no equivalent of 
 'zfs rollback'
 
 
 I'm tempted to say, ``vmware needs to remove their silly limit'' but
 there are takes-three-hours-to-boot problems with thousands of Solaris
 NFS exports so maybe their limit is not so silly after all.
 
 What is the scenario, you have?  Is it something like 40 hosts with
 live migration among them, and 40 guests on each host?  so you need
 1600 filesystems mounted even though only 40 are actually in use?
 
 'zfs set sharenfs=absorb dataset' would be my favorite answer, but
 lots of people have asked for such a feature, and answer is always
 ``wait for mirror mounts'' (which BTW are actually just-works for me
 on very-recent linux, even with plain 'mount host:/fs /fs', without
 saying 'mount -t nfs4', in spite of my earlier rant complaining they
 are not real).  Of course NFSv4 features are no help to vmware, but
 hypothetically I guess mirror-mounting would work if vmware supported
 it, so long as they were careful not to provoke the mounting of guests
 not in use.  The ``implicit automounter'' on which the mirror mount
 feature's based would avoid the boot delay of mounting 1600
 filesystems.
 
 and BTW I've not been able to get the Real Automounter in Linux to do
 what this implicit one already can with subtrees.  Why is it so hard
 to write a working automounter?
 
 The other thing I've never understood is, if you 'zfs rollback' an
 NFS-exported filesystem, what happens to all the NFS clients?  It
 seems like this would cause much worse corruption than the worry when
 people give fire-and-brimstone speeches about never disabling
 zil-writing while using the NFS server.  but it seems to mostly work
 anyway when I do this, so I'm probably confused about something.

To add to Miles' comments, what you are trying to accomplish isn't possible via 
NFS to ESX, but could be accomplished with iSCSI zvols I believe. If I 
understand you can thin-provision a zvol and clone it as many times as you wish 
and present all the clones over iSCSI. Haven't tried it myself, but would be 
worth testing.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision

2010-07-20 Thread Ross Walker
On Jul 20, 2010, at 6:12 AM, v victor_zh...@hotmail.com wrote:

 Hi,
 for zfs raidz1, I know for random io, iops of a raidz1 vdev eqaul to one 
 physical disk iops, since raidz1 is like raid5 , so is raid5 has same 
 performance like raidz1? ie. random iops equal to one physical disk's ipos.

On reads, no, any part of the stripe width can be read without reading the 
whole stripe width, giving performance equal to raid0 of non-parity disks.

On writes it could be worse then raidz1 depending on whether whole stripe 
widths are being written (same performance) or partial stripe widths are being 
written (worse performance). If it's a partial stripe width then the remaining 
data needs to be read off disk which doubles the IOs.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Need ZFS master!

2010-07-13 Thread Ross Walker

The whole disk layout should be copied from disk 1 to 2, then the slice on disk 
2 that corresponds to the slice on disk 1 should be attached to the rpool which 
forms an rpool mirror (attached not added).

Then you need to add the grub bootloader to disk 2.

When it finishes resilvering then you have an rpool mirror.

-Ross



On Jul 12, 2010, at 6:30 PM, Beau J. Bechdol bbech...@gmail.com wrote:

 I do apologies but I am completely lost here Maybe I am just not 
 understanding. Are you saying that a slice has to be created on the seond 
 drive before it can bee added to the pool?
 
 Thanks
 
 On Mon, Jul 12, 2010 at 4:22 PM, Cindy Swearingen 
 cindy.swearin...@oracle.com wrote:
 Hi John,
 
 Follow the steps in this section:
 
 http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide
 
 Replacing/Relabeling the Root Pool Disk
 
 If the disk is correctly labeled with an SMI label, then you can skip
 down to steps 5-8 of this procedure.
 
 Thanks,
 
 Cindy
 
 
 On 07/12/10 16:06, john wrote:
 Hello all. I am new...very new to opensolaris and I am having an issue and 
 have no idea what is going wrong. So I have 5 drives in my machine. all 
 500gb. I installed open solaris on the first drive and rebooted. . Now what I 
 want to do is ad a second drive so they are mirrored. How does one do this!!! 
 I am getting no where and need some help.
 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Encryption?

2010-07-11 Thread Ross Walker
On Jul 11, 2010, at 5:11 PM, Freddie Cash fjwc...@gmail.com wrote:

 ZFS-FUSE is horribly unstable, although that's more an indication of
 the stability of the storage stack on Linux.

Not really, more an indication of the pseudo-VFS layer implemented in fuse. 
Remember fuse provides it's own VFS API separate from the Linux VFS API so file 
systems can be implemented in user space. Fuse needs a little more work to 
handle ZFS as a file system.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Should i enable Write-Cache ?

2010-07-10 Thread Ross Walker
On Jul 10, 2010, at 5:46 AM, Erik Trimble erik.trim...@oracle.com wrote:

 On 7/10/2010 1:14 AM, Graham McArdle wrote:
 Instead, create Single Disk arrays for each disk.
 
 I have a question related to this but with a different controller: If I'm 
 using a RAID controller to provide non-RAID single-disk volumes, do I still 
 lose out on the hardware-independence advantage of software RAID that I 
 would get from a basic non-RAID HBA?
 In other words, if the controller dies, would I still need an identical 
 controller to recognise the formatting of 'single disk volumes', or is more 
 'standardised' than the typical proprietary implementations of hardware RAID 
 that makes it impossible to switch controllers on  hardware RAID?
   
 
 Yep. You're screwed.  :-)
 
 single-disk volumes are still RAID volumes to the controller, so they'll have 
 the extra controller-specific bits on them. You'll need an identical 
 controller (or, possibly, just one from the same OEM) to replace a broken 
 controller with.
 
 Even in JBOD mode, I wouldn't trust a RAID controller to not write 
 proprietary bits onto the disks.  It's one of the big reasons to chose a HBA 
 and not a RAID controller.

Not always, my Dell PERC with the drives set as single disk RAID0 disks, I was 
able to successfully import the pool on a regular LSI SAS (non-RAID) controller.

The only change the PERC made was to coerce the disk size down 128MB, so left 
128MB unused at the end of the drive, which would mean new disks would be 
slightly bigger.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Ross Walker
On Jun 24, 2010, at 5:40 AM, Robert Milkowski mi...@task.gda.pl wrote:

 On 23/06/2010 18:50, Adam Leventhal wrote:
 Does it mean that for dataset used for databases and similar environments 
 where basically all blocks have fixed size and there is no other data all 
 parity information will end-up on one (z1) or two (z2) specific disks?
 
 No. There are always smaller writes to metadata that will distribute parity. 
 What is the total width of your raidz1 stripe?
 
   
 
 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.

From what I gather each 16KB record (plus parity) is spread across the raidz 
disks. This causes the total random IOPS (write AND read) of the raidz to be 
that of the slowest disk in the raidz.

Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.

-Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Ross Walker
On Jun 24, 2010, at 10:42 AM, Robert Milkowski mi...@task.gda.pl wrote:

 On 24/06/2010 14:32, Ross Walker wrote:
 On Jun 24, 2010, at 5:40 AM, Robert Milkowskimi...@task.gda.pl  wrote:
 
   
 On 23/06/2010 18:50, Adam Leventhal wrote:
 
 Does it mean that for dataset used for databases and similar environments 
 where basically all blocks have fixed size and there is no other data all 
 parity information will end-up on one (z1) or two (z2) specific disks?
 
 
 No. There are always smaller writes to metadata that will distribute 
 parity. What is the total width of your raidz1 stripe?
 
 
   
 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
 
 From what I gather each 16KB record (plus parity) is spread across the raidz 
 disks. This causes the total random IOPS (write AND read) of the raidz to be 
 that of the slowest disk in the raidz.
 
 Raidz is definitely made for sequential IO patterns not random. To get good 
 random IO with raidz you need a zpool with X raidz vdevs where X = desired 
 IOPS/IOPS of single drive.
   
 
 I know that and it wasn't mine question.

Sorry, for the OP...


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Ross Walker
On Jun 23, 2010, at 1:48 PM, Robert Milkowski mi...@task.gda.pl wrote:

 
 128GB.
 
 Does it mean that for dataset used for databases and similar environments 
 where basically all blocks have fixed size and there is no other data all 
 parity information will end-up on one (z1) or two (z2) specific disks?

What's the record size on those datasets?

8k?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SLOG striping? (Bob Friesenhahn)

2010-06-22 Thread Ross Walker
On Jun 22, 2010, at 8:40 AM, Jeff Bacon ba...@walleyesoftware.com wrote:

 The term 'stripe' has been so outrageously severely abused in this
 forum that it is impossible to know what someone is talking about when
 they use the term.  Seemingly intelligent people continue to use wrong
 terminology because they think that protracting the confusion somehow
 helps new users.  We are left with no useful definition of
 'striping'.
 
 There is no striping. 
 (I'm sorry, I couldn't resist.)

There is no spoon


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ls says: /tank/ws/fubar: Operation not applicable

2010-06-22 Thread Gordon Ross
Anyone know why my ZFS filesystem might suddenly start
giving me an error when I try to ls -d the top of it?
i.e.: ls -d /tank/ws/fubar
/tank/ws/fubar: Operation not applicable

zpool status says all is well.  I've tried snv_139 and snv_137
(my latest and previous installs).  It's an amd64 box.
Both OS versions show the same problem.

Do I need to run a scrub?  (will take days...)

Other ideas?

Thanks,
Gordon
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ls says: /tank/ws/fubar: Operation not applicable

2010-06-22 Thread Gordon Ross
lstat64(/tank/ws/fubar, 0x080465D0)   Err#89 ENOSYS
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup... still in beta status

2010-06-16 Thread Ross Walker
On Jun 16, 2010, at 9:02 AM, Carlos Varela carlos.var...@cibc.ca  
wrote:




Does the machine respond to ping?


Yes



If there is a gui does the mouse pointer move?



There is no GUI (nexentastor)


Does the keyboard numlock key respond at all ?


Yes



I just find it very hard to believe that such a
situation could exist as I
have done some *abusive* tests on a SunFire X4100
with Sun 6120 fibre
arrays ( in HA config ) and I could not get it to
become a warm brick like
you describe.

How many processors does your machine have ?


Full data:

Motherboard: Asus m2n68-CM
Initial memory: 3 Gb DDR2 ECC
Actual memory: 8 GB DDR2 800
CPU: Athlon X2 5200
HD: 2 Seagate 1 WD (1,5 TB each)
Pools: 1 RAIDZ pool
datasets: 5 (ftp: 30 GB, varios: 170 GB, multimedia:
1,7TB, segur: 80 Gb prueba: 50 Mb)
ZFS ver: 22

The pool was created with EON-NAS 0.6 ... dedupe on,


Similar situation but with Opensolaris b133. Can ping machine but  
its frozen about 24 hours. I was deleting 25GB of dedup data. If I  
move 1-2 GB of data then the machine stops responding for 1 hour but  
comes back after that. I have munin installed and the graphs stop  
updating during that time and you can not use ssh. I agree that  
memory seems to not be enough as I see a lot of 20kb reads before it  
stops responding (reading DDT entries I guess). Maybe dedup has to  
be redesigned for low memory machines (a batch process instead of  
inline ?)
This is my home machine so I can wait but businesses would not be so  
happy if the machine becomes so unresponsive that you can not access  
your data.


The unresponsiveness that people report deleting large dedup zfs  
objects is due to ARC memory pressure and long service times accessing  
other zfs objects while it is busy resolving the deleted object's  
dedup references.


Set a max size the ARC can grow to, saving room for system services,  
get an SSD drive to act as an L2ARC, run a scrub first to prime the  
L2ARC (actually probably better to run something targetting just those  
datasets in question), then delete the dedup objects, smallest to  
largest.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Moved disks to new controller - cannot import pool even after moving ba

2010-06-14 Thread Ross Walker
On Jun 13, 2010, at 2:14 PM, Jan Hellevik  
opensola...@janhellevik.com wrote:


Well, for me it was a cure. Nothing else I tried got the pool back.  
As far as I can tell, the way to get it back should be to use  
symlinks to the fdisk partitions on my SSD, but that did not work  
for me. Using -V got the pool back. What is wrong with that?


If you have a better suggestion as to how I should have recovered my  
pool I am certainly interested in hearing it.


I would take this time to offline one disk at a time, wipe all it's  
tables/labels and re-attach it as an EFI whole disk to avoid hitting  
this same problem again in the future.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Please trim posts

2010-06-11 Thread Ross Walker
On Jun 11, 2010, at 2:07 AM, Dave Koelmeyer davekoelme...@me.com  
wrote:


I trimmed, and then got complained at by a mailing list user that  
the context of what I was replying to was missing. Can't win :P


If at a minimum one trims the disclaimers, footers and signatures,  
that's better then nothing.


On long threads with inlined comments, think about keeping the  
previous 2 comments before or trimming anything 3 levels of indents or  
more.


Of course that's just my general rule of thumb and different  
discussions require different quotings, but just being mindful is  
often enough.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Ross Walker
On Jun 10, 2010, at 5:54 PM, Richard Elling richard.ell...@gmail.com  
wrote:



On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:


Andrey Kuzmin wrote:

Well, I'm more accustomed to  sequential vs. random, but YMMW.
As to 67000 512 byte writes (this sounds suspiciously close to  
32Mb fitting into cache), did you have write-back enabled?


It's a sustained number, so it shouldn't matter.


That is only 34 MB/sec.  The disk can do better for sequential writes.


Not doing sector sized IO.

Besides this was a max IOPS number not max throughput number. If it  
were the OP might have used a 1M bs or better instead.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] General help with understanding ZFS performance bottlenecks

2010-06-09 Thread Ross Walker

On Jun 8, 2010, at 1:33 PM, besson3c j...@netmusician.org wrote:



Sure! The pool consists of 6 SATA drives configured as RAID-Z. There  
are no special read or write cache drives. This pool is shared to  
several VMs via NFS, these VMs manage email, web, and a Quickbooks  
server running on FreeBSD, Linux, and Windows.


Ok, well RAIDZ is going to be a problem here. Because each record is  
spread across the whole pool (each read/write will hit all drives in  
the pool) which has the side effect of making the total number of IOPS  
equal to the total number of IOPS of the slowest drive in the pool.


Since these are SATA let's say the total number of IOPS will be 80  
which is not good enough for what is a mostly random workload.


If it were a 6 drive pool of mirrors then it would be able to handle  
240 IOPS write and up to 480 IOPS read (can read from either side of  
mirror).


I would probably rethink the setup.

ZIL wil not buy you much here and if your VM software is like VMware  
then each write over NFS will be marked FSYNC which will force the  
lack of IOPS to the surface.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Ross Walker
On Jun 7, 2010, at 2:10 AM, Erik Trimble erik.trim...@oracle.com  
wrote:



Comments in-line.


On 6/6/2010 9:16 PM, Ken wrote:


I'm looking at VMWare, ESXi 4, but I'll take any advice offered.

On Sun, Jun 6, 2010 at 19:40, Erik Trimble  
erik.trim...@oracle.com wrote:

On 6/6/2010 6:22 PM, Ken wrote:


Hi,

I'm looking to build a virtualized web hosting server environment  
accessing files on a hybrid storage SAN.  I was looking at using  
the Sun X-Fire x4540 with the following configuration:
6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA  
drives)

2 Intel X-25 32GB SSD's as a mirrored ZIL
4 Intel X-25 64GB SSD's as the L2ARC.
De-duplification
LZJB compression
The clients will be Apache web hosts serving hundreds of domains.

I have the following questions:
Should I use NFS with all five VM's accessing the exports, or one  
LUN for each VM, accessed over iSCSI?


Generally speaking, it depends on your comfort level with running  
iSCSI  Volumes to put the VMs in, or serving everything out via NFS  
(hosting the VM disk file in an NFS filesystem).


If you go the iSCSI route, I would definitely go the one iSCSI  
volume per VM route - note that you can create multiple zvols per  
zpool on the X4540, so it's not limiting in any way to volume-ize a  
VM.  It's a lot simpler, easier, and allows for nicer management  
(snapshots/cloning/etc. on the X4540 side) if you go with a VM per  
iSCSI volume.


With NFS-hosted VM disks, do the same thing:  create a single  
filesystem on the X4540 for each VM.


Vmware has a 32 mount limit which may limit the OP somewhat here.


Performance-wise, I'd have to test, but I /think/ the iSCSI route  
will be faster. Even with the ZIL SSDs.


Actually properly tuned they are about the same, but VMware NFS  
datastores are FSYNC on all operations which isn't the best for data  
vmdk files, best to serve the data directly to the VM using either  
iSCSI or NFS.







Are the FSYNC speed issues with NFS resolved?


The ZIL SSDs will compensate for synchronous write issues in NFS.   
Not completely eliminate them, but you shouldn't notice issues with  
sync writing until you're up at pretty heavy loads.


You will need this with VMware as every NFS operation (not just file  
open/close) coming out of VMware will be marked FSYNC (for VM data  
integrity in the face of server failure).











If it were me (and, given what little I know of your data), I'd go  
like this:


(1) pool for VMs:
8 disks, MIRRORED
1 SSD for L2ARC
one Zvol per VM instance, served via iSCSI, each with:
DD turned ON,  Compression turned OFF

(1) pool for clients to write data to (log files, incoming data, etc.)
6 or 8 disks, MIRRORED
2 SSDs for ZIL, mirrored
Ideally, As many filesystems as you have webSITES, not just  
client VMs.  As this might be unwieldy for 100s of websites, you  
should segregate them into obvious groupings, taking care with write/ 
read permissions.

NFS served
DD OFF, Compression ON  (or OFF, if you seem to be  
having CPU overload on the X4540)


(1) pool for client read-only data
All the rest of the disks, split into 7 or 8-disk RAIDZ2 vdevs
All the remaining SSDs for L2ARC
As many filesystems as you have webSITES, not just client  
VMs.  (however, see above)

NFS served
DD on for selected websites (filesystems),  
Compression ON for everything


(2) Global hot spares.


Make your life easy and use NFS for VMs and data. If you need high  
performance data such as databases, use iSCSI zvols directly into the  
VM, otherwise NFS/CIFS into the VM should be good enough.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Migrating to ZFS

2010-06-02 Thread Ross Walker

On Jun 2, 2010, at 12:03 PM, zfsnoob4 zfsnoob...@hotmail.co.uk wrote:


Wow thank you very much for the clear instructions.

And Yes, I have another 120GB drive for the OS, separate from A, B  
and C. I will repartition the drive and install Solaris. Then maybe  
at some point I'll delete the entire drive and just install a single  
OS.



I have a question about step 6, Step 6: create a dummy drive as a  
sparse file: mkfile -n 1500G /foo


I understand that I need to create a dummy drive and then immediatly  
remove it to run the raidz in degraded mode. But by creating the  
file with mkfile, will it allocate the 1.5TB right away on the OS  
drive? I was wondering because my OS drive is only 120GB, so won't  
it have a problem with creating a 1.5TB sparse file?


There is one potential pitfall in this method, if your Windows mirror  
is using dynamic disks, you can't access a dynamic disk with the NTFS  
driver under Solaris.


To get around this create a basic NTFS partition on the new third  
drive, copy the data to that drive and blow away the dynamic mirror.  
Then build the degraded raidz pool out of the two original mirror  
disks and copy the data back off the new third disk on to the raidz,  
then wipe the disk labels off that third drive and resilver the raidz.


A safer approach is to get a 2GB eSATA drive (a mirrored device to be  
extra safe) and copy the data there, then build a complete raidz and  
copy the data off the eSATA device to the raidz.


The risk and time it takes to copy data on to a degraded raidz isn't  
worth it. The write throughput on a degraded raidz will be horrible  
and the time it takes to copy the data over plus the time it takes in  
the red zone where it resilvers the raidz with no backup available...   
There is a high potential for tears here.


Get an external disk for your own sanity.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-21 Thread Ross Walker

On May 20, 2010, at 7:17 PM, Ragnar Sundblad ra...@csc.kth.se wrote:



On 21 maj 2010, at 00.53, Ross Walker wrote:


On May 20, 2010, at 6:25 PM, Travis Tabbal tra...@tabbal.net wrote:


use a slog at all if it's not durable?  You should
disable the ZIL
instead.



This is basically where I was going. There only seems to be one  
SSD that is considered working, the Zeus IOPS. Even if I had the  
money, I can't buy it. As my application is a home server, not a  
datacenter, things like NFS breaking if I don't reboot the clients  
is a non-issue. As long as the on-disk data is consistent so I  
don't have to worry about the entire pool going belly-up, I'm  
happy enough. I might lose 30 seconds of data, worst case, as a  
result of running without ZIL. Considering that I can't buy a  
proper ZIL at a cost I can afford, and an improper ZIL is not  
worth much, I don't see a reason to bother with ZIL at all. I'll  
just get a cheap large SSD for L2ARC, disable ZIL, and call it a  
day.


For my use, I'd want a device in the $200 range to even consider  
an slog device. As nothing even remotely close to that price range  
exists that will work properly at all, let alone with decent  
performance, I see no point in ZIL for my application. The  
performance hit is just too severe to continue using it without an  
slog, and there's no slog device I can afford that works properly,  
even if I ignore performance.


Just buy a caching RAID controller and run it in JBOD mode and have  
the ZIL integrated with the pool.


A 512MB-1024MB card with battery backup should do the trick. It  
might not have the capacity of an SSD, but in my experience it  
works well in the 1TB data moderately loaded range.


Have more data/activity then try more cards and more pools,  
otherwise pony up the  for a capacitor backed SSD.


It - again - depends on what problem you are trying to solve.

If the RAID controller goes bad on you so that you loose the
data in the write cache, your file system could be in pretty bad
shape. Most RAID controllers can't be mirrored. That would hardly
make a good replacement for a mirrored ZIL.

As far as I know, there is no single silver bullet to this issue.


That is true, and there at finite budgets as well and as all things in  
life one must make a trade-off somewhere.


If you have 2 mirrored SSDs that don't support cache flush and your  
power goes out your file system will be in the same bad shape.  
Difference is in the first place you paid a lot less to have your data  
hosed.


-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] New SSD options

2010-05-20 Thread Ross Walker

On May 20, 2010, at 6:25 PM, Travis Tabbal tra...@tabbal.net wrote:


use a slog at all if it's not durable?  You should
disable the ZIL
instead.



This is basically where I was going. There only seems to be one SSD  
that is considered working, the Zeus IOPS. Even if I had the  
money, I can't buy it. As my application is a home server, not a  
datacenter, things like NFS breaking if I don't reboot the clients  
is a non-issue. As long as the on-disk data is consistent so I don't  
have to worry about the entire pool going belly-up, I'm happy  
enough. I might lose 30 seconds of data, worst case, as a result of  
running without ZIL. Considering that I can't buy a proper ZIL at a  
cost I can afford, and an improper ZIL is not worth much, I don't  
see a reason to bother with ZIL at all. I'll just get a cheap large  
SSD for L2ARC, disable ZIL, and call it a day.


For my use, I'd want a device in the $200 range to even consider an  
slog device. As nothing even remotely close to that price range  
exists that will work properly at all, let alone with decent  
performance, I see no point in ZIL for my application. The  
performance hit is just too severe to continue using it without an  
slog, and there's no slog device I can afford that works properly,  
even if I ignore performance.


Just buy a caching RAID controller and run it in JBOD mode and have  
the ZIL integrated with the pool.


A 512MB-1024MB card with battery backup should do the trick. It might  
not have the capacity of an SSD, but in my experience it works well in  
the 1TB data moderately loaded range.


Have more data/activity then try more cards and more pools, otherwise  
pony up the  for a capacitor backed SSD.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS High Availability

2010-05-13 Thread Ross Walker
On May 12, 2010, at 7:12 PM, Richard Elling richard.ell...@gmail.com  
wrote:



On May 11, 2010, at 10:17 PM, schickb wrote:

I'm looking for input on building an HA configuration for ZFS. I've  
read the FAQ and understand that the standard approach is to have a  
standby system with access to a shared pool that is imported during  
a failover.


The problem is that we use ZFS for a specialized purpose that  
results in 10's of thousands of filesystems (mostly snapshots and  
clones). All versions of Solaris and OpenSolaris that we've tested  
take a long time ( hour) to import that many filesystems.


I've read about replication through AVS, but that also seems  
require an import during failover. We'd need something closer to an  
active-active configuration (even if the second active is only  
modified through replication). Or some way to greatly speedup  
imports.


Any suggestions?


The import is fast, but two other operations occur during import  
that will

affect boot time:
   + for each volume (zvol) and its snapshots, a device tree entry is
  made in /devices
   + for each NFS share, the file system is (NFS) exported

When you get into the thousands of datasets and snapshots range, this
takes some time. Several RFEs have been implemented over the past few
years to help improve this.

NB.  Running in a VM doesn't improve the share or device enumeration  
time.


The idea I propose is to use VMs in a manner such that the server does  
not have to be restarted in the event of a hardware failure thus  
avoiding the enumerations by using VMware's hot-spare VM technology.


Of course using VMs could also mean the OP could have multiple ZFS  
servers such that the datasets could be spread evenly between them.


This could conceivably be done in containers within the 2 original VMs  
so as to maximize ARC space.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS High Availability

2010-05-12 Thread Ross Walker

On May 12, 2010, at 1:17 AM, schickb schi...@gmail.com wrote:

I'm looking for input on building an HA configuration for ZFS. I've  
read the FAQ and understand that the standard approach is to have a  
standby system with access to a shared pool that is imported during  
a failover.


The problem is that we use ZFS for a specialized purpose that  
results in 10's of thousands of filesystems (mostly snapshots and  
clones). All versions of Solaris and OpenSolaris that we've tested  
take a long time ( hour) to import that many filesystems.


I've read about replication through AVS, but that also seems require  
an import during failover. We'd need something closer to an active- 
active configuration (even if the second active is only modified  
through replication). Or some way to greatly speedup imports.


Any suggestions?


Bypass the complexities of AVS and the start-up times by implementing  
a ZFS head server in a pair of ESX/ESXi with Hot-spares using  
redundant back-end storage (EMC, NetApp, Equalogics).


Then, if there is a hardware or software failure of the head server or  
the host it is on, the hot-spare automatically kicks in with the same  
running state as the original.


There should be no interruption of services in this setup.

This type of arrangement provides for oodles of flexibility in testing/ 
upgrading deployments as well.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS High Availability

2010-05-12 Thread Ross Walker
On May 12, 2010, at 3:06 PM, Manoj Joseph manoj.p.jos...@oracle.com  
wrote:



Ross Walker wrote:

On May 12, 2010, at 1:17 AM, schickb schi...@gmail.com wrote:


I'm looking for input on building an HA configuration for ZFS. I've
read the FAQ and understand that the standard approach is to have a
standby system with access to a shared pool that is imported during
a failover.

The problem is that we use ZFS for a specialized purpose that
results in 10's of thousands of filesystems (mostly snapshots and
clones). All versions of Solaris and OpenSolaris that we've tested
take a long time ( hour) to import that many filesystems.

I've read about replication through AVS, but that also seems require
an import during failover. We'd need something closer to an active-
active configuration (even if the second active is only modified
through replication). Or some way to greatly speedup imports.

Any suggestions?


Bypass the complexities of AVS and the start-up times by implementing
a ZFS head server in a pair of ESX/ESXi with Hot-spares using
redundant back-end storage (EMC, NetApp, Equalogics).

Then, if there is a hardware or software failure of the head server  
or

the host it is on, the hot-spare automatically kicks in with the same
running state as the original.


By hot-spare here, I assume you are talking about a hot-spare ESX
virtual machine.

If there is a software issue and the hot-spare server comes up with  
the
same state, is it not likely to fail just like the primary server?  
If it

does not, can you explain why it would not?


That's a good point and worth looking into. I guess it would fail as  
well as a vmware hot-spare is like a vm in constant vmotion where  
active memory is mirrored between the two.


I suppose one would need a hot-spare for hardware failure and a cold- 
spare for software failure. Both scenarios are possible with ESX, the  
cold spare I suppose in this instance would be the original VM  
rebooting.


Recovery time would be about the same in this instance as an AVS  
solution that has to mount 1 mounts though, so it wins with a  
hardware failure and ties with a software failure, but wins with ease  
of setup and maintenance, but looses with additional cost. Guess it  
all depends on your risk analysis whether it is worth it.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance of the ZIL

2010-05-06 Thread Ross Walker
On May 6, 2010, at 8:34 AM, Edward Ned Harvey solar...@nedharvey.com  
wrote:



From: Pasi Kärkkäinen [mailto:pa...@iki.fi]


In neither case do you have data or filesystem corruption.



ZFS probably is still OK, since it's designed to handle this (?),
but the data can't be OK if you lose 30 secs of writes.. 30 secs of
writes
that have been ack'd being done to the servers/applications..


What I meant was:  Yes there's data loss.  But no corruption.  In  
other
filesystems, if you have an ungraceful shutdown while the filesystem  
is
writing, since filesystems such as EXT3 perform file-based (or inode- 
based)
block write operations, then you can have files whose contents have  
been
corrupted...  Some sectors of the file still in their old state,  
and some
sectors of the file in their new state.  Likewise, in something  
like EXT3,

you could have some file fully written, while another one hasn't been
written yet, but should have been.  (AKA, some files written out of  
order.)


In the case of EXT3, since it is a journaled filesystem, the journal  
only
keeps the *filesystem* consistent after a crash.  It's still  
possible to

have corrupted data in the middle of a file.


I believe ext3 has an option to journal data as well as metadata, it  
just defaults to metadata.


I don't believe out-of-order writes are so much an issue any more  
since Linux gained write barrier support (and most file systems and  
block devices now support it).



These things don't happen in ZFS.  ZFS takes journaling to a whole new
level.  Instead of just keeping your filesystem consistent, it also  
keeps
your data consistent.  Yes, data loss is possible when a system  
crashes, but
the filesystem will never have any corruption.  These are separate  
things

now, and never were before.


ZFS does NOT have a journal, it has an intent log which is completely  
different. A journal logs operations that are to be performed later  
(the journal is read, the operation performed) an intent log logs  
operations that are being performed now, when the disk flushes the  
intent entry is marked complete.


ZFS is consistent by the nature of COW which means a partial write  
will not become part of the file system (the old block pointer isn't  
updated till the new block completes the write).


In ZFS, losing n-seconds of writes leading up to the crash will  
never result
in files partially written, or written out of order.  Every atomic  
write to
the filesystem results in a filesystem-consistent and data- 
consistent view

of *some* valid form of all the filesystem and data within it.


ZFS file system will always be consistent, but if an application  
doesn't flush it's data, then it can definitely have partially written  
data.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshots and Data Loss

2010-04-23 Thread Ross Walker

On Apr 22, 2010, at 11:03 AM, Geoff Nordli geo...@grokworx.com wrote:


From: Ross Walker [mailto:rswwal...@gmail.com]
Sent: Thursday, April 22, 2010 6:34 AM

On Apr 20, 2010, at 4:44 PM, Geoff Nordli geo...@grokworx.com  
wrote:



If you combine the hypervisor and storage server and have students
connect to the VMs via RDP or VNC or XDM then you will have the
performance of local storage and even script VirtualBox to take a
snapshot right after a save state.

A lot less difficult to configure on the client side, and allows you
to deploy thin clients instead of full desktops where you can get  
away

with it.

It also allows you to abstract the hypervisor from the client.

Need a bigger storage server with lots of memory, CPU and storage
though.

Later, if need be, you can break out the disks to a storage appliance
with an 8GB FC or 10Gbe iSCSI interconnect.



Right, I am in the process now of trying to figure out what the load  
looks

like with a central storage box and how ZFS needs to be configured to
support that load.  So far what I am seeing is very exciting :)

We are currently porting over our existing Learning Lab Infrastructure
platform from MS Virtual Server to VBox + ZFS.  When students  
connect into
their lab environment it dynamically creates their VMs and load  
balances

them across physical servers.


You can also check out OpenSolaris' Xen implementation, which if you  
use Linux VMs will allow PV VMs as well as hardware assisted full  
virtualized Windows VMs. There are public domain Windows Xen drivers  
out there.


The advantage of using Xen is it's VM live migration and XMLRPC  
management API. As it runs as a bare metal hypervisor it also allows  
fine granularity of CPU schedules, between guests and the host VM, but  
unfortunately it's remote display technology leaves something to be  
desired. For Windows VMs I use the built-in remote desktop, and for  
Linux VMs I use XDM and use something like 'thinstation' on the client  
side.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Snapshots and Data Loss

2010-04-22 Thread Ross Walker

On Apr 20, 2010, at 4:44 PM, Geoff Nordli geo...@grokworx.com wrote:


From: matthew patton [mailto:patto...@yahoo.com]
Sent: Tuesday, April 20, 2010 12:54 PM

Geoff Nordli geo...@grokworx.com wrote:


With our particular use case we are going to do a save
state on their
virtual machines, which is going to write  100-400 MB
per VM via CIFS or
NFS, then we take a snapshot of the volume, which
guarantees we get a
consistent copy of their VM.


maybe you left out a detail or two but I can't see how your ZFS  
snapshot

is going to be consistent UNLESS every VM on that ZFS volume is
prevented from doing any and all I/O from the time it finishes save
state and you take your ZFS snapshot.

If by save state you mean something akin to VMWare's disk snapshot,
why would you even bother with a ZFS snapshot in addition?



We are using VirtualBox as our hypervisor.  When it does a save  
state it
generates a memory file.  The memory file plus the volume snapshot  
creates a

consistent state.

In our platform each student's VM points to a unique backend volume  
via

iscsi using VBox's built-in iscsi initiator.  So there is a one-to-one
relationship between VM and Volume.  Just for clarity, a single VM  
could
have multiple disks attached to it.  In that scenario, then a VM  
would have

multiple volumes.



end we could have
maybe 20-30 VMs getting saved at the same time, which could
mean several GB
of data would need to get written in a short time frame and
would need to
get committed to disk.

So it seems the best case would be to get those save
state writes as sync
and get them into a ZIL.


That I/O pattern is vastly 32kb and so will hit the 'rust' ZIL  
(which
ALWAYS exists) and if you were thinking an SSD would help you, I  
don't

see any/much evidence it will buy you anything.




If I set the logbias (b122) to latency, then it will direct all sync  
IO to
the log device, even if it exceeds the zfs_immediate_write_sz  
threshold.


If you combine the hypervisor and storage server and have students  
connect to the VMs via RDP or VNC or XDM then you will have the  
performance of local storage and even script VirtualBox to take a  
snapshot right after a save state.


A lot less difficult to configure on the client side, and allows you  
to deploy thin clients instead of full desktops where you can get away  
with it.


It also allows you to abstract the hypervisor from the client.

Need a bigger storage server with lots of memory, CPU and storage  
though.


Later, if need be, you can break out the disks to a storage appliance  
with an 8GB FC or 10Gbe iSCSI interconnect.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can RAIDZ disks be slices ?

2010-04-21 Thread Ross Walker

On Apr 20, 2010, at 12:13 AM, Sunil funt...@yahoo.com wrote:


Hi,

I have a strange requirement. My pool consists of 2 500GB disks in  
stripe which I am trying to convert into a RAIDZ setup without data  
loss but I have only two additional disks: 750GB and 1TB. So, here  
is what I thought:


1. Carve a 500GB slice (A) in 750GB and 2 500GB slices (B,C) in 1TB.
2. Create a RAIDZ pool out of these 3 slices. Performance will be  
bad because of seeks in the same disk for B and C but its just  
temporary.

3. zfs send | recv my current pool data into the new pool.
4. Destroy the current pool.
5. In the new pool, replace B with the 500GB disk freed by the  
destruction of the current pool.
6. Optionally, replace C with second 500GB to free up the 750GB  
completely.


So, essentially I have slices out of 3 separate disks giving me my  
needed 1TB space. Additional 500GB on the 1TB drive can be used for  
scratch non-important data or may be even mirrored with a slice from  
750GB disk.


Will this work as I am hoping it should?

Any potential gotchas?


Wouldn't it just be easier to zfs send to a file on the 1TB, build  
your raidz, then zfs recv into the new raidz from this file?


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Ross Walker

On Apr 19, 2010, at 12:50 PM, Don d...@blacksun.org wrote:


Now I'm simply confused.

Do you mean one cachefile shared between the two nodes for this  
zpool? How, may I ask, would this work?


The rpool should be in /etc/zfs/zpool.cache.

The shared pool should be in /etc/cluster/zpool.cache (or wherever  
you prefer to put it) so it won't come up on system start.


What I don't understand is how the second node is either a) supposed  
to share the first nodes cachefile or b) create it's own without  
importing the pool.


You say this is the job of the cluster software- does ha-cluster  
already handle this with their ZFS modules?


I've asked this question 5 different ways and I either still haven't  
gotten an answer- or still don't understand the problem.


Is there a way for a passive node to generate it's _own_ zpool.cache  
without importing the file system. If so- how. If not- why is this  
unimportant?


I don't run the cluster suite, but I'd be surprised if the software  
doesn't copy the cache to the passive node whenever it's updated.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Ross Walker
On Fri, Apr 2, 2010 at 8:03 AM, Edward Ned Harvey
solar...@nedharvey.com wrote:
  Seriously, all disks configured WriteThrough (spindle and SSD disks
  alike)
  using the dedicated ZIL SSD device, very noticeably faster than
  enabling the
  WriteBack.

 What do you get with both SSD ZIL and WriteBack disks enabled?

 I mean if you have both why not use both? Then both async and sync IO
 benefits.

 Interesting, but unfortunately false.  Soon I'll post the results here.  I
 just need to package them in a way suitable to give the public, and stick it
 on a website.  But I'm fighting IT fires for now and haven't had the time
 yet.

 Roughly speaking, the following are approximately representative.  Of course
 it varies based on tweaks of the benchmark and stuff like that.
        Stripe 3 mirrors write through:  450-780 IOPS
        Stripe 3 mirrors write back:  1030-2130 IOPS
        Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
        Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

 Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
 ZIL is 3-4 times faster than naked disk.  And for some reason, having the
 WriteBack enabled while you have SSD ZIL actually hurts performance by
 approx 10%.  You're better off to use the SSD ZIL with disks in Write
 Through mode.

 That result is surprising to me.  But I have a theory to explain it.  When
 you have WriteBack enabled, the OS issues a small write, and the HBA
 immediately returns to the OS:  Yes, it's on nonvolatile storage.  So the
 OS quickly gives it another, and another, until the HBA write cache is full.
 Now the HBA faces the task of writing all those tiny writes to disk, and the
 HBA must simply follow orders, writing a tiny chunk to the sector it said it
 would write, and so on.  The HBA cannot effectively consolidate the small
 writes into a larger sequential block write.  But if you have the WriteBack
 disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
 SSD, and immediately return to the process:  Yes, it's on nonvolatile
 storage.  So the application can issue another, and another, and another.
 ZFS is smart enough to aggregate all these tiny write operations into a
 single larger sequential write before sending it to the spindle disks.

Hmm, when you did the write-back test was the ZIL SSD included in the
write-back?

What I was proposing was write-back only on the disks, and ZIL SSD
with no write-back.

Not all operations hit the ZIL, so it would still be nice to have the
non-ZIL operations return quickly.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker
On Mar 31, 2010, at 11:51 PM, Edward Ned Harvey  
solar...@nedharvey.com wrote:



A MegaRAID card with write-back cache? It should also be cheaper than
the F20.


I haven't posted results yet, but I just finished a few weeks of  
extensive

benchmarking various configurations.  I can say this:

WriteBack cache is much faster than naked disks, but if you can  
buy an SSD
or two for ZIL log device, the dedicated ZIL is yet again much  
faster than

WriteBack.

It doesn't have to be F20.  You could use the Intel X25 for  
example.  If
you're running solaris proper, you better mirror your ZIL log  
device.  If

you're running opensolaris ... I don't know if that's important.  I'll
probably test it, just to be sure, but I might never get around to it
because I don't have a justifiable business reason to build the  
opensolaris

machine just for this one little test.

Seriously, all disks configured WriteThrough (spindle and SSD disks  
alike)
using the dedicated ZIL SSD device, very noticeably faster than  
enabling the

WriteBack.


What do you get with both SSD ZIL and WriteBack disks enabled?

I mean if you have both why not use both? Then both async and sync IO  
benefits.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker
On Mar 31, 2010, at 11:58 PM, Edward Ned Harvey  
solar...@nedharvey.com wrote:


We ran into something similar with these drives in an X4170 that  
turned

out to
be  an issue of the preconfigured logical volumes on the drives. Once
we made
sure all of our Sun PCI HBAs where running the exact same version of
firmware
and recreated the volumes on new drives arriving from Sun we got back
into sync
on the X25-E devices sizes.


Can you elaborate?  Just today, we got the replacement drive that has
precisely the right version of firmware and everything.  Still, when  
we
plugged in that drive, and create simple volume in the storagetek  
raid
utility, the new drive is 0.001 Gb smaller than the old drive.  I'm  
still

hosed.

Are you saying I might benefit by sticking the SSD into some laptop,  
and

zero'ing the disk?  And then attach to the sun server?

Are you saying I might benefit by finding some other way to make the  
drive

available, instead of using the storagetek raid utility?


I know it is way after the fact, but I find it best to coerce each  
drive down to the whole GB boundary using format (create Solaris  
partition just up to the boundary). Then if you ever get a drive a  
little smaller it still should fit.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker

On Apr 1, 2010, at 8:42 AM, casper@sun.com wrote:




Is that what sync means in Linux?


A sync write is one in which the application blocks until the OS  
acks that
the write has been committed to disk.  An async write is given to  
the OS,
and the OS is permitted to buffer the write to disk at its own  
discretion.
Meaning the async write function call returns sooner, and the  
application is

free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.   
But sync
writes are done by applications which need to satisfy a race  
condition for
the sake of internal consistency.  Applications which need to know  
their

next commands will not begin until after the previous sync write was
committed to disk.



We're talking about the sync for NFS exports in Linux; what do  
they mean

with sync NFS exports?


See section A1 in the FAQ:

http://nfs.sourceforge.net/

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker
On Thu, Apr 1, 2010 at 10:03 AM, Darren J Moffat
darr...@opensolaris.org wrote:
 On 01/04/2010 14:49, Ross Walker wrote:

 We're talking about the sync for NFS exports in Linux; what do they
 mean
 with sync NFS exports?

 See section A1 in the FAQ:

 http://nfs.sourceforge.net/

 I think B4 is the answer to Casper's question:

  BEGIN QUOTE 
 Linux servers (although not the Solaris reference implementation) allow this
 requirement to be relaxed by setting a per-export option in /etc/exports.
 The name of this export option is [a]sync (note that there is also a
 client-side mount option by the same name, but it has a different function,
 and does not defeat NFS protocol compliance).

 When set to sync, Linux server behavior strictly conforms to the NFS
 protocol. This is default behavior in most other server implementations.
 When set to async, the Linux server replies to NFS clients before flushing
 data or metadata modifying operations to permanent storage, thus improving
 performance, but breaking all guarantees about server reboot recovery.
  END QUOTE 

 For more info the whole of section B4 though B6.

True, I was thinking more of the protocol summary.

 Is that what sync means in Linux?  As NFS doesn't use close or
 fsync, what exactly are the semantics.

 (For NFSv2/v3 each *operation* is sync and the client needs to make sure
 it can continue; for NFSv4, some operations are async and the client
 needs to use COMMIT)

Actually the COMMIT command was introduced in NFSv3.

The full details:

NFS Version 3 introduces the concept of safe asynchronous writes. A
Version 3 client can specify that the server is allowed to reply
before it has saved the requested data to disk, permitting the server
to gather small NFS write operations into a single efficient disk
write operation. A Version 3 client can also specify that the data
must be written to disk before the server replies, just like a Version
2 write. The client specifies the type of write by setting the
stable_how field in the arguments of each write operation to UNSTABLE
to request a safe asynchronous write, and FILE_SYNC for an NFS Version
2 style write.

Servers indicate whether the requested data is permanently stored by
setting a corresponding field in the response to each NFS write
operation. A server can respond to an UNSTABLE write request with an
UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the
requested data resides on permanent storage yet. An NFS
protocol-compliant server must respond to a FILE_SYNC request only
with a FILE_SYNC reply.

Clients ensure that data that was written using a safe asynchronous
write has been written onto permanent storage using a new operation
available in Version 3 called a COMMIT. Servers do not send a response
to a COMMIT operation until all data specified in the request has been
written to permanent storage. NFS Version 3 clients must protect
buffered data that has been written using a safe asynchronous write
but not yet committed. If a server reboots before a client has sent an
appropriate COMMIT, the server can reply to the eventual COMMIT
request in a way that forces the client to resend the original write
operation. Version 3 clients use COMMIT operations when flushing safe
asynchronous writes to the server during a close(2) or fsync(2) system
call, or when encountering memory pressure.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Ross Walker

On Mar 31, 2010, at 5:39 AM, Robert Milkowski mi...@task.gda.pl wrote:




On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
  Use something other than Open/Solaris with ZFS as an NFS  
server?  :)


I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.




Well, for lots of environments disabling ZIL is perfectly acceptable.
And frankly the reason you get better performance out of the box on  
Linux as NFS server is that it actually behaves like with disabled  
ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using  
Linux here or any other OS which behaves in the same manner.  
Actually it makes it better as even if ZIL is disabled ZFS  
filesystem is always consisten on a disk and you still get all the  
other benefits from ZFS.


What would be useful though is to be able to easily disable ZIL per  
dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal  
process to be completed in order to get integrated. Should be rather  
sooner than later.


Well being fair to Linux the default for NFS exports is to export them  
'sync' now which syncs to disk on close or fsync. It has been many  
years before they exported 'async' by default. Now if Linux admins set  
their shares 'async' and loose important data then it's operator error  
and not Linux's fault.


If apps don't care about their data consistency and don't sync their  
data I don't see why the file server has to care for them. I mean if  
it were a local file system and the machine rebooted the data would be  
lost too. Should we care more for data written remotely then locally?


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Ross Walker
On Mar 31, 2010, at 10:25 PM, Richard Elling  
richard.ell...@gmail.com wrote:




On Mar 31, 2010, at 7:11 PM, Ross Walker wrote:

On Mar 31, 2010, at 5:39 AM, Robert Milkowski mi...@task.gda.pl  
wrote:





On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
Use something other than Open/Solaris with ZFS as an NFS  
server?  :)


I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.




Well, for lots of environments disabling ZIL is perfectly  
acceptable.
And frankly the reason you get better performance out of the box  
on Linux as NFS server is that it actually behaves like with  
disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse  
than using Linux here or any other OS which behaves in the same  
manner. Actually it makes it better as even if ZIL is disabled ZFS  
filesystem is always consisten on a disk and you still get all the  
other benefits from ZFS.


What would be useful though is to be able to easily disable ZIL  
per dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal  
process to be completed in order to get integrated. Should be  
rather sooner than later.


Well being fair to Linux the default for NFS exports is to export  
them 'sync' now which syncs to disk on close or fsync. It has been  
many years before they exported 'async' by default. Now if Linux  
admins set their shares 'async' and loose important data then it's  
operator error and not Linux's fault.


If apps don't care about their data consistency and don't sync  
their data I don't see why the file server has to care for them. I  
mean if it were a local file system and the machine rebooted the  
data would be lost too. Should we care more for data written  
remotely then locally?


This is not true for sync data written locally, unless you disable  
the ZIL locally.


No, of course if it's written sync with ZIL, it just seems over  
Solaris NFS all writes are delayed not just sync writes.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ISCSI + RAID-Z + OpenSolaris HA

2010-03-20 Thread Ross Walker

On Mar 20, 2010, at 10:18 AM, vikkr psi...@gmail.com wrote:


Hi sorry for bad eng and picture :).

Can such a decision?

3 servers openfiler give their drives 2 - 1 tb ISCSI server to  
OpenSolaris

On OpenSolaris assembled a RAID-Z with double parity.
Server OpenSolaris provides NFS access to this array, and duplicated  
by means of Open HA CLuster


Yes, you can.

With three servers you want to to provide resiliency against the loss  
of any one server.


I guess these are mirrors in each server?

If so, you will get better performance and more useable capacity by  
exporting each drive individually over iSCSI and setting the 6 drives  
as a raidz2 or even raidz3 which will give 3-4 drives of capacity,  
raidz3 will provide resiliency of a drive failure during a server  
failure.


-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ISCSI + RAID-Z + OpenSolaris HA

2010-03-20 Thread Ross Walker

On Mar 20, 2010, at 11:48 AM, vikkr psi...@gmail.com wrote:


THX Ross, i plan exporting each drive individually over iSCSI.
I this case, the write, as well as reading, will go to all 6 discs  
at once, right?


The only question - how to calculate fault tolerance of such a  
system if the discs are all different in size?

Maybe there is such a tool? or check?


They should all be the same size.

You can make them the same size on the iSCSI target.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Can we get some documentation on iSCSI sharing after comstar took over?

2010-03-17 Thread Ross Walker





On Mar 17, 2010, at 2:30 AM, Erik Ableson eable...@mac.com wrote:



On 17 mars 2010, at 00:25, Svein Skogen sv...@stillbilde.net wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 16.03.2010 22:31, erik.ableson wrote:


On 16 mars 2010, at 21:00, Marc Nicholas wrote:

On Tue, Mar 16, 2010 at 3:16 PM, Svein Skogen sv...@stillbilde.net
mailto:sv...@stillbilde.net wrote:



I'll write you a Perl script :)


  I think there are ... several people that'd like a script that  
gave us
  back some of the ease of the old shareiscsi one-off, instead of  
having

  to spend time on copy-and-pasting GUIDs they have ... no real use
  for. ;)


I'll try and knock something up in the next few days, then!


Try this :

http://www.infrageeks.com/groups/infrageeks/wiki/56503/zvol2iscsi.html



Thank you! :)

Mind if I (after some sleep) look at extending your script a  
little? Of

course with feedback of the changes I make?

//Svein

Certainly! I just whipped that up since I was testing out a pile of  
clients with different volumes and got tired of going through all  
the steps so anything to make it more complete would be useful.


How about a perl script that emulates the functionality of iscsitadm  
so share=iscsi works as expected?


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corruption of ZFS on iScsi storage

2010-03-15 Thread Ross Walker
On Mar 15, 2010, at 10:55 AM, Gabriele Bulfon gbul...@sonicle.com  
wrote:



Hello,
I'd like to check for any guidance about using zfs on iscsi storage  
appliances.
Recently I had an unlucky situation with an unlucky storage machine  
freezing.
Once the storage was up again (rebooted) all other iscsi clients  
were happy, while one of the iscsi clients (a sun solaris sparc,  
running Oracle) did not mount the volume marking it as corrupted.
I had no way to get back my zfs data: had to destroy and recreate  
from backups.

So I have some questions regarding this nice story:
- I remember sysadmins being able to almost always recover data on  
corrupted ufs filesystems by magic of superblocks. Is there  
something similar on zfs? Is there really no way to access data of a  
corrupted zfs filesystem?
- In this case, the storage appliance is a legacy system based on  
linux, so raids/mirrors are managed at the storage side its own way.  
Being an iscsi target, this volume was mounted as a single iscsi  
disk from the solaris host, and prepared as a zfs pool consisting of  
this single iscsi target. ZFS best practices, tell me that to be  
safe in case of corruption, pools should always be mirrors or raidz  
on 2 or more disks. In this case, I considered all safe, because the  
mirror and raid was managed by the storage machine. But from the  
solaris host point of view, the pool was just one! And maybe this  
has been the point of failure. What is the correct way to go in this  
case?
- Finally, looking forward to run new storage appliances using  
OpenSolaris and its ZFS+iscsitadm and/or comstar, I feel a bit  
confused by the possibility of having a double zfs situation: in  
this case, I would have the storage zfs filesystem divided into zfs  
volumes, accessed via iscsi by a possible solaris host that creates  
his own zfs pool on it (...is it too redundant??) and again I would  
fall in the same previous case (host zfs pool connected to one only  
iscsi resource).


Any guidance would be really appreciated :)
Thanks a lot
Gabriele.


What iSCSI target was this?

If it was IET I hope you were NOT using the write-back option on it as  
it caches write data in volatile RAM.


IET does support cache flushes, but if you cache in RAM (bad idea) a  
system lockup or panic will ALWAYS loose data.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corruption of ZFS on iScsi storage

2010-03-15 Thread Ross Walker
On Mar 15, 2010, at 12:19 PM, Ware Adams rwali...@washdcmail.com  
wrote:




On Mar 15, 2010, at 12:13 PM, Gabriele Bulfon wrote:

Well, I actually don't know what implementation is inside this  
legacy machine.
This machine is an AMI StoreTrends ITX, but maybe it has been built  
around IET, don't know.
Well, maybe I should disable write-back on every zfs host  
connecting on iscsi?

How do I check this?


I think this would be a property of the NAS, not the clients.


Yes, Ware's right the setting should be on the AMI device.

I don't know what target it's using either, but if it has an option to  
disable write-back caching at least then if it doesn't honor flushing  
your data should still be safe.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corruption of ZFS on iScsi storage

2010-03-15 Thread Ross Walker

On Mar 15, 2010, at 7:11 PM, Tonmaus sequoiamo...@gmx.net wrote:


Being an iscsi
target, this volume was mounted as a single iscsi
disk from the solaris host, and prepared as a zfs
pool consisting of this single iscsi target. ZFS best
practices, tell me that to be safe in case of
corruption, pools should always be mirrors or raidz
on 2 or more disks. In this case, I considered all
safe, because the mirror and raid was managed by the
storage machine.


As far as I understand Best Practises, redundancy needs to be within  
zfs in order to provide full protection. So, actually Best Practises  
says that your scenario is rather one to be avoided.


There is nothing saying redundancy can't be provided below ZFS just if  
you want auto recovery you need redundancy within ZFS itself as well.


You can have 2 separate raid arrays served up via iSCSI to ZFS which  
then makes a mirror out of the storage.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] corruption of ZFS on iScsi storage

2010-03-15 Thread Ross Walker

On Mar 15, 2010, at 11:10 PM, Tim Cook t...@cook.ms wrote:




On Mon, Mar 15, 2010 at 9:10 PM, Ross Walker rswwal...@gmail.com  
wrote:

On Mar 15, 2010, at 7:11 PM, Tonmaus sequoiamo...@gmx.net wrote:

Being an iscsi
target, this volume was mounted as a single iscsi
disk from the solaris host, and prepared as a zfs
pool consisting of this single iscsi target. ZFS best
practices, tell me that to be safe in case of
corruption, pools should always be mirrors or raidz
on 2 or more disks. In this case, I considered all
safe, because the mirror and raid was managed by the
storage machine.

As far as I understand Best Practises, redundancy needs to be within  
zfs in order to provide full protection. So, actually Best Practises  
says that your scenario is rather one to be avoided.


There is nothing saying redundancy can't be provided below ZFS just  
if you want auto recovery you need redundancy within ZFS itself as  
well.


You can have 2 separate raid arrays served up via iSCSI to ZFS which  
then makes a mirror out of the storage.


-Ross


Perhaps I'm remembering incorrectly, but I didn't think mirroring  
would auto-heal/recover, I thought that was limited to the raidz*  
implementations.


Mirroring auto-heals, in fact copies=2 on a single disk vdev can auto- 
heal (if it isn't a disk failure).


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS - VMware ESX -- vSphere Upgrade : Zpool Faulted

2010-03-11 Thread Ross Walker

On Mar 11, 2010, at 8:27 AM, Andrew acmcomput...@hotmail.com wrote:


Ok,

The fault appears to have occurred regardless of the attempts to  
move to vSphere as we've now moved the host back to ESX 3.5 from  
whence it came and the problem still exists.


Looks to me like the fault occurred as a result of a reboot.

Any help and advice would be greatly appreciated.


It appears the RDM might have had something to do with this.

Try a different RDM setting then physical, like virtual. Try mounting  
the disk via iSCSI initiator inside VM instead of RDM.


If you tried fiddling with the ESX RDM options and it still doesn't  
work... Inside the Solaris VM, dump the first 128k of the disk to a  
file using dd then using a hex editor find out what lba contains the  
MBR, which should be LBA 0, but I suspect it will be offset. Then the  
GPT will start at MBR LBA + 1 to MBR LBA + 33. Use the wikipedia entry  
for MBR, there is a unique identifier in there somewhere to search for.


There is a backup GPT also in the last 33 sectors of the disk.

Once you find the offset it is best to just dump those 34 sectors  
(0-33) to another file. Edit each MBR and GPT entry to take into  
account the offset then copy those 34 sectors into the first 34  
sectors of the disk, and the last 33 sectors of the file to the last  
33 sectors of the disk. Rescan, and hopefully it will see the disk.


If the offset is in the other direction then it means it's been  
padded, probably with metainfo? And you will need to get rid of the  
RDM and use the iSCSI initiator in the solaris vm to mount the volume.  
See how the first 34 sectors look, and if they are damaged take the  
backup GPT to reconstruct the primary GPT and recreate the MBR.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS - VMware ESX -- vSphere Upgrade : Zpool Faulted

2010-03-11 Thread Ross Walker

On Mar 11, 2010, at 12:31 PM, Andrew acmcomput...@hotmail.com wrote:


Hi Ross,

Ok - as a Solaris newbie.. i'm going to need your help.

Format produces the following:-

c8t4d0 (VMware-Virtualdisk-1.0 cyl 65268 alt 2 hd 255 sec 126) / 
p...@0,0/pci15ad,1...@10/s...@4,0


what dd command do I need to run to reference this disk? I've tried / 
dev/rdsk/c8t4d0 and /dev/dsk/c8t4d0 but neither of them are valid.


dd if=/dev/rdsk/c8t4d0p0 of=~/disk.out bs=512 count=256

That should get you the first 128K.

As for a hex editor, try bvi, like vi but for binary and supports much  
of the vi commands.


Search for signature 0x55AA (little endian) which should be bytes 511  
and 512 of the MBR.


There is also the possibility that these were wiped somehow, or even  
cached in vmware and lost during a vm reset.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk (70% drop)

2010-03-09 Thread Ross Walker
On Mar 8, 2010, at 11:46 PM, ольга крыжановская olga.kryzh 
anov...@gmail.com wrote:



tmpfs lacks features like quota and NFSv4 ACL support. May not be the
best choice if such features are required.


True, but if the OP is looking for those features they are more then  
unlikely looking for an in-memory file system.


This would be more for something like temp databases in a RDBMS or a  
cache of some sort.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk (70% drop)

2010-03-09 Thread Ross Walker
On Mar 9, 2010, at 1:42 PM, Roch Bourbonnais  
roch.bourbonn...@sun.com wrote:




I think This is highlighting that there is extra CPU requirement to  
manage small blocks in ZFS.
The table would probably turn over if you go to 16K zfs records and  
16K reads/writes form the application.


Next step for you is to figure how much reads/writes IOPS do you  
expect to take in the real workloads and whether or not the  
filesystem portion

will represent a significant drain of CPU resource.


I think it highlights more the problem of ARC vs ramdisk, or  
specifically ZFS on ramdisk while ARC is fighting with ramdisk for  
memory.


It is a wonder it didn't deadlock.

If I were to put a ZFS file system on a ramdisk, I would limit the  
size of the ramdisk and ARC so both, plus the kernel fit nicely in  
memory with room to spare for user apps.


-Ross

 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [indiana-discuss] future of OpenSolaris

2010-02-25 Thread Ross Walker
On Feb 25, 2010, at 9:11 AM, Giovanni Tirloni gtirl...@sysdroid.com  
wrote:


On Thu, Feb 25, 2010 at 9:47 AM, Jacob Ritorto jacob.rito...@gmail.com 
 wrote:

It's a kind gesture to say it'll continue to exist and all, but
without commercial support from the manufacturer, it's relegated to
hobbyist curiosity status for us.  If I even mentioned using an
unsupported operating system to the higherups here, it'd be considered
absurd.  I like free stuff to fool around with in my copious spare
time as much as the next guy, don't get me wrong, but that's not the
issue.  For my company, no support contract equals 'Death of
OpenSolaris.'

OpenSolaris is not dying just because there is no support contract  
available for it, yet.


Last time I looked Red Hat didn't offer support contracts for Fedora  
and that project is doing quite well.


Difference here is Redhat doesn't claim Fedora as a production OS.

While CentOS is a derivative of RHEL and also comes with no support  
contracts as it just recompiles RHEL source one gets the inherited  
binary support through this and technical support through the community.


OpenSolaris not being as transparent and more leading edge doesn't get  
the stability of binary support that Solaris has and the community is  
always playing catch-up on the technical details. Which make it about  
as suitable for production use as Fedora.


The commercial support contracts attempted to bridge the gap between  
the lack of knowledge due to the newness and the binary stability with  
patches. Without it OS is no longer really production quality.


A little scattered in my reasoning but I think I get the main idea  
across.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-19 Thread Ross Walker

On Feb 19, 2010, at 4:57 PM, Ragnar Sundblad ra...@csc.kth.se wrote:



On 18 feb 2010, at 13.55, Phil Harman wrote:

...
Whilst the latest bug fixes put the world to rights again with  
respect to correctness, it may be that some of our performance  
workaround are still unsafe (i.e. if my iSCSI client assumes all  
writes are synchronised to nonvolatile storage, I'd better be  
pretty sure of the failure modes before I work around that).


But are there any clients that assume that an iSCSI volume is  
synchronous?


Isn't an iSCSI target supposed to behave like any other SCSI disk
(pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
With that I mean: A disk which understands SCSI commands with an
optional write cache that could be turned off, with cache sync
command, and all those things.
Put in another way, isn't is the OS/file systems responsibility to
use the SCSI disk responsibly regardless of the underlying
protocol?


That was my argument a while back.

If you use /dev/dsk then all writes should be asynchronous and WCE  
should be on and the initiator should issue a 'sync' to make sure it's  
in NV storage, if you use /dev/rdsk all writes should be synchronous  
and WCE should be off. RCD should be off in all cases and the ARC  
should cache all it can.


Making COMSTAR always start with /dev/rdsk and flip to /dev/dsk if the  
initiator flags write cache is the wrong way to go about it. It's more  
complicated then it needs to be and it leaves setting the storage  
policy up to the system admin rather then the storage admin.


It would be better to put effort into supporting FUA and DPO options  
in the target then dynamically changing a volume's cache policy from  
the initiator side.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced

2010-02-10 Thread Ross Walker

On Feb 9, 2010, at 1:55 PM, matthew patton patto...@yahoo.com wrote:

The cheapest solution out there that isn't a Supermicro-like server  
chassis, is DAS in the form of HP or Dell MD-series which top out at  
15 or 16 3 drives. I can only chain 3 units per SAS port off a HBA  
in either case.


The new Dell MD11XX series is 24 2.5 drives and you can chain 3 of  
them together off a single controller. If your drives are dual ported  
you can use both HBA ports for redundant paths.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS access by OSX clients (was Cores vs. Speed?)

2010-02-09 Thread Ross Walker
On Feb 8, 2010, at 4:58 PM, Edward Ned Harvey macenterpr...@nedharvey.com 
 wrote:


How are you managing UID's on the NFS server?  If user eharvey  
connects to
server from client Mac A, or Mac B, or Windows 1, or Windows 2, or  
any of
the linux machines ... the server has to know it's eharvey, and  
assign the
correct UID's etc.  When I did this in the past, I maintained a list  
of
users in AD, and duplicate list of users in OD, so the mac clients  
could
resolve names to UID's via OD.  And a third duplicate list in NIS so  
the
linux clients could resolve.  It was terrible.  You must be doing  
something

better?


The way I did this type of integration in my environment was to setup  
a Linux box with winbind and have NIS make maps just pull out the UID  
ranges I wanted shared over NIS with all passwords blanked out. Then  
all -nix based systems use NIS+Kerberos.


I suppose one could do the same with LDAP, but winbind has the  
advantage of auto-creating UIDs based on the user's RID+mapping range  
which saves A LOT of work in creating UIDs in AD.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Cores vs. Speed?

2010-02-05 Thread Ross Walker

On Feb 5, 2010, at 10:49 AM, Robert Milkowski mi...@task.gda.pl wrote:


Actually, there is.
One difference is that when writing to a raid-z{1|2} pool compared  
to raid-10 pool you should get better throughput if at least 4  
drives are used. Basically it is due to the fact that in RAID-10 the  
maximum you can get in terms of write throughput is a total  
aggregated throughput of half the number of used disks and only  
assuming there are no other bottlenecks between the OS and disks  
especially as you need to take into account that you are double the  
bandwidth requirements due to mirroring. In case of RAID-Zn you have  
some extra overhead for writing additional checksum but other than  
that you should get a write throughput closer to of T-N (where N is  
a RAID-Z level) instead of T/2 in RAID-10.


That hasn't been my experience with raidz. I get a max read and write  
IOPS of the slowest drive in the vdev.


Which makes sense because each write spans all drives and each read  
spans all drives (except the parity drives) so they end up having the  
performance characteristics of a single drive.


Now if you have enough drives you can create multiple raidz vdevs to  
get the IOPS up, but you need a lot more drives then what multiple  
mirror vdevs can provide IOPS wise with the same amount of spindles.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-04 Thread Ross Walker





On Feb 4, 2010, at 2:00 AM, Tomas Ögren st...@acc.umu.se wrote:


On 03 February, 2010 - Frank Cusack sent me these 0,7K bytes:

On February 3, 2010 12:04:07 PM +0200 Henu henrik.he...@tut.fi  
wrote:

Is there a possibility to get a list of changed files between two
snapshots? Currently I do this manually, using basic file system
functions offered by OS. I scan every byte in every file manually  
and it

 ^^^

On February 3, 2010 10:11:01 AM -0500 Ross Walker rswwal...@gmail.com 


wrote:
Not a ZFS method, but you could use rsync with the dry run option  
to list

all changed files between two file systems.


That's exactly what the OP is already doing ...


rsync by default compares metadata first, and only checks through  
every

byte if you add the -c (checksum) flag.

I would say rsync is the best tool here.

The find -newer blah suggested in other posts won't catch newer  
files

with an old timestamp (which could happen for various reasons, like
being copied with kept timestamps from somewhere else).


Find -newer doesn't catch files added or removed it assumes identical  
trees.


I would be interested in comparing ddiff, bart and rsync (local  
comparison only) to see imperically how they match up.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-03 Thread Ross Walker

On Feb 3, 2010, at 9:53 AM, Henu henrik.he...@tut.fi wrote:

Okay, so first of all, it's true that send is always fast and 100%  
reliable because it uses blocks to see differences. Good, and thanks  
for this information. If everything else fails, I can parse the  
information I want from send stream :)


But am I right, that there is no other methods to get the list of  
changed files other than the send command?


And in my situation I do not need to create snapshots. They are  
already created. The only thing that I need to do, is to get list of  
all the changed files (and maybe the location of difference in them,  
but I can do this manually if needed) between two already created  
snapshots.


Not a ZFS method, but you could use rsync with the dry run option to  
list all changed files between two file systems.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-03 Thread Ross Walker
On Feb 3, 2010, at 12:35 PM, Frank Cusack frank+lists/ 
z...@linetwo.net wrote:


On February 3, 2010 12:19:50 PM -0500 Frank Cusack frank+lists/z...@linetwo.net 
 wrote:

If you do need to know about deleted files, the find method still may
be faster depending on how ddiff determines whether or not to do a
file diff.  The docs don't explain the heuristics so I wouldn't want
to guess on that.


An improvement on finding deleted files with the find method would
be to not limit your find criteria to files.  Directories with
deleted files will be newer than in the snapshot so you only need
to look at those directories.  I think this would be faster than
ddiff in most cases.


So was there a final consensus on the best way to find the difference  
between two snapshots (files/directories added, files/directories  
deleted and file/directories changed)?


Find won't do it, ddiff won't do it, I think the only real option is  
rsync. Of course you can zfs send the snap to another system and do  
the rsync there against a local previous version.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-03 Thread Ross Walker
On Feb 3, 2010, at 8:59 PM, Frank Cusack frank+lists/z...@linetwo.net  
wrote:


On February 3, 2010 6:46:57 PM -0500 Ross Walker  
rswwal...@gmail.com wrote:

So was there a final consensus on the best way to find the difference
between two snapshots (files/directories added, files/directories  
deleted

and file/directories changed)?

Find won't do it, ddiff won't do it, I think the only real option is
rsync.


I think you misread the thread.  Either find or ddiff will do it and
either will be better than rsync.


Find can find files that have been added or removed between two  
directory trees?


How?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Home ZFS NAS - 2 drives or 3?

2010-01-30 Thread Ross Walker

On Jan 30, 2010, at 2:53 PM, Mark white...@gmail.com wrote:

I have a 1U server that supports 2 SATA drives in the chassis. I  
have 2 750 GB SATA drives. When I install opensolaris, I assume it  
will want to use all or part of one of those drives for the install.  
That leaves me with the remaining part of disk 1, and all of disk 2.


Question is, how do I best install OS to maximize my ability to use  
ZFS snapshots and recover if one drive fails?


Alternatively, I guess I could add a small USB drive to use solely  
for the OS and then have all of the 2 750 drives for ZFS. Is that a  
bad idea since the OS drive will be standalone?


Just install the OS on the first drive and add the second drive to  
form a mirror. There are wikis and blogs on how to add the second  
drive to form an rpool mirror.


You'll then have a 750GB rpool which you can use for your media and  
rest safely knowing your data is protected in the event of a disk  
failure.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 2gig file limit on ZFS?

2010-01-21 Thread Ross Walker

On Jan 21, 2010, at 6:47 PM, Daniel Carosone d...@geek.com.au wrote:


On Thu, Jan 21, 2010 at 02:54:21PM -0800, Richard Elling wrote:

+ support file systems larger then 2GiB include 32-bit UIDs a GIDs


file systems, but what about individual files within?


I think the original author meant files bigger then 2GiB and files  
systems bigger then 2TiB.


I don't know why that wasn't builtin from the start it's been out for  
a long, long time now, between 5-10 years if I had to guess.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] 4 Internal Disk Configuration

2010-01-14 Thread Ross Walker
On Jan 14, 2010, at 10:44 AM, Mr. T Doodle tpsdoo...@gmail.com  
wrote:



Hello,

I have played with ZFS but not deployed any production systems using  
ZFS and would like some opinions


I have a T-series box with 4 internal drives and would like to  
deploy ZFS with availability and performance in mind ;)


What would some recommended configurations be?
Example: use internal RAID controller to mirror boot drives, and ZFS  
the other 2?


Can I create one pool with the 3 or 4 drives, install Solaris, and  
use this pool for other apps?

Also, what happens if a drive fails?

Thanks for any tips and gotchas.


Here's my .02

Have two small disks for rpool mirror and 2 large disks for your data  
pool mirror.


Raidz will only give you IOPS of a single disk, so why not mirror? You  
have lots of memory for ARC to read cache and you should get the same  
performance and redundancy as a raidz.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] I/O Read starvation

2010-01-11 Thread Ross Walker
On Jan 11, 2010, at 2:23 PM, Bob Friesenhahn bfrie...@simple.dallas.tx.us 
 wrote:



On Mon, 11 Jan 2010, bank kus wrote:


Are we still trying to solve the starvation problem?


I would argue the disk I/O model is fundamentally broken on Solaris  
if there is no fair I/O scheduling between multiple read sources  
until that is fixed individual I_am_systemstalled_while_doing_xyz  
problems will crop up. Started a new thread focussing on just this  
problem.


While I will readily agree that zfs has a I/O read starvation  
problem (which has been discussed here many times before), I doubt  
that it is due to the reasons you are thinking.


A true fair I/O scheduling model would severely hinder overall  
throughput in the same way that true real-time task scheduling  
cripples throughput.  ZFS is very much based on its ARC model.  ZFS  
is designed for maximum throughput with minimum disk accesses in  
server systems.  Most reads and writes are to and from its ARC.   
Systems with sufficient memory hardly ever do a read from disk and  
so you will only see writes occuring in 'zpool iostat'.


The most common complaint is read stalls while zfs writes its  
transaction group, but zfs may write this data up to 30 seconds  
after the application requested the write, and the application might  
not even be running any more.


Maybe an IO scheduler like Linux's 'deadline' IO scheduler whose only  
purpose is to reduce the effect of writers starving readers while  
providing some form of guaranteed latency.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   3   4   5   6   7   >