from:"\"Ross\""

Re: [zfs-discuss] Any company willing to support a 7410 ?

2012-07-19 Thread Gordon Ross

On Thu, Jul 19, 2012 at 5:38 AM, sol  wrote:
> Other than Oracle do you think any other companies would be willing to take
> over support for a clustered 7410 appliance with 6 JBODs?
>
> (Some non-Oracle names which popped out of google:
> Joyent/Coraid/Nexenta/Greenbytes/NAS/RackTop/EraStor/Illumos/???)
>

I'm not sure, but I think there are people running NexentaStor on that h/w.
If not, then on something pretty close.  NS supports clustering, etc.


-- 
Gordon Ross 
Nexenta Systems, Inc.  www.nexenta.com
Enterprise class storage for everyone
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Creating NFSv4/ZFS XATTR through dirfd through /proc not allowed?

2012-07-13 Thread Gordon Ross

On Fri, Jul 13, 2012 at 2:16 AM, ольга крыжановская
 wrote:
> Can some one here explain why accessing a NFSv4/ZFS xattr directory
> through proc is forbidden?
>
[...]
> truss says the syscall fails with
> open("/proc/3988/fd/10/myxattr", O_WRONLY|O_CREAT|O_TRUNC, 0666) Err#13 EACCES
>
> Accessing files or directories through /proc/$$/fd/ from a shell
> otherwise works, only the xattr directories cause trouble. Native C
> code has the same problem.
>
> Olga

Does "runat" let you see those xattr files?

-- 
Gordon Ross 
Nexenta Systems, Inc.  www.nexenta.com
Enterprise class storage for everyone
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [illumos-Developer] revisiting aclmode options

2011-08-02 Thread Gordon Ross

On Thu, Jul 21, 2011 at 9:58 PM, Paul B. Henson  wrote:
> On 7/19/2011 7:10 PM, Gordon Ross wrote:
>
>> The idea:  A new "aclmode" setting called "discard", meaning that
>> the users don't care at all about the traditional mode bits.  A
>> dataset with aclmode=discard would have the chmod system call and NFS
>> setattr do absolutely nothing to the mode bits.
>
> The caveat to that are the suid/sgid/sticky bits, which have no
> corresponding bits in the ACL, and potentially will still need to be
> manipulated. The details on that still need to be worked out :).

It seems consistent to me that a "discard" mode would simply
never present suid/sgid/sticky.  (It discards mode settings.)
After all, the suid/sgid/sticky bits don't have any counterpart in
Windows security descriptors, and Windows ACL use interited
$CREATOR_OWNER ACEs to do the equivalent of the sticky bit.

>> The mode bits would be derived from the ACL such that the mode
>> represents the greatest possible access that might be allowed by the
>> ACL, without any consideration of deny entries or group memberships.
>
> Is this description different than how the mode bits are currently derived
> when a ZFS acl is set on an object?

I think it's pretty much the same, though I haven't looked recently
at the code that derives the mode from an  ACL.

Gordon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Entire client hangs every few seconds

2011-07-26 Thread Gordon Ross

Are the "disk active" lights typically ON when this happens?

On Tue, Jul 26, 2011 at 3:27 PM, Garrett D'Amore  wrote:
> This is actually a recently known problem, and a fix for it is in the
> 3.1 version, which should be available any minute now, if it isn't
> already available.
>
> The problem has to do with some allocations which are sleeping, and jobs
> in the ZFS subsystem get backed behind some other work.
>
> If you have adequate system memory, you are less likely to see this
> problem, I think.
>
>         - Garrett
>
>
> On Tue, 2011-07-26 at 08:29 -0700, Rocky Shek wrote:
>> Ian,
>>
>> Did you enable DeDup?
>>
>> Rocky
>>
>>
>> -Original Message-
>> From: zfs-discuss-boun...@opensolaris.org
>> [mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ian D
>> Sent: Tuesday, July 26, 2011 7:52 AM
>> To: zfs-discuss@opensolaris.org
>> Subject: [zfs-discuss] Entire client hangs every few seconds
>>
>> Hi all-
>> We've been experiencing a very strange problem for two days now.
>>
>> We have three client (Linux boxes) connected to a ZFS box (Nexenta) via
>> iSCSI.  Every few seconds (seems random), iostats shows the clients go from
>> an normal 80K+ IOPS to zero.  It lasts up to a few seconds and things are
>> fine again.  When that happens, I/Os on the local disks stops too, even the
>> totally unrelated ones. How can that be?  All three clients show the same
>> pattern and everything was fine prior to Sunday.  Nothing has changed on
>> neither the clients or the server. The ZFS box is not even close to be
>> saturated, nor the network.
>>
>> We don't even know where to start... any advices?
>> Ian
>
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] SSD vs "hybrid" drive - any advice?

2011-07-21 Thread Gordon Ross

I'm looking to upgrade the disk in a high-end laptop (so called
"desktop replacement" type).  I use it for development work,
runing OpenIndiana (native) with lots of ZFS data sets.

These "hybrid" drives look kind of interesting, i.e. for about $100,
one can get:
 Seagate Momentus XT ST95005620AS 500GB 7200 RPM 2.5" SATA 3.0Gb/s
with NCQ Solid State Hybrid Drive
 http://www.newegg.com/Product/Product.aspx?Item=N82E16822148591
And then for about $400 one can get an 250GB SSD, such as:
 Crucial M4 CT256M4SSD2 2.5" 256GB SATA III MLC Internal Solid State
Drive (SSD)
 http://www.newegg.com/Product/Product.aspx?Item=N82E16820148443

Anyone have experience with either one?  (good or bad)

Opinions whether the lower capacity and higher cost of
the SSD is justified in terms of performance for things
like software builds, etc?

Thanks,
Gordon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [illumos-Developer] revisiting aclmode options

2011-07-19 Thread Gordon Ross

On Mon, Jul 18, 2011 at 9:44 PM, Paul B. Henson  wrote:
> Now that illumos has restored the aclmode option to zfs, I would like to
> revisit the topic of potentially expanding the suite of available modes.
[...]

At one point, I was experimenting with some code for smbfs that would
"invent" the mode bits (remember, smbfs does not get mode bits from
the remote server, only the ACL).  I ended up discarding it there due to
objections from reviewers, but the idea might be useful for people who
really don't care about mode bits.  I'll attempt a description below.

The idea:  A new "aclmode" setting called "discard", meaning that the
users don't care at all about the traditional mode bits.  A dataset with
aclmode=discard would have the chmod system call and NFS setattr
do absolutely nothing to the mode bits.  The getattr call would receive
mode bits derived from the ACL.  (this derivation would actually happen
when and acl is stored, not during getattr)  The mode bits would be
derived from the ACL such that the mode represents the greatest
possible access that might be allowed by the ACL, without any
consideration of deny entries or group memberships.

In detail, that mode derivation might be:

The mode's "owner" part would be the union of access granted by any
"owner" type ACEs in the ACL and any ACEs where the ACE owner
matches the file owner.  The mode's "group" part would be the union
of access granted by any group ACEs and any ACEs who's type is
unknown (all SIDs are of unknown type).  The mode's "other" part
would be the access granted by an "Everyone" ACE, if present.

Would that be of any use?

Gordon
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] question about COW and snapshots

2011-06-17 Thread Ross Walker

On Jun 16, 2011, at 7:23 PM, Erik Trimble  wrote:

> On 6/16/2011 1:32 PM, Paul Kraus wrote:
>> On Thu, Jun 16, 2011 at 4:20 PM, Richard Elling
>>   wrote:
>> 
>>> You can run OpenVMS :-)
>> Since *you* brought it up (I was not going to :-), how does VMS'
>> versioning FS handle those issues ?
>> 
> It doesn't, per se.  VMS's filesystem has a "versioning" concept (i.e. every 
> time you do a close() on a file, it creates a new file with the version 
> number appended, e.g.  foo;1  and foo;2  are the same file, different 
> versions).  However, it is completely missing the rest of the features we're 
> talking about, like data *consistency* in that file. It's still up to the app 
> using the file to figure out what data consistency means, and such.  Really, 
> all VMS adds is versioning, nothing else (no API, no additional features, 
> etc.).

I believe NTFS was built on the same concept of file streams the VMS FS used 
for versioning.

It's a very simple versioning system.

Personnally I use Sharepoint, but there are other content management systems 
out there that provide what your looking for, so no need to bring out the crypt 
keeper.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] write cache partial-disk pools (was Server with 4 drives, how to configure ZFS?)

2011-06-17 Thread Ross Walker

On Jun 17, 2011, at 7:06 AM, Edward Ned Harvey 
 wrote:

> I will only say, that regardless of whether or not that is or ever was true,
> I believe it's entirely irrelevant.  Because your system performs read and
> write caching and buffering in ram, the tiny little ram on the disk can't
> possibly contribute anything.

You would be surprised.

The on-disk buffer is there so data is ready when the hard drive head lands, 
without it the drive's average rotational latency will trend higher due to 
missed landings because the data wasn't in buffer at the right time.

The read buffer is to allow the disk to continuously read sectors whether the 
system bus is ready to transfer or not. Without it, sequential reads wouldn't 
last long enough to reach max throughput before they would have to pause 
because of bus contention and then suffer a rotation of latency hit which would 
kill read performance.

Try disabling the on-board write or read cache and see how your sequential IO 
performs and you'll see just how valuable those puny caches are.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dual protocal on one file system?

2011-03-16 Thread Ross Walker

On Mar 16, 2011, at 8:13 AM, Paul Kraus  wrote:

> On Tue, Mar 15, 2011 at 11:00 PM, Edward Ned Harvey
>  wrote:
> 
>> BTW, what is the advantage of the kernel cifs server as opposed to samba?
>> It seems, years ago, somebody must have been standing around and saying
>> "There is a glaring deficiency in samba, and we need to solve it."
> 
>Complete integration with AD/NTFS from the client perspective. In
> other words, the Sun CIFS server really does look like a genuine NTFS
> volume shared via CIFS in terms of ACLs. Snapshots even show up as
> "previous versions" in explorer.
> 
>I have never seen SAMBA provide more than just authentication
> integration with AD.
> 
>The in kernel CIFS server is also supposed to be much faster,
> although I have not tested that yet.

Samba has all those features as well. It has native support for different 
platform ACLs (Linux/Solaris/BSD) and supports mapping POSIX perms with 
platform ACLs to present a quasi NT ACL that reflects the native permissions of 
the host.

Samba even has modules for mapping NT RIDs to Nix UIDs/GIDs as well as a module 
that supports "Previous Versions" using the hosts native snapshot method.

The one glaring deficiency Samba has though, in Sun's eyes not mine, is that it 
runs in user space, though I believe that's just the cover song for "It wasn't 
invented here".

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SAS/short stroking vs. SSDs for ZIL

2010-12-25 Thread Ross Walker

On Dec 24, 2010, at 1:21 PM, Richard Elling  wrote:

> Latency is what matters most.  While there is a loose relationship between 
> IOPS
> and latency, you really want low latency.  For 15krpm drives, the average 
> latency
> is 2ms for zero seeks.  A decent SSD will beat that by an order of magnitude.

Actually I'd say that latency has a direct relationship to IOPS because it's 
the time it takes to perform an IO that determines how many IOs Per Second that 
can be performed.

Ever notice how storage vendors list their max IOPS in 512 byte sequential IO 
workloads and sustained throughput in 1MB+ sequential IO workloads. Only SSD 
makers list their random IOPS workload numbers and their 4K IO workload numbers.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS ... open source moving forward?

2010-12-15 Thread Ross Walker

On Dec 15, 2010, at 6:48 PM, Bob Friesenhahn  
wrote:

> On Wed, 15 Dec 2010, Linder, Doug wrote:
> 
>> But it sure would be nice if they spared everyone a lot of effort and 
>> annoyance and just GPL'd ZFS.  I think the goodwill generated
> 
> Why do you want them to "GPL" ZFS?  In what way would that save you annoyance?

I actually think Doug was trying to say he wished Oracle would open the 
development and make the source code open-sourced, not necessarily GPL'd.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-12-08 Thread Ross Walker

On Dec 8, 2010, at 11:41 PM, Edward Ned Harvey 
 wrote:

> For anyone who cares:
> 
> I created an ESXi machine.  Installed two guest (centos) machines and
> vmware-tools.  Connected them to each other via only a virtual switch.  Used
> rsh to transfer large quantities of data between the two guests,
> unencrypted, uncompressed.  Have found that ESXi virtual switch performance
> peaks around 2.5Gbit.
> 
> Also, if you have a NFS datastore, which is not available at the time of ESX
> bootup, then the NFS datastore doesn't come online, and there seems to be no
> way of telling ESXi to make it come online later.  So you can't auto-boot
> any guest, which is itself stored inside another guest.
> 
> So basically, if you want a layer of ZFS in between your ESX server and your
> physical storage, then you have to have at least two separate servers.  And
> if you want anything resembling actual disk speed, you need infiniband,
> fibre channel, or 10G ethernet.  (Or some really slow disks.)   ;-)

Besides the chicken and egg scenario that Ed mentions there is also the CPU 
usage that running the storage virtualized. You might find that as you get more 
machines on the storage the performance will decrease a lot faster then it 
otherwise would if it were standalone as it competes with the very machines it 
is suppose to be serving.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [OpenIndiana-discuss] iops...

2010-12-08 Thread Ross Walker

On Dec 7, 2010, at 9:49 PM, Edward Ned Harvey 
 wrote:

>> From: Ross Walker [mailto:rswwal...@gmail.com]
>> 
>> Well besides databases there are VM datastores, busy email servers, busy
>> ldap servers, busy web servers, and I'm sure the list goes on and on.
>> 
>> I'm sure it is much harder to list servers that are truly sequential in IO
> then
>> random. This is especially true when you have thousands of users hitting
> it.
> 
> Depends on the purpose of your server.  For example, I have a ZFS server
> whose sole purpose is to receive a backup data stream from another machine,
> and then write it to tape.  This is a highly sequential operation, and I use
> raidz.
> 
> Some people have video streaming servers.  And http/ftp servers with large
> files.  And a fileserver which is the destination for laptop whole-disk
> backups.  And a repository that stores iso files and rpm's used for OS
> installs on other machines.  And data capture from lab equipment.  And
> packet sniffer / compliance email/data logger.
> 
> and I'm sure the list goes on and on.  ;-)

Ok, single stream backup servers are one type, but as soon as you have multiple 
streams, even for large files, then IOPS trumps throughput to a degree, of 
course if throughput is very bad then that's no good either.

Know your workload is key, or have enough $$ to implement RAID10 everywhere.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [OpenIndiana-discuss] iops...

2010-12-07 Thread Ross Walker

On Dec 7, 2010, at 12:46 PM, Roy Sigurd Karlsbakk  wrote:

>> Bear a few things in mind:
>> 
>> iops is not iops.
> 
> 
> I am totally aware of these differences, but it seems some people think RAIDz 
> is nonsense unless you don't need speed at all. My testing shows (so far) 
> that the speed is quite good, far better than single drives. Also, as Eric 
> said, those speeds are for random i/o. I doubt there is very much out there 
> that is truely random i/o except perhaps databases, but then, I would never 
> use raid5/raidz for a DB unless at gunpoint.

Well besides databases there are VM datastores, busy email servers, busy ldap 
servers, busy web servers, and I'm sure the list goes on and on.

I'm sure it is much harder to list servers that are truly sequential in IO then 
random. This is especially true when you have thousands of users hitting it.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-11-17 Thread Ross Walker

On Wed, Nov 17, 2010 at 3:00 PM, Pasi Kärkkäinen  wrote:
> On Wed, Nov 17, 2010 at 10:14:10AM +, Bruno Sousa wrote:
>>    Hi all,
>>
>>    Let me tell you all that the MC/S *does* make a difference...I had a
>>    windows fileserver using an ISCSI connection to a host running snv_134
>>    with an average speed of 20-35 mb/s...After the upgrade to snv_151a
>>    (Solaris 11 express) this same fileserver got a performance boost and now
>>    has an average speed of 55-60mb/s.
>>
>>    Not double performance, but WAY better , specially if we consider that
>>    this performance boost was purely software based :)
>>
>
> Did you verify you're using more connections after the update?
> Or was is just *other* COMSTAR (and/or kernel) updates making the difference..

This is true. If someone wasn't utilizing 1Gbps before MC/S then going
to MC/S won't give you more, as you weren't using what you had (in
fact added latency in MC/S may give you less!).

I am going to say that the speed improvement from 134->151a was due to
OS and comstar improvements and not the MC/S.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-11-16 Thread Ross Walker

On Nov 16, 2010, at 7:49 PM, Jim Dunham  wrote:

> On Nov 16, 2010, at 6:37 PM, Ross Walker wrote:
>> On Nov 16, 2010, at 4:04 PM, Tim Cook  wrote:
>>> AFAIK, esx/i doesn't support L4 hash, so that's a non-starter.
>> 
>> For iSCSI one just needs to have a second (third or fourth...) iSCSI session 
>> on a different IP to the target and run mpio/mpxio/mpath whatever your OS 
>> calls multi-pathing.
> 
> MC/S (Multiple Connections per Sessions) support was added to the iSCSI 
> Target in COMSTAR, now available in Oracle Solaris 11 Express. 

Good to know.

The only initiator I know of that supports that is Windows, but with MC/S one 
at least doesn't need MPIO as the initiator handles the multiplexing over the 
multiple connections itself.

Doing multiple sessions and MPIO is supported almost universally though.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS

2010-11-16 Thread Ross Walker

On Nov 16, 2010, at 4:04 PM, Tim Cook  wrote:

> 
> 
> On Wed, Nov 17, 2010 at 7:56 AM, Miles Nordin  wrote:
> >>>>> "tc" == Tim Cook  writes:
> 
>tc> Channeling Ethernet will not make it any faster. Each
>tc> individual connection will be limited to 1gbit.  iSCSI with
>tc> mpxio may work, nfs will not.
> 
> well...probably you will run into this problem, but it's not
> necessarily totally unsolved.
> 
> I am just regurgitating this list again, but:
> 
>  need to include L4 port number in the hash:
>  
> http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb
>  port-channel load-balance mixed  -- for L2 etherchannels
>  mls ip cef load-sharing full -- for L3 routing (OSPF ECMP)
> 
>  nexus makes all this more complicated.  there are a few ways that
>  seem they'd be able to accomplish ECMP:
>   FTag flow markers in ``FabricPath'' L2 forwarding
>   LISP
>   MPLS
>  the basic scheme is that the L4 hash is performed only by the edge
>  router and used to calculate a label.  The routing protocol will
>  either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of
>  per-entire-path ECMP for LISP and MPLS.  unfortunately I don't
>  understand these tools well enoguh to lead you further, but if
>  you're not using infiniband and want to do >10way ECMP this is
>  probably where you need to look.
> 
>  http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942
>  feature added in snv_117, NFS client connections can be spread over multiple 
> TCP connections
>  When rpcmod:clnt_max_conns is set to a value > 1
>  however Even though the server is free to return data on different
>  connections, [it does not seem to choose to actually do so] --
>  6696163 fixed snv_117
> 
>  nfs:nfs3_max_threads=32
>  in /etc/system, which changes the default 8 async threads per mount to
>  32.  This is especially helpful for NFS over 10Gb and sun4v
> 
>  this stuff gets your NFS traffic onto multiple TCP circuits, which
>  is the same thing iSCSI multipath would accomplish.  From there, you
>  still need to do the cisco/??? stuff above to get TCP circuits
>  spread across physical paths.
> 
>  
> http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html
>-- suspect.  it advises ``just buy 10gig'' but many other places
>   say 10G NIC's don't perform well in real multi-core machines
>   unless you have at least as many TCP streams as cores, which is
>   honestly kind of obvious.  lego-netadmin bias.
> 
> 
> 
> AFAIK, esx/i doesn't support L4 hash, so that's a non-starter.

For iSCSI one just needs to have a second (third or fourth...) iSCSI session on 
a different IP to the target and run mpio/mpxio/mpath whatever your OS calls 
multi-pathing.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Excruciatingly slow resilvering on X4540 (build 134)

2010-11-01 Thread Ross Walker

On Nov 1, 2010, at 3:33 PM, Mark Sandrock  wrote:

> Hello,
> 
>   I'm working with someone who replaced a failed 1TB drive (50% utilized),
> on an X4540 running OS build 134, and I think something must be wrong.
> 
> Last Tuesday afternoon, zpool status reported:
> 
> scrub: resilver in progress for 306h0m, 63.87% done, 173h7m to go
> 
> and a week being 168 hours, that put completion at sometime tomorrow night.
> 
> However, he just reported zpool status shows:
> 
> scrub: resilver in progress for 447h26m, 65.07% done, 240h10m to go
> 
> so it's looking more like 2011 now. That can't be right.
> 
> I'm hoping for a suggestion or two on this issue.
> 
> I'd search the archives, but they don't seem searchable. Or am I wrong about 
> that?

Some zpool versions have an issue where snapshot creation/deletion during a 
resilver causes it to start over.

Try suspending all snapshot activity during the resilver.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-11-01 Thread Ross Walker

On Nov 1, 2010, at 5:09 PM, Ian D  wrote:

>> Maybe you are experiencing this:
>> http://opensolaris.org/jive/thread.jspa?threadID=11942
> 
> It does look like this... Is this really the expected behaviour?  That's just 
> unacceptable.  It is so bad it sometimes drop connection and fail copies and 
> SQL queries...

Then set the zfs_write_limit_override to a reasonable value.

Depending on the speed of your ZIL and/or backing store (for async IO) you will 
need to limit the write size in such a way so TXG1 is fully committed before 
TXG2 fills.

Myself, with a RAID controller with a 512MB BBU write-back cache I set the 
write limit to 512MB which allows my setup to commit-before-fill.

It also prevents ARC from discarding good read cache data in favor of write 
cache.

Others may have a good calculation based on ARC execution plan timings, disk 
seek and sustained throughput to give an accurate figure based on one's setup, 
otherwise start with a reasonable value, say 1GB, and decrease until the pauses 
stop.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] vdev failure -> pool loss ?

2010-10-19 Thread Ross Walker

On Oct 19, 2010, at 4:33 PM, Tuomas Leikola  wrote:

> On Mon, Oct 18, 2010 at 8:18 PM, Simon Breden  wrote:
>> So are we all agreed then, that a vdev failure will cause pool loss ?
>> --
> 
> unless you use copies=2 or 3, in which case your data is still safe
> for those datasets that have this option set.

This doesn't prevent pool loss in the face of a vdev failure, merely reduces 
the likelihood of file loss due to block corruption.

A loss of a vdev (mirror, raidz or non-redundant disk) means the loss of the 
pool.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance issues with iSCSI under Linux

2010-10-15 Thread Ross Walker

On Oct 15, 2010, at 5:34 PM, Ian D  wrote:

>> Has anyone suggested either removing L2ARC/SLOG
>> entirely or relocating them so that all devices are
>> coming off the same controller? You've swapped the
>> external controller but the H700 with the internal
>> drives could be the real culprit. Could there be
>> issues with cross-controller IO in this case? Does
>> the H700 use the same chipset/driver as the other
>> controllers you've tried? 
> 
> We'll try that.  We have a couple other devices we could use for the SLOG 
> like a DDRDrive X1 and an OCZ Z-Drive which are both PCIe cards and don't use 
> the local controller.

What mount options are you using on the Linux client for the NFS share?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-15 Thread Ross Walker

On Oct 15, 2010, at 9:18 AM, Stephan Budach  wrote:

> Am 14.10.10 17:48, schrieb Edward Ned Harvey:
>> 
>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>>> boun...@opensolaris.org] On Behalf Of Toby Thain
>>> 
>>>> I don't want to heat up the discussion about ZFS managed discs vs.
>>>> HW raids, but if RAID5/6 would be that bad, no one would use it
>>>> anymore.
>>> It is. And there's no reason not to point it out. The world has
>> Well, neither one of the above statements is really fair.
>> 
>> The truth is:  radi5/6 are generally not that bad.  Data integrity failures
>> are not terribly common (maybe one bit per year out of 20 large disks or
>> something like that.)
>> 
>> And in order to reach the conclusion "nobody would use it," the people using
>> it would have to first *notice* the failure.  Which they don't.  That's kind
>> of the point.
>> 
>> Since I started using ZFS in production, about a year ago, on three servers
>> totaling approx 1.5TB used, I have had precisely one checksum error, which
>> ZFS corrected.  I have every reason to believe, if that were on a raid5/6,
>> the error would have gone undetected and nobody would have noticed.
>> 
> Point taken!
> 
> So, what would you suggest, if I wanted to create really big pools? Say in 
> the 100 TB range? That would be quite a number of single drives then, 
> especially when you want to go with zpool raid-1.

A pool consisting of 4 disk raidz vdevs (25% overhead) or 6 disk raidz2 vdevs 
(33% overhead) should deliver the storage and performance for a pool that size, 
versus a pool of mirrors (50% overhead).

You need a lot if spindles to reach 100TB.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Ross Walker

On Oct 12, 2010, at 8:21 AM, "Edward Ned Harvey"  wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Stephan Budach
>> 
>>  c3t211378AC0253d0  ONLINE   0 0 0
> 
> How many disks are there inside of c3t211378AC0253d0?
> 
> How are they configured?  Hardware raid 5?  A mirror of two hardware raid
> 5's?  The point is:  This device, as seen by ZFS, is not a pure storage
> device.  It is a high level device representing some LUN or something, which
> is configured & controlled by hardware raid.
> 
> If there's zero redundancy in that device, then scrub would probably find
> the checksum errors consistently and repeatably.
> 
> If there's some redundancy in that device, then all bets are off.  Sometimes
> scrub might read the "good half" of the data, and other times, the bad half.
> 
> 
> But then again, the error might not be in the physical disks themselves.
> The error might be somewhere in the raid controller(s) or the interconnect.
> Or even some weird unsupported driver or something.

If it were a parity based raid set then the error would most likely be 
reproducible, if not detected by the raid controller.

The biggest problem is from hardware mirrors where the hardware can't detect an 
error on one side vs the other.

For mirrors it's always best to use ZFS' built-in mirrors, otherwise if I were 
to use HW RAID I would use RAID5/6/50/60 since errors encountered can be 
reproduced, two parity raids mirrored in ZFS would probably provide the best of 
both worlds, for a steep cost though.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] performance leakage when copy huge data

2010-09-09 Thread Ross Walker

On Sep 9, 2010, at 8:27 AM, Fei Xu  wrote:

>> 
>> Service times here are crap. Disks are malfunctioning
>> in some way. If
>> your source disks can take seconds (or 10+ seconds)
>> to reply, then of
>> course your copy will be slow. Disk is probably
>> having a hard time
>> reading the data or something.
>> 
> 
> 
> Yeah, that should not go over 15ms.  I just cannot understand why it starts 
> ok with hundred GB files transfered and then suddenly fall to "sleep".
> by the way,  WDIDLE time is already disabled which might cause some issue.  
> I've changed to another system to test ZFS send between 8*1TB pool and 4*1TB 
> pool.  hope everythings OK on this case.

This might be the dreaded WD TLER issue. Basically the drive keeps retrying a 
read operation over and over after a bit error trying to recover from a read 
error themselves. With ZFS one really needs to disable this and have the drives 
fail immediately.

Check your drives to see if they have this feature, if so think about replacing 
the drives in the source pool that have long service times and make sure this 
feature is disabled on the destination pool drives.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-27 Thread Ross Walker

On Aug 27, 2010, at 1:04 AM, Mark  wrote:

> We are using a 7210, 44 disks I believe, 11 stripes of RAIDz sets.  When I 
> installed I selected the best bang for the buck on the speed vs capacity 
> chart.
> 
> We run about 30 VM's on it, across 3 ESX 4 servers.  Right now, its all 
> running NFS, and it sucks... sooo slow.

I have a Dell 2950 server with a PERC6 controller with 512MB of write back 
cache and a pool of mirrors made out of 14 15K SAS drives. ZIL is integrated.

This is serving 30 VMs on 3 ESXi hosts and performance is good.

I find the #1 operation is random reads, so I doubt the ZIL will make as much 
difference as a very large L2ARC will. I'd hit that first, it's a cheaper buy. 
Random reads across a theoretical infinitely sized (in comparison to system 
RAM) 7200RPM device is a killer. Cache as much as possible in hope of hitting 
cache rather than disk.

Breaking your pool into two or three, setting different vdev types of different 
type disks and tiering your VMs based on their performance profile would help.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with Equallogic storage

2010-08-21 Thread Ross Walker

On Aug 21, 2010, at 4:40 PM, Richard Elling  wrote:

> On Aug 21, 2010, at 10:14 AM, Ross Walker wrote:
>> I'm planning on setting up an NFS server for our ESXi hosts and plan on 
>> using a virtualized Solaris or Nexenta host to serve ZFS over NFS.
> 
> Please follow the joint EMC+NetApp best practices for VMware ESX servers.
> The recommendations apply to any NFS implementation for ESX.

Thanks, I'll check that out! Always looking for advice on how best to tweak NFS 
for ESX.

I have a current ZFS over NFS implementation, but on direct attached storage 
using Sol10. I will be interested to see how Nexenta compares.

>> The storage I have available is provided by Equallogic boxes over 10Gbe 
>> iSCSI.
>> 
>> I am trying to figure out the best way to provide both performance and 
>> resiliency given the Equallogic provides the redundancy.
>> 
>> Since I am hoping to provide a 2TB datastore I am thinking of carving out 
>> either 3 1TB luns or 6 500GB luns that will be RDM'd to the storage VM and 
>> within the storage server setting up either 1 raidz vdev with the 1TB luns 
>> (less RDMs) or 2 raidz vdevs with the 500GB luns (more fine grained 
>> expandability, work in 1TB increments).
>> 
>> Given the 2GB of write-back cache on the Equallogic I think the integrated 
>> ZIL would work fine (needs benchmarking though).
> 
> This should work fine.
> 
>> The vmdk files themselves won't be backed up (more data then I can store), 
>> just the essential data contained within, so I would think resiliency would 
>> be important here.
>> 
>> My questions are these.
>> 
>> Does this setup make sense?
> 
> Yes, it is perfectly reasonable.
> 
>> Would I be better off forgoing resiliency for simplicity, putting all my 
>> faith into the Equallogic to handle data resiliency?
> 
> I don't have much direct experience with Equillogic, but I would expect that
> they do a reasonable job of protecting data, or they would be out of business.
> 
> You can also use the copies parameter to set extra redundancy for the 
> important
> files. ZFS will also tell you if corruption is found in a single file, so 
> that you can 
> recover just the file and not be forced to recover everything else. I think 
> this fits
> into your back strategy.

I thought of the copies parameter, but figured a raidz laid on top of the 
storage pool would only waste 33% instead of 50% and since this is on top of a 
conceptually single RAID volume the IOPS bottleneck won't come into play since 
the any single drive IOPS will be equal to the array IOPS as a whole.

>> Will this setup perform? Anybody with experience in this type of setup?
> 
> Many people are quite happy with RAID arrays and still take advantage of 
> the features of ZFS: checksums, snapshots, clones, send/receive, VMware
> integration, etc. The decision of where to implement data protection (RAID) 
> is not as important as the decision to protect your data.  
> 
> My advice: protect your data.

Always good advice.

So I suppose this just confirms my analysis.

Thanks,

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS with Equallogic storage

2010-08-21 Thread Ross Walker

On Aug 21, 2010, at 2:14 PM, Bill Sommerfeld  wrote:

> On 08/21/10 10:14, Ross Walker wrote:
>> I am trying to figure out the best way to provide both performance and 
>> resiliency given the Equallogic provides the redundancy.
> 
> (I have no specific experience with Equallogic; the following is just generic 
> advice)
> 
> Every bit stored in zfs is checksummed at the block level; zfs will not use 
> data or metadata if the checksum doesn't match.

I understand that much and is the reason I picked ZFS for persistent data 
storage.

> zfs relies on redundancy (storing multiple copies) to provide resilience; if 
> it can't independently read the multiple copies and pick the one it likes, it 
> can't recover from bitrot or failure of the underlying storage.

Can't auto-recover, but will report the failure so it can be restored from 
backup, but since the vmdk files are too big to backup...

> if you want resilience, zfs must be responsible for redundancy.

Must have, not necessarily have full control.

> You imply having multiple storage servers.  The simplest thing to do is 
> export one large LUN from each of two different storage servers, and have ZFS 
> mirror them.

Well... You need to know that the multiple storage servers are acting as a 
single pool with tiered storage levels (SAS 15K in RAID10 and SATA in RAID6) 
and luns are auto-tiered across these based on demand performance, so a pool of 
mirrors won't really provide any more performance then a raidz (same physical 
RAID) and raidz will only "waste" 33% as oppose to 50%.

> While this reduces the available space, depending on your workload, you can 
> make some of it back by enabling compression.
> 
> And, given sufficiently recent software, and sufficient memory and/or ssd for 
> l2arc, you can enable dedup.

The host is a blade server with no room for SSDs, but if SSD investment is 
needed in the future I can add an SSD Equallogic box to the storage pool.

> Of course, the effectiveness of both dedup and compression depends on your 
> workload.
> 
>> Would I be better off forgoing resiliency for simplicity, putting all my 
>> faith into the Equallogic to handle data resiliency?
> 
> IMHO, no; the resulting system will be significantly more brittle.

Exactly how brittle I guess depends on the Equallogic system.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS with Equallogic storage

2010-08-21 Thread Ross Walker


I'm planning on setting up an NFS server for our ESXi hosts and plan on using a 
virtualized Solaris or Nexenta host to serve ZFS over NFS.

The storage I have available is provided by Equallogic boxes over 10Gbe iSCSI.

I am trying to figure out the best way to provide both performance and 
resiliency given the Equallogic provides the redundancy.

Since I am hoping to provide a 2TB datastore I am thinking of carving out 
either 3 1TB luns or 6 500GB luns that will be RDM'd to the storage VM and 
within the storage server setting up either 1 raidz vdev with the 1TB luns 
(less RDMs) or 2 raidz vdevs with the 500GB luns (more fine grained 
expandability, work in 1TB increments).

Given the 2GB of write-back cache on the Equallogic I think the integrated ZIL 
would work fine (needs benchmarking though).

The vmdk files themselves won't be backed up (more data then I can store), just 
the essential data contained within, so I would think resiliency would be 
important here.

My questions are these.

Does this setup make sense?

Would I be better off forgoing resiliency for simplicity, putting all my faith 
into the Equallogic to handle data resiliency?

Will this setup perform? Anybody with experience in this type of setup?

-Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS in Linux (was Opensolaris is apparently dead)

2010-08-19 Thread Ross Walker

On Aug 19, 2010, at 9:26 AM, joerg.schill...@fokus.fraunhofer.de (Joerg 
Schilling) wrote:

> "Edward Ned Harvey"  wrote:
> 
>> The reasons for ZFS not in Linux must be more than just the license issue.
> 
> If Linux has ZFS, then it would be possible to do 
> 
> -I/O performance analysis based on the same FS implementation
> 
> -stability analysis for data, crashes, ...
> 
> and a lot more. It may be that the Linux people are in fear of becoming 
> comparable.

I really think that with ZFS on Linux implemented using the block layer instead 
of the VFS layer (which would need work to support it and thus kernel adoption 
for that work) would provide comparable performance to FreeBSD/Solaris on 
comparable hardware.

This means a lot more work on the port as it will need to write a lot of the 
routines that used to be handled by the OS' VFS layer to the lower-level block 
layer, but this would assure both reliability and performance.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-18 Thread Ross Walker

On Aug 18, 2010, at 10:43 AM, Bob Friesenhahn  
wrote:

> On Wed, 18 Aug 2010, Joerg Schilling wrote:
>> 
>> Linus is right with his primary decision, but this also applies for static
>> linking. See Lawrence Rosen for more information, the GPL does not distinct
>> between static and dynamic linking.
> 
> GPLv2 does not address linking at all and only makes vague references to the 
> "program".  There is no insinuation that the program needs to occupy a single 
> address space or mention of address spaces at all. The "program" could 
> potentially be a composition of multiple cooperating executables (e.g. like 
> GCC) or multiple modules.  As you say, everything depends on the definition 
> of a "derived work".
> 
> If a shell script may be dependent on GNU 'cat', does that make the shell 
> script a "derived work"?  Note that GNU 'cat' could be replaced with some 
> other 'cat' since 'cat' has a well defined interface.  A very similar 
> situation exists for loadable modules which have well defined interfaces 
> (like 'cat').  Based on the argument used for 'cat', the mere injection of a 
> loadable module into an execution environment which includes GPL components 
> should not require that module to be distributable under GPL.  The module 
> only needs to be distributable under GPL if it was developed in such a way 
> that it specifically depends on GPL components.

This is how I see it as well.

The big problem is not the insmod'ing of the blob but how it is distributed.

As far as I know this can be circumvented by not including it in the main 
distribution but through a separate repo to be installed afterwards, ala Debian 
non-free.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-17 Thread Ross Walker

On Aug 17, 2010, at 5:44 AM, joerg.schill...@fokus.fraunhofer.de (Joerg 
Schilling) wrote:

> Frank Cusack  wrote:
> 
>> On 8/16/10 9:57 AM -0400 Ross Walker wrote:
>>> No, the only real issue is the license and I highly doubt Oracle will
>>> re-release ZFS under GPL to dilute it's competitive advantage.
>> 
>> You're saying Oracle wants to keep zfs out of Linux?
> 
> In order to get zfs into Linux, you don't need to change the license for ZFS 
> but the mind of the Linux folks.

I'm afraid you will have better luck catching a moon beam in your hands then 
convincing the likes of RS. I'd bet even Charlie Manson would say, dude that 
guy is crazy.

And there lies the problem, you need the agreement of all copyright holders in 
a GPL project to change it's licensing terms and some just will not budge.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-17 Thread Ross Walker

On Aug 16, 2010, at 11:17 PM, Frank Cusack  wrote:

> On 8/16/10 9:57 AM -0400 Ross Walker wrote:
>> No, the only real issue is the license and I highly doubt Oracle will
>> re-release ZFS under GPL to dilute it's competitive advantage.
> 
> You're saying Oracle wants to keep zfs out of Linux?

I would if I were them, wouldn't you?

Linux has already eroded the low-end of the Solaris business model, if Linux 
had ZFS it could possibly erode out the middle tier as well.

Solaris with only high-end customers wouldn't be very profitable (unless 
seriously marked up in price), thus unsustainable as a business.

Sun didn't get this, but Oracle does.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ross Walker

On Aug 15, 2010, at 9:44 PM, Peter Jeremy  
wrote:

> Given that both provide similar features, it's difficult to see why
> Oracle would continue to invest in both.  Given that ZFS is the more
> mature product, it would seem more logical to transfer all the effort
> to ZFS and leave btrfs to die.

I can see Oracle ejecting BTRFS from it's folds, but seriously doubt it will 
die. BTRFS is now mainlined into the Linux kernel and I will bet that currently 
a lot of it's development is already coming from outside parties and Oracle is 
simply acting as the commit maintainer.

Linux is an evolving OS, what determines a FS's continued existence is the 
public's adoption rate of that FS. If nobody ends up using it then the kernel 
will drop it in which case it will eventually die.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-16 Thread Ross Walker

On Aug 16, 2010, at 9:06 AM, "Edward Ned Harvey"  wrote:

> ZFS does raid, and mirroring, and resilvering, and partitioning, and NFS, and 
> CIFS, and iSCSI, and device management via vdev's, and so on.  So ZFS steps 
> on a lot of linux peoples' toes.  They already have code to do this, or that, 
> why should they kill off all these other projects, and turn the world upside 
> down, and bow down and acknowledge that anyone else did anything better than 
> what they did?

Actually ZFS doesn't do NFS/CIFS/iSCSI those shareX options merely execute 
scripts to perform the OS operations as appropriate.

BTRFS also handles the "RAID" of the hard disks as ZFS does.

No, the only real issue is the license and I highly doubt Oracle will 
re-release ZFS under GPL to dilute it's competitive advantage.

I think the market NEEDs file system competition in order to drive innovation 
so it would be beneficial for both FSs to continue together into the future.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and VMware

2010-08-14 Thread Ross Walker

On Aug 14, 2010, at 8:26 AM, "Edward Ned Harvey"  wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
>> 
>> #3  I previously believed that vmfs3 was able to handle sparse files
>> amazingly well, like, when you create a new vmdk, it appears almost
>> instantly regardless of size, and I believed you could copy sparse
>> vmdk's
>> efficiently, not needing to read all the sparse consecutive zeroes.  I
>> was
>> wrong.  
> 
> Correction:  I was originally right.  ;-)  
> 
> In ESXi, if you go to command line (which is busybox) then sparse copies are
> not efficient.
> If you go into vSphere, and browse the datastore, and copy vmdk files via
> gui, then it DOES copy efficiently.
> 
> The behavior is the same, regardless of NFS vs iSCSI.
> 
> You should always copy files via GUI.  That's the lesson here.

Technically you should always copy vmdk files via vmfstool on the command line. 
That will give you wire speed transfers.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iScsi slow

2010-08-05 Thread Ross Walker

On Aug 5, 2010, at 2:24 PM, Roch Bourbonnais  wrote:

> 
> Le 5 août 2010 à 19:49, Ross Walker a écrit :
> 
>> On Aug 5, 2010, at 11:10 AM, Roch  wrote:
>> 
>>> 
>>> Ross Walker writes:
>>>> On Aug 4, 2010, at 12:04 PM, Roch  wrote:
>>>> 
>>>>> 
>>>>> Ross Walker writes:
>>>>>> On Aug 4, 2010, at 9:20 AM, Roch  wrote:
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Ross Asks: 
>>>>>>> So on that note, ZFS should disable the disks' write cache,
>>>>>>> not enable them  despite ZFS's COW properties because it
>>>>>>> should be resilient. 
>>>>>>> 
>>>>>>> No, because ZFS builds resiliency on top of unreliable parts. it's able 
>>>>>>> to deal
>>>>>>> with contained failures (lost state) of the disk write cache. 
>>>>>>> 
>>>>>>> It can then export LUNS that have WC enabled or
>>>>>>> disabled. But if we enable the WC on the exported LUNS, then
>>>>>>> the consumer of these LUNS must be able to say the same.
>>>>>>> The discussion at that level then needs to focus on failure groups.
>>>>>>> 
>>>>>>> 
>>>>>>> Ross also Said :
>>>>>>> I asked this question earlier, but got no answer: while an
>>>>>>> iSCSI target is presented WCE does it respect the flush
>>>>>>> command? 
>>>>>>> 
>>>>>>> Yes. I would like to say "obviously" but it's been anything
>>>>>>> but.
>>>>>> 
>>>>>> Sorry to probe further, but can you expand on but...
>>>>>> 
>>>>>> Just if we had a bunch of zvols exported via iSCSI to another Solaris
>>>>>> box which used them to form another zpool and had WCE turned on would
>>>>>> it be reliable? 
>>>>>> 
>>>>> 
>>>>> Nope. That's because all the iSCSI are in the same fault
>>>>> domain as they share a unified back-end cache. What works,
>>>>> in principle, is mirroring SCSI channels hosted on 
>>>>> different storage controllers (or N SCSI channels on N
>>>>> controller in a raid group).
>>>>> 
>>>>> Which is why keeping the WC set to the default, is really
>>>>> better in general.
>>>> 
>>>> Well I was actually talking about two backend Solaris storage servers 
>>>> serving up storage over iSCSI to a front-end Solaris server serving ZFS 
>>>> over NFS, so I have redundancy there, but want the storage to be 
>>>> performant, so I want the iSCSI to have WCE, yet I want it to be reliable 
>>>> and have it honor cache flush requests from the front-end NFS server.
>>>> 
>>>> Does that make sense? Is it possible?
>>>> 
>>> 
>>> Well in response to a commit (say after a file creation) then the
>>> front end server will end up sending flush write caches on
>>> both side of the iscsi mirror which will reach the backend server
>>> which will flush disk write caches. This will all work but
>>> probably  not unleash performance the way you would like it
>>> to.
>> 
>> 
>> 
>>> If you setup to have the backend server not honor the
>>> backend disk flush write caches, then the 2 backend pools become at
>>> risk of corruption, mostly because the ordering of IOs
>>> around the ueberblock updates. If you have faith, then you
>>> could consider that you won't hit 2 backend pool corruption
>>> together and rely on the frontend resilvering to rebuild the
>>> corrupted backend.
>> 
>> So you are saying setting WCE disables cache flush on the target and setting 
>> WCD forces a flush for every WRITE?
> 
> Nope. Setting WC either way has not implication on the response to a flush 
> request. We flush the cache in response to a request to do so,
> unless one sets the unsupported zfs_nocacheflush, if set then the pool is at 
> risk
>> 
>> How about a way to enable WCE on the target, yet still perform cache flush 
>> when the initiator requests one, like a real SCSI target should do, or is 
>> that just not possible with ZVOLs today?
>> 
> I hope I've cleared that up. Not sure what I said that implicated otherwise.
> 
> But if you honor the flush write cache request all the way to the disk 
> device, then 1, 2 or 3 layers of ZFS won't make a dent in the performance of 
> NFS tar x. 
> Only a device accepting low latency writes which survives power outtage can 
> do that.

Understood and thanks for the clarification, if the NFS synchronicity has too 
much of a negative impact then that can be alleviated through an SSD or NVRAM 
slog device on the head server.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iScsi slow

2010-08-05 Thread Ross Walker

On Aug 5, 2010, at 11:10 AM, Roch  wrote:

> 
> Ross Walker writes:
>> On Aug 4, 2010, at 12:04 PM, Roch  wrote:
>> 
>>> 
>>> Ross Walker writes:
>>>> On Aug 4, 2010, at 9:20 AM, Roch  wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> Ross Asks: 
>>>>> So on that note, ZFS should disable the disks' write cache,
>>>>> not enable them  despite ZFS's COW properties because it
>>>>> should be resilient. 
>>>>> 
>>>>> No, because ZFS builds resiliency on top of unreliable parts. it's able 
>>>>> to deal
>>>>> with contained failures (lost state) of the disk write cache. 
>>>>> 
>>>>> It can then export LUNS that have WC enabled or
>>>>> disabled. But if we enable the WC on the exported LUNS, then
>>>>> the consumer of these LUNS must be able to say the same.
>>>>> The discussion at that level then needs to focus on failure groups.
>>>>> 
>>>>> 
>>>>> Ross also Said :
>>>>> I asked this question earlier, but got no answer: while an
>>>>> iSCSI target is presented WCE does it respect the flush
>>>>> command? 
>>>>> 
>>>>> Yes. I would like to say "obviously" but it's been anything
>>>>> but.
>>>> 
>>>> Sorry to probe further, but can you expand on but...
>>>> 
>>>> Just if we had a bunch of zvols exported via iSCSI to another Solaris
>>>> box which used them to form another zpool and had WCE turned on would
>>>> it be reliable? 
>>>> 
>>> 
>>> Nope. That's because all the iSCSI are in the same fault
>>> domain as they share a unified back-end cache. What works,
>>> in principle, is mirroring SCSI channels hosted on 
>>> different storage controllers (or N SCSI channels on N
>>> controller in a raid group).
>>> 
>>> Which is why keeping the WC set to the default, is really
>>> better in general.
>> 
>> Well I was actually talking about two backend Solaris storage servers 
>> serving up storage over iSCSI to a front-end Solaris server serving ZFS over 
>> NFS, so I have redundancy there, but want the storage to be performant, so I 
>> want the iSCSI to have WCE, yet I want it to be reliable and have it honor 
>> cache flush requests from the front-end NFS server.
>> 
>> Does that make sense? Is it possible?
>> 
> 
> Well in response to a commit (say after a file creation) then the
> front end server will end up sending flush write caches on
> both side of the iscsi mirror which will reach the backend server
> which will flush disk write caches. This will all work but
> probably  not unleash performance the way you would like it
> to.



> If you setup to have the backend server not honor the
> backend disk flush write caches, then the 2 backend pools become at
> risk of corruption, mostly because the ordering of IOs
> around the ueberblock updates. If you have faith, then you
> could consider that you won't hit 2 backend pool corruption
> together and rely on the frontend resilvering to rebuild the
> corrupted backend.

So you are saying setting WCE disables cache flush on the target and setting 
WCD forces a flush for every WRITE?

How about a way to enable WCE on the target, yet still perform cache flush when 
the initiator requests one, like a real SCSI target should do, or is that just 
not possible with ZVOLs today?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iScsi slow

2010-08-04 Thread Ross Walker

On Aug 4, 2010, at 12:04 PM, Roch  wrote:

> 
> Ross Walker writes:
>> On Aug 4, 2010, at 9:20 AM, Roch  wrote:
>> 
>>> 
>>> 
>>> Ross Asks: 
>>> So on that note, ZFS should disable the disks' write cache,
>>> not enable them  despite ZFS's COW properties because it
>>> should be resilient. 
>>> 
>>> No, because ZFS builds resiliency on top of unreliable parts. it's able to 
>>> deal
>>> with contained failures (lost state) of the disk write cache. 
>>> 
>>> It can then export LUNS that have WC enabled or
>>> disabled. But if we enable the WC on the exported LUNS, then
>>> the consumer of these LUNS must be able to say the same.
>>> The discussion at that level then needs to focus on failure groups.
>>> 
>>> 
>>> Ross also Said :
>>> I asked this question earlier, but got no answer: while an
>>> iSCSI target is presented WCE does it respect the flush
>>> command? 
>>> 
>>> Yes. I would like to say "obviously" but it's been anything
>>> but.
>> 
>> Sorry to probe further, but can you expand on but...
>> 
>> Just if we had a bunch of zvols exported via iSCSI to another Solaris
>> box which used them to form another zpool and had WCE turned on would
>> it be reliable? 
>> 
> 
> Nope. That's because all the iSCSI are in the same fault
> domain as they share a unified back-end cache. What works,
> in principle, is mirroring SCSI channels hosted on 
> different storage controllers (or N SCSI channels on N
> controller in a raid group).
> 
> Which is why keeping the WC set to the default, is really
> better in general.

Well I was actually talking about two backend Solaris storage servers serving 
up storage over iSCSI to a front-end Solaris server serving ZFS over NFS, so I 
have redundancy there, but want the storage to be performant, so I want the 
iSCSI to have WCE, yet I want it to be reliable and have it honor cache flush 
requests from the front-end NFS server.

Does that make sense? Is it possible?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iScsi slow

2010-08-04 Thread Ross Walker

On Aug 4, 2010, at 9:20 AM, Roch  wrote:

> 
> 
>  Ross Asks: 
>  So on that note, ZFS should disable the disks' write cache,
>  not enable them  despite ZFS's COW properties because it
>  should be resilient. 
> 
> No, because ZFS builds resiliency on top of unreliable parts. it's able to 
> deal
> with contained failures (lost state) of the disk write cache. 
> 
> It can then export LUNS that have WC enabled or
> disabled. But if we enable the WC on the exported LUNS, then
> the consumer of these LUNS must be able to say the same.
> The discussion at that level then needs to focus on failure groups.
> 
> 
>  Ross also Said :
>  I asked this question earlier, but got no answer: while an
>  iSCSI target is presented WCE does it respect the flush
>  command? 
> 
> Yes. I would like to say "obviously" but it's been anything
> but.

Sorry to probe further, but can you expand on but...

Just if we had a bunch of zvols exported via iSCSI to another Solaris box which 
used them to form another zpool and had WCE turned on would it be reliable?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iScsi slow

2010-08-04 Thread Ross Walker

On Aug 4, 2010, at 3:52 AM, Roch  wrote:

> 
> Ross Walker writes:
> 
>> On Aug 3, 2010, at 12:13 PM, Roch Bourbonnais  
>> wrote:
>> 
>>> 
>>> Le 27 mai 2010 Ã  07:03, Brent Jones a Ã©crit :
>>> 
>>>> On Wed, May 26, 2010 at 5:08 AM, Matt Connolly
>>>>  wrote:
>>>>> I've set up an iScsi volume on OpenSolaris (snv_134) with these commands:
>>>>> 
>>>>> sh-4.0# zfs create rpool/iscsi
>>>>> sh-4.0# zfs set shareiscsi=on rpool/iscsi
>>>>> sh-4.0# zfs create -s -V 10g rpool/iscsi/test
>>>>> 
>>>>> The underlying zpool is a mirror of two SATA drives. I'm connecting from 
>>>>> a Mac client with global SAN initiator software, connected via Gigabit 
>>>>> LAN. It connects fine, and I've initialiased a mac format volume on that 
>>>>> iScsi volume.
>>>>> 
>>>>> Performance, however, is terribly slow, about 10 times slower than an SMB 
>>>>> share on the same pool. I expected it would be very similar, if not 
>>>>> faster than SMB.
>>>>> 
>>>>> Here's my test results copying 3GB data:
>>>>> 
>>>>> iScsi:  44m01s  1.185MB/s
>>>>> SMB share:  4m2711.73MB/s
>>>>> 
>>>>> Reading (the same 3GB) is also worse than SMB, but only by a factor of 
>>>>> about 3:
>>>>> 
>>>>> iScsi:  4m3611.34MB/s
>>>>> SMB share:  1m4529.81MB/s
>>>>> 
>>> 
>>>  
>>> 
>>> Not unexpected. Filesystems have readahead code to prefetch enough to cover 
>>> the latency of the read request. iSCSI only responds to the request.
>>> Put a filesystem on top of iscsi and try again.
>>> 
>>> For writes, iSCSI is synchronous and SMB is not. 
>> 
>> It may be with ZFS, but iSCSI is neither synchronous nor asynchronous is is 
>> simply SCSI over IP.
>> 
> 
> Hey Ross,
> 
> Nothing to do with ZFS here, but you're right to point out
> that iSCSI is neither. It was just that in the context of
> this test (and 99+% of iSCSI usage) it will be. SMB is
> not. Thus a large discrepancy on the write test.
> 
> Resilient storage, by default, should expose iSCSI channels
> with write caches disabled.


So on that note, ZFS should disable the disks' write cache, not enable them  
despite ZFS's COW properties because it should be resilient.


>> It is the application using the iSCSI protocol that
> determines whether it is synchronous, issue a flush after
> write, or asynchronous, wait until target flushes.
>> 
> 
> True.
> 
>> I think the ZFS developers didn't quite understand that
> and wanted strict guidelines like NFS has, but iSCSI doesn't
> have those, it is a lower level protocol than NFS is, so
> they forced guidelines on it and violated the standard. 
>> 
>> -Ross
>> 
> 
> Not True. 
> 
> 
> ZFS exposes LUNS (or ZVOL) and while at first we didn't support
> DKIOCSETWCE, we now do. So a ZFS LUN can be whatever you
> need it to be.

I asked this question earlier, but got no answer: while an iSCSI target is 
presented WCE does it respect the flush command?

> Now in the context of iSCSI luns hosted by a resilient
> storage system, enabling write caches is to be used only in
> very specific circumstances. The situation is not symmetrical
> with WCE in disks of a JBOD since that can be setup with
> enough redundancy to deal with potential data loss. When
> using a resilient storage, you need to trust the storage for
> persistence of SCSI commands and building a resilient system
> on top of write cache enabled SCSI channels is not trivial.

Not true, advertise WCE, support flush and tagged command queuing and the 
initiator will be able to use the resilient storage appropriate for it's needs.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iScsi slow

2010-08-03 Thread Ross Walker

On Aug 3, 2010, at 5:56 PM, Robert Milkowski  wrote:

> On 03/08/2010 22:49, Ross Walker wrote:
>> On Aug 3, 2010, at 12:13 PM, Roch Bourbonnais  
>> wrote:
>> 
>>   
>>> Le 27 mai 2010 à 07:03, Brent Jones a écrit :
>>> 
>>> 
>>>> On Wed, May 26, 2010 at 5:08 AM, Matt Connolly
>>>>   wrote:
>>>>   
>>>>> I've set up an iScsi volume on OpenSolaris (snv_134) with these commands:
>>>>> 
>>>>> sh-4.0# zfs create rpool/iscsi
>>>>> sh-4.0# zfs set shareiscsi=on rpool/iscsi
>>>>> sh-4.0# zfs create -s -V 10g rpool/iscsi/test
>>>>> 
>>>>> The underlying zpool is a mirror of two SATA drives. I'm connecting from 
>>>>> a Mac client with global SAN initiator software, connected via Gigabit 
>>>>> LAN. It connects fine, and I've initialiased a mac format volume on that 
>>>>> iScsi volume.
>>>>> 
>>>>> Performance, however, is terribly slow, about 10 times slower than an SMB 
>>>>> share on the same pool. I expected it would be very similar, if not 
>>>>> faster than SMB.
>>>>> 
>>>>> Here's my test results copying 3GB data:
>>>>> 
>>>>> iScsi:  44m01s  1.185MB/s
>>>>> SMB share:  4m2711.73MB/s
>>>>> 
>>>>> Reading (the same 3GB) is also worse than SMB, but only by a factor of 
>>>>> about 3:
>>>>> 
>>>>> iScsi:  4m3611.34MB/s
>>>>> SMB share:  1m4529.81MB/s
>>>>> 
>>>>> 
>>> 
>>> 
>>> Not unexpected. Filesystems have readahead code to prefetch enough to cover 
>>> the latency of the read request. iSCSI only responds to the request.
>>> Put a filesystem on top of iscsi and try again.
>>> 
>>> For writes, iSCSI is synchronous and SMB is not.
>>> 
>> It may be with ZFS, but iSCSI is neither synchronous nor asynchronous is is 
>> simply SCSI over IP.
>> 
>> It is the application using the iSCSI protocol that determines whether it is 
>> synchronous, issue a flush after write, or asynchronous, wait until target 
>> flushes.
>> 
>> I think the ZFS developers didn't quite understand that and wanted strict 
>> guidelines like NFS has, but iSCSI doesn't have those, it is a lower level 
>> protocol than NFS is, so they forced guidelines on it and violated the 
>> standard.
>> 
>>   
> Nothing has been violated here.
> Look for WCE flag in COMSTAR where you can control how a given zvol  should 
> behave (synchronous or asynchronous). Additionally in recent build you have 
> zfs set sync={disabled|default|always} which also works with zvols.
> 
> So you do have a control over how it is supposed to behave and to make it 
> nice it is even on per zvol basis.
> It is just that the default is synchronous.

I forgot to ask, if the ZVOL is set async with WCE will it still honor a flush 
command from the initiator and flush those TXGs held for the ZVOL?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iScsi slow

2010-08-03 Thread Ross Walker

On Aug 3, 2010, at 5:56 PM, Robert Milkowski  wrote:

> On 03/08/2010 22:49, Ross Walker wrote:
>> On Aug 3, 2010, at 12:13 PM, Roch Bourbonnais  
>> wrote:
>> 
>>   
>>> Le 27 mai 2010 à 07:03, Brent Jones a écrit :
>>> 
>>> 
>>>> On Wed, May 26, 2010 at 5:08 AM, Matt Connolly
>>>>   wrote:
>>>>   
>>>>> I've set up an iScsi volume on OpenSolaris (snv_134) with these commands:
>>>>> 
>>>>> sh-4.0# zfs create rpool/iscsi
>>>>> sh-4.0# zfs set shareiscsi=on rpool/iscsi
>>>>> sh-4.0# zfs create -s -V 10g rpool/iscsi/test
>>>>> 
>>>>> The underlying zpool is a mirror of two SATA drives. I'm connecting from 
>>>>> a Mac client with global SAN initiator software, connected via Gigabit 
>>>>> LAN. It connects fine, and I've initialiased a mac format volume on that 
>>>>> iScsi volume.
>>>>> 
>>>>> Performance, however, is terribly slow, about 10 times slower than an SMB 
>>>>> share on the same pool. I expected it would be very similar, if not 
>>>>> faster than SMB.
>>>>> 
>>>>> Here's my test results copying 3GB data:
>>>>> 
>>>>> iScsi:  44m01s  1.185MB/s
>>>>> SMB share:  4m2711.73MB/s
>>>>> 
>>>>> Reading (the same 3GB) is also worse than SMB, but only by a factor of 
>>>>> about 3:
>>>>> 
>>>>> iScsi:  4m3611.34MB/s
>>>>> SMB share:  1m4529.81MB/s
>>>>> 
>>>>> 
>>> 
>>> 
>>> Not unexpected. Filesystems have readahead code to prefetch enough to cover 
>>> the latency of the read request. iSCSI only responds to the request.
>>> Put a filesystem on top of iscsi and try again.
>>> 
>>> For writes, iSCSI is synchronous and SMB is not.
>>> 
>> It may be with ZFS, but iSCSI is neither synchronous nor asynchronous is is 
>> simply SCSI over IP.
>> 
>> It is the application using the iSCSI protocol that determines whether it is 
>> synchronous, issue a flush after write, or asynchronous, wait until target 
>> flushes.
>> 
>> I think the ZFS developers didn't quite understand that and wanted strict 
>> guidelines like NFS has, but iSCSI doesn't have those, it is a lower level 
>> protocol than NFS is, so they forced guidelines on it and violated the 
>> standard.
>> 
>>   
> Nothing has been violated here.
> Look for WCE flag in COMSTAR where you can control how a given zvol  should 
> behave (synchronous or asynchronous). Additionally in recent build you have 
> zfs set sync={disabled|default|always} which also works with zvols.
> 
> So you do have a control over how it is supposed to behave and to make it 
> nice it is even on per zvol basis.
> It is just that the default is synchronous.

Ah, ok, my experience has been with Solaris and the iscsitgt which, correct me 
if I am wrong, is still synchronous only.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] iScsi slow

2010-08-03 Thread Ross Walker

On Aug 3, 2010, at 12:13 PM, Roch Bourbonnais  wrote:

> 
> Le 27 mai 2010 à 07:03, Brent Jones a écrit :
> 
>> On Wed, May 26, 2010 at 5:08 AM, Matt Connolly
>>  wrote:
>>> I've set up an iScsi volume on OpenSolaris (snv_134) with these commands:
>>> 
>>> sh-4.0# zfs create rpool/iscsi
>>> sh-4.0# zfs set shareiscsi=on rpool/iscsi
>>> sh-4.0# zfs create -s -V 10g rpool/iscsi/test
>>> 
>>> The underlying zpool is a mirror of two SATA drives. I'm connecting from a 
>>> Mac client with global SAN initiator software, connected via Gigabit LAN. 
>>> It connects fine, and I've initialiased a mac format volume on that iScsi 
>>> volume.
>>> 
>>> Performance, however, is terribly slow, about 10 times slower than an SMB 
>>> share on the same pool. I expected it would be very similar, if not faster 
>>> than SMB.
>>> 
>>> Here's my test results copying 3GB data:
>>> 
>>> iScsi:  44m01s  1.185MB/s
>>> SMB share:  4m2711.73MB/s
>>> 
>>> Reading (the same 3GB) is also worse than SMB, but only by a factor of 
>>> about 3:
>>> 
>>> iScsi:  4m3611.34MB/s
>>> SMB share:  1m4529.81MB/s
>>> 
> 
>  
> 
> Not unexpected. Filesystems have readahead code to prefetch enough to cover 
> the latency of the read request. iSCSI only responds to the request.
> Put a filesystem on top of iscsi and try again.
> 
> For writes, iSCSI is synchronous and SMB is not. 

It may be with ZFS, but iSCSI is neither synchronous nor asynchronous is is 
simply SCSI over IP.

It is the application using the iSCSI protocol that determines whether it is 
synchronous, issue a flush after write, or asynchronous, wait until target 
flushes.

I think the ZFS developers didn't quite understand that and wanted strict 
guidelines like NFS has, but iSCSI doesn't have those, it is a lower level 
protocol than NFS is, so they forced guidelines on it and violated the standard.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mirrored raidz

2010-07-26 Thread Ross Walker

On Jul 26, 2010, at 2:51 PM, Dav Banks  wrote:

> I wanted to test it as a backup solution. Maybe that's crazy in itself but I 
> want to try it.
> 
> Basically, once a week detach the 'backup' pool from the mirror, replace the 
> drives, add the new raidz to the mirror and let it resilver and sit for a 
> week.

If that's the case why not create a second pool called 'backup' and 'zfs send' 
periodically to the backup pool?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision

2010-07-25 Thread Ross Walker

On Jul 23, 2010, at 10:14 PM, Edward Ned Harvey  wrote:

>> From: Arne Jansen [mailto:sensi...@gmx.net]
>>> 
>>> Can anyone else confirm or deny the correctness of this statement?
>> 
>> As I understand it that's the whole point of raidz. Each block is its
>> own
>> stripe. 
> 
> Nope, that doesn't count for confirmation.  It is at least theoretically
> possible to implement raidz using techniques that would (a) unintelligently
> stripe all blocks (even small ones) across multiple disks, thus hurting
> performance on small operations, or (b) implement raidz such that striping
> of blocks behaves differently for small operations (plus parity).  So the
> confirmation I'm looking for would be somebody who knows the actual source
> code, and the actual architecture that was chosen to implement raidz in this
> case.

Maybe this helps?

http://blogs.sun.com/ahl/entry/what_is_raid_z

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] File cloning

2010-07-22 Thread Ross Walker

On Jul 22, 2010, at 2:41 PM, Miles Nordin  wrote:

>>>>>> "sw" == Saxon, Will  writes:
> 
>sw> 'clone' vs. a 'copy' would be very easy since we have
>sw> deduplication now
> 
> dedup doesn't replace the snapshot/clone feature for the
> NFS-share-full-of-vmdk use case because there's no equivalent of 
> 'zfs rollback'
> 
> 
> I'm tempted to say, ``vmware needs to remove their silly limit'' but
> there are takes-three-hours-to-boot problems with thousands of Solaris
> NFS exports so maybe their limit is not so silly after all.
> 
> What is the scenario, you have?  Is it something like 40 hosts with
> live migration among them, and 40 guests on each host?  so you need
> 1600 filesystems mounted even though only 40 are actually in use?
> 
> 'zfs set sharenfs=absorb ' would be my favorite answer, but
> lots of people have asked for such a feature, and answer is always
> ``wait for mirror mounts'' (which BTW are actually just-works for me
> on very-recent linux, even with plain 'mount host:/fs /fs', without
> saying 'mount -t nfs4', in spite of my earlier rant complaining they
> are not real).  Of course NFSv4 features are no help to vmware, but
> hypothetically I guess mirror-mounting would work if vmware supported
> it, so long as they were careful not to provoke the mounting of guests
> not in use.  The ``implicit automounter'' on which the mirror mount
> feature's based would avoid the boot delay of mounting 1600
> filesystems.
> 
> and BTW I've not been able to get the Real Automounter in Linux to do
> what this implicit one already can with subtrees.  Why is it so hard
> to write a working automounter?
> 
> The other thing I've never understood is, if you 'zfs rollback' an
> NFS-exported filesystem, what happens to all the NFS clients?  It
> seems like this would cause much worse corruption than the worry when
> people give fire-and-brimstone speeches about never disabling
> zil-writing while using the NFS server.  but it seems to mostly work
> anyway when I do this, so I'm probably confused about something.

To add to Miles' comments, what you are trying to accomplish isn't possible via 
NFS to ESX, but could be accomplished with iSCSI zvols I believe. If I 
understand you can thin-provision a zvol and clone it as many times as you wish 
and present all the clones over iSCSI. Haven't tried it myself, but would be 
worth testing.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs raidz1 and traditional raid 5 perfomrance comparision

2010-07-20 Thread Ross Walker

On Jul 20, 2010, at 6:12 AM, v  wrote:

> Hi,
> for zfs raidz1, I know for random io, iops of a raidz1 vdev eqaul to one 
> physical disk iops, since raidz1 is like raid5 , so is raid5 has same 
> performance like raidz1? ie. random iops equal to one physical disk's ipos.

On reads, no, any part of the stripe width can be read without reading the 
whole stripe width, giving performance equal to raid0 of non-parity disks.

On writes it could be worse then raidz1 depending on whether whole stripe 
widths are being written (same performance) or partial stripe widths are being 
written (worse performance). If it's a partial stripe width then the remaining 
data needs to be read off disk which doubles the IOs.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Need ZFS master!

2010-07-13 Thread Ross Walker


The whole disk layout should be copied from disk 1 to 2, then the slice on disk 
2 that corresponds to the slice on disk 1 should be attached to the rpool which 
forms an rpool mirror (attached not added).

Then you need to add the grub bootloader to disk 2.

When it finishes resilvering then you have an rpool mirror.

-Ross



On Jul 12, 2010, at 6:30 PM, "Beau J. Bechdol"  wrote:

> I do apologies but I am completely lost here Maybe I am just not 
> understanding. Are you saying that a slice has to be created on the seond 
> drive before it can bee added to the pool?
> 
> Thanks
> 
> On Mon, Jul 12, 2010 at 4:22 PM, Cindy Swearingen 
>  wrote:
> Hi John,
> 
> Follow the steps in this section:
> 
> http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide
> 
> Replacing/Relabeling the Root Pool Disk
> 
> If the disk is correctly labeled with an SMI label, then you can skip
> down to steps 5-8 of this procedure.
> 
> Thanks,
> 
> Cindy
> 
> 
> On 07/12/10 16:06, john wrote:
> Hello all. I am new...very new to opensolaris and I am having an issue and 
> have no idea what is going wrong. So I have 5 drives in my machine. all 
> 500gb. I installed open solaris on the first drive and rebooted. . Now what I 
> want to do is ad a second drive so they are mirrored. How does one do this!!! 
> I am getting no where and need some help.
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Encryption?

2010-07-11 Thread Ross Walker

On Jul 11, 2010, at 5:11 PM, Freddie Cash  wrote:

> ZFS-FUSE is horribly unstable, although that's more an indication of
> the stability of the storage stack on Linux.

Not really, more an indication of the pseudo-VFS layer implemented in fuse. 
Remember fuse provides it's own VFS API separate from the Linux VFS API so file 
systems can be implemented in user space. Fuse needs a little more work to 
handle ZFS as a file system.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Should i enable Write-Cache ?

2010-07-10 Thread Ross Walker

On Jul 10, 2010, at 5:46 AM, Erik Trimble  wrote:

> On 7/10/2010 1:14 AM, Graham McArdle wrote:
>>> Instead, create "Single Disk" arrays for each disk.
>>> 
>> I have a question related to this but with a different controller: If I'm 
>> using a RAID controller to provide non-RAID single-disk volumes, do I still 
>> lose out on the hardware-independence advantage of software RAID that I 
>> would get from a basic non-RAID HBA?
>> In other words, if the controller dies, would I still need an identical 
>> controller to recognise the formatting of 'single disk volumes', or is more 
>> 'standardised' than the typical proprietary implementations of hardware RAID 
>> that makes it impossible to switch controllers on  hardware RAID?
>>   
> 
> Yep. You're screwed.  :-)
> 
> single-disk volumes are still RAID volumes to the controller, so they'll have 
> the extra controller-specific bits on them. You'll need an identical 
> controller (or, possibly, just one from the same OEM) to replace a broken 
> controller with.
> 
> Even in JBOD mode, I wouldn't trust a RAID controller to not write 
> proprietary bits onto the disks.  It's one of the big reasons to chose a HBA 
> and not a RAID controller.

Not always, my Dell PERC with the drives set as single disk RAID0 disks, I was 
able to successfully import the pool on a regular LSI SAS (non-RAID) controller.

The only change the PERC made was to coerce the disk size down 128MB, so left 
128MB unused at the end of the drive, which would mean new disks would be 
slightly bigger.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Ross Walker

On Jun 24, 2010, at 10:42 AM, Robert Milkowski  wrote:

> On 24/06/2010 14:32, Ross Walker wrote:
>> On Jun 24, 2010, at 5:40 AM, Robert Milkowski  wrote:
>> 
>>   
>>> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> 
>>>>> Does it mean that for dataset used for databases and similar environments 
>>>>> where basically all blocks have fixed size and there is no other data all 
>>>>> parity information will end-up on one (z1) or two (z2) specific disks?
>>>>> 
>>>>> 
>>>> No. There are always smaller writes to metadata that will distribute 
>>>> parity. What is the total width of your raidz1 stripe?
>>>> 
>>>> 
>>>>   
>>> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.
>>> 
>> From what I gather each 16KB record (plus parity) is spread across the raidz 
>> disks. This causes the total random IOPS (write AND read) of the raidz to be 
>> that of the slowest disk in the raidz.
>> 
>> Raidz is definitely made for sequential IO patterns not random. To get good 
>> random IO with raidz you need a zpool with X raidz vdevs where X = desired 
>> IOPS/IOPS of single drive.
>>   
> 
> I know that and it wasn't mine question.

Sorry, for the OP...


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-24 Thread Ross Walker

On Jun 24, 2010, at 5:40 AM, Robert Milkowski  wrote:

> On 23/06/2010 18:50, Adam Leventhal wrote:
>>> Does it mean that for dataset used for databases and similar environments 
>>> where basically all blocks have fixed size and there is no other data all 
>>> parity information will end-up on one (z1) or two (z2) specific disks?
>>> 
>> No. There are always smaller writes to metadata that will distribute parity. 
>> What is the total width of your raidz1 stripe?
>> 
>>   
> 
> 4x disks, 16KB recordsize, 128GB file, random read with 16KB block.

>From what I gather each 16KB record (plus parity) is spread across the raidz 
>disks. This causes the total random IOPS (write AND read) of the raidz to be 
>that of the slowest disk in the raidz.

Raidz is definitely made for sequential IO patterns not random. To get good 
random IO with raidz you need a zpool with X raidz vdevs where X = desired 
IOPS/IOPS of single drive.

-Ross


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raid-z - not even iops distribution

2010-06-23 Thread Ross Walker

On Jun 23, 2010, at 1:48 PM, Robert Milkowski  wrote:

> 
> 128GB.
> 
> Does it mean that for dataset used for databases and similar environments 
> where basically all blocks have fixed size and there is no other data all 
> parity information will end-up on one (z1) or two (z2) specific disks?

What's the record size on those datasets?

8k?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ls says: /tank/ws/fubar: Operation not applicable

2010-06-22 Thread Gordon Ross

lstat64("/tank/ws/fubar", 0x080465D0)   Err#89 ENOSYS
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ls says: /tank/ws/fubar: Operation not applicable

2010-06-22 Thread Gordon Ross

Anyone know why my ZFS filesystem might suddenly start
giving me an error when I try to "ls -d" the top of it?
i.e.: ls -d /tank/ws/fubar
/tank/ws/fubar: Operation not applicable

zpool status says all is well.  I've tried snv_139 and snv_137
(my latest and previous installs).  It's an amd64 box.
Both OS versions show the same problem.

Do I need to run a scrub?  (will take days...)

Other ideas?

Thanks,
Gordon
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SLOG striping? (Bob Friesenhahn)

2010-06-22 Thread Ross Walker

On Jun 22, 2010, at 8:40 AM, Jeff Bacon  wrote:

>> The term 'stripe' has been so outrageously severely abused in this
>> forum that it is impossible to know what someone is talking about when
>> they use the term.  Seemingly intelligent people continue to use wrong
>> terminology because they think that protracting the confusion somehow
>> helps new users.  We are left with no useful definition of
>> 'striping'.
> 
> "There is no striping." 
> (I'm sorry, I couldn't resist.)

"There is no spoon"


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-16 Thread Ross Walker

On Jun 16, 2010, at 9:02 AM, Carlos Varela   
wrote:




Does the machine respond to ping?


Yes



If there is a gui does the mouse pointer move?



There is no GUI (nexentastor)


Does the keyboard numlock key respond at all ?


Yes



I just find it very hard to believe that such a
situation could exist as I
have done some *abusive* tests on a SunFire X4100
with Sun 6120 fibre
arrays ( in HA config ) and I could not get it to
become a warm brick like
you describe.

How many processors does your machine have ?


Full data:

Motherboard: Asus m2n68-CM
Initial memory: 3 Gb DDR2 ECC
Actual memory: 8 GB DDR2 800
CPU: Athlon X2 5200
HD: 2 Seagate 1 WD (1,5 TB each)
Pools: 1 RAIDZ pool
datasets: 5 (ftp: 30 GB, varios: 170 GB, multimedia:
1,7TB, segur: 80 Gb prueba: 50 Mb)
ZFS ver: 22

The pool was created with EON-NAS 0.6 ... dedupe on,


Similar situation but with Opensolaris b133. Can ping machine but  
its frozen about 24 hours. I was deleting 25GB of dedup data. If I  
move 1-2 GB of data then the machine stops responding for 1 hour but  
comes back after that. I have munin installed and the graphs stop  
updating during that time and you can not use ssh. I agree that  
memory seems to not be enough as I see a lot of 20kb reads before it  
stops responding (reading DDT entries I guess). Maybe dedup has to  
be redesigned for low memory machines (a batch process instead of  
inline ?)
This is my home machine so I can wait but businesses would not be so  
happy if the machine becomes so unresponsive that you can not access  
your data.


The unresponsiveness that people report deleting large dedup zfs  
objects is due to ARC memory pressure and long service times accessing  
other zfs objects while it is busy resolving the deleted object's  
dedup references.


Set a max size the ARC can grow to, saving room for system services,  
get an SSD drive to act as an L2ARC, run a scrub first to prime the  
L2ARC (actually probably better to run something targetting just those  
datasets in question), then delete the dedup objects, smallest to  
largest.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moved disks to new controller - cannot import pool even after moving ba

2010-06-14 Thread Ross Walker

On Jun 13, 2010, at 2:14 PM, Jan Hellevik  
 wrote:


Well, for me it was a cure. Nothing else I tried got the pool back.  
As far as I can tell, the way to get it back should be to use  
symlinks to the fdisk partitions on my SSD, but that did not work  
for me. Using -V got the pool back. What is wrong with that?


If you have a better suggestion as to how I should have recovered my  
pool I am certainly interested in hearing it.


I would take this time to offline one disk at a time, wipe all it's  
tables/labels and re-attach it as an EFI whole disk to avoid hitting  
this same problem again in the future.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Please trim posts

2010-06-11 Thread Ross Walker

On Jun 11, 2010, at 2:07 AM, Dave Koelmeyer   
wrote:


I trimmed, and then got complained at by a mailing list user that  
the context of what I was replying to was missing. Can't win :P


If at a minimum one trims the disclaimers, footers and signatures,  
that's better then nothing.


On long threads with inlined comments, think about keeping the  
previous 2 comments before or trimming anything 3 levels of indents or  
more.


Of course that's just my general rule of thumb and different  
discussions require different quotings, but just being mindful is  
often enough.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Ross Walker

On Jun 10, 2010, at 5:54 PM, Richard Elling   
wrote:



On Jun 10, 2010, at 1:24 PM, Arne Jansen wrote:


Andrey Kuzmin wrote:

Well, I'm more accustomed to  "sequential vs. random", but YMMW.
As to 67000 512 byte writes (this sounds suspiciously close to  
32Mb fitting into cache), did you have write-back enabled?


It's a sustained number, so it shouldn't matter.


That is only 34 MB/sec.  The disk can do better for sequential writes.


Not doing sector sized IO.

Besides this was a max IOPS number not max throughput number. If it  
were the OP might have used a 1M bs or better instead.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] General help with understanding ZFS performance bottlenecks

2010-06-09 Thread Ross Walker


On Jun 8, 2010, at 1:33 PM, besson3c  wrote:



Sure! The pool consists of 6 SATA drives configured as RAID-Z. There  
are no special read or write cache drives. This pool is shared to  
several VMs via NFS, these VMs manage email, web, and a Quickbooks  
server running on FreeBSD, Linux, and Windows.


Ok, well RAIDZ is going to be a problem here. Because each record is  
spread across the whole pool (each read/write will hit all drives in  
the pool) which has the side effect of making the total number of IOPS  
equal to the total number of IOPS of the slowest drive in the pool.


Since these are SATA let's say the total number of IOPS will be 80  
which is not good enough for what is a mostly random workload.


If it were a 6 drive pool of mirrors then it would be able to handle  
240 IOPS write and up to 480 IOPS read (can read from either side of  
mirror).


I would probably rethink the setup.

ZIL wil not buy you much here and if your VM software is like VMware  
then each write over NFS will be marked FSYNC which will force the  
lack of IOPS to the surface.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Homegrown Hybrid Storage

2010-06-07 Thread Ross Walker

On Jun 7, 2010, at 2:10 AM, Erik Trimble   
wrote:



Comments in-line.


On 6/6/2010 9:16 PM, Ken wrote:


I'm looking at VMWare, ESXi 4, but I'll take any advice offered.

On Sun, Jun 6, 2010 at 19:40, Erik Trimble  
 wrote:

On 6/6/2010 6:22 PM, Ken wrote:


Hi,

I'm looking to build a virtualized web hosting server environment  
accessing files on a hybrid storage SAN.  I was looking at using  
the Sun X-Fire x4540 with the following configuration:
6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA  
drives)

2 Intel X-25 32GB SSD's as a mirrored ZIL
4 Intel X-25 64GB SSD's as the L2ARC.
De-duplification
LZJB compression
The clients will be Apache web hosts serving hundreds of domains.

I have the following questions:
Should I use NFS with all five VM's accessing the exports, or one  
LUN for each VM, accessed over iSCSI?


Generally speaking, it depends on your comfort level with running  
iSCSI  Volumes to put the VMs in, or serving everything out via NFS  
(hosting the VM disk file in an NFS filesystem).


If you go the iSCSI route, I would definitely go the "one iSCSI  
volume per VM" route - note that you can create multiple zvols per  
zpool on the X4540, so it's not limiting in any way to volume-ize a  
VM.  It's a lot simpler, easier, and allows for nicer management  
(snapshots/cloning/etc. on the X4540 side) if you go with a VM per  
iSCSI volume.


With NFS-hosted VM disks, do the same thing:  create a single  
filesystem on the X4540 for each VM.


Vmware has a 32 mount limit which may limit the OP somewhat here.


Performance-wise, I'd have to test, but I /think/ the iSCSI route  
will be faster. Even with the ZIL SSDs.


Actually properly tuned they are about the same, but VMware NFS  
datastores are FSYNC on all operations which isn't the best for data  
vmdk files, best to serve the data directly to the VM using either  
iSCSI or NFS.







Are the FSYNC speed issues with NFS resolved?


The ZIL SSDs will compensate for synchronous write issues in NFS.   
Not completely eliminate them, but you shouldn't notice issues with  
sync writing until you're up at pretty heavy loads.


You will need this with VMware as every NFS operation (not just file  
open/close) coming out of VMware will be marked FSYNC (for VM data  
integrity in the face of server failure).











If it were me (and, given what little I know of your data), I'd go  
like this:


(1) pool for VMs:
8 disks, MIRRORED
1 SSD for L2ARC
one Zvol per VM instance, served via iSCSI, each with:
DD turned ON,  Compression turned OFF

(1) pool for clients to write data to (log files, incoming data, etc.)
6 or 8 disks, MIRRORED
2 SSDs for ZIL, mirrored
Ideally, As many filesystems as you have webSITES, not just  
client VMs.  As this might be unwieldy for 100s of websites, you  
should segregate them into obvious groupings, taking care with write/ 
read permissions.

NFS served
DD OFF, Compression ON  (or OFF, if you seem to be  
having CPU overload on the X4540)


(1) pool for client read-only data
All the rest of the disks, split into 7 or 8-disk RAIDZ2 vdevs
All the remaining SSDs for L2ARC
As many filesystems as you have webSITES, not just client  
VMs.  (however, see above)

NFS served
DD on for selected websites (filesystems),  
Compression ON for everything


(2) Global hot spares.


Make your life easy and use NFS for VMs and data. If you need high  
performance data such as databases, use iSCSI zvols directly into the  
VM, otherwise NFS/CIFS into the VM should be good enough.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Migrating to ZFS

2010-06-02 Thread Ross Walker


On Jun 2, 2010, at 12:03 PM, zfsnoob4  wrote:


Wow thank you very much for the clear instructions.

And Yes, I have another 120GB drive for the OS, separate from A, B  
and C. I will repartition the drive and install Solaris. Then maybe  
at some point I'll delete the entire drive and just install a single  
OS.



I have a question about step 6, "Step 6: create a "dummy" drive as a  
sparse file: mkfile -n 1500G /foo"


I understand that I need to create a dummy drive and then immediatly  
remove it to run the raidz in degraded mode. But by creating the  
file with mkfile, will it allocate the 1.5TB right away on the OS  
drive? I was wondering because my OS drive is only 120GB, so won't  
it have a problem with creating a 1.5TB sparse file?


There is one potential pitfall in this method, if your Windows mirror  
is using dynamic disks, you can't access a dynamic disk with the NTFS  
driver under Solaris.


To get around this create a basic NTFS partition on the new third  
drive, copy the data to that drive and blow away the dynamic mirror.  
Then build the degraded raidz pool out of the two original mirror  
disks and copy the data back off the new third disk on to the raidz,  
then wipe the disk labels off that third drive and resilver the raidz.


A safer approach is to get a 2GB eSATA drive (a mirrored device to be  
extra safe) and copy the data there, then build a complete raidz and  
copy the data off the eSATA device to the raidz.


The risk and time it takes to copy data on to a degraded raidz isn't  
worth it. The write throughput on a degraded raidz will be horrible  
and the time it takes to copy the data over plus the time it takes in  
the red zone where it resilvers the raidz with no backup available...   
There is a high potential for tears here.


Get an external disk for your own sanity.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New SSD options

2010-05-21 Thread Ross Walker


On May 20, 2010, at 7:17 PM, Ragnar Sundblad  wrote:



On 21 maj 2010, at 00.53, Ross Walker wrote:


On May 20, 2010, at 6:25 PM, Travis Tabbal  wrote:


use a slog at all if it's not durable?  You should
disable the ZIL
instead.



This is basically where I was going. There only seems to be one  
SSD that is considered "working", the Zeus IOPS. Even if I had the  
money, I can't buy it. As my application is a home server, not a  
datacenter, things like NFS breaking if I don't reboot the clients  
is a non-issue. As long as the on-disk data is consistent so I  
don't have to worry about the entire pool going belly-up, I'm  
happy enough. I might lose 30 seconds of data, worst case, as a  
result of running without ZIL. Considering that I can't buy a  
proper ZIL at a cost I can afford, and an improper ZIL is not  
worth much, I don't see a reason to bother with ZIL at all. I'll  
just get a cheap large SSD for L2ARC, disable ZIL, and call it a  
day.


For my use, I'd want a device in the $200 range to even consider  
an slog device. As nothing even remotely close to that price range  
exists that will work properly at all, let alone with decent  
performance, I see no point in ZIL for my application. The  
performance hit is just too severe to continue using it without an  
slog, and there's no slog device I can afford that works properly,  
even if I ignore performance.


Just buy a caching RAID controller and run it in JBOD mode and have  
the ZIL integrated with the pool.


A 512MB-1024MB card with battery backup should do the trick. It  
might not have the capacity of an SSD, but in my experience it  
works well in the 1TB data moderately loaded range.


Have more data/activity then try more cards and more pools,  
otherwise pony up the  for a capacitor backed SSD.


It - again - depends on what problem you are trying to solve.

If the RAID controller goes bad on you so that you loose the
data in the write cache, your file system could be in pretty bad
shape. Most RAID controllers can't be mirrored. That would hardly
make a good replacement for a mirrored ZIL.

As far as I know, there is no single silver bullet to this issue.


That is true, and there at finite budgets as well and as all things in  
life one must make a trade-off somewhere.


If you have 2 mirrored SSDs that don't support cache flush and your  
power goes out your file system will be in the same bad shape.  
Difference is in the first place you paid a lot less to have your data  
hosed.


-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New SSD options

2010-05-20 Thread Ross Walker


On May 20, 2010, at 6:25 PM, Travis Tabbal  wrote:


use a slog at all if it's not durable?  You should
disable the ZIL
instead.



This is basically where I was going. There only seems to be one SSD  
that is considered "working", the Zeus IOPS. Even if I had the  
money, I can't buy it. As my application is a home server, not a  
datacenter, things like NFS breaking if I don't reboot the clients  
is a non-issue. As long as the on-disk data is consistent so I don't  
have to worry about the entire pool going belly-up, I'm happy  
enough. I might lose 30 seconds of data, worst case, as a result of  
running without ZIL. Considering that I can't buy a proper ZIL at a  
cost I can afford, and an improper ZIL is not worth much, I don't  
see a reason to bother with ZIL at all. I'll just get a cheap large  
SSD for L2ARC, disable ZIL, and call it a day.


For my use, I'd want a device in the $200 range to even consider an  
slog device. As nothing even remotely close to that price range  
exists that will work properly at all, let alone with decent  
performance, I see no point in ZIL for my application. The  
performance hit is just too severe to continue using it without an  
slog, and there's no slog device I can afford that works properly,  
even if I ignore performance.


Just buy a caching RAID controller and run it in JBOD mode and have  
the ZIL integrated with the pool.


A 512MB-1024MB card with battery backup should do the trick. It might  
not have the capacity of an SSD, but in my experience it works well in  
the 1TB data moderately loaded range.


Have more data/activity then try more cards and more pools, otherwise  
pony up the  for a capacitor backed SSD.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS High Availability

2010-05-13 Thread Ross Walker

On May 12, 2010, at 7:12 PM, Richard Elling   
wrote:



On May 11, 2010, at 10:17 PM, schickb wrote:

I'm looking for input on building an HA configuration for ZFS. I've  
read the FAQ and understand that the standard approach is to have a  
standby system with access to a shared pool that is imported during  
a failover.


The problem is that we use ZFS for a specialized purpose that  
results in 10's of thousands of filesystems (mostly snapshots and  
clones). All versions of Solaris and OpenSolaris that we've tested  
take a long time (> hour) to import that many filesystems.


I've read about replication through AVS, but that also seems  
require an import during failover. We'd need something closer to an  
active-active configuration (even if the second active is only  
modified through replication). Or some way to greatly speedup  
imports.


Any suggestions?


The import is fast, but two other operations occur during import  
that will

affect boot time:
   + for each volume (zvol) and its snapshots, a device tree entry is
  made in /devices
   + for each NFS share, the file system is (NFS) exported

When you get into the thousands of datasets and snapshots range, this
takes some time. Several RFEs have been implemented over the past few
years to help improve this.

NB.  Running in a VM doesn't improve the share or device enumeration  
time.


The idea I propose is to use VMs in a manner such that the server does  
not have to be restarted in the event of a hardware failure thus  
avoiding the enumerations by using VMware's hot-spare VM technology.


Of course using VMs could also mean the OP could have multiple ZFS  
servers such that the datasets could be spread evenly between them.


This could conceivably be done in containers within the 2 original VMs  
so as to maximize ARC space.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS High Availability

2010-05-12 Thread Ross Walker

On May 12, 2010, at 3:06 PM, Manoj Joseph   
wrote:



Ross Walker wrote:

On May 12, 2010, at 1:17 AM, schickb  wrote:


I'm looking for input on building an HA configuration for ZFS. I've
read the FAQ and understand that the standard approach is to have a
standby system with access to a shared pool that is imported during
a failover.

The problem is that we use ZFS for a specialized purpose that
results in 10's of thousands of filesystems (mostly snapshots and
clones). All versions of Solaris and OpenSolaris that we've tested
take a long time (> hour) to import that many filesystems.

I've read about replication through AVS, but that also seems require
an import during failover. We'd need something closer to an active-
active configuration (even if the second active is only modified
through replication). Or some way to greatly speedup imports.

Any suggestions?


Bypass the complexities of AVS and the start-up times by implementing
a ZFS head server in a pair of ESX/ESXi with Hot-spares using
redundant back-end storage (EMC, NetApp, Equalogics).

Then, if there is a hardware or software failure of the head server  
or

the host it is on, the hot-spare automatically kicks in with the same
running state as the original.


By hot-spare here, I assume you are talking about a hot-spare ESX
virtual machine.

If there is a software issue and the hot-spare server comes up with  
the
same state, is it not likely to fail just like the primary server?  
If it

does not, can you explain why it would not?


That's a good point and worth looking into. I guess it would fail as  
well as a vmware hot-spare is like a vm in constant vmotion where  
active memory is mirrored between the two.


I suppose one would need a hot-spare for hardware failure and a cold- 
spare for software failure. Both scenarios are possible with ESX, the  
cold spare I suppose in this instance would be the original VM  
rebooting.


Recovery time would be about the same in this instance as an AVS  
solution that has to mount 1 mounts though, so it wins with a  
hardware failure and ties with a software failure, but wins with ease  
of setup and maintenance, but looses with additional cost. Guess it  
all depends on your risk analysis whether it is worth it.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS High Availability

2010-05-12 Thread Ross Walker


On May 12, 2010, at 1:17 AM, schickb  wrote:

I'm looking for input on building an HA configuration for ZFS. I've  
read the FAQ and understand that the standard approach is to have a  
standby system with access to a shared pool that is imported during  
a failover.


The problem is that we use ZFS for a specialized purpose that  
results in 10's of thousands of filesystems (mostly snapshots and  
clones). All versions of Solaris and OpenSolaris that we've tested  
take a long time (> hour) to import that many filesystems.


I've read about replication through AVS, but that also seems require  
an import during failover. We'd need something closer to an active- 
active configuration (even if the second active is only modified  
through replication). Or some way to greatly speedup imports.


Any suggestions?


Bypass the complexities of AVS and the start-up times by implementing  
a ZFS head server in a pair of ESX/ESXi with Hot-spares using  
redundant back-end storage (EMC, NetApp, Equalogics).


Then, if there is a hardware or software failure of the head server or  
the host it is on, the hot-spare automatically kicks in with the same  
running state as the original.


There should be no interruption of services in this setup.

This type of arrangement provides for oodles of flexibility in testing/ 
upgrading deployments as well.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance of the ZIL

2010-05-06 Thread Ross Walker

On May 6, 2010, at 8:34 AM, Edward Ned Harvey   
wrote:



From: Pasi Kärkkäinen [mailto:pa...@iki.fi]


In neither case do you have data or filesystem corruption.



ZFS probably is still OK, since it's designed to handle this (?),
but the data can't be OK if you lose 30 secs of writes.. 30 secs of
writes
that have been ack'd being done to the servers/applications..


What I meant was:  Yes there's data loss.  But no corruption.  In  
other
filesystems, if you have an ungraceful shutdown while the filesystem  
is
writing, since filesystems such as EXT3 perform file-based (or inode- 
based)
block write operations, then you can have files whose contents have  
been
corrupted...  Some sectors of the file still in their "old" state,  
and some
sectors of the file in their "new" state.  Likewise, in something  
like EXT3,

you could have some file fully written, while another one hasn't been
written yet, but should have been.  (AKA, some files written out of  
order.)


In the case of EXT3, since it is a journaled filesystem, the journal  
only
keeps the *filesystem* consistent after a crash.  It's still  
possible to

have corrupted data in the middle of a file.


I believe ext3 has an option to journal data as well as metadata, it  
just defaults to metadata.


I don't believe out-of-order writes are so much an issue any more  
since Linux gained write barrier support (and most file systems and  
block devices now support it).



These things don't happen in ZFS.  ZFS takes journaling to a whole new
level.  Instead of just keeping your filesystem consistent, it also  
keeps
your data consistent.  Yes, data loss is possible when a system  
crashes, but
the filesystem will never have any corruption.  These are separate  
things

now, and never were before.


ZFS does NOT have a journal, it has an intent log which is completely  
different. A journal logs operations that are to be performed later  
(the journal is read, the operation performed) an intent log logs  
operations that are being performed now, when the disk flushes the  
intent entry is marked complete.


ZFS is consistent by the nature of COW which means a partial write  
will not become part of the file system (the old block pointer isn't  
updated till the new block completes the write).


In ZFS, losing n-seconds of writes leading up to the crash will  
never result
in files partially written, or written out of order.  Every atomic  
write to
the filesystem results in a filesystem-consistent and data- 
consistent view

of *some* valid form of all the filesystem and data within it.


ZFS file system will always be consistent, but if an application  
doesn't flush it's data, then it can definitely have partially written  
data.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Snapshots and Data Loss

2010-04-23 Thread Ross Walker


On Apr 22, 2010, at 11:03 AM, Geoff Nordli  wrote:


From: Ross Walker [mailto:rswwal...@gmail.com]
Sent: Thursday, April 22, 2010 6:34 AM

On Apr 20, 2010, at 4:44 PM, Geoff Nordli   
wrote:



If you combine the hypervisor and storage server and have students
connect to the VMs via RDP or VNC or XDM then you will have the
performance of local storage and even script VirtualBox to take a
snapshot right after a save state.

A lot less difficult to configure on the client side, and allows you
to deploy thin clients instead of full desktops where you can get  
away

with it.

It also allows you to abstract the hypervisor from the client.

Need a bigger storage server with lots of memory, CPU and storage
though.

Later, if need be, you can break out the disks to a storage appliance
with an 8GB FC or 10Gbe iSCSI interconnect.



Right, I am in the process now of trying to figure out what the load  
looks

like with a central storage box and how ZFS needs to be configured to
support that load.  So far what I am seeing is very exciting :)

We are currently porting over our existing Learning Lab Infrastructure
platform from MS Virtual Server to VBox + ZFS.  When students  
connect into
their lab environment it dynamically creates their VMs and load  
balances

them across physical servers.


You can also check out OpenSolaris' Xen implementation, which if you  
use Linux VMs will allow PV VMs as well as hardware assisted full  
virtualized Windows VMs. There are public domain Windows Xen drivers  
out there.


The advantage of using Xen is it's VM live migration and XMLRPC  
management API. As it runs as a bare metal hypervisor it also allows  
fine granularity of CPU schedules, between guests and the host VM, but  
unfortunately it's remote display technology leaves something to be  
desired. For Windows VMs I use the built-in remote desktop, and for  
Linux VMs I use XDM and use something like 'thinstation' on the client  
side.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Snapshots and Data Loss

2010-04-22 Thread Ross Walker


On Apr 20, 2010, at 4:44 PM, Geoff Nordli  wrote:


From: matthew patton [mailto:patto...@yahoo.com]
Sent: Tuesday, April 20, 2010 12:54 PM

Geoff Nordli  wrote:


With our particular use case we are going to do a "save
state" on their
virtual machines, which is going to write  100-400 MB
per VM via CIFS or
NFS, then we take a snapshot of the volume, which
guarantees we get a
consistent copy of their VM.


maybe you left out a detail or two but I can't see how your ZFS  
snapshot

is going to be consistent UNLESS every VM on that ZFS volume is
prevented from doing any and all I/O from the time it finishes "save
state" and you take your ZFS snapshot.

If by "save state" you mean something akin to VMWare's disk snapshot,
why would you even bother with a ZFS snapshot in addition?



We are using VirtualBox as our hypervisor.  When it does a save  
state it
generates a memory file.  The memory file plus the volume snapshot  
creates a

consistent state.

In our platform each student's VM points to a unique backend volume  
via

iscsi using VBox's built-in iscsi initiator.  So there is a one-to-one
relationship between VM and Volume.  Just for clarity, a single VM  
could
have multiple disks attached to it.  In that scenario, then a VM  
would have

multiple volumes.



end we could have
maybe 20-30 VMs getting saved at the same time, which could
mean several GB
of data would need to get written in a short time frame and
would need to
get committed to disk.

So it seems the best case would be to get those "save
state" writes as sync
and get them into a ZIL.


That I/O pattern is vastly >32kb and so will hit the 'rust' ZIL  
(which
ALWAYS exists) and if you were thinking an SSD would help you, I  
don't

see any/much evidence it will buy you anything.




If I set the logbias (b122) to latency, then it will direct all sync  
IO to
the log device, even if it exceeds the zfs_immediate_write_sz  
threshold.


If you combine the hypervisor and storage server and have students  
connect to the VMs via RDP or VNC or XDM then you will have the  
performance of local storage and even script VirtualBox to take a  
snapshot right after a save state.


A lot less difficult to configure on the client side, and allows you  
to deploy thin clients instead of full desktops where you can get away  
with it.


It also allows you to abstract the hypervisor from the client.

Need a bigger storage server with lots of memory, CPU and storage  
though.


Later, if need be, you can break out the disks to a storage appliance  
with an 8GB FC or 10Gbe iSCSI interconnect.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can RAIDZ disks be slices ?

2010-04-21 Thread Ross Walker


On Apr 20, 2010, at 12:13 AM, Sunil  wrote:


Hi,

I have a strange requirement. My pool consists of 2 500GB disks in  
stripe which I am trying to convert into a RAIDZ setup without data  
loss but I have only two additional disks: 750GB and 1TB. So, here  
is what I thought:


1. Carve a 500GB slice (A) in 750GB and 2 500GB slices (B,C) in 1TB.
2. Create a RAIDZ pool out of these 3 slices. Performance will be  
bad because of seeks in the same disk for B and C but its just  
temporary.

3. zfs send | recv my current pool data into the new pool.
4. Destroy the current pool.
5. In the new pool, replace B with the 500GB disk freed by the  
destruction of the current pool.
6. Optionally, replace C with second 500GB to free up the 750GB  
completely.


So, essentially I have slices out of 3 separate disks giving me my  
needed 1TB space. Additional 500GB on the 1TB drive can be used for  
scratch non-important data or may be even mirrored with a slice from  
750GB disk.


Will this work as I am hoping it should?

Any potential gotchas?


Wouldn't it just be easier to zfs send to a file on the 1TB, build  
your raidz, then zfs recv into the new raidz from this file?


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] SSD best practices

2010-04-19 Thread Ross Walker


On Apr 19, 2010, at 12:50 PM, Don  wrote:


Now I'm simply confused.

Do you mean one cachefile shared between the two nodes for this  
zpool? How, may I ask, would this work?


The rpool should be in /etc/zfs/zpool.cache.

The shared pool should be in /etc/cluster/zpool.cache (or wherever  
you prefer to put it) so it won't come up on system start.


What I don't understand is how the second node is either a) supposed  
to share the first nodes cachefile or b) create it's own without  
importing the pool.


You say this is the job of the cluster software- does ha-cluster  
already handle this with their ZFS modules?


I've asked this question 5 different ways and I either still haven't  
gotten an answer- or still don't understand the problem.


Is there a way for a passive node to generate it's _own_ zpool.cache  
without importing the file system. If so- how. If not- why is this  
unimportant?


I don't run the cluster suite, but I'd be surprised if the software  
doesn't copy the cache to the passive node whenever it's updated.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-02 Thread Ross Walker

On Fri, Apr 2, 2010 at 8:03 AM, Edward Ned Harvey
 wrote:
>> > Seriously, all disks configured WriteThrough (spindle and SSD disks
>> > alike)
>> > using the dedicated ZIL SSD device, very noticeably faster than
>> > enabling the
>> > WriteBack.
>>
>> What do you get with both SSD ZIL and WriteBack disks enabled?
>>
>> I mean if you have both why not use both? Then both async and sync IO
>> benefits.
>
> Interesting, but unfortunately false.  Soon I'll post the results here.  I
> just need to package them in a way suitable to give the public, and stick it
> on a website.  But I'm fighting IT fires for now and haven't had the time
> yet.
>
> Roughly speaking, the following are approximately representative.  Of course
> it varies based on tweaks of the benchmark and stuff like that.
>        Stripe 3 mirrors write through:  450-780 IOPS
>        Stripe 3 mirrors write back:  1030-2130 IOPS
>        Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
>        Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS
>
> Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
> ZIL is 3-4 times faster than naked disk.  And for some reason, having the
> WriteBack enabled while you have SSD ZIL actually hurts performance by
> approx 10%.  You're better off to use the SSD ZIL with disks in Write
> Through mode.
>
> That result is surprising to me.  But I have a theory to explain it.  When
> you have WriteBack enabled, the OS issues a small write, and the HBA
> immediately returns to the OS:  "Yes, it's on nonvolatile storage."  So the
> OS quickly gives it another, and another, until the HBA write cache is full.
> Now the HBA faces the task of writing all those tiny writes to disk, and the
> HBA must simply follow orders, writing a tiny chunk to the sector it said it
> would write, and so on.  The HBA cannot effectively consolidate the small
> writes into a larger sequential block write.  But if you have the WriteBack
> disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
> SSD, and immediately return to the process:  "Yes, it's on nonvolatile
> storage."  So the application can issue another, and another, and another.
> ZFS is smart enough to aggregate all these tiny write operations into a
> single larger sequential write before sending it to the spindle disks.

Hmm, when you did the write-back test was the ZIL SSD included in the
write-back?

What I was proposing was write-back only on the disks, and ZIL SSD
with no write-back.

Not all operations hit the ZIL, so it would still be nice to have the
non-ZIL operations return quickly.

-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker

On Thu, Apr 1, 2010 at 10:03 AM, Darren J Moffat
 wrote:
> On 01/04/2010 14:49, Ross Walker wrote:
>>>
>>> We're talking about the "sync" for NFS exports in Linux; what do they
>>> mean
>>> with "sync" NFS exports?
>>
>> See section A1 in the FAQ:
>>
>> http://nfs.sourceforge.net/
>
> I think B4 is the answer to Casper's question:
>
>  BEGIN QUOTE 
> Linux servers (although not the Solaris reference implementation) allow this
> requirement to be relaxed by setting a per-export option in /etc/exports.
> The name of this export option is "[a]sync" (note that there is also a
> client-side mount option by the same name, but it has a different function,
> and does not defeat NFS protocol compliance).
>
> When set to "sync," Linux server behavior strictly conforms to the NFS
> protocol. This is default behavior in most other server implementations.
> When set to "async," the Linux server replies to NFS clients before flushing
> data or metadata modifying operations to permanent storage, thus improving
> performance, but breaking all guarantees about server reboot recovery.
>  END QUOTE 
>
> For more info the whole of section B4 though B6.

True, I was thinking more of the protocol summary.

> Is that what "sync" means in Linux?  As NFS doesn't use "close" or
> "fsync", what exactly are the semantics.
>
> (For NFSv2/v3 each *operation* is sync and the client needs to make sure
> it can continue; for NFSv4, some operations are async and the client
> needs to use COMMIT)

Actually the COMMIT command was introduced in NFSv3.

The full details:

NFS Version 3 introduces the concept of "safe asynchronous writes." A
Version 3 client can specify that the server is allowed to reply
before it has saved the requested data to disk, permitting the server
to gather small NFS write operations into a single efficient disk
write operation. A Version 3 client can also specify that the data
must be written to disk before the server replies, just like a Version
2 write. The client specifies the type of write by setting the
stable_how field in the arguments of each write operation to UNSTABLE
to request a safe asynchronous write, and FILE_SYNC for an NFS Version
2 style write.

Servers indicate whether the requested data is permanently stored by
setting a corresponding field in the response to each NFS write
operation. A server can respond to an UNSTABLE write request with an
UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the
requested data resides on permanent storage yet. An NFS
protocol-compliant server must respond to a FILE_SYNC request only
with a FILE_SYNC reply.

Clients ensure that data that was written using a safe asynchronous
write has been written onto permanent storage using a new operation
available in Version 3 called a COMMIT. Servers do not send a response
to a COMMIT operation until all data specified in the request has been
written to permanent storage. NFS Version 3 clients must protect
buffered data that has been written using a safe asynchronous write
but not yet committed. If a server reboots before a client has sent an
appropriate COMMIT, the server can reply to the eventual COMMIT
request in a way that forces the client to resend the original write
operation. Version 3 clients use COMMIT operations when flushing safe
asynchronous writes to the server during a close(2) or fsync(2) system
call, or when encountering memory pressure.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker


On Apr 1, 2010, at 8:42 AM, casper@sun.com wrote:




Is that what "sync" means in Linux?


A sync write is one in which the application blocks until the OS  
acks that
the write has been committed to disk.  An async write is given to  
the OS,
and the OS is permitted to buffer the write to disk at its own  
discretion.
Meaning the async write function call returns sooner, and the  
application is

free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.   
But sync
writes are done by applications which need to satisfy a race  
condition for
the sake of internal consistency.  Applications which need to know  
their

next commands will not begin until after the previous sync write was
committed to disk.



We're talking about the "sync" for NFS exports in Linux; what do  
they mean

with "sync" NFS exports?


See section A1 in the FAQ:

http://nfs.sourceforge.net/

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker

On Mar 31, 2010, at 11:58 PM, Edward Ned Harvey  
 wrote:


We ran into something similar with these drives in an X4170 that  
turned

out to
be  an issue of the preconfigured logical volumes on the drives. Once
we made
sure all of our Sun PCI HBAs where running the exact same version of
firmware
and recreated the volumes on new drives arriving from Sun we got back
into sync
on the X25-E devices sizes.


Can you elaborate?  Just today, we got the replacement drive that has
precisely the right version of firmware and everything.  Still, when  
we
plugged in that drive, and "create simple volume" in the storagetek  
raid
utility, the new drive is 0.001 Gb smaller than the old drive.  I'm  
still

hosed.

Are you saying I might benefit by sticking the SSD into some laptop,  
and

zero'ing the disk?  And then attach to the sun server?

Are you saying I might benefit by finding some other way to make the  
drive

available, instead of using the storagetek raid utility?


I know it is way after the fact, but I find it best to coerce each  
drive down to the whole GB boundary using format (create Solaris  
partition just up to the boundary). Then if you ever get a drive a  
little smaller it still should fit.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-04-01 Thread Ross Walker

On Mar 31, 2010, at 11:51 PM, Edward Ned Harvey  
 wrote:



A MegaRAID card with write-back cache? It should also be cheaper than
the F20.


I haven't posted results yet, but I just finished a few weeks of  
extensive

benchmarking various configurations.  I can say this:

WriteBack cache is much faster than "naked" disks, but if you can  
buy an SSD
or two for ZIL log device, the dedicated ZIL is yet again much  
faster than

WriteBack.

It doesn't have to be F20.  You could use the Intel X25 for  
example.  If
you're running solaris proper, you better mirror your ZIL log  
device.  If

you're running opensolaris ... I don't know if that's important.  I'll
probably test it, just to be sure, but I might never get around to it
because I don't have a justifiable business reason to build the  
opensolaris

machine just for this one little test.

Seriously, all disks configured WriteThrough (spindle and SSD disks  
alike)
using the dedicated ZIL SSD device, very noticeably faster than  
enabling the

WriteBack.


What do you get with both SSD ZIL and WriteBack disks enabled?

I mean if you have both why not use both? Then both async and sync IO  
benefits.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Ross Walker

On Mar 31, 2010, at 10:25 PM, Richard Elling  
 wrote:




On Mar 31, 2010, at 7:11 PM, Ross Walker wrote:

On Mar 31, 2010, at 5:39 AM, Robert Milkowski   
wrote:





On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
Use something other than Open/Solaris with ZFS as an NFS  
server?  :)


I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.




Well, for lots of environments disabling ZIL is perfectly  
acceptable.
And frankly the reason you get better performance out of the box  
on Linux as NFS server is that it actually behaves like with  
disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse  
than using Linux here or any other OS which behaves in the same  
manner. Actually it makes it better as even if ZIL is disabled ZFS  
filesystem is always consisten on a disk and you still get all the  
other benefits from ZFS.


What would be useful though is to be able to easily disable ZIL  
per dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal  
process to be completed in order to get integrated. Should be  
rather sooner than later.


Well being fair to Linux the default for NFS exports is to export  
them 'sync' now which syncs to disk on close or fsync. It has been  
many years before they exported 'async' by default. Now if Linux  
admins set their shares 'async' and loose important data then it's  
operator error and not Linux's fault.


If apps don't care about their data consistency and don't sync  
their data I don't see why the file server has to care for them. I  
mean if it were a local file system and the machine rebooted the  
data would be lost too. Should we care more for data written  
remotely then locally?


This is not true for sync data written locally, unless you disable  
the ZIL locally.


No, of course if it's written sync with ZIL, it just seems over  
Solaris NFS all writes are delayed not just sync writes.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Sun Flash Accelerator F20 numbers

2010-03-31 Thread Ross Walker


On Mar 31, 2010, at 5:39 AM, Robert Milkowski  wrote:




On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
  Use something other than Open/Solaris with ZFS as an NFS  
server?  :)


I don't think you'll find the performance you paid for with ZFS and
Solaris at this time. I've been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.




Well, for lots of environments disabling ZIL is perfectly acceptable.
And frankly the reason you get better performance out of the box on  
Linux as NFS server is that it actually behaves like with disabled  
ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using  
Linux here or any other OS which behaves in the same manner.  
Actually it makes it better as even if ZIL is disabled ZFS  
filesystem is always consisten on a disk and you still get all the  
other benefits from ZFS.


What would be useful though is to be able to easily disable ZIL per  
dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal  
process to be completed in order to get integrated. Should be rather  
sooner than later.


Well being fair to Linux the default for NFS exports is to export them  
'sync' now which syncs to disk on close or fsync. It has been many  
years before they exported 'async' by default. Now if Linux admins set  
their shares 'async' and loose important data then it's operator error  
and not Linux's fault.


If apps don't care about their data consistency and don't sync their  
data I don't see why the file server has to care for them. I mean if  
it were a local file system and the machine rebooted the data would be  
lost too. Should we care more for data written remotely then locally?


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ISCSI + RAID-Z + OpenSolaris HA

2010-03-20 Thread Ross Walker


On Mar 20, 2010, at 11:48 AM, vikkr  wrote:


THX Ross, i plan exporting each drive individually over iSCSI.
I this case, the write, as well as reading, will go to all 6 discs  
at once, right?


The only question - how to calculate fault tolerance of such a  
system if the discs are all different in size?

Maybe there is such a tool? or check?


They should all be the same size.

You can make them the same size on the iSCSI target.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ISCSI + RAID-Z + OpenSolaris HA

2010-03-20 Thread Ross Walker


On Mar 20, 2010, at 10:18 AM, vikkr  wrote:


Hi sorry for bad eng and picture :).

Can such a decision?

3 servers openfiler give their drives 2 - 1 tb ISCSI server to  
OpenSolaris

On OpenSolaris assembled a RAID-Z with double parity.
Server OpenSolaris provides NFS access to this array, and duplicated  
by means of Open HA CLuster


Yes, you can.

With three servers you want to to provide resiliency against the loss  
of any one server.


I guess these are mirrors in each server?

If so, you will get better performance and more useable capacity by  
exporting each drive individually over iSCSI and setting the 6 drives  
as a raidz2 or even raidz3 which will give 3-4 drives of capacity,  
raidz3 will provide resiliency of a drive failure during a server  
failure.


-Ross
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can we get some documentation on iSCSI sharing after comstar took over?

2010-03-17 Thread Ross Walker






On Mar 17, 2010, at 2:30 AM, Erik Ableson  wrote:



On 17 mars 2010, at 00:25, Svein Skogen  wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 16.03.2010 22:31, erik.ableson wrote:


On 16 mars 2010, at 21:00, Marc Nicholas wrote:

On Tue, Mar 16, 2010 at 3:16 PM, Svein Skogen mailto:sv...@stillbilde.net>> wrote:



I'll write you a Perl script :)


  I think there are ... several people that'd like a script that  
gave us
  back some of the ease of the old shareiscsi one-off, instead of  
having

  to spend time on copy-and-pasting GUIDs they have ... no real use
  for. ;)


I'll try and knock something up in the next few days, then!


Try this :

http://www.infrageeks.com/groups/infrageeks/wiki/56503/zvol2iscsi.html



Thank you! :)

Mind if I (after some sleep) look at extending your script a  
little? Of

course with feedback of the changes I make?

//Svein

Certainly! I just whipped that up since I was testing out a pile of  
clients with different volumes and got tired of going through all  
the steps so anything to make it more complete would be useful.


How about a perl script that emulates the functionality of iscsitadm  
so share=iscsi works as expected?


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] corruption of ZFS on iScsi storage

2010-03-15 Thread Ross Walker


On Mar 15, 2010, at 11:10 PM, Tim Cook  wrote:




On Mon, Mar 15, 2010 at 9:10 PM, Ross Walker   
wrote:

On Mar 15, 2010, at 7:11 PM, Tonmaus  wrote:

Being an iscsi
target, this volume was mounted as a single iscsi
disk from the solaris host, and prepared as a zfs
pool consisting of this single iscsi target. ZFS best
practices, tell me that to be safe in case of
corruption, pools should always be mirrors or raidz
on 2 or more disks. In this case, I considered all
safe, because the mirror and raid was managed by the
storage machine.

As far as I understand Best Practises, redundancy needs to be within  
zfs in order to provide full protection. So, actually Best Practises  
says that your scenario is rather one to be avoided.


There is nothing saying redundancy can't be provided below ZFS just  
if you want auto recovery you need redundancy within ZFS itself as  
well.


You can have 2 separate raid arrays served up via iSCSI to ZFS which  
then makes a mirror out of the storage.


-Ross


Perhaps I'm remembering incorrectly, but I didn't think mirroring  
would auto-heal/recover, I thought that was limited to the raidz*  
implementations.


Mirroring auto-heals, in fact copies=2 on a single disk vdev can auto- 
heal (if it isn't a disk failure).


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] corruption of ZFS on iScsi storage

2010-03-15 Thread Ross Walker


On Mar 15, 2010, at 7:11 PM, Tonmaus  wrote:


Being an iscsi
target, this volume was mounted as a single iscsi
disk from the solaris host, and prepared as a zfs
pool consisting of this single iscsi target. ZFS best
practices, tell me that to be safe in case of
corruption, pools should always be mirrors or raidz
on 2 or more disks. In this case, I considered all
safe, because the mirror and raid was managed by the
storage machine.


As far as I understand Best Practises, redundancy needs to be within  
zfs in order to provide full protection. So, actually Best Practises  
says that your scenario is rather one to be avoided.


There is nothing saying redundancy can't be provided below ZFS just if  
you want auto recovery you need redundancy within ZFS itself as well.


You can have 2 separate raid arrays served up via iSCSI to ZFS which  
then makes a mirror out of the storage.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] corruption of ZFS on iScsi storage

2010-03-15 Thread Ross Walker

On Mar 15, 2010, at 12:19 PM, Ware Adams   
wrote:




On Mar 15, 2010, at 12:13 PM, Gabriele Bulfon wrote:

Well, I actually don't know what implementation is inside this  
legacy machine.
This machine is an AMI StoreTrends ITX, but maybe it has been built  
around IET, don't know.
Well, maybe I should disable write-back on every zfs host  
connecting on iscsi?

How do I check this?


I think this would be a property of the NAS, not the clients.


Yes, Ware's right the setting should be on the AMI device.

I don't know what target it's using either, but if it has an option to  
disable write-back caching at least then if it doesn't honor flushing  
your data should still be safe.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] corruption of ZFS on iScsi storage

2010-03-15 Thread Ross Walker

On Mar 15, 2010, at 10:55 AM, Gabriele Bulfon   
wrote:



Hello,
I'd like to check for any guidance about using zfs on iscsi storage  
appliances.
Recently I had an unlucky situation with an unlucky storage machine  
freezing.
Once the storage was up again (rebooted) all other iscsi clients  
were happy, while one of the iscsi clients (a sun solaris sparc,  
running Oracle) did not mount the volume marking it as corrupted.
I had no way to get back my zfs data: had to destroy and recreate  
from backups.

So I have some questions regarding this nice story:
- I remember sysadmins being able to almost always recover data on  
corrupted ufs filesystems by magic of superblocks. Is there  
something similar on zfs? Is there really no way to access data of a  
corrupted zfs filesystem?
- In this case, the storage appliance is a legacy system based on  
linux, so raids/mirrors are managed at the storage side its own way.  
Being an iscsi target, this volume was mounted as a single iscsi  
disk from the solaris host, and prepared as a zfs pool consisting of  
this single iscsi target. ZFS best practices, tell me that to be  
safe in case of corruption, pools should always be mirrors or raidz  
on 2 or more disks. In this case, I considered all safe, because the  
mirror and raid was managed by the storage machine. But from the  
solaris host point of view, the pool was just one! And maybe this  
has been the point of failure. What is the correct way to go in this  
case?
- Finally, looking forward to run new storage appliances using  
OpenSolaris and its ZFS+iscsitadm and/or comstar, I feel a bit  
confused by the possibility of having a double zfs situation: in  
this case, I would have the storage zfs filesystem divided into zfs  
volumes, accessed via iscsi by a possible solaris host that creates  
his own zfs pool on it (...is it too redundant??) and again I would  
fall in the same previous case (host zfs pool connected to one only  
iscsi resource).


Any guidance would be really appreciated :)
Thanks a lot
Gabriele.


What iSCSI target was this?

If it was IET I hope you were NOT using the write-back option on it as  
it caches write data in volatile RAM.


IET does support cache flushes, but if you cache in RAM (bad idea) a  
system lockup or panic will ALWAYS loose data.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS - VMware ESX --> vSphere Upgrade : Zpool Faulted

2010-03-11 Thread Ross Walker


On Mar 11, 2010, at 12:31 PM, Andrew  wrote:


Hi Ross,

Ok - as a Solaris newbie.. i'm going to need your help.

Format produces the following:-

c8t4d0 (VMware-Virtualdisk-1.0 cyl 65268 alt 2 hd 255 sec 126) / 
p...@0,0/pci15ad,1...@10/s...@4,0


what dd command do I need to run to reference this disk? I've tried / 
dev/rdsk/c8t4d0 and /dev/dsk/c8t4d0 but neither of them are valid.


dd if=/dev/rdsk/c8t4d0p0 of=~/disk.out bs=512 count=256

That should get you the first 128K.

As for a hex editor, try bvi, like vi but for binary and supports much  
of the vi commands.


Search for signature 0x55AA (little endian) which should be bytes 511  
and 512 of the MBR.


There is also the possibility that these were wiped somehow, or even  
cached in vmware and lost during a vm reset.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS - VMware ESX --> vSphere Upgrade : Zpool Faulted

2010-03-11 Thread Ross Walker


On Mar 11, 2010, at 8:27 AM, Andrew  wrote:


Ok,

The fault appears to have occurred regardless of the attempts to  
move to vSphere as we've now moved the host back to ESX 3.5 from  
whence it came and the problem still exists.


Looks to me like the fault occurred as a result of a reboot.

Any help and advice would be greatly appreciated.


It appears the RDM might have had something to do with this.

Try a different RDM setting then physical, like virtual. Try mounting  
the disk via iSCSI initiator inside VM instead of RDM.


If you tried fiddling with the ESX RDM options and it still doesn't  
work... Inside the Solaris VM, dump the first 128k of the disk to a  
file using dd then using a hex editor find out what lba contains the  
MBR, which should be LBA 0, but I suspect it will be offset. Then the  
GPT will start at MBR LBA + 1 to MBR LBA + 33. Use the wikipedia entry  
for MBR, there is a unique identifier in there somewhere to search for.


There is a backup GPT also in the last 33 sectors of the disk.

Once you find the offset it is best to just dump those 34 sectors  
(0-33) to another file. Edit each MBR and GPT entry to take into  
account the offset then copy those 34 sectors into the first 34  
sectors of the disk, and the last 33 sectors of the file to the last  
33 sectors of the disk. Rescan, and hopefully it will see the disk.


If the offset is in the other direction then it means it's been  
padded, probably with metainfo? And you will need to get rid of the  
RDM and use the iSCSI initiator in the solaris vm to mount the volume.  
See how the first 34 sectors look, and if they are damaged take the  
backup GPT to reconstruct the primary GPT and recreate the MBR.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk (70% drop)

2010-03-09 Thread Ross Walker

On Mar 9, 2010, at 1:42 PM, Roch Bourbonnais  
 wrote:




I think This is highlighting that there is extra CPU requirement to  
manage small blocks in ZFS.
The table would probably turn over if you go to 16K zfs records and  
16K reads/writes form the application.


Next step for you is to figure how much reads/writes IOPS do you  
expect to take in the real workloads and whether or not the  
filesystem portion

will represent a significant drain of CPU resource.


I think it highlights more the problem of ARC vs ramdisk, or  
specifically ZFS on ramdisk while ARC is fighting with ramdisk for  
memory.


It is a wonder it didn't deadlock.

If I were to put a ZFS file system on a ramdisk, I would limit the  
size of the ramdisk and ARC so both, plus the kernel fit nicely in  
memory with room to spare for user apps.


-Ross

 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk (70% drop)

2010-03-09 Thread Ross Walker

On Mar 8, 2010, at 11:46 PM, ольга крыжановская anov...@gmail.com> wrote:



tmpfs lacks features like quota and NFSv4 ACL support. May not be the
best choice if such features are required.


True, but if the OP is looking for those features they are more then  
unlikely looking for an in-memory file system.


This would be more for something like temp databases in a RDBMS or a  
cache of some sort.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [indiana-discuss] future of OpenSolaris

2010-02-25 Thread Ross Walker

On Feb 25, 2010, at 9:11 AM, Giovanni Tirloni   
wrote:


On Thu, Feb 25, 2010 at 9:47 AM, Jacob Ritorto > wrote:

It's a kind gesture to say it'll continue to exist and all, but
without commercial support from the manufacturer, it's relegated to
hobbyist curiosity status for us.  If I even mentioned using an
unsupported operating system to the higherups here, it'd be considered
absurd.  I like free stuff to fool around with in my copious spare
time as much as the next guy, don't get me wrong, but that's not the
issue.  For my company, no support contract equals 'Death of
OpenSolaris.'

OpenSolaris is not dying just because there is no support contract  
available for it, yet.


Last time I looked Red Hat didn't offer support contracts for Fedora  
and that project is doing quite well.


Difference here is Redhat doesn't claim Fedora as a production OS.

While CentOS is a derivative of RHEL and also comes with no support  
contracts as it just recompiles RHEL source one gets the inherited  
binary support through this and technical support through the community.


OpenSolaris not being as transparent and more leading edge doesn't get  
the stability of binary support that Solaris has and the community is  
always playing catch-up on the technical details. Which make it about  
as suitable for production use as Fedora.


The commercial support contracts attempted to bridge the gap between  
the lack of knowledge due to the newness and the binary stability with  
patches. Without it OS is no longer really production quality.


A little scattered in my reasoning but I think I get the main idea  
across.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Abysmal ISCSI / ZFS Performance

2010-02-19 Thread Ross Walker


On Feb 19, 2010, at 4:57 PM, Ragnar Sundblad  wrote:



On 18 feb 2010, at 13.55, Phil Harman wrote:

...
Whilst the latest bug fixes put the world to rights again with  
respect to correctness, it may be that some of our performance  
workaround are still unsafe (i.e. if my iSCSI client assumes all  
writes are synchronised to nonvolatile storage, I'd better be  
pretty sure of the failure modes before I work around that).


But are there any clients that assume that an iSCSI volume is  
synchronous?


Isn't an iSCSI target supposed to behave like any other SCSI disk
(pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
With that I mean: A disk which understands SCSI commands with an
optional write cache that could be turned off, with cache sync
command, and all those things.
Put in another way, isn't is the OS/file systems responsibility to
use the SCSI disk responsibly regardless of the underlying
protocol?


That was my argument a while back.

If you use /dev/dsk then all writes should be asynchronous and WCE  
should be on and the initiator should issue a 'sync' to make sure it's  
in NV storage, if you use /dev/rdsk all writes should be synchronous  
and WCE should be off. RCD should be off in all cases and the ARC  
should cache all it can.


Making COMSTAR always start with /dev/rdsk and flip to /dev/dsk if the  
initiator flags write cache is the wrong way to go about it. It's more  
complicated then it needs to be and it leaves setting the storage  
policy up to the system admin rather then the storage admin.


It would be better to put effort into supporting FUA and DPO options  
in the target then dynamically changing a volume's cache policy from  
the initiator side.


-Ross
 
___

zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] verging OT: how to buy J4500 w/o overpriced

2010-02-10 Thread Ross Walker


On Feb 9, 2010, at 1:55 PM, matthew patton  wrote:

The cheapest solution out there that isn't a Supermicro-like server  
chassis, is DAS in the form of HP or Dell MD-series which top out at  
15 or 16 3" drives. I can only chain 3 units per SAS port off a HBA  
in either case.


The new Dell MD11XX series is 24 2.5" drives and you can chain 3 of  
them together off a single controller. If your drives are dual ported  
you can use both HBA ports for redundant paths.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS access by OSX clients (was Cores vs. Speed?)

2010-02-09 Thread Ross Walker

On Feb 8, 2010, at 4:58 PM, Edward Ned Harvey > wrote:


How are you managing UID's on the NFS server?  If user eharvey  
connects to
server from client Mac A, or Mac B, or Windows 1, or Windows 2, or  
any of
the linux machines ... the server has to know it's eharvey, and  
assign the
correct UID's etc.  When I did this in the past, I maintained a list  
of
users in AD, and duplicate list of users in OD, so the mac clients  
could
resolve names to UID's via OD.  And a third duplicate list in NIS so  
the
linux clients could resolve.  It was terrible.  You must be doing  
something

better?


The way I did this type of integration in my environment was to setup  
a Linux box with winbind and have NIS make maps just pull out the UID  
ranges I wanted shared over NIS with all passwords blanked out. Then  
all -nix based systems use NIS+Kerberos.


I suppose one could do the same with LDAP, but winbind has the  
advantage of auto-creating UIDs based on the user's RID+mapping range  
which saves A LOT of work in creating UIDs in AD.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Cores vs. Speed?

2010-02-05 Thread Ross Walker


On Feb 5, 2010, at 10:49 AM, Robert Milkowski  wrote:


Actually, there is.
One difference is that when writing to a raid-z{1|2} pool compared  
to raid-10 pool you should get better throughput if at least 4  
drives are used. Basically it is due to the fact that in RAID-10 the  
maximum you can get in terms of write throughput is a total  
aggregated throughput of half the number of used disks and only  
assuming there are no other bottlenecks between the OS and disks  
especially as you need to take into account that you are double the  
bandwidth requirements due to mirroring. In case of RAID-Zn you have  
some extra overhead for writing additional checksum but other than  
that you should get a write throughput closer to of T-N (where N is  
a RAID-Z level) instead of T/2 in RAID-10.


That hasn't been my experience with raidz. I get a max read and write  
IOPS of the slowest drive in the vdev.


Which makes sense because each write spans all drives and each read  
spans all drives (except the parity drives) so they end up having the  
performance characteristics of a single drive.


Now if you have enough drives you can create multiple raidz vdevs to  
get the IOPS up, but you need a lot more drives then what multiple  
mirror vdevs can provide IOPS wise with the same amount of spindles.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-04 Thread Ross Walker



Interesting, can you explain what zdb is dumping exactly?

I suppose you would be looking for blocks referenced in the snapshot  
that have a single reference and print out the associated file/ 
directory name?


-Ross


On Feb 4, 2010, at 7:29 AM, Darren Mackay  wrote:


Hi Ross,

zdb -  f...@snapshot | grep "path" | nawk '{print $2}'

Enjoy!

Darren Mackay
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-04 Thread Ross Walker






On Feb 4, 2010, at 2:00 AM, Tomas Ögren  wrote:


On 03 February, 2010 - Frank Cusack sent me these 0,7K bytes:

On February 3, 2010 12:04:07 PM +0200 Henu   
wrote:

Is there a possibility to get a list of changed files between two
snapshots? Currently I do this manually, using basic file system
functions offered by OS. I scan every byte in every file manually  
and it

 ^^^

On February 3, 2010 10:11:01 AM -0500 Ross Walker >

wrote:
Not a ZFS method, but you could use rsync with the dry run option  
to list

all changed files between two file systems.


That's exactly what the OP is already doing ...


rsync by default compares metadata first, and only checks through  
every

byte if you add the -c (checksum) flag.

I would say rsync is the best tool here.

The "find -newer blah" suggested in other posts won't catch newer  
files

with an old timestamp (which could happen for various reasons, like
being copied with kept timestamps from somewhere else).


Find -newer doesn't catch files added or removed it assumes identical  
trees.


I would be interested in comparing ddiff, bart and rsync (local  
comparison only) to see imperically how they match up.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-03 Thread Ross Walker

On Feb 3, 2010, at 8:59 PM, Frank Cusack   
wrote:


On February 3, 2010 6:46:57 PM -0500 Ross Walker  
 wrote:

So was there a final consensus on the best way to find the difference
between two snapshots (files/directories added, files/directories  
deleted

and file/directories changed)?

Find won't do it, ddiff won't do it, I think the only real option is
rsync.


I think you misread the thread.  Either find or ddiff will do it and
either will be better than rsync.


Find can find files that have been added or removed between two  
directory trees?


How?

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to get a list of changed files between two snapshots?

2010-02-03 Thread Ross Walker

On Feb 3, 2010, at 12:35 PM, Frank Cusack z...@linetwo.net> wrote:


On February 3, 2010 12:19:50 PM -0500 Frank Cusack > wrote:

If you do need to know about deleted files, the find method still may
be faster depending on how ddiff determines whether or not to do a
file diff.  The docs don't explain the heuristics so I wouldn't want
to guess on that.


An improvement on finding deleted files with the find method would
be to not limit your find criteria to files.  Directories with
deleted files will be newer than in the snapshot so you only need
to look at those directories.  I think this would be faster than
ddiff in most cases.


So was there a final consensus on the best way to find the difference  
between two snapshots (files/directories added, files/directories  
deleted and file/directories changed)?


Find won't do it, ddiff won't do it, I think the only real option is  
rsync. Of course you can zfs send the snap to another system and do  
the rsync there against a local previous version.


-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 4 5 6 7 >

1 - 100 of 696 matches

Mail list logo